Domesticating Talend

We’ve started working with Talend, and specifically with the ‘big data’ point-and-drag IDE. I’m reasonably happy with it, it does pretty well what it says on the box, but the ability to integrate it’s output with our product and approach is not great. The intention of the product appears to be mainly to run the ETL jobs from within the IDE, but there’s an ‘export job’ facility that dumps a ZIP file containing shell and batch scripts, some generated JARs, and all the dependencies, all bundled up for execution from the command line.

The trouble is that our use case does not match up well with this approach – we need to embed the Talend-generated code inside our service, which for us then means getting the generated JARs into our service project using Maven. The nasty bit then is immediately obvious – how do we version and deploy the Talend-generated JAR files?

My first tentative approach is going to be as follows. Step 1 is to use the Talend job export facility to export the ZIP to a standard location with a standard name. Second step is to use Maven with the following pom.xml, and invoke a standard mvn release:prepare release:perform to get a single unified JAR into our maven repository:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <parent>
    <groupId>com.somoglobal</groupId>
    <artifactId>Apptimiser</artifactId>
    <version>1.14.2</version>
    <relativePath/>
  </parent>

  <groupId>com.somoglobal.talend</groupId>
  <artifactId>PostAttribution</artifactId>
  <packaging>jar</packaging>
  <version>1.0.7-SNAPSHOT</version>
  <name>PostAttribution</name>

  <description>
    The Talend PostAttribution project packaged as a jar.
  </description>

  <scm>
    <connection>scm:svn:https://svn.somodigital.com/mobfusion/talend/PostAttribution/trunk</connection>
    <developerConnection>scm:svn:https://svn.somodigital.com/mobfusion/talend/PostAttribution/trunk</developerConnection>
    <url>https://svn.somodigital.com/mobfusion/talend/PostAttribution/trunk</url>
  </scm>
  
  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  </properties>
  
  <build>
    <plugins>
      <plugin>
        <artifactId>maven-clean-plugin</artifactId>
        <version>2.5</version>
        <configuration>
          <filesets>
            <fileset>
              <directory>temp/unzip</directory>
              <includes>
                <include>**</include>
              </includes>
              <followSymlinks>false</followSymlinks>
            </fileset>
          </filesets>
        </configuration>
      </plugin>
      
      <plugin>
        <!-- http://evgeny-goldin.com/wiki/Copy-maven-plugin -->
        <groupId>com.github.goldin</groupId>
        <artifactId>copy-maven-plugin</artifactId>
        <version>0.2.5</version>
        <executions>
          <execution>
            <id>obtain-jars</id>
            <phase>prepare-package</phase>
            <goals>
              <goal>copy</goal>
            </goals>
            <configuration>
              <resources>
                <!-- unpack the zip when not doing release -->
                <resource>
                  <runIf>{{ new File( project.basedir, 'temp' ).isDirectory() }}</runIf>
                  <description>Unpacking Talend export</description>
                  <targetPath>${project.build.outputDirectory}</targetPath>
                  <file>temp/newExportFolder.zip</file>
                  <zipEntries>
                    <zipEntry>**/*_0_1.jar</zipEntry>
                  </zipEntries>
                  <unpack>true</unpack>
                </resource>
                
                <!-- unpack the zip when not doing release:perform -->
                <resource>
                  <runIf>{{ !(new File( project.basedir, 'temp' ).isDirectory()) }}</runIf>
                  <description>Unpacking Talend export</description>
                  <targetPath>${project.build.outputDirectory}</targetPath>
                  <file>../../temp/newExportFolder.zip</file>
                  <zipEntries>
                    <zipEntry>**/*_0_1.jar</zipEntry>
                  </zipEntries>
                  <unpack>true</unpack>
                </resource>
                
                <!-- unpack the jars -->
                <resource>
                  <description>Unpacking jar files</description>
                  <targetPath>${project.build.outputDirectory}</targetPath>
                  <directory>${project.build.outputDirectory}</directory>
                  <includes>
                    <include>*.jar</include>
                  </includes>
                  <unpack>true</unpack>
                </resource>
                
                <!-- discard the jars -->
                <resource>
                  <description>cleaning jar files</description>
                  <targetPath>${project.build.outputDirectory}</targetPath>
                  <directory>${project.build.outputDirectory}</directory>
                  <includes>
                    <include>*.jar</include>
                  </includes>
                  <clean>true</clean>
                </resource>
              </resources>
            </configuration>
          </execution>
        </executions>
      </plugin>
    </plugins>
  </build>
</project>

Big shout out to Evgeny Goldin for his copy-maven plugin that makes this easier.

Toward a vision of Sustainable Server Programming

For a number of years I’ve been thinking that I should write down some of the ideas I’ve had and lessons I’ve learned over way too many years banging on a keyboard. In my head this has has always been centered around the vague label “sustainable server development”. Let me try to peel away some layers of that, and make the label a little less vague.

We will begin with “server”. A substantial amount of what I’ve written over the past decade or so I would handwave label as “server side”. But what do I mean by that? For me a “server” is a program intended to be mainly headless, running unattended for considerable periods of time, probably on remote hardware, and providing a well defined service. By preference for me a server runs under some form of Unix. Really the landscape has collapsed to three platforms now: some form of Unix, Windows, and the widely variable set of options for software embedded in specialist hardware (although these days, that is very frequently a Unix variant as well). I’ve done a little bit against Windows, and always found the experience frustratingly complex and ultimately unrewarding. Unix was built from the ground up to provide many of the facilities needed for “server side” coding (I’m particularly thinking of a robust security model, abstracted hardware and networking facilities, and sophisticated multi-processing facilities), and provides the coder with access to multiple layers of the stack between her code and the hardware in ways that Windows makes difficult. Bringing that back to a statement: for me a “server” is a headless, service-oriented piece of code running under Unix, and required to be robust, performant and reliable.

So. “sustainable”. Like any coin, this has two sides (I know, technically all coins have at least three faces): “sustainable server” and “sustainable development”. I believe the two are really linked, and hope over a series of articles to illustrate this. When I talk about a “sustainable server”, I mean something that has been built to minimise hassle and surprise for administrators and for code maintainers. When I talk about “sustainable development” I mean approaches that make building and maintaining robust, reliable and performant code a pleasant and simple 9-to-5 job, rather than a heroic nightmare of late nights, pizza and caffeine.

I am not a fan of heroic coding. There is plenty of verified clinical evidence that amply demonstrates that a tired and stressed coder is a bad coder: some clinical studies suggest that a few days disturbed sleep has the same effects on cognition to being seriously inebriated. We have a culture that is proving very hard to break, where a mad hack over a sleepless week resulting in partially completed, un-documented, un-maintainable code is an effort to be applauded (and repeated) rather than treated as an unwelcome and undesired exception. While the coder is subject to a variety of lunacies from project managers and product owners, we are our own worst enemies if we keep committing to unhealthy and irrational death marches. A calm and rational approach to developing server side services should make it unnecessary: most of the problems to be solved in this space have been solved before, and we’ve got a lot of historical precedents to call on. Most of the time, none of this has to be considered hard or complex, so please just go take a cold shower and a walk around the block, and calm down.

Let me point to an example outside the coding world. Watch a carpenter, or a blacksmith, at work. There’s no sense of rush or panic or urgency. The craftsman knows how long each part of the process takes, has learned from the past, and is happy to re-use established patterns. She gives herself time to deal with the hard parts of the problem by knocking away the simple parts efficiently. And most relevantly: if a project manager rushes in and says “this needs to be done in half the time”, the response is not “oh, in that case we’d better order pizzas because it will be a long night.”

The key elements of what I would classify as ‘good’ server software are as follows:

1) Clarity of purpose. The software does one thing, provides one well defined service;
2) Predictability. The software behaves in a well defined and documented fashion;
3) Robustness. The server should be resilient and gracefully adapt to environmental changes, cope with malformed or malicious requests, and not collapse under extreme load;
4) Manageability. Administrators should be able to monitor and configure the service with ease, and should be able to programatically manage the service state;
5) Performant. Requests should be responded to within 50ms – 100ms or better under all conditions. In particular performance should not degrade unpredictably under load.

In my experience a lot of coders – and managers of coders – have the idea that setting these goals as a minimum base requirement are unrealistic and expensive. Twaddle and nonsense. Let me point to two exemplars, both available as FOSS and both initially built out by small teams: Varnish and the core Apache server. In both cases, these are not part of the base operating system, they are services run on a server. In both cases, all the goals above are amply met. And in both cases, there is a development and maintenance infrastructure around the code which is palpably sustainable and effective.

Varnish is a particularly fine example. There were no surprises or complexities in installing it and running it. It worked as expected ‘out of the box’ without intensive configuration. It’s very easy to monitor and manage. It does one thing, extremely well, and does it in the fashion described, documented and expected. And most importantly it just runs and runs and runs without intervention or alarm.

Lets make all server software that good, and knock off work at 5pm, enjoy our weekends, take up hobbies and stop these panicked head-long rushes into the night. Our partners, family, waistlines and hearts will thank us for it.

Amazon SWF, aspectj-maven-plugin and JaCoCo

Which could be subtitled as “6+ hours of my life I will never get back”.

I’m leaving this here in case someone else finds it useful. The short story is thus: I’m working on a product using the Amazon Flow SDK, writing in Eclipse under Java 7. We use Maven and JaCoCo, and believe in TDD, high levels of code coverage, and code that doesn’t suck. It turns out that getting all these bits to work together is far more complex than it has any reason to be.

There were several conflicting problems: the AspectJ AOP tool was not successfully dealing with the @Asynchronous annotation on my Workflow implementation, which meant that tests were failing. Various attempts to get that working resulted in the aspectj-maven-plugin failing in various horrible ways, JaCoCo instrumentation failing in various horrible ways, or both.

Here’s an edited version of the pom.xml to show I sorted it in the end (which reminds me that I need a better way of adding code snippets here)

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelversion>4.0.0</modelversion>

  <parent>
    <!-- snip: the parent pom contains base definitions for JaCoCo -->
  </parent>

  <artifactid>XXXX</artifactid>
  <version>1.0.0-SNAPSHOT</version>
  <packaging>jar</packaging>

  <name>XXX</name>
  <url>https://...</url>

  <description>
  </description>

  <scm>
    <!-- snip -->
  </scm>

  <properties>
    <!-- snip -->
  </properties>

  <dependencies>
    <!-- snip -->
    <dependency>
      <groupid>org.aspectj</groupid>
      <artifactid>aspectjrt</artifactid>
      <version>1.7.3</version>
    </dependency>

    <dependency>
      <groupid>com.amazonaws</groupid>
      <artifactid>aws-java-sdk-flow-build-tools</artifactid>
      <version>1.5.2</version>
    </dependency>

    <dependency>
      <groupid>com.amazonaws</groupid>
      <artifactid>aws-java-sdk</artifactid>
      <version>1.5.2</version>
    </dependency>

    <dependency>
      <groupid>org.freemarker</groupid>
      <artifactid>freemarker</artifactid>
      <version>2.3.20</version>
    </dependency>
  </dependencies>

  <build>
    <resources>
      <resource>
        <directory>src/main/resources</directory>
        <filtering>true</filtering>
    </resource>
  </resources>

  <plugins>
    <plugin>
      <groupid>org.codehaus.mojo</groupid>
      <artifactid>aspectj-maven-plugin</artifactid>
      <version>1.4</version>
      <configuration>
        <showweaveinfo>true</showweaveinfo>
        <source>1.7</source>
        <target>1.7</target>
        <xlint>ignore</xlint>
        <compliancelevel>1.7</compliancelevel>
        <encoding>UTF-8</encoding>
        <verbose>true</verbose>
        <aspectlibraries>
          <aspectlibrary>
            <groupid>com.amazonaws</groupid>
            <artifactid>aws-java-sdk</artifactid>
          </aspectlibrary>
        </aspectlibraries>
        <sources>
          <basedir>src/main/java</basedir>
          <includes>
            <include>com/xxx/yyy/workflow/*.java</include>
            <include>com/xxx/yyy/workflow/activities/*.java</include>
          </includes>
        </sources>
      </configuration>

      <executions>
        <execution>
          <goals>
            <goal>compile</goal>
            <goal>test-compile</goal>
          </goals>
        </execution>
      </executions>

      <dependencies>
        <dependency>
            <groupid>org.aspectj</groupid>
            <artifactid>aspectjrt</artifactid>
            <version>1.7.3</version>
          </dependency>
          <dependency>
            <groupid>org.aspectj</groupid>
            <artifactid>aspectjtools</artifactid>
            <version>1.7.3</version>
        </dependency>
      </dependencies>
    </plugin>

    <plugin>
      <groupid>org.jacoco</groupid>
      <artifactid>jacoco-maven-plugin</artifactid>
      <executions>
        <execution>
          <id>prepare-agent</id>
          <goals>
            <goal>prepare-agent</goal>
          </goals>
          <configuration>
            <excludes>
              <exclude>**/aspectj/*</exclude>
            </excludes>
          </configuration>
        </execution>
      </executions>
    </plugin>
  </plugins>
  </build>
</project>

The key bits are using the correct versions of AspectJ and the maven plugin, correctly specifying Java 1.7 everywhere possible, and then telling the JaCoCo plugin which classes to exclude when attempting to instrument. This solution is not perfect, and I don’t expect it to be the final solution, as it results in JaCoCo cheerfully reporting that classes generated by the Flow SDK annotation processor have no coverage, but it’s better than a poke in the eye with a decaying ferret.

Eclipse, how I loathe thee

For something which is theoretically the industry standard, Eclipse remains profoundly buggy and unstable. Of course, the argument is that the reason the user experience is a nightmare of crashes and weird behaviour is that it’s the plugins that are broken, but that’s kind of like saying a car is substandard only because it has a lousy engine and bald tires.

My latest adventure was trying to configure the Checkstyle  plugin. As soon as I tried to configure it, it would present me with the wonderfully meaningful error message:

Unhandled event loop exception – No more handles

Now, if you google for the checkstyle plugin with that message, you will see several thousand developers have had the same problem. The wonderfully obvious solution (can you detect the sarcasm here?) for 64bit Ubuntu 13:04 was to do two things: install libwebkitgtk, and hack the eclipse.ini.

sudo apt-get install libwebkitgtk-1.0-0

and then in eclipse.ini add the following beneath the -vmargs invocation:

-Dorg.eclipse.swt.browser.DefaultType=webkit

For reference, the whole eclipse.ini now looks like this for me

-startup plugins/org.eclipse.equinox.launcher_1.3.0.v20130327-1440.jar
 --launcher.library plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.200.v20130521-0416
 -product org.eclipse.epp.package.jee.product
 --launcher.defaultAction openFile
 -showsplash org.eclipse.platform
 --launcher.XXMaxPermSize 256m
 --launcher.defaultAction openFile
 --launcher.appendVmargs
 -vmargs
 -Dosgi.requiredJavaVersion=1.7
 -Dorg.eclipse.swt.browser.DefaultType=webkit
 -Xms3072m
 -Xmx3072m
 -XX:+UseParallelGC
 -XX:PermSize=256M
 -XX:MaxPermSize=512m

The triumph of open source.

Eight, Eight and Eight

The great fight for labour relations and working conditions in the 19th Century was for the Eight Hour Day.

Eight hours work, Eight hours sleep, Eight hours leisure. That was the target, and for most people in the industrialised west this was achieved in the 1970s and 80s. Generally.

So what has happened since, in our hyper-connected, always on world? I haven’t worked less than 9 hours for the last two years, and the commute is added to that.

So now it’s more like: Ten hours work, Two hours commute, Six hours sleep, Four hours managing banks, mortgages, taxes, insurance. Knock yourself out, the other four hours are for leisure, as long as you don’t shop, maintain the house or bathe.