Archive for the 'Programming' Category

Splitting Files By Column Value Using Awk

Thursday, August 9th, 2012

At the day job a data fairy gives me a giant pipe delimited text file that contains data for a bunch of our customers. The customer ID is contained in one of the columns. Ideally I'd like to have one file per customer but it's usually very difficult to get data fairies to do the things you want.

For reference here's a reasonable facsimile of what the file looks like. Let's pretend this is some sort of interesting survey. Bonus points if you can figure out a question that would make sense for these answers.

FIELD1|FIELD2|CUSTOMER|FIELDN
"Once in college but it wasn't my idea."|3|"CUST1"|"blah blah"
"Like your mom."|14|"CUST2"|""
"Blame it on the dog."|15|"CUST1"|"Frankenberry"
"That wasn't chicken."|9|"CUST2"|"Definitely the mouth."
"Never professionally"|26|"CUST3"|"And then she stepped on the ball!"

What we want is three files: one for each customer. We drop the split file in a different directory for each customer to keep things a little neater and we name the file with the customer code prepended to the original file name. All nice and orderly.

As with many things involving text files this winds up being stupid easy using Awk. I'm showing it here mostly so I can find it again and because this type of command line file processing always makes me giddy. The comments should do a good enough job of explaining things.

#! /usr/bin/awk -f
BEGIN {
  if($CUSTOMER < 1) {
    print "Usage: split -vCUSTOMER=[split column] [files]";
    exit;
  }

  # Set the input and output field delimiters
  FS="|";
  OFS="|";
  "mkdir -p split" | getline;
}

{
  # If this is the first line of a file...
  if (FNR==1) {
    # Grab the entire first row as the header
    header=$0;

    # Close open files from the previous file (if any)
    for(customer in customers) {
      close(customers[customer]);
    }
 
    # Clear the array of customers / output files   
    delete customers;
  }

  if (FNR!=1) {
    # Grab the customer code and strip out the quotes
    customer=tolower($CUSTOMER);
    gsub(/"/, "", customer);

    # Store the output file name.  This is the customer code followed 
    # by the original file name.
    outputFile="split/" customer "/" customer "_" FILENAME;

    # If this is the first time this file we've seen this customer code...
    if(customers[customer]=="") {
      ("mkdir -p split/" customer) | getline;

      # Overwrite any previous output file and print the header
      print header > outputFile; 
      # Track the fact that we've seen this customer code and store the output file
      customers[customer]=outputFile;
    }

    # Append the current line to the output file
    print >> outputFile;
  }
}

I'm sure someone could do this more succinctly and without some of the odd things I've done in there (maybe parameterize the delimiters or the output directory structure), but I kind of like it. It's already proved useful for a number of other cases for me. Also the fact that it's relatively tiny and super fast is all the answer I need if one of the co-workers asks why I didn't write it in Java.

Share

He's Got This Ultimate Set of Tools

Sunday, May 6th, 2012

"Relax, all right? My old man is a television repairman, he's got this ultimate set of tools. I can fix it." If you don't remember your Fast Times at Ridgemont High quotes you're probably not alone. The scene is worth remembering because the context is ridiculous. So it is sometimes with software development. The cost and effort of fixing the existing implementation is sometimes just too great. The changes cut too deep. You're better off throwing out the current stuff and starting from scratch.

In software development you rarely understand your problem domain perfectly, if ever. You learn what your customers want through trial and error. Sometimes your organization has made such poor attempts at delivering the product people want that you can't help but throw away what you've currently got and try again with what you learned from your previous attempt.

Managers usually hate to hear such talk from developers. Developers always want to rewrite things. But in some rare cases they're absolutely right. Refactoring is great if you're even remotely close to what you want to do. But what if your product is built on bad assumptions of epic proportions?

Could CVS have been refactored incrementally to arrive at git? Could Windows have been refactored to create Linux? Could MacOS have been refactored to create OSX? Could Internet Explorer be refactored to create Chrome? When do you come to the realization that what you want, what you need, is so far away from what you have that you can't get there from here? When is the cost of making changes to your current product artificially inflated by the technical debt and faulty abstractions to the extent that it's better to throw it all away?

That's the advantage your competition has. You've shown them your near miss at a great product. If the people in your organization advocating a rewrite were magically transported into a competing startup that was creating a competing product from scratch would you be at all worried? If the answer is "yes" then you should use the advantages you have (those very same people plus a more intimate knowledge of the problem domain and where you went wrong) and do something about it. Plus if something in your product actually proves useful you can copy and refactor it into the new product.

There are certainly risks but the rewards are incredible.

Share

Autowiring Jackson Deserializers in Spring

Wednesday, May 2nd, 2012

Recently I was working in a Spring 3.1 controller for a page with a multi-select of some other entity in the system. Let's say an edit user page that has a User object for which you're selecting Role objects (with Role being a persistent entity with an ID). And let's further say that I'm doing some fancy in place editing of a user within a user list so I want to use AJAX and JSON to submit the user to the server, for whatever reason (probably because it's rad \oo/).

Okay now that we have our contrived scenario I want to serialize the collection of roles on a user so that they're a JSON array of IDs of said roles. That part is pretty easy. Let's just make all of our persistent entities either extend some BaseDomainObject or implement some interface with getId and then write a generic JSON serializer for Jackson:

package com.runningasroot.webapp.spring.jackson;

import java.io.IOException;
import org.codehaus.jackson.JsonGenerator;
import org.codehaus.jackson.JsonProcessingException;
import org.codehaus.jackson.map.JsonSerializer;
import org.codehaus.jackson.map.SerializerProvider;
import org.springframework.stereotype.Component;
import com.runningasroot.persistence.BaseDomainObject;

@Component
public class RunningAsRootDomainObjectSerializer extends JsonSerializer<BaseDomainObject> {

    @Override
    public void serialize(BaseDomainObject value, JsonGenerator jgen, SerializerProvider provider) 
            throws IOException, JsonProcessingException {
        jgen.writeNumber(value.getId());
    }
}

Awesome if that's what I want. We'll assume it is. Now if I submit this JSON back to the server I want to convert those IDs into real live boys, er, domain objects. To do this I need a deserializer that has access to some service that can find a domain object by ID. I'll leave figuring out ways to genericize this for multiple domain objects as an exercise for the reader because frankly that's not the part I'm interested in.

So how do I control how Jackson instantiates deserializers and make sure that I can inject Spring beans into them? You would think it would be very easy and it is. Figuring it out turned out to be unnecessarily hard. The latest version of Jackson has a class for this and even says that's what it's for. So let's make us an implementation of a HandlerInstantiator that is aware of Spring's ApplicationContext. Note that you could do this entirely differently with an interface from Spring but who cares? Here's what I did:

package com.runningasroot.webapp.spring;

import org.codehaus.jackson.map.DeserializationConfig;
import org.codehaus.jackson.map.HandlerInstantiator;
import org.codehaus.jackson.map.JsonDeserializer;
import org.codehaus.jackson.map.JsonSerializer;
import org.codehaus.jackson.map.KeyDeserializer;
import org.codehaus.jackson.map.MapperConfig;
import org.codehaus.jackson.map.SerializationConfig;
import org.codehaus.jackson.map.introspect.Annotated;
import org.codehaus.jackson.map.jsontype.TypeIdResolver;
import org.codehaus.jackson.map.jsontype.TypeResolverBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationContext;
import org.springframework.stereotype.Component;

@Component
public class SpringBeanHandlerInstantiator extends HandlerInstantiator {

    private ApplicationContext applicationContext;

    @Autowired
    public SpringBeanHandlerInstantiator(ApplicationContext applicationContext) {
        this.applicationContext = applicationContext;
    }

    @Override
    public JsonDeserializer<?> deserializerInstance(DeserializationConfig config,
            Annotated annotated,
            Class<? extends JsonDeserializer<?>> deserClass) {
        try {
            return (JsonDeserializer<?>) applicationContext.getBean(deserClass);
        } catch (Exception e) {
            // Return null and let the default behavior happen
        }
        return null;
    }

    @Override
    public KeyDeserializer keyDeserializerInstance(DeserializationConfig config,
            Annotated annotated,
            Class<? extends KeyDeserializer> keyDeserClass) {
        try {
            return (KeyDeserializer) applicationContext.getBean(keyDeserClass);
        } catch (Exception e) {
            // Return null and let the default behavior happen
        }
        return null;
    }

    // Two other methods omitted because if you don't get the idea yet then you don't 
    // deserve to see them.  phbbbbt.
}

Great now we just need to hook up a custom ObjectMapper to use this thing and we're home free (extra shit that would probably trip you up as well included at no extra charge):

package com.runningasroot.webapp.spring;

import org.codehaus.jackson.map.DeserializationConfig;
import org.codehaus.jackson.map.HandlerInstantiator;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.map.SerializationConfig.Feature;
import org.codehaus.jackson.map.annotate.JsonSerialize;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.ApplicationContext;
import org.springframework.stereotype.Component;
import com.fasterxml.jackson.module.hibernate.HibernateModule;

@Component
public class RunningAsRootObjectMapper extends ObjectMapper {

    @Autowired
    ApplicationContext applicationContext;

    public RunningAsRootObjectMapper() {
        // Problems serializing Hibernate lazily initialized collections?  Fix here.
        HibernateModule hm = new HibernateModule();
        hm.configure(com.fasterxml.jackson.module.hibernate.HibernateModule.Feature.FORCE_LAZY_LOADING, true);
        this.registerModule(hm);

        // Jackson confused by what to set or by extra properties?  Fix it.
        this.setSerializationInclusion(JsonSerialize.Inclusion.NON_NULL);
        this.configure(DeserializationConfig.Feature.FAIL_ON_UNKNOWN_PROPERTIES, false);
        this.configure(Feature.FAIL_ON_EMPTY_BEANS, false);
    }

    @Override
    @Autowired
    public void setHandlerInstantiator(HandlerInstantiator hi) {
        super.setHandlerInstantiator(hi);
    }
}

Now you just have to tell everything to use your custom object mapper. This can be found elsewhere on the web but I'll include it here in case of link rot:

package com.runningasroot.webapp.spring;

import javax.annotation.PostConstruct;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.http.converter.HttpMessageConverter;
import org.springframework.http.converter.json.MappingJacksonHttpMessageConverter;
import org.springframework.stereotype.Component;
import org.springframework.web.servlet.mvc.annotation.AnnotationMethodHandlerAdapter;

@Component
public class JacksonConfigurer {
    private AnnotationMethodHandlerAdapter annotationMethodHandlerAdapter;
    private RunningAsRootObjectMapper objectMapper;

    @PostConstruct
    public void init() {
        HttpMessageConverter<?>[] messageConverters = annotationMethodHandlerAdapter.getMessageConverters();
        for (HttpMessageConverter<?> messageConverter : messageConverters) {
            if (messageConverter instanceof MappingJacksonHttpMessageConverter) {
                MappingJacksonHttpMessageConverter m = (MappingJacksonHttpMessageConverter) messageConverter;
                m.setObjectMapper(objectMapper);
            }
        }
    }

    @Autowired
    public void setAnnotationMethodHandlerAdapter(AnnotationMethodHandlerAdapter annotationMethodHandlerAdapter) {
        this.annotationMethodHandlerAdapter  = annotationMethodHandlerAdapter;
    }

    @Autowired
    public void setObjectMapper(RunningAsRootObjectMapper objectMapper) {
        this.objectMapper = objectMapper;
    }
}

I think you can also perform this bit of trickery inside of an application-context.xml. But whatever works for you works. I think Yogi Berra said that.

Of course you still need to annotate your getters and setters with special Jackson annotations:

@JsonSerialize(contentUsing=RunningAsRootDomainObjectSerializer.class) 
public Collection<Role> getRoles() {
    ...
}

// Some deserializer with some hot Spring injection going on in the back end (if you know what I mean)
@JsonDeserialize(contentUsing=RoleListDeserializer.class)
public void setRoles(Collection<Role> roles) {
    ...
}

So there you have it: an example of a Spring Jackson JSON serializer that serializes the contents of collections of domain objects as an array of IDs and then deserializes JSON arrays of IDs into domain objects to be put into a collection. Say that three times fast.

Share

I Think We're Going to Need a Bigger Box

Tuesday, April 10th, 2012

I was reading this post on the Instagram buyout by Facebook today and it got me to thinking about the benefits of the cloud, DevOps, horizontal scalability (one of my favorites), and well thought out architectures and monitoring.

One of the more interesting things about the $1 billion purchase price is that Instagram has 13 employees and 35 million users. That's just so crazy to me. It also ends up being yet another argument against the "bigger box" method of solving scalability issues. Eventually you cannot simply add more RAM to fix things. Trying to solve your problems that way is like trying to solve world hunger by breeding a single, giant cow.

Share

Let's Just Burn It All Down and Start Again

Saturday, April 7th, 2012

All software sucks to some extent including everything you are working on right now. If you reexamine your code six months from now and don't think it sucks then it probably means you didn't learn anything in those six months. That's the downside of being a software developer. You feel like the code you're working around is some degree of horrible. For the most part you just accept it and try to make incremental improvements to things. If you're lucky you'll work on something that you think is magnificent (and then think it's shit in six months).

But what happens when the code is truly horrific? For example: you wrote your own FTP client, your own templating engine, you have mutating getters, there's database access in your pages and data objects, you cut and paste DDL statements into SQL clients and call it "upgrading the schema", etc. We can argue about whether some of those things are truly bad but from my perspective they're pretty rotten. Throw that into a 100k+ line code base with many active customers and too few developers and then you've got some real fun.

In these situations I can envision a more ideal code base pretty easily. Update the libraries and start using them, fix the schema that no longer matches the problem domain (if it ever did), start pushing things into neat little tiers, get rid of that shitty build, run a continuous integration build server, use Chef or Puppet to manage configuration, scale your shit horizontally and get all elasticy with the cloud, etc. Pretty soon I've built a shining city on the hill in my mind. The only problem is I'm still calf deep in shit and I need to go back to standing on my head just as soon as my lunch break is over.

My solution has always been to burn everything to the ground and start over. It's not a popular position even among software developers. "Let's just slowly fix everything that is wrong," they say. It sounds good but progress on paying down your massive technical debt always seems to take a backseat to a shiny new feature (with its own share of technical debt). Pretty soon you're not even paying the interest on that debt. Nope. Burn it all down. Or at least build a new bridge next to the old bridge and then blow the old bridge up. Maybe you can even be nice enough to divert traffic first.

The "fix in place" crowd always sounds like this to me: "I bought a new motorcycle. It's a Honda. I kind of want a Harley instead. Can you turn it into a Harley while I ride it around? Thanks. xxxooo"

At least I'll always have these rants before the void. Thanks for listening.

Share

Geek TGI Friday's Flair

Monday, September 19th, 2011

TGI Friday's walls are littered with "vintage" wall decor. Red Lobster has old lobster traps and fish photos all over their walls. Then it hit me: geek hangouts need their own brand of wall flair. Why not outdated tech books?

I've got a ton of books on technologies that aren't in widespread use any more. I'd donate them but even Goodwill doesn't want stuff like that. When you think about it it makes sense. So where do they go? The landfill? I like to pretend I'm much more environmentally friendly than that.

Some hangout for geeks needs to step up and offer a free appetizer or something for anyone that brings in a tech book that was published before, say, 2000? That seems like a reasonable cutoff. Then all the geeky people can laugh at the titles lining the shelves above their tables. "PowerBuilder? Oh, shit! I wrote something in that once!" (Apologies to Sybase, but you really need to give up on that shit.)

Share

Better Programmer Interviews

Thursday, April 14th, 2011

One of my former co-workers wrote some of his thoughts on crappy interview questions as well as some advice on improving the situation. My latest job was the first time I had to write code during the interview. It was interesting although I think the problem was a bit trivial. The thing I liked about it was that it started with an OOAD design question about a specific problem and then segued into you coding your solution. It was kind of a nice "eat your own design dog food while I watch" moment.

However, when I read the post I mentioned above it occurred to me that you might be able to use open source projects to improve on this a bit. The idea I had was to use an open source library on which you depend and have the interviewee either address a bug or add a feature that you've wanted. This can easily be a take home type of question as well. The plus is you get to do a code review on their submission and get a feature you want. The open source software community benefits as well. It's just wins all around, baby!

Share

MP3s and Ratings

Friday, August 13th, 2010

Don't you hate when you put ratings on most of the songs in your massive music library only to find that you need to do it again when you switch players? On Ubuntu I use Banshee which allows you to save ratings to the ID3 tag right in the MP3 file. That means those ratings are available from any Banshee player. Nice.

The problem is that I'm working a contract gig that sort of requires Windows (well, they think they do at least) and I don't fully trust the port in progress of Banshee to Windows. So, I'm using iTunes (which I hate). I think it'd be nice if other players could use that same custom ID3 tag to use the ratings but I realize that many people have an issue with subjective information (the ratings) being stored in a repository meant to store common supposedly objective information about the song itself. Then there's the whole issue of standardizing on the custom tag. In a perfect world more stuff would use a plugin based design and you could simply write an extension to get the ratings from wherever you wanted.

A simple import / export to an agreed upon format could also sort of solve the problem but you can't get people to agree on things and you would then have some annoying synchronization issues. I think it'd be swell if something like last.fm acted as that song and ratings repository since they're a bit of a de facto standard supported by most MP3 players. It seems simple to stick the rating in there when you scrobble whatever you're listening to. Then it's just a hop, skip, and a jump to an import / export to get up and running. It also feels like it'd add some value to their existing service. Somebody get on that…

Share

Recreating Foreign Keys in MySQL

Tuesday, October 20th, 2009

The short version of this story is that I had a test server that was inadvertently configured to use the MyISAM engine of MySQL. This engine doesn't support foreign keys. It will quietly ignore your attempts to add them. I meant to use the InnoDB engine (which does support foreign keys). Of course, who hasn't done that? Am I right?

I fixed the engine problem quickly enough. Next I wanted to take a version of our production / dev / whatever that had the foreign keys and export the necessary "alter table" statements to add them to the fixed version of the test database. I couldn't find anything so I whipped up this SELECT statement to generate a script based on my limited understanding of MySQL. If it helps someone else then great.

SELECT concat('ALTER TABLE `',  table_name, '` ADD CONSTRAINT `', CONSTRAINT_NAME, '` FOREIGN KEY (`', column_name, '`) REFERENCES `', referenced_table_name, '`(`', referenced_column_name, '`);') from information_schema.key_column_usage where referenced_table_name is not null and constraint_schema = 'ourserverdb' order by table_name, column_name

This of course results in a whole bunch of rows of the form:

ALTER TABLE `licensekeys` add constraint `FK_keysIssuerId__appuserId` FOREIGN KEY (`issuer_id`) REFERENCES `app_user`(`id`);
ALTER TABLE `subscription` add constraint `FK_subscription_entity_group_id__entityGroupId` FOREIGN KEY (`entity_group_id`) REFERENCES `entityGroup`(`id`);
ALTER TABLE `user_role` add constraint `FK_userRoleRoleId__roleId` FOREIGN KEY (`role_id`) REFERENCES `role`(`id`);

From there it's just a little copy / paste into MySQL command prompt and I'm done. Incidentally mysqldump with the --no-data flag didn't do quite what I wanted since the foreign key creation is in the middle of a CREATE TABLE statement. There are surely other ways to do this but this is what worked for me.

Share

Shell Scripting Madness

Friday, October 16th, 2009

Every now and then I bask in the beauty of the simple things. I'm not talking about children smiling, flowers, or any of that other crap. Shell scripting, baby! Today I had to move some SQL statements in some XML document into a Java class. So I needed to change this (which I didn't write):

SELECT CASE
    WHEN primaryStartAge < 20  THEN ' 0 to 19'
	WHEN primaryStartAge BETWEEN 20 AND 29 THEN '20 to 29'
	WHEN primaryStartAge BETWEEN 30 AND 39 THEN '30 to 39'
	WHEN primaryStartAge BETWEEN 40 AND 49 THEN '40 to 49'
	WHEN primaryStartAge BETWEEN 50 AND 59 THEN '50 to 59'
	WHEN primaryStartAge BETWEEN 60 AND 69 THEN '60 to 69'
	WHEN primaryStartAge > 70 THEN '70 and up'
	END as "Primary Start Age Range",
	count(1) as "Count" FROM analyticsResults
	WHERE calculatorType like ?
	GROUP BY CASE
	WHEN primaryStartAge < 20  THEN ' 0 to 19'
	WHEN primaryStartAge BETWEEN 20 AND 29 THEN '20 to 29'
	WHEN primaryStartAge BETWEEN 30 AND 39 THEN '30 to 39'
	WHEN primaryStartAge BETWEEN 40 AND 49 THEN '40 to 49'
	WHEN primaryStartAge BETWEEN 50 AND 59 THEN '50 to 59'
	WHEN primaryStartAge BETWEEN 60 AND 69 THEN '60 to 69'
	WHEN primaryStartAge > 70 THEN '70 and up'
END
ORDER BY 1 ASC

to something like this (which I still didn't write):

"SELECT CASE "
            + "    WHEN primaryStartAge < 20  THEN ' 0 to 19' "
            + "    WHEN primaryStartAge BETWEEN 20 AND 29 THEN '20 to 29' "
            + "    WHEN primaryStartAge BETWEEN 30 AND 39 THEN '30 to 39' "
            + "    WHEN primaryStartAge BETWEEN 40 AND 49 THEN '40 to 49' "
            + "    WHEN primaryStartAge BETWEEN 50 AND 59 THEN '50 to 59' "
            + "    WHEN primaryStartAge BETWEEN 60 AND 69 THEN '60 to 69' "
            + "    WHEN primaryStartAge > 70 THEN '70 and up' "
            + "    END as \"Primary Start Age Range\", "
            + "    count(1) as \"Count\" FROM analyticsResults "
            + "    WHERE calculatorType like ? "
            + "    GROUP BY CASE "
            + "    WHEN primaryStartAge < 20  THEN ' 0 to 19' "
            + "    WHEN primaryStartAge BETWEEN 20 AND 29 THEN '20 to 29' "
            + "    WHEN primaryStartAge BETWEEN 30 AND 39 THEN '30 to 39' "
            + "    WHEN primaryStartAge BETWEEN 40 AND 49 THEN '40 to 49' "
            + "    WHEN primaryStartAge BETWEEN 50 AND 59 THEN '50 to 59' "
            + "    WHEN primaryStartAge BETWEEN 60 AND 69 THEN '60 to 69' "
            + "    WHEN primaryStartAge > 70 THEN '70 and up' "
            + "END "
            + "ORDER BY 1 ASC";

I could copy and paste and fix it manually, use a text editor with regex search and replace, or something equally bland. Since it was Friday though i decided to treat myself and do it from a Cygwin shell. This got me close enough and made me giddy with satisfaction:

getclip |sed -e 's/"/\\"/g' -e 's/^/"/g' -e 's/$/ " +/g' |putclip

This grabs the contents of the clipboard, replaces all quotes with escaped quotes, replaces the beginning of each line with a double quote, and replaces the end of each line with a space / double quote / space / plus combo. It then sticks it back into the clipboard. It's not fancy, it could be better, but it was a minor bright point. And thanks to Cygwin it happened in Windows. Sort of.

Share