Splitting Files By Column Value Using Awk

At the day job a data fairy gives me a giant pipe delimited text file that contains data for a bunch of our customers. The customer ID is contained in one of the columns. Ideally I'd like to have one file per customer but it's usually very difficult to get data fairies to do the things you want.

For reference here's a reasonable facsimile of what the file looks like. Let's pretend this is some sort of interesting survey. Bonus points if you can figure out a question that would make sense for these answers.

FIELD1|FIELD2|CUSTOMER|FIELDN
"Once in college but it wasn't my idea."|3|"CUST1"|"blah blah"
"Like your mom."|14|"CUST2"|""
"Blame it on the dog."|15|"CUST1"|"Frankenberry"
"That wasn't chicken."|9|"CUST2"|"Definitely the mouth."
"Never professionally"|26|"CUST3"|"And then she stepped on the ball!"

What we want is three files: one for each customer. We drop the split file in a different directory for each customer to keep things a little neater and we name the file with the customer code prepended to the original file name. All nice and orderly.

As with many things involving text files this winds up being stupid easy using Awk. I'm showing it here mostly so I can find it again and because this type of command line file processing always makes me giddy. The comments should do a good enough job of explaining things.

#! /usr/bin/awk -f
BEGIN {
  if($CUSTOMER < 1) {
    print "Usage: split -vCUSTOMER=[split column] [files]";
    exit;
  }

  # Set the input and output field delimiters
  FS="|";
  OFS="|";
  "mkdir -p split" | getline;
}

{
  # If this is the first line of a file...
  if (FNR==1) {
    # Grab the entire first row as the header
    header=$0;

    # Close open files from the previous file (if any)
    for(customer in customers) {
      close(customers[customer]);
    }
 
    # Clear the array of customers / output files   
    delete customers;
  }

  if (FNR!=1) {
    # Grab the customer code and strip out the quotes
    customer=tolower($CUSTOMER);
    gsub(/"/, "", customer);

    # Store the output file name.  This is the customer code followed 
    # by the original file name.
    outputFile="split/" customer "/" customer "_" FILENAME;

    # If this is the first time this file we've seen this customer code...
    if(customers[customer]=="") {
      ("mkdir -p split/" customer) | getline;

      # Overwrite any previous output file and print the header
      print header > outputFile; 
      # Track the fact that we've seen this customer code and store the output file
      customers[customer]=outputFile;
    }

    # Append the current line to the output file
    print >> outputFile;
  }
}

I'm sure someone could do this more succinctly and without some of the odd things I've done in there (maybe parameterize the delimiters or the output directory structure), but I kind of like it. It's already proved useful for a number of other cases for me. Also the fact that it's relatively tiny and super fast is all the answer I need if one of the co-workers asks why I didn't write it in Java.

Share

Leave a Reply