Joining CSV Files

Have you ever needed to combine data from two csv (comma-separated value) files? Maybe you have transaction data in one, so you know who made payments, and shipping addresses in the other?

I’m sure it’s possible to do by importing the files into Excel and… perhaps writing some formulas or macros and clicking buttons. I’ll leave that as an exercise for the reader. :)

But what if you could join the files, pick out the columns you want, even filter the rows based on certain criteria, all at the command line?

As long as the two csv files have some sort of identifier in common, you can! There is an amazing project called csvkit that provides utilities to do all of this.

From the csvkit documentation: “csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.”

Example

In this example we’re going to join up the Transactions and Customers CSV exports from Chargify.

(If you’re not already using Chargify you can sign up for a free Developer account, but it will be quite a bit of work to populate a test site with enough data to produce these exports. You’ll probably have more fun using csv files you already have available. :) )

First let’s use the csvcut command to look at the column or field names in each file. Here is the transactions file:

$ csvcut -n transactions.csv
  1: id
  2: created_at
  3: type
  4: memo
  5: amount_in_cents
  6: ending_balance_in_cents
  7: subscription_id
  8: customer_id
  9: customer_name
 10: product_id
 11: success
 12: kind
 13: payment_id
 14: gateway_transaction_id
 15: customer_organization
 16: gateway_order_id

And the customers file:

$ csvcut -n customers.csv
  1: id
  2: first_name
  3: last_name
  4: organization
  5: email
  6: created_at
  7: reference
  8: address
  9: address_2
 10: city
 11: state
 12: zip
 13: country
 14: phone
 15: created_at
 16: updated_at

Note that field #8 in the transactions file is called customer_id. That’s going to match up with field #1 in the customers file, called simply id.

The goal for this step is to create a new csv file that contains all of the lines in the transactions.csv file, with each line having the fields from the appropriate line of the customers.csv file appended to it.

Have a look at the documentation for csvjoin, which we’re going to use to combine these files.

Based on this, we’ll need to specify our filenames, transactions.csv and customers.csv and the two field names, customer_id and id. Note that the order is important – the first field name must be found in the first csv file, and so on.

We should also do a LEFT join, so that all the rows in the transactions file are preserved. In theory, you can’t have a transaction without a customer, but if for some reason the customers file is truncated you wouldn’t want to miss one of the payments simply because it didn’t match a line in the customers file.

So far, our command looks like this:

csvjoin -c "customer_id,id" --left transactions.csv customers.csv

Note that the transactions export contains all of the transactions that were exported. We’re assuming that only Payments were exported, but if we only want the successful payments, we’ll need to filter it further. We can do that by “piping” the output of the join into another csvkit utility called csvgrep. Again, have a look at the documentation for csvgrep.

We’ll use the pipe symbol | which sends the output of one command into another, and then add the csvgrep command:

csvjoin -c "customer_id,id" --left transactions.csv customers.csv | csvgrep -c "success" -m "true" |

Not only can we filter for the successful payments, we can also trim the file down to only the columns we need. You guessed it, there’s a utility for that – it’s csvcut, which we already used to print the names of the columns. Here is the documentation for csvcut again.

Examine the column names above and decide which ones you need.

Once again we’ll use the pipe symbol | to feed the output of one command into the next, and add the csvcut command to the end:

csvjoin -c "customer_id,id" --left transactions.csv customers.csv | csvgrep -c "success" -m "true" | csvcut -c "customer_name,address,address_2,city,state,zip,amount_in_cents,success"

I included the name and address fields, and also the amount and success fields, just as a sanity check that nothing went wrong with the filtering.

Note that I picked the customer_name field from the transactions file, then some address fields from the customers file, then went back to transactions for the amount and success fields. You can re-order the fields in any way you want.

To send the results to a file, append > paid_orders.csv.

csvjoin -c "customer_id,id" --left transactions.csv customers.csv | csvgrep -c "success" -m "true" | csvcut -c "customer_name,address,address_2,city,state,zip,amount_in_cents,success" > paid_orders.csv

You can then use that file for whatever you need, for example a mail-merge, or to send to a fulfillment service.

History

I know I learned about csvkit on Twitter, but I didn’t remember the details. Looking back, I see that Joe Germuska was one of the authors of the original csvcut utility. I ‘know’ Joe from way back at Apache Struts. Never underestimate the power of your social network – without that connection, I might not have heard about this amazing tool!

Next Steps

This is only a small part of what csvkit can do. The next thing I’d encourage you to check out is csvstat which will give you a summary overview of your data. Here is a tutorial on examining data.

Have fun!

Example

History

Next Steps

References