Jeddi Sea Org

One of my clients recently asked me if I could 'just simply add some column headings to the CSV'.

They were referring to a CSV file produced by the INTO OUTFILE feature of MySQL, and I naively wandered off to find the option to enabled that handy feature.

Alas, no option. No feature. No enabling. No headings. It turns out that lots of people (well, for small values of lots) ask how to do this periodically ... and the answer is seldom 'here, use this code'.

So ... here, use this code.

First, the caveats.

Obviously this is an 'as is' lump of code. If you can't understand what it's doing and the many, many ways that it may go horribly wrong, you probably shouldn't use it. This isn't stackoverflow.

Your SQL script needs to be stored in a file, and must meet these requirements:

The SELECT and FROM must be on their own lines
Every element you're selecting must be on its own line
Every element you're selecting must have the AS sensible_name construct (the sensible_name is the csv heading)

And here's the code fragment.

#!/bin/bash
#
# 2013-05
# jedd

# Determine the first line of a CSV file based on the MySQL
# fragment that will produce the csv data.

# Requirements for the MySQL script file:
#  *  SELECT and FROM are on their own lines (optional leading whitespace)
#  *  every element being selected is on its own line
#  *  every element being selected has an AS var_name

# Example:
#
# SELECT
#   table1.alpha  AS   alpha,
#   table2.bravo  AS   bravo,
#   ...
# FROM
#   ...

# TODO
#  * sanely cope with columns lacking an AS 

# ----------------------------------------------------------------------------
# We want a single parameter - the SQL file to analyse
if [ $# -ne 1 ]; then
  echo "Wrong number of parameters."
  echo "Usage: `basename ${0}` SQLFILE.sql > firstlineofcsv"
  echo "Exiting."
  exit
fi

SQLFILE=$1

# Find SELECT and FROM lines then add / minus from each to
# determine line number range of all the elements being selected.
SELECTLINENUMBER=`grep -n "^ *SELECT" ${SQLFILE} | cut -f1 -d:`
FROMLINENUMBER=`grep -n "^ *FROM" ${SQLFILE} | cut -f1 -d:`
RANGESTART=`expr ${SELECTLINENUMBER} + 1`
RANGEEND=`expr ${FROMLINENUMBER} - 1`

# Filter the bejesus out of this file ...
sed  -n ${RANGESTART},${RANGEEND}p  ${SQLFILE}     | \
grep -v "^ *--"                                    | \
grep -v  "^$"                                      | \
grep -v  "^ *$"                                    | \
grep "AS"                                          | \
awk -F"AS" '{print $2}'                            | \
awk -F","  '{print $1}'                            | \
tr -d ' '                                          | \
sed "s/^/\"/"                                      | \
sed "s/$/\"/"                                      | \
tr '\n' ','                                        | \
sed "s/,$/\n/"

# Explanation - first sed, 'p' means print output, then we ignore all 
# comment line, then empty and solely whitespace lines.  Then we filter
# in only lines with 'AS', identifying the portion after the AS, stripping
# the trailing ,, then stripping whitespace left and right, then putting a
# quote at the start, and end, of each field, then turning each newline
# into a comma, ditching the ultimate comma, and adding a newline.

Explanation is provided in the code – it's not rocket surgery by any stretch, and it's easy to adjust to taste. Also I'm quite confident that filtering ugliness could be reduced in size, but I found it easier to debug and adjust in that format. There's little gain to be had from improving the performance of something like this, as you're typically dealing with a few hundred lines at most.

Obviously it's not immediately useful in this form. On my systems I've embedded this code in the cron script that does my extracts.

MySQL's INTO OUTFILE will choke if the OUTFILE already exists, so I produce both the data file, and this headings file, then cat them together into the actual output (and then bzip2 and biabam them off to the lucky winner).

A good-looking MySQL script will be formatted like this. Names have been changed to protect the crufty.

-- 2013-05-07
-- jedd
--
-- Present the ... data (from second tab on the WC web site) into 
-- a format useful for ... 
--
-- This will be accompanied by a vanilla dump (also in csv format) of ...
-- [ firstname , surname , postcode ] to uniquely identify a person.

-- Timestamp here bewilderingly isn't an actual timestamp, but a bigint,
-- so we convert on the fly with : from_unixtime(preferenceTimestamp/1000)

SELECT
  person.id                               AS person_id,

  PreferentialOption.id                   AS po_id,
  PreferentialOption.reference            AS po_reference,

  -- PreferentialOption.feedbackTimestamp AS po_when_raw,
  from_unixtime(PreferentialOption.feedbackTimestamp/1000)    AS po_when_actual,

  SecondTable.firstname                   AS firstname,
  SecondTable.surname                     AS surname,

  SecondTable.postcode                    AS address_postcode

FROM
  ...

The above will generate output that looks like this:

"person_id","po_id","po_reference","po_when_actual","firstname","surname","address_postcode"

Bish bash bosh. Job done.