Thursday, June 6, 2013

Moving My Own Little Mountain


Last year, I returned to a question that has been bugging me for a number of years with Toronto's weather.  The problem was that in order to find the answer, I needed some data - and whilst the data is freely available to answer the question, it's just not in a form that I could easily query.  It was very obvious I was going to have to construct something pretty big at home, just so I had all the data in one place in order to query it.

So, back in the winter, I wrote a program for the RaspberryPi (http://www.raspberrypi.org/faqs) to grab the data I needed for 18 locations across canada, for every day on record, and put them somewhere for me to process at a later date…  Naturally, the Raspberry Pi did this job admirably.  OK, it was slow, but then again, for a $35 computer I wasn't expecting speed.  What the Raspberry Pi did do extremely well was allow me to teach myself a few things I didn't know before - like how to build using the GCC compiler on a linux distro that I'd installed from scratch.

What I ended up with was ~37,000 files to process.  This was a bit of a tall order especially as I was a new dad with twins, so the entire process had to be put on hold for a bit.

Now that it's almost summer the next year, I finally have some time to deal with this.  First, I had a look at the files.  They are CSV's with 17 lines of junk above the data that had to be stripped out, followed by another ~720 lines of data per month (one line for each hour, 24 hours a day, multiplied by 30 days).

So, I created a MySQL database with the corresponding fields for the stations and the readings.  Next, I created a simple PHP script to enumerate each folder, and for each file, strip out the header junk and for each line of data, break it into the correct fields and load these as a record into the readings table.  This makes it approximately 1.5m records per station. 

Time to load this is about 5 hours over wifi from my laptop to a Mac Mini server - again, speed isn't the concern here - rather that I have a way to get the files parsed and loaded with about 1 hour of programming time invested in a PHP script that I can easily repoint to new folders.

This will likely take a week or so (process one city whilst away at work, and one city through the night as I sleep) to get all 26m to 27m records in the system.  Then the next step is to augment the data.  The biggest item is to add the day of the week.  So, this needs to be done as well.  

If all goes well, I can get round to querying this data to get some answers by the end of the month!