Finishing data (Frantically) for the PoliInfomatics Conference

I have spent the last three days piping data from hadoop into the Amazon instance we created. Its 252GB and it turns out that is a pretty huge chunk of data. PostgreSQL says that it can handle up to 3 terabytes, but Andrew and I are pondering the wisdom of trying to host this amount locally. Not sure whether it will be fast enough and accessible enough to fill its intended purpose of being easier to navigate than hadoop. In any case, I made the mistake of making my local machine an integral part of the chain in the process of uploading the data, so I have had to leave it at the data lab where there is enough bandwidth to handle the process of logging into hadoop, grabbing each file that mentions climate change, removing files that break a certain set of rules I set (for example, if their content fields contain goblidy gook instead of text) and sending them to the Amazon instance. The script Andrew and I wrote to do this (mostly Andrew) is pretty cool – its in python and on github under “Fetchandingest.py”- but I learned a lot about PostgreSQL and python from doing this, specifically ways to login and run a command automatically as part of a larger script on either the hadoop cluster or on python, which is a nifty trick. Also, the way the script is set up made it very easy to trouble shoot, and I learned how to print a log of errors into a file, which I will use frequently!