Incubator Blog .gov Data Processing

Week 2

Progress This Week

We have been attempting to access and process the captures from the .gov cluster, hosted by Altiscale, and make it more accessible for social science research. As the person doing most of the coding, I've encountered three main problems.

  1. The cluster is huge and captures were not collected scientifically. As a result, I need a way to test whether what I'm finding (and what’s missing) is a result of what is actually in the cluster, how the data was collected, or as a result of the web itself and how it has evolved and changed. In short, I lacked a mechanism to test for reliability and validity.

  2. Functionality of the scripts. I was using very blunt tools (basically eyeballing the results to see if they passed a "sniff test") to evaluate whether my scripts were doing what I wanted them to do.

  3. The scripts I run are very slow.

This week I've made progress with the second and third. By using a SOCKs proxy to tunnel into the online domain of the cluster, thanks to help from Andrew, I've been able to see what jobs are running and better chose when to run my scripts. As well, I have removed one element that we believe was taking a great deal of time: calling a python function to do a regular expression search for a list of terms. It turns out that while blunt, Apache Pig can be used to do the same. My scripts have speed up as a result. Lastly, I was able to scrap a single WARC file and have data on what is in it, which can now be used to test the efficacy of each script I run. I am still running word counts on this but hope to finish soon.