Incubator Blog .gov Data Processing

Data A Hit At Poliinfomatics

Success! Data a hit at PoliInfomatics Workshop

The data was a big hit at the workshop! People seemed excited about the possibilities. I got a lot of requests to share the data and scripts, from people as far away as Germany! Excellent. I also was able to speak with the lead data engineer at Internet Archive, and he was able to help me construct a gold standard for estimating total content on the web, since many of the crawls were incomplete. It turns out that around each election cycle form 2004 to present, Internet Archive conducted an exhaustive crawl of the .gov domain between the end of Oct and the end of January. If I aggregate across these three months, I am able to have an estimate every two years of total content on the .gov domain, which will allow for a more statistically rigorous analysis of the data.