As Chris mentioned in his recent post, we were visited Thursday by James Turk, a developer working on the Sunlight Foundation’s Open States project, which collects legislative data from states and makes it available on its own site in more and simpler ways. James’s presentation gave us an eye-opening glimpse into the lack of attention most states have given to making their data open and accessible. James described a complicated and messy situation: No states provide their data to Open States directly. Very few states have APIs or databases. Data must be collected from 143 separate websites, most of which date back to the nineties and are written in non-standard HTML. The scraping alone takes 37 thousand lines of code. It is a huge undertaking, and the results are impressive.
Over the course of his presentation, James used Open States as a jumping off point for providing us with helpful recommendations on scraping, cleaning, and publishing data. Here, somewhat out of order but largely verbatim, are the main points of advice James had for us at the start of our own scraping projects:
Don’t always let the scope of a problem scare you. If the developers at the Sunlight Foundation had known how difficult it would be to get state data, they might not have tried it.
Learn the language and its ecosystem. Python has one of the richest ecosystems of any programming language, meaning you don’t have to find libraries for most things.
If scraping frequently, be polite to sites. Don’t cause downtime or other problems for the systems you are scraping.
Save data to an intermediate format, then transform. Limit your scraper to scraping. It will save you time.
Grab everything that’s easy to grab when scraping. You can filter out the parts you don’t need. Trying to remember the logic of a page’s structure ten weeks later is a lot harder than pulling everything the first time through.
Publish both your data and your code. Someone else will probably be interested in your data. People using your code will have improvements for you. ScraperWiki does this well.
Don’t be embarrassed to publish your code! Scrapers are ugly. Publish them anyway.
But don’t publish your API key. If you do happen to publish your key, notify the content provider, even if it was only up momentarily, and especially if the key gives you access to change the database.
If you want your process to be repeatable and modifiable, write code. If, on the other hand, you’re pretty confident that you’re only going to do it once, use a tool like Google Refine.
Use the pipeline approach (“the Unix way”). Create lots of little functions that can be applicable to other scrapers, then chain them together. This is good practice in programming, and scraping is an ideal problem to apply this best practice to.
Never delete your code! You will have a use for it later.
Join the Python community. The Python community is friendly. Go to PyCon as a newbie (great time!). Join the local Python user group.
In addition to this general advice, James gave us several recommendations for specific resources to take advantage of while writing our code, including the XML Path Language (XPath), the scraping libraries scrapelib and lxml, and the Python package index PyPi. Details on all of this and on James’s experience with Open States can be found in his slides and my notes (PDF, 107 KB) from his presentation and the hour-long discussion that followed. It was an informative and engaging presentation—delivered at just the right time for our projects—on an inspiring example of data wrangling for the public good.