Data Digging and Display with Python

“(11/8/2014) The Craftsman worked with speed and precision. He was architect, researcher, tester, and creator. From the translucent python coiled around his arm, he wrought the tools to build his dreams. By his own admission the quiet creature was still mostly mystery. As he worked, the anaconda’s gentle chatter aligned with the appearance of biblical calculations, charts, and even great pandas for heavy lifting. His speed was such that failures were brief, successes frequent, and learning constant. Each challenge was met with forty questions soon answered. I came to realize in watching the motion of the serpent that most of it was in fact invisible and the Craftsman’s genius was not purely in his skill. Knowing which visible facets of the potentially endless tool to use at each moment and quickly uncovering new ones as needed was the true foundation of his virtuosity.

-Excerpt from Diary of a Novice Data Librarian

Huge thanks to Rahul “The Craftsman” Dave for his excellent instruction during both weeklong sessions of Python training. Congratulations are also in order for those bold Pythonians who started without prior programming experience. The three afternoon classes and day long hackathon required a significant investment of time, focus, and patience but the payoff was worth it.

The course introduced some of the fundamentals of programming then rapidly moved on to advanced topics in statistics and visualization, all while stressing the value of GitHub for versioning. The hackathon brought everything together and provided an avenue for participants to experiment on their own projects in a safe and encouraging environment.

After everyone was set up, day one got off to a fast start with examples that used Python’s data analysis library Pandas to analyze a dataset from These early demonstrations showed the process of coding and set the class trajectory by giving a sense of Python’s potential.

On day two Rahul took a step back and explained some Python basics and how to apply them to building more complex functions while everyone followed along on their own computers. The class then practiced by doing a text analysis of Hamlet using the NumPy Python library. We also learned how to use the twitter and Requests libraries to get data from the Twitter and Wikipedia APIs.

Day three was the final day of class and returned to Goodreads from a different perspective. This time we learned how to use the PyQuery library for web scraping. We also did a text analysis of Genesis. Finally, Jeremy Guillette gave an example of using Python to do statistical analysis and manipulate GeoJSON. Jeremy taught a valuable lesson about sample size and illustrated his point with a series of maps.

The top and bottom 10% of US counties in terms of kidney cancer mortality rates.

The top and bottom 10% of US counties in terms of kidney cancer mortality rates.

Kidney cancer mortality rates by state.

Kidney cancer mortality rates by state.

Take a look at Susan Berstler’s post to learn more about our Github session here.

Course materials are still available online here.

The hackathon started with a brief final lesson before the class organized itself into teams and started to work on projects. Read on for some of the participants’ own stories about the classes and hackathons, which range from joining data sets, to working with Twitter, to plotting data on book weeding, to rejuvenating inherited code, to web scraping faculty directories, to building a movie selection tool:

During the last 4 DST4L sessions I not only learned what a data frame is and how to manipulate it, but we also reinforced our git skills by posting our work on GitHub. I attempted two projects during the project day. The first project attempted to join our instructional statistics with course data information. Unfortunately, I wasn’t able to find a common data point to join the two data sets but I did learn how to load and manipulate over 420,000 lines of data. More importantly, I was able to explore a data set that I couldn’t even open on my laptop before learning python. For my second project, I started scraping pdf documents from WPI’s website. While I ran a little short on time, I was able to pull down the pdfs programmatically and save them into a file on my computer. A seemingly small step but one that will allow us to start looking at first year student work in an entirely different way.

Kristen and Heather
We ventured to analyze tweets authored by a list of Twitter users. We had already imported the Twitter module and created a Twitter app. We imported a module called Tweepy, which we found through a web search, to try to access the list. After a good amount of time of trial and error, we decided to temporarily abandon the idea of accessing the list and to just take a look at a single hashtag or user.
To do this we used the Twitter app we had created and wrote a function to authorize use of this for each query. We limited the number of queries to 100, but later set other limits. We defined the search results in json and called these “statuses.” We printed all statuses, which was a very long list of keys and their values. In order to take a look at the different keys, the actual fields for each status, we printed the keys of just the first status. Then we printed the values of just the first status.

We wanted to convert the json into csv, so we looked up a method of doing so on the web. After a good amount of time of trial and error, again, we were advised by the workshop leader to use Pandas for this. By using Pandas we created a nice data frame and were able to view the data in a table. We printed just the head of the table to take a look. Since there were so many keys for each result, the table had many columns and required a lot of scrolling to view just one status. We played with deleting certain keys as well as printing specific keys to make the table more manageable. We defined and printed just one column, text, and converted it to csv format.

We did not have time to use the numerical stack for analysis, but that is definitely something we will try as we continue to work with this code. We are also still optimistic about finding the right method to analyze a list of users. Thankfully there is a ton of documentation on using Python with the Twitter API. We were grateful for the opportunity to start trying out some of the Python we had learned over the course of the week at Saturday’s DST4L hackathon.

For my Python project, I worked with the same data set that I used for the OpenRefine exercise (and which I demonstrated at our session in October and summarized on the DST4L blog). This set contains “weeding candidates” from my library’s print collection for mathematics. I was curious to see how Python might graph the data via variables such as publication dates, circulation rates, etc. so as to make both broad policies for de-accession or retention and item-specific decisions. I didn’t get too far with the visualization because I had issues with converting the data from Excel to .csv formats but was able to see the tables within Python and to build new variables based on the data. That in itself was valuable experience.

A legacy Python program was handed down to me that was created about 7 years ago. It was created by a programmer who is no longer at my institution. It scrapes PubMed metadata and transforms it into xml that can be batch uploaded into my institutional repository. The input is a text file of PubMed IDs and the output is an xml file. I want to edit the program to better suit the current workflow and fix some problems. Anna worked with me on this project. Rahul helped us get the script into iPython and present the output in a print statement (in readable “prettyxml”) rather than written to a file for easier troubleshooting during this project. We were able to make some of the changes I need (yay!) but other problems are more difficult. For instance, we discovered that the author data as available from PubMed is not structured in a helpful way: each full name is stored in one field rather than as separate lastname, firstname, and middle initial fields. This makes it difficult to add this metadata to my IR without having to edit each name as is necessary now; one of my goals is to minimize this editing if possible. This project highlights issues working with a legacy program and with problematic data sources. It also heightened my appreciation for the excellent work the original programmer did!

Basically, my project was to scrape data from the HBS faculty directory on the public site at: I was curious to see what information was offered on the public site, so I inspected the elements on the page, and practiced scraping at different levels of the elements to see what I would get. When I was happy with what I had (name, title, url), I printed the 3 elements that I wanted, and then exported the result into a csv file. A simple project, but it took me a while to get all the python scripting right.

I’m just about done scraping Common Sense Media’s movie reviews for children 5 and under. Next steps: (1) fetch the Rotten Tomatoes rating for each movie, (2) try to identify which ones are available at local libraries (via WorldCat API) or (3) as streaming video through Hoopla/Hulu/Netflix (Guidebox API) and (4) explore and visualize the data.

Thanks for reading!

Leave a Reply