Did you know that the Internet Archive (IA) has digitized out-of-copyright texts from some of the world’s most venerable universities and they are available to you for free to download?
To the astronomy research community this means freely accessing some of the world’s most important scientific works such as: Isaac Newton’s famed Principia, Galileo’s Dialogo de Cecco di Ronchitti or Alexander von Humboldt’s, Cosmos. These formidable tomes frequently sell for thousands of dollars at auctions. Therefore, it is kind of unbelievable that the Internet Archive not only lets individual users download anything from their 4.5M textual corpus, but also happily cooperates with other digital libraries who serve specific research communities and allow them to ingest content, en masse. Did some body say free stuff?
This past spring semester, the J.G. Wolbach Library at the Harvard-Smithsonian Center for Astrophysics hosted an experimental Data Science for Librarians course in an effort to get the local librarian community trained and proactively involved with data research projects at their respective institutions. Our team’s final project for the course seemed simple at the start:
We were tasked with identifying Astronomy related items in the Internet Archive that could be readily ingested into the SAO/NASA Astrophysics Data System (ADS).
The idea behind this is that the ingested historical literature would fill the serial and monographic gaps in the ADS and give the Astrophysics community increased access to important literature. Sounds easy, right?
The two major obstacles we ran into from the get-go: (1) the data elements in the Internet Archive describe “book-like” objects whereas the ADS bibliographic data primarily describes content at the article level. This meant incompatible mapping of data elements from each portal’s metadata schemas, and (2) historical serials are a web of cataloging tangles that will perplex even the sleuthiest of bibliographic detectives. We realized that we needed ISSN data to do truly accurate matching for serials in the ADS to those in the Internet Archive – yet, we did not have this data.
Despite these roadblocks, we decided to re-frame the scope of our project to focus on exclusively monographic bibliographic records in the ADS and map those records to matches found in the Internet Archive. We agreed to come back to serials at some point in the future, should time permit. The ADS could then incorporate IA’s digitized monographs into their existing records by either ingesting the items into stub records or overlaying existing black and white copies with higher quality scans.
(Although, we filtered for texts only, occasionally interesting items would sneak by.. )
How we got our initial data set
After a month of trying to figure out what our project was, we commenced with a volley of searches for ADS’s scanned monographs published prior to 1923. The advanced query form and the supplemental FAQ page were invaluable at this stage and we were able to quickly identify the 1,442 records we were interested in. The search returned sweet, sweet XML and we were in the data science business. We were able to get our data from ADS using a simple URL dropped into a web browser. Our next challenge was to compare this data to what we could find in the Internet Archive.
The three best metadata elements gleaned from ADS bib records that we could use to identify possible IA records were authors, publication dates, and titles. But first we needed to clean our data as much as possible before hammering the Internet Archive’s API with requests. The date field needed to be limited to year (removing the initial “n/a” standing for month). Titles and authors needed to be stripped of extraneous punctuation.
Open Refine to the Rescue
We started to dig in to the data cleaning with Python, but a heretic in our midst knew that OpenRefine could clean up titles, authors and dates using its internal expression language, GREL; it would then programmatically create custom queries based on our data elements and fetch the IA data that we wanted. Thus, we decided to take a stab at doing our entire project, from data clean-up to collecting data to data clean-up — again — within OpenRefine. Our group’s approach: Why make life harder when there’s a tool that will do all the work for you?
Step 1: Cleaning the ADS Data using GREL expressions
Step 2: Fetching Internet Archive metadata
Both the Internet Archive and the ADS get their bibliographic records from a diverse set of contributors, rather than creating their own records. There is no common content standard between the two systems or even within the individual systems. The lack of continuity between metadata standards makes getting exact matches on all fields difficult. We decided that the following search strategies would be the best options for balancing precision and recall:
|T0DA||Full title entered as a title search, date entered as a date search, and author entered as an author search|
|T8DA||First eight words of the title entered as a title search, date entered as a date search, and author entered as an author search|
|T8_A||Same as T8DA but without date.|
|T8D_||Same as T8DA but without author.|
|T0||Full title entered as a title search|
|T8__||First eight words of the title entered as a title search.|
|t8Da||First eight words of the title entered as a keyword search, date entered as a date search, words in the author field entered as a keyword search.|
|t8_a||Same as above but without date.|
Using OpenRefine’s handy fetch URLs feature enabled us to query the Internet Archive API using ADS metadata and retrieve results in JSON, all from within our Refine project. We were then able to take this JSON and parse it into digestible morsels. Yum Yum!
Step 3: Cleaning-up our fetched Internet Archive JSON
Once we had performed all searches, we combined the responses into a single column (Fig. 6) and parsed the JSON to extract Internet Archive fields (Fig. 7), making sure to retain information about the kind of search that had resulted in the match. For each matching pair of records we tallied up the number of search dimensions the pair matched on over the course of the several searches (T0, D, A, etc.).
Get our final data set on figshare! We encourage everyone to download and play with the data.
Data wrap-up and lessons learned
The importance of documentation and fine cooking with recipes. IA also provided great documentation for their API. This allowed us to quickly shape searches that returned JSON results and it was only a matter of learning enough GREL (Google Refine Expression Language) to construct the URL fetches in OpenRefine. Refine comes with recipes, small pieces of GREL that you can easily convert into something personally useful. We picked through these examples and we were cooking with gas in no time.
OpenRefine’s low barrier to entry. Using OpenRefine, we were able to see our results immediately and evaluate our success and adjust our searches to improve results. OpenRefine also sports a built-in JSON parser which simplified the task of extracting specific data elements from our IA searches. Everything was contained within the OpenRefine project; you could see data transformations immediately and didn’t have to structure and save the output. This allowed us to make quick advances when we were first tackling the messiness of our data.
Four different approaches. OpenRefine works within a web browser, but it is a locally hosted instance of the project. You can export and distribute your project, but you can’t work collaboratively. Each team member had to undertake this stage of the process independently. We weren’t able to use GitHub to coordinate electronically with the group and instead were emailing and posting copies of our data sets and projects. When meeting with the group, we were honed our strategies and identified a plan for the challenge of evaluating the “matchiness” of our IA results.
Stay tuned for How to Beat Bibliographic Data into Submission, pt. 2!! In this future post we will go over how to turn an Internet Archive dataset into a pretty data story using a really low barrier tool for newbie data scientists. Awww yeaaahhh…
And now for your enjoyment we leave you with this special link from our dataset.
Dive into Data Links
- Figshare: Intenet Archive DST4L Data sets
- SAO/NASA Astrophysics Data System (ADS) Openrefine
- Internet Archive
- GREL string functions
- Jython for cleaning
- R and R Studio
|Jennifer Prentice, TA for this Data Science for Librarians course held spring semester 2013 at the John G. Wolbach Library at the Harvard-Smithsonian Center for Astrophysics. Currently attending Simmons in pursuit of her MSLIS.|
|Colin Van Alstine, Library Assistant for Harvard Library, Information and Technical Services. Aspiring library data wrangler with a computer science background.|
|Amy Benson, Librarian & Archivist for Digital Projects at the Harvard’s Schlesinger Library. Amy brought a wealth of experience to our group and has been at the forefront of digital initiatives at Harvard working to digitize Schlesinger’s collections, archive the web, and preserve born-digital materials.|
|JJ Ford, CFA Librarian at the JG Wolbach Library. Formerly worked as Digital Collections Librarian for the Biodiversity Heritage Library and still likes to blog about environmental conservation; went to UCLA then Simmons. She is a scorpio, aquarius rising born in the year of the water boar.|