Our data science training has officially begun. We’ve got a great group of participants from several academic/nonprofit institutions in the area, including Harvard, Radcliffe, MIT, UMass, Brandeis, Community Change, the Smithsonian, and Simmons.
Thursday’s class was largely about introductions and logistics. Chris gave us a broad overview of the course and the value of data science skills in libraries, and we discussed our own motivations for taking the class. It was exciting for me as a library student to hear how these skills might apply to the many positions and responsibilities of the professional librarians in the group.
On Saturday we installed some tools on our laptops in preparation for the work we’ll be doing in the coming weeks. Our first download was Google Refine, a useful application for quickly cleaning up, viewing, and sharing data. Chris gave us a brief tour of its features using an example XML file generated from the Astrophysics Data System (ADS), a bibliographic database that will serve as the data source for most of our class projects this semester. Then our Python instructor for the class, Rahul Dave, a programmer and computational scientist at Harvard, walked us through the installation of (1) the Enthought Python Distribution (EPD) and (2) a natural language processing package in the EPD repository called Natural Language Toolkit (NLTK).
Our homework for this week: work through the first chapter of Natural Language Processing with Python, an O’Reilly publication available for free on the NLTK website.
We’re off to a great start!