Past Courses

Recent studies suggest that there will be a shortfall in the near future of skilled talent available to help take advantage of big data in organizations. Meanwhile, government initiatives have encouraged the research community to share their data more openly, raising new challenges for researchers. Librarians can assist in this new data driven environment. Data Scientist Training for Librarians (or Data Savvy Librarians) is an experimental course offered by the Harvard Library to train librarians to respond to the growing data needs of their communities. In the course, librarians familiarize themselves with the research data lifecycle, hands-on, using the latest tools for extracting, wrangling, storing, analyzing and visualizing data. By experiencing the research data lifecycle themselves, becoming data savvy and embracing the data science culture, librarians can begin to imagine how their services might be transformed. See IDCC14 poster and article in Information Outlook 18:3 (2014): 21-24.

The course is informal. There is flexibility on the times depending on the needs/wishes of the participants. Use cases and problems covered in the course will have an astrophysics focus but other sources outside astrophysics will be covered. Participants have at least a basic knowledge of HTML and have coded a little.


DST4L Round 1:
January 31-May 30, 2013
Thursdays 4-6, Saturdays 10-2

DST4L Round 2:
July 26 - Intro to DST4L
Aug 23-24 - Software Carpentry Bootcamp
Aug 28 - Nov 26 - Classes by T. Morris, R. Dave, L. Cherny
Dec 14 - Projects/Class Ends
Mainly Tues 10-1 & 11-2 also Wed, Thurs (see syllabus)

Past Instructors/Speakers:
Tom Morris, Rahul Dave, Lynn Cherny,
David Dietrich, Software Carpentry, James Turk, Gabriel Florit, Christopher Erdmann (Organizer), Erin Braswell, Matt Carroll, Jay Luker, Alex Storer, Seth Woodworth, Raymond Randall


Visit speakers page

Teaching Assistants:

Jennifer Prentice, Simmons GSLIS

Louise Rubin, Harvard-Smithsonian Center for Astrophysics

Jeremy Guillette

James Damon

Brendan Short

Colin Van Alstine

Christine Eslao

Vernica Downey


(instructors included)
Course 1
Course 2

Course Material:
Software Carpentry Bootcamp
Course 1 Syllabus, Course 2 Syllabus
Course 1 Kit (Login required), Course 2 Notes, Course 2 Pinboard Notes
Course 1 Data Stories
Class Projects
First Course

Second Course

General Course Outline:

  • Overview
  • Data Sources
  • Data Extraction
  • Data Cleansing
  • Sharing
  • Presentation
  • Visualization
  • Advanced Topics

The aim of the course is to upgrade the skills sets of librarians so that they can better serve the data needs of their communities, while also learning how to improve current library workflows and systems. Due to the experimental nature of the course, the teaching style is less formal and flexible to the needs of the instructors and students. Participants get their hands dirty with data and work with some of the latest technologies currently used by Data Scientists. Just like the first course, the class presents their work to the library community at the conclusion.

Harvard-Smithsonian CfA, 60 Garden Street, Cambridge, MA
Harvard Library, 90 Mt Auburn, Cambridge, MA


basic statistics, web scraping, data wrangling, text extraction, Data Repositories, PDF extraction,  regular expressions, recommender systems, simple machine learning/naive bayes, static/dynamic vizualization


Unix Shell

HTML, JavaScript, XML, JSON,  R/RStudio, Python/iPython notebook

NLTK/scikit, pandas, basic NumPy/matplotlib

Excel, OpenRefine (aka Google Refine), Google Fusion Tables/Charts, Tableau, Gephi, D3

Sublime Text

SQL, NoSQL, MongoDB, SQLShare

Git, GitHub

DataWrangler, BibSoup, ScraperWiki, Dataverse, FigShare, WordPress, Solr, Hadoop, MapReduce, Logstash


See blog for more