Now that our group projects are underway, we are busily working on code for extracting, comparing, and creating datasets. This has given rise within my own group, and presumably within others, to the question, “where do we store this data so that we can share and interact with it easily?” Chris’s presentation Thursday was intended partially to address this question and partially to discuss data management issues generally.
Chris began his presentation with an overview of two options for using SQL databases to manage and share our data:
SQLite, which Python supports by default, and
SQLShare, a cloud-based database platform developed at the University of Washington. See creator Bill Howe’s slides on using SQLShare for collaborative data management.
Chris emphasized that it is not necessary to build a relational database to take advantage of these tools. Simply running SQL queries on tables, or flat files, can be very powerful and provide a lot of flexibility. Chris’s prepared notes (PDF, 68 KB) for the session include resources for learning about these solutions and a basic introduction to interacting with them through Python.
The second part of Chris’s presentation dealt with data management and the current state of data sharing among researchers. Many of the participants in our class are, or will be, involved in data management and data research services for the faculty at their home institutions and are interested in learning about data sharing platforms for that reason. Chris encouraged us to upload our data to a data repository not only for the sake of publishing our data, but also as a means of gaining first-hand knowledge of the data-sharing process researchers experience.
He compared two popular open data sharing platforms, Figshare and Dataverse:
- Uses DOIs as unique identifiers.
- Offers a simple interface and social networking features.
- Requires few metadata fields.
- Has become a lightweight publication solution: researchers are publishing posters and slides as well as datasets.
Example from the Center for Astrophysics:
- Uses HDLs as unique identifiers.
- Allows researchers to set permissions on who can view datasets.
- Can be customized, tailored to brand.
- Tracks versions and includes version numbers in auto-generated data citations.
Examples from the Center for Astrophysics:
Chris explained that, while the NSF, NIH, and other grant-giving institutions are requiring researchers to publish their data and create data management plans, many researchers are protective of their data or won’t publish their data unless it is necessary. Chris views his role as a librarian as that of a collaborator or advisor, rather than an enforcer, and tries to help scientists at the Center for Astrophysics by responding to their data needs.
To read more about all the topics discussed in this post, see my notes (PDF, 92 KB).
. . .
Rahul was on hand to answer our questions during our Saturday hack day, and we had plenty to keep him occupied for the full three hours. In the process of helping one group looking for geocoordinates in papers indexed in the ADS, he wrote up a little script that may prove useful to anyone learning to perform searches programmatically through APIs. Rahul’s script, available on Gist, uses the Requests, JSON, pprint, time, and re libraries to “get full text search results from ADS and try matching degree latitudes.”