“(9/22/2014) The spectacles were perpetrated by wizards of a sort. Magicians of Dada, or ta-da or someodd. As I first understood it, their power was mystical or perhaps equinoctial. All I knew was that their magic was more than illusion, and for them it was something other than magic. It could be studied, taught, and learned. It was, in fact, science.” -Excerpt from Diary of a Novice Data Librarian.
DST4L is off to a start with an introduction to linked data, Open Refine, and APIs. Seth and Ruben from free your metadata led the class on day one with assistance from Open Refine’s lead developer, Tom Morris. On day two, Tom delved into the finer points of working with Refine by using examples inspired by the class. We learned about Open Refine’s kit of tools, related technologies, and its strengths and limitations. Hands on work was an essential part of both classes so if you’re reading this without having attended in-person, make sure to practice on your own.
So what is linked data, how does it relate to Open Refine, and why are either of them important?
Linked data is a concept that librarians are acutely aware of even if they know it by another name or no name at all. Card catalogs, book indexes, footnotes, and bibliographies can be considered analog versions of linked data. Any particular topic is tied in various ways to a multitude of other topics. We used to record these connections on paper, but now we’re recording them online in vast quantities. Just as there are certain rules for the punctuation and formatting of a citation or MARC 21 record, there is a structure to the linked data online. RDF (Resource Description Framework) is that structure. It is worth reading more about RDF but in depth knowledge of it isn’t required to take advantage of its benefits. As long as you understand that it’s a set of rules governing linked data over the internet, you should be fine.
Open Refine is an application designed to manipulate data but it includes features that take advantage of linked data functionality. Open Refine allows you to work with large data sets (hundreds of thousands or even millions of rows) that are inconsistent or poorly organized. Faceting and filtering allow you to examine your data and find errors. Faceting organizes the data in a column based on the cell values. If a column holds names of cities as strings of text, faceting that column will display the number of times each string is entered. The problem with messy data is that sometimes multiple strings of text will represent the same thing but with slight variations. For instance, ‘NYC’, ‘N.Y.C.’, and ‘nyc’. Once we’ve faceted, one of Open Refine’s powerful data cleaning options becomes available. Clustering is an automated way of collecting strings that the software deems similar to one another and changing them to one standard string. You can change ‘N.Y.C.’ and its siblings to ‘NYC’, ‘New York City’, or even ‘The Big Apple’ if you want. The important thing is that they have a unified identity.
The goal of your project may be to display or export your data for other purposes, but Open Refine also enables users to go further and build on their data by linking it. You may have clustered all of your versions on NYC into a standard string but to a computer the string ‘NYC’ has no more meaning than the string ‘Call me Ishmael’. Linking from the text strings in each cell to the web is done by reconciling or performing Named Entity Recognition. After this process is complete ‘NYC’ will be a link that takes you to a page for the city and more importantly, the value in the cell is better understood by your computer. With linked data it’s possible to create much more powerful queries than otherwise because you can compare your own data to data stored in collections online.
Now that you know why it’s valuable, learn how to use it. Visit Free Your Metadata
Thanks again to Seth, Ruben, and Tom for their superb guidance and support!