On a rainy Thursday morning, Tom Morris (@tfmorris on Twitter) gave the class an introduction to OpenRefine. As is often the case with learning a new technology and using it on a variety of platforms, there were some mild growing pains. We spent the first twenty minutes of class setting up the software. While it would have been nice to jump straight into the finer points of faceting and clustering a couple minutes sooner, hiccoughs like these offer learning experiences that shouldn’t be overlooked. Consulting with our neighbors brought the class together to troubleshoot. As we get closer to tackling the class projects, we’ll need to be comfortable doing problem solving as parts of a team.
OpenRefine (Wikipedia page) is an open source data cleaning application. It is compatible with a variety of formats including: CSV, TSV, fixed width field, JSON, XML, Excel, Google spreadsheets, and RDF. It can access specific files within compressed folders and unzip them for use. It functions by creating a local web server and opening a browser to display the interface. While secure, this setup may drain system memory to the point of needing to be shut down when an overly intense process is attempted. OpenRefine also keeps a log of all work done on a data set. This log can be used to undo and redo transformations or be saved as JSON for use with other data sets. This makes OpenRefine an excellent resource for working with groups of similar files that need similar treatments.
After OpenRefine was installed and we had some background on its development history, we started to learn about faceting. Faceting allows the data to be broken into manageable characteristics for analysis. Facets can be created to filter different aspects of strings, numbers and dates. Custom facets can be made using Google Refine Expression Language (GREL), Jython, or Clojure. Scatter plots are easy to make and illustrate the data. Each column is compared to the others so it is possible to look for patterns at a glance. Narrowing the data set while in scatter plot view can be done using an intuitive click and drag interface on the plot. Text and numeric facets must be used with the appropriate data type. By understanding the function of each type of facet and using them together, it is possible to narrow results to very specific data points.
Faceting is a necessary tool, but clustering performs an essential function when cleaning data that was entered by humans. Clustering searches columns for terms that are similar. When it finds terms that could have resulted from variant spellings, misspellings, fat fingers, or some other error, it displays them and allows the user to select or create a unifying term. The process has customization options and multiple search algorithms but may require a lot of memory. Ngram is used for transpositions, fingerprint excels at word order problems and the two phonetic methods are suited to variant spellings.
OpenRefine is a powerful and free tool. While reading up on it, I got sidetracked learning about Freebase and can’t wait to try my hand at using them together. Thanks again to Tom for coming to teach us!
Click below for a PDF of the original class notes: