We were visited last Thursday by Tom Morris, one of the lead developers on the data refinement tool OpenRefine. OpenRefine (also known simply as Refine and by its former name, Google Refine) is free, desktop-based software described by its developers as a “power tool for working with messy data.” Tom presented an overview of Refine’s many features, including facets, clustering, and reconciliation, and demonstrated some of the transformations that you can perform on your data using the Refine Expression Language. He recommended Refine’s online screencasts and the “Recipes” page on the OpenRefine wiki as useful starting points for learning about what’s possible with the product. Much of what was covered is contained in Tom’s slides and my notes (PDF, 108 KB).
There are things that Excel can do that Refine can’t. For instance, it is impossible in Refine to edit changes that you’ve made to cells. If you find you’ve made a mistake, you have to perform your transformation again from scratch. The formulas themselves are not stored in Refine cells, as they can be in Excel, and for that reason it is also impossible to copy formulas from one cell to another. However, Refine’s functionality for identifying and cleaning up messy data is built in and far more robust, such as its clustering algorithms, which would make Refine worth trying just on their own. In addition, Refine makes your operation history not only visible (so that you can keep track of what you’ve done to your data) but also reusable: if you have a process that you perform repeatedly in the same way and you are careful to use files with exactly the same number of columns and column names, you can extract all the operations you’ve performed on one file and apply them to subsequent files. Refine can even pull content from an API and parse JSON, if a custom script isn’t possible or necessary for your project.
To read about some of the ways that people have applied Refine’s functionality to library-related data, see
- Owen Stephens’s blog post about using OpenRefine for e-journal data, and
- the Free Your Metadata website, which provides tutorials and examples for, among other things, using Refine to clean data and reconcile local controlled vocabularies with LCSH.
Do you see a need for a tool like OpenRefine in your library? If you’ve used it, how has it impacted your work and the services you’re able to provide to your users? When do you choose to use Refine, and when is a different tool more appropriate? We’d love to hear from you.