OpenRefine in Action: Working with Strings

OpenRefine has already proved a valuable tool in the workplace. Days after Tom’s first session I produced a set of CSV files with usage data from Harvard’s library reporting system. I wanted to upload and share the data in a Google spreadsheet, but first had to modify the date information to enable sorting and filtering.

First I tried the Text Import Wizard in Excel.  The resulting spreadsheet was a mess; much manual clean-up would have been needed to make the data presentable. (It might have worked if I hadn’t missed Lynn Cherney’s Excel training!)

OpenRefine handled the import much better. Data lined up in neat columns, and it was easy to apply text faceting to review the content for errors.

With the project created, I started working on the date columns. One of the options under Edit cells > Common transformations is ‘To date’. Selecting ‘to date’ transformed days and months from names into numbers. For example Monday, June 16, 2008 became 2008-06-16. The transform worked very well, but it was not complete. For some reason there were still several rows on each page that were not changed.

In a small data set it would be easy to adjust a few cells manually.  Moving the mouse to a cell and clicking the blue edit button brings up a list of data types. I selected date from the list, clicked Apply and the date was transformed. But there were too many rows to do the transform manually.

I used ‘Undo’ to restore the dates to text, then decided to try the custom transform functions. I read through the documentation on string functions, experimented, and used ‘Undo’ a lot.  I found in OpenRefine Recipes an example where split and join were used to extract and reorder text segments. Using the recipe as a model, I created an expression that removed the first segment containing the day text and then joined the remaining two segments with a  space “ “ -as shown below.

Time stamp information was  then removed from the end of the modified string using an expression that Tom taught us in class: value[0, -11] 

The substring function also came in handy.  Again working from the Recipes, I used substring to extract the first six characters of the text and then combine them with the 4-digit year at the end. Substring worked well in this context because each element was a fixed length.

Creating the expressions took time, but definitely less time than manual clean-up, and ultimately less time than other available tools  because the expressions were applied to multiple files. As the learning curve levels out OpenRefine is becoming a real time-saver!

Leave a Reply