Skip to Main Content

Research Guides

OpenRefine, a Power Tool for Messy Data: Home

Get started with OpenRefine, a powerful application for cleaning and transforming tabular data.

What is Open Refine?

refine

OpenRefine is a free, open source application for manipulating all types of data files. Based in Java, it runs on any operating system in your web browser.

Refine is great for quickly getting an overview of the contents of a data set, resolving inconsistencies, and enhancing it with other data—all in a visual, interactive, and efficient manner.

Exciting Trailer from Google!

Introduction

Refine started as a project called Freebase Gridworks which was bought out by Google and rebuilt as GoogleRefine in 2010. Official Google support ended in 2012, prompting a transition to the open source project OpenRefine. GoogleRefine and OpenRefine are the same application, so many tutorials and documentation use the names interchangeably. 

 

Google created a series of slick trailers that act as a good introduction to Refine:

Introduction: https://youtu.be/B70J_H_zAWM
Data Transformation: https://youtu.be/cO8NVCs_Ba0
Data Augmentation: https://youtu.be/5tsyz3ibYzk

Refine is very flexible, so if you have anything that can be visualized in some tabular format—spreadsheets, databases, XML data, RDF, arrays, data stored in JSON—Refine can help you with it. Furthermore, it is designed to be extensible, the community has created numerous specialized plugins and extensions.

Use Cases

If you have Messy Data, such as:

dates in different formats, numeric data stored inconsistently as text strings, inconsistent categorical data, typos, extra white space, multivalued cells

 

2015-10-14

$1,000

ID

10/14/2015

1000

I.D.

10/14/15

1,000

US-ID

Oct 14, 2015

1000 dollars

idaho

Wed, Oct 14th

US$1000

Idaho,

42291

$1k

Ihaho

“Using OpenRefine by Ruben Verborgh and Max De Wilde, September 2013”

   

 

OpenRefine can help!

Refine use cases include:

  • Clean - discover and fix inconsistency with faceting, clustering, cell transforms, GREL expressions...
  • Transform - change formats or reshape with split/join multi valued cells, split columns, transpose columns/rows...
  • Extend - enrich data by combining files, merging projects, fetching URLs, reconciliation with online databases...
  • Automate - reuse your processing routine by exporting operation history in JSON!

 

A Power Tool

David Huynh, a Google developer who originally worked on the project, says OpenRefine is

A power tool for working with messy data.

More powerful than a spreadsheet,

More interactive and visual than scripting,

More provisional / exploratory / experimental / playful than a database.

You can get a copy of his introduction and tutorial here.

 

Working with Refine is a different mindset. It allows you to identify patterns in the data, use the patterns to isolate the rows you want to change, and transform them as a batch.

Digital Infrastructure Librarian: Ask me about Refine!