12. OpenRefine for normalizing data


Print Friendly, PDF & Email
Normalizing data: the process of organizing data into tables in such a way that the results of using the database are always unambiguous as intended.

Download this dataset and copy it to your C:\ drive workspace and uncompress it.  Load the Buyer2010_gt25ac table to ArcMap.  Save your ArcMap project to your C:\ drive workspace .

Problem:

There are some problems with our list of buyers who purchased a parcel larger than 25 acres in 2010.  Load the Buyer2010_gt25ac table and submit the following attribute query:

ADDRESS1 LIKE 'PO%' or ADDRESS1 LIKE 'P.%'

Show only the selected records (1,130 of them).  How many different combinations of “PO Box” do you see?  I see: “P. O.”, “P.O.”, “p.O.”, “.. Box”, “.. box”, and more.  While each of these are certainly meant to represent post office box, structured query clients like ArcMap consider these to be different.  Thus, you must explicitly capture all the combinations in your query to select all of the post office box records.  After cleaning the P. O. Box, you need to fix all of the other near-sames like “Church” and “Chruch” and “CHURCH” and “church”, and many many more.  Yes, the problem with this approach is it is difficult to capture all of the possible combinations of near-same word and word combinations and abbreviations.  This problem is compounded when you realize there are more fields to clean like the buyer’s name, city, state, and zip code.

While you might be able to normalize this database using a series of queries and summaries in ArcMap or Excel, there are better tools for the job.  OpenRefine is a better choice when faced with un-normalized, dirty data.

Terminology:

Facet: A facet is summarization of unique values contained within a column of data

Clustered Facet: A grouping of near-similar facets

OpenRefine Live Demo

Load the Prop2010_gt100ac table to your ArcMap project.  This table contains data about sales from 2010 involving properties 100 acres and larger.  There are 2,700 records total in this data set but I suspect, I know, that some properties are listed multiple times.  Furthermore, there are quite a few inconsistencies.  For example, there are multiple spellings of R-Ranch Road.

facetex1

Our goal here is to normalize these entries.  Open refine allows for the import of CSV and TXT files.  Export your Prop2010_gt100ac table as a ‘Text file’ in your workspace on the C:\ drive.  Name this file P2010_g100ac.txt.

OpenRefine Installation:

OpenRefine is an open source, desktop software that will run on Windows, Mac, and Linux.  It is a green install, so all you have to do is download the appropriate package from here (v2.7 Windows kit), unzip the file into your working directory, then double-click the openrefine.exe to run.  Also a good idea to copy the OpenRefine installation zip file up to your network drive for easy access later on.

In-Class Demo:

  • Installation
  • Load data
  • Create text facet and clean
  • Export results

 

Recommended OpenRefine Tutorials:

Cleaning Data with Refine (schoolofdata.org)

Clustering text facets in OpenRefine (padjo.org)