12. Open Refine Continued

 

 

If you didn’t archive the OpenRefine 2.7 (Windows Kit) last Wednesday, redownload (OpenRefine Download) and extract it to your working directory.  Start OpenRefine (double click the openrefine.exe) and let the browser install Java again.  Once Java has installed, restart OpenRefine.

We will be working with this dataset (Prop2010_gt100ac.txt).  Save this text file to your working directory and load it into a new OpenRefine project.

DATA EXPLORATION

Lets focus first on the PARCEL_NO field.

  • Generate a text facet on PARCEL_NO and then sort the facets based on count
    • There are 2119 total choices
    • There are 62 entries for ‘074-016’
    • There are at least 10 entries with ‘N/A’ but I expect there are more
    • Notice some of the parcels are formatted with ‘###-###’, some with ‘### ###’, and some with ‘###/###’
  • Delete the facet (click on ‘x’ in top-left of facet box)

FIRST STEPS

  • Create a new field based on the PARCEL_NO field called PARCEL_NO2
    • dropdown-arrow>Edit column>Add column based on this column>OK
  • Trim leading and trailing whitespace
    • dropdown-arrow>Edit cells>Common transformations>Trim leading and trailing whitespaces
  • Find all sequential white spaces and replace with a single blank space
    • dropdown-arrow>Edit cells>Transform
      value.replace(/\s+/," ")
  • Find all ‘/’ and ‘\’ and replace with a single blank space
    • dropdown-arrow>Edit cells>Transform
      value.replace(/\//," ") 
      value.replace(/\\/," ")
  • Find all ‘-‘ and replace with a single blank space
    • dropdown-arrow>Edit cells>Transform
      value.replace(/\-/," ")

REVIEW YOUR EDITS

Create a text facet on PARCEL_NO2 field and compare

  • Total number of facets reduced to 2080
  • ‘N/A’ now ‘N A’
  • No ‘-‘ or ‘\’ or ‘/’ characters in the parcel number
  • Scroll down and you will see parcel ids separated with commas, ‘and’, and several others with various text.  Leave these for now.

TRY TO CLUSTER THE PARCEL_NO2 FIELD

  • dropdown-arrow>Edit cells>Cluster and edit…