Week 10 (Wed): Cleaning Data…


My original plan for today was to cover spatial selections; instead I will first briefly cover steps to take to normalize the street names ([STREET_NAM]) in the

Prop2015

table.  After this, we will move on to spatial selections.

Problem:

Recall we have 403,751 records in the Prop_2015 table.  Our goal in this exercise is to normalize street names contained in the [STREET_NAM] field.  Problems I see are the following:

  • variable string case (Church vs CHURCH vs church)
  • abbreviated vs not (Drive vs Dr vs Dr.)
  • words vs symbols (and vs &)

Solution:

Use a combination of summaries, queries, and table joins to manually filter and fix problems.

Prepare full street address in Prop2015 table

The three fields you are interested in are STDIRECT, STREET_NAM, and STTYPE.  The full address is a concatenation of these three fields.  However, before we can do the concatenation, we must normalize the STDIRECT and  STTYPE fields.

  • Summarize STDIRECT field (right-click on the field >> Summarize >> name ‘sum_stdirect’ and save in your FGDB
    • you are looking for variations like ‘N’ and ‘ N’ and ‘North’ or ‘NE’ and ‘ne’
    • luckily, there are no variations
  • Summarize the STTYPE field (name ‘sum_sttype’ and save in your FGDB)
    • looking for variations of abbreviations for the same type of road (‘Dr’ vs ‘ Dr’ vs ‘Dr.’ vs ‘Drive’ vs ‘drive’)
    • lucky again, nothing jumps out at me

Move on to creating the full address…

  • Create a new field ‘fulladdr’ (text, Length = 200)
  • Use the Field Calculator to populate the [fulladdr] field:
    concatenate the three fields
    [STDIRECT] & " " & [STREET_NAM] & " " & [STTYPE] 
    
    remove the leading (LTrim) and trailing (RTrim) spaces
    RTrim(LTrim([STDIRECT] & " " & [STREET_NAM] & " " & [STTYPE]))
    
    might as well fix the variable case problem, use LCase to turn all text to lower case text
    LCase(RTrim(LTrim([STDIRECT] & " " & [STREET_NAM] & " " & [STTYPE])))

Reduction of the Prop2015 table on the fulladdr field

  • Summarize the fulladdr field (right-click on field name >> Summarize >> name output ‘sum_fulladdr’, save in FGDB and hit OK)
    • yields 95,403 records
    • the Count_fulladdr field contains the number of records with the fulladdr value
  • When you sort the count field (big-to-little) you see that there are 913 records with a value of ‘Peachtree’, 722 records with ‘peachtree rd’
  • Submit this attribute query to the sum_fulladdr table:
    • To select all records that begin with ‘peachtree’ (138 records selected):
      fulladdr Like 'peachtree%'
    • To select all records that contain ‘peachtree’ anywhere (204 records selected):
      fulladdr Like '%peachtree%'
    • To select all records that contain ‘peachtree’ anywhere but does not contain ‘oak’ – removes ‘peachtree oak dr’ (203 records):
      fulladdr Like '%peachtree%' AND NOT fulladdr LIKE '%oak%'
    • To select all records that belong to ‘peachtree road’ (53 selected, only peachtree road variations remain):
      fulladdr Like '%peachtree%' AND NOT (fulladdr LIKE '%oak%' OR fulladdr LIKE '%park%' OR fulladdr LIKE '%pkwy%' OR fulladdr LIKE '%pl%' OR fulladdr LIKE '%place%' OR fulladdr LIKE '%reserve%' OR fulladdr LIKE '%ridge%' OR fulladdr LIKE '%run%' OR fulladdr LIKE '%sta%' OR fulladdr LIKE '%strm%' OR fulladdr LIKE '%ter%' OR fulladdr LIKE '%trails%' OR fulladdr LIKE '%vw%' OR fulladdr LIKE '%walk%' OR fulladdr LIKE '%way%' OR fulladdr LIKE '%st%' OR fulladdr LIKE '%shl%' OR fulladdr LIKE '%street%' OR fulladdr LIKE '%ct%' OR fulladdr LIKE '%ct%' OR fulladdr LIKE '%blvd%' OR fulladdr LIKE '%old%' OR fulladdr LIKE '%heights%' OR fulladdr LIKE '%battle%' OR fulladdr LIKE '%blvd%' OR fulladdr LIKE '%cir%' OR fulladdr LIKE '%club%' OR fulladdr LIKE '%corner%' OR fulladdr LIKE '%cors%' OR fulladdr LIKE '%court%' OR fulladdr LIKE '%drive%' OR fulladdr LIKE '%dr%' OR fulladdr LIKE '%glen%' OR fulladdr LIKE '%hill%' OR fulladdr LIKE '%hollow%' OR fulladdr LIKE '%blvd%' OR fulladdr LIKE '%lake%' OR fulladdr LIKE '%landing%' OR fulladdr LIKE '%lane%' OR fulladdr LIKE '%ln%' OR fulladdr LIKE '%memorial%')
    • Flip to the selected records view, and inspect the selected records.  I see that there remains ‘peachtree ave’ and peachtree avenue ne’.   Add the part to remove any record that contains ‘ave’ to the query and reapply (51 records selected)
      fulladdr Like '%peachtree%' AND NOT (fulladdr LIKE '%oak%' OR fulladdr LIKE '%park%' OR fulladdr LIKE '%pkwy%' OR fulladdr LIKE '%pl%' OR fulladdr LIKE '%place%' OR fulladdr LIKE '%reserve%' OR fulladdr LIKE '%ridge%' OR fulladdr LIKE '%run%' OR fulladdr LIKE '%sta%' OR fulladdr LIKE '%strm%' OR fulladdr LIKE '%ter%' OR fulladdr LIKE '%trails%' OR fulladdr LIKE '%vw%' OR fulladdr LIKE '%walk%' OR fulladdr LIKE '%way%' OR fulladdr LIKE '%st%' OR fulladdr LIKE '%shl%' OR fulladdr LIKE '%street%' OR fulladdr LIKE '%ct%' OR fulladdr LIKE '%ct%' OR fulladdr LIKE '%blvd%' OR fulladdr LIKE '%old%' OR fulladdr LIKE '%heights%' OR fulladdr LIKE '%battle%' OR fulladdr LIKE '%blvd%' OR fulladdr LIKE '%cir%' OR fulladdr LIKE '%club%' OR fulladdr LIKE '%corner%' OR fulladdr LIKE '%cors%' OR fulladdr LIKE '%court%' OR fulladdr LIKE '%drive%' OR fulladdr LIKE '%dr%' OR fulladdr LIKE '%glen%' OR fulladdr LIKE '%hill%' OR fulladdr LIKE '%hollow%' OR fulladdr LIKE '%blvd%' OR fulladdr LIKE '%lake%' OR fulladdr LIKE '%landing%' OR fulladdr LIKE '%lane%' OR fulladdr LIKE '%ln%' OR fulladdr LIKE '%memorial%' OR fulladdr Like '%ave%')
  • At this point, you want to copy/paste your query into a Notepad or other editor for your documentation purposes.

Prepare the sum_fulladr table for join back to main table

You have found 53 records that belong to Peachtree Road.  These 53 records in the summary table represent 870 entries in the original Prop2015 table (right-click on the Count field >> Statistics) to label ‘Peachtree Road’.  You will add a new fields (text, length=200) to both the summary and original tables in which you will populate with ‘Peachtree Road’.

  • From the sum_fulladdr table:
    • Add a new field: click table options >> Add Field >> Name: nAddresSum, Type: text, Length: 200
    • Populate the selected records with “Peachtree Road”:
      • With the 51 selected records still selected: right-click on nAddresSum >> Field Calculator >> enter “Peachtree Road” in the bottom box and hit OK
    • Most records in the summary table will have a <Null> record
    • Unselect all records
  • From the Prop2015 table:
    • Unselect all records
    • Add a new field: click table options >> Add Field >> Name: AddressFix, Type: text, Length: 200
    • Join the sum_fulladdr table to this table (common field in both tables is fulladdr)
    • Perform a final field calculation to populate the AddressFix field with the values in the nAddresSum field
      • Right-click on the AddressFix field >> Field Calculator >> scroll down and click on the nAddresSum entry ([sum_fulladdr.nAddresSum]) then hit OK
      • This calculation might take a few minutes.  Oh yeah, you should probably save your project  before you do the calculation!
    • Remove All Joins then fort AddressFix field from big-to-little
    • Submit a query to select all NULL records (402,881 selected records):
      AddressFix IS NULL
    • Submit a query to select all non-NULL records (870 selected records):
      Not AddressFix IS NULL

In conclusion:

I use these methods all the time to filter and fix small datasets, however, that was a lot of work to fix only one street name.  We will look at another solution Friday during lab.

Print Friendly, PDF & Email