Tutorial: FuzzyWuzzy String Matching in Python – Improving Merge Accuracy Across Data Products and Naming Conventions

District Naming Preview

Example of Two Datasets with Comparable Variables

If you work with manually-entered string character data or data coming from multiple providers, you may encounter the reality of not being able to a.) merge the data, or b.) produce correct summary statistics.  Regarding a.), take the example in the picture of Indian district names exported from two data sources — we’d have a high success rate if we tried to merge on the existing names, but would miss out on Prakasam (Source B) which is labeled Ongole (Prakasam) in Source A.  This level of success is pretty magical, but not realistic – I’ve encountered situations where maybe 30% of the raw data merge correctly. This is bad for power and for representativeness.  In Stata you might be satisfyced with 85+% _merge == 3, but we should be striving for 100% since the dropped 15% could be dramatically different in composition than our 85% subsample.

Similarly, you may want to map your data using a related shapefile.  If the administrative names (or associated identifier values) don’t match, you’ll have an incomplete map.  The solution is to export the Attribute Table from your *.SHP and use this as one of the input spreadsheets into the script I describe below.

As per b.), I recently opened a file where workers self-reported their primary occupation.  Entries that *should’ve been* “construction” spanned “CONSTRUCTION”, “contruction”, “ccontrucion”, “construct”, “construciton.”  You get the idea.  If you wanted to calculate the share of workers by occupation, you could either manually edit each entry or adopt a FuzzyWuzzy-based approach described here.  While there are other methods to achieve similar results, this is the workflow that has worked for me and I hope will also work for you.

FuzzyWuzzy is a fantastic Python package which uses a distance matching algorithm to calculate proximity measures between string entries.  “CONSTRUCTION” and “CONSTRUCTION” would yield a 100% match, while “CONSTRUCTION” and “CANSTRICTION” would generate a lower score.  FuzzyWuzzy can be installed via pip or other Python package management tools.

Take the Indian districts example with two distinct datasets each possessing unique entries.  I would ideally like to match District D in Dataset B (e.g., Prakasam, Andhra Pradesh) to the best match in Dataset B.  FuzzyWuzzy will generate those matching scores and provide you with N (user-selected) entries having the highest score.  If all
goes well, the entry with the highest score will indeed be our match, but we need to verify that first.

I’ve included a sample Python script that’s worked for my purposes and can be easily retrofitted for a range of applications.  I’ll briefly describe the steps I take from start to end.

  1. Take two datasets and export the variables you wish to merge as distinct spreadsheets.  In this example, I am interested in merging the names of district-state pairs from “ICRISAT_Meso_Names.csv” and “District_Records_1971_Clean.csv.”  They need not possess the same variable names since the function <districtMatch> takes as input arguments the district and state variable names from either input file.  The former is a complete dataset, the latter a list of administrative names for a 1971 shapefile.  Once these two are merged, I’ll be able to project variable values into a map.
  2. I run the <districtMatch> function on these two files which generates “ICRI_1971_DistrictMatches.csv” into /Matched_Results.  The function’s default is to output the top 3 matches, but you can edit this.
  3. If there’s a perfect match, the “perfect_match” field in the DistrictMatches file is given a “True” value.  You need not spend any more time on those entries since they will merge correctly.  You’ll only need to focus on “False” matches and will need to do this manually.  To save time, you may even sort on “perfect_match”, rather than hunt-and-peck for Falses.
  4. To address the False matches, I generate two new variables in ICRI_1971_DistrictMatches.csv and save it as an XLSX file to retain highlighting/cell formatting.
    1. “best_imperfect_match” – identifies which match column is in fact correct. (i.e., 2 refers to ‘match2’, 3 to ‘match3’, etc.)
    2. “explicit_match” – in the event the match is not captured in the top 3 match hits, I’ll have to manually identify the correct match and type it in.
  5. Once I’ve made all the necessary edits, I can then import this XLSX file into Stata and run some basic code to get the two datasets merge properly.  I provide example code in this Stata “fuzzywuzzy_Merge_Sample.do”.

If you’re interested in merging on a single variable (i.e., state names, without districts), you can easily modify the function to include only a single variable input name from each dataset and skip the joining of district and state names.

(Appropriately-indented Python code available on the Resources page)

# -- DistrictNameMatching.py
# Author: Anthony Louis D'Agostino (ald2187 [at] columbia.edu)
# Purpose: Given CSV lists of district-state name pairs, identifies the best match given the fuzzywuzzy library
# Notes: Default number of matches currently set to 3, though can be modified as input argument.
# **** AUTHOR NOT LIABLE FOR ANY DAMAGES INCURRED THROUGH THE USE OR [ESPECIALLY] MISUSE OF THIS PRODUCT ***** </code>

import os
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd

def districtMatch(master, using, master_dist, master_state, using_dist, using_state, outFile, num_match = 3):
"""
This function takes two sets of district-state names, and produces a DTA with a set number (default=3)
of matches with a flag for whether the district name has been completely matched.

Manual work is then required for districts where a perfect match has not been made.

master: file containing the master list of districts
using: file containing using list of districts, eg., each of these districts is compared against these
universe of master districts from the master file
master_dist: variable name pertaining to districts in master file
master_state: variable name pertaining to states in master file
using_dist: variable name pertaining to districts in using file
using_state: variable name pertaining to states in using file
num_match: number of matches generated, default is 3
outFile: includes path and filename for an outputted DTA file - should be "*.dta"
"""

master_dists = pd.read_csv(master, quotechar = '"', skipinitialspace = True, sep = None)
print " *** Now printing column values for master file *** "
print list(master_dists.columns.values)

using_dists = pd.read_csv(using, quotechar = '"', skipinitialspace = True, sep = None)
print " *** Now printing column values for using file *** "
print list(using_dists.columns.values)

# -- concatenate district and state names
master_names = master_dists[master_dist].map(str) + ", " + master_dists[master_state]
using_names = using_dists[using_dist].map(str) + ", " + using_dists[using_state]

fhp_new = [process.extract(x, master_names, limit=num_match) for x in using_names]

# -- generate column names
lab = ""
i = 1
while i <= num_match:
lab = lab + " " + "Match" + str(i)
i += 1

fhp_matches = pd.DataFrame(fhp_new, columns = lab.split())

d={}
for x in range(1,num_match+1):
d["Match{0}".format(x)]=[y[0] for y in fhp_matches['Match'+str(x)]]

d['using_original'] = using_names

#match1 = [x[0] for x in fhp_matches['Match1']]
d['perfect_match'] = d['Match1'] == d['using_original']

#fhp_matches['perfect_match'] = pd.Series(perfect_match, index = fhp_matches.index)
out = pd.DataFrame(d)
#out.to_stata(str(outFile + ".dta"))
out.to_csv(str(outFile + ".csv"))
print "******************************************"
print "*** Your analysis has been completed! *** "
print "******************************************"

return out

"""
BASIC FILES/PATHS WHOSE USE IS REPEATED
"""

# -- Replace below path with your correct directory structure
baseDir = os.path.join("/Users", "xadx", "Desktop", "Dropbox", "Working_Current", "Fuzzy_Wuzzy")
outDir = os.path.join(baseDir, "Matched_Results")

# -- In case preferred path does not already exist
if not os.path.exists(outDir):
os.makedirs(outDir)

"""
MATCHING ICRISAT and 1971 SHAPEFILE
"""

master_file = os.path.join(baseDir, "ICRISAT_Meso_Names.csv")
input_1971 = os.path.join(baseDir, "District_Records_1971_Clean.csv")

# -- Directory into which matched results spreadsheet is saved
outFile = os.path.join(outDir, "ICRI_1971_DistrictMatches")

icri_1971_match = districtMatch(input_1971, master_file, 'NAME', 'STATE_UT', 'distname', 'stname', outFile)

# -- Alternatively, don't save as an object in the workspace
districtMatch(input_1971, master_file, 'NAME', 'STATE_UT', 'distname', 'stname', outFile)

			

One response to “Tutorial: FuzzyWuzzy String Matching in Python – Improving Merge Accuracy Across Data Products and Naming Conventions

  1. Pingback: Scraping Web Forms with Selenium | Eyal Frank·

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s