Tutorial: FuzzyWuzzy String Matching in Python -- Improving Merge Accuracy Across Data Products and Naming Conventions
• Anthony Louis D'Agostino •
If you work with manually-entered string character data or data coming from multiple providers, you may encounter the reality of not being able to a.) merge the data, or b.) produce correct summary statistics. Regarding a.), take the example in the picture of Indian district names exported from two data sources — we’d have a high success rate if we tried to merge on the existing names, but would miss out on Prakasam (Source B) which is labeled Ongole (Prakasam) in Source A. This level of success is pretty magical, but not realistic – I’ve encountered situations where maybe 30% of the raw data merge correctly. This is bad for power and for representativeness. In Stata you might be satisfyced with 85+% _merge == 3, but we should be striving for 100% since the dropped 15% could be dramatically different in composition than our 85% subsample.
Similarly, you may want to map your data using a related shapefile. If the administrative names (or associated identifier values) don’t match, you’ll have an incomplete map. The solution is to export the Attribute Table from your *.SHP and use this as one of the input spreadsheets into the script I describe below.
As per b.), I recently opened a file where workers self-reported their primary occupation. Entries that should’ve been “construction” spanned “CONSTRUCTION”, “contruction”, “ccontrucion”, “construct”, “construciton.” You get the idea. If you wanted to calculate the share of workers by occupation, you could either manually edit each entry or adopt a FuzzyWuzzy-based approach described here. While there are other methods to achieve similar results, this is the workflow that has worked for me and I hope will also work for you.
FuzzyWuzzy is a fantastic Python package which uses a distance matching algorithm to calculate proximity measures between string entries. “CONSTRUCTION” and “CONSTRUCTION” would yield a 100% match, while “CONSTRUCTION” and “CANSTRICTION” would generate a lower score. FuzzyWuzzy can be installed via pip or other Python package management tools.
Take the Indian districts example with two distinct datasets each possessing unique entries. I would ideally like to match District D in Dataset B (e.g., Prakasam, Andhra Pradesh) to the best match in Dataset B. FuzzyWuzzy will generate those matching scores and provide you with N (user-selected) entries having the highest score. If all
goes well, the entry with the highest score will indeed be our match, but we need to verify that first.
I’ve included a sample Python script that’s worked for my purposes and can be easily retrofitted for a range of applications. I’ll briefly describe the steps I take from start to end.
Take two datasets and export the variables you wish to merge as distinct spreadsheets. In this example, I am interested in merging the names of district-state pairs from “ICRISAT_Meso_Names.csv” and “District_Records_1971_Clean.csv.” They need not possess the same variable names since the function <districtMatch> takes as input arguments the district and state variable names from either input file. The former is a complete dataset, the latter a list of administrative names for a 1971 shapefile. Once these two are merged, I’ll be able to project variable values into a map.
I run the <districtMatch> function on these two files which generates “ICRI_1971_DistrictMatches.csv” into /Matched_Results. The function’s default is to output the top 3 matches, but you can edit this.
If there’s a perfect match, the “perfect_match” field in the DistrictMatches file is given a “True” value. You need not spend any more time on those entries since they will merge correctly. You’ll only need to focus on “False” matches and will need to do this manually. To save time, you may even sort on “perfect_match”, rather than hunt-and-peck for Falses.
To address the False matches, I generate two new variables in ICRI_1971_DistrictMatches.csv and save it as an XLSX file to retain highlighting/cell formatting.
“best_imperfect_match” – identifies which match column is in fact correct. (i.e., 2 refers to ‘match2’, 3 to ‘match3’, etc.)
“explicit_match” – in the event the match is not captured in the top 3 match hits, I’ll have to manually identify the correct match and type it in.
Once I’ve made all the necessary edits, I can then import this XLSX file into Stata and run some basic code to get the two datasets merge properly. I provide example code in this Stata “fuzzywuzzy_Merge_Sample.do”.
If you’re interested in merging on a single variable (i.e., state names, without districts), you can easily modify the function to include only a single variable input name from each dataset and skip the joining of district and state names.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
[Learn more about bidirectional Unicode characters](https://github.co/hiddenchars)
[ Show hidden characters ](<>)
| | \# — DistrictNameMatching.py |
|---|---|
| | \# Author: Anthony Louis D'Agostino (ald2187 \[at\] columbia.edu) |
| | \# Purpose: Given CSV lists of district-state name pairs, identifies the best match given the fuzzywuzzy library |
| | \# Notes: Default number of matches currently set to 3, though can be modified as input argument. |
| | |
| | importos |
| | importnumpyasnp |
| | fromfuzzywuzzyimportfuzz |
| | fromfuzzywuzzyimportprocess |
| | importpandasaspd |
| | |
| | |
| | |
| | defdistrictMatch(master, using, master\_dist, master\_state, using\_dist, using\_state, outFile, num\_match=3): |
| | """ |
| | This function takes two sets of district-state names, and produces a DTA with a set number (default=3) |
| | of matches with a flag for whether the district name has been completely matched. |
| | |
| | Manual work is then required for districts where a perfect match has not been made. |
| | |
| | master: file containing the master list of districts |
| | using: file containing using list of districts, eg., each of these districts is compared against these |
| | universe of master districts from the master file |
| | master\_dist: variable name pertaining to districts in master file |
| | master\_state: variable name pertaining to states in master file |
| | using\_dist: variable name pertaining to districts in using file |
| | using\_state: variable name pertaining to states in using file |
| | num\_match: number of matches generated, default is 3 |
| | outFile: includes path and filename for an outputted DTA file – should be "\*.dta" |
| | """ |
| | |
| | master\_dists=pd.read\_csv(master, quotechar='"', skipinitialspace=True, sep=None) |
| | print" \*\*\* Now printing column values for master file \*\*\* " |
| | printlist(master\_dists.columns.values) |
| | |
| | using\_dists=pd.read\_csv(using, quotechar='"', skipinitialspace=True, sep=None) |
| | print" \*\*\* Now printing column values for using file \*\*\* " |
| | printlist(using\_dists.columns.values) |
| | |
| | \# — concatenate district and state names |
| | master\_names=master\_dists\[master\_dist\].map(str) +", "+master\_dists\[master\_state\] |
| | using\_names=using\_dists\[using\_dist\].map(str) +", "+using\_dists\[using\_state\] |
| | |
| | fhp\_new= \[process.extract(x, master\_names, limit=num\_match) forxinusing\_names\] |
| | |
| | \# — generate column names |
| | lab="" |
| | i=1 |
| | whilei<=num\_match: |
| | lab=lab+" "+"Match"+str(i) |
| | i+=1 |
| | |
| | |
| | |
| | fhp\_matches=pd.DataFrame(fhp\_new, columns=lab.split()) |
| | |
| | d={} |
| | forxinrange(1,num\_match+1): |
| | d\["Match{0}".format(x)\]=\[y\[0\] foryinfhp\_matches\['Match'+str(x)\]\] |
| | |
| | |
| | d\['using\_original'\] =using\_names |
| | |
| | |
| | \#match1 = \[x\[0\] for x in fhp\_matches\['Match1'\]\] |
| | d\['perfect\_match'\] =d\['Match1'\] ==d\['using\_original'\] |
| | |
| | \#fhp\_matches\['perfect\_match'\] = pd.Series(perfect\_match, index = fhp\_matches.index) |
| | out=pd.DataFrame(d) |
| | \#out.to\_stata(str(outFile + ".dta")) |
| | out.to\_csv(str(outFile+".csv")) |
| | print"\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*" |
| | print"\*\*\* Your analysis has been completed! \*\*\* " |
| | print"\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*\*" |
| | |
| | returnout |
| | |
| | |
| | """ |
| | BASIC FILES/PATHS WHOSE USE IS REPEATED |
| | """ |
| | |
| | |
| | baseDir=os.path.join("<insert path>") |
| | |
| | |
| | outDir=os.path.join(baseDir, "Matched\_Results") |
| | |
| | ifnotos.path.exists(outDir): |
| | os.makedirs(outDir) |
| | |
| | |
| | |
| | |
| | |
| | """ |
| | ICRISAT and 1971 Polygon borders |
| | """ |
| | |
| | master\_file=os.path.join(baseDir, "ICRISAT\_Meso\_Names.csv") |
| | input\_1971=os.path.join(baseDir,"District\_Records\_1971\_Clean.csv") |
| | |
| | outFile=os.path.join(outDir, "ICRI\_1971\_DistrictMatches") |
| | icri\_1971\_match=districtMatch(input\_1971, master\_file, 'NAME', 'STATE\_UT', 'distname', 'stname', outFile) |
| | |
| | \# — alternatively, don't save as a workspace object |
| | districtMatch(input\_1971, master\_file, 'NAME', 'STATE\_UT', 'distname', 'stname', outFile) |
| | |
[view raw](https://gist.github.com/a8dx/20caedaf942f1810e16994bfdb57e8dc/raw/ec412554285823b873e37d523142082d79cd36d5/DistrictNameMatching.py)
[
DistrictNameMatching.py
](https://gist.github.com/a8dx/20caedaf942f1810e16994bfdb57e8dc#file-districtnamematching-py)
hosted with ❤ by [GitHub](https://github.com)