Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers Jedediah J. Teres, MDRC, New York, NY

Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers Jedediah J. Teres, MDRC, New York, NY ABSTRACT Matching observations from different data sources is problematic without a reliable shared identifier. Using multiple identifiers can be more restrictive as it requires multiple exact matches. One way around this is to create a match score based on the number of matching identifiers. This score can be weighted to favor certain matches or sets of matches (e.g., first name and last name) over others (e.g., first name and date of birth). This paper builds on a previous paper describing a technique for creating and using a match score in the context of a SQL join to find matches in cases when all identifiers are not exactly the same and allows for the use of COMPGED to find close matches without requiring strict equality. INTRODUCTION This paper expands upon a previously described method of combining data sets using multiple variables as matching criteria in by using COMPGED to allow for fuzzy matching. Prior to SAS 9.2, using COMPGED in the context of a SQL JOIN produced a note to the log each time a character was compared to a blank space. With the release of SAS 9.2, this is no longer an issue, and COMPGED can be used to expand the flexibility of JOINS in SQL. Knowledge of the SELECT statement and JOINs in is assumed. Prior knowledge of COMPGED is helpful but not necessary. An understanding of the PUT function is needed. SAMPLE DATA SETS Two SAS data sets are used for illustration purposes in this paper: REF and CHK. Data set REF Obs FNAME LNAME SSN NINO DOB 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 2 Robert Plant 234-56-7890 YZ912345H 8/20/1948 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 4 John Bonham 456-78-9012 YZ891234G 5/31/1948 5 Ray Davies 567-89-0123 HL456789D 6/21/1944 6 Dave Davies 678-90-1234 KR789123F 2/3/1947 7 Peter Quaife 789-01-2345 AA123456A 12/31/1943 8 Mick Avory 890-12-3456 HL345678C 2/15/1944 Data set REF contains 5 variables and 8 observations. Each observation represents a unique sample member; there are no duplicate observations in the data set REF. Data set CHK 1 John Jones 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 2 Robert Plant 234-56-7890 YZ912235H 8/20/1948 Led Zeppelin 3 Jimmy Page 345-67-8907 HL234567D 1/9/1944 Led Zeppelin 4 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 5 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 6 Dave Davies 678-90-1234 KR789213F 2/3/1947 The Kinks 7 Pete Quiafe 789-01-2346 AA132456A 12/31/1943 The Kinks 8 Mick Avery 890-21-3456 HL345678C 2/15/1944 The Kinks 9 Jay Davie 567-89-1023 HI636789D 6/12/1984 Blue Devils 10 Dave David 678-90-1324 KR789213F 2/3/1974 The Minks Data set CHK contains 6 variables and 10 observations. Each observation represents a unique individual; there are no duplicate observations in the data set CHK. The data set CHK contains the same variables as the data set REF, but there is an additional variable, BAND. The goal is to merge the data set CHK to the data set REF in order to pick up the BAND variable for the people in the sample. 1

Using AN INNER JOIN TO COMBINE DATA SETS In its most basic form, the syntax for combing data sets using an inner join in is as straightforward as a match-merge in a DATA step. Here, the data sets REF and CHK are joined using the National Insurance Number field (NINO) as the common identifier. create table inner_join1 as on (ref.nino eq chk.nino) The resulting data set contains only 4 observations. The original data set REF had 8 observations, so 3 observations failed to return a match. 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 3 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 4 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks Upon further investigation, it becomes clear that several NINOs do not match. John Baldwin BM567891E John Jones BM567891E Robert Plant YZ912345H Robert Plant YZ912235H Jimmy Page HL234567B Jimmy Page HL234567D John Bonham YZ891234G John Bonham YZ891234G Ray Davies HL456789D Ray Davies HL456789D Dave Davies KR789123F Dave Davies KR789213F Peter Quaife AA123456A Pete Quiafe AA132456A Mick Avory HL345678C Mick Avery HL345678C These are all fairly minor differences consisting mostly of the transposition of two numbers or the errant substitution of a letter. Each represents a possible data entry error. With a data set this small, values could be edited so they match on both files. In a larger data set, that would not be a viable solution, so it's well worth considering what other options are available. COMPGED Since NINO is a character variable, we can use the COMPGED function to compute the generalized edit distance between the two character values. The SAS documentation on the COMPGED function states that Generalized edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two strings. The finer points of COMPGED are beyond the scope of this paper, but the general syntax takes the form COMPGED(string-1, string-2 <,cutoff> <,modifiers>) The COMPGED function returns a value based on the difference between the two character strings. The default values for minor transpositions and substitutions tend to be 100 or less. Using the COMPGED function as part of the criteria in the JOIN would allow for greater flexibility and not require strict equality, making fuzzier matches possible. Technically, we are still matching on a binary outcome whether COMPGED returns a value less than 100 or not. It s the method of calculating that value that allows for some fuzziness. create table inner_join2a as on (compged(ref.nino,chk.nino) le 100) Note that in order to use COMPGED, we must specify the tolerance that is, what is the maximum level of dissimilarity between two values of NINO that is acceptable? By specifying that the value returned by the 2

COMPGED function must be less than or equal to 100, we are allowing for only very slight differences. COMPGED will return 0 if two strings are equal, so we are allowing strict equality as well. The resulting data set has 8 observations, but all is not well. Robert Plant from Led Zeppelin is missing, and Dave Davies returned a match for someone in a band called The Minks. 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 4 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 5 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 6 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 7 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 8 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Minks The members of these bands have been assigned both British National Insurance Numbers (NINOs) and American Social Security Numbers (SSNs) for illustration purposes. While we were unable to match Robert Plant on his NINO, perhaps looking for fuzzy matches on SSN would return the desired match. The following code: create table inner_join2b as on (compged(ref.ssn,chk.ssn) le 100) Returns the following error message: ERROR: Function COMPGED requires a character expression as argument 1. ERROR: Function COMPGED requires a character expression as argument 2. ERROR: Expression using less than or equal (<=) has components that are of different data types. COMPGED requires character expressions, or strings. The PUT function returns a string value, and can be nested in the COMPGED function. The following code: create table inner_join2c as on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) Creates the following data set: 1 Ray Davies 567-89-0123 HL456789D 6/21/1944 Blue Devils 2 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 3 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 4 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 5 Robert Plant 234-56-7890 YZ912345H 8/20/1948 Led Zeppelin 6 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 7 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 8 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 9 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 10 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Minks 3

We have successfully rescued Robert Plant from limbo but at the cost of exacerbating the over-match issue from the previous JOIN. Now we have Dave Davies from The Minks as well as Ray Davies from Blue Devils. Clearly, we must refine the match criteria. USING A WEIGHTED MATCH SCORE Until now, we have allowed for fuzzy matches, but we have treated them the same as exact matches. What would happen if we assigned more weight to exact matches while still allowing for fuzzy matches? The code below creates a weighted match score (see Teres 2009) that gives twice as much weight to exact matches as fuzzy matches. This match score is then used to select the observation with the closest match as defined by the HAVING clause. The following query: proc sql ; create table inner_join3b as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)) as wms on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) group by ref.ssn having calculated wms eq max(calculated wms) Creates the following data set: wms 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 3 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 3 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 1 4 Robert Plant 234-56-7890 YZ912345H 8/20/1948 Led Zeppelin 3 5 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 1 6 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 3 7 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 3 8 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 1 At this point, we have met the goal of combining the two data sets. However, the weighted match score (WMS) takes on only 2 values, 1 and 3. It may be desirable to include all available identifying information in the creation of the match score to allow for greater variability in the quality of the matches. The following code expands the idea of including both the exact and fuzzy matches to all identifiers. Note the use of the nested PUT function with the FORMAT MMDDYY10 for DOB. The following query: create table inner_join5 as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) + (2*(ref.nino eq chk.nino)) + (compged(ref.nino, chk.nino) le 100) + (2*(ref.fname eq chk.fname)) + (compged(ref.fname, chk.fname) le 100) + (2*(ref.lname eq chk.lname)) + (compged(ref.lname, chk.lname) le 100) + (2*(ref.dob eq chk.dob)) + (compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100)) as wms 4

on ( (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) or (compged(ref.nino, chk.nino) le 100) or (compged(ref.fname, chk.fname) le 100) or (compged(ref.lname, chk.lname) le 100) or (compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100) ) group by ref.ssn having ((calculated wms eq max(calculated wms))) order by band, ref.lname, ref.fname ; Produces the following data set: wms 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 12 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 15 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 11 4 Robert Plant 234-56-7890 YZ912345H 8/20/1948 Led Zeppelin 12 5 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 11 6 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 13 7 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 15 8 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 7 Here the match score is weighted so that it favors exact matches, assigning them twice the weight of fuzzier matches. It becomes much clearer that some matches are better than others. For example, Peter Quaife had a WMS value of 7 because while his entries in REF and CHK matched exactly on NINO and DOB, there were slight discrepancies in FNAME, LNAME, and SSN (each differed by one character). On the other hand, John Bonham and Ray Davies had much stronger matches, with WMS values of 15, indicating that they were exact matches. Because different identifiers might be more or less important, however, it s hard to say whether matches with weighted match scores of 15 are twice as good as a match with a value of 7. This approach could be taken even further. Matches on SSN, fuzzy or otherwise, could be given more weight than matches on first name, for example. COMPGED options can also be refined to change the penalties associated with certain types of mismatches between character expressions to best fit any given context. CONCLUSIONS The use of the COMPGED function in joins offers an incredible amount of flexibility when combining data sets. Allowing for fuzzy matches opens the door to over-matching a data set by including spurious matches, however. Creating a weighted match score to favor exact matches is a useful tool for refining results when combining data sets. REFERENCES SAS Institute Inc. 2011. SAS 9.2 Language Reference: Dictionary, Fourth Edition. Cary, NC: SAS Institute Inc. Functions and Call Routines: COMPGED. Staum, Paulette (2007). Fuzzy Matching using the COMPGED Function. Proceedings of the 2007 NorthEast SAS Users Group Conference. Teres, Jedediah J (2009). Using SQL Joins to Perform Weighted Matches on Multiple Identifiers. Proceedings of the 2009 NorthEast SAS Users Group Conference. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. The author wishes to thank Paulette Staum for her encouragement, and Chris Bost and Aaron Hill for reviewing this paper. 5

CONTACT INFORMATION Jedediah Teres MDRC 16 East 34 th St, 19 th Floor New York, NY 10016 (212) 340-8807 telephone (212) 684-0832 fax jed.teres@mdrc.org www.mdrc.org 6