Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers Jedediah J. Teres, MDRC, New York, NY



Similar documents
Subsetting Observations from Large SAS Data Sets

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Imelda C. Go, South Carolina Department of Education, Columbia, SC

Taming the PROC TRANSPOSE

Introduction to SAS Mike Zdeb ( , #122

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

A Closer Look at PROC SQL s FEEDBACK Option Kenneth W. Borowiak, PPD, Inc., Morrisville, NC

Identifying Invalid Social Security Numbers

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Introduction to Proc SQL Steven First, Systems Seminar Consultants, Madison, WI

Katie Minten Ronk, Steve First, David Beam Systems Seminar Consultants, Inc., Madison, WI

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

9.1 SAS. SQL Query Window. User s Guide

Performing Queries Using PROC SQL (1)

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Search and Replace in SAS Data Sets thru GUI

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Let the CAT Out of the Bag: String Concatenation in SAS 9 Joshua Horstman, Nested Loop Consulting, Indianapolis, IN

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Tips to Use Character String Functions in Record Lookup

SQL Simple Queries. Chapter 3.1 V3.0. Napier University Dr Gordon Russell

Information Systems SQL. Nikolaj Popov

Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Encoding the Password

Chapter 9, More SQL: Assertions, Views, and Programming Techniques

COMP 110 Prasun Dewan 1

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

MySQL for Beginners Ed 3

What You re Missing About Missing Values

Innovative Techniques and Tools to Detect Data Quality Problems

Reading Delimited Text Files into SAS 9 TS-673

Using the SQL Procedure

How to Create a Custom TracDat Report With the Ad Hoc Reporting Tool

Normalizing SAS Datasets Using User Define Formats

Alternatives to Merging SAS Data Sets But Be Careful

Effective Use of SQL in SAS Programming

Oracle Database 12c: Introduction to SQL Ed 1.1

2Creating Reports: Basic Techniques. Chapter

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Nine Steps to Get Started using SAS Macros

Fuzzy Matching in Audit Analytics. Grant Brodie, President, Arbutus Software

Using SQL Queries in Crystal Reports

Writing a complaint letter

SUBQUERIES AND VIEWS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 6

EXST SAS Lab Lab #4: Data input and dataset modifications

Chapter 1 Overview of the SQL Procedure

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Table Lookups: From IF-THEN to Key-Indexing

PROC SQL for SQL Die-hards Jessica Bennett, Advance America, Spartanburg, SC Barbara Ross, Flexshopper LLC, Boca Raton, FL

Advanced Query for Query Developers

Technical Paper. Reading Delimited Text Files into SAS 9

CSC 443 Data Base Management Systems. Basic SQL

Workflow Solutions for Very Large Workspaces

A s h o r t g u i d e t o s ta n d A r d i s e d t e s t s

You have got SASMAIL!

Programming Idioms Using the SET Statement

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Using the COMPUTE Block in PROC REPORT Jack Hamilton, Kaiser Foundation Health Plan, Oakland, California

Storing and Using a List of Values in a Macro Variable

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

# or ## - how to reference SQL server temporary tables? Xiaoqiang Wang, CHERP, Pittsburgh, PA

Chapter 2 The Data Table. Chapter Table of Contents

PharmaSUG Paper AD08

SAS Programming Tips, Tricks, and Techniques

ADP Workforce Now V3.0

Improving Maintenance and Performance of SQL queries

Applications Development ABSTRACT PROGRAM DESIGN INTRODUCTION SAS FEATURES USED


Lab Manual. Databases. Microsoft Access. Peeking into Computer Science Access Lab manual

Antech Automated Services User Guide

CONVERTING FROM NETKEEPER ISAM v6.32 to NETKEEPER SQL

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

Outline. SAS-seminar Proc SQL, the pass-through facility. What is SQL? What is a database? What is Proc SQL? What is SQL and what is a database

That Mysterious Colon (:) Haiping Luo, Dept. of Veterans Affairs, Washington, DC

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Selecting Features by Attributes in ArcGIS Using the Query Builder

Internet/Intranet, the Web & SAS. II006 Building a Web Based EIS for Data Analysis Ed Confer, KGC Programming Solutions, Potomac Falls, VA

Access Tutorial 2: Tables

Configuring an Alternative Database for SAS Web Infrastructure Platform Services

Fine Grained Auditing In Oracle 10G

SONA SYSTEMS RESEARCHER DOCUMENTATION

Database Query 1: SQL Basics

Time needed. Before the lesson Assessment task:

STIR Education Micro-Innovations that raise results STUDENT ATTENDANCE SCANNER

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

Advanced Subqueries In PROC SQL

SAS Enterprise Guide A Quick Overview of Developing, Creating, and Successfully Delivering a Simple Project

Transcription:

Using SQL Joins to Perform Fuzzy Matches on Multiple Identifiers Jedediah J. Teres, MDRC, New York, NY ABSTRACT Matching observations from different data sources is problematic without a reliable shared identifier. Using multiple identifiers can be more restrictive as it requires multiple exact matches. One way around this is to create a match score based on the number of matching identifiers. This score can be weighted to favor certain matches or sets of matches (e.g., first name and last name) over others (e.g., first name and date of birth). This paper builds on a previous paper describing a technique for creating and using a match score in the context of a SQL join to find matches in cases when all identifiers are not exactly the same and allows for the use of COMPGED to find close matches without requiring strict equality. INTRODUCTION This paper expands upon a previously described method of combining data sets using multiple variables as matching criteria in by using COMPGED to allow for fuzzy matching. Prior to SAS 9.2, using COMPGED in the context of a SQL JOIN produced a note to the log each time a character was compared to a blank space. With the release of SAS 9.2, this is no longer an issue, and COMPGED can be used to expand the flexibility of JOINS in SQL. Knowledge of the SELECT statement and JOINs in is assumed. Prior knowledge of COMPGED is helpful but not necessary. An understanding of the PUT function is needed. SAMPLE DATA SETS Two SAS data sets are used for illustration purposes in this paper: REF and CHK. Data set REF Obs FNAME LNAME SSN NINO DOB 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 2 Robert Plant 234-56-7890 YZ912345H 8/20/1948 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 4 John Bonham 456-78-9012 YZ891234G 5/31/1948 5 Ray Davies 567-89-0123 HL456789D 6/21/1944 6 Dave Davies 678-90-1234 KR789123F 2/3/1947 7 Peter Quaife 789-01-2345 AA123456A 12/31/1943 8 Mick Avory 890-12-3456 HL345678C 2/15/1944 Data set REF contains 5 variables and 8 observations. Each observation represents a unique sample member; there are no duplicate observations in the data set REF. Data set CHK 1 John Jones 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 2 Robert Plant 234-56-7890 YZ912235H 8/20/1948 Led Zeppelin 3 Jimmy Page 345-67-8907 HL234567D 1/9/1944 Led Zeppelin 4 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 5 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 6 Dave Davies 678-90-1234 KR789213F 2/3/1947 The Kinks 7 Pete Quiafe 789-01-2346 AA132456A 12/31/1943 The Kinks 8 Mick Avery 890-21-3456 HL345678C 2/15/1944 The Kinks 9 Jay Davie 567-89-1023 HI636789D 6/12/1984 Blue Devils 10 Dave David 678-90-1324 KR789213F 2/3/1974 The Minks Data set CHK contains 6 variables and 10 observations. Each observation represents a unique individual; there are no duplicate observations in the data set CHK. The data set CHK contains the same variables as the data set REF, but there is an additional variable, BAND. The goal is to merge the data set CHK to the data set REF in order to pick up the BAND variable for the people in the sample. 1

Using AN INNER JOIN TO COMBINE DATA SETS In its most basic form, the syntax for combing data sets using an inner join in is as straightforward as a match-merge in a DATA step. Here, the data sets REF and CHK are joined using the National Insurance Number field (NINO) as the common identifier. create table inner_join1 as on (ref.nino eq chk.nino) The resulting data set contains only 4 observations. The original data set REF had 8 observations, so 3 observations failed to return a match. 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 3 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 4 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks Upon further investigation, it becomes clear that several NINOs do not match. John Baldwin BM567891E John Jones BM567891E Robert Plant YZ912345H Robert Plant YZ912235H Jimmy Page HL234567B Jimmy Page HL234567D John Bonham YZ891234G John Bonham YZ891234G Ray Davies HL456789D Ray Davies HL456789D Dave Davies KR789123F Dave Davies KR789213F Peter Quaife AA123456A Pete Quiafe AA132456A Mick Avory HL345678C Mick Avery HL345678C These are all fairly minor differences consisting mostly of the transposition of two numbers or the errant substitution of a letter. Each represents a possible data entry error. With a data set this small, values could be edited so they match on both files. In a larger data set, that would not be a viable solution, so it's well worth considering what other options are available. COMPGED Since NINO is a character variable, we can use the COMPGED function to compute the generalized edit distance between the two character values. The SAS documentation on the COMPGED function states that Generalized edit distance is a generalization of Levenshtein edit distance, which is a measure of dissimilarity between two strings. The finer points of COMPGED are beyond the scope of this paper, but the general syntax takes the form COMPGED(string-1, string-2 <,cutoff> <,modifiers>) The COMPGED function returns a value based on the difference between the two character strings. The default values for minor transpositions and substitutions tend to be 100 or less. Using the COMPGED function as part of the criteria in the JOIN would allow for greater flexibility and not require strict equality, making fuzzier matches possible. Technically, we are still matching on a binary outcome whether COMPGED returns a value less than 100 or not. It s the method of calculating that value that allows for some fuzziness. create table inner_join2a as on (compged(ref.nino,chk.nino) le 100) Note that in order to use COMPGED, we must specify the tolerance that is, what is the maximum level of dissimilarity between two values of NINO that is acceptable? By specifying that the value returned by the 2

COMPGED function must be less than or equal to 100, we are allowing for only very slight differences. COMPGED will return 0 if two strings are equal, so we are allowing strict equality as well. The resulting data set has 8 observations, but all is not well. Robert Plant from Led Zeppelin is missing, and Dave Davies returned a match for someone in a band called The Minks. 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 4 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 5 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 6 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 7 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 8 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Minks The members of these bands have been assigned both British National Insurance Numbers (NINOs) and American Social Security Numbers (SSNs) for illustration purposes. While we were unable to match Robert Plant on his NINO, perhaps looking for fuzzy matches on SSN would return the desired match. The following code: create table inner_join2b as on (compged(ref.ssn,chk.ssn) le 100) Returns the following error message: ERROR: Function COMPGED requires a character expression as argument 1. ERROR: Function COMPGED requires a character expression as argument 2. ERROR: Expression using less than or equal (<=) has components that are of different data types. COMPGED requires character expressions, or strings. The PUT function returns a string value, and can be nested in the COMPGED function. The following code: create table inner_join2c as on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) Creates the following data set: 1 Ray Davies 567-89-0123 HL456789D 6/21/1944 Blue Devils 2 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 3 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 4 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 5 Robert Plant 234-56-7890 YZ912345H 8/20/1948 Led Zeppelin 6 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 7 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 8 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 9 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 10 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Minks 3

We have successfully rescued Robert Plant from limbo but at the cost of exacerbating the over-match issue from the previous JOIN. Now we have Dave Davies from The Minks as well as Ray Davies from Blue Devils. Clearly, we must refine the match criteria. USING A WEIGHTED MATCH SCORE Until now, we have allowed for fuzzy matches, but we have treated them the same as exact matches. What would happen if we assigned more weight to exact matches while still allowing for fuzzy matches? The code below creates a weighted match score (see Teres 2009) that gives twice as much weight to exact matches as fuzzy matches. This match score is then used to select the observation with the closest match as defined by the HAVING clause. The following query: proc sql ; create table inner_join3b as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100)) as wms on (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) group by ref.ssn having calculated wms eq max(calculated wms) Creates the following data set: wms 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 3 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 3 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 1 4 Robert Plant 234-56-7890 YZ912345H 8/20/1948 Led Zeppelin 3 5 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 1 6 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 3 7 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 3 8 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 1 At this point, we have met the goal of combining the two data sets. However, the weighted match score (WMS) takes on only 2 values, 1 and 3. It may be desirable to include all available identifying information in the creation of the match score to allow for greater variability in the quality of the matches. The following code expands the idea of including both the exact and fuzzy matches to all identifiers. Note the use of the nested PUT function with the FORMAT MMDDYY10 for DOB. The following query: create table inner_join5 as select ref.*, band, ((2*(ref.ssn eq chk.ssn)) + (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) + (2*(ref.nino eq chk.nino)) + (compged(ref.nino, chk.nino) le 100) + (2*(ref.fname eq chk.fname)) + (compged(ref.fname, chk.fname) le 100) + (2*(ref.lname eq chk.lname)) + (compged(ref.lname, chk.lname) le 100) + (2*(ref.dob eq chk.dob)) + (compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100)) as wms 4

on ( (compged(put(ref.ssn,9.),put(chk.ssn,9.)) le 100) or (compged(ref.nino, chk.nino) le 100) or (compged(ref.fname, chk.fname) le 100) or (compged(ref.lname, chk.lname) le 100) or (compged(put(ref.dob,mmddyy10.), put(chk.dob,mmddyy10.)) le 100) ) group by ref.ssn having ((calculated wms eq max(calculated wms))) order by band, ref.lname, ref.fname ; Produces the following data set: wms 1 John Baldwin 123-45-6789 BM567891E 1/3/1946 Led Zeppelin 12 2 John Bonham 456-78-9012 YZ891234G 5/31/1948 Led Zeppelin 15 3 Jimmy Page 345-67-8901 HL234567B 1/9/1944 Led Zeppelin 11 4 Robert Plant 234-56-7890 YZ912345H 8/20/1948 Led Zeppelin 12 5 Mick Avory 890-12-3456 HL345678C 2/15/1944 The Kinks 11 6 Dave Davies 678-90-1234 KR789123F 2/3/1947 The Kinks 13 7 Ray Davies 567-89-0123 HL456789D 6/21/1944 The Kinks 15 8 Peter Quaife 789-01-2345 AA123456A 12/31/1943 The Kinks 7 Here the match score is weighted so that it favors exact matches, assigning them twice the weight of fuzzier matches. It becomes much clearer that some matches are better than others. For example, Peter Quaife had a WMS value of 7 because while his entries in REF and CHK matched exactly on NINO and DOB, there were slight discrepancies in FNAME, LNAME, and SSN (each differed by one character). On the other hand, John Bonham and Ray Davies had much stronger matches, with WMS values of 15, indicating that they were exact matches. Because different identifiers might be more or less important, however, it s hard to say whether matches with weighted match scores of 15 are twice as good as a match with a value of 7. This approach could be taken even further. Matches on SSN, fuzzy or otherwise, could be given more weight than matches on first name, for example. COMPGED options can also be refined to change the penalties associated with certain types of mismatches between character expressions to best fit any given context. CONCLUSIONS The use of the COMPGED function in joins offers an incredible amount of flexibility when combining data sets. Allowing for fuzzy matches opens the door to over-matching a data set by including spurious matches, however. Creating a weighted match score to favor exact matches is a useful tool for refining results when combining data sets. REFERENCES SAS Institute Inc. 2011. SAS 9.2 Language Reference: Dictionary, Fourth Edition. Cary, NC: SAS Institute Inc. Functions and Call Routines: COMPGED. Staum, Paulette (2007). Fuzzy Matching using the COMPGED Function. Proceedings of the 2007 NorthEast SAS Users Group Conference. Teres, Jedediah J (2009). Using SQL Joins to Perform Weighted Matches on Multiple Identifiers. Proceedings of the 2009 NorthEast SAS Users Group Conference. ACKNOWLEDGMENTS SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. The author wishes to thank Paulette Staum for her encouragement, and Chris Bost and Aaron Hill for reviewing this paper. 5

CONTACT INFORMATION Jedediah Teres MDRC 16 East 34 th St, 19 th Floor New York, NY 10016 (212) 340-8807 telephone (212) 684-0832 fax jed.teres@mdrc.org www.mdrc.org 6