The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data



Similar documents
Introduction to SAS Mike Zdeb ( , #122

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

Introduction to Criteria-based Deduplication of Records, continued SESUG 2012

Subsetting Observations from Large SAS Data Sets

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Everything you wanted to know about MERGE but were afraid to ask

Dup, Dedup, DUPOUT - New in PROC SORT Heidi Markovitz, Federal Reserve Board of Governors, Washington, DC

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Imelda C. Go, South Carolina Department of Education, Columbia, SC

Taming the PROC TRANSPOSE

Salary. Cumulative Frequency

Table Lookups: From IF-THEN to Key-Indexing

Beyond the Simple SAS Merge. Vanessa L. Cox, MS 1,2, and Kimberly A. Wildes, DrPH, MA, LPC, NCC 3. Cancer Center, Houston, TX.

C H A P T E R 1 Introducing Data Relationships, Techniques for Data Manipulation, and Access Methods

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

Leads and Lags: Static and Dynamic Queues in the SAS DATA STEP

Normalizing SAS Datasets Using User Define Formats

Search and Replace in SAS Data Sets thru GUI

Parallel Data Preparation with the DS2 Programming Language

SUGI 29 Coders' Corner

Effective Use of SQL in SAS Programming

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

How To Write A Clinical Trial In Sas

So you want to create an a Friend action

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

Let SAS Modify Your Excel File Nelson Lee, Genentech, South San Francisco, CA

It s not the Yellow Brick Road but the SAS PC FILES SERVER will take you Down the LIBNAME PATH= to Using the 64-Bit Excel Workbooks.

- Eliminating redundant data - Ensuring data dependencies makes sense. ie:- data is stored logically

KEYWORDS ARRAY statement, DO loop, temporary arrays, MERGE statement, Hash Objects, Big Data, Brute force Techniques, PROC PHREG

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

What You re Missing About Missing Values

How to Use SDTM Definition and ADaM Specifications Documents. to Facilitate SAS Programming

Experiences in Using Academic Data for BI Dashboard Development

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Managing Data Issues Identified During Programming

An Introduction to SAS/SHARE, By Example

Rochester Institute of Technology. Oracle Training: Advanced Financial Application Training

CHAPTER 1 Overview of SAS/ACCESS Interface to Relational Databases

Outline. SAS-seminar Proc SQL, the pass-through facility. What is SQL? What is a database? What is Proc SQL? What is SQL and what is a database

Choosing the Best Method to Create an Excel Report Romain Miralles, Clinovo, Sunnyvale, CA

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

SAS Views The Best of Both Worlds

Paper TU_09. Proc SQL Tips and Techniques - How to get the most out of your queries

Table of Contents. Copyright Symphonic Source, Inc. All rights reserved. Salesforce is a registered trademark of salesforce.

Integrity Constraints and Audit Trails Working Together Gary Franklin, SAS Institute Inc., Austin, TX Art Jensen, SAS Institute Inc.

PROC SQL for SQL Die-hards Jessica Bennett, Advance America, Spartanburg, SC Barbara Ross, Flexshopper LLC, Boca Raton, FL

# or ## - how to reference SQL server temporary tables? Xiaoqiang Wang, CHERP, Pittsburgh, PA

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

AN INTRODUCTION TO MACRO VARIABLES AND MACRO PROGRAMS Mike S. Zdeb, New York State Department of Health

Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

Managing Tables in Microsoft SQL Server using SAS

Importing Excel File using Microsoft Access in SAS Ajay Gupta, PPD Inc, Morrisville, NC

Best Practice in SAS programs validation. A Case Study

PharmaSUG Paper QT26

Chapter 25 Specifying Forecasting Models

New Tricks for an Old Tool: Using Custom Formats for Data Validation and Program Efficiency

Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

A Basic introduction to Microsoft Access

Intro to Mail Merge. Contents: David Diskin for the University of the Pacific Center for Professional and Continuing Education. Word Mail Merge Wizard

Introduction to ProSMART: Proforma s Sales Marketing and Relationship Tool How to Log On and Get Started

ing a large amount of recipients

THE POWER OF PROC FORMAT

SuccessMaker Learning Management System Administrator's Guide Version 1.0

Creating HTML Output with Output Delivery System

CUSTOMER PORTAL USER GUIDE FEBRUARY 2007

Paper DV KEYWORDS: SAS, R, Statistics, Data visualization, Monte Carlo simulation, Pseudo- random numbers

Data Presentation. Paper Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

FrontStream CRM Import Guide Page 2

PharmaSUG Paper AD11

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC

HELP! - My MERGE Statement Has More Than One Data Set With Repeats of BY Values!

Database Design Basics

Innovative Techniques and Tools to Detect Data Quality Problems

A Method for Cleaning Clinical Trial Analysis Data Sets

The SET Statement and Beyond: Uses and Abuses of the SET Statement. S. David Riba, JADE Tech, Inc., Clearwater, FL

Handling Missing Values in the SQL Procedure

Transcription:

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data Carter Sevick MS, DoD Center for Deployment Health Research, San Diego, CA ABSTRACT Whether by design or by error there are data sets out there that contain variables, combinations of variables, or whole observations that are not unique. For example, duplications of a patient identifier should be expected in a medical records file, but perhaps not in a master list of patients. Unanticipated duplicates in your data can be the source of hours of headaches, which may be surprisingly easy to resolve once you apply a few well chosen techniques. Basic knowledge of PROC SORT, and DATA step processing (including SET and MERGE) will be assumed. INTRODUCTION Every data set should have a primary key and every row should have a unique instance of a primary key. So goes the theory, but in real life there are imperfect data sets. Sometimes whole rows are loaded into a data set more than once. A person may be entered into a system twice, perhaps with conflicting information, such as different birth dates. If data sets have rows duplicated on their primary keys then you cannot count how many distinct individuals are represented. SAS performs table merges brilliantly when the relationship between two tables is one-to-one, or oneto-many, but if two tables are merged with an unobserved many-to-many relationship then you may get unexpected results. If you have duplicated data in a large data set it s not possible to just print out the data set to find and deal with the offending rows. In this paper we will explore how to deal with simple row duplication, how to isolate rows that are unique, and isolate rows that are duplicated. Resolving conflicting data, such as the birthday example above, may not be possible given your level of access to data and the offending rows may need to be collected and sent to someone else. THE WHOLE ROW IS DUPLICATED, GETTING THE DISTINCT LIST You have been given the task of collecting demographic information on a list of subjects to complement a study that you are currently working on. Each subject has a four digit subject ID number named sub_id. The data set is as follows: Data Set: sub_list Obs sub_id 1 1204 2 5023 3 3029 4 1106 5 2034 6 5023 You can see that one subject is represented twice, but remember with real life data, you will not be able to see the problem. Consequently, whenever you are confronted with a new data set the first action you should take is to check for duplicates. Fortunately, the SORT procedure has a simple option for this situation: NODUPKEY. Including this option on the procedure statement line causes the procedure to compare the BY values of the current row with those of the previous row. The current row is deleted if an exact match is found. When using PROC SORT it is always a good idea to use the OUT = option to avoid overwriting the original data set (unless you really mean to). When using NODUPKEY it is even more important to use OUT = since deleted rows are gone for good, or at least until you can get your DBA to reload your data set. To eliminate the duplicates in sub_list we can use the following code: 1

PROC SORT DATA = sub_list OUT = sub_list_tmp NODUPKEY; NOTE: There were 6 observations read from the data set WORK.SUB_LIST. NOTE: 1 observations with duplicate key values were deleted. NOTE: The data set WORK.SUB_LIST_TMP has 5 observations and 1 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.12 seconds cpu time 0.01 seconds The log file reveals a very handy feature. Notice that on the second note SAS tells us how many rows were deleted. If you just want to check for duplicates, but do not want to create another data set, you can set OUT = _NULL_ and you will still get this information in the log. PROC SORT DATA = sub_list OUT = _NULL_ NODUPKEY; The demographics file is said to have one row per individual and contains the following fields: Variables in Creation Order # Variable Type Len Format Label 1 sub_id Char 4 Unique identifier of a subject 2 gender Char 1 Gender of subject 3 dob Num 8 DATE9. Subject's date of birth 4 race_ethnic Char 1 Race/ethnicity of subject 5 education Char 1 Education level of subject In order to access the needed information we submit the following code: *** Merge to demographic set ***; PROC SORT DATA = sub_list OUT = sub_list_tmp NODUPKEY; PROC SORT DATA = study_demographics OUT = study_demo_tmp; DATA demo_subset; MERGE sub_list_tmp ( IN = a) study_demo_tmp; IF a; This program is simple and straight forward, so no problems should be expected. Or should there? You must always remember the most important rule of SAS programming: Check your log file. 2

61 DATA demo_subset; 62 MERGE 63 sub_list_tmp (IN = a) 64 study_demo_tmp; 65 66 67 IF a; 68 NOTE: There were 5 observations read from the data set WORK.SUB_LIST_TMP. NOTE: There were 11 observations read from the data set WORK.STUDY_DEMO_TMP. NOTE: The data set WORK.DEMO_SUBSET has 8 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds There are no error notes or warnings but if you look closely you will notice that WORK.SUB_LIST_TMP has 5 observations and the output data set has 8 observations. If each data set was truly structured as one row per subject then the output data set should also have 5 observations. Clearly, the demographics file has some duplication by subject ID. Since the output table is small, we can print it in its entirety and inspect it, though this will almost never be possible in real life. Later in this paper we will cover what to do if the table is too large to be printed. 1 1106 F 05SEP1973 2 C 2 1204 M 22JAN1972 1 H 3 1204 M 23JAN1972 1 H 4 1204 M 22JAN1972 1 H 5 2034 M 31JAN1962 3 H 6 3029 F 20AUG1970 1 G 7 5023 M 13APR1969 2 C 8 5023 M 13APR1969 2 C As you can see, subject 5023 has two identical rows and subject 1024 has three rows (two that are identical, and one with a different birthday). This situation is different than with the data set sub_list, then we only had a single field to worry about but here we have five. Consider the following: 1 1106 F 05SEP1973 2 C 2 1204 M 22JAN1972 1 H 3 2034 M 31JAN1962 3 H 4 3029 F 20AUG1970 1 G 5 5023 M 13APR1969 2 C If this code is run then we will get a data set (above) that has rows unique by sub_id, and subject 5023 will have all of his information preserved intact, but not 1204. Two of the rows will be deleted, leaving you with a birthday that may not be correct. If SAS is using its default sort engine then the row that is read by the SORT procedure first will be preserved and the second deleted. This can change under certain circumstances and if you need that default behavior to be insured then you should use the EQUALS option on the PROC SORT statement. Either way, the outcome is undesirable. To make sure that only whole row duplicates are deleted you must sort on all variables in the set. If you have a data set with a large number of variables then you can use the name list _ALL_. This will signal PROC SORT to include all variables in the BY statement. 3

BY _ALL_; Unfortunately, SAS does not let you choose the order of the BY variables, the order is determined by the order in which they appear in the data set. PROC CONTENTS displays this information in the column with the # header. So, the above code is equivalent to: BY sub_id gender dob race_ethnic education; Fortunately for us, SAS is pretty smart and there is a recourse if you want the result set sorted first by dob, or one of the other variables: BY dob _ALL_; NOTE: There were 8 observations read from the data set WORK.DEMO_SUBSET. NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored. NOTE: 2 observations with duplicate key values were deleted. NOTE: The data set WORK.DEMO_SUBSET_TMP has 6 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.06 seconds cpu time 0.00 seconds In the above submission, SAS actually sees the following: BY dob sub_id gender dob race_ethnic education; But, the data set will be sorted as if the second instance of dob wasn t there. For the purpose of this project we will use the following code: BY sub_id gender dob race_ethnic education; The result follows: 1 1106 F 05SEP1973 2 C 2 1204 M 22JAN1972 1 H 3 1204 M 23JAN1972 1 H 4 2034 M 31JAN1962 3 H 5 3029 F 20AUG1970 1 G 6 5023 M 13APR1969 2 C So, all of our whole row duplicates have been taken care of but we still have to deal with the fact that subject 1204 has two rows with two different birthdays. In the next section we will investigate BY group processing as a way to sort out these data. 4

ROWS WITH DUPLICATE PRIMARY KEYS BUT CONFLICTING DATA VALUES BY-GROUP PROCESSING Like many SAS procedures, you can use a BY statement in the DATA step. Just like the SAS procedures, the input data set must be presorted on the same variables that appear in the BY statement. When you include a BY statement in the DATA step, two temporary variables are created for each variable included in the BY statement. These temporary variables have the names FIRST.variable-name and LAST.variable-name and each takes values of either 1 or 0. Consider the following program and output: data test; input col @@; datalines; 2 2 2 1 1 3 3 3 3 ; proc sort data = test; by col; data ByProcess; set test; by col; first = first.col; last = last.col; run; Obs col first last 1 1 1 0 2 1 0 1 3 2 1 0 4 2 0 0 5 2 0 1 6 3 1 0 7 3 0 0 8 3 0 0 9 3 0 1 Remember that FIRST.col and LAST.col are temporary variables and are not visible in a PROC PRINT output. To be able to see the values that they took, they were assigned to the variables first and last, respectively. Whenever the DATA step processes the first row in a BY-group FIRST.variable-name will be given the value of 1 and 0 otherwise. Likewise, LAST.variable-name is assigned the value 1 when the DATA step is processing the last row in a BY-group, 0 otherwise. BY-group processing has a multitude of uses beyond the topic of this paper. It can be used to deduplicate a data set, subset out duplicate rows, evaluate data across multiple rows, and many others. USING BY PROCESSING In order to isolate rows that are unique or duplicated on a variable or list of variables we can use BY processing. Since each row should be unique by sub_id we will use it as our BY variable. The data set demo_subset_tmp is already sorted by sub_id and so is ready for BY processing without further sorting. 5

DATA demo_by_sub_id; SET demo_subset_tmp; first = FIRST.sub_id; last = LAST.sub_id; first last 1 1106 F 05SEP1973 2 C 1 1 2 1204 M 22JAN1972 1 H 1 0 3 1204 M 23JAN1972 1 H 0 1 4 2034 M 31JAN1962 3 H 1 1 5 3029 F 20AUG1970 1 G 1 1 6 5023 M 13APR1969 2 C 1 1 The above table reveals a recurring pattern: if a row is unique by the BY variables then FIRST.var and LAST.var will both be equal to one, otherwise there are at least 2 rows with the same BY values. The following two DATA steps isolate the unique and duplicated rows. DATA unique_rows; SET demo_subset_tmp; IF FIRST.sub_id = 1 AND LAST.sub_id = 1; 1 1106 F 05SEP1973 2 C 2 2034 M 31JAN1962 3 H 3 3029 F 20AUG1970 1 G 4 5023 M 13APR1969 2 C DATA dup_rows; SET demo_subset_tmp; IF ^(FIRST.sub_id = 1 AND LAST.sub_id = 1); 1 1204 M 22JAN1972 1 H 2 1204 M 23JAN1972 1 H A WORD ON EFFICIENCY If you are working with large data sets even a simple DATA step (like those above) may take a significant amount of time to run. In such circumstances anything we can do to limit the number of reads can become important. Multiple output data sets can be specified on the DATA statement and rows can be conditionally output to each set using the OUTPUT statement. The syntax for this statement is: OUTPUT data_set_name; 6

It is optional to specify a data set name, if you do not then the current row will be output to all data sets names in the DATA statement. Optionally, if a data set name is specified then the current row will be output to that data set only. The following DATA step will do the job of both of the previous and demo_subset_tmp is read only once. DATA unique_rows dup_rows; SET demo_subset_tmp; IF FIRST.sub_id = 1 AND LAST.sub_id = 1 THEN OUTPUT unique_rows; ELSE OUTPUT dup_rows; CONCLUSION At first, duplicated data seems to present us with significant challenges to our day to day data management operations, but we have seen that SAS offers several useful tools to help resolve the issues. In addition to what has been shown, PROC SORT offers several other options to help the programmer. PROC SQL is another tool that offers many routes to the same end. I would encourage everyone to investigate the SAS documentation for these PROCS. ACKNOWLEDGMENTS The author would like to thank his wife Myrna for her support and patience throughout this process. Also, Tyler Smith PhD., and the members of the Data Team at the DoD Center for Deployment Health Research who have been an invaluable source of inspiration, support, and assistance. The members include Besa Smith, Isabel Jacobson, Cynthia Leard, Charlene Wong, Anna Bukowinski, Molly Kelton, LtCol Christopher Phillips and Martin White. Thank you to the leadership of WUSS 2008 for making this venue possible. RECOMMENDED READING SAS Institute Inc. 2006 SAS OnlineDoc 9.1.3 "SAS Language Reference: Concepts" http://support.sas.com/onlinedoc/913/docmainpage.jsp SAS Institute Inc. 2006 SAS OnlineDoc 9.1.3 "SAS Language Reference: Dictionary" http://support.sas.com/onlinedoc/913/docmainpage.jsp CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Carter Sevick Naval Health Research Center DoD Center for Deployment Health Research 140 Sylvester Rd. San Diego CA 92106-3521 Work Phone: (619) 767-4762 Fax: (619) 553-7601 E-mail: carter.sevick@med.navy.mil SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7

APPENDIX: DATASETS: DATA sub_list; INPUT sub_id : $4.; DATALINES; 1204 5023 3029 1106 2034 5023 ; DATA study_demographics; INPUT sub_id : $4. gender : $1. dob : DATE9. race_ethnic : $1. education : $1.; FORMAT dob DATE9.; LABEL sub_id = "Unique identifier of a subject" gender = "Gender of subject" dob = "Subject's date of birth" race_ethnic = "Race/ethnicity of subject" education = "Education level of subject" ; DATALINES; 2039 F 20NOV1971 1 G 1204 M 22JAN1972 1 H 5023 M 13APR1969 2 C 3025 M 10APR1968 2 G 1204 M 23JAN1972 1 H 3029 F 20AUG1970 1 G 1106 F 05SEP1973 2 C 2034 M 31JAN1962 3 H 5023 M 13APR1969 2 C 1204 M 22JAN1972 1 H 6101 F 08AUG1970 2 C ; 8