One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University



Similar documents
Introduction to Criteria-based Deduplication of Records, continued SESUG 2012

Dup, Dedup, DUPOUT - New in PROC SORT Heidi Markovitz, Federal Reserve Board of Governors, Washington, DC

Remove Orphan Claims and Third party Claims for Insurance Data Qiling Shi, NCI Information Systems, Inc., Nashville, Tennessee

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

SQL SUBQUERIES: Usage in Clinical Programming. Pavan Vemuri, PPD, Morrisville, NC

Paper Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Normalizing SAS Datasets Using User Define Formats

A Method for Cleaning Clinical Trial Analysis Data Sets

Tales from the Help Desk 3: More Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Subsetting Observations from Large SAS Data Sets

Overview. NT Event Log. CHAPTER 8 Enhancements for SAS Users under Windows NT

PharmaSUG Paper AD11

SAS Programming Tips, Tricks, and Techniques

Normalized EditChecks Automated Tracking (N.E.A.T.) A SAS solution to improve clinical data cleaning

SAS/Data Integration Studio Creating and Using A Generated Transformation Jeff Dyson, Financial Risk Group, Cary, NC

ABSTRACT THE ISSUE AT HAND THE RECIPE FOR BUILDING THE SYSTEM THE TEAM REQUIREMENTS. Paper DM

Effective Use of SQL in SAS Programming

Five Little Known, But Highly Valuable, PROC SQL Programming Techniques. a presentation by Kirk Paul Lafler

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

Using the Magical Keyword "INTO:" in PROC SQL

Salary. Cumulative Frequency

Simple Rules to Remember When Working with Indexes Kirk Paul Lafler, Software Intelligence Corporation, Spring Valley, California

Topics. Database Essential Concepts. What s s a Good Database System? Using Database Software. Using Database Software. Types of Database Programs

Managing Data Issues Identified During Programming

From Database to your Desktop: How to almost completely automate reports in SAS, with the power of Proc SQL

Top Ten SAS DBMS Performance Boosters for 2009 Howard Plemmons, SAS Institute Inc., Cary, NC

Guide to Operating SAS IT Resource Management 3.5 without a Middle Tier

Table Lookups: From IF-THEN to Key-Indexing

Taming the PROC TRANSPOSE

UNIX Operating Environment

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Hands-On Workshops HW003

Technical Paper. Defining an ODBC Library in SAS 9.2 Management Console Using Microsoft Windows NT Authentication

Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

SAS Credit Scoring for Banking 4.3

Innovative Techniques and Tools to Detect Data Quality Problems

Using SAS With a SQL Server Database. M. Rita Thissen, Yan Chen Tang, Elizabeth Heath RTI International, RTP, NC

Automate Data Integration Processes for Pharmaceutical Data Warehouse

Foundations & Fundamentals. A PROC SQL Primer. Matt Taylor, Carolina Analytical Consulting, LLC, Charlotte, NC

Unit 10: Microsoft Access Queries

Interfacing SAS Software, Excel, and the Intranet without SAS/Intrnet TM Software or SAS Software for the Personal Computer

SUGI 29 Applications Development

Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access

Efficient Techniques and Tips in Handling Large Datasets Shilong Kuang, Kelley Blue Book Inc., Irvine, CA

A Microsoft Access Based System, Using SAS as a Background Number Cruncher David Kiasi, Applications Alternatives, Upper Marlboro, MD

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

PharmaSUG 2014 Paper CC23. Need to Review or Deliver Outputs on a Rolling Basis? Just Apply the Filter! Tom Santopoli, Accenture, Berwyn, PA

Using Proc SQL and ODBC to Manage Data outside of SAS Jeff Magouirk, National Jewish Medical and Research Center, Denver, Colorado

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

Utilizing Clinical SAS Report Templates Sunil Kumar Gupta Gupta Programming, Thousand Oaks, CA

Automating SAS Macros: Run SAS Code when the Data is Available and a Target Date Reached.

USING SAS WITH ORACLE PRODUCTS FOR DATABASE MANAGEMENT AND REPORTING

Configuring an Alternative Database for SAS Web Infrastructure Platform Services

Importing Excel Files Into SAS Using DDE Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

Handling Missing Values in the SQL Procedure

Data Warehousing. Paper

Using DATA Step MERGE and PROC SQL JOIN to Combine SAS Datasets Dalia C. Kahane, Westat, Rockville, MD

Shipping Administration Getting Started Guide

Health Services Research Utilizing Electronic Health Record Data: A Grad Student How-To Paper

SAS, Excel, and the Intranet

Make Better Decisions with Optimization

Integrity Constraints and Audit Trails Working Together Gary Franklin, SAS Institute Inc., Austin, TX Art Jensen, SAS Institute Inc.

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

Transferring vs. Transporting Between SAS Operating Environments Mimi Lou, Medical College of Georgia, Augusta, GA

How to Create a Custom TracDat Report With the Ad Hoc Reporting Tool

College SAS Tools. Hope Opportunity Jobs

Session Attribution in SAS Web Analytics

SAS BI Dashboard 4.3. User's Guide. SAS Documentation

Paper PO03. A Case of Online Data Processing and Statistical Analysis via SAS/IntrNet. Sijian Zhang University of Alabama at Birmingham

SUGI 29 Coders' Corner

OS/390 SAS/MXG Computer Performance Reports in HTML Format

Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA

Dynamic Decision-Making Web Services Using SAS Stored Processes and SAS Business Rules Manager

Need for Speed in Large Datasets The Trio of SAS INDICES, PROC SQL and WHERE CLAUSE is the Answer, continued

SAS BI Dashboard 3.1. User s Guide

Order from Chaos: Using the Power of SAS to Transform Audit Trail Data Yun Mai, Susan Myers, Nanthini Ganapathi, Vorapranee Wickelgren

How To Write A Clinical Trial In Sas

From Validating Clinical Trial Data Reporting with SAS. Full book available for purchase here.

Using Pharmacovigilance Reporting System to Generate Ad-hoc Reports

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

Search and Replace in SAS Data Sets thru GUI

Transcription:

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University ABSTRACT In real world, analysts seldom come across data which is in ready to use format. Data cleaning is a critical aspect and one of the major problems faced is duplicate records within a dataset. For instance, we may want to have all unique Employee ID s in our dataset or we may want unique transactions from a list of customer transactions. This paper will cover various options available for treating duplicates records. Usage of options such as NODUPKEY in combination with DUPOUT, how NOUNIQUEKEY, NODUPRECS and NODUPKEY are different from each other, will be some of the points discussed here. INTRODUCTION Analysts spend the majority of time in the data exploration and preparation phase of a data mining project. A critical aspect of this phase is the detection and removal of duplicate records. This paper discusses how to resolve the problem of duplicate records. There are various ways in SAS to tackle the problem of duplication. A few options are mentioned in this paper and include NODUPRECS (alias NODUP), NODUPKEY. NOUNIQUEKEY is new to SAS version 9.3 and will be discussed as well. DATASET For the purpose of explanation, I have created a dataset Emp_Details containing details of employees working for different departments for a company XYZ. This dataset has 3 columns Emp_ID, Emp_Name and Dept. Table 1 contains the code for creating the dataset; whereas, Table 2 displays the dataset. Table 1. data Emp_Details; Input Emp_ID 6. Emp_Name $ Dept $; datalines; 110011 Jack Sales 110012 Jonny Finance 110016 Jill Sales 110025 Harry Admin 110018 Harry Admin 110030 Mary Finance 110031 Potter Finance 110030 Mary Finance 110041 Tom Admin ;

In Table 2, observation number 6 and 8 are exact duplicates, something which is definitely a case for deletion. However, observation number 4 and 5 have the same Employee Name (Harry) and Dept (Admin), but different employee IDs. This could be a possibility in real life where two people with same name work for the same department. Table 2. 4 110025 Harry Admin 5 110018 Harry Admin 6 110030 Mary Finance 7 110031 Potter Finance 8 110030 Mary Finance 9 110041 Tom Admin VARIOUS WAYS OF HANDLING DUPLICATE RECORDS NODUPRECS One of the most commonly used option is NODUPRECS. It deletes all those records which have the same values across all the variables. Example1 shows the code used to delete duplicate record using NODUPRECS and partial log displays 1 duplicate observations were deleted. Example 1 Proc Sort data=emp_details noduprecs; by Emp_ID; NOTE: There were 9 observations read from the data set WORK.EMP_DETAILS. NOTE: 1 duplicate observations were deleted. NOTE: The data set WORK.EMP_DETAILS has 8 observations and 3 variables.

Output Dataset 4 110018 Harry Admin 5 110025 Harry Admin 6 110030 Mary Finance 7 110031 Potter Finance 8 110041 Tom Admin Let s consider another example using the same dataset, after changing the BY variable. In example 2, the BY variable has been changed to Dept and we see in the SAS log that 0 duplicate observations were deleted. The reason is that SAS sorts the data on BY variable and compares one observation with the previous observation, if it encounters an exact match (same values across all variables) the record is deleted. So in example 2, after sorting by Dept., SAS couldn t find observation number 6 and 8 one after the other. Therefore, it is critical to use the correct BY variable to get the desired result. Example 2 Proc Sort data=emp_details noduprecs; by Dept; NOTE: There were 9 observations read from the data set WORK.EMP_DETAILS. NOTE: 0 duplicate observations were deleted. NOTE: The data set WORK.EMP_DETAILS has 9 observations and 3 variables. NODUPKEY NODUPKEY is another option available in SAS to handle duplicate records. SAS looks for exact matches for BY variables specified. The difference between NODUP and NODUPKEY is that NODUP looks for exact matches for all the variables, whereas NODUPKEY looks for exact matches for BY variables specified in the code. Example 3 contains the code for removing duplicates based on three BY variables: EMP_ID, Emp_Name and Dept. The result is similar to

the one in example 1. However we would get different results by not specifying all the variables in BY variable. Example 3 Proc Sort data=emp_details nodupkey dupout=dup_rec; by Emp_ID Emp_Name Dept; NOTE: There were 9 observations read from the data set WORK.EMP_DETAILS. NOTE: 1 observations with duplicate key values were deleted. NOTE: The data set WORK.DUP_REC has 1 observations and 3 variables. NOTE: The data set WORK.EMP_DETAILS has 8 observations and 3 variables. Output Dataset 4 110018 Harry Admin 5 110025 Harry Admin 6 110030 Mary Finance 7 110031 Potter Finance 8 110041 Tom Admin NOUNIQUEKEY NOUNIQUEKEY is a new option added in SAS version 9.3. This option eliminates unique records based on BY variable. Duplicate records deleted from the original dataset can be saved in another dataset by using the option OUT. If option OUT is not used, the original dataset is overwritten and contains only the duplicate cases. UNIQUEOUT option can be used to store all the unique cases from the original dataset. Example 4 contains the code for eliminating unique records from duplicates by using NOUNIQUEKEY and UNIQUEOUT option.

Example 4 Proc Sort data=emp_details out=dup_rec nouniquekey uniqueout=unique_rec; by Emp_ID Emp_Name Dept ; NOTE: There were 9 observations read from the data set WORK.EMP_DETAILS. NOTE: 7 observations with unique key values were deleted. NOTE: The data set WORK.DUP_REC has 2 observations and 3 variables. NOTE: The data set WORK.UNIQUE_REC has 7 observations and 3 variables. Output Dataset: DUP_REC 1 110030 Mary Finance 2 110030 Mary Finance Output Dataset: UNIQUE_REC 4 110018 Harry Admin 5 110025 Harry Admin 6 110031 Potter Finance 7 110041 Tom Admin CONCLUSION This was a brief discussion on removing duplicates using the readily available options in SAS in combination with PROC SORT. There are various other ways in which we can remove duplicate records. For instance using PROC SQL, data step, Macros etc.

REFERENCES SAS Institute Inc. (2012), Base SAS(R) 9.3 Procedures Guide, Second Edition Cary, NC. ACKNOWLEDGEMENTS I would like to thank Dr. Joni Shreve for her valuable guidance and support while I was writing this paper. CONTACT INFORMATION Your comments and questions are valued and welcomed. You can contact the author at: Jaya Dhillon, MSA Student Louisiana State University Phone: 919-324-445 E-Mail: jdhill1@lsu.edu SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies.