Macro utility to compare multiple SAS data sets Krish Krishnan, Statistical Programming Scientist, Quintiles



Similar documents
Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

How To Write A Clinical Trial In Sas

Chapter 25 Specifying Forecasting Models

Workflow Solutions for Very Large Workspaces

Directions for the Well Allocation Deck Upload spreadsheet

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ

A Closer Look at PROC SQL s FEEDBACK Option Kenneth W. Borowiak, PPD, Inc., Morrisville, NC

PRN Medical Transport Employment Application Packet

We begin by defining a few user-supplied parameters, to make the code transferable between various projects.

Same Data Different Attributes: Cloning Issues with Data Sets Brian Varney, Experis Business Analytics, Portage, MI

Integrity Constraints and Audit Trails Working Together Gary Franklin, SAS Institute Inc., Austin, TX Art Jensen, SAS Institute Inc.

ing Automated Notification of Errors in a Batch SAS Program Julie Kilburn, City of Hope, Duarte, CA Rebecca Ottesen, City of Hope, Duarte, CA

Search and Replace in SAS Data Sets thru GUI

A Macro to Create Data Definition Documents

Time & Attendance Supervisor Basics for ADP Workforce Now. Automatic Data Processing, LLC ES Canada

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Alternatives to Merging SAS Data Sets But Be Careful

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

Essential Project Management Reports in Clinical Development Nalin Tikoo, BioMarin Pharmaceutical Inc., Novato, CA

OBJECT_EXIST: A Macro to Check if a Specified Object Exists Jim Johnson, Independent Consultant, North Wales, PA

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

How To Create An Audit Trail In Sas

Applications Development ABSTRACT PROGRAM DESIGN INTRODUCTION SAS FEATURES USED

Custom Reporting Basics for ADP Workforce Now. Automatic Data Processing, LLC ES Canada

SAS Add-In 2.1 for Microsoft Office: Getting Started with Data Analysis

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

PharmaSUG Paper QT26

ABSTRACT INTRODUCTION %CODE MACRO DEFINITION

A/P Payment Selection Based on A/R Cash Receipts AP-1108

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

Teamstudio USER GUIDE

NATIONAL FLEET DATABASE FLEET OWNER/MOTOR TRADER MANUAL. Page 1

Payroll Basics for ADP Workforce Now. Automatic Data Processing, LLC ES Canada

CUSTOMER PORTAL USER GUIDE FEBRUARY 2007

How to Create a Data Table in Excel 2010

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

Hands-On Lab. Building a Data-Driven Master/Detail Business Form using Visual Studio Lab version: Last updated: 12/10/2010.

PharmaSUG 2014 Paper CC23. Need to Review or Deliver Outputs on a Rolling Basis? Just Apply the Filter! Tom Santopoli, Accenture, Berwyn, PA

Using Macros to Automate SAS Processing Kari Richardson, SAS Institute, Cary, NC Eric Rossland, SAS Institute, Dallas, TX

Eclipse Palm Sales Force Automation. Release (Eterm)

Let the CAT Out of the Bag: String Concatenation in SAS 9 Joshua Horstman, Nested Loop Consulting, Indianapolis, IN

Importing Excel Files Into SAS Using DDE Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA

SQL Server Integration Services with Oracle Database 10g

This document was derived from simulation software created by Steve Robbins which was supported by NSF DUE

Microsoft Access 2010 Overview of Basics

How to easily convert clinical data to CDISC SDTM

Internet/Intranet, the Web & SAS. II006 Building a Web Based EIS for Data Analysis Ed Confer, KGC Programming Solutions, Potomac Falls, VA

Gentran Integration Suite. Archiving and Purging. Version 4.2

Windows File Management A Hands-on Class Presented by Edith Einhorn

Release Notes DAISY 4.0

Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA

Introduction to SAS on Windows

Spam Marshall SpamWall Step-by-Step Installation Guide for Exchange 5.5

Review Easy Guide for Administrators. Version 1.0

EDI Advantage Integration SO-1184

SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

An macro: Exploring metadata EG and user credentials in Linux to automate notifications Jason Baucom, Ateb Inc.

Flat Pack Data: Converting and ZIPping SAS Data for Delivery

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

Intellect Platform - The Workflow Engine Basic HelpDesk Troubleticket System - A102

Vim, Emacs, and JUnit Testing. Audience: Students in CS 331 Written by: Kathleen Lockhart, CS Tutor

Be a More Productive Cross-Platform SAS Programmer Using Enterprise Guide

SPSS: Getting Started. For Windows

Abstract. Introduction. System Requirement. GUI Design. Paper AD


IBM Campaign and IBM Silverpop Engage Version 1 Release 2 August 31, Integration Guide IBM

ABSTRACT INTRODUCTION THE MAPPING FILE GENERAL INFORMATION

ScriptLogic File System Auditor User Guide

Creating a Simple Macro

MICROSOFT ACCESS 2003 TUTORIAL

What's New in ADP Reporting?

Instant Interactive SAS Log Window Analyzer

Configuration Manager

Directions for the AP Invoice Upload Spreadsheet

Applications Development

An Introduction to SAS/SHARE, By Example

TIBCO Spotfire Metrics Modeler User s Guide. Software Release 6.0 November 2013

DI SHAREPOINT PORTAL. User Guide

PROC SQL for SQL Die-hards Jessica Bennett, Advance America, Spartanburg, SC Barbara Ross, Flexshopper LLC, Boca Raton, FL

How to Create a Custom TracDat Report With the Ad Hoc Reporting Tool

Paper PO03. A Case of Online Data Processing and Statistical Analysis via SAS/IntrNet. Sijian Zhang University of Alabama at Birmingham

SAS ODS HTML + PROC Report = Fantastic Output Girish K. Narayandas, OptumInsight, Eden Prairie, MN

Hypercosm. Studio.

A User s Guide to Helm

Database Security. The Need for Database Security

Statistics and Data Analysis

BlueJ Teamwork Tutorial

Creating Codes with Spreadsheet Upload

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

IBM Business Monitor V8.0 Global monitoring context lab

Transcription:

2013 acro utility to compare multiple data sets Krish Krishnan, tatistical rogramming cientist, Quintiles bstract his paper describes an efficient method to compare pairs of data sets. he utility presented here will compare one or more pairs of data sets and will generate a condensed user friendly report outlining the findings. he summary report generated by this utility lists above and beyond the 16 comparison results that can be obtained from the return codes stored in the automatic macro I from. aximum efficiency from this macro can be achieved if the business process involves independent validation of analysis data sets, tables, listings or any other outputs that requires creation of permanent data sets by both production and independent Q programmer. he macro can be modified by end user to accommodate other scenarios (e.g. to check only the data structure compliance of current data transfer to previous data transfer, print the entire output by removing the I option etc.). wo programs are needed and both the files containing the codes are available for download through ropbox (sign in not required): https://www.dropbox.com/s/rwzt03u4nlh1uww/mprashrd.sas https://www.dropbox.com/s/qe65mcp9qy09d19/unompare.sas Introduction utput elivery ystem () is used in conjunction with traffic lighting to compare one or more pairs of data sets and a ich ext ormat () output is generated that identifies 16 comparison results that comes out of the. raffic lighting is automatically created in the report to display in green or red whether the compare passed or failed. In addition to the 16 standard comparison results, the following additional checks are done as well: Will always check and flag as error, if production data set date is after Q data set date. If enabled, the report will flag when variable ordering (position) between the production and Q data set does not match. ( option in procedure lists the variables by the position in the data set). his check is an optional feature. When comparing data sets, if business process dictates any specific character (e.g. a blank space) needs to be ignored, this utility will ignore a specific character or a list of characters (optional feature). pecific criterion can be used (other than the default of 0.00001) to judge the equality of numeric values. How to run the macro acro program is called mprashbrd.sas (ompare dash board). his macro in general is not expected to be changed unless the functionality needs to be modified. he approach taken by this paper is to only illustrate how to run this macro and how to interpret the summary report. unompare.sas calls the macro mprashbrd and this program lists the simulated pairs of data sets to compare. his program is expected to be modified to meet the end user s expectation. oth files (mprashbrd.sas and unompare.sas) can be downloaded from the above mentioned ropbox. acro mprashbrd can be invoked as follows: %mprashbrd(_set =%str(wk.),_oot=%str(tudyootirectoryoeshere), _I=%str(rogrammingirectory),_ile=%str(unompare), _itle1 =%str(ponsor Inc.), _itle2 =%str(rotocol 0000001), _mpres=%str(),_riter =0.000000001,_arum =es) ; age 1 of 8

able 1: he following table describes the macro parameters used in mprashbrd macro: acro parameter escription 1. _set ser created work data set, containing the list of production and Q data sets to compare. 2. _oot haracters string describing the project root directory location. sed in the footnote for identification purpose only. 3. _I haracter string describing the programming directory within the root location. sed in the footnote for identification purpose only. 4. _ile haracter string providing the file name. xtension (.) is not required. s a good practice, keep the same name for the and file. or demonstration (in section xample), several s are created within the same program. 5. _itle1 itle 1 to be used in the report. 6. _itle2 itle 2 to be used in the report. 7. _mpres haracters that can be ignored when comparing. Impacts character variables only. _mpres=%str() o be used for as is comparison. _mpres=%str( @).g. to ignore blank space and @ sign. ote there is a space before @ sign in %str( @). 8. _riter riterion to be used in _riter =%str() default of 0.00001 will be used. _riter=0.000000001 bsolute relative difference of 0.000000001 is used by. 9. _arum ompare variable position. ( in ). _arum =es ompares variable position. _arum=o Ignores variable position acro parameters _oot, _I, _ile, _itle1, _itle2 are used either in the footnotes or titles to provide identifying information about the name of the file, location of the file and the title for the report ( and file name are assumed to be the same). alues for macro parameters _set, _oot, _I, _ile, _itle1, _itle2 will remain constant in this paper and therefore will not be repeated during the various illustrations. How to interpret the summary report his report will list the production data set name and the production I along with the 16 comparison results (e.g. data set labels, data set types, variable labels, variable types etc.) that are obtained from the return codes in the automatic macro I when comparing the production data set to the Q data set. In addition to these 16 results, report also lists the following two checks: Whether production data set date is after Q data set date. Whether variable position match between production and Q. he macro will compare any pair of data sets and in the report under the column utcome will print a ass or ail to identify whether a compare passed or failed all the comparison checks. olumn unrder in the report indicates the order in which the end user supplied the pairs of data sets to compare. o interpret the 16 standard comparison results, the following macro return codes table (able 2) was extracted from reference #1 (ase () 9.2 rocedures uide, esults: rocedure). he footnote generated in the report will always provide a description of these 16 results. able 2: it ondition ode Hex escription 1 1 0001X ata set labels differ 2 2 0002X ata set types differ 3 I 4 0004X ariable has different informat 4 8 0008X ariable has different format age 2 of 8

it ondition ode Hex escription 5 H 16 0010X ariable has different length 6 32 0020X ariable has different label 7 64 0040X ase data set has observation not in comparison 8 128 0080X omparison data set has observation not in base 9 256 0100X ase data set has group not in comparison 10 512 0200X omparison data set has group not in base 11 1024 0400X ase data set has variable not in comparison 12 2048 0800X omparison data set has variable not in base 13 4096 1000X value comparison was unequal 14 8192 2000X onflicting variable types 15 16384 4000X variables do not match 16 32768 8000X atal error: comparison not done When the report flags production data set date is > Q data set date, this indicates potential problems; a production programming update was made and Q programmer failed to implement the update and re-run the compare or production programmer ran the program and failed to communicate the information to the Q programmer. his can lead to quality issues and to ensure the report is clean and comply with the business process correctly, Q programmer will need to re-run the program and re-qc the output (this corrective measure will ensure Q data set date is after production data set date). hese 16 standard comparisons and the check to ensure production data set date is after Q data set date are mandatory and will always be checked by the macro. ollowing comparison checks are optional and can be enabled or disabled while invoking the macro: _arum =es: Whenever the variable position (or order of the variables in the data set) is critical and if this needs to be compared between the production and Q data set, this option can be enabled. When the report identifies this as an issue, production or Q programmer will need to reorder the variable. eference # 4 provides detailed description on how to reorder variables in a data set. _mpres: efault for this parameter is %str(), and comparison will be done as is in this case. In certain cases, production programmer can use special characters (e.g. @,, $ etc.) while generating outputs and the corresponding image () data sets (e.g. blank space used for indenting sub categories). his may not need to be verified by the Q programmer electronically since visual cosmetic check of the production output is always expected. In this scenario, use the option_mpres=str( ) and this instructs the macro to ignore blank space when comparing production to Q data sets. Instead of a single character, list of characters can also be passed to this macro parameter. s an example, _mpres=%str(@ $ ) will instruct the macro to ignore @, blank space, $ and. _riter: efault for this parameter is %str(). In this case, default criterion for judging the equality of numeric values will be used. efault as per the is 0.00001. If any other criterion needs to be used, this parameter can be used (e.g. _riter=0.000000001). or all the optional comparison checks outlined, report will dynamically change the title of the report to indicate what option is enabled. age 3 of 8

inally the generated report will indicate in one line for each pair of comparison data sets, whether the compare passed (in green) or failed (in red). he report will print all the failures first followed by the successful matching comparison. his is intended so the user can take corrective measure to address all failed comparison. xamples section of unompare. is presented here to support the demonstration of all examples. * imulated data sets are temporary (WK). tility should be used on permanent data sets ; *++++++++++++++++++++++++++; data class1(label="tudent ata") ; retain ge ; set sashelp.class ; if height = 69 then height = height + 0.000000001 ; run ; *++++++++++++++++++++++++++; data ; set sashelp.cars(drop=origin) ; if ype='' then ype=' ' ; format type $15. ; informat type $15. ; label ype = '' ; run ; data qc(label="nbalanced Quotes:'(") ; set sashelp.cars(drop=msrp) ; run ; *++++++++++++++++++++++++++; data prep(where=(order>0)); infile datalines dsd; input order rodib :$8. Qib :$8. rodset :$30. Qcset :$30. ; datalines; 1,work,sashelp,class1,class 1,sashelp,work,class,class1 1,sashelp,sashelp,cars,cars 1,work,work,otxist,otxist 1,work,work,prod,qc ; run ; %include "mprashrd.sas" ; able 3: here are five pairs of comparisons done and they are as follows: unrder roduction Q escription of the comparison data set () data set (Q) 1. work.class1 sashelp.class WK.1 () was intentionally created after H. (Q). When Height=69, Height was incremented by 0.000000001. ariable position for has been changed. 2. sashelp.class work.class1 ole of () and (Q) reversed. 3. sashelp.cars sashelp.cars () and (Q) are same. lean compare is expected. 4. work.otxist work.otxist omparing non existing data sets. 5. work.prod work.qc In addition to several standard compare differences, = is different from =. ote: unrder will be referred frequently in the examples. age 4 of 8

In all the examples listed below note that the macro parameters _set, _oot, _I, _ile, _itle1, _itle2 are constant. ection How to run the macro? provides the default values for these macro parameters. efer to able 3, whenever unrder is used to identify the pairs of data sets compared. Q data set name is not printed in the report. xample 1: et us say macro is invoked as follows: %mprashbrd(_mpres=%str(), _riter =%str(), _arum=es ) ; ince _mpres=%str(), comparison will be done as is. ince _riter=%str(), default riterion value of 0.00001 will be used. unrder=1:, _ and are identified as compare differences. unrder=2: and are identified as compare differences (and not _). unrder=3: omparison is clean. unrder=4: onexistent data set is flagged as. unrder=5: s expected has various compare difference when compared to Q. ee utput 1 for the report displayed by using this macro. xample 2: %mprashbrd(_ile =%str(unompare2), _mpres=%str(), _riter =%str(0.000000001), _arum=es ) ; ompare results using unrder has the identifier: ince _riter =0.000000001, unrder=1 and 2 does not have difference any more. s expected, all others are similar to xample 1. ee utput 2 for the report displayed by using this macro. xample 3: %mprashbrd(_ile =%str(unompare3), _mpres=%str( ), _riter =%str(), _arum=es ) ; ince _mpres=%str( ), spacing is ignored during comparison and hence unrder=5 no longer has difference. ll others are similar to xample 1. ee utput 3 for the report displayed by using this macro. age 5 of 8

xample 4: %mprashbrd(_ile =%str(unompare4), _mpres=%str( ), _riter =%str(0.000000001), _arum=o ) ; In this example, spacing differences are ignored for character variables and absolute relative difference of 0.000000001 is used by and variable ordering is ignored. his example is also compared against xample 1. unrder=1: mall increment made to Height and ordering (for variable ) is ignored. unrder=2: lean compare. ame reasons as unrder=1 and also Q date is > date. unrder=3: s always compare matched 100%. unrder=4: emains the same as xample 1. unrder=5: difference of ype= versus ype= is no longer present since spacing is ignored. ee utput 4 for the report displayed by using this macro. onclusion his utility is a handy tool for Q programmer to efficiently find out quickly and easily if there are any compare differences. his can be used by the lead Q programmer to ensure all output comparisons matched electronically. his report can be used as an audit trail to indicate all outputs passed independent electronic validation before the deliverable is released. ast but not the least, this tool has been a huge timesaver whenever all study outputs have to re-run due to underlying datasets getting changed during the deliverable or when several re-runs are planned or expected during the course of the project. eferences 1. ase () 9.2 rocedures uide, esults: rocedure http://support.sas.com/documentation/cdl/en/proc/61895/h/default/viewer.htm#a000146743.htm 2. aria. eiss (2003)., ailoring utput. http://analytics.ncsu.edu/sesug/2003/01-eiss.pdf 3. Kavitha adduri (2012). harma 2012 - aper 24, et s compare two libraries! http://www.pharmasug.org/proceedings/2012//harma-2012-24.pdf 4. Imelda (2002). 2002 - eordering ariables in a ata et http://analytics.ncsu.edu/sesug/2002/12.pdf cknowledgments uthor would like to thank the employer Quintiles for the approval to present this paper at this forum and thanks to 2013 board members Kenny issett and izette lonzo for accepting the abstract and paper. pecial thanks to ommy etzlaff and rema anjit for their valuable feedback on this paper. ontact information our comments and questions are valued and encouraged. ontact the author at: Krish Krishnan tatistical rogramming cientist Quintiles Krish.Krishnan@quintiles.com Krish.Krishnan@usa.com and all other Institute Inc. product or service names are registered trademarks or trademarks of Institute Inc. in the and other countries. indicates registration. ther brand and product names are registered trademarks or trademarks of their respective companies. age 6 of 8

age 7 of 8 utput 1: ompare is done 'as is' when comparing character variables. acro variable =es. efault II (0.00001) is used in ' ' bs utome u n r d e r ibref set ame I H _ 1 ail 1 work class1 X X X 2 ail 2 sashelp class X X 3 ail 4 work otxist X 4 ail 5 work prod X X X X X X X X 5 ass 3 sashelp cars utput 2: ompare is done 'as is' when comparing character variables. acro variable =es. II=0.000000001 option is used in ' ' bs utome u n r d e r ibref set ame I H _ 1 ail 1 work class1 X X 2 ail 2 sashelp class X 3 ail 4 work otxist X 4 ail 5 work prod X X X X X X X X 5 ass 3 sashelp cars

age 8 of 8 utput 3: ompare will ignore characters within left and right arrow -> <- when comparing. acro variable =es. efault II (0.00001) is used in ' ' bs utome u n r d e r ibref set ame I H _ 1 ail 1 work class1 X X X 2 ail 2 sashelp class X X 3 ail 4 work otxist X 4 ail 5 work prod X X X X X X X 5 ass 3 sashelp cars utput 4: ompare will ignore characters within left and right arrow -> <- when comparing. acro variable =o. II=0.000000001 option is used in ' ' bs utome u n r d e r ibref set ame I H _ 1 ail 1 work class1 X 2 ail 4 work otxist X 3 ail 5 work prod X X X X X X 4 ass 2 sashelp class 5 ass 3 sashelp cars