DATA CLEANING: LONGITUDINAL STUDY CROSS-VISIT CHECKS



Similar documents
Programming Tricks For Reducing Storage And Work Space Curtis A. Smith, Defense Contract Audit Agency, La Mirada, CA.

Using the Magical Keyword "INTO:" in PROC SQL

Taming the PROC TRANSPOSE

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

Intro to Longitudinal Data: A Grad Student How-To Paper Elisa L. Priest 1,2, Ashley W. Collinsworth 1,3 1

Guido s Guide to PROC FREQ A Tutorial for Beginners Using the SAS System Joseph J. Guido, University of Rochester Medical Center, Rochester, NY

WRITING PROOFS. Christopher Heil Georgia Institute of Technology

Beyond the Simple SAS Merge. Vanessa L. Cox, MS 1,2, and Kimberly A. Wildes, DrPH, MA, LPC, NCC 3. Cancer Center, Houston, TX.

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

The Essentials of Finding the Distinct, Unique, and Duplicate Values in Your Data

Parameter Passing in Pascal

IBM SPSS Data Preparation 22

LOCF-Different Approaches, Same Results Using LAG Function, RETAIN Statement, and ARRAY Facility Iuliana Barbalau, ClinOps LLC. San Francisco, CA.

USING PROCEDURES TO CREATE SAS DATA SETS... ILLUSTRATED WITH AGE ADJUSTING OF DEATH RATES 1

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC

Do It Yourself (DIY) Data: Creating a Searchable Data Set of Available Classrooms using SAS Enterprise BI Server

Comparing Methods to Identify Defect Reports in a Change Management Database

SAS and Clinical IVRS: Beyond Schedule Creation Gayle Flynn, Cenduit, Durham, NC

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

Commission Formula. Value If True Parameter Value If False Parameter. Logical Test Parameter

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Transforming SAS Data Sets Using Arrays. Introduction

Paper Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Chapter 16: Follow-up

Let the CAT Out of the Bag: String Concatenation in SAS 9 Joshua Horstman, Nested Loop Consulting, Indianapolis, IN

Changing the Shape of Your Data: PROC TRANSPOSE vs. Arrays

Managing very large EXCEL files using the XLS engine John H. Adams, Boehringer Ingelheim Pharmaceutical, Inc., Ridgefield, CT

Anyone Can Learn PROC TABULATE

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA

Labels, Labels, and More Labels Stephanie R. Thompson, Rochester Institute of Technology, Rochester, NY

Automate your forms assembly process with PlanetPress Suite. Go Beyond Printing :::

2 SYSTEM DESCRIPTION TECHNIQUES

PO-18 Array, Hurray, Array; Consolidate or Expand Your Input Data Stream Using Arrays

1.2 Solving a System of Linear Equations

The Query Builder: The Swiss Army Knife of SAS Enterprise Guide

NUMBER SYSTEMS APPENDIX D. You will learn about the following in this appendix:

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

Imputing Missing Data using SAS

Laboratory Assignments of OBJECT ORIENTED METHODOLOGY & PROGRAMMING (USING C++) [IT 553]

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

=

Pseudo code Tutorial and Exercises Teacher s Version

Alex Vidras, David Tysinger. Merkle Inc.

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Subsetting Observations from Large SAS Data Sets

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001 STATISTICIANS WOULD DO WELL TO USE DATA FLOW DIAGRAMS

Binary Adders: Half Adders and Full Adders

PharmaSUG Paper AD08

Gantt Chart for Excel

EXST SAS Lab Lab #4: Data input and dataset modifications

Guide to Performance and Tuning: Query Performance and Sampled Selectivity

A Process is Not Just a Flowchart (or a BPMN model)

Reduced echelon form: Add the following conditions to conditions 1, 2, and 3 above:

SSN validation Virtually at no cost Milorad Stojanovic RTI International Education Surveys Division RTP, North Carolina

SUGI 29 Posters. Mazen Abdellatif, M.S., Hines VA CSPCC, Hines IL, 60141, USA

AP Computer Science Java Mr. Clausen Program 9A, 9B

Using SAS to Build Customer Level Datasets for Predictive Modeling Scott Shockley, Cox Communications, New Orleans, Louisiana

Solving Linear Systems, Continued and The Inverse of a Matrix

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

Preserving Line Breaks When Exporting to Excel Nelson Lee, Genentech, South San Francisco, CA

Second Degree Price Discrimination - Examples 1

Quick Start to Family Tree

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Lecture 2 Matrix Operations

Everything you wanted to know about MERGE but were afraid to ask

Identifying Invalid Social Security Numbers

Being Legal in the Czech Republic: One American s Bureaucratic Odyssey

Lab 4.4 Secret Messages: Indexing, Arrays, and Iteration

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Experiences in Using Academic Data for BI Dashboard Development

SPC Data Visualization of Seasonal and Financial Data Using JMP WHITE PAPER

Simple File Input & Output

Adobe Acrobat 6.0 Professional

Dr. U. Devi Prasad Associate Professor Hyderabad Business School GITAM University, Hyderabad

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Paper FF-014. Tips for Moving to SAS Enterprise Guide on Unix Patricia Hettinger, Consultant, Oak Brook, IL

Process Modeling. Chapter 6. (with additions by Yale Braunstein) Slide 1

5 Homogeneous systems

Clinical Trial Data Integration: The Strategy, Benefits, and Logistics of Integrating Across a Compound

Converting Electronic Medical Records Data into Practical Analysis Dataset

CHAPTER 2. Logic. 1. Logic Definitions. Notation: Variables are used to represent propositions. The most common variables used are p, q, and r.

Step by Step Guide to Importing Genetic Data into JMP Genomics

WORKING WITH SUBQUERY IN THE SQL PROCEDURE

Algorithm & Flowchart & Pseudo code. Staff Incharge: S.Sasirekha

Basics of Dimensional Modeling

BASIC TECHNIQUES IN USING EXCEL TO ANALYZE ASSESSMENT DATA

Fun with PROC SQL Darryl Putnam, CACI Inc., Stevensville MD

ZIMBABWE SCHOOL EXAMINATIONS COUNCIL. COMPUTER STUDIES 7014/01 PAPER 1 Multiple Choice SPECIMEN PAPER

1. ERP integrators contacts. 2. Order export sample XML schema

Transcription:

Paper 1314-2014 ABSTRACT DATA CLEANING: LONGITUDINAL STUDY CROSS-VISIT CHECKS Lauren Parlett,PhD Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD Cross-visit checks are a vital part of data cleaning for longitudinal studies. The nature of longitudinal studies encourages repeatedly collecting the same information. Sometimes, these variables are expected to remain static, go away, increase, or decrease over time. This presentation reviews the naïve and the better approaches at handling one-variable and two-variable consistency checks. For a single-variable check, the better approach features the new ALLCOMB function, introduced in SAS 9.2. For a two-variable check, the better approach uses a BY PROCESSING variable to flag inconsistencies. This paper will provide you the tools to enhance your longitudinal data cleaning process. INTRODUCTION One of my first tasks as a data analyst was to "clean" the data collected in a longitudinal study that began long before I was hired. Researchers of the longitudinal study collected regular data on the subjects on a yearly basis. The same forms would be used for each visit. Not only did the collected data have to make sense across forms, but also the data had to make sense over time. For example, if the subject had indicated at his first visit that his mother was dead, the data should not show her alive at a later visit. That sort of inconsistency had to be flagged for further investigation. My predecessor had adequately written code for the cross visit checks; however, I could not shake the feeling that the code could be more efficient and more dynamic. In this paper I will provide examples of cross visit checks for one and two variable scenarios. For each, I will go through my initial naïve approach and then come at the problem from a more sophisticated approach. Since the ALLCOMB function is an important part of these methods, SAS 9.2 or higher is required. SETTING UP YOUR DATA The data set used in this paper is completely artificial in order to illustrate concepts and possible pitfalls. This data is for a pretend study about the association between vision and stroke over time. Presented here are 5 individuals (ID) who each had 4 visits (Visit). At each visit there were four pieces of information collected: whether the person accompanying them -- their informant-- was new (NewInfrmnt), the date of birth of the informant (InfrmntDOB), whether the subject's vision was normal (NormVision), and whether a stroke had occurred in the past (StrkHist). Missing or unknown values are denoted by the value 99. Otherwise, variable coding is standard. DATA mydata; INPUT ID Visit NewInfrmnt InfrmntDOB MMDDYY10. NormVision StrkHist; DATALINES; 10 1 1 01/13/1970 1 0 10 2 0 01/13/1970 0 0 10 3 0 02/22/1968 1 1 10 4 0 02/22/1968 1 1 11 1 1 03/04/1977 1 0 11 2 0 03/04/1977 0 0 11 3 0 03/04/1977 99 0 11 4 0 03/04/1977 1 1 12 1 1 04/14/1955 1 0 12 2 0 04/14/1955 1 0 12 3 0 04/14/1955 99 0 12 4 0 04/14/1955 1 0 13 1 1 05/21/1966 1 1 13 2 1 06/16/1980 1 99 13 3 0 06/16/1980 1 0 1

13 4. 06/16/1980 1 1 14 1 1 07/08/1972 1 0 14 2 0 07/09/1972 1 0 14 3 0 07/09/1972 1 0 14 4 0 07/09/1972 1 99 ; The first step in setting up the data is to convert the missing code to a missing value. Then transpose it so that there is one observation per subject such that each row contains all of the variable values across the visits. Only one variable will be transposed at a time. We will start with vision. DATA mydata (DROP=i); SET mydata; ARRAY vars {*} _NUMERIC_; DO i=1 to HBOUND(vars); IF vars{i} = 99 THEN vars{i}=.; PROC SORT data=mydata; BY ID; PROC TRANSPOSE data=mydata PREFIX=v OUT=vision_trans(DROP=_NAME_); BY ID; *Each subject gets a row; ID Visit; *Names each new variable according to visit number; VAR Normvision; After running that code, the mydata dataset will look like this: ID v1 v2 v3 v4 10 1 0 1 1 11 1 0. 1 12 1 1. 1 13 1 1 1 1 14 1 1 1 1 Output 1. Study Data after Transposing by Visit Number Each subject is on one row. The vision values for each visit are in separate columns. Now we are ready to begin checking across visits. ONE-VARIABLE CHECK NAÏVE APPROACH My first approach to the problem was, admittedly, simplistic, though it did produce the expected results. Someone could present with normal vision and then subsequently have poor vision, but not vice versa. My approach was to type out logic code (IF/THEN statements) to compare each visit. Ultimately, the code should have identified IDs 10 and 11 for further review. My code looked like this: DATA vision_check; *Compare against Visit 1; IF v1=0 AND v2=1 THEN flag=1; IF v1=0 AND v3=1 THEN flag=1; 2

IF v1=0 AND v4=1 THEN flag=1; *Compare against Visit 2; IF v2=0 AND v3=1 THEN flag=1; IF v2=0 AND v4=1 THEN flag=1; *Compare against Visit 3; IF v3=0 AND v4=1 THEN flag=1; The output dataset vision_check: 10 1 0 1 1 1 11 1 0. 1 1 Output 2. One-Variable Naive Approach: Dataset after Flagging Discrepancies Using IF/THEN Statements Nothing wrong there! But things could get a little hairy if the study ended up lasting another 2 years. How many lines of logic code would that be for this one variable? You can calculate on this own using the formula for permutations where n = the total number of visits and r= the number of visits rearranged at one time. n! P(n, r) = (n r)! For example, four visits with two visits evaluated at a time would require 12 lines of IF/THEN code. P(4,2) = 4! (4 3 2 1) = = 12 (4 2)! (2 1) Six visits with two comparisons would require 30 lines. Seven visits with two comparisons? 42 lines. That is a lot of hand coding. And doing that for each variable would take a very long time. There should be an easier way. That is when I tried to use an array and do loop to cut down on the number of lines. DATA vision_checkv2 (DROP=i); ARRAY visit {*} v:; DO i=2 TO HBOUND(visit); IF visit{i-1} > visit {i} THEN flag=1; That was a lot faster than coding 6 or 30 logic lines. But, there was something wrong with the vision_checkv2 dataset. 10 1 0 1 1 1 11 1 0. 1 1 12 1 1. 1 1 Output 3. One-Variable Naive Approach: Dataset after Flagging Discrepancies Using Array and Do Loop The code caught a missing value for subject 12. In SAS, a missing value (.) is considered less than 1; therefore, that observation was flagged as having a discrepancy where one did not exist. I tried again with another condition on the comparison. If either value is missing, the code will skip to the next number of the do loop. 3

DATA vision_checkv3 (DROP = i); ARRAY visit {*} v:; DO i=2 TO HBOUND(visit); IF visit{i} =. OR visit{i-1}=. THEN CONTINUE; IF visit{i} > visit {i-1} THEN flag=1; And what do we get for vision_checkv3? 10 1 0 1 1 1 Output 4. One-Variable Naive Approach: Dataset after Flagging Discrepancies Using Array and Do Loop, Controlling for Missing Values That's not right either! Subject 11 is now gone because the code could not compare the second and fourth visits where there was a discrepancy. BETTER APPROACH The answer involves a rarely-used SAS function ALLCOMB that is available in SAS 9.2 and later. The purpose of the ALLCOMB function is to generate combinations of variables in a specific order. Order matters in our checks (we must check earlier visit versus a later visit and in that order), so that is why we use ALLCOMB rather than ALLPERM. This seems at odds with our first naïve approach where permutations dictated the number of lines, but that was because we could interchange the order of the logic statements without a change in the outcome. That is not the case with this one-variable approach. ALLCOMB works on elements in an array. The function parameters are the count (or row), the number of chosen elements, and the source of the elements. The result of the ALLCOMB function is a matrix of all possible combinations in a sorted order. The sorted order algorithm will output the same matrix every time it is run. The process for the code is summarized below. 1. Set the data step with the transposed vision dataset. 2. Put the visits into an explicit array (visit). 3. Calculate the number of visits in the array (n). 4. Declare the number of visits to evaluate at one time (k). 5. Calculate the number of times the DO LOOP will have to run through the rows of the ALLCOMB matrix (ncomb). 6. Set up a DO LOOP where that will run ncomb times to compare k visits at a time. 7. Call the ALLCOMB function to interchange two visits and select a certain row (i). 8. Write a logic statement comparing the early and later visits, accounting for missing values. Here is the code for the one-variable approach: DATA vision_onevar (DROP = n k ncomb i); ARRAY visit {*} v:; n = dim(visit); k = 2; ncomb = comb(n,k); DO i=1 TO ncomb; CALL allcomb(i, k, of visit[*]); IF visit{1} < visit{2} AND visit[1]^=. THEN flag=1; 4

The resulting vision_onevar dataset: 10 1 1 0 1 1 11. 1 0 1 1 Output 5. One-Variable Better Approach: Dataset after Flagging Discrepancies Using ALLCOMB Function It flagged the proper IDs, just like the first naïve way, but this code can accommodate many additional visits without adding lines to the data step. Note that the values are no longer with the right visits. They are in reverse order. That can be fixed by copying the visits values to temporary variables before the do loop and then restoring them after the DO loop. There is code in the two-variable better approach that can serve as your template for this. Finally, the less than operator can be changed to greater than or not equal to depending on your particular coding scheme and logic needs. TWO-VARIABLE CHECK What if your consistency check depends on two variables rather than one? The value of one variable is allowed to change if another variable for that same visit also changes? That is where the two-variable approach comes in. NAÏVE APPROACH Our example involves the variables NewInfrmnt and InfrmntDOB. The date of birth of the informant should not change unless the informant is new (NewInfrmnt=1). According to this logic, subjects 10 and 14 should be flagged for further review. My first instinct was to expand the better code from the one-variable check where the ALLCOMB function is used. First, I transposed the two variables and added the prefixes "new" and "dob" to their new columns. Next, I merged the resulting datasets. Before going into the comparison, I stored the original values in temporary variables since the ALLCOMB function ultimately reverses the values. Then, two ALLCOMB function calls are employed rather than just one. The code: DATA inft_check (DROP= n k ncomb i j m temp:); MERGE inft_trans inft_trans2; BY ID; ARRAY new {*} new:; ARRAY dob {*} dob:; ARRAY temp1 {4}; ARRAY temp2 {4}; DO i=1 to HBOUND(new); temp1{i} = new{i}; temp2{i} = dob{i}; n = dim(new); k = 2; ncomb = comb(n,k); DO j=1 TO ncomb; CALL allcomb(j, k, of new[*]); CALL allcomb(j, k, of dob{*}); IF dob{1} ^= dob{2} AND new[2]=0 AND dob{1}^=. AND dob{2}^=. THEN flag=1; 5

DO m=1 TO HBOUND(new); new{m} = temp1{m}; dob{m} = temp2{m}; FORMAT dob1-dob4 MMDDYY9.; The resulting inft_check dataset: ID new1 new2 new3 new4 dob1 dob2 dob3 dob4 10 1 0 0 0 01/13/70 01/13/70 02/22/68 02/22/68 13 1 1 0. 05/21/66 06/16/80 06/16/80 06/16/80 14 1 0 0 0 07/08/72 07/09/72 07/09/72 07/09/72 Output 6. Two-Variable Naive Approach: Dataset after Flagging Discrepancies Using Two ALLCOMB Functions That's not right. What went wrong? It turns out that the code compared visit 1 and visit 3 for subject 13 and found that a new informant was not specified even though the dates of birth were different. So, it flagged subject 13. However, the new informant WAS specified for visit 2, so subsequent visits had no need to indicate a new informant was present. The two ALLCOMB calls approach was not the way to go. BETTER APPROACH A more elegant solution requires no function calls. Rather than using the transposed dataset, the two-variable approach makes use of the long dataset. Keep that in mind if your data is structured as wide. The code: PROC SORT data=mydata; BY ID InfrmntDOB; DATA inft_twovar; SET mydata (KEEP= ID Visit InfrmntDOB NewInfrmnt); BY ID InfrmntDOB; IF first.infrmntdob and NewInfrmnt=0 THEN flag=1; FORMAT InfrmntDOB MMDDYY9.; This approach relies on a BY processing variable (first.infrmntdob). When a dataset is SET with one or more BY variables, the processing makes first. and last. variables at the beginning and end of each group. In this better approach, if the observation is the first of a group of date of births for a subject, first.informntdob is equal to 1. If, in that same observation, there is no new informant declared (NewInfrmt=0), then the observation is flagged as having a discrepancy. The resulting inft_twovar dataset: New Infrmnt ID Visit Infrmnt DOB flag 10 3 0 02/22/68 1 14 2 0 07/09/72 1 Output 7. Two-Variable Better Approach: Dataset after Flagging Discrepancies Using BY Variables The dataset has flagged the proper IDs. It shows you the visit where the error occurred. Keep in mind that this method 6

will not flag visits where there is a new date of birth and the new informant variable is missing. Those should be flagged and hand-checked using other code. CONCLUSION Checking for cross-visit consistency is an important part of data cleaning. By doing this, you can catch typos, inconsistent assessments, and more. Sometimes the naïve ways of coding work, such as the first naïve way I tried in the one variable example. Unfortunately, those solutions are not always dynamic and can balloon when additional follow up visits are added to the study. Since longitudinal data collection (or any data collection!) is bound to have missing or unknown values, using the newly offered-- though rarely talked about-- function ALLCOMB is essential for checking more than just one visit versus the next visit. When examining two variables, using the BY statement in a DATA step and the resulting BY processing variables (FIRST. or LAST.) is a good way to flag discrepancies when one value relies on another. The examples provided in this paper can be scaled up and supplemented according to your specific data cleaning needs. ACKNOWLEDGEMENTS This work was completed while working on the Johns Hopkins University BIOCARD study. Funding for the BIOCARD study is provided by the National Institute of Aging, NIH, grant U01AG033655. Special thanks to the SAS Support Community for providing ideas about the two-variable cross visit check. Also, thank you to Joseph Guido for reviewing this paper. CONTACT INFORMATION Your comments are questions are valued and encouraged. Contact the author at: Lauren Parlett, PhD Johns Hopkins University Bloomberg School of Public Health Department of Epidemiology 615 N. Wolfe St. Rm W6024 Baltimore, MD 21205 (410) 502-5448 (phone) (410) 502-4623 (fax) SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of their respective companies. 7