Innovative Techniques and Tools to Detect Data Quality Problems

Transcription

1 Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful analysis and reporting, and it is widely accepted that high quality data directly leads to higher programming productivity and faster throughput. Historically, data have been examined for quality with PROC FREQ, PROC UNIVARIATE, and similar tools. Those approaches are very useful, but additional work is often warranted. This paper presents innovative - yet simple - approaches to check different aspects of data quality, including both structural and content issues. These tools can be implemented with a minimum of up-front development and programming time, and can be easily used. INTRODUCTION Good quality data is essential not only for accurate and meaningful analysis and reporting, but also for high programming productivity and faster throughput. It is necessary for a data user to acquire, before using the data for further actions, an overall appraisal of data quality including any potential issues contained in data format and values. As the first step towards this goal, some fundamental aspects about data should always be checked, such as the distribution of data values, format, duplicate records, outliers, etc. Traditionally, data are examined with PROC FREQ, PROC UNIVARIATE, and/or similar tools. Those approaches are very helpful, but additional programming work is often required. This paper introduces five innovative yet simple approaches to check data qualities including the detection of duplicate records, a scatterplot matrix to detect outliers, examining the distribution of key variable values, dataset-level presentation of record populations, and determining if numeric variables are real- or integervalued. The following sections summarize each approach with an introduction, typical usage and sample outputs, and programming considerations. DETECTION OF DUPLICATE RECORDS Duplicate records may occur in a dataset due to a variety of reasons, such as repeated data entry, sample re-testing, etc. It is a common practice that duplicate records are identified and a subset of them is removed from the dataset before data processing. PROC SORT NODUPKEY can be used for this purpose; however, it does not always keep the specific record(s) that are needed for analysis. Therefore, a user-friendly approach to identifying duplicate records and selecting a specific set of records for summary and analysis purposes is proposed. TYPICAL USAGE AND SAMPLE OUTPUT There are several scenarios where duplicate records may take place, one of which is laboratory generated test results. For example, a subject is supposed to have one test result per time point for a laboratory test, but may get multiple results for the same test at the same time point due to repeated testing. This issue needs to be identified up-front so that the rationale can be built up to decide on the test result that should be included in the analysis. In this scenario, our program, %dp0rcd, generates a message in the SAS log file (Figure 1) and a listing (Figure 2) including all the duplicate records in the SAS output file to help the user understand where the duplicate records occur. 1

2 * * datadir.lab contains duplicate records at the level of AN STUDYWK LABCODE * * Data set name Variable list Figure 1. Message in the SAS log file generated from %dp0rcd. datadir.lab: duplicate records at the level of AN STUDYWK LABCODE Obs STUDYID AN STUDYWK LABCODE EXAMVALU EXAMUNIT VISIT SCHVTFL 1 V WBC count 11 10[3]/microL 2.0 Y 2 V WBC count [3]/microL 2.1 N... Figure 2. Listing in the SAS output file generated from %dp0rcd. The output listing shows that two results for WBC count are recorded for the same subject in the same study week. However, one result is recorded at an unscheduled visit. The user needs to make a judgment as to which result should be used in the subsequent summary and analysis according to the study design. Note that using PROC SORT NODUPKEY may take away one of the two records with no justification of keeping the other one. A macro approach is adopted to identify duplicate records in a dataset. As usual, the specification of some macro parameters is required, such as the name of a dataset and a list of variables for which the values should not be repeated. Our program, %dp0rcd, is designed to detect duplicate records with a few simple steps. First, the program tests whether there are duplicate records in the data set by comparing the total number of records (a) and the number of unique records (b) for the data set at certain variable level. If a > b, then the program searches all the duplicate records, sends a message to the log file, and prints out the duplicate records with all the variables in the dataset. For example, the following code identifies the laboratory records that repeat by variables AN (unique subject identifier), STUDYWK (study week), and LABCODE (laboratory test code) in the data set DATADIR.LABS. %dp0dprd (inds = datadir.labs, varlist = AN STUDYWK LABCODE); The macro is easy to use, and can be applied to any data scenario. SCATTERPLOT MATRIX TO DETECT OUTLIERS Graphic data representations are very helpful in detecting outliers or data anomalies, and PROC UNIVARIATE, PROC TIMEPLOT, and PROC PLOT are frequently used. PROC PLOT is especially useful, because it helps to place data in proper context with other data. Real-world data may include many variables and many relationships and dependencies; deciding which pairs 2

3 of variables to plot can require a substantial amount of up-front investigation and the process can easily overlook important relationships. These concerns can be easily addressed with straightforward macro code and some of PROC PLOT's often overlooked capabilities. A group of scatterplots, commonly known as a matrix or constellation, allows the user to very easily look at data relationships and spot potential anomalies. TYPICAL USAGE AND SAMPLE OUTPUT Demographic and patient characteristic data are commonly used in the pharmaceutical industry, and the quality of that data is important. Let us suppose that we are examining data from an adult population, and have both HEIGHT and HEIGHT_UNIT_OF_MEASURE. A quick glance at the data may show legitimate values and a reasonable mix of inches and centimeters. Looking at the values of each of these variables separately may not show any problem. Plotting the data, however, would easily show a potentially incorrect unit of measure; in Figure 3, for example, there appear to be two heights in inches that are recorded as centimeters (indicated by the "B" around 60 cm), and further investigation is warranted. Keep in mind that the scatterplot matrix contains a similar plot for each pair of variables. height 200 ˆ C C K G H 150 ˆ C A 100 ˆ K Z B R 50 ˆ B Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ cm in height_unit_of_measure Figure 3. Sample plot from scatterplot constellation. The programming approach must be flexible and allow plotting only numeric data, only character data, or both. Options are also needed to allow inclusion or exclusion of specific variables, the handling of missing values, and changing the plot size. Using a macro approach, parameters capture the user's preferences. The macro then executes PROC CONTENTS against the target dataset. The resulting metadata is filtered based upon the parameters, and PROC SQL is then used to assign the desired list of variable names to a macro variable. If missing values are to be included, a data step is used to assign printable characters to missing values. This typically means that missing character data are changed to a short string of 3

4 asterisks and missing numeric data are changed to zeros, but the substitution values are controlled by macro parameters and can be easily changed. Lastly, PROC PLOT is invoked with options and variables determined by the macro parameters and the above preprocessing. By default, if a list of variables is specified, then each pairwise combination of variables will create a separate plot. Since a large number of plots may be produced, and since the user will generally be looking for only gross anomalies, multiple plots can usually be placed on a single page. This illustrates the general coding approach: proc plot hpercent=&hpercent vpercent=&vpercent nolegend data=&input_data ; plot (&variable_list) ; We have found that hpercent and vpercent values of 33 and 50, respectively, are generally good, and will put 6 plots on a page. Larger plots are better if the data volume is large, if data values are not clustered, or if character variables have many different values or long values. The execution time for this tool is essentially a linear function of the volume of data. If the user decides to include missing values, the data must also be processed through a data step, and that roughly doubles the time. EXAMINING THE DISTRIBUTION OF KEY VARIABLE VALUES Data can be collected or generated under different guidelines. Data values may vary in the same dataset or domains across studies. It is beneficial to understand the data values before processing the data for the following obvious reasons. First, data values contribute to data quality. Any unexpected data values can be documented in the feedback to data collection, data entry, data mapping, data generation program, etc., for data quality improvement. Second, it is often necessary to generate derived variables with reference to the data values in the input dataset. For example, to generate a variable for the week of study based on the data value for visit name, which can be "day-30", or "day-30 screen". Knowing what to expect in the data values definitely makes coding more effective. Moreover, key variable values are often reviewed during the validation of an analysis data set after the data processing. PROC FREQ and PROC UNIVARIATE can be used for this purpose; however, the variable types need to be understood before using these procedures to prevent errors or a lengthy SAS output. Therefore, a userfriendly macro approach to examine the distribution of key variable values is proposed. TYPICAL USAGE AND SAMPLE OUTPUT This macro can be used to examine the distribution of key variables for any dataset, e.g., a raw dataset or a derived dataset. For instance, in clinical trial studies, we usually need to provide tables describing subject characteristics of the study population. To design such a table and assign appropriate format for population categories to be summarized, we first need to know the distribution of age, race, and gender of study population in the demographic dataset. In this scenario, our program, %dist0var, generates the following in the SAS output file: (1) the frequency distribution for the character variables RACE and GENDER as a result from FREQ procedure; and (2) output from the UNIVARIATE procedure for the numeric variable AGE. Understanding the variable values for race (Figure 4), the user can design the format of data values to be display for race in the table. After examining the data range for AGE, the user can choose to produce the frequency distribution of AGE or age groups with an additional macro call specifying one more macro parameter. 4

5 RACE Cumulative Cumulative RACE Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Asian Hispa black multi white Figure 4. Distribution of Race generated from %dist0var. The macro is designed to allow the user to examine the distribution of key variables with a simple macro call. Parameters to be specified include the name of a data set and a list of key variables. FREQ and UNIVARIATE procedures are built in the macro. Due to the potential lengthy output that the FREQ procedure can generate for numeric variables, the macro is designed to examine character variables with the FREQ procedure, and numeric variables with the UNIVARIATE procedure by default. It will then be the user's decision as whether to examine the numeric variables with the FREQ procedure after reviewing the output from the UNIVARIATE procedure. Macro %dist0var, is composed of a few simple steps. First, it differentiates the key variables into character and numeric variables. Second, the program performs the FREQ or UNIVARIATE procedure according to the variable type. For numeric variables, it allows the user to choose the procedure to examine the data distribution. For example, the following code generates the distribution of data values for variables AGE, RACE, and GENDER in the data set, DATARAW.S_DEMOS. %dp0dprd (inds = dataraw.s_demos, varlist = AGE RACE GENDER); Upon reviewing the output, if the user decides to examine AGE with the FREQ procedure, another macro call can be specified as: %dp0dprd (inds = dataraw.s_demos, varlist = AGE, numvar = Y); The macro can be easily specified without prior knowledge about the variable type, and can be applied to any data scenario. DATASET-LEVEL PRESENTATION OF RECORD POPULATIONS Real-world datasets typically include some missing values. Some of the variables with missing values may stand alone, while some may be closely related to other variables. Finding problems with the distribution of missing values is often difficult, and looking at individual records may be tedious, time-consuming, and incomplete. It is helpful to look at the overall distribution of missing values, and this can be done with a concise listing of a dataset's variables that includes information about the data density. This listing can be easily produced with a small amount of SAS code, and simple graphics can be added to facilitate interpretation. 5

6 TYPICAL USAGE AND SAMPLE OUTPUT Demographic and patient characteristic data can be used again to illustrate the issues. Some specifics that may be characterized as poor data quality include measurements without a corresponding unit of measure, missing key information, an unexpected number of missing values for certain variables, etc. In the contrived sample output in Figure 5, the DEMOG dataset is noted as having 10 variables and 120 records. The variables are listed alphabetically, and the maximum actual length and defined length are given for character variables. The number of records that contain non-missing values is given, and a simple graphic illustrates this. The position of asterisks indicate what fraction of the records have non-missing values. Lastly, the variable labels are shown. DATASET: DEMOG VARIABLES: 10 RECORDS: 120 LENGTH # RECORDS VARIABLE ACT/ DEF POPULATED % RECORDS POPULATED AGE ALL * AGE AGE_YR NONE * AGE IN YEARS BRTH_DT 92 * DATE OF BIRTH CSNT_DT ALL * CONSENT DATE GENDER 1/ 6 92 * GENDER INV_NAME 43/ 200 ALL * INVESTIGATOR NAME INV_NUM 6/ 6 ALL * INVESTIGATOR NUMBER PRJC_DSC 35/ 40 ALL * PROJECT DESCRIPTION PROTOCOL ALL * PROTOCOL ID PAT_NUM ALL * PATIENT NUMBER Figure 5. Example of record populations. In this example, a concise overview of the entire dataset is presented. Note that AGE is populated on all records, but the calculation of AGE_YR apparently failed since that variable has only missing values. Looking at the counts, it appears that BRTH_DT and GENDER may be missing on the same records. Also note that INV_NAME is using only 43 of the permitted 200 bytes. The programming approach uses a simple macro, with just one parameter that points to the target dataset. PROC CONTENTS initially extracts metadata, which is then loaded into macro variables. A single pass is made of the target dataset, where missing values are counted and character variable lengths are checked. A data step is used to create the output, although PROC PRINT or other approaches could have been easily used. The simple graphic allows a quick visual check of the prevalence of missing values. A character string is given the value of a series of blanks with a vertical bar in the first and last positions. For each variable, the fraction of populated records is determined and rounded to the nearest multiple of 5. That number is used to place an asterisk in proper position in the string. The unusual technique of putting the substring function on the left side of the equals sign is used: substr(scale, position, 1) = '*' ; 6

7 The execution time for this tool is essentially a linear function of the volume of data, since the entire dataset must be processed once and every variable value must be checked. DETERMINING IF NUMERIC VARIABLES ARE REAL- OR INTEGER-VALUED Variables in a SAS dataset are either character or numeric. Some computer languages allow or require variables to be defined in more detail, e.g. a numeric variable may be restricted to having only integer values. The distinction between integer and real values can be very subtle, and ignoring this can lead to faulty logic and erroneous results. Even if data are intended to be integer, calculations and manipulations can lead to those data being assigned real values. If those values are very close to integers, comparisons, calculations, and reporting may not yield accurate results. In practice, it is often helpful to know two characteristics of numeric variables. First, does each variable contain only integer values, or does it contain real values? And, for real-valued variables, what is the maximum deviation from the nearest integer? Together, these characteristics give insight into the legitimacy of the data, and the authors have developed tools to help answer these questions. There are many related issues regarding real values, including moving data across platforms, comparing values, etc. This paper will not address those topics; the interested reader should look at the references. TYPICAL USAGE AND SAMPLE OUTPUTS Consider, for example, that we are working with the weights of adult patients. The data should be originally captured in whole pounds, and the presentation of the results is dependent on having integer values. It is possible that some values were captured in pounds/ounces or kilograms, and then converted to pounds. If that happens, the resulting values may not be integers. A concise report shows all of the numeric variables for the dataset, identifies which are real and which are integer, and for real variables, shows the maximum deviation from the nearest integer. A real valued variable's deviation should approach 0.5 as the number of values increases and as the distribution approaches uniformity across fractional values. NUMERIC VARIABLES for DATASET and MAXIMUM DISTANCE to NEAREST INTEGER AGE real AGE_YR integer BRTH_DT integer CSNT_DT integer HEIGHT integer WEIGHT real Figure 6. Example of numeric variable characteristics. In the example shown in Figure 6, most of the numeric variables have only integer values, and AGE appears to have values that are legitimately non-integer. WEIGHT, however, only has values that are close to integers. All WEIGHT values are within of the nearest integer. This suggests that WEIGHT is intended to have integer values, but that calculations or manipulations may have had an undesirable side effect. Missing values, of course, are ignored. Our tool is a macro that requires only a pointer to the target dataset. PROC CONTENTS determines which variables are numeric. Each dataset in the target dataset is read, and the actual value of each numeric variable is compared with its floor and ceil values. The results are 7

8 summarized via PROC MEANS and then reported. Recall that the floor function returns the value of the largest integer that is smaller than the function's argument, while the ceil function returns the value of the smallest integer that is larger than the function's argument. This illustrates the general approach: data... ; min_del = min(value - floor(value), ceil(value) - value) ; proc means... max ; var min_del ; Using both minimum and maximum may be confusing. The minimum function in the data step ensures that we are looking at the closest integer. Each value must be compared to both its floor and ceil. For example, a value of is only 0.32 from the nearest integer. PROC MEANS returns the largest of those minimums, since we want to know the greatest deviation from an integer. More thorough approaches can be envisioned. One would be to present a simple graphic that illustrated the distribution of deviations for each variable. Another would capture the distribution between floor and ceil, not just to the nearest integer. In practice, the authors have not found such thoroughness necessary. CONCLUSION AND DISCUSSION The approaches presented in this paper offer innovative ways to understand data and detect data quality issues. They can be implemented with a minimum of up-front development and programming time, and can be easily used in any data scenario. Therefore, they are recommended for routine checking of data quality. ACKNOWLEDGEMENTS The authors would like to acknowledge William Wilkins, a colleague who has done research in this area and has offered guidance and suggestions to others. RECOMMENDED READING Two SAS documents are excellent references for issues related to floating point representation of numeric data. "Numeric Precision 101" and "Dealing with Numeric Representation Error in SAS Applications" can be found at support.sas.com/techsup/technote/ts654.pdf and support.sas.com/techsup/technote/ts230.html, respectively. CONTACT INFORMATION The authors are members of Merck's Scientific Programming management team. Questions, comments, and suggestions are welcome, and should be directed to the authors at Hong_Qi@Merck.com and Allan_Glaser@Merck.com. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. 8