Innovative Techniques and Tools to Detect Data Quality Problems

Size: px
Start display at page:

Download "Innovative Techniques and Tools to Detect Data Quality Problems"

Transcription

1 Paper DM05 Innovative Techniques and Tools to Detect Data Quality Problems Hong Qi and Allan Glaser Merck & Co., Inc., Upper Gwynnedd, PA ABSTRACT High quality data are essential for accurate and meaningful analysis and reporting, and it is widely accepted that high quality data directly leads to higher programming productivity and faster throughput. Historically, data have been examined for quality with PROC FREQ, PROC UNIVARIATE, and similar tools. Those approaches are very useful, but additional work is often warranted. This paper presents innovative - yet simple - approaches to check different aspects of data quality, including both structural and content issues. These tools can be implemented with a minimum of up-front development and programming time, and can be easily used. INTRODUCTION Good quality data is essential not only for accurate and meaningful analysis and reporting, but also for high programming productivity and faster throughput. It is necessary for a data user to acquire, before using the data for further actions, an overall appraisal of data quality including any potential issues contained in data format and values. As the first step towards this goal, some fundamental aspects about data should always be checked, such as the distribution of data values, format, duplicate records, outliers, etc. Traditionally, data are examined with PROC FREQ, PROC UNIVARIATE, and/or similar tools. Those approaches are very helpful, but additional programming work is often required. This paper introduces five innovative yet simple approaches to check data qualities including the detection of duplicate records, a scatterplot matrix to detect outliers, examining the distribution of key variable values, dataset-level presentation of record populations, and determining if numeric variables are real- or integervalued. The following sections summarize each approach with an introduction, typical usage and sample outputs, and programming considerations. DETECTION OF DUPLICATE RECORDS Duplicate records may occur in a dataset due to a variety of reasons, such as repeated data entry, sample re-testing, etc. It is a common practice that duplicate records are identified and a subset of them is removed from the dataset before data processing. PROC SORT NODUPKEY can be used for this purpose; however, it does not always keep the specific record(s) that are needed for analysis. Therefore, a user-friendly approach to identifying duplicate records and selecting a specific set of records for summary and analysis purposes is proposed. TYPICAL USAGE AND SAMPLE OUTPUT There are several scenarios where duplicate records may take place, one of which is laboratory generated test results. For example, a subject is supposed to have one test result per time point for a laboratory test, but may get multiple results for the same test at the same time point due to repeated testing. This issue needs to be identified up-front so that the rationale can be built up to decide on the test result that should be included in the analysis. In this scenario, our program, %dp0rcd, generates a message in the SAS log file (Figure 1) and a listing (Figure 2) including all the duplicate records in the SAS output file to help the user understand where the duplicate records occur. 1

2 * * datadir.lab contains duplicate records at the level of AN STUDYWK LABCODE * * Data set name Variable list Figure 1. Message in the SAS log file generated from %dp0rcd. datadir.lab: duplicate records at the level of AN STUDYWK LABCODE Obs STUDYID AN STUDYWK LABCODE EXAMVALU EXAMUNIT VISIT SCHVTFL 1 V WBC count 11 10[3]/microL 2.0 Y 2 V WBC count [3]/microL 2.1 N... Figure 2. Listing in the SAS output file generated from %dp0rcd. The output listing shows that two results for WBC count are recorded for the same subject in the same study week. However, one result is recorded at an unscheduled visit. The user needs to make a judgment as to which result should be used in the subsequent summary and analysis according to the study design. Note that using PROC SORT NODUPKEY may take away one of the two records with no justification of keeping the other one. A macro approach is adopted to identify duplicate records in a dataset. As usual, the specification of some macro parameters is required, such as the name of a dataset and a list of variables for which the values should not be repeated. Our program, %dp0rcd, is designed to detect duplicate records with a few simple steps. First, the program tests whether there are duplicate records in the data set by comparing the total number of records (a) and the number of unique records (b) for the data set at certain variable level. If a > b, then the program searches all the duplicate records, sends a message to the log file, and prints out the duplicate records with all the variables in the dataset. For example, the following code identifies the laboratory records that repeat by variables AN (unique subject identifier), STUDYWK (study week), and LABCODE (laboratory test code) in the data set DATADIR.LABS. %dp0dprd (inds = datadir.labs, varlist = AN STUDYWK LABCODE); The macro is easy to use, and can be applied to any data scenario. SCATTERPLOT MATRIX TO DETECT OUTLIERS Graphic data representations are very helpful in detecting outliers or data anomalies, and PROC UNIVARIATE, PROC TIMEPLOT, and PROC PLOT are frequently used. PROC PLOT is especially useful, because it helps to place data in proper context with other data. Real-world data may include many variables and many relationships and dependencies; deciding which pairs 2

3 of variables to plot can require a substantial amount of up-front investigation and the process can easily overlook important relationships. These concerns can be easily addressed with straightforward macro code and some of PROC PLOT's often overlooked capabilities. A group of scatterplots, commonly known as a matrix or constellation, allows the user to very easily look at data relationships and spot potential anomalies. TYPICAL USAGE AND SAMPLE OUTPUT Demographic and patient characteristic data are commonly used in the pharmaceutical industry, and the quality of that data is important. Let us suppose that we are examining data from an adult population, and have both HEIGHT and HEIGHT_UNIT_OF_MEASURE. A quick glance at the data may show legitimate values and a reasonable mix of inches and centimeters. Looking at the values of each of these variables separately may not show any problem. Plotting the data, however, would easily show a potentially incorrect unit of measure; in Figure 3, for example, there appear to be two heights in inches that are recorded as centimeters (indicated by the "B" around 60 cm), and further investigation is warranted. Keep in mind that the scatterplot matrix contains a similar plot for each pair of variables. height 200 ˆ C C K G H 150 ˆ C A 100 ˆ K Z B R 50 ˆ B Šƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒ cm in height_unit_of_measure Figure 3. Sample plot from scatterplot constellation. The programming approach must be flexible and allow plotting only numeric data, only character data, or both. Options are also needed to allow inclusion or exclusion of specific variables, the handling of missing values, and changing the plot size. Using a macro approach, parameters capture the user's preferences. The macro then executes PROC CONTENTS against the target dataset. The resulting metadata is filtered based upon the parameters, and PROC SQL is then used to assign the desired list of variable names to a macro variable. If missing values are to be included, a data step is used to assign printable characters to missing values. This typically means that missing character data are changed to a short string of 3

4 asterisks and missing numeric data are changed to zeros, but the substitution values are controlled by macro parameters and can be easily changed. Lastly, PROC PLOT is invoked with options and variables determined by the macro parameters and the above preprocessing. By default, if a list of variables is specified, then each pairwise combination of variables will create a separate plot. Since a large number of plots may be produced, and since the user will generally be looking for only gross anomalies, multiple plots can usually be placed on a single page. This illustrates the general coding approach: proc plot hpercent=&hpercent vpercent=&vpercent nolegend data=&input_data ; plot (&variable_list) ; We have found that hpercent and vpercent values of 33 and 50, respectively, are generally good, and will put 6 plots on a page. Larger plots are better if the data volume is large, if data values are not clustered, or if character variables have many different values or long values. The execution time for this tool is essentially a linear function of the volume of data. If the user decides to include missing values, the data must also be processed through a data step, and that roughly doubles the time. EXAMINING THE DISTRIBUTION OF KEY VARIABLE VALUES Data can be collected or generated under different guidelines. Data values may vary in the same dataset or domains across studies. It is beneficial to understand the data values before processing the data for the following obvious reasons. First, data values contribute to data quality. Any unexpected data values can be documented in the feedback to data collection, data entry, data mapping, data generation program, etc., for data quality improvement. Second, it is often necessary to generate derived variables with reference to the data values in the input dataset. For example, to generate a variable for the week of study based on the data value for visit name, which can be "day-30", or "day-30 screen". Knowing what to expect in the data values definitely makes coding more effective. Moreover, key variable values are often reviewed during the validation of an analysis data set after the data processing. PROC FREQ and PROC UNIVARIATE can be used for this purpose; however, the variable types need to be understood before using these procedures to prevent errors or a lengthy SAS output. Therefore, a userfriendly macro approach to examine the distribution of key variable values is proposed. TYPICAL USAGE AND SAMPLE OUTPUT This macro can be used to examine the distribution of key variables for any dataset, e.g., a raw dataset or a derived dataset. For instance, in clinical trial studies, we usually need to provide tables describing subject characteristics of the study population. To design such a table and assign appropriate format for population categories to be summarized, we first need to know the distribution of age, race, and gender of study population in the demographic dataset. In this scenario, our program, %dist0var, generates the following in the SAS output file: (1) the frequency distribution for the character variables RACE and GENDER as a result from FREQ procedure; and (2) output from the UNIVARIATE procedure for the numeric variable AGE. Understanding the variable values for race (Figure 4), the user can design the format of data values to be display for race in the table. After examining the data range for AGE, the user can choose to produce the frequency distribution of AGE or age groups with an additional macro call specifying one more macro parameter. 4

5 RACE Cumulative Cumulative RACE Frequency Percent Frequency Percent ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Asian Hispa black multi white Figure 4. Distribution of Race generated from %dist0var. The macro is designed to allow the user to examine the distribution of key variables with a simple macro call. Parameters to be specified include the name of a data set and a list of key variables. FREQ and UNIVARIATE procedures are built in the macro. Due to the potential lengthy output that the FREQ procedure can generate for numeric variables, the macro is designed to examine character variables with the FREQ procedure, and numeric variables with the UNIVARIATE procedure by default. It will then be the user's decision as whether to examine the numeric variables with the FREQ procedure after reviewing the output from the UNIVARIATE procedure. Macro %dist0var, is composed of a few simple steps. First, it differentiates the key variables into character and numeric variables. Second, the program performs the FREQ or UNIVARIATE procedure according to the variable type. For numeric variables, it allows the user to choose the procedure to examine the data distribution. For example, the following code generates the distribution of data values for variables AGE, RACE, and GENDER in the data set, DATARAW.S_DEMOS. %dp0dprd (inds = dataraw.s_demos, varlist = AGE RACE GENDER); Upon reviewing the output, if the user decides to examine AGE with the FREQ procedure, another macro call can be specified as: %dp0dprd (inds = dataraw.s_demos, varlist = AGE, numvar = Y); The macro can be easily specified without prior knowledge about the variable type, and can be applied to any data scenario. DATASET-LEVEL PRESENTATION OF RECORD POPULATIONS Real-world datasets typically include some missing values. Some of the variables with missing values may stand alone, while some may be closely related to other variables. Finding problems with the distribution of missing values is often difficult, and looking at individual records may be tedious, time-consuming, and incomplete. It is helpful to look at the overall distribution of missing values, and this can be done with a concise listing of a dataset's variables that includes information about the data density. This listing can be easily produced with a small amount of SAS code, and simple graphics can be added to facilitate interpretation. 5

6 TYPICAL USAGE AND SAMPLE OUTPUT Demographic and patient characteristic data can be used again to illustrate the issues. Some specifics that may be characterized as poor data quality include measurements without a corresponding unit of measure, missing key information, an unexpected number of missing values for certain variables, etc. In the contrived sample output in Figure 5, the DEMOG dataset is noted as having 10 variables and 120 records. The variables are listed alphabetically, and the maximum actual length and defined length are given for character variables. The number of records that contain non-missing values is given, and a simple graphic illustrates this. The position of asterisks indicate what fraction of the records have non-missing values. Lastly, the variable labels are shown. DATASET: DEMOG VARIABLES: 10 RECORDS: 120 LENGTH # RECORDS VARIABLE ACT/ DEF POPULATED % RECORDS POPULATED AGE ALL * AGE AGE_YR NONE * AGE IN YEARS BRTH_DT 92 * DATE OF BIRTH CSNT_DT ALL * CONSENT DATE GENDER 1/ 6 92 * GENDER INV_NAME 43/ 200 ALL * INVESTIGATOR NAME INV_NUM 6/ 6 ALL * INVESTIGATOR NUMBER PRJC_DSC 35/ 40 ALL * PROJECT DESCRIPTION PROTOCOL ALL * PROTOCOL ID PAT_NUM ALL * PATIENT NUMBER Figure 5. Example of record populations. In this example, a concise overview of the entire dataset is presented. Note that AGE is populated on all records, but the calculation of AGE_YR apparently failed since that variable has only missing values. Looking at the counts, it appears that BRTH_DT and GENDER may be missing on the same records. Also note that INV_NAME is using only 43 of the permitted 200 bytes. The programming approach uses a simple macro, with just one parameter that points to the target dataset. PROC CONTENTS initially extracts metadata, which is then loaded into macro variables. A single pass is made of the target dataset, where missing values are counted and character variable lengths are checked. A data step is used to create the output, although PROC PRINT or other approaches could have been easily used. The simple graphic allows a quick visual check of the prevalence of missing values. A character string is given the value of a series of blanks with a vertical bar in the first and last positions. For each variable, the fraction of populated records is determined and rounded to the nearest multiple of 5. That number is used to place an asterisk in proper position in the string. The unusual technique of putting the substring function on the left side of the equals sign is used: substr(scale, position, 1) = '*' ; 6

7 The execution time for this tool is essentially a linear function of the volume of data, since the entire dataset must be processed once and every variable value must be checked. DETERMINING IF NUMERIC VARIABLES ARE REAL- OR INTEGER-VALUED Variables in a SAS dataset are either character or numeric. Some computer languages allow or require variables to be defined in more detail, e.g. a numeric variable may be restricted to having only integer values. The distinction between integer and real values can be very subtle, and ignoring this can lead to faulty logic and erroneous results. Even if data are intended to be integer, calculations and manipulations can lead to those data being assigned real values. If those values are very close to integers, comparisons, calculations, and reporting may not yield accurate results. In practice, it is often helpful to know two characteristics of numeric variables. First, does each variable contain only integer values, or does it contain real values? And, for real-valued variables, what is the maximum deviation from the nearest integer? Together, these characteristics give insight into the legitimacy of the data, and the authors have developed tools to help answer these questions. There are many related issues regarding real values, including moving data across platforms, comparing values, etc. This paper will not address those topics; the interested reader should look at the references. TYPICAL USAGE AND SAMPLE OUTPUTS Consider, for example, that we are working with the weights of adult patients. The data should be originally captured in whole pounds, and the presentation of the results is dependent on having integer values. It is possible that some values were captured in pounds/ounces or kilograms, and then converted to pounds. If that happens, the resulting values may not be integers. A concise report shows all of the numeric variables for the dataset, identifies which are real and which are integer, and for real variables, shows the maximum deviation from the nearest integer. A real valued variable's deviation should approach 0.5 as the number of values increases and as the distribution approaches uniformity across fractional values. NUMERIC VARIABLES for DATASET and MAXIMUM DISTANCE to NEAREST INTEGER AGE real AGE_YR integer BRTH_DT integer CSNT_DT integer HEIGHT integer WEIGHT real Figure 6. Example of numeric variable characteristics. In the example shown in Figure 6, most of the numeric variables have only integer values, and AGE appears to have values that are legitimately non-integer. WEIGHT, however, only has values that are close to integers. All WEIGHT values are within of the nearest integer. This suggests that WEIGHT is intended to have integer values, but that calculations or manipulations may have had an undesirable side effect. Missing values, of course, are ignored. Our tool is a macro that requires only a pointer to the target dataset. PROC CONTENTS determines which variables are numeric. Each dataset in the target dataset is read, and the actual value of each numeric variable is compared with its floor and ceil values. The results are 7

8 summarized via PROC MEANS and then reported. Recall that the floor function returns the value of the largest integer that is smaller than the function's argument, while the ceil function returns the value of the smallest integer that is larger than the function's argument. This illustrates the general approach: data... ; min_del = min(value - floor(value), ceil(value) - value) ; proc means... max ; var min_del ; Using both minimum and maximum may be confusing. The minimum function in the data step ensures that we are looking at the closest integer. Each value must be compared to both its floor and ceil. For example, a value of is only 0.32 from the nearest integer. PROC MEANS returns the largest of those minimums, since we want to know the greatest deviation from an integer. More thorough approaches can be envisioned. One would be to present a simple graphic that illustrated the distribution of deviations for each variable. Another would capture the distribution between floor and ceil, not just to the nearest integer. In practice, the authors have not found such thoroughness necessary. CONCLUSION AND DISCUSSION The approaches presented in this paper offer innovative ways to understand data and detect data quality issues. They can be implemented with a minimum of up-front development and programming time, and can be easily used in any data scenario. Therefore, they are recommended for routine checking of data quality. ACKNOWLEDGEMENTS The authors would like to acknowledge William Wilkins, a colleague who has done research in this area and has offered guidance and suggestions to others. RECOMMENDED READING Two SAS documents are excellent references for issues related to floating point representation of numeric data. "Numeric Precision 101" and "Dealing with Numeric Representation Error in SAS Applications" can be found at support.sas.com/techsup/technote/ts654.pdf and support.sas.com/techsup/technote/ts230.html, respectively. CONTACT INFORMATION The authors are members of Merck's Scientific Programming management team. Questions, comments, and suggestions are welcome, and should be directed to the authors at Hong_Qi@Merck.com and Allan_Glaser@Merck.com. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. indicates USA registration. 8

MWSUG 2011 - Paper S111

MWSUG 2011 - Paper S111 MWSUG 2011 - Paper S111 Dealing with Duplicates in Your Data Joshua M. Horstman, First Phase Consulting, Inc., Indianapolis IN Roger D. Muller, First Phase Consulting, Inc., Carmel IN Abstract As SAS programmers,

More information

Handling Missing Values in the SQL Procedure

Handling Missing Values in the SQL Procedure Handling Missing Values in the SQL Procedure Danbo Yi, Abt Associates Inc., Cambridge, MA Lei Zhang, Domain Solutions Corp., Cambridge, MA ABSTRACT PROC SQL as a powerful database management tool provides

More information

SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN Paper PH200 SAS Enterprise Guide in Pharmaceutical Applications: Automated Analysis and Reporting Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN ABSTRACT SAS Enterprise Guide is a member

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC

Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC Data-driven Validation Rules: Custom Data Validation Without Custom Programming Don Hopkins, Ursa Logic Corporation, Durham, NC ABSTRACT One of the most expensive and time-consuming aspects of data management

More information

Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN

Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN Paper TS600 Introduction to SAS Business Intelligence/Enterprise Guide Alex Dmitrienko, Ph.D., Eli Lilly and Company, Indianapolis, IN ABSTRACT This paper provides an overview of new SAS Business Intelligence

More information

AP * Statistics Review. Descriptive Statistics

AP * Statistics Review. Descriptive Statistics AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

More information

Paper 109-25 Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation

Paper 109-25 Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation Paper 109-25 Merges and Joins Timothy J Harrington, Trilogy Consulting Corporation Abstract This paper discusses methods of joining SAS data sets. The different methods and the reasons for choosing a particular

More information

Appendix 2.1 Tabular and Graphical Methods Using Excel

Appendix 2.1 Tabular and Graphical Methods Using Excel Appendix 2.1 Tabular and Graphical Methods Using Excel 1 Appendix 2.1 Tabular and Graphical Methods Using Excel The instructions in this section begin by describing the entry of data into an Excel spreadsheet.

More information

Performing Queries Using PROC SQL (1)

Performing Queries Using PROC SQL (1) SAS SQL Contents Performing queries using PROC SQL Performing advanced queries using PROC SQL Combining tables horizontally using PROC SQL Combining tables vertically using PROC SQL 2 Performing Queries

More information

Best Practice in SAS programs validation. A Case Study

Best Practice in SAS programs validation. A Case Study Best Practice in SAS programs validation. A Case Study CROS NT srl Contract Research Organisation Clinical Data Management Statistics Dr. Paolo Morelli, CEO Dr. Luca Girardello, SAS programmer AGENDA Introduction

More information

IBM SPSS Data Preparation 22

IBM SPSS Data Preparation 22 IBM SPSS Data Preparation 22 Note Before using this information and the product it supports, read the information in Notices on page 33. Product Information This edition applies to version 22, release

More information

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD

Advanced Tutorials. Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD Numeric Data In SAS : Guidelines for Storage and Display Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD ABSTRACT Understanding how SAS stores and displays numeric data is essential

More information

Describing, Exploring, and Comparing Data

Describing, Exploring, and Comparing Data 24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter

More information

A Demonstration of Hierarchical Clustering

A Demonstration of Hierarchical Clustering Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised

More information

Software Solutions - 375 - Appendix B. B.1 The R Software

Software Solutions - 375 - Appendix B. B.1 The R Software Appendix B Software Solutions This appendix provides a brief discussion of the software solutions available to researchers for computing inter-rater reliability coefficients. The list of software packages

More information

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn);

B) Mean Function: This function returns the arithmetic mean (average) and ignores the missing value. E.G: Var=MEAN (var1, var2, var3 varn); SAS-INTERVIEW QUESTIONS 1. What SAS statements would you code to read an external raw data file to a DATA step? Ans: Infile and Input statements are used to read external raw data file to a Data Step.

More information

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya

Post Processing Macro in Clinical Data Reporting Niraj J. Pandya Post Processing Macro in Clinical Data Reporting Niraj J. Pandya ABSTRACT Post Processing is the last step of generating listings and analysis reports of clinical data reporting in pharmaceutical industry

More information

Search and Replace in SAS Data Sets thru GUI

Search and Replace in SAS Data Sets thru GUI Search and Replace in SAS Data Sets thru GUI Edmond Cheng, Bureau of Labor Statistics, Washington, DC ABSTRACT In managing data with SAS /BASE software, performing a search and replace is not a straight

More information

Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type

Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type Data Cleaning 101 Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ INTRODUCTION One of the first and most important steps in any data processing task is to verify that your data values

More information

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can. SAS Enterprise Guide for Educational Researchers: Data Import to Publication without Programming AnnMaria De Mars, University of Southern California, Los Angeles, CA ABSTRACT In this workshop, participants

More information

Solution for Homework 2

Solution for Homework 2 Solution for Homework 2 Problem 1 a. What is the minimum number of bits that are required to uniquely represent the characters of English alphabet? (Consider upper case characters alone) The number of

More information

What You re Missing About Missing Values

What You re Missing About Missing Values Paper 1440-2014 What You re Missing About Missing Values Christopher J. Bost, MDRC, New York, NY ABSTRACT Do you know everything you need to know about missing values? Do you know how to assign a missing

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

Paper AD11 Exceptional Exception Reports

Paper AD11 Exceptional Exception Reports Paper AD11 Exceptional Exception Reports Gary McQuown Data and Analytic Solutions Inc. http://www.dasconsultants.com Introduction This paper presents an overview of exception reports for data quality control

More information

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC

Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC Christianna S. Williams, University of North Carolina at Chapel Hill, Chapel Hill, NC ABSTRACT Have you used PROC MEANS or PROC SUMMARY and wished there was something intermediate between the NWAY option

More information

Big Data & Scripting Part II Streaming Algorithms

Big Data & Scripting Part II Streaming Algorithms Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set

More information

PASW Direct Marketing 18

PASW Direct Marketing 18 i PASW Direct Marketing 18 For more information about SPSS Inc. software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412

More information

A Method for Cleaning Clinical Trial Analysis Data Sets

A Method for Cleaning Clinical Trial Analysis Data Sets A Method for Cleaning Clinical Trial Analysis Data Sets Carol R. Vaughn, Bridgewater Crossings, NJ ABSTRACT This paper presents a method for using SAS software to search SAS programs in selected directories

More information

Problem Solving and Data Analysis

Problem Solving and Data Analysis Chapter 20 Problem Solving and Data Analysis The Problem Solving and Data Analysis section of the SAT Math Test assesses your ability to use your math understanding and skills to solve problems set in

More information

THE POWER OF PROC FORMAT

THE POWER OF PROC FORMAT THE POWER OF PROC FORMAT Jonas V. Bilenas, Chase Manhattan Bank, New York, NY ABSTRACT The FORMAT procedure in SAS is a very powerful and productive tool. Yet many beginning programmers rarely make use

More information

Figure 1. Default histogram in a survey engine

Figure 1. Default histogram in a survey engine Between automation and exploration: SAS graphing techniques for visualization of survey data Chong Ho Yu, Samuel DiGangi, & Angel Jannasch-Pennell Arizona State University, Tempe AZ 85287-0101 ABSTRACT

More information

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon ABSTRACT Effective business development strategies often begin with market segmentation,

More information

IBM SPSS Direct Marketing 19

IBM SPSS Direct Marketing 19 IBM SPSS Direct Marketing 19 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This document contains proprietary information of SPSS

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Scheduling Programming Activities and Johnson's Algorithm

Scheduling Programming Activities and Johnson's Algorithm Scheduling Programming Activities and Johnson's Algorithm Allan Glaser and Meenal Sinha Octagon Research Solutions, Inc. Abstract Scheduling is important. Much of our daily work requires us to juggle multiple

More information

Scatter Chart. Segmented Bar Chart. Overlay Chart

Scatter Chart. Segmented Bar Chart. Overlay Chart Data Visualization Using Java and VRML Lingxiao Li, Art Barnes, SAS Institute Inc., Cary, NC ABSTRACT Java and VRML (Virtual Reality Modeling Language) are tools with tremendous potential for creating

More information

Statistics Chapter 2

Statistics Chapter 2 Statistics Chapter 2 Frequency Tables A frequency table organizes quantitative data. partitions data into classes (intervals). shows how many data values are in each class. Test Score Number of Students

More information

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS

REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS REx: An Automated System for Extracting Clinical Trial Data from Oracle to SAS Edward McCaney, Centocor Inc., Malvern, PA Gail Stoner, Centocor Inc., Malvern, PA Anthony Malinowski, Centocor Inc., Malvern,

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC Paper 073-29 Using Edit-Distance Functions to Identify Similar E-Mail Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC ABSTRACT Version 9 of SAS software has added functions which can efficiently

More information

SAS CLINICAL TRAINING

SAS CLINICAL TRAINING SAS CLINICAL TRAINING Presented By 3S Business Corporation Inc www.3sbc.com Call us at : 281-823-9222 Mail us at : info@3sbc.com Table of Contents S.No TOPICS 1 Introduction to Clinical Trials 2 Introduction

More information

Subsetting Observations from Large SAS Data Sets

Subsetting Observations from Large SAS Data Sets Subsetting Observations from Large SAS Data Sets Christopher J. Bost, MDRC, New York, NY ABSTRACT This paper reviews four techniques to subset observations from large SAS data sets: MERGE, PROC SQL, user-defined

More information

Survey Analysis: Options for Missing Data

Survey Analysis: Options for Missing Data Survey Analysis: Options for Missing Data Paul Gorrell, Social & Scientific Systems, Inc., Silver Spring, MD Abstract A common situation researchers working with survey data face is the analysis of missing

More information

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 1: Review and Exploratory Data Analysis (EDA) Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course

More information

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board

More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board More Tales from the Help Desk: Solutions for Simple SAS Mistakes Bruce Gilsen, Federal Reserve Board INTRODUCTION In 20 years as a SAS consultant at the Federal Reserve Board, I have seen SAS users make

More information

Make Better Decisions with Optimization

Make Better Decisions with Optimization ABSTRACT Paper SAS1785-2015 Make Better Decisions with Optimization David R. Duling, SAS Institute Inc. Automated decision making systems are now found everywhere, from your bank to your government to

More information

Summarizing and Displaying Categorical Data

Summarizing and Displaying Categorical Data Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency

More information

ABSTRACT INTRODUCTION THE MAPPING FILE GENERAL INFORMATION

ABSTRACT INTRODUCTION THE MAPPING FILE GENERAL INFORMATION An Excel Framework to Convert Clinical Data to CDISC SDTM Leveraging SAS Technology Ale Gicqueau, Clinovo, Sunnyvale, CA Marc Desgrousilliers, Clinovo, Sunnyvale, CA ABSTRACT CDISC SDTM data is the standard

More information

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics. Getting Correlations Using PROC CORR Correlation analysis provides a method to measure the strength of a linear relationship between two numeric variables. PROC CORR can be used to compute Pearson product-moment

More information

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA

SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA SAS Rule-Based Codebook Generation for Exploratory Data Analysis Ross Bettinger, Senior Analytical Consultant, Seattle, WA ABSTRACT A codebook is a summary of a collection of data that reports significant

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. White Paper Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices. Contents Data Management: Why It s So Essential... 1 The Basics of Data Preparation... 1 1: Simplify Access

More information

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc.

A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. A Hybrid Modeling Platform to meet Basel II Requirements in Banking Jeffery Morrision, SunTrust Bank, Inc. Introduction: The Basel Capital Accord, ready for implementation in force around 2006, sets out

More information

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

More information

BUSINESS DATA ANALYSIS WITH PIVOTTABLES

BUSINESS DATA ANALYSIS WITH PIVOTTABLES BUSINESS DATA ANALYSIS WITH PIVOTTABLES Jim Chen, Ph.D. Professor Norfolk State University 700 Park Avenue Norfolk, VA 23504 (757) 823-2564 jchen@nsu.edu BUSINESS DATA ANALYSIS WITH PIVOTTABLES INTRODUCTION

More information

S P S S Statistical Package for the Social Sciences

S P S S Statistical Package for the Social Sciences S P S S Statistical Package for the Social Sciences Data Entry Data Management Basic Descriptive Statistics Jamie Lynn Marincic Leanne Hicks Survey, Statistics, and Psychometrics Core Facility (SSP) July

More information

Data Presentation. Paper 126-27. Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs

Data Presentation. Paper 126-27. Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs Paper 126-27 Using SAS Macros to Create Automated Excel Reports Containing Tables, Charts and Graphs Tugluke Abdurazak Abt Associates Inc. 1110 Vermont Avenue N.W. Suite 610 Washington D.C. 20005-3522

More information

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data

Statistics and Analysis. Quality Control: How to Analyze and Verify Financial Data Abstract Quality Control: How to Analyze and Verify Financial Data Michelle Duan, Wharton Research Data Services, Philadelphia, PA As SAS programmers dealing with massive financial data from a variety

More information

SAS R IML (Introduction at the Master s Level)

SAS R IML (Introduction at the Master s Level) SAS R IML (Introduction at the Master s Level) Anton Bekkerman, Ph.D., Montana State University, Bozeman, MT ABSTRACT Most graduate-level statistics and econometrics programs require a more advanced knowledge

More information

Using Excel for Analyzing Survey Questionnaires Jennifer Leahy

Using Excel for Analyzing Survey Questionnaires Jennifer Leahy University of Wisconsin-Extension Cooperative Extension Madison, Wisconsin PD &E Program Development & Evaluation Using Excel for Analyzing Survey Questionnaires Jennifer Leahy G3658-14 Introduction You

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University

One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University One problem > Multiple solutions; various ways of removing duplicates from dataset using SAS Jaya Dhillon, Louisiana State University ABSTRACT In real world, analysts seldom come across data which is in

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

Common Tools for Displaying and Communicating Data for Process Improvement

Common Tools for Displaying and Communicating Data for Process Improvement Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot

More information

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

More information

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC

Counting the Ways to Count in SAS. Imelda C. Go, South Carolina Department of Education, Columbia, SC Paper CC 14 Counting the Ways to Count in SAS Imelda C. Go, South Carolina Department of Education, Columbia, SC ABSTRACT This paper first takes the reader through a progression of ways to count in SAS.

More information

1 J (Gr 6): Summarize and describe distributions.

1 J (Gr 6): Summarize and describe distributions. MAT.07.PT.4.TRVLT.A.299 Sample Item ID: MAT.07.PT.4.TRVLT.A.299 Title: Travel Time to Work (TRVLT) Grade: 07 Primary Claim: Claim 4: Modeling and Data Analysis Students can analyze complex, real-world

More information

IBM SPSS Direct Marketing 20

IBM SPSS Direct Marketing 20 IBM SPSS Direct Marketing 20 Note: Before using this information and the product it supports, read the general information under Notices on p. 105. This edition applies to IBM SPSS Statistics 20 and to

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER JMP Genomics Step-by-Step Guide to Bi-Parental Linkage Mapping Introduction JMP Genomics offers several tools for the creation of linkage maps

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Trade Flows and Trade Policy Analysis. October 2013 Dhaka, Bangladesh

Trade Flows and Trade Policy Analysis. October 2013 Dhaka, Bangladesh Trade Flows and Trade Policy Analysis October 2013 Dhaka, Bangladesh Witada Anukoonwattaka (ESCAP) Cosimo Beverelli (WTO) 1 Introduction to STATA 2 Content a. Datasets used in Introduction to Stata b.

More information

Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois Paper 70-27 An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois Abstract This paper introduces SAS users with at least a basic understanding of SAS data

More information

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining Mining Process CRISP - DM Cross-Industry Standard Process for Mining (CRISP-DM) European Community funded effort to develop framework for data mining tasks Goals: Cross-Industry Standard Process for Mining

More information

KEY FEATURES OF SOURCE CONTROL UTILITIES

KEY FEATURES OF SOURCE CONTROL UTILITIES Source Code Revision Control Systems and Auto-Documenting Headers for SAS Programs on a UNIX or PC Multiuser Environment Terek Peterson, Alliance Consulting Group, Philadelphia, PA Max Cherny, Alliance

More information

SAS Data Set Encryption Options

SAS Data Set Encryption Options Technical Paper SAS Data Set Encryption Options SAS product interaction with encrypted data storage Table of Contents Introduction: What Is Encryption?... 1 Test Configuration... 1 Data... 1 Code... 2

More information

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7

DBF Chapter. Note to UNIX and OS/390 Users. Import/Export Facility CHAPTER 7 97 CHAPTER 7 DBF Chapter Note to UNIX and OS/390 Users 97 Import/Export Facility 97 Understanding DBF Essentials 98 DBF Files 98 DBF File Naming Conventions 99 DBF File Data Types 99 ACCESS Procedure Data

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

Demographics of Atlanta, Georgia:

Demographics of Atlanta, Georgia: Demographics of Atlanta, Georgia: A Visual Analysis of the 2000 and 2010 Census Data 36-315 Final Project Rachel Cohen, Kathryn McKeough, Minnar Xie & David Zimmerman Ethnicities of Atlanta Figure 1: From

More information

Appendix III: SPSS Preliminary

Appendix III: SPSS Preliminary Appendix III: SPSS Preliminary SPSS is a statistical software package that provides a number of tools needed for the analytical process planning, data collection, data access and management, analysis,

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

More information

Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA

Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA WUSS2015 Paper 84 Building and Customizing a CDISC Compliance and Data Quality Application Wayne Zhong, Accretion Softworks, Chester Springs, PA ABSTRACT Creating your own SAS application to perform CDISC

More information

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database. Physical Design Physical Database Design (Defined): Process of producing a description of the implementation of the database on secondary storage; it describes the base relations, file organizations, and

More information

A Technique for Storing and Manipulating Incomplete Dates in a Single SAS Date Value

A Technique for Storing and Manipulating Incomplete Dates in a Single SAS Date Value A Technique for Storing and Manipulating Incomplete Dates in a Single SAS Date Value John Ingersoll Introduction: This paper presents a technique for storing incomplete date values in a single variable

More information

Paper 2917. Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA

Paper 2917. Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA Paper 2917 Creating Variables: Traps and Pitfalls Olena Galligan, Clinops LLC, San Francisco, CA ABSTRACT Creation of variables is one of the most common SAS programming tasks. However, sometimes it produces

More information

ACL Command Reference

ACL Command Reference ACL Command Reference Test or Operation Explanation Command(s) Key fields*/records Output Basic Completeness Uniqueness Frequency & Materiality Distribution Multi-File Combinations, Comparisons, and Associations

More information

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.

More information

DATA CONSISTENCY, COMPLETENESS AND CLEANING. By B.K. Tyagi and P.Philip Samuel CRME, Madurai

DATA CONSISTENCY, COMPLETENESS AND CLEANING. By B.K. Tyagi and P.Philip Samuel CRME, Madurai DATA CONSISTENCY, COMPLETENESS AND CLEANING By B.K. Tyagi and P.Philip Samuel CRME, Madurai DATA QUALITY (DATA CONSISTENCY, COMPLETENESS ) High-quality data needs to pass a set of quality criteria. Those

More information

PharmaSUG2010 HW06. Insights into ADaM. Matthew Becker, PharmaNet, Cary, NC, United States

PharmaSUG2010 HW06. Insights into ADaM. Matthew Becker, PharmaNet, Cary, NC, United States PharmaSUG2010 HW06 Insights into ADaM Matthew Becker, PharmaNet, Cary, NC, United States ABSTRACT ADaM (Analysis Dataset Model) is meant to describe the data attributes such as structure, content, and

More information

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets

Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets Methodologies for Converting Microsoft Excel Spreadsheets to SAS datasets Karin LaPann ViroPharma Incorporated ABSTRACT Much functionality has been added to the SAS to Excel procedures in SAS version 9.

More information

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7 Chapter 1 Introduction 1 SAS: The Complete Research Tool 1 Objectives 2 A Note About Syntax and Examples 2 Syntax 2 Examples 3 Organization 4 Chapter by Chapter 4 What This Book Is Not 5 Chapter 2 SAS

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Data exploration with Microsoft Excel: univariate analysis

Data exploration with Microsoft Excel: univariate analysis Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating

More information

Chapter 32 Histograms and Bar Charts. Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474

Chapter 32 Histograms and Bar Charts. Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474 Chapter 32 Histograms and Bar Charts Chapter Table of Contents VARIABLES...470 METHOD...471 OUTPUT...472 REFERENCES...474 467 Part 3. Introduction 468 Chapter 32 Histograms and Bar Charts Bar charts are

More information