Data Cleaning and Missing Data Analysis



Similar documents
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Multiple Imputation for Missing Data: A Cautionary Tale

A Basic Introduction to Missing Data

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Data analysis process

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Problem of Missing Data

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Dealing with Missing Data

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Additional sources Compilation of sources:

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw

APPLIED MISSING DATA ANALYSIS

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

IBM SPSS Missing Values 22

Analyzing Structural Equation Models With Missing Data

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Missing Data Part 1: Overview, Traditional Methods Page 1

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Handling missing data in Stata a whirlwind tour

Missing Data. Katyn & Elena

DISCRIMINANT FUNCTION ANALYSIS (DA)

Imputing Missing Data using SAS

Handling attrition and non-response in longitudinal data

Missing data: the hidden problem

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Visualization of missing values using the R-package VIM

Imputation and Analysis. Peter Fayers

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

An introduction to using Microsoft Excel for quantitative data analysis

Multiple logistic regression analysis of cigarette use among high school students

UNDERSTANDING THE TWO-WAY ANOVA

IBM SPSS Missing Values 20

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Multivariate Analysis of Variance (MANOVA)

Best Practices for Missing Data Management in Counseling Psychology

Dealing with Missing Data

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Moderation. Moderation

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SPSS and AM statistical software example.

In almost any research you perform, there is the potential for missing or

Instructions for SPSS 21

The Dummy s Guide to Data Analysis Using SPSS

Module 14: Missing Data Stata Practical

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 22

LOGISTIC REGRESSION ANALYSIS

(Howell, D.C. (2008) The analysis of missing data. In Outhwaite, W. & Turner, S. Handbook of Social Science Methodology. London: Sage.

A Review of Methods for Missing Data

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

4. Descriptive Statistics: Measures of Variability and Central Tendency

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Mode and Patient-mix Adjustment of the CAHPS Hospital Survey (HCAHPS)

SPSS: Getting Started. For Windows

Stephen du Toit Mathilda du Toit Gerhard Mels Yan Cheng. LISREL for Windows: PRELIS User s Guide

SPSS Introduction. Yi Li

Before LISREL: Preparing the Data using PRELIS

Introduction to Data Analysis in Hierarchical Linear Models

Binary Logistic Regression

Multinomial Logistic Regression

Missing data in randomized controlled trials (RCTs) can

Discussion. Seppo Laaksonen Introduction

An introduction to modern missing data analyses

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Craig K. Enders Arizona State University Department of Psychology

Missing Data. Paul D. Allison INTRODUCTION

Analysis of Longitudinal Data with Missing Values.

Workpackage 11 Imputation and Non-Response. Deliverable 11.2

Dealing with missing data: Key assumptions and methods for applied analysis

Foundation of Quantitative Data Analysis

Introduction. Chapter Before you start Formulation

Introduction to Quantitative Methods

A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University

How to choose an analysis to handle missing data in longitudinal observational studies

WHAT IS A JOURNAL CLUB?

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Assignments Analysis of Longitudinal data: a multilevel approach

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Lesson 3: Calculating Conditional Probabilities and Evaluating Independence Using Two-Way Tables

Comparison of Imputation Methods in the Survey of Income and Program Participation

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

An Analysis of Four Missing Data Treatment Methods for Supervised Learning

A Guide to Imputing Missing Data with Stata Revision: 1.4

Appendix I: Methodology

Transcription:

Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010

Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS data cleaning procedure The importance of addressing missing data Types of missingness Why we care Missing data analysis procedure What do we do when data are missing? An introduction to weighting

Student-Athlete Climate Study National, anonymous, population survey of student-athletes' perceptions and experiences pertaining to campus climate and its impact on student-athletes athletes : Academic success Athletic success Athletic identity Almost 9,000 student-athletes from all three divisions and all NCAA sports Focuses on the voices of those who have traditionally not been heard SACS

What do we mean by Data Cleaning and why do we do it? Find and eliminate data entry and other errors Examine missing data and perhaps account for it in some way (statistically) Prepare the data for analysis Data Cleaning

General data cleaning process 1. Define and determine error types 2. Search for and identify error instances 3. Correct errors 4. Document error types and instances 5. Modify data entry process to reduce future errors Data Cleaning

The SACS data cleaning procedure 1. Check for and delete duplicate data entries (use SPSS Identify Duplicate Cases procedure or Data Preparation module). 2. Perform descriptive statistics to see if the data make sense. (e.g., Do the max and min values fall within the question s s expected range? Does the mean make sense for that question?) 1. Frequency tables for categorical variables 2. Histograms for Likert/ordinal/continuous variables 3. Scatterplots for write-in in continuous variables (e.g. age) to look for outliers and nonsense values Use frequency tables & histograms to examine the normality of distributions, the range, and extreme skew and kurtosis. Data Cleaning

The SACS data cleaning procedure 3. Add to the list of new variables created for analysis purposes, including dummy variables (Appendix A). Also consider combining variables. There may be sets of questions that could be combined into a metric beyond the factor scales already identified (Appendix B). 4. Make sure dichotomous variables are coded with 1=the category of interest and 0=not. 5. Fix variable & value labels (e.g., typos, wrong scale). 6. Make sure variables are of the appropriate type (numerical/string) and variable measures are of the right level (SPSS language: scale/ordinal/nominal). Data Cleaning

The SACS data cleaning procedure 7. Examine write-in in text for Other responses and recode into already-existing existing categories. 8. Recode Institution write-in in values to IPEDS institutional codes (ID variable for merging institutional data). 9. Clean qualitative answers, but retain the respondent s voice. Edit the punctuation and grammar, but don t change the meaning of the response. 10. Code missing data properly (skipped due to skip pattern = 999999, unanswered for any other reason (system missing) = sysmis). Check any items where SRC coded 0=No Response and bring to the group for discussion. Data Cleaning

The SACS data cleaning procedure 11. Reverse recode if necessary to ensure the conceptual consistency of items within a group of questions. Reverse value labels as well. Also run correlations on the set of questions. Negative correlations are further evidence that reverse coding may be necessary. To reverse code, use the following equation: New Value = (High Value + 1) - Original Value For example, if the High Value for a metric is 5, we obtain the New (transformed) Value by subtracting each Original Value from 6 (i.e., 5 + 1). Data Cleaning

The SACS data cleaning procedure 12. Create placeholder variables that will help us find variables when using point and click and will make the dataset more readable use CAPS for the placeholder variable NAMES and LABELS. 13. Perform descriptive statistics again. Make sure everything still makes sense (e.g., check to make sure the dummy variables add up to the number of cases). 14. Perform a missing data analysis to determine survey fatigue and if there is a pattern to the missing data. Follow the procedure outlined in Missing Data Analysis Procedure.doc. Data Cleaning

The importance of addressing missing data Why are there missing data? Fatigue Sensitivity Lack of knowledge Not applicable Data processing errors Programming errors Missing Data

Types of Missingness Missing Completely at Random (MCAR) There is no pattern the the cases with missing data are indistinguishable from those that are complete. Awesome! Missing at Random (MAR) Only vary depending on other observable variables (e.g. income and parental education). Want key IVs to be MAR with respect to the DV. Missing not at Random (MNAR) Not random, but missingness cannot be predicted by variables in the model Not ignorable Missing Data

Why do we care? Independent Variables (IVs) Sample size and statistical power Bias Data no longer be representative Dependent Variables (DVs( DVs) Sample size and statistical power Generalizability We must limit our conclusions to the cases which do not have missing data and, therefore, limit our generalization to populations which share the same qualities as those cases. Missing Data

Missing Data Analysis Procedures 1. Check for drop out/fatigue. 2. Create dummy variables representing cases that are missing data. 3. Test to see if the missing data are biased or if they are randomly distributed along each of the other IVs and DVs of interest. Chi Square test for categorical variables T-test for continuous variables Little s Chi Square Test for MCAR 4. SPSS Data Preparation and Missing Values Analysis modules Missing Data

What do we do when Data are Missing? If the Data are MCAR Listwise (casewise)) deletion: uses only complete cases Pairwise deletion: uses all available cases (different cases, different Ns) Listwise is preferred over pairwise when sample size is large in relation to the number of cases that have missing data ( 5%). Missing Data

What do we do when Data are Missing? Methods of Missing Data Replacement Mean substitution: substitute mean value computed from available cases Regression methods: predict value based on regression equation with other variables as predictors (assumes MAR) Missing Data

What do we do when Data are Missing? Hot Deck: identify the most similar case to the case with missing data and impute the value Approximate Bayesian Bootstrap (ABB): use logistic regression to estimate the probability of response/non-response on the dependent variable. Missing Data

What do we do when Data are Missing? Maximum Likelihood Estimation (MLE): use all available data to generate maximum likelihood-based statistics. EM in SPSS. (Assumes MAR) Multiple Imputation: combines methods of MLE to produce multiple data sets with imputed values for missing cases Most recent method. Good because the variance/covariance structure s s preserved. More computationally demanding Missing Data

Weighting A value assigned to each case in the data file. Normally used to make the statistics computed from the data more representative of the population. The value indicates how much each case will count in a frequency distribution, cross-tab, or other descriptive statistics of a central tendency (i.e., percent, mean) A weight of 2 would mean that the case would count as if it were two identical cases. A weight of 1 would mean that a case only counts as one case (unweighted data) Weights are often fractions, but always positive and non-zero. Only one weight per case can be used. If we weight for different characteristics, these weights must be combined together into one weight. Weighting

Common Types of Weighting Design Weights Normally used to compensate for over- or under-sampling of specific cases. Used to present statistics that are representative of the population. Post-Stratification (Non-response) Weights Used to compensate for that fact that persons with certain characteristics are not as likely to respond to the survey. Weighting

Calculating Weights Design Weights If we know the sampling fraction (ratio of sampling size to population size) for each case, the weight is merely the inverse of the sampling fraction. Post-Stratification (Non-response) Weights Base on population estimates. Weighting

Calculating Weights When weighting on multiple characteristics we must use a more complicated procedure that I haven t t learned yet. Weight each separately but sequentially Use a huge X row by Y column table Manual iteration Automatic iteration (Ranking) Logistic regression Weighting

Problems with Weights Standard errors are more difficult to evaluate. The process can be difficult, with little clear advice. Creating practical weights requires arbitrary choices about inclusion of weighting factors and interactions, pooling of weighting cells, and truncation of weights. Weighting

Reference List SPSS Missing Values 17.0. (2007). Chicago, IL: SPSS Inc. Allison, P. D. (2001). Missing Data.. Thousand Oaks, CA: SAGE Publications. Chapman, A. D. (2005). Principles and Methods of Data Cleaning.. Copenhagen, Denmark: Global Biodiversity Information Facility. Garson, G. D. (2009, January 14). Sampling. Statnotes: : Topics in Multivariate Analysis Retrieved April 13, 2010, from http://faculty.chass.ncsu.edu/garson/pa765/sampling.htm son/pa765/sampling.htm Garson, G. D. (2009, November 18). Data Imputation for Missing Values. V Statnotes: : Topics in Multivariate Analysis Retrieved January 22, 2010, from http://faculty.chass.ncsu.edu/garson/pa765/missing.htm Gelman,, A. (2007). Struggles with survey weighting and regression modeling. Statistical Science, 22(2), 153-164. 164. doi: : 10.1214/088342306000000691 Little, R. J. A. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83(404). doi: 10.2307/2290157 Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data.. New York: John Wiley & Sons. Maletic,, J. I., & Marcus, A. (2000, October 20-22). 22). Data cleansing: Beyond integrity analysis. Paper presented at the Conference on Information Quality (IQ2000), Boston, MA. References