Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw



Similar documents
Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

A Basic Introduction to Missing Data

Dealing with Missing Data

Data Cleaning and Missing Data Analysis

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

IBM SPSS Missing Values 22

Analyzing Structural Equation Models With Missing Data

Handling attrition and non-response in longitudinal data

Data analysis process

Visualization of missing values using the R-package VIM

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

IBM SPSS Missing Values 20

Multiple Imputation for Missing Data: A Cautionary Tale

Psy 210 Conference Poster on Sex Differences in Car Accidents 10 Marks

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

A full analysis example Multiple correlations Partial correlations

Working with SPSS. A Step-by-Step Guide For Prof PJ s ComS 171 students

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Problem of Missing Data

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

An introduction to using Microsoft Excel for quantitative data analysis

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Best Practices for Missing Data Management in Counseling Psychology

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Imputation and Analysis. Peter Fayers

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Projects Involving Statistics (& SPSS)

7. Comparing Means Using t-tests.

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

January 26, 2009 The Faculty Center for Teaching and Learning

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Simple Linear Regression Inference

Linear Models in STATA and ANOVA

Handling missing data in Stata a whirlwind tour

A Review of Methods for Missing Data

Additional sources Compilation of sources:

Survey Process White Paper Series 20 Pitfalls to Avoid When Conducting Marketing Research

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Directions for using SPSS

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.

Introduction to Quantitative Methods

SPSS Tests for Versions 9 to 13

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

The Data Analysis of Primary and Secondary School Teachers Educational Technology Training Result in Baotou City of Inner Mongolia Autonomous Region

Independent t- Test (Comparing Two Means)

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Assignments Analysis of Longitudinal data: a multilevel approach

Mathematics within the Psychology Curriculum

The Dummy s Guide to Data Analysis Using SPSS

Marketing Research Core Body Knowledge (MRCBOK ) Learning Objectives

SPSS Guide How-to, Tips, Tricks & Statistical Techniques

Psych. Research 1 Guide to SPSS 11.0

Missing data: the hidden problem

Module 14: Missing Data Stata Practical

An introduction to IBM SPSS Statistics

Main Effects and Interactions

Missing Data. Paul D. Allison INTRODUCTION

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Missing Data Part 1: Overview, Traditional Methods Page 1

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

5. Correlation. Open HeightWeight.sav. Take a moment to review the data file.

Running head: SCHOOL COMPUTER USE AND ACADEMIC PERFORMANCE. Using the U.S. PISA results to investigate the relationship between

Chapter 2 Probability Topics SPSS T tests

How to choose an analysis to handle missing data in longitudinal observational studies

Electronic Thesis and Dissertations UCLA

Psyc 250 Statistics & Experimental Design. Correlation Exercise

Module 2: Introduction to Quantitative Data Analysis

Inferential Statistics. What are they? When would you use them?

An introduction to modern missing data analyses

Analysis of Longitudinal Data with Missing Values.

Introduction Course in SPSS - Evening 1

T-test & factor analysis

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Imputation of missing network data: Some simple procedures

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University

Dealing with Missing Data

Crosstabulation & Chi Square

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

Imputing Missing Data using SAS

10. Analysis of Longitudinal Studies Repeat-measures analysis

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

Exploratory Data Analysis

SPSS Step-by-Step Tutorial: Part 2

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Transcription:

Missing Data: Patterns, Mechanisms & Prevention Edith de Leeuw Thema middag Nonresponse en Missing Data, Universiteit Groningen, 30 Maart 2006

Item-Nonresponse Pattern General pattern: various variables missing Var 1 p??? = missing Case 1 n??? Copyright De Leeuw 2

Why a Problem? Gaps in data matrix Loss of information Bad image (quality criterion) Ignoring (deletion of missing cases) has problems: Analyses are performed on different (sub) data sets Analyses can be inconsistent with each other Difficult to present results consistently over analyses Potential for bias Strong assumption (MCAR) Copyright De Leeuw 3

Why a Problem continued Potential for biased results Univariate analysis and (general) low itemnonresponse: bias is generally small Multivariate analysis, even with low item nonresponse for each question, cumulates to a substantial proportion of records that are missing So: do something Simply ignoring (standard option in SPSS and other packages) not wise Copyright De Leeuw 4

Important Distinctions Missing Completely At Random (MCAR) Missing values random sample of all values Missing At Random (MAR) Missing values random sample of all values within classes defined by covariates (conditional) Not Missing At Random (NMAR) Missingness is related to unobserved (missing) value (Little & Rubin, 1987, p14) Copyright De Leeuw 5

A Silly Example Hoytink, 2004 Copyright De Leeuw 6

Illustration: Survey Research Interviewer overlooks a question by accident Turns two pages in one Elderly person has difficulty remembering event Participant refuses to answer Copyright De Leeuw 7

Sources Item-Nonresponse Researcher (by design) Interviewer Respondent Questionnaire Method of Data Collection Interaction between sources, e.g, respondent and questionnaire Copyright De Leeuw 8

What Can Be Done Missing by Design Special analyses (e.g., multi-level analysis) Partial Non-Response (e.g., break-of) Prevent Adjust: Delete cases and treat as unit-nonresponse (weighting) Keep cases and impute missing answers Item Non-Response Prevent Adjust (impute!) Copyright De Leeuw 9

What is Known Respondents: Age and Education Interviewer: Training and Supervision Topic: Sensitive Questions Questionnaire: Lay-out, Do-not-know category, Number of categories, graphical tools Mode: SAQ, CAI Copyright De Leeuw 10

Mechanisms I: Interviewer Interviewer fails to: Ask question Record answer Record answer correctly In post-interview editing this will often be coded as missing Fails to probe (ask again) Causes of failure: Mistakes (e.g., wrong routing) Purpose, cheating (e.g., fast interview, not wanting to go to much trouble) Copyright De Leeuw 11

Prevention I: Interviewer Mistakes: Train interviewers in correct procedures Give instruction about the questionnaire Avoid mistakes by: Ergonomic lay-out questionnaire or interviewer schedule (e.g., far less chance of skipping, routing errors, etc) Use of computer-assisted interviewing (e.g., no routing errors, range checks ) Cheating: Stricter supervision CAI Copyright De Leeuw 12

Mechanisms II: Respondent Respondent Skips question by mistake Refuses to answer Not able to provide (correct) answer Causes: Badly designed self-administered questionnaire (mistake) Sensitive question (refusal) A problem in the total question-answer process (not able to provide, e.g. memory in retrospective questions) Copyright De Leeuw 13

Prevention II: Respondent Write good questions and test them: Comprehension question & answer categories Inclusion of all relevant answer categories Avoid mistakes (cf. Interviewer mistakes) Provide help (good instructions, etc) Ergonomic lay-out questionnaire CSAQ Pretest! Special formats Sensitive questions Retrospective questions Copyright De Leeuw 14

Mechanisms and Prevention III: The Questionnaire Good questionnaire helps to avoid mistakes of interviewer and/or respondent Question should be understood, categories should fit and be exhaustive (keep questions understandable) Pretest this: Expert reviews, cognitive interviewing,etc Lay-out should be clear and guide from question to question Use graphical language consistently SAQ, such as web/internet questionnaire Copyright De Leeuw 15

Prevent and then Adjust: Why Adjust? Remember: respondent age and education consistently correlate with item-nonresponse: NOT MCAR, So standard solution (pairwise/listwise) not adequate Use age & education in model Impute missing data to get a complete data -set All analyses are on ONE data-set Consistent with each other Retain all data Copyright De Leeuw 16

Two Phases In Adjustment Phase I: Diagnosis, Inspect Patterns of Missingness Suggest processes Suggest solutions Phase II: Cure, Adjust for Missing Use what you know from phase 1 Use any available information you have Plan for nonresponse Copyright De Leeuw 17

Patterns of Item Nonresponse Various variables missing (Missing =?) Var 1 p? Unsystematic (MCAR)??? MAR? NMAR or or Case 1 n??? Copyright De Leeuw 18

Tools for Exploring Missingness Descriptive statistics Graphical representations View data matrix on screen/special plots Statistical tests Usual tests for MCAR Software: SPSS-MVA module Dedicated Programs for Missing Data e.g. SOLAS, NORM Some DIY-tricks using SPSS or any other program Copyright De Leeuw 19

Simple Procedures DIY Recode all variables into new variables with values: 1 = missing, 0 = observed These variables are missingness indicators Use your favorite standard program and do simple tests like SPSS MVA does Descriptives on the recoded variables (missingness indicators) Cross-tabulation missingness indicator with (substantive) categorical variables T-tests with (substantive) interval variables Copyright De Leeuw 20

DIY-MVA and MORE Use new variables (missingness indicators) Use favorite standard program Examples SPSS Explore Graphs Boxplot with missingness indicator on category axis Correlations between missingness indicators PCA Correlations substantive vars with indicators Pairwise deletion! Why?. Copyright De Leeuw 21

Example MSCOHORT.SAV Data set from educational research Order of variables: idnr, father education (fatheduc), father occupation (fathocc), sex, iqlo, iqpm, iqws, education (educ), occupation (occup) Note 1: iglo,iqpm,iqws are three IQ-tests Note 2: 2 variables measure father of pupil rest of variables measure pupil! Note 3: Missing data are indicated by missing value 999 Copyright De Leeuw 22

Step 1: Make Indicator Variables value 1 if missing, 0 if not! RECODE fatheduc (MISSING=1) (ELSE=0) INTO misfe RECODE fathocc (MISSING=1) (ELSE=0) INTO misfo RECODE sex (MISSING=1) (ELSE=0) INTO missex RECODE iqlo (MISSING=1) (ELSE=0) INTO misiqlo RECODE iqpm (MISSING=1) (ELSE=0) INTO misiqpm RECODE. Copyright De Leeuw 23

SPSS Recode Into Different Variable Copyright De Leeuw 24

Step 2: SPSS Descriptives Descriptive Statistics MISFE MISFO MISSEX MISIQLO MISIQPM MISIQWS MISEDUC MISOCC Valid N (listwise) N Minimum Maximum Mean 5690.00 1.00.4374.4961 5690.00 1.00.1065.3085 5690.00 1.00 6.854E-03 8.251E-02 5690.00 1.00 8.471E-02.2785 5690.00 1.00.1230.3285 5690.00 1.00.1253.3311 5690.00 1.00.5557.4969 5690.00 1.00.5891.4920 5690 Std. Deviation Copyright De Leeuw 25

Step 3 Test MCAR How about gender?: Crosstabs MISOCC * SEX Crosstabulation MISOCC.00 1.00 Count % within MISOCC % within SEX Adjusted Residual Count % within MISOCC % within SEX Adjusted Residual SEX 0 1 Total 1586 751 2337 67.9% 32.1% 100.0% 54.0% 27.7% 41.4% 20.1-20.1 1352 1962 3314 40.8% 59.2% 100.0% 46.0% 72.3% 58.6% -20.1 20.1 chi 2 : 4003 df = 1 p=.00 Phi= 0.27 Total Count % within MISOCC % within SEX Adjusted Residual 2938 2713 5651 52.0% 48.0% 100.0% 100.0% 100.0% 100.0% 1=f 0= m Misoc=missing on occupation Is this MCAR? Copyright De Leeuw 26

Step 4 Test MCAR continued How about IQ?: T-test Group Statistics (P=.00) IQLO MISOCC.00 1.00 Std. Std. Error N Mean Deviation Mean 2091 102.21 14.29.31 3117 97.98 13.87.25 Misoc=missing on occupation Is this MCAR? 1=missing 0= data available Copyright De Leeuw 27

Boxplot of IQ-score grouped by Missingness indicator Occupation 160 140 3826 3827 3633 2076 4087 3806 2936 1640 4993 1647 5079 1083 3544 1999 59 4372 5092 2198 2078 2065 967 5450 3923 48 4970 5191 4960 4923 120 100 80 iq-score 60 N = 2091 3117.00 1.00 MISOCC Copyright De Leeuw 28

Patterns in Missingness 1: Correlations between Missingness Indicators (ignore significance) MISFE MISFO MISSEX MISIQLO MISIQPM MISIQWS MISEDUC MISOCC MISFE Pearson Correlation 1.000.220**.038**.036** -.017 -.022.164**.189** Sig. (2-tailed)..000.004.007.187.092.000.000 N 5690 5690 5690 5690 5690 5690 5690 5690 MISFO Pearson Correlation.220** 1.000.061**.110**.072**.071**.027*.017 Sig. (2-tailed).000..000.000.000.000.044.190 N 5690 5690 5690 5690 5690 5690 5690 5690 MISSEX Pearson Correlation.038**.061** 1.000.036**.021.014.070**.065** Sig. (2-tailed).004.000..007.117.305.000.000 N 5690 5690 5690 5690 5690 5690 5690 5690 MISIQLO Pearson Correlation.036**.110**.036** 1.000.614**.609** -.054** -.063** Sig. (2-tailed).007.000.007..000.000.000.000 N 5690 5690 5690 5690 5690 5690 5690 5690 MISIQPM Pearson Correlation -.017.072**.021.614** 1.000.970** -.078** -.093** Sig. (2-tailed).187.000.117.000..000.000.000 N 5690 5690 5690 5690 5690 5690 5690 5690 MISIQWS Pearson Correlation -.022.071**.014.609**.970** 1.000 -.086** -.100** Sig. (2-tailed).092.000.305.000.000..000.000 N 5690 5690 5690 5690 5690 5690 5690 5690 Copyright De Leeuw 29 MISEDUC Pearson Correlation.164**.027*.070** -.054** -.078** -.086** 1.000.898**

Patterns 2: Correlations Missingness Indicators and Substantive Vars (Pairwise Deletion!) FATHEDUC Pearson Correlation MISFE MISFO MISSEX MISIQLO MISIQPM MISIQWS MISEDUC MISOCC..072.036.025.097.094.027.023 FATHOCC Pearson Correlation.063. -.002 -.021 -.037 -.035.018.008 SEX Pearson Correlation.193 -.010. -.074 -.157 -.160.184.267 IQLO Pearson Correlation -.447 -.044 -.023..102.101 -.124 -.146 IQPM Pearson Correlation -.271 -.012 -.034.047. -.016 -.074 -.091 IQWS Pearson Correlation -.415.004 -.004.040 -.019. -.057 -.091 EDUC Pearson Correlation -.245 -.031.047.034.087.081. -.154 OCCUP Pearson Correlation -.192 -.029.032.037.095.091 -.029. Copyright De Leeuw 30

Suggested Readings De Leeuw, E.D., Hox, J., and Huisman, M. (2003). Prevention and treatment of item nonresponse. Journal of Official Statistics, 19, 2, 153-176. Schafer, J.L. and Graham, J.W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147-177. Copyright De Leeuw 31