Missing Data. Katyn & Elena

Similar documents
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Dealing with Missing Data

Problem of Missing Data

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

APPLIED MISSING DATA ANALYSIS

2. Making example missing-value datasets: MCAR, MAR, and MNAR

A Basic Introduction to Missing Data

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Data Mining. Supervised Methods. Ciro Donalek Ay/Bi 199ab: Methods of Sciences hcp://esci101.blogspot.

Handling attrition and non-response in longitudinal data

Data Cleaning and Missing Data Analysis

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Imputing Missing Data using SAS

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Analyzing Structural Equation Models With Missing Data

Dealing with missing data: Key assumptions and methods for applied analysis

Handling missing data in Stata a whirlwind tour

HCUP Methods Series Missing Data Methods for the NIS and the SID Report #

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Multiple Imputation for Missing Data: A Cautionary Tale

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh

Dealing with Missing Data

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

IBM SPSS Missing Values 22

Nodes, Ties and Influence

TIME SERIES ANALYSIS OF COMPOSITIONAL DATA USING A DYNAMIC LINEAR MODEL APPROACH

IBM SPSS Missing Values 20

Comparison of Imputation Methods in the Survey of Income and Program Participation

Missing-data imputation

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Missing Data. Paul D. Allison INTRODUCTION

Bayesian Approaches to Handling Missing Data

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Big Data and Health Insurance Product Selec6on (and a few other applica6on) Jonathan Kolstad UC Berkeley and NBER

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Analysis of Longitudinal Data with Missing Values.

A Review of Methods for Missing Data

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

Analysis of Various Techniques to Handling Missing Value in Dataset Rajnik L. Vaishnav a, Dr. K. M. Patel b a

Everything You Wanted to Know about Moderation (but were afraid to ask) Jeremy F. Dawson University of Sheffield

A Latent Variable Approach to Validate Credit Rating Systems using R

A PARADIGM FOR DEVELOPING BETTER MEASURES OF MARKETING CONSTRUCTS

Electronic Theses and Dissertations UC Riverside

Visualization of missing values using the R-package VIM

Workpackage 11 Imputation and Non-Response. Deliverable 11.2

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

STA 4273H: Statistical Machine Learning

Applications of R Software in Bayesian Data Analysis

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Imputation of missing network data: Some simple procedures

Craig K. Enders Arizona State University Department of Psychology

Imputation and Analysis. Peter Fayers

Sensitivity Analysis in Multiple Imputation for Missing Data

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

In almost any research you perform, there is the potential for missing or


An introduction to modern missing data analyses

Module 14: Missing Data Stata Practical

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

Best Practices for Missing Data Management in Counseling Psychology

How to choose an analysis to handle missing data in longitudinal observational studies

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Data Mining Methods: Applications for Institutional Research

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Outline: Demand Forecasting

R2MLwiN Using the multilevel modelling software package MLwiN from R

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Sample Size Designs to Assess Controls

ECBDL 14: Evolu/onary Computa/on for Big Data and Big Learning Workshop July 13 th, 2014 Big Data Compe//on

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Use of Observa,onal Data to Make Causal Inferences About Treatment Decisions in Mul,ple Sclerosis. Brian Healy, PhD

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Statistics in Retail Finance. Chapter 2: Statistical models of default

A Guide to Imputing Missing Data with Stata Revision: 1.4

Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw

Understanding Characteristics of Caravan Insurance Policy Buyer

Analysis of Bayesian Dynamic Linear Models

Imputing Values to Missing Data

Introduction to Regression and Data Analysis

Transcription:

Missing Data Katyn & Elena

What to do with Missing Data Standard is complete case analysis/listwise dele;on ie. Delete cases with missing data so only complete cases are le> Two other popular op;ons: Mul;ple Imputa;on Full Informa;on Maximum Likelihood

Ways data is missing mahers MCAR: missing completely at random In this case, listwise dele;on doesn t create bias MAR: missing at random Probability that data is missing depends only on available informa;on If you have everything that mahers for missingness in the model, then no bias MNAR: missing not at random (this is a problem) Ie. people with higher incomes are less likely to reveal their income because they feel self- conscious or people who have college degrees don t reveal their income and we have missing data on level of educa;on Note: we rarely can tell if data is MAR or MNAR. Imputa;on methods assume MAR.

Problems with Complete Case Analysis Can lead to bias if observa;ons with missing values differ systema;cally from complete cases Can result in a small sample and larger standard errors as a result Could reweight to make the complete- case sample representa;ve But, survey weigh;ng is a mess Gelman, 2007, Struggles with Survey Weigh;ng and Regression Modeling

Bad Imputa;on Strategies Need to fill in the missing values But how? What about just replacing with the mean value? Distorts the distribu;on of the variable, and distorts rela;onships between variables (correla;ons will be pulled towards zero) How about including an indicator variable for missingness? (replace missing values with 0 or the mean) Leads to biased coefficients of other predictors in the model because it forces the slope to be the same across both missing- data groups. Could add interac;ons es;mates will be similar to complete- case analysis

BeHer Imputa;on Strategies Could just generate random x values from the observed distribu;on of x values. But, beher to use informa;on from other variables if available. Regression predic;ng x variable using other variables. Fill in missing values with predicted values from regression. Predicted values will be less variable than the original data. Can add uncertainty back in by adding the predic;on error from the regression.

Just using predicted values to fill in missing values : Adding in regression error Figure from hhp://lane.compbio.cmu.edu/courses/gelmanmissing.pdf

What to include in the Regression Include any variables you think will make a beher predic;on. For example, in predic;ng income, maybe you have informa;on on whether the respondent received income support from disability payments or welfare. Put it in the regression. The goal is not causal inference, it is accurate predic;on.

Other Methods Matching: for each unit with a missing value of y, find a unit with similar X values and take the y value. Also called hot- deck imputa;on Can be combined with regression where similarity is defined as closeness in the predicted value from the regression

Mul;ple Imputa;on O>en we are missing data for several variables in the analysis Two approaches: Mul.variate imputa.on (MVN): fit a mul;variate model to all the variables that have missing values. Iterated Chained Equa.ons (ICE): apply univariate methods itera;vely

MVN Assume a mul;variate distribu;on for all imputa;on variables and impute missing values as draws from the posterior predic;ve distribu;on of the missing data, given the observed data Use MCMC methods to approximate the distribu;on and draw imputed values O>en assume Mul;variate normal (MVN)

ICE 1. Fill in missing values with random values from the distribu;on of each variable 2. Regress variable 1 on all other variables (which now have complete data) 3. Fill in missing values in variable 1 with the closest matched value to the predicted value + noise Perform steps 2 and 3 for all variables, con;nuing un;l missing values converge

Which one? MVN makes an assump;on about the joint distribu;on of all the variables ICE doesn t assume this, and it s also possible to tailor each regression model appropriately (logis;c for a binary variable, etc) but you have to specify correctly. May not make a difference though. Lee & Carlin (2010) simulate data, then induce missing data using different mechanisms, then use Stata MVN and Stata ICE. find that both resulted in similar results (and both were less biased than complete- case analysis)

Evalua;ng Imputa;ons: Trace Plots Check that there are no systema;c trends

Mul;level Imputa;on Have data on students (test scores, demographics) Have data on schools (public v. private) Best to separate into two data sets and then use the results from one in the other (posibly back and forth) So, first impute individual- level variables using individual level- data and observed group- level measurement Then, in group- level, include aggregated forms of individual level measurements when impu;ng missing data at that level Maybe choose what you care about to determine the order? Not clear what is the best way to do this.

Inference with Mul;ple Imputa;on There is uncertainty about our imputa;on model that needs to be accounted for in our analysis Create mul;ple complete datasets using different imputed values, run analysis on each dataset. Final es;mate is average of the coefficients across m datasets: Variance will reflect variance within and between