Problem of Missing Data



Similar documents
A Basic Introduction to Missing Data

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Dealing with Missing Data

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Imputing Missing Data using SAS

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Handling attrition and non-response in longitudinal data

Handling missing data in Stata a whirlwind tour

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Analyzing Structural Equation Models With Missing Data

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Introduction to mixed model and missing data issues in longitudinal studies

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Multiple Imputation for Missing Data: A Cautionary Tale

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

Imputation and Analysis. Peter Fayers

Missing data: the hidden problem

How to choose an analysis to handle missing data in longitudinal observational studies

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Workpackage 11 Imputation and Non-Response. Deliverable 11.2

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

Missing Data Techniques for Structural Equation Modeling

Analysis of Longitudinal Data with Missing Values.

Dealing with Missing Data

Data Cleaning and Missing Data Analysis

Missing Data. Paul D. Allison INTRODUCTION

Missing Data. Katyn & Elena

Bayesian Approaches to Handling Missing Data

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

A Review of Methods for Missing Data

Sensitivity Analysis in Multiple Imputation for Missing Data

Missing data in randomized controlled trials (RCTs) can

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

Dealing with missing data: Key assumptions and methods for applied analysis

Missing data are ubiquitous in clinical research.

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission

An introduction to modern missing data analyses

A Review of Missing Data Treatment Methods

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Imputation of missing data under missing not at random assumption & sensitivity analysis

Best Practices for Missing Data Management in Counseling Psychology

Discussion. Seppo Laaksonen Introduction

A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University

Bayesian Statistics in One Hour. Patrick Lam

Comparison of Imputation Methods in the Survey of Income and Program Participation

IBM SPSS Missing Values 22

A Pseudo Nearest-Neighbor Approach for Missing Data Recovery on Gaussian Random Data Sets

APPLIED MISSING DATA ANALYSIS

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh

Imputation of missing network data: Some simple procedures

Missing Data Part 1: Overview, Traditional Methods Page 1

IBM SPSS Missing Values 20

Missing data and net survival analysis Bernard Rachet

Reject Inference in Credit Scoring. Jie-Men Mok

Analysis of Incomplete Survey Data Multiple Imputation via Bayesian Bootstrap Predictive Mean Matching

Note on the EM Algorithm in Linear Regression Model

A Guide to Imputing Missing Data with Stata Revision: 1.4

AMELIA II: A Program for Missing Data

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Module 14: Missing Data Stata Practical

Missing-data imputation

Introduction to Longitudinal Data Analysis

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Imputing Values to Missing Data

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Lecture 3: Linear methods for classification

CHAPTER 8 EXAMPLES: MIXTURE MODELING WITH LONGITUDINAL DATA

Assignments Analysis of Longitudinal data: a multilevel approach

Electronic Theses and Dissertations UC Riverside

(Howell, D.C. (2008) The analysis of missing data. In Outhwaite, W. & Turner, S. Handbook of Social Science Methodology. London: Sage.

Implementation of Pattern-Mixture Models Using Standard SAS/STAT Procedures

Multiple Regression: What Is It?

Introduction to Fixed Effects Methods

Transcription:

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians; Promote good statistical practice; Increase participation & visibility of VA statisticians at national meetings such as the VA HSR&D and ASA Joint Statistical Meetings; Increase participation of statisticians in VA research merit review 1 / 46

Poll Your affiliation and title: Your Research areas: Your level of statistical knowledge: beginner, intermediate, and advanced Topic of future statistical cyberseminar you like to take: Need of statistical help in your research: No, Yes. 2 / 46

Problem of Missing Data Standard statistical methods have been developed to analyze rectangular data sets. The rows are observation units The columns are variables. The entries are values (real numbers). The concern is what happen when some of these values are not observed. Most statistical software creates special codes for the missing values Some statistical software exclude subjects with missing values - complete-case analysis, which will be valid in a very limited cases. 3 / 46

Missing-data Pattern and Mechanism Which method to choose for the analysis of missing data depends on missing-data patterns, as well as the missing-data mechanism. 4 / 46

Univariate Missing Data Pattern A B C 1 88 42 63 2 84 82 12 3 26 59 66 4 7 28 73 5 12 75 2 6 79 41 NA 7 81 84 NA 8 64 19 NA This data set has three variables A, B and C, and the variable C has missing values, denoted by NA. It can be seen that five of the eight subjects in the data set are completely observed, while the values of variable C of the other three subjects are missing. This is the simplest missing data pattern. 5 / 46

Attrition in Longitudinal Studies Attrition (dropout) in longitudinal studies is a very frequent occurrence, especially for long follow-ups or difficult study populations such as children. If in a study, subjects drop out before the end of the study and do not return, then this scenario leads to the so called monotone missing data pattern. 6 / 46

Example on Attrition A multi-clinic observational study on a prospective cohort of primary care patients with clinical depression. The study evaluated depressive symptoms, mental and physical health for 966 clinically depressed persons from 6 large U.S. clinics. These persons responded to several questionnaires that measured their physical and mental health at baseline, 6 weeks, 3 months, and nine months after baseline. At 9 month after baseline each patient was interviewed by a psychiatrist who determined whether the patient still suffered from clinical depression after nine months. 7 / 46

Example on Attrition, continued Unit non-response over time (X: observed, M=missing) Baseline 6 weeks 3 months nine months Freq X X X X 759 X X X M 92 X X M M 27 X M M M 2 8 / 46

File-matching Missing Data Pattern Sometimes two or more variables are collected on units, but are never jointly observed. A B C 45 34 NA 99 6 NA 16 84 NA 79 37 NA 38 95 NA 32 NA 5 48 NA 34 91 NA 25 NA 19 30 NA 20 31 9 / 46

File-matching Missing Data Pattern, Continued It is important to know that for this missing data pattern, parameters relating to the association between these jointly unobservable variables (A,B,C) are not estimable from the data. In the above data, there is no information in the data about the partial associations of B and C given A. In practice, analyses of data with this pattern require to make strong assumptions about these partial associations. 10 / 46

General Missing Data Pattern) A B C D 25 26 88 32 NA 42 66 21 27 86 54 NA 28 NA 92 20 29 20 83 NA 30 89 NA 41 NA NA NA 35 32 NA NA 33 11 / 46

Missing Data Mechanism Understanding the missing data mechanism is important for the analysis of the missing data. Rational of using a method for dealing with a missing data set relies on the missingness mechanism. To explain the missing data mechanisms, we first define a vector of indicator variables R that identify which variables are observed and which are missing. For example, R = (0,1) T indicates that the first variable is missing and the second variable is observed. 12 / 46

Assumption on Missing Data Assumption: Missingness indicators hide true values that are meaningful for analysis. 13 / 46

Example on Death Consider a randomized trial with two treatments, and suppose that a primary outcome of the study is a quality-of-life health score Y measured one year after randomization to treatment. For participants who die within a year of randomization to treatment, Y is not available due to death. 14 / 46

Poll Do you consider patients who died before the end of the study as missing quality-of-life health score? Yes No Not Sure 15 / 46

Example on Death, Continued Do you consider patients who died before the end of the study as missing quality-of-life health score? It does not make sense to treat those outcomes as missing, given that quality of life is meaningless concept for people who not alive. 16 / 46

Example on Non-response in Opinion Polls We are interested in polling individuals about how they will vote in a future referendum, where the available responses are yes, no, or missing. 17 / 46

Poll Do you consider patients who with missing as missing response? Yes No Not Sure 18 / 46

Example, Continued Individuals who fail to respond to the question may be refusing to reveal real answers or may have no interest in voting. Assumption would not apply to individuals who would not vote, and these individuals define a stratum of the population that is not relevant to the outcome of the referendum. Assumption would apply to individuals who do not respond to the initial poll but would vote in the referendum. For these individuals it would make sense to treat their responses on the referendum as missing. 19 / 46

Missing Data Mechanism Since some of the variables Y in the data set may be missing, we partition Y into two parts, Y obs and Y mis, the observed part and the missing part. Below we introduce the three missing data mechanisms, missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). 20 / 46

Missing Completely at Random (MCAR) The data are called MCAR if missingness does not depend on the value of data Y, missing or observed. Statistically, that is f (R Y ) = f (R) for all Y. (1) The consequence of the above relationship means that missingness indicator is conditionally independent of the study variables, so data on the study variables, regardless of missing or observed, have nothing to do with the mechanism that the data are missing. 21 / 46

Example on MCAR Consider an experiment involving random sampling. Under MCAR people with missing values are really just a random sample of the study population. As a result, people with observed data still make up a valid random sample from the study population for the ultimate inference, though with some loss in efficiency because of the smaller sample size. Therefore when data are MCAR, you can ignore the units with missing values and proceed with the complete data analysis, and still make valid inference. 22 / 46

Missing at Random (MAR) MAR is a less restrictive assumption on missing data mechanism than MCAR. Data are called MAR when missingness depends only on the observed components Y obs of Y, but not on the missing components Y mis. Statistically, that is f (R Y ) = f (R Y obs ) for all Y mis. (2) Despite its name, MAR does not mean that the missing data are a simple random sample of all data values, which is the case for MCAR. MAR lessens MCAR in that the missing values can behave like a random sample of all values within subpopulations defined by the observed data rather than the entire data values. 23 / 46

Example on MAR In many VA studies, race is a the key variable but missing substantially. If the reason for missing race is only related to age of a veteran, and if age is observed for all study subjects, the MAR holds. 24 / 46

Missing not at random (MNAR) The mechanism is called missing not at random (MNAR) if the distribution of R depends on the missing values in Y. Strictly speaking, MNAR comes in two forms: (i) missingness depends on the missing value itself or (ii) missingness depends on an unobserved variable. The latter is common in latent variable models or hidden Markov models. 25 / 46

Example on MNAR If the reason for missing race is related to race itself or some unmeasured covariates, the MAR does not hold. 26 / 46

Illustration Example We demonstrate the three mechanisms and their consequences through the following simple example. Consider a simple linear relationship between two variables x and y, where the predictor variable x is fully observed, but some of y are missing according to different missing mechanisms. The variables x and y follow the linear regression y = x + ε, where x and ε independently follows standard normal distributions. We generate an independent sample of (x,y) with size 300. The sample is referred to as the full data, because all values of x and y are known. Then three samples with y-missing data are generated from the full data, according to the missing mechanisms MCAR, MAR and MNAR, respectively. 27 / 46

Result Full data MCAR MAR MNAR (Intercept) 0.08(0.06) 0.01(0.08) 0.07(0.13) 0.44(0.09) x 1.02(0.06) 1.01(0.08) 1.15(0.12) 0.70(0.09) R 2 0.53 0.54 0.36 0.23 Adj. R 2 0.53 0.53 0.35 0.23 Num. obs. 300 148 163 201 28 / 46

Methods for Dealing with Missing Data- Complete data Analysis As the name suggests, the complete-case method (also known as list-wise deletion) makes use of cases with complete data on all variables of interest in the analysis. Cases with incomplete data on one or more variables are discarded. This approach is easiest to implement since standard full data analysis can be applied without modification, and is the default in many statistical software (i.e., cases with any missing values are automatically ignored). The disadvantages include loss of precision due to a smaller sample size, and bias in complete-data summaries when the cases with missing data differ systematically from the completely observed cases (i.e., when data are not missing completely at random). 29 / 46

Weighted Complete Data Method The earlier complete-case strategy can potentially lead to biased summaries because the sample of observed cases may not be representative of the full sample. Another strategy reweights the sample to make it more representative. For example, if response rate is twice as likely among men as women, data from each man in the sample could receive a weight of 2 in order to make the data more representative. This strategy is commonly used in sample surveys, especially in the case of unit nonresponse, where all the survey items are missing for cases in the sample that did not participate. This method usually requires the MAR assumption. 30 / 46

Maximum Likelihood (ML) Method under MAR The fundamental idea behind the ML methods is conveyed by its name: find the values of the parameters that are most probable, or most likely, for the data that have actually been observed. Denote D obs as the observed data, which have the probability density p(d obs θ). Here θ is a set of the parameters of interest. Then the likelihood function L(θ D obs ) is equal to p(d obs θ), i.e. L(θ x) = p(d obs θ), and is thought of as a function of θ where the data D obs are fixed. 31 / 46

Maximum Likelihood Method under MAR, continued The next step is to deduce a numerical value for θ using our knowledge of L(θ x). The ML principle says we should choose an estimate of θ that maximizes the likelihood function. In other words, we choose a value of the parameter that best explains the observed data, and this value is called the ML estimate of θ. 32 / 46

Calculation of the ML estimates Finding the maximum can be easy or hard, depending on the form of p(d obs θ). For the normal (gaussian) distribution, θ consists of the mean µ and variance σ 2, and we can find the derivative of the log(lθ D obs )), set it equal to zero, and solve directly for µ and σ 2. But for most problems, these analytic expressions are hard to come by, and we have to use more elaborate techniques. One might use Newton-Raphson or quasi-newton algorithms to directly maximize the likelihood of the observed data. One of the most popular techniques for maximizing the likelihood function is the expectation-maximization (EM) algorithm. 33 / 46

Imputation-based Methods implicit modeling methods Hot deck imputation: replacing missing values by values from similar responding units in the sample. Cold deck imputation: replace a missing value of an item by a constant value from an external source. Model-based procedures: define a model for the observed data and base imputation on a draw from the posterior distribution of missing data conditional on observed data under that model. 34 / 46

Single Imputation Single imputation substitutes some reasonable guess (imputation) for each missing value and then perform the analysis as if there were no missing data. Mean imputation: A popular approach replaces each missing value with the unconditional mean of the observed values for a particular variable. This single-imputation approach can lead to underestimating standard errors. Additionally, this method distorts relationships between variables since it pushes the correlations between variables towards zero. 35 / 46

Single Imputation, continued Conditional mean imputation: For data sets with few missing variables, conditional mean imputation is an improvement on the unconditional mean approach since it replaces each missing value with the conditional mean of the variable based on other fully observed variables in the data set. Regression imputation: replace missing values by predicted values from a regression of the missing item on items observed for the unit, usually calculated from units with both observed and missing variables present. Mean imputation is a special case. Stochastic regression imputation: replace missing values by a value predicted by regression imputation plus a residual, drawn to reflect uncertainty in the predicted value. 36 / 46

Nearest Neighbor Hot Deck on imputing Missing Outcome Define a metric to measure distance between units, based on the values of observed covariates Choose imputed values that come from responding units close to the units with missing value. 37 / 46

Nearest Neighbor Hot Deck, Cont Let X = (X 1,..., X K ) be K covariates that are observable for all units, and let x i = (x i1,..., x ik ) T be the values of K covariates for a unit i for which y i is missing. Let d(i,j) be the metric between two covariates x i and x j of unit i and unit j. 38 / 46

Nearest Neighbor Hot Deck, Cont For the unit i, we choose an imputed value for y i from those units j that are such that (1) y j,x j1,..., x jk are observed, and (2) d(i,j) is less than some value d 0. 39 / 46

Comments on Imputation Imputation should generally be Conditional on observed variable, to reduce bias due to response, improve precision, and preserve association between missing and observed variables. Multivariate, to preserve associations between missing variables. Draws from the predictive distribution rather than means. Account for imputation uncertainty multiple imputation. 40 / 46

Multiple Imputation, cont Imputation: Create D imputations of the missing data, (1) (d) Y mis,..., Y mis, under a suitable model. Analysis: Analyze each of the m completed datasets in the same way. Combination: Combine the m sets of estimates and SE s using Rubin s (1987) rules. 41 / 46

Picture Illustration Observed D ata???? Im putation 1 2 m 42 / 46

Software for Multiple Imputation - SAS PROC MI implements multiple imputation by the multivariate normal model. PROC MIANALYZE takes the results of the complete-data analysis per dataset. IVEware is a SAS callable software application that implements multiple imputation using chained equations, called sequential regressions. 43 / 46

Software for Multiple Imputation - STATA The ice package: a user-contributed Stata package that provides multiple imputation by chained equations. 44 / 46

Software for Multiple Imputation - SPSS Command, MULTIPLE IMPUTATION: a part of the Missing Values module, supports multiple imputations by chained equations. Imputation and analysis can be done in a largely automatic fashion and is well integrated with the software for complete-data analysis. 45 / 46

Software for Multiple Imputation - R Amelia: multiple imputations based on the multivariate normal model. BaBooN: multiple imputations by chained equations. cat: multiple imputation of categorical data according to the log-linear model. kmi: a Kaplan-Meier multiple imputation, specifically designed to impute missing censoring times. mi: a chained equations approach based on Bayesian regression methods. mice: MI by the chained equations. 46 / 46