How to choose an analysis to handle missing data in longitudinal observational studies

Similar documents

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Problem of Missing Data

A Basic Introduction to Missing Data

Handling missing data in Stata a whirlwind tour

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Handling attrition and non-response in longitudinal data

Missing data and net survival analysis Bernard Rachet

Imputation and Analysis. Peter Fayers

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Guideline on missing data in confirmatory clinical trials

Module 14: Missing Data Stata Practical

Multiple Imputation for Missing Data: A Cautionary Tale

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

Dealing with Missing Data

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Bayesian Approaches to Handling Missing Data

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Imputation of missing data under missing not at random assumption & sensitivity analysis

Introduction to mixed model and missing data issues in longitudinal studies

Imputation of missing network data: Some simple procedures

Analyzing Structural Equation Models With Missing Data

Missing data in randomized controlled trials (RCTs) can

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Surveying Prisoner Crime Reduction (SPCR) Adjusting for Missing Data Technical Report

Dealing with Missing Data

A Guide to Imputing Missing Data with Stata Revision: 1.4

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh

Prospective, retrospective, and cross-sectional studies

Sensitivity Analysis in Multiple Imputation for Missing Data

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

UMEÅ INTERNATIONAL SCHOOL

IPDET Module 6: Descriptive, Normative, and Impact Evaluation Designs

Social determinants of mental health in childhood a gene-environment perspective.

Data Cleaning and Missing Data Analysis

Imputing Missing Data using SAS

SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Big data size isn t enough! Irene Petersen, PhD Primary Care & Population Health

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Long-term impact of childhood bereavement

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

PRACTICE PROBLEMS FOR BIOSTATISTICS

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Selection bias in secondary analysis of electronic health record data. Sebastien Haneuse, PhD

APPLIED MISSING DATA ANALYSIS

Missing Data: Patterns, Mechanisms & Prevention. Edith de Leeuw

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

When You Are Born Matters: The Impact of Date of Birth on Child Cognitive Outcomes in England

Introduction to Longitudinal Data Analysis

Cohort Studies. Sukon Kanchanaraksa, PhD Johns Hopkins University

Understanding Clinical Trials

Multilevel Models for Longitudinal Data. Fiona Steele

SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

Pregnancy Intendedness

Randomized trials versus observational studies

Missing data are ubiquitous in clinical research.

WWC Single Study Review A review of the design and summary of findings for an individual study

IS 30 THE MAGIC NUMBER? ISSUES IN SAMPLE SIZE ESTIMATION

MISSING DATA: THE POINT OF VIEW OF ETHICAL COMMITTEES

Analysis and Interpretation of Clinical Trials. How to conclude?

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

Missing values in data analysis: Ignore or Impute?

DESCRIPTIVE RESEARCH DESIGNS

Linear Models in STATA and ANOVA

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Understanding Retrospective vs. Prospective Study designs

The Long-Run Effects of Attending an Elite School: Evidence from the UK (Damon Clark and Emilia Del Bono): Online Appendix

MRC Autism Research Forum Interventions in Autism

AVOIDING BIAS AND RANDOM ERROR IN DATA ANALYSIS

Food costing in BC October 2014

2. Background This was the fourth submission for everolimus requesting listing for clear cell renal carcinoma.

UNIVERSITY OF WAIKATO. Hamilton New Zealand

Missing data: the hidden problem

An introduction to modern missing data analyses

Social Mobility and the Professions

Poverty Indices: Checking for Robustness

Chapter 3. Sampling. Sampling Methods

Transcription:

How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK

Plan Why are missing data a problem? Methods: multiple imputation When is MI the best approach? When is MI not the best approach? How to decide Examples This is a talk about how to choose the analysis method - not about how to do MI Based on work funded by the Population Health Sciences Research Network done by Shaun Seaman (BSU) with Chris Power & Leah Li (ICH) and Alastair Leyland, Seeromanie Harding & Michaela Benzeval (MRC Social and Public Health Sciences Unit) 2

Why are missing data a problem? 1. Loss of power (compared to power achieved with no missing data) can t regain lost power 2. Any analysis must make an untestable assumption about the missing data wrong assumption biased estimates 3. Some popular analyses with missing data give biased estimates (no matter how the missingness arises) [missing indicator method] biased standard errors (resulting in incorrect p- values and confidence intervals) [mean imputation] inefficient estimates [complete case] 3

Missing data are a problem: so what must we do? 1. Loss of power minimise amount of missing data 2. Any analysis must make an untestable assumption about the missing data think carefully about the right assumption perform sensitivity analyses around that assumption 3. Some popular analyses with missing data give biased estimates / biased standard errors / inefficient estimates make a good choice of analysis (today's topic) 4

Menu of analyses Complete-cases analysis (CCA) Simple imputation mean imputation regression imputation stochastic imputation last observation carried forward Multiple imputation (MI) Inverse probability weighting (IPW) Likelihood-based methods (mixed models etc.) includes complex Bayesian modelling 5

A simple problem - and a useful graph id sat96 sat94 99. 14 101 9 12 102. 12 105 20 22 106 14 18 107 22 18 sat94 sat96 67% complete cases 6 individuals Observed Missing Satisfaction variable measured at two times Some missing values on sat96

Complete-cases analysis id sat96 sat94 id sat96 sat94 99. 14 101 9 12 102. 12 105 20 22 106 14 18 107 22 18 99 101 9 12 102 105 20 22 106 14 18 107 22 18 Usually inefficient Default in most stats packages

Mean imputation id sat96 sat94 id sat96 sat94 99. 14 101 9 12 102. 12 105 20 22 106 14 18 107 22 18 99 16.25 14 101 9 12 102 16.25 12 105 20 22 106 14 18 107 22 18 Makes results too certain and distorts associations between variables

0 5 10 15 20 25 sat96 Mean imputation again x x X1 missing 10 15 20 25 sat94

0 5 10 15 20 25 sat96 Regression imputation x x X1 missing 10 15 20 25 sat94 Better than mean imputation as it preserves relationships between variables but exaggerates correlations

0 5 10 15 20 25 sat96 Stochastic imputation x x X1 missing 10 15 20 25 sat94

Stochastic imputation id sat96 sat94 id sat96 sat94 99. 14 101 9 12 102. 12 105 20 22 106 14 18 107 22 18 99 15.1 14 101 9 12 102 9.5 12 105 20 22 106 14 18 107 22 18 Still over-precise because it treats imputed values as correct

Multiple imputation id sat96 sat94 id sat96 sat94 99. 14 101 9 12 102. 12 105 20 22 106 14 18 107 22 18 99 15.1 14 101 9 12 102 9.5 12 105 20 22 106 14 18 107 22 18 id sat96 sat94 99 21.3 14 101 9 12 102 18.8 12 105 20 22 106 14 18 107 22 18

Basics of multiple imputation Not "making up data" but "making up data honestly"! Idea is to impute data several times in order to express the full uncertainty about the missing data uses the "imputation model" Each completed data set is analysed using standard methods the "substantive model" The results are combined using Rubin s rules which allow for variation between imputed data sets as a source of uncertainty 14

Missing at random MI is usually done assuming missing at random (MAR): the probability of data being missing depends only on observed variables, not on unobserved variables e.g. whether a GP measures cholesterol depends only on the patient's age, sex, smoking, blood pressure and whether diabetic cholesterol is MAR The opposite is missing not at random (MNAR) e.g. whether a researcher interviews a patient with severe mental illness is likely to depend on their current symptom severity as well as their age, sex, etc. symptom severity is MNAR 15

When MI is or isn't a good choice MI is probably applicable to all missing data problems aim here is to see when we might in practice prefer some other analysis Assume the substantive model is a regression analysis: regressing an outcome on an exposure adjusting for several confounders where all these may be repeatedly measured (e.g. lifecourse eipdemiology) A particular alternative to MI is CCA 16

When is MI the best choice? 1. Incomplete confounders MI is most applicable when there is lots of missing data in the confounders e.g. here <10% of data points are missing but complete-cases analysis would discard 44% of the observations Outcome Exposure C1 C2 C3 C4 C5 C6 100 individuals Observed 56% complete cases Missing 17

When is MI the best choice? 2. Auxiliary variables Auxiliary variables are variables that are not in the substantive model associated with the missing data sometimes observed when data are missing and can therefore be used to improve the imputations Examples: outcomes from case-notes are a useful auxiliary for an interview-collected outcome NB the auxiliary variable is collected whether or not the main variable is collected Auxiliary variables are easy to include in multiple imputation and usually hard to include in other methods But need strong associations (e.g. correlation>0.3) before auxiliary variables make discernable difference 18

When may MI not be the best choice? 1. Very little missing data 2. Missing data only in the outcome 3. Other special missing data patterns 4. Multilevel data 5. Interactions in the model 6. Mis-specified model 7. Simple missing data patterns 8. Too much missing data 19

1. Very little missing data With very small amounts of missing data, any method (e.g. CCA) is adequate what matters is the % of incomplete cases But how much data is very little? Harrell (2001): <5% incomplete cases Barzi & Woodward (2004) and Burton et al (2010): <10% incomplete cases Depends on other factors: e.g. consider a binary outcome with prevalence 1% and 1% missing data if in fact all missing values are cases then prevalence is 2% not 1% so results are still sensitive to (extreme) departures from MAR 20

2. Missing data only in the outcome Assume no auxiliary variables Incomplete cases hold no information about the substantive model Hence it is entirely appropriate to restrict analysis to complete cases makes the same MAR assumption as MI etc. MI would just give CC results + random error (if impute from substantive model) Outcome Exposure C1 C2 C3 C4 C5 C6 100 individuals Observed 49% complete cases This is why MI is less relevant in randomised trials Missing 21

3. Other special missing data patterns: missing data in the exposure Incomplete cases hold very little information about the regression model (still assuming no auxiliary variables) Often reasonable to restrict analysis to complete cases But note the different assumptions: MI: being missing may depend on Outcome but not on Exposure (MAR) CC: being missing may depend on Exposure but not on Outcome Outcome Exposure C1 C2 C3 C4 C5 C6 100 individuals Observed 49% complete cases Missing 22

Other special missing data patterns: Introducing the FICO MI enables us to make use of the incomplete cases How much information do the incomplete cases hold? We approximate this by the Fraction of Incomplete Cases among cases with outcome and exposure Observed (FICO) Small FICO (e.g. <10%) & no auxiliary variables complete cases is adequate Large FICO or auxiliary variables MI is needed Outcome Exposure C1 C2 C3 C4 C5 C6 FICO =0% 100 individuals Observed 49% complete cases Missing White IR, Carlin JB. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine 2010; 28: 2920 2931. 23

FICO: simple examples Outcome Outcome Outcome Exposure Exposure Exposure C1 C1 C1 C2 C2 C2 C3 C3 C3 C4 C4 C4 C5 C5 C5 C6 C6 C6 100 individuals Observed Missing 50% incomplete cases; FICO=0% 100 individuals Observed Missing 50% incomplete cases; FICO=0% 100 individuals Observed Missing 50% incomplete cases; FICO=50% 24

FICO: more realistic illustration Outcome Exposure C1 C2 C3 C4 C5 C6 100 individuals Observed Missing 77% incomplete cases; FICO=50% Worth imputing Outcome Exposure C1 C2 C3 C4 C5 C6 100 individuals Observed Missing 54% incomplete cases; FICO=0% Not worth imputing? 25

FICO: summary CCA is a reasonable alternative to MI if there are few incomplete cases among those with complete outcome and exposure (e.g. assessed by low FICO) there are no (strong) auxiliary variables 26

4. Multilevel data MI is harder for multilevel data. Options include impute ignoring clustering (underestimates clustering hence standard errors likely to be too small) if clusters are large, impute with cluster as fixed effects (overestimates clustering ) REALCOM - stand-alone software that can be called from MLWin or Stata R: some facilities in mice; jomo If missing data are only in the outcome, again complete cases may be appropriate (look at FICO in level 1 units) usually involves mixed models Repeated measures can be seen as correlated not multilevel (use wide format) 27

5. Interactions in the substantive model Key fact about MI: the imputation model must contain all the variables in the substantive model If the substantive model contains interactions then these need to be reflected in the imputation model e.g. you are exploring whether a particular association differs between boys and girls you impute assuming that the association is the same in boys and girls then you are biasing your analysis Easy to see the problem; harder to fix it Bartlett JW et al. Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model. Statistical Methods in Medical Research (online). 28

6. Model mis-specification Let's go back to this missing data pattern. There are two reasons why we might not be happy with complete-cases analysis: 1. Some auxiliary variables may predict both Outcome and whether Outcome is missing 2. We may not believe the model Outcome Exposure C1 C2 C3 C4 C5 C6 100 individuals Observed 49% complete cases Missing 29

Model mis-specification (example) Illustrate in a hypothetical RCT of an anti-influenza drug drug would be used before flu was definitively diagnosed drug is effective in people with flu drug is harmful in those without flu so we care about the balance of benefit and harm measure these in quality-adjusted life-hours (!) Substantive model is a regression of quality-adjusted life-hours on assignment to flu drug model is mis-specified because there's an omitted interaction with flu status 30

Model mis-specification (example) Flu status Count Mean outcome Placebo Flu drug Difference Flu 200 20 24 +4 Not flu 800 22 21-1 All 1000 0 What if data are missing at random for half the not-flu? Flu status Count Mean outcome Placebo Flu drug Difference Flu 200 20 24 +4 Not flu 400 22 21-1 All 600 +0.7 We wrongly conclude overall benefit 31

Mis-specified models Can solve this by inverse probability weighting (IPW) weight each person by 1 / their probability of being observed here, weight each flu case as 1 and each non-flu case as 2 restores the "right" answer Could also solve it by imputing missing outcomes with the correct IM (i.e. one which allows for an interaction between flu status and drug given) In general, IPW is appropriate for protecting against model mis-specification 32

7. Simple missing data patterns This pattern has only 1 incomplete pattern - perhaps because Outcome and C2-C6 are measured at interview in adult life and Exposure & C1 are measured at birth MI would require a correct imputation models for 6 variables IPW is a good alternative: only requires one model for being a complete case given Exposure & C1 (and any auxiliary variables) Outcome Exposure C1 C2 C3 C4 C5 C6 1000 individuals Observed Missing 80% incomplete cases; FICO=0% 33

Simple missing data patterns (ctd) This pattern is similar but also has some extra missing data (presumably missing items in those interviewed) Could use IPW-MI hybrid build model for being interviewed weights impute only among those interviewed (with weights in the imputation model) Seaman S, White I, Copas A, Li L. Combining multiple imputation and inverseprobability weighting. Biometrics 2012; 68: 129 137. Outcome Exposure C1 C2 C3 C4 C5 C6 1000 individuals Observed Missing 85% incomplete cases; FICO=21% 34

8. Too much missing data In principle, MI can handle very large amounts of missing data, but the impact of anything you do wrong is much greater with more missing data departures from MAR will be very influential» sensitivity analysis to departures from MAR will identify this problem MI errors will matter a lot» e.g. with 70% missing data, omitting the outcome variable from the imputaiton model would dilute associations by 70% 35

How to choose an analysis: what to consider 1. Fraction of missing values for each variable in model 2. Fraction of incomplete cases 3. Fraction of incomplete cases among those with observed outcome and exposure (FICO) 4. Availability of auxiliary variables 5. Distribution of number of missing values 6. Patterns of jointly missing data 7. Reasons for missing data 8. Plausible missingness mechanisms 9. Clustering of data low FICO & no AVs CCA? simple pattern IPW? possible departures from MAR 36

Example 1 (auxiliary variables) Southampton Women's Survey (Crozier et al., 2009) 1987 women interviewed pre-pregnancy 1553 (78%) interviewed at early pregnancy 1893 (95%) interviewed at late pregnancy Analysis: regress mother's daily caffeine consumption at early pregnancy on her examination qualifications and age at conception MAR was considered plausible FICO is 1.5% - main missing data are in outcome suggests a complete cases analysis But recent caffeine consumption at pre-pregnancy and late pregnancy are useful auxiliary variables for caffeine consumption at early pregnancy we therefore recommend MI with auxiliary variables 37

Example 2 (repeated exposure) 1958 birth cohort Exposures Count % Exposure: maternal 0 313 3% interest in the education of the participant in 1 1420 15% childhood, reported by 2 3678 38% teachers at ages 7, 11 3 4238 44% and 16 years and formed into a summary measure Total 9649 100% Outcome: participants' cognitive function at 50 years. Table shows the 9649/17638 with observed outcome. The FICO is calculated here as the fraction of incomplete cases among those with observed outcome and partly or fully observed exposure. This works out as (1420+3678)/(1420+3678+4238)=55%. We recommend MI. 38

Example 3 (repeated outcomes) 1958 Birth Cohort Outcome: the trajectory of maths scores at age 7 to 16 years Exposure: birth weight Covariates: none The graph on the right summarises the data in a "wide" format (1 record per child) bwtkg math7a math11a math16a 18560 individuals Observed Missing 39

Example 3 ctd. The graphs below summarises the data in the more appropriate "long" format (1 record per wave per child) proportion of complete cases is 71% FICO=0 (obviously) Recommend a random-effects model (easier than MI) Age 7 Age 11 Age 16 bwtkg bwtkg bwtkg math math math 18560 individuals Observed Missing 75% complete cases; FICO=0% 18560 individuals Observed Missing 70% complete cases; FICO=0% 18560 individuals Observed Missing 58% complete cases; FICO=0% 40

Summary: the alternatives to MI 1. Very little missing data 2. Missing data only in the outcome 3. Other patterns with low FICO 4. Multilevel data 5. Interactions in the model 6. Mis-specified model 7. Simple missing data patterns 8. Too much missing data sensitivity analysis low FICO & no AVs CCA? REALCOM etc.? care IPW? 41