Dealing with Missing Data



Similar documents
Missing data and net survival analysis Bernard Rachet

Introduction to mixed model and missing data issues in longitudinal studies

Handling missing data in Stata a whirlwind tour

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

A Basic Introduction to Missing Data

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Problem of Missing Data

Multiple Imputation for Missing Data: A Cautionary Tale

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

Dealing with Missing Data

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Sensitivity Analysis in Multiple Imputation for Missing Data

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Analyzing Structural Equation Models With Missing Data

Imputing Missing Data using SAS

Missing data are ubiquitous in clinical research.

Dealing with missing data: Key assumptions and methods for applied analysis

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Handling attrition and non-response in longitudinal data

AVOIDING BIAS AND RANDOM ERROR IN DATA ANALYSIS

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Missing Data. Paul D. Allison INTRODUCTION

Checking proportionality for Cox s regression model

Missing values in data analysis: Ignore or Impute?

Imputation of missing data under missing not at random assumption & sensitivity analysis

Data Cleaning and Missing Data Analysis

Bayesian Approaches to Handling Missing Data

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Module 14: Missing Data Stata Practical

Regression Modeling Strategies

Analysis of Longitudinal Data with Missing Values.

Binary Logistic Regression

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

13. Poisson Regression Analysis

IBM SPSS Missing Values 22

Study Design and Statistical Analysis

Effect of Anxiety or Depression on Cancer Screening among Hispanic Immigrants

Missing Data. Katyn & Elena

How to choose an analysis to handle missing data in longitudinal observational studies

SAMPLE SIZE TABLES FOR LOGISTIC REGRESSION

A Guide to Imputing Missing Data with Stata Revision: 1.4

Longitudinal Data Analysis. Wiley Series in Probability and Statistics

Methods for Meta-analysis in Medical Research

Introduction to Fixed Effects Methods

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

Personalized Predictive Medicine and Genomic Clinical Trials

Ordinal Regression. Chapter

Missing data in randomized controlled trials (RCTs) can

Komorbide brystkræftpatienter kan de tåle behandling? Et registerstudie baseret på Danish Breast Cancer Cooperative Group

Effect of Risk and Prognosis Factors on Breast Cancer Survival: Study of a Large Dataset with a Long Term Follow-up

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Gerry Hobbs, Department of Statistics, West Virginia University

Workpackage 11 Imputation and Non-Response. Deliverable 11.2

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

Statistical Rules of Thumb

Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission

Statistics Graduate Courses

An extension of the factoring likelihood approach for non-monotone missing data

APPLIED MISSING DATA ANALYSIS

Early mortality rate (EMR) in Acute Myeloid Leukemia (AML)

Implementation of Pattern-Mixture Models Using Standard SAS/STAT Procedures

Craig K. Enders Arizona State University Department of Psychology

A Review of Methods. for Dealing with Missing Data. Angela L. Cool. Texas A&M University

Statistics in Retail Finance. Chapter 6: Behavioural models

Development and validation of a prediction model with missing predictor data: a practical approach

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Guide to Biostatistics

Social inequalities impacts of care management and survival in patients with non-hodgkin lymphomas (ISO-LYMPH)

Multiple logistic regression analysis of cigarette use among high school students

Data Analysis, Research Study Design and the IRB

Illustration (and the use of HLM)

Electronic Theses and Dissertations UC Riverside

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh

Multinomial and Ordinal Logistic Regression

Randomized trials versus observational studies

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Munich Cancer Registry

Transcription:

Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January 23, 2014 EPAAC WP9 Satellite Meeting Ispra (Italy)

Background (1) Importance of quality control is well known Covariate values may be missing for some subjects Collected routinely: tumor size, lymph node status, metastasis (mainly) Collected for specific studies: estrogen receptor, socioprofessional category, Missing values may concern Dependent variable: Time/Status in survival analysis Independent variable(s): tumor size, Whatever the question (incidence, survival, ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 2

Background (2) Consequences of missing data Loss of irrelevant/non informative information No impact on estimates Loss of relevant/informative information Impact depends on the percentage of missing values Possible bias in both point estimates and standard errors Loss of statistical power Univariate/Multivariate analysis? Multivariate analysis: increase of the total percentage of missing values What can we do? Discard all the data set? Choose an appropriate method to perform analysis? Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 3

Objectives Present an overview of The types of missing data Some methods used to deal with missing data Provide outline guidelines Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 4

Missing Data Mechanism: Notations Y = ( y ) ij : (n x k) rectangular data set without missing values M = ( m ) ij m ij =1 if y ij is missing m ij =0 if y ij is present Defines the missingness pattern Univariate Y 1 Y 2 Y k 1 2 n? Monotone Y 1 Y 2 Y 3 Y 4 Y k 1 2???? n??? Non-Monotone Y 1 Y 2 Y 3 Y 4 Y k 1? 2???? n?? Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 5

Missing Data Mechanism: Classification Characterized by the conditional distribution of M given Y Missing Completely At Random (MCAR) Missingness mechanism independent of the values of the data Y (missing-y mis - or observed-y obs ) Missing At Random (MAR) Missingness mechanism depends only on Y obs, not on Y mis Missing Not At Random (MNAR) Missingness mechanism depends on Y mis Ignorable (MCAR, MAR) / Non-ignorable (MNAR) missing data Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 6

Missing Data Mechanism What do we learn with that? MCAR, MAR: handling missing data in an appropriate way do not need to model the missingness process Statistical tests H 0 : MCAR vs MAR? Yes H 0 : ignorable vs non-ignorable? No Classical methods used to handle missing data Provide valid statistical inferences with ignorable missing data Are not valid with non-ignorable missing data Sensitivity analyses under various scenarios of nonreponse when the MNAR hypothesis is suspected (e.g. self-reported characteristics as psychological disorders, quality of life, income, ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 7

Classical Methods Complete cases Indicator variable Multiple imputation and others Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 8

Complete Cases Method Based only on the individuals having no missing values on the covariates included in the analysis The preferred method of many statistical softwares! Pos Easy to perform! but not necessarily a good point Unbiased results under MCAR hypothesis Neg Reduction of sample size Loss of statistical power Bias in standard errors Inappropriate variable selection (regression analysis) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 9

Indicator Variable Method Creation of a missing data indicator variable Treat missing data as just another category Pos Includes all the observations for the analysis No loss of statistical power May help to interpret results (similarity with another category) Neg Biased estimates (usually) May not help to interpret results (absence of similarity) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 10

Multiple Imputation (MI): Principle Step 1 MAR assumption Imputations of the missing values for M completed data sets Step 2 Analyze of each of these completed data sets estimates and standard errors Step 3 Combination to produce a single set of estimates with their standard errors 1...? Imputation model? 2...? Analysis model e 1 e 2 (se 1 )(se 2 ) e (se) M... e M (se M ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 11

MI: Imputation of the Missing Values Goal: to account for the relationships between Y mis and Y obs, while taking into account the uncertainty of the imputation Y * ~ f Y Y ( ) mis obs Imputation model (non exhaustive) Continuous variable (e.g.: age at diagnosis): propensity methods, predictive mean matching Binary data (e.g.: M-stage): logistic regression Categorical data (e.g.: T-stage): polytomous logistic regression, proportional odds Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 12

MI: Analyses of the Completed Data Sets Analysis model: classical methods used to estimate Incidence Survival Effect of prognostic factors Independent analyses Each applied on the new completed data sets Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 13

MI: Combined Analysis Combination of the M estimates into an overall estimate and variance covariance matrix using Rubin s rules Take into account the uncertainty due to missing data Statistics that can be combined Mean, proportion, regression coefficient, Statistics that may require transformation Odds ratio, hazard ratio, baseline hazard, survival probability, Adapted from: White IR, et coll. Statistics in Medicine 2009 Statistics that cannot be combined P-value, likelihood ratio test statistic, Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 14

MI: Issue and Guidance for Practice (1) How many missing at most? Do not think in term of % of missing by covariate, but in term of reduction of % from the original data set when all variables used for the analyses are considered Think about the missingness mechanism Which variables to include in the imputation model? Covariates and outcome from the analysis model In survival model: status, time (t, log(t)) or cumulative baseline hazard function All predictors of the incomplete variable The number of variables in the imputation model may be greater than in the analysis model Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 15

MI: Issue and Guidance for Practice (2) Should we pay attention to the form of the imputation model? Yes in theory, hard to do (linearity? Interaction term?...) How many imputations are necessary? M=5-10 usually considered to be adequate Other rule exist based on the fraction of missing data Do we have to perform new imputations for each analysis? The imputed data set may be used for several analysis Need attention on the elaboration of the imputation model (more congenial ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 16

MI: Issue and Guidance for Practice (3) Is there a particular model building strategy? Variable selection can be performed to all imputed data sets, or considering a single data set (after merging) with an appropriate weighting procedure Model checking could be performed on each imputed data set Prediction could be obtained using Rubin s rules How to be confident about the fact that the missingness mechanism is ignorable? Think about your data Perform sensitivity analysis Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 17

Thank you Roch Giorgi email: roch.giorgi@univ-amu.fr Challenges in the Estimation of Net SURvival working survival group French National Research Agency (ANR-12-BSV1-0028) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 18

References Eisemann N, Waldmann A, Katalinic A. Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research in Methodololgy 2011;11:129. Giorgi R, Belot A, Gaudart J, Launoy G; French Network of Cancer Registries FRANCIM. The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis. Statistics in Medicine 2008;27(30):6310-31. Howlader N, Noone AM, Yu M, Cronin KA. Use of imputed population-based cancer registry data as a method of accounting for missing information: application to estrogen receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347-56. Little RJA, Rubin DB. Statistical Analysis with Missing Data (2nd edn). Wiley: New York, 2002. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. International Journal of Epidemiology 2010;39(1):118-28. Resseguier N, Giorgi R, Paoletti X. Sensitivity analysis when data are missing not-atrandom. Epidemiology 2011;22(2):282. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 2011;30(4):377-99. Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 19