Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January 23, 2014 EPAAC WP9 Satellite Meeting Ispra (Italy)
Background (1) Importance of quality control is well known Covariate values may be missing for some subjects Collected routinely: tumor size, lymph node status, metastasis (mainly) Collected for specific studies: estrogen receptor, socioprofessional category, Missing values may concern Dependent variable: Time/Status in survival analysis Independent variable(s): tumor size, Whatever the question (incidence, survival, ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 2
Background (2) Consequences of missing data Loss of irrelevant/non informative information No impact on estimates Loss of relevant/informative information Impact depends on the percentage of missing values Possible bias in both point estimates and standard errors Loss of statistical power Univariate/Multivariate analysis? Multivariate analysis: increase of the total percentage of missing values What can we do? Discard all the data set? Choose an appropriate method to perform analysis? Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 3
Objectives Present an overview of The types of missing data Some methods used to deal with missing data Provide outline guidelines Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 4
Missing Data Mechanism: Notations Y = ( y ) ij : (n x k) rectangular data set without missing values M = ( m ) ij m ij =1 if y ij is missing m ij =0 if y ij is present Defines the missingness pattern Univariate Y 1 Y 2 Y k 1 2 n? Monotone Y 1 Y 2 Y 3 Y 4 Y k 1 2???? n??? Non-Monotone Y 1 Y 2 Y 3 Y 4 Y k 1? 2???? n?? Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 5
Missing Data Mechanism: Classification Characterized by the conditional distribution of M given Y Missing Completely At Random (MCAR) Missingness mechanism independent of the values of the data Y (missing-y mis - or observed-y obs ) Missing At Random (MAR) Missingness mechanism depends only on Y obs, not on Y mis Missing Not At Random (MNAR) Missingness mechanism depends on Y mis Ignorable (MCAR, MAR) / Non-ignorable (MNAR) missing data Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 6
Missing Data Mechanism What do we learn with that? MCAR, MAR: handling missing data in an appropriate way do not need to model the missingness process Statistical tests H 0 : MCAR vs MAR? Yes H 0 : ignorable vs non-ignorable? No Classical methods used to handle missing data Provide valid statistical inferences with ignorable missing data Are not valid with non-ignorable missing data Sensitivity analyses under various scenarios of nonreponse when the MNAR hypothesis is suspected (e.g. self-reported characteristics as psychological disorders, quality of life, income, ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 7
Classical Methods Complete cases Indicator variable Multiple imputation and others Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 8
Complete Cases Method Based only on the individuals having no missing values on the covariates included in the analysis The preferred method of many statistical softwares! Pos Easy to perform! but not necessarily a good point Unbiased results under MCAR hypothesis Neg Reduction of sample size Loss of statistical power Bias in standard errors Inappropriate variable selection (regression analysis) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 9
Indicator Variable Method Creation of a missing data indicator variable Treat missing data as just another category Pos Includes all the observations for the analysis No loss of statistical power May help to interpret results (similarity with another category) Neg Biased estimates (usually) May not help to interpret results (absence of similarity) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 10
Multiple Imputation (MI): Principle Step 1 MAR assumption Imputations of the missing values for M completed data sets Step 2 Analyze of each of these completed data sets estimates and standard errors Step 3 Combination to produce a single set of estimates with their standard errors 1...? Imputation model? 2...? Analysis model e 1 e 2 (se 1 )(se 2 ) e (se) M... e M (se M ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 11
MI: Imputation of the Missing Values Goal: to account for the relationships between Y mis and Y obs, while taking into account the uncertainty of the imputation Y * ~ f Y Y ( ) mis obs Imputation model (non exhaustive) Continuous variable (e.g.: age at diagnosis): propensity methods, predictive mean matching Binary data (e.g.: M-stage): logistic regression Categorical data (e.g.: T-stage): polytomous logistic regression, proportional odds Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 12
MI: Analyses of the Completed Data Sets Analysis model: classical methods used to estimate Incidence Survival Effect of prognostic factors Independent analyses Each applied on the new completed data sets Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 13
MI: Combined Analysis Combination of the M estimates into an overall estimate and variance covariance matrix using Rubin s rules Take into account the uncertainty due to missing data Statistics that can be combined Mean, proportion, regression coefficient, Statistics that may require transformation Odds ratio, hazard ratio, baseline hazard, survival probability, Adapted from: White IR, et coll. Statistics in Medicine 2009 Statistics that cannot be combined P-value, likelihood ratio test statistic, Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 14
MI: Issue and Guidance for Practice (1) How many missing at most? Do not think in term of % of missing by covariate, but in term of reduction of % from the original data set when all variables used for the analyses are considered Think about the missingness mechanism Which variables to include in the imputation model? Covariates and outcome from the analysis model In survival model: status, time (t, log(t)) or cumulative baseline hazard function All predictors of the incomplete variable The number of variables in the imputation model may be greater than in the analysis model Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 15
MI: Issue and Guidance for Practice (2) Should we pay attention to the form of the imputation model? Yes in theory, hard to do (linearity? Interaction term?...) How many imputations are necessary? M=5-10 usually considered to be adequate Other rule exist based on the fraction of missing data Do we have to perform new imputations for each analysis? The imputed data set may be used for several analysis Need attention on the elaboration of the imputation model (more congenial ) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 16
MI: Issue and Guidance for Practice (3) Is there a particular model building strategy? Variable selection can be performed to all imputed data sets, or considering a single data set (after merging) with an appropriate weighting procedure Model checking could be performed on each imputed data set Prediction could be obtained using Rubin s rules How to be confident about the fact that the missingness mechanism is ignorable? Think about your data Perform sensitivity analysis Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 17
Thank you Roch Giorgi email: roch.giorgi@univ-amu.fr Challenges in the Estimation of Net SURvival working survival group French National Research Agency (ANR-12-BSV1-0028) Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 18
References Eisemann N, Waldmann A, Katalinic A. Imputation of missing values of tumour stage in population-based cancer registration. BMC Medical Research in Methodololgy 2011;11:129. Giorgi R, Belot A, Gaudart J, Launoy G; French Network of Cancer Registries FRANCIM. The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis. Statistics in Medicine 2008;27(30):6310-31. Howlader N, Noone AM, Yu M, Cronin KA. Use of imputed population-based cancer registry data as a method of accounting for missing information: application to estrogen receptor status for breast cancer. American Journal of Epidemiology 2012;176(4):347-56. Little RJA, Rubin DB. Statistical Analysis with Missing Data (2nd edn). Wiley: New York, 2002. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. International Journal of Epidemiology 2010;39(1):118-28. Resseguier N, Giorgi R, Paoletti X. Sensitivity analysis when data are missing not-atrandom. Epidemiology 2011;22(2):282. White IR, Royston P, Wood AM. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine 2011;30(4):377-99. Roch Giorgi, SESSTIM, Faculty of Medicine, Aix-Marseille University CENSUR working survival group 19