Missing data and net survival analysis Bernard Rachet



Similar documents
Dealing with Missing Data

A Basic Introduction to Missing Data

Handling missing data in Stata a whirlwind tour

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Introduction to mixed model and missing data issues in longitudinal studies

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Relative survival an introduction and recent developments

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides:

Big data size isn t enough! Irene Petersen, PhD Primary Care & Population Health

Statistics Graduate Courses

Dealing with missing data: Key assumptions and methods for applied analysis

How to choose an analysis to handle missing data in longitudinal observational studies

13. Poisson Regression Analysis

Imputation of missing data under missing not at random assumption & sensitivity analysis

Development and validation of a prediction model with missing predictor data: a practical approach

Analyzing Structural Equation Models With Missing Data

Problem of Missing Data

Social inequalities impacts of care management and survival in patients with non-hodgkin lymphomas (ISO-LYMPH)

Sensitivity Analysis in Multiple Imputation for Missing Data

Dealing with Missing Data

Handling attrition and non-response in longitudinal data

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Prevalence odds ratio or prevalence ratio in the analysis of cross sectional data: what is to be done?

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Missing Data Sensitivity Analysis of a Continuous Endpoint An Example from a Recent Submission

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop

A General Approach to Variance Estimation under Imputation for Missing Survey Data

BayesX - Software for Bayesian Inference in Structured Additive Regression

Regression Modeling Strategies

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Randomized trials versus observational studies

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

VI. Introduction to Logistic Regression

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Nominal and ordinal logistic regression

Distance to Event vs. Propensity of Event A Survival Analysis vs. Logistic Regression Approach

Exam C, Fall 2006 PRELIMINARY ANSWER KEY

Guide to Biostatistics

EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA

Using Medical Research Data to Motivate Methodology Development among Undergraduates in SIBS Pittsburgh

HCUP Methods Series Missing Data Methods for the NIS and the SID Report #

Multivariate Logistic Regression

Sun Li Centre for Academic Computing

Longitudinal Data Analysis. Wiley Series in Probability and Statistics

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

Multinomial and Ordinal Logistic Regression

Module 14: Missing Data Stata Practical

7.1 The Hazard and Survival Functions

Efficient and Practical Econometric Methods for the SLID, NLSCY, NPHS

Probability Calculator

Analysis of Survey Data Using the SAS SURVEY Procedures: A Primer

Sampling Error Estimation in Design-Based Analysis of the PSID Data

Missing data in randomized controlled trials (RCTs) can

Multilevel Modelling of medical data

R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models

Ordinal Regression. Chapter

Missing data are ubiquitous in clinical research.

SUMAN DUVVURU STAT 567 PROJECT REPORT

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Checking proportionality for Cox s regression model

The CRM for ordinal and multivariate outcomes. Elizabeth Garrett-Mayer, PhD Emily Van Meter

Master programme in Statistics

Lecture 3: Linear methods for classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Reject Inference in Credit Scoring. Jie-Men Mok

Incorrect Analyses of Radiation and Mesothelioma in the U.S. Transuranium and Uranium Registries Joey Zhou, Ph.D.

The Basics of Regression Analysis. for TIPPS. Lehana Thabane. What does correlation measure? Correlation is a measure of strength, not causation!

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

A Review of Methods for Missing Data

Travel Distance to Healthcare Centers is Associated with Advanced Colon Cancer at Presentation

Electronic Theses and Dissertations UC Riverside

11. Analysis of Case-control Studies Logistic Regression

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Komorbide brystkræftpatienter kan de tåle behandling? Et registerstudie baseret på Danish Breast Cancer Cooperative Group

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Recent trends in cerebral palsy survival. Part I: period and cohort effects

The Latent Variable Growth Model In Practice. Individual Development Over Time

Transcription:

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet

General context Population-based, routine data Cancer registry data Clinical data tumour, treatment, comorbidity Cancer survival and roles played by patient, tumour and healthcare factors (very) large data sets, but incomplete information, which we have handled using multiple imputation procedure with Rubin s rules

Preliminary results of on-going work

Multiple imputation procedure Under Missing At Random (MAR) assumption 1. Impute the missing data from f data sets Y M Y O to give K complete 2. Fit the substantive model to each of the K data sets, to obtain K estimates of the parameters and estimates of their variance 3. Combine them using Rubin s rules

Multiple imputation steps Imputation Analysis Pooling Incomplete data Final results K completed data sets K analysis results

Pooling K estimates Rubin s rules Given K completed data sets, there are: K estimates with variance ˆk,k 2 ˆ k 1,...,K,k 1,...,K Pooled estimate Total variance ˆ ˆ V MI MI Wˆ 1 ˆ K 1 k within-imputation variance between-imputation variance K (1 k 1 K )Bˆ Wˆ Bˆ 1 K 1 K -1 K k 1 K k 1 2 k ( ˆ ˆ k MI 2 )

Multiple imputation procedure Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from f Y X, f Y X g X is a congenial imputation model if both f and g are correctly specified 3. Valid inference (under MAR) if f Y X g X (approximately) represents data structure and substantive model

Concepts and measures of interest Aims Concepts Prognosis of a cancer and impact at population level Excess hazard Excess hazard ratio Net survival Crude probabilities of death from cancer and other causes Relative survival data setting Population-based data Expected mortality hazard from life tables By single year age and sex, and calendar year, geography, deprivation

Nur et al, 2009 - Settings Population-based cohort of colorectal cancer patients Complete information on age, sex, follow-up time, vital status, deprivation, comorbidity, surgical treatment Tumour stage, morphology and grade: 45% incomplete data Relative survival data setting λ x = λ P x + exp xβ Substantive model: generalised linear model (Dickman et al, Stat Med 2005) Link function log μ j d Pj = log y j + xβ d j ~Poisson μ j ; μ j = λ j y j ; y j person-time at risk d Pj expected number of deaths life tables Excess hazard ratio (+ Ederer-2 relative survival) Offset

Data description Variable Stage Patients Category No. % 29 563 100.0 I 2 193 12.3 II 7 326 41.0 III 7 726 43.2 IV 643 3.6 Missing 11 684 (39.5) Missing information associated with: Older ages More deprived categories Less treatment with curative intent Higher probability of death Morphology Adenocarcinoma 23 693 90.7 Mucinous and serous 2 314 8.9 Other 128 0.5 Neoplasm, NOS 1 3 428 (11.6) Grade I 3 212 14.5 II 16 047 72.4 III/IV 2 907 13.1 Missing 7 397 (25.0)

Missing information in several variables Multiple imputation using Full Conditional Specification (chained equations van Buuren, 1999) Same basic assumptions than in multiple imputation Assumes a joint (multivariate) distribution exists without specifying its form f Y, Y,..., Y f Y Y,..., Y i,1 i,2 i, p i, p i,1 i, p 1 f Y Y,..., Y... f Y Y f Y i, p 1 i,1 i, p 2 i,2 i,1 i,1 Imputation model (joint model for the data) Gibbs sampler to: 1. Estimate the parameters in the joint imputation model 2. Impute the missing data Y ~ N β, Ω Multivariate problem split into a series of univariate problems

Imputation models Outcomes Ordinal regression for stage and grade Polytomous regression for morphology Covariables Other two covariables with incomplete information Sex, age, deprivation, comorbidity, treatment, cancer site Vital status Follow-up time (years): piecewise function (0, 0.5, 1, 2, 3, 4, 5, 5+) Time-dependent effects (categorical) for deprivation and age Substantive (excess hazard) model includes all these variables (binary) time-dependent effects

Results Variable Stage Patients Data after imputation Category No. % % 29 563 100.0 I 2 193 12.3 10.1 II 7 326 41.0 36.1 III 7 726 43.2 47.4 IV 643 3.6 6.2 Missing 11 684 (39.5) Missing information associated with: Older ages More deprived categories Less treatment with curative intent Higher probability of death Morphology Adenocarcinoma 23 693 90.7 90.5 Mucinous and serous 2 314 8.9 8.9 Other 128 0.5 0.5 Neoplasm, NOS 1 3 428 (11.6) Grade I 3 212 14.5 13.6 II 16 047 72.4 72.0 III/IV 2 907 13.1 14.4 Missing 7 397 (25.0)

Results Complete-case analysis (16 223 cases) Five years** First year Second to fifth years Period since diagnosis over which EHR was estimated Multiple imputation (29 563 cases) Five years** First year Second to fifth years EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI I 1.0 - - 1.0 - - II 3.6 2.7 4.7 2.6 2.2 3.0 III 10.2 7.7 13.5 7.0 5.9 8.4 IV 26.4 19.6 35.5 16.5 13.8 19.8 Missing 15 to 44 1.0 - - 1.0 - - 1.0 - - 1.0 - - 45 to 54 1.1 0.8 1.5 1.3 1.0 1.6 1.3 1.0 1.6 1.3 1.1 1.5 55 to 64 1.4 1.0 1.9 1.2 1.0 1.5 1.7 1.4 2.1 1.3 1.1 1.5 65 to 74 2.0 1.5 2.7 1.2 1.0 1.5 2.4 2.0 2.9 1.3 1.1 1.6 75 to 84 2.7 2.0 3.7 1.1 0.9 1.4 3.6 2.9 4.3 1.4 1.2 1.6 85 to 99 4.0 2.9 5.5 0.9 0.7 1.3 5.4 4.4 6.6 1.5 1.2 1.9 Other results Indicator approach Systematically underestimates variance of EHRs Overestimates EHRs for tumour morphology Underestimates EHRs for age and deprivation Does not identify time-dependent effects

Stage-specific survival Before imputation 100 100 After imputation 80 80 60 40 Relative survival (%) 60 40 20 20 0 I II III IV missing 0 1 2 3 4 5 Years since diagnosis 0 I II III IV 0 1 2 3 4 5 Years since diagnosis

Limitations Tutorial paper no systematic evaluation Relatively simple substantive model piecewise model categorical variables Further recent methodological developments in: multiple imputation net survival, flexible modelling More systematic evaluation simulations

Concepts and measures of interest Excess hazard λ E t = λ O t λ P t λ O t dt = dnw t ; λ Y W t P t dt = i=1 n Net survival S E t = e 0 Crude mortality F C t = 0 W t = 1 S Pi t t λe u du t S O u λ E u du Yi W t λpi t Y W t Expected probability of surviving up to t

Modelling approach Flexible multivariable excess hazard model Excess hazard Time-dependent and non-linear effects (splines) Variables affecting both mortality processes (cancer and other causes of death) included in the model Net survival is the mean of individual net survival functions predicted by the model

Multiple imputation procedure Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from f Y X, f Y X g X is a congenial imputation model if both f and g are correctly specified 3. Valid inference (under MAR) if f Y X g X (approximately) represents data structure and substantive model 4. Problematic within net survival setting and with nonlinear and time-dependent effects

Falcaro et al, 2015 Study settings Data 44,461 men diagnosed with a colorectal cancer in 1998-2006, followed up to 2009 Age at diagnosis (continuous), tumour stage (4 categories), deprivation (5 categories) Missing stage: 30% MCAR logit Pr MAR on X logit Pr MAR logit Pr R i = 1 Z i = δ 0 R i = 1 Z i = α 0 + α 1 (age i 60) R i = 1 Z i = γ 0 + γ 1 (age i 60) + γ 2 T i + γ 3 D i R = 1 if stage missing 100 simulated data sets per scenario

Distribution on fully observed data and empirical expected distribution in remaining complete records

Substantive model Flexible log cumulative excess hazard model ln Λ E t x i = s 1 ln t ; γ 1, k 1 + β x i + s 2 age i ; γ 2, k 2 Flexible functions: restricted cubic splines Baseline excess hazard: 5 df, 4 internal knots and 2 boundary knots Age (continuous): 3 df, 2 internal knots Covariables: deprivation and stage Aims: estimate effect of stage (log EHR) and stage-specific net survival at 1, 5 and 10 years since diagnosis

Imputation models Outcome (stage) Ordinal or multinomial logistic regression Covariables Survival time and log(survival time) or Nelson-Aalen estimate of the cumulative hazard Event indicator Age splines defined as in the substantive model Deprivation dummy variables 30 imputations Net survival: Rubin s rules applied on log log S E t to obtain approximate normality, then back-transformed

Multiple imputation strategy Multiple Imputation Strategy Functional Form How Survival Is Modeled in the Imputation MI_ologit_surv Ordinal logistic Survival time and log survival time MI_ologit_na Ordinal logistic Nelson-Aalen estimate of cumulative hazard MI_mlogit_surv Multinomial logistic Survival time and log survival time MI_mlogit_na Multinomial logistic Nelson-Aalen estimate of cumulative hazard

Results Bias in log excess hazard ratio estimates for stage (reference stage 1), 100 replications Poor results with ordered logit even under MCAR scenario

Stage-specific net survival at 1 year, 100 replications

Results Bias in stage-specific net survival estimates at 1 year, 100 replications

Comments Promising results despite that the parameter estimated in the substantive model (here excess hazard) does not correspond to the final outcome of interest (net survival) Limitations No time-dependent effects of stage Which joint model? Which variables in the imputation models? Vital status Nelson-Aalen estimates of cumulative hazard Interactions with time since diagnosis (age at diagnosis, deprivation ) Other relevant interactions (tumour stage, region ) other factors (treatment variables, co-morbidities, hospital volume, surgeon s experience )

Limitations and challenges: preliminary study Simulated data set colon cancer, 12,048 men followed up at least 5 years Baseline excess hazard: 5 df, 4 internal knots Covariables: stage, deprivation, age Time-dependent effects of stage: 2 df, 1 internal knot for each higher stage Non-linear effects of age: 3 df, 2 internal knots Substantive model ln Λ E t x i = s 1 ln t ; γ 1, k 1 + β x i + s 2 age i ; γ 2, k 2 + s 3j stage j t ; γ 3, k 3 Missing stage simulated as in previous example 100 data sets per scenario, with 30% missing stage Focus on MAR here

Limitations and challenges: preliminary study Time (year) Net Survival function Complete MAR Stage 1 1 0.95 0.99 5 0.91 0.99 2 1 0.90 0.97 5 0.78 0.90 3 1 0.77 0.86 5 0.46 0.59 Simulation of missingness mechanisms as in previous example Same imputation model was applied (multinomial, Nelson-Aalen) 4 1 0.32 0.41 5 0.06 0.09

Results Excess hazard ratios for stage 3.5 Tumour stage 2 (reference stage 1) 3 2.5 2 1.5 1.5 0 True EHR Complete-case EHRs Imputed EHRs 0 1 2 3 4 5 Time since diagnosis (years)

Results Excess hazard ratios for stage 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Tumour stage 3 (reference stage 1) True EHR Complete-case EHRs Imputed EHRs 0 1 2 3 4 5 Time since diagnosis (years)

Results Excess hazard ratios for stage 60 55 50 45 40 35 30 25 20 15 10 5 0 Tumour stage 4 (reference stage 1) True EHR Complete-case EHRs Imputed EHRs 0 1 2 3 4 5 Time since diagnosis (years)

Results Stage-specific net survival 1 Tumour stage 1.9.8.7.6.5.4.3.2.1 0 0 1 2 3 4 5 Time since diagnosis (years)

Results Stage-specific net survival 1 Tumour stage 2.9.8.7.6.5.4.3.2.1 0 0 1 2 3 4 5 Time since diagnosis (years)

Results Stage-specific net survival 1 Tumour stage 3.9.8.7.6.5.4.3.2.1 0 0 1 2 3 4 5 Time since diagnosis (years)

Results Stage-specific net survival 1 Tumour stage 4.9.8.7.6.5.4.3.2.1 0 0 1 2 3 4 5 Time since diagnosis (years)

Conclusion and development Why MI? Strength: clear division between imputation and analysis stages both efficiency and MAR plausibility increased Challenge: incompatibility between imputation and substantive models asymptotically biased estimates Define joint model for flexible excess hazard models Multiple imputation by fully conditional specification with substantive model compatible algorithm (SMC-FCS) Bartlett JW et al. Statistical Methods in Medical Research 2015

References Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley & Sons; 1987. Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18: 681 94. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med 2009; 28: 1982 98. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. Int J Epidemiol 2010; 39: 118 28. Carpenter JR, Kenward MG. Multiple imputation and its application. Chichester: John Wiley & Sons; 2013. Falcaro M, Nur U, Rachet B, Carpenter JR. Estimating excess hazard ratios and net survival when covariate data are missing: strategies for multiple imputation. Epidemiology 2015; 26: 421-8. Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res 2015; 24: 462-97. http://www.missingdata.org.uk/