Handling missing data in Stata a whirlwind tour

Similar documents

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Module 14: Missing Data Stata Practical

A Basic Introduction to Missing Data

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

MULTIPLE REGRESSION EXAMPLE

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Analyzing Structural Equation Models With Missing Data

Missing data and net survival analysis Bernard Rachet

Problem of Missing Data

Dealing with Missing Data

Multinomial and Ordinal Logistic Regression

ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics

Imputation of missing data under missing not at random assumption & sensitivity analysis

Sensitivity Analysis in Multiple Imputation for Missing Data

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

How to choose an analysis to handle missing data in longitudinal observational studies

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week (0.052)

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Simple Linear Regression Inference

Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, Last revised February 21, 2015

Discussion Section 4 ECON 139/ Summer Term II

How to set the main menu of STATA to default factory settings standards

Re-analysis using Inverse Probability Weighting and Multiple Imputation of Data from the Southampton Women s Survey

Nonlinear Regression Functions. SW Ch 8 1/54/

Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Correlation and Regression

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Quick Stata Guide by Liz Foster

Ordinal Regression. Chapter

Introduction to mixed model and missing data issues in longitudinal studies

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Generalized Linear Models

From the help desk: hurdle models

Lecture 15. Endogeneity & Instrumental Variable Estimation

August 2012 EXAMINATIONS Solution Part I

Introduction. Survival Analysis. Censoring. Plan of Talk

Interaction effects between continuous variables (Optional)

Linear Regression Models with Logarithmic Transformations

Sample Size Calculation for Longitudinal Studies

Basic Statistical and Modeling Procedures Using SAS

Handling attrition and non-response in longitudinal data

SAS Software to Fit the Generalized Linear Model

Standard errors of marginal effects in the heteroskedastic probit model

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

xtmixed & denominator degrees of freedom: myth or magic

Additional sources Compilation of sources:

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

DETERMINANTS OF CAPITAL ADEQUACY RATIO IN SELECTED BOSNIAN BANKS

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Imputing Missing Data using SAS

Multiple Imputation for Missing Data: A Cautionary Tale

From this it is not clear what sort of variable that insure is so list the first 10 observations.

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

VI. Introduction to Logistic Regression

Rockefeller College University at Albany

Using Stata for Categorical Data Analysis

Bayesian Approaches to Handling Missing Data

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Stata Walkthrough 4: Regression, Prediction, and Forecasting

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Dealing with Missing Data

Binary Diagnostic Tests Two Independent Samples

II. DISTRIBUTIONS distribution normal distribution. standard scores

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Multivariate Analysis of Variance (MANOVA)

Regression with a Binary Dependent Variable

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Milk Data Analysis. 1. Objective Introduction to SAS PROC MIXED Analyzing protein milk data using STATA Refit protein milk data using PROC MIXED

III. INTRODUCTION TO LOGISTIC REGRESSION. a) Example: APACHE II Score and Mortality in Sepsis

2. Linear regression with multiple regressors

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Interaction effects and group comparisons Richard Williams, University of Notre Dame, Last revised February 20, 2015

From the help desk: Swamy s random-coefficients model

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Data Analysis Methodology 1

especially with continuous

Odds ratio, Odds ratio test for independence, chi-squared statistic.

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Dealing with missing data: Key assumptions and methods for applied analysis

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Univariate Regression

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

MODELING AUTO INSURANCE PREMIUMS

Transcription:

Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 2/55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 3/55

The problem of missing data Missing data is a pervasive problem in epidemiological, clinical, social, and economic studies. Missing data always cause some loss of information which cannot be recovered. But statistical methods can often help us make best use of the data which has been observed. More seriously, missing data can introduce bias into our estimates. 4/55

Untestable assumptions Whether missing data cause bias depends on how missingness is associated with our variables. Crucially, with missing data we cannot empirically verify the required assumptions. e.g. consider the following distribution of smoking status (for males in THIN from [1]): Smoking status n (% of sample) (% of those observed) Non 82,479 (36) (48) Ex 30,294 (13) (18) Current 57,599 (25) (34) Missing 56,661 (25) n/a Are the %s in the last column unbiased estimates? 5/55

A principled approach to missing data We cannot be sure that the required assumptions are true given the observed data. Data analysis and contextual knowledge should be used to decide what assumption(s) are plausible about missingness. We can then choose a statistical method which is valid under this/these assumption(s). 6/55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 7/55

Rubin s classification Rubin developed a classification for missing data mechanisms [2]. We introduce the three types in a very simple setting. We assume we have one fully observed variable X (age), and one partially observed variable Y (blood pressure (BP)). We will let R indicate whether Y is observed (R =1)oris missing (R =0). 8/55

Missing completely at random The missing values in BP (Y ) are said to be missing completely at random (MCAR) if missingness is independent of BP (Y )andage(x ). i.e. those subjects with missing BP do not differ systematically (in terms of BP or age) to those with BP observed. In terms of the missingness indicator R, MCAR means P(R =1 X, Y )=P(R =1) e.g. 1 in 10 printed questionnaires were mistakenly printed with a page missing. 9/55

Example - blood pressure (simulated data) We assume age has been categorised into 30-50 and 50-70. n = 200, but only 99 subjects have BP observed: Age n Mean (SD) BP 30-50 72 129.7 (10.3) 50-70 27 160.6 (11.7) 10 / 55

Checking MCAR With the observed data, we could investigate whether age X is associated with missingness of blood presure (R). If it is, we can conclude the data are not MCAR. If it is not, we cannot necessarily conclude the data are MCAR. It is possible (though arguably unlikely in this case) that BP is associated with missingness in BP, even if age is not. 11 / 55

Example - blood pressure (simulated data) We compare the distribution of age in those with BP observed and those with BP missing:. tab age r, chi2 row Key frequency row percentage r age 0 1 Total 30-50 28 72 100 28.00 72.00 100.00 50-70 73 27 100 73.00 27.00 100.00 Total 101 99 200 50.50 49.50 100.00 Pearson chi2(1) = 40.5041 Pr = 0.000 p < 0.001 from chi2 test, shows we have strong evidence that missingness is associated with age. 12 / 55

Missing at random BP (Y ) is missing at random (MAR) given age (X )if missingness is independent of BP (Y )givenage(x ). This means that amongst subjects of the same age, missingness in BP is independent of BP. In terms of the missingness indicator R, MAR means P(R =1 X, Y )=P(R =1 X ) 13 / 55

Checking MAR We cannot check whethe MAR holds based on the observed data. To do this we would need to check whether, within categories of age, those with missing BP had higher/lower BP than those with it observed. 14 / 55

BP MAR given age SBP (mmhg) 100 120 140 160 180 200 72/100 observed Observed Missing 27/100 observed 30 50 Age (years) 50 70 15 / 55

A different representation of MAR We have defined MCAR and MAR in terms of how P(R =1 Y, X )dependsonage(x )andbp(y ). From the plot, we see that MAR can also be viewed in terms of the conditional distribution of BP (Y )givenage(x ). MAR implies that f (Y X, R =0)=f (Y X, R =1)=f (Y X ) That is, the distribution of BP (Y ), given age (X ), is the same whether or not BP (Y ) is observed. This key consequence of MAR is directly exploited by multiple imputation. 16 / 55

Missing not at random If data are neither MCAR nor MAR, they are missing not at random (MNAR). This means the chance of seeing Y depends on Y,evenafter conditioning on X. Equivalently, f (Y X, R =0) f (Y X, R =1). MNAR is much more difficult to handle. Essentially the data cannot tell us how the missing values differ to the observed values (given X ). We are thus led to conducting sensitivity analyses. 17 / 55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 18 / 55

Complete case analysis Complete case (CC) (or complete records) analysis involves using only data from those subjects for whom all of the variables involved in our analysis are observed. CC is the default approach of most statistical packages (including Stata) when we have missing data. By only analysing a subset of records, our estimates will be less precise than had there been no missing data. Arguably more importantly, our estimates may be biased if the complete records differ systematically to the incomplete records. However, CC can be unbiased in certain situations in which the complete records are systematically different. 19 / 55

Validity of complete case analysis CC analysis is valid provided the probability of being a CC is independent of outcome, given the covariates in the model of interest [3]. Note that this condition has nothing to do with which variable(s) have missing values. This condition does not fit into the MCAR/MAR/MNAR classification. It is not true, as is sometimes stated, that CC is always biased if data are not MCAR! 20 / 55

The complete case assumption The validity of the assumption required for CC analysis to be unbiased depends on the model of interest. Returning to the example of estimating mean BP, we can think of this as the following linear model with no covariates: with ɛ i N(0,σ 2 ɛ ). BP i = α + ɛ i Here CC analysis is unbiased only of missingness is independent of BP (Y ), i.e. P(R =1 Y )=P(R =1). 21 / 55

Estimating mean BP - complete case analysis. reg sbp Source SS df MS Number of obs = 99 F( 0, 98) = 0.00 Model 0 0. Prob > F =. Residual 29924.3689 98 305.350703 R-squared = 0.0000 Adj R-squared = 0.0000 Total 29924.3689 98 305.350703 Root MSE = 17.474 sbp Coef. Std. Err. t P> t [95% Conf. Interval] _cons 138.1012 1.756232 78.63 0.000 134.616 141.5864 The estimated mean (138.1) is biased downwards (truth=145). This is because missingness is associated with BP (higher BP more chance of BP missing). 22 / 55

A model for which CC is unbiased. reg sbp age Source SS df MS Number of obs = 99 F( 1, 97) = 163.17 Model 18767.6873 1 18767.6873 Prob > F = 0.0000 Residual 11156.6816 97 115.017336 R-squared = 0.6272 Adj R-squared = 0.6233 Total 29924.3689 98 305.350703 Root MSE = 10.725 sbp Coef. Std. Err. t P> t [95% Conf. Interval] age 30.9154 2.420199 12.77 0.000 26.11197 35.71882 _cons 129.6697 1.263908 102.59 0.000 127.1612 132.1782 This CC analysis is unbiased, because we condition on the cause of missingness (BP). Of course this alternative model does not (by itself) give an estimate of mean BP. 23 / 55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 24 / 55

Multiple imputation Multiple imputation (MI) involves filling in each missing values multiple times. This results in multiple completed datasets. We then analyse each completed dataset separately, and combine the estimates using formulae developed by Rubin ( Rubin s rules ). By using observed data from all cases, estimates based on MI are generally more efficient than from CC. And, in some settings, MI may remove bias present CC estimates. 25 / 55

MI in a very simple setting There are many different imputation methods. We describe one (the classic ) in the context of a very simple setting. Suppose we have two continuous variables X and Y. X is fully observed, but Y has some missing values. Our task is to impute the missing values in Y using X. 26 / 55

Imputing Y from X Y 0.0 0.5 1.0 1.5 2.0??? 0 1 2 3 4 X 27 / 55

Linear regression imputation 1. Fit the linear regression of Y on X using the complete cases: where ɛ N(0,σ 2 ). Y = α + βx + ɛ 2. This gives estimates ˆα, ˆβ and ˆσ 2. 3. To create the mth imputed dataset: 3.1 Draw new values α m, β m and σ 2 m based on ˆα, ˆβ and ˆσ 2. 3.2 For each subject with observed X i but missing Y i,create imputation Y i(m) by: Y i(m) = α m + β m X i + ɛ i(m) where ɛ i(m) is a random draw from N(0,σ 2 m ). 28 / 55

The end result Data Imputation 1 Imputation 2 Imputation 3 Imputation 4 Subject Y X Y X Y X Y X Y X 1 1.1 3.4 1.1 3.4 1.1 3.4 1.1 3.4 1.1 3.4 2 1.5 3.9 1.5 3.9 1.5 3.9 1.5 3.9 1.5 3.9 3 2.3 2.6 2.3 2.6 2.3 2.6 2.3 2.6 2.3 2.6 4 3.6 1.9 3.6 1.9 3.6 1.9 3.6 1.9 3.6 1.9 5 0.8 2.2 0.8 2.2 0.8 2.2 0.8 2.2 0.8 2.2 6 3.6 3.3 3.6 3.3 3.6 3.3 3.6 3.3 3.6 3.3 7 3.8 1.7 3.8 1.7 3.8 1.7 3.8 1.7 3.8 1.7 8? 0.8 0.2 0.8 0.8 0.8 0.3 0.8 2.3 0.8 9? 2.0 1.7 2.0 2.4 2.0 1.8 2.0 3.5 2.0 10? 3.2 2.7 3.2 2.5 3.2 1.0 3.2 1.7 3.2 29 / 55

The analysis stage For each imputation, we estimate our parameter of interest θ, and records its standard error. e.g. θ = E(Y ), the average value of Y. Let ˆθ m and Var(ˆθ m )denotetheestimateofθ and its variance from the mth imputation. Our overall estimate of θ is then the average of the estimates from the imputed datasets ˆθ MI = M m=1 ˆθ m M where M denotes the number of imputations used. 30 / 55

Variance estimation The within-imputation variance is given by M m=1 Var(ˆθ m ). M This quantifies uncertainty due to the fact we have a finite sample (the usual cause of uncertainty in estimates). The between-imputation variance is given by M m=1 (ˆθ m ˆθ MI ) 2. M 1 This quantifies uncertainty due to the missing data. The overall uncertainty in our estimate ˆθ is then given by ( Var(ˆθ MI )=σw 2 + 1+ 1 ) σb 2 M. 31 / 55

Inference The MI estimate and its variance can be used to form confidence intervals and performs hypothesis test. Implementations of MI in statistical packages like Stata automate the process of analysing each imputation and combining the results. 32 / 55

Assumptions for MI MI gives unbiased estimates provided data are MAR and the imputation model(s) is correctly specified. To be correctly specified, we must include all variables involved in our model of interest in the imputation model(s). The plausibility of MAR can be guided by data analysis and contextual knowledge. Often we have variables which are associated with missingness and the variable(s) being imputed, but which are not in the model of interest. Including these in the imputation model increases likelihood of MAR holding. 33 / 55

Specification of imputation models We should also ensure as best as possible that our imputation models are reasonably well specified. e.g. if a variable has a highly skewed distribution, imputing using normal linear regression is probably not a good idea. Various diagnostics can be used to aid this process, e.g. comparing distributions of imputed and observed 34 / 55

MI in Stata Historically the only imputation command in Stata was Patrick Royston s ice command, which performed ICE/FCS imputation (more on this later). Stata 11 included imputation using the multivariate normal model. Stata 12 adds ICE/FCS imputation functionality. 35 / 55

Imputing missing BP values in Stata Step 1 - mi set the data e.g. mi set wide Alternatives include mlong, flong. This only affects how Stata organises the imputed datasets. 36 / 55

Imputing missing BP values in Stata Step 2 - mi register variables At a minimum, we must mi register variables with missing values we want to impute. e.g. mi register imputed sbp 37 / 55

Imputing missing BP values in Stata Step 3 - imputing the missing values We are now ready to impute the missing values. Since we have only missing values in one continuous variable, we shall impute using a linear regression imputation model:. mi impute reg sbp age, add(10) rseed(5123) Univariate imputation Imputations = 10 Linear regression added = 10 Imputed: m=1 through m=10 updated = 0 Observations per m Variable Complete Incomplete Imputed Total sbp 99 101 101 200 (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.) 38 / 55

Imputing missing BP values in Stata Step 4 - analysing the imputed datasets We are now ready to analyse the imputed datasets. This is done by Stata s mi estimate command, which supports most of Stata s estimation commands.. mi estimate: reg sbp Multiple-imputation estimates Imputations = 10 Linear regression Number of obs = 200 Average RVI = 0.7163 Largest FMI = 0.4420 Complete DF = 199 DF: min = 35.63 avg = 35.63 DF adjustment: Small sample max = 35.63 F( 0,.) =. Within VCE type: OLS Prob > F =. sbp Coef. Std. Err. t P> t [95% Conf. Interval] _cons 145.3263 1.747398 83.17 0.000 141.7811 148.8715 The estimate is quite close to the true value (145). 39 / 55

Other MI imputation methods in Stata In addition to linear regression Stata s mi command offers imputation using: Logistic, ordinal logistic, and multinomial logsitic models Predictive mean matching Truncated normal regression for imputing bounded cts variables Interval regression for imputing censored cts variables Poisson regression for imputing count data Negative binomial regression for imputing overdispersed count data 40 / 55

MI with more than one variable So far we have considered setting with one variable partially observed. Often we have datasets with multiple partially observed variables. Stata 11/12 supports imputation with the multi-variate normal model. What if we have categorical or binary variables with missing values? More on this in tomorrow s course... 41 / 55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 42 / 55

Inverse probability weighting Inverse probability weighting (IPW) for missing data takes a different approach [4]. We perform a CC analysis, but weight the complete cases by the inverse of their probability of having data observed (i.e. not being missing). Those who had a small chance of being observed are given increased weight, to compensate for those similar subjects who are missing. This requires us to model how missingness depends on fully observed variables. 43 / 55

Using IPW to estimate mean BP Recall our previous analysis of missingness in BP and age:. tab age r, chi2 row Key frequency row percentage r age 0 1 Total 30-50 28 72 100 28.00 72.00 100.00 50-70 73 27 100 73.00 27.00 100.00 Total 101 99 200 50.50 49.50 100.00 Pearson chi2(1) = 40.5041 Pr = 0.000 The probability of observing BP is 0.72 for 30-50 year olds, and 0.27 for 50-70 year olds. So the weight for 30-50 year olds is 1/0.72 = 1.39 and for 50-70 year olds is 1/0.27 = 3.7. 44 / 55

The IPW estimator Since we are interested in estimating a simple parameter (mean BP), we can manually calculate the IPW estimate: 72 129.7 1.39 + 27 160.6 3.7 72 1.39 + 27 3.7 =145.1 IPW appears has removed the bias from the simple CC estimate of mean BP. 45 / 55

IPW more generally Step 1 - Constructing weights With multiple fully observed variables, we can use logistic regression to model missingness:. logistic r age Logistic regression Number of obs = 200 LR chi2(1) = 42.00 Prob > chi2 = 0.0000 Log likelihood = -117.62122 Pseudo R2 = 0.1515 r Odds Ratio Std. Err. z P> z [95% Conf. Interval] age.1438356.0455618-6.12 0.000.0773103.2676059 _cons 2.571428.5727026 4.24 0.000 1.661869 3.9788. predict pr, pr. gen wgt=1/pr 46 / 55

IPW more generally Step 2 - parameter estimation We can then pass the constructed weights to our estimation command:. reg sbp [pweight=wgt] (sum of wgt is 2.0000e+02) Linear regression Number of obs = 99 F( 0, 98) = 0.00 Prob > F =. R-squared = 0.0000 Root MSE = 19.008 Robust sbp Coef. Std. Err. t P> t [95% Conf. Interval] _cons 145.1274 2.162726 67.10 0.000 140.8356 149.4193 Notice that the SE is larger (2.16) compared to the MI SE (1.75). 47 / 55

Outline The problem of missing data and a principled approach Missing data assumptions Complete case analysis Multiple imputation Inverse probability weighting Conclusions 48 / 55

Problems caused by missing data and a principled approach Missing data reduce precision and potentially parameter bias estimates and inferences. Producing valid estimates requires additional assumptions about the missingness to be made. Ad-hoc methods should generally be avoided. Both data analysis and contextual knowledge should guide us in thinking about missingness in a given setting. We can then choose a statistical method which accommodates missing data under our chosen assumption (e.g. MAR). 49 / 55

Complete case analysis Complete case (CC) analysis is the default method of most software packages, including Stata. CC analysis is generally biased unless data are MCAR. But it can be unbiased in certain non-mcar settings when the model of interest is a regression model. Even when it is unbiased, CC may be inefficient compared to other methods. 50 / 55

Multiple imputation Multiple imputation is a flexible approach to handling missing data under the MAR assumption [5]. Stata 12 now includes a comprehensive range of MI commands, including ICE/FCS MI. In settings where both CC and MI are unbiased, MI will generally give more precise estimates. We must carefully consider the plausibility of the MAR assumption and whether imp. models are correctly specified. 51 / 55

Inverse probability weighting IPW involves performing a weighted CC analysis. Rather than model the partially observed variable, we model the observation/missingness indicator R. The weights based on this model are then passed to our estimation command, and most Stata estimation commands support weights. Sometimes modelling missingness may be easier than modelling the partially obs. variable (e.g. if the partially observed variable has a tricky distribution). However, IPW estimators can be quite inefficient compared to MI or maximum likelihood. IPW is also difficult (or impossible) to use in settings with complicated patterns of missingness. 52 / 55

Sensitivity to the MAR assumption Since we can never definitively our assumptions (e.g. MAR) hold, we should consider sensitivity analysis. MI can also be used to perform MNAR sensitivity analyses [6]. If you want to learn more, come on our missing data short course at LSHTM in June. And/or visit our website www.missingdata.org.uk 53 / 55

References I [1] L. Marston, J. R. Carpenter, K. R. Walters, R. W. Morris, I. Nazareth, and I. Petersen. Issues in multiple imputation of missing data for large general practice clinical databases. Pharmacoepidemiology and Drug Safety, 19:618 626, 2010. [2] D B Rubin. Inference and missing data. Biometrika, 63:581 592, 1976. [3] I. R. White and J. B. Carlin. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Statistics in Medicine, 28:2920 2931, 2010. 54 / 55

References II [4] S. R. Seaman and I. R. White. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 2011. [5] J A C Sterne, I R White, J B Carlin, M Spratt, P Royston, M G Kenward, A M Wood, and J R Carpenter. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. British Medical Journal, 339:157 160, 2009. [6] J R Carpenter, M G Kenward, and I R White. Sensitivity analysis after multiple imputation under missing at random a weighting approach. Statistical Methods in Medical Research, 16:259 275, 2007. 55 / 55