Missing Data Katyn & Elena
What to do with Missing Data Standard is complete case analysis/listwise dele;on ie. Delete cases with missing data so only complete cases are le> Two other popular op;ons: Mul;ple Imputa;on Full Informa;on Maximum Likelihood
Ways data is missing mahers MCAR: missing completely at random In this case, listwise dele;on doesn t create bias MAR: missing at random Probability that data is missing depends only on available informa;on If you have everything that mahers for missingness in the model, then no bias MNAR: missing not at random (this is a problem) Ie. people with higher incomes are less likely to reveal their income because they feel self- conscious or people who have college degrees don t reveal their income and we have missing data on level of educa;on Note: we rarely can tell if data is MAR or MNAR. Imputa;on methods assume MAR.
Problems with Complete Case Analysis Can lead to bias if observa;ons with missing values differ systema;cally from complete cases Can result in a small sample and larger standard errors as a result Could reweight to make the complete- case sample representa;ve But, survey weigh;ng is a mess Gelman, 2007, Struggles with Survey Weigh;ng and Regression Modeling
Bad Imputa;on Strategies Need to fill in the missing values But how? What about just replacing with the mean value? Distorts the distribu;on of the variable, and distorts rela;onships between variables (correla;ons will be pulled towards zero) How about including an indicator variable for missingness? (replace missing values with 0 or the mean) Leads to biased coefficients of other predictors in the model because it forces the slope to be the same across both missing- data groups. Could add interac;ons es;mates will be similar to complete- case analysis
BeHer Imputa;on Strategies Could just generate random x values from the observed distribu;on of x values. But, beher to use informa;on from other variables if available. Regression predic;ng x variable using other variables. Fill in missing values with predicted values from regression. Predicted values will be less variable than the original data. Can add uncertainty back in by adding the predic;on error from the regression.
Just using predicted values to fill in missing values : Adding in regression error Figure from hhp://lane.compbio.cmu.edu/courses/gelmanmissing.pdf
What to include in the Regression Include any variables you think will make a beher predic;on. For example, in predic;ng income, maybe you have informa;on on whether the respondent received income support from disability payments or welfare. Put it in the regression. The goal is not causal inference, it is accurate predic;on.
Other Methods Matching: for each unit with a missing value of y, find a unit with similar X values and take the y value. Also called hot- deck imputa;on Can be combined with regression where similarity is defined as closeness in the predicted value from the regression
Mul;ple Imputa;on O>en we are missing data for several variables in the analysis Two approaches: Mul.variate imputa.on (MVN): fit a mul;variate model to all the variables that have missing values. Iterated Chained Equa.ons (ICE): apply univariate methods itera;vely
MVN Assume a mul;variate distribu;on for all imputa;on variables and impute missing values as draws from the posterior predic;ve distribu;on of the missing data, given the observed data Use MCMC methods to approximate the distribu;on and draw imputed values O>en assume Mul;variate normal (MVN)
ICE 1. Fill in missing values with random values from the distribu;on of each variable 2. Regress variable 1 on all other variables (which now have complete data) 3. Fill in missing values in variable 1 with the closest matched value to the predicted value + noise Perform steps 2 and 3 for all variables, con;nuing un;l missing values converge
Which one? MVN makes an assump;on about the joint distribu;on of all the variables ICE doesn t assume this, and it s also possible to tailor each regression model appropriately (logis;c for a binary variable, etc) but you have to specify correctly. May not make a difference though. Lee & Carlin (2010) simulate data, then induce missing data using different mechanisms, then use Stata MVN and Stata ICE. find that both resulted in similar results (and both were less biased than complete- case analysis)
Evalua;ng Imputa;ons: Trace Plots Check that there are no systema;c trends
Mul;level Imputa;on Have data on students (test scores, demographics) Have data on schools (public v. private) Best to separate into two data sets and then use the results from one in the other (posibly back and forth) So, first impute individual- level variables using individual level- data and observed group- level measurement Then, in group- level, include aggregated forms of individual level measurements when impu;ng missing data at that level Maybe choose what you care about to determine the order? Not clear what is the best way to do this.
Inference with Mul;ple Imputa;on There is uncertainty about our imputa;on model that needs to be accounted for in our analysis Create mul;ple complete datasets using different imputed values, run analysis on each dataset. Final es;mate is average of the coefficients across m datasets: Variance will reflect variance within and between