Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ

Transcription

1 19 Statistica Neerlandica (2003) Vol. 57, nr. 1, pp Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ Joseph L. Schafer* Department of Statistics and The Methodology Center, The Pennsylvania State University, 326 Thomas Building, University Park, PA 16802, USA Bayesian multiple imputation (MI) has become a highly useful paradigm for handling missing values in many settings. In this paper, I compare Bayesian MI with other methods maximum likelihood, in particular and point out some of its unique features. One key aspect of MI, the separation of the imputation phase from the analysis phase, can be advantageous in settings where the models underlying the two phases do not agree. Key Words and Pharases: missing data, nonresponse. 1 Fundamentals In modern statistical practice, the occurrence of missing values is usually viewed as a random phenomenon. Let Y com ¼ (Y obs, Y mis ) denote a set of complete data and Y obs the data actually observed. The missing data, Y mis, denotes real or hypothetical quantities which, if available, would simplify the analysis. Our primary interest is in some aspect of the population distribution of Y com, P(Y com h) not the distribution of Y obs, nor the process that partitions Y com into Y obs and Y mis. Nevertheless, we suppose that the partitioning of Y com is encoded in a set of random variables R. For example, if Y com is a data matrix containing both observed and missing values, then R could be a matrix of the same size containing 1 s and 0 s to show whether the corresponding elements of Y com are observed or missing. We will refer to R as the missingness, and P(R Y com ; n) the distribution of missingness. Note that R is itself completely observed. We do not posit a distribution for R because we want to model it. Rather, we hope to avoid modeling R because our model might not be plausible. Reasons for missingness are often not present in Y com, because explaining R is usually not the purpose of data collection. Nevertheless, these reasons may be related to aspects of Y com and, by omission, may induce relationships between R and Y com. The distribution of missingness should be regarded as a mathematical device to describe *jls@stat.psu.edu This work was supported by grant 1-P50-DA10075, National Institute on Drug Abuse.. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

2 20 J. L. Schafer the rates and patterns of missing values and to capture approximately relationships between R and Y com in a correlational (not causal) sense. Usually, our main reason for introducing P(R Y com ; n) is to clarify the conditions under which we may avoid specifying it. Elementary probability theory suggests that P(Y obs h) ¼ òp(y com h) dy mis may provide a basis for inference about h, both as a sampling distribution for Y obs and as a likelihood function for h. But that is true only under certain conditions. Missing data are missing at random (MAR) if the distribution of missingness does not depend on Y mis, P(R Y com ; n) ¼ P(R Y obs ; n) (RUBIN, 1976). They are missing completely at random (MCAR) if it does not depend on Y obs or Y mis, P(R Y com ; n) ¼ P(R; n) (LITTLE and RUBIN, 1987). Missing not at random (MNAR) refers to any violation of MAR. RUBIN (1976) showed that P(Y obs h) provides a correct sampling distribution for frequentist inference about h under MCAR, and a correct likelihood function for likelihood/bayes inference under the weaker condition of MAR. (To be precise, one also needs the parameters of the missingness distribution, n, tobe distinct from or independent of h; see LITTLE and RUBIN, 1987, for details.) That models for R may be avoided more often in a likelihood/bayesian mode than in a frequentist mode suggests that missing-data procedures based on likelihood may be more useful than those motivated by frequentist arguments alone. For the most part, I believe that is true. Many of the older data-editing procedures (e.g. listwise deletion) are based on frequentist arguments and are generally valid only under the strong and often implausible assumption of MCAR. Even if MCAR does hold, these procedures may be inefficient. Methods for analyzing data without a true likelihood function marginal modeling based on generalized estimating equations, for example can handle missing values only if they are MCAR (ZEGER, LIANG and ALBERT, 1988) or if one specifies a correct model for R (ROBINS, ROTNITZKY and SCHARFSTEIN, 1998). As a general principle, I believe that an analyst s time and effort are better spent building an intelligent model for the data rather than modeling the missingness, unless departures from MAR are suspected to be very serious. Consider a multivariate problem where we collect items Y j, j ¼ 1,...,p for subjects i ¼ 1,...,n, so that Y com is an n p data matrix, but portions of Y com are missing for reasons beyond our control. If Y 1,...,Y p represent repeated measures of the same variable at different occasions, and if subjects who drop out do not return (i.e., if Y j is missing then Y j+1,..., Y p are missing as well), then MAR is easy to understand: it means that a subject s probability of dropping out at occasion j, given that he has not yet dropped out, may depend on previous responses, Y 1,..., Y j)1, but not on the present or future, Y j,...,y p. In more general multivariate setting, however, MAR is not very intuitive; it means that a subject s probabilities of responding to Y 1,...,Y p may depend only on his or her own set of observed items, a set that changes from one subject to the next. One could say that this condition seems odd or unnatural. However, the apparent awkwardness of MAR does not imply that assuming MAR is unwise. COLLINS,SCHAFER and KAM (2001) demonstrated that, with continuous data,

3 Multiple imputation when models differ 21 an erroneous assumption of MAR failing to take into account a cause or correlate of missingness may have only a minor impact on estimates and standard errors unless the relationships between the omitted cause or correlate of missingness and the outcomes are unusually strong ( q much greater than 0.5). In typical social science applications, I believe that such strong correlations between causes of missingness and outcomes are the exception rather than the rule, and assuming MAR will probably not lead us far astray. When MNAR is a serious concern e.g., in a clinical trial where subjects drop out if the treatment s not working it may be necessary to jointly model Y com and R, but such models must be based on other unverifiable assumptions. Methods for joint modeling of Y com and R in longitudinal studies with dropout are reviewed by LITTLE (1995) and by VERBEKE and MOLENBERGHS (2000). 2 Bayesian multiple imputation Inferences about h may proceed from a likelihood based on P(Y obs h) under MAR or from a joint model for Y com and R under MNAR. Procedures for ML estimation in multivariate missing-data problems are becoming widely available. Programs for fitting linear mixed models to longitudinal data (e.g., PROC MIXED in SAS) allow missing responses. ML for latent-variable models with incomplete data is found in Amos (ARBUCKLE and WOTHKE, 1999), Mx (NEALE et al., 1999), LISREL 8.5 (JO RESKOG and SO RBOM, 2001), LTA (COLLINS et al., 1999) and Mplus (MUTHE N and MUTHE N, 1998). All of these programs assume that the missing values are MAR. A useful alternative to direct likelihood methods is given by multiple imputation (MI) (RUBIN, 1987). Suppose that Q is some aspect of p the ffiffiffiffi distribution of Y com to be estimated, and that an estimate ^Q and standard error U could be easily calculated if Y mis were available. In MI, Y mis is replaced by m > 1 simulated versions, Y ð1þ mis ;...; YðmÞ mis, resulting in m estimates and standard errors, ð ^Q j ; U j Þ; j ¼ 1;...; m. From RUBIN (1987), an overall estimate for Q is Q ¼ m 1 P m p ffiffiffi ^Q j¼1 j with a standard error of T, where T ¼ U þð1 þ m 1 ÞB, U ¼ m 1 P m j¼1 U j, B ¼ðm 1Þ 1 P m j ¼ 1 ð ^Q j QÞ 2.Ifð ^Q p QÞ= ffiffiffiffi p U is approximately N(0, 1) with complete data, then ð Q QÞ= ffiffiffi T tm provides tests and intervals for Q, where m ¼ (m ) 1)(1 + r )1 ) 2, r ¼ (1 + m )1 )B/U. Where do the imputations Y ð1þ mis ;...; YðmÞ mis come from? They are drawn from a Bayesian posterior predictive distribution for Y obs. The model for complete data, P(Y com ; h), implies a conditional distribution for the missing values given the observed ones, P(Y mis Y obs ; h). We could use this distribution for imputation if h were known. But because h is unknown, we must generate m independent random draws h (1),..., h (m) from a Bayesian posterior distribution given Y obs and R, which depends only on Y obs if the missing data are MAR. Once the random parameters are drawn, the imputations follow as Y ð jþ mis PðY misjy obs ; h ð jþ Þ, j ¼ 1,..., m. Inmost applications, the number of imputations does not need to be large; m ¼ 5 is often enough. Computational strategies for creating MI s in multivariate settings are

4 22 J. L. Schafer reviewed by SCHAFER (1997). Programs for creating MI s include NORM (SCHAFER, 1999), the S+MissingData library in S-PLUS (SCHIMERT et al., 2001), SAS PROC MI (YUAN, 2000), Amelia (KING et al., 2001), SOLAS (STATISTICAL SOLUTIONS, 2002), MICE (VAN BUUREN and OUDSHOORN, 1999), and IVEware (RAGHUNATHAN, SOLENBERGER and VAN HOEWYK, 2000). An excellent resource for information about these and other programs for MI is the website These programs assume that the missing data are MAR. However, in some cases, we can also use them to generate imputations under certain MNAR models by creating variables out of the missing-data pattern R and treating these variables as additional covariates; this idea will be explored in Section 5 below. It is easy to show that, when the estimand Q is a function of h, MI provides approximate Bayesian inferences for Q (SCHAFER, 1997, Sec ). Therefore, with a large sample and a diffuse prior distribution, MI and direct likelihood methods produce similar answers. From this standpoint, MI seems to be an unnecessarily circuitous way to summarize the likelihood for h. Why is MI worth considering at all? One reason is that, unlike ML, MI separates the inference into two phases: the imputation phase, in which Y ð1þ mis ;...; YðmÞ mis are created, and the analysis phase, in which the ð ^Q j ; U j Þ; j ¼ 1;...; m are calculated and combined. Because the phases are distinct, imputation and analysis may be carried out on different occasions by different persons. For example, the National Center for Health Statistics in the United States has recently released a multiply-imputed version of the Third National Health and Nutrition Examination Survey (NHANES III, ). Special techniques and software were used to create the imputations, but now that they exist, the imputed data files may be analyzed in a straightforward manner by anyone familiar with standard techniques of survey data analysis. The NHANES III Multiply Imputed Data Set is available at nhanes/nh3data.htm. Another advantage of separating imputation from the analysis and the primary topic of the rest of this paper is that the imputation and analysis may be carried out under different models. For example, it is a relatively simple matter to include extra variables in an imputation procedure but ignore them in later analyses. Properties of MI when the imputation and analysis models differ has explored from a theoretical standpoint by MENG (1994) and RUBIN (1996), and from a practical standpoint by COLLINS, SCHAFER, and KAM (2001). Discrepancies between the models are not necessarily harmful and can often be advantageous. 3 Comparing the results from MI and ML To understand what happens when the imputation and analysis models differ, it helps to compare first the results of an ML procedure applied to Y obs alone (which uses a single model) to an analysis based on MI (which uses two models). This discussion, which is summarized from COLLINS, SCHAFER and KAM (2001), assumes that the

5 Multiple imputation when models differ 23 model for the complete-data population P(Y com ; h) used in the ML analysis is the same model used to obtain estimates and standard errors ð ^Q j ; U j Þ; j ¼ 1;...; m after MI. Without this assumption, there is no guarantee that the two same population parameters are being estimated under the two methods. At present, we will also limit our attention to situations where the user of ML and the imputer both assume that the missing values are MAR, so that neither one is specifying a distribution for the missingness. This does not mean that the user of ML and the imputer are making identical assumptions about the missing data, because their models may still be different. For example, the imputer may use additional variables that do not appear in the analyses, or he may specify different inter-variable relationships (e.g. a structured covariance matrix versus an unstructured one). Proposition 1. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units); if their models apply equivalent distributional assumptions to the variables and the relationships among them; if the sample size is large; and if the number of imputations is sufficiently large; then the results from the ML and MI procedures will be essentially identical. Under these conditions, MI approximates a Bayesian analysis under the same model used in the ML procedure, and the asymptotic equivalence between Bayesian and likelihood-based procedures is well known (GELMAN et al., 1995). With large samples the effect of a prior distribution diminishes, and Bayesian and ML analyses produce similar results, especially when the prior is diffuse. For example, suppose that we want to regress a single variable Y on a set of predictors X 1,...,X p containing missing values. We could do this in AMOS by specifying the model with a path from each predictor X j to Y, ay-error that is uncorrelated with the predictors, and arbitrary correlations among the X j s. If the sample size is large, the parameter estimates and standard errors from AMOS will be essentially identical to those that would be obtained if the missing values were multiply imputed M times using NORM, the imputed datasets were analyzed using conventional regression software, and the results combined according to RUBIN s (1987) rules. When MI is carried out under these conditions, the imputer s and analyst s procedures are said to be congenial (MENG, 1994). Under congeniality, there is no major theoretical reason to prefer ML estimates to MI or vice-versa, because the properties of the two methods are very similar. It many real world applications of MI, however, the imputer s and analyst s models are uncongenial. Some types of uncongeniality are clearly harmful; for example, an imputer might create MI s under a model that is grossly misspecified. But MENG (1994) and RUBIN (1996) have demonstrated that uncongeniality can also be beneficial, particularly when the imputer can take advantage of extra information unavailable or unused by the analyst. My next two propositions describe different kinds of uncongeniality; the first is rather benign, whereas the second may help or harm.

6 24 J. L. Schafer Proposition 2. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units) but the imputer s model assumes a more general distributional form than the analyst s model, then ML and MI tend to produce similar parameter estimates, but the standard errors from MI may be larger. For example, suppose that Y 1,...,Y p are repeated measures of a variable over time, and missing values arise from dropout. Researcher A uses SAS PROC MIXED to fit a linear model with random intercepts and time-slopes for each individual, and then manipulates the fixed effects to estimate E(Y p ), the population mean response at the final occasion. Researcher B multiply imputes the missing responses using NORM, calculates the sample mean of Y p in each imputed dataset, and combines the results by RUBIN s (1987) rules. The two researchers will obtain similar estimates of E(Y p ), but A s standard errors may be slightly smaller than B s, because PROC MIXED assumes a patterned covariance structure for Y 1,...,Y p whereas NORM applies an unstructured model. In my experiences with real data, I have found that this increase in standard errors that arises when the imputation model is more general than the analysis model is often barely noticeable. Proposition 3. If the user of the ML procedure and the imputer use the same sample of units but a different set of variables, then the results from ML and MI could be quite different, even though the ML user s model is equivalent to the model used by the analyst of the imputed dataset. This can easily happen in practice. For example, suppose that Researcher A creates MI s for a set of variables Y 1,...,Y p using NORM, but also includes in the imputation procedure some additional covariates Z 1,...,Z q. After imputation, he ignores Z 1,...,Z q, fits a covariance structure model to Y 1,...,Y p with LISREL using each of the imputed datasets and combines the results by RUBIN s (1987) rules. Researcher B fits the same covariance structure model directly to Y 1,...,Y p without imputation using AMOS and finds that her results differ from A s. Many of her parameter estimates are similar to A s, but others are different enough to cause worry; some of her standard errors are similar, some are smaller and others are larger. In this example, discrepancies arise not because of inherent differences between MI and ML but because A included a set of auxiliary variables whereas B did not. If B had figured out a way to include the auxiliary variables in LISREL without altering the marginal model for Y 1,...,Y p, then her results would have been essentially identical to A s. From this discussion, it is clear that non-trivial differences between ML and MI can arise when the imputation and analysis models are uncongenial. In the remainder of this paper, I present situations where uncongeniality can be used to our benefit as part of an intelligent missing-data strategy.

7 Multiple imputation when models differ 25 4 Using auxiliary variables in MI Let Y 1,...,Y p denote the key items or variables that we wish to analyze. If missing values occur in these variables, there may be other completely or partially observed variables Z 1,...,Z q which are not inherently of interest, but which may potentially contain useful information for predicting the missing values. If so, then we may include Z 1,...,Z q in a multiple imputation procedure but exclude them from the subsequent analysis. When is it beneficial to use Z 1,...,Z q in this fashion? Are there potential dangers in doing so? COLLINS, SCHAFER and KAM (2001) (henceforth CSK) explore these questions in depth using simulation. Without going into details, I will attempt to explain and interpret our major findings. CSK classify auxiliary variables Z 1,...,Z q into three types. Type A variables are correlated with the outcomes Y 1,...,Y p and may also help to explain why Y 1,...,Y p are sometimes missing; that is, they are related to missingness. Type B variables are correlated with the outcomes Y 1,...,Y p but unrelated to missingness. Type C variables are unrelated to Y 1,...,Y p. Suppose that we plan to multiply impute the missing values in Y 1,...,Y p using one of the currently available MI programs (e.g. NORM) which assumes MAR. Because Type A variables are related to missingness, MAR is violated if they are excluded from the imputation procedure; therefore, including them may help to reduce bias. Type B variables will not reduce bias under MAR, but they may increase precision of the final parameter estimates because they contain useful information for predicting the missing values of Y 1,...,Y p. (Under MNAR conditions, Type B variables may help to reduce bias.) Type C variables will neither reduce bias nor increase precision; rather, they may reduce precision because they make the imputation model unnecessarily large and complicated. In one set of simulations, CSK show that the biases incurred by failing to include Type A variables in the imputation procedure are not as serious as some have previously thought. Including or excluding a Type A variable Z k made little difference unless the correlation between Z k and an outcome Y j was unusually strong (much greater than 0.4) and the rate of missing values in Y j was high (50% or more). With weaker correlations and lower rates of missing values, biases in parameters pertaining to Y j and its relationships to other variables were barely noticeable when Z k was excluded. In another set of simulations, CSK show that including a Type B variable can substantially increase precision under MAR conditions if its correlation with the outcomes is strong (say, 0.9). This situation could easily arise in longitudinal studies, because repeated measures on individuals over time are highly correlated; responses at one occasion may be very useful for imputing missing responses at another. Under MNAR conditions, where the probability that Y j is missing depends directly on Y j, an auxiliary variable Z k that is highly correlated with Y j will tend to reduce bias but will not eliminate it entirely. Finally, CSK show that the costs of unnecessarily including Type C variables in the imputation procedure tend to be minimal. Overall, there are potentially

8 26 J. L. Schafer important gains and small risks associated with auxiliary variables in MI. Therefore, we suggest that users of MI be quite liberal in deciding whether to include extra variables in the imputation procedure, even when these variables are not likely to appear in subsequent analyses. Agencies that collect and distribute data to the public for secondary analyses often have access to extra information (e.g., finely detailed geographic identifiers) that are not released for reasons of confidentiality, but which may be highly predictive of missing values. If the agency uses this additional information in an imputation procedure, the imputed data files released to the public will produce more efficient estimates than are possible by any missing-data procedure the user can implement himself. This is an example of the phenomenon known as superefficiency (MENG, 1994; RUBIN, 1996). LITTLE and YAO (1996) describe an innovative use of MI with auxiliary variables for reducing bias in the estimation of an intent-to-treat (ITT) effect. In many randomized experiments, some subjects do not adhere to the treatment regimen to which they have been assigned. If this noncompliance is also accompanied by dropout, then the traditional ITT analysis which compares subjects who completed the study on the basis of the group to which they were assigned, ignoring the treatment actually received may produce a biased estimate of the actual ITT effect. The reason for this bias is that the rates of dropout are often related to the treatment actually received. In these settings, the treatment actually received provides useful information for imputing the missing values, but after imputation it is the treatment assigned (not treatment received) that is used to estimate the ITT effect. In fairness, one should note that in principle it is also possible to include auxiliary variables in a likelihood-based procedure, so that the user of ML may derive the same benefits from them as the user of MI. In practice, however, this may be tricky. Consider a longitudinal study with dropout. Suppose that we have access to a covariate which, although it is not of substantive interest, may be highly predictive of dropping out. For example, at each occasion, we might ask, How likely is it that you will remain in this study for at least one more assessment? (1 ¼ unlikely, 2 ¼ somewhat likely, 3 ¼ very likely). Incorporating this variable into a longitudinal analysis might convert a seriously MNAR situation to one that is effectively MAR. However, if we simply include this variable in our longitudinal regression model as a time-varying covariate, we have substantially changed the meaning and interpretation of the model, making it more difficult to estimate the marginal treatment effects of interest. The correct way to add this variable to the model is to make it an additional response, jointly modeling this variable and the outcome of interest as a bivariate function of treatment group and time. Such models tend to be difficult to fit with current software for longitudinal analysis. Note that the advice given about auxiliary variables that including them in an imputation procedure has potential benefits but few risks does not apply to variables that are functions or summaries of the missingness R. For example, suppose that Y 1,...,Y p are repeated measures in a longitudinal study with dropout, and we notice that subjects who drop out early tend to have different response trajectories

9 Multiple imputation when models differ 27 from those who drop out later or not at all. It might seem reasonable to create a variable Z equal to the number of occasions for which the subject remains in the study (1, 2,...,p) and include it in the imputation model. Because this new variable is a function of R, including it in an MI procedure will produce imputations that are consistent with a particular MNAR model. Without Z, an MI procedure that assumes MAR will produce correct results under any MAR situation. But once Z has been included, the same procedure may produce biased results under many MAR mechanisms. In fact, the results may be nonsensical because, unless care is taken, these summaries of R may introduce parameters into the imputation model that cannot be identified from the observed data. (In this particular example, the correlation between Z and Y p is not identified, because Y p is observed if and only if Z ¼ p.) MNAR models are an important and potentially useful application for MI, but the user should be fully aware of the implications of these models and the special challenges they pose. 5Imputation under MNAR models When serious departures from MAR are suspected, it may be necessary to investigate alternative ways to jointly model the data and the missingness. Selection models specify a marginal distribution for the complete data, P(Y com ), and a conditional distribution for the missingness, P(R Y com ). Selection models have intuitive appeal, but their results can be highly sensitive to unverifiable assumptions about the shape of the complete-data population (LITTLE and RUBIN, 1987, Chapter 11; KENWARD, 1998). Pattern-mixture models posit a marginal distribution for the patterns of missingness, P(R), followed by a conditional model for the data distribution within patterns, P(Y com R) (LITTLE, 1993). In pattern-mixture models, some unverifiable assumptions must inevitably be made to identify all the parameters in P(Y com R). These assumptions are no less onerous than those made by selection models, but in one sense they are more honest because the parameters that cannot be estimated from the observed data are made explicit. Pattern-mixture models for longitudinal data with dropout are reviewed by LITTLE (1995) and VERBEKE and MOLENBERGHS (2000). One mildly unfortunate aspect of pattern-mixture models is that the effects of scientific interest are usually parameters of the marginal distribution of the complete data, not its conditional distribution within patterns. Therefore, when fitting these models, one must somehow manipulate the estimates from P(R) and P(Y com R) to obtain the desired estimates for PðY com Þ¼ P R PðRÞ PðY comjrþ. Alternatively, this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y ð1þ mis ;...; YðmÞ mis under a pattern-mixture model. Once these imputations exist, we may forget about R and use the imputed datasets to estimate the parameters of P(Y com ) directly. In many cases, the model applied to Y com in this analysis phase may deviate from the implied model

10 28 J. L. Schafer PðY com Þ¼ P R PðRÞ PðY comjrþ of the imputation phase, resulting in a mild form of uncongeniality. In practice, of course, neither of these models will be true, and the observed data probably contain little information that would allow us to distinguish one from the other. MI may also be used with selection models. In fact, any manner of specifying a joint model for Y com and R may be used, as long as it is possible to sample from the posterior predictive distribution P(Y mis Y obs, R) under the model. Once Y ð1þ mis ;...; YðmÞ mis have been created, further modeling of the missingness is unnecessary, and the analysis of the imputed datasets may proceed as usual. For this reason, MI seems to be an ideal device for sensitivity analyses. If imputations are generated under a variety of alternative models, the imputed datasets may be analyzed in the same manner and compared directly, without having to worry about the fact that the form of P(Y com ) and the meaning of its parameters may vary from imputation model to the next. MI under a variety of MNAR models is possible with current software. Any of the multivariate models implemented in S+MissingData library (SCHIMERT et al., 2001), which apply to continuous and categorical variables, may be jointly applied to a set of outcomes Y 1,...,Y p and to summaries of the missingness R. In most cases, some aspect of this joint distribution will not be identifiable from the observed data, and successful use of the routines may require omitting certain inter-variable relationships, use of an informative prior distribution, or both. Example. SCHAFER (1997) previously analyzed incomplete data from the National Crime Survey conducted by the U.S. Bureau of the Census. Residents in a sample of housing units were interviewed to determine whether they had been victimized by crime in the preceding half-year. Six months later, the same units were visited again and the residents were asked the same question. Missing values occurred at both occasions. The data are shown in Table 1. Let Y j denote victimization status (1 ¼ no, 2 ¼ yes) at occasions j ¼ 1, 2. SCHAFER (1997, Section 7.3.4) generated m ¼ 10 imputations of the missing values of Y 1 and Y 2 under the MAR assumption and used them to test hypotheses of independence and symmetry in the Y 1 Y 2 table. With S+MissingData, it is also quite easy to generate imputations under MNAR models. Let R j denote the missingness indicator for Y j (1 ¼ response, 2 ¼ missing value). In principle, any Table 1. Victimization status from the National Crime Survey. Victimized in second period? Victimized in first period? No Yes Missing No Yes Missing

11 Multiple imputation when models differ 29 loglinear model may be applied to the four-way table Y 1 Y 2 R 1 R 2, but many of these models will be under-identified. Models that contain associations between (Y 1, Y 2 ) and (R 1, R 2 ) correspond to various hypotheses of MNAR. Perhaps the simplest MNAR model worth considering is the one containing the associations Y 1 Y 2, R 1 R 2, Y 1 R 1 and Y 2 R 2, which allows missingness at each occasion to be influenced by the response only at that occasion. That model has 8 free parameters, the maximum that can be identified from the nine observed frequencies reported in Table 1. As noted by FAY (1986), even though this model appears to be saturated, ML estimates under this model will not necessarily reproduce the observed frequencies in Table 1. Models more complicated than this one will not have unique ML estimates. It is still possible to produce MI s under more complicated models, provided that a proper prior distribution is applied to the parameters. It is probably unwise to do so, however, unless the prior distribution truly does reflect the analyst s a priori state of knowledge. Using S+MissingData, I generated m ¼ 10 imputations under the model Y 1 Y 2, R 1 R 2, Y 1 R 1, Y 2 R 2 and then collapsed the imputed tables over R 1 and R 2. Using standard techniques, I then calculated estimates and standard errors for the log of the odds ratio a ¼ p 11 p 22 p 1 12 p 1 21 and the difference d ¼ p 12 ) p 21 from each table, where p ij ¼ P(Y 1 ¼ i, Y 2 ¼ j) (AGRESTI, 1990). The results are shown in Table 2 below. Combining the results by RUBIN s (1987) rules gives an estimate of 3.79 for the odds ratio a with a 95% interval of (2.16, 6.64), and an estimate of )0.058 for the difference d with a 95% interval of ()0.205, 0.089). The estimates are close to those reported by SCHAFER (1997) under the MAR assumption (3.60 for a, )0.039 for d), but the new intervals are 15% wider for a and 270% wider for d. Assuming MAR, the evidence for a shift in victimization rates from one occasion to the next (d 0) was fairly strong (p ¼ 0.06), but under the MNAR model the evidence has disappeared (p ¼ 0.40). The S-PLUS code for creating these imputations and performing the post-imputation analyses is provided in the Appendix. Table 2. Multiple imputations of (Y 1, Y 2 ) under MNAR model, with estimates and standard errors for the log-odds ratio a and difference d. Imputations (Y 1, Y 2 ) no, no no, yes yes, no yes, yes log ^a SE ^d )0.020 )0.037 )0.063 )0.144 ) )0.142 )0.060 )0.104 SE

12 30 J. L. Schafer What are we to make of these results? I am not entirely sure. On the one hand, we may be tempted to impute under MNAR models routinely because it is not difficult to do so. On the other hand, I suspect that these models may, in many cases, be unnecessarily complex. In this example, I tested the joint significance of the Y 1 R 1 and Y 2 R 2 associations by a likelihood-ratio test and found no evidence for them (p ¼ 0.98). By nature, the data can provide almost no information about these associations; we can estimate them only in a very indirect way, by assuming that other associations (e.g., Y 1 R 2 ) do not exist. What can we possibly achieve by adding parameters to a model that are barely identified, except to increase the width of our interval estimates by an arbitrary amount? This joint test for Y 1 R 1 and Y 2 R 2 is correctly interpreted as a test for MCAR, not a test for MAR. Nevertheless, omitting Y 1 R 1 and Y 2 R 2 results in a procedure that is valid under any MAR missingness model (a rather broad class), whereas including them produces results that are valid only under this particular MNAR model (a rather narrow class). It seems paradoxical that the inclusion of additional parameters, about which the data contain little evidence, produces a model that may in some sense be more restrictive than the model that omits them. Rather than rely heavily on poorly estimated MNAR models, I would prefer to examine auxiliary variables that may be related to missingness in Y 1 and Y 2, and include them in a richer imputation model under assumption of MAR. 6 Analyses by less-than-fully parametric methods In their quest for robustness, statisticians are fond of relaxing assumptions. In missing-data problems, however, there is an unpleasant aspect to this: once we dispense with a true likelihood function for Y com, we must usually bid farewell to inferential procedures that are valid under general MAR conditions. Incompletedata procedures derived from frequentist arguments are typically valid only under MCAR (RUBIN, 1976) or they require strong models for the missingness. A good example of the latter is weighting. If X 1,...,X n is a random sample from a density f(x), and the value X i ¼ x becomes missing with probability 1 ) g(x), then the observed data will be sampled from f * (x) µ f(x)g(x), and consistent nonparametric estimates of moments of f are available by applying weights w i µ g )1 (X i ) to the observed X i s. The problem, of course, is that g(x) is unknown and cannot be estimated from the observed data without strong assumptions or powerful auxiliary information. Therefore, statisticians may feel the need to choose between (a) fully parametric analyses that make strong assumptions about P(Y com ), and (b) semi- or non-parametric analyses that weaken the assumptions about P(Y com ) but make strong assumptions about the distribution of missingness. With MI, we can (almost) have the best of both worlds. If we create imputations Y ð1þ mis ;...; YðmÞ mis from a predictive distribution P(Y mis Y obs ) derived from a fully

13 Multiple imputation when models differ 31 parametric model, and then analyze the imputed datasets by a less-than-fully parametric method, we may be able to achieve better performance and greater robustness than is possible with any procedure that handles missing data and estimation in a single step. Any erroneous parametric assumptions in the imputation phase will effectively be applied only to the missing part of the dataset, not to the observed data. This type of uncongeniality was anticipated by MENG (1994), whose results suggest that these procedures tend to perform well as long as the imputation model is not grossly misspecified. MENG s (1994) theorems encompass certain types of estimation procedures but not others. For example, they do apply to certain techniques of design-based estimation commonly applied to sample surveys, but apparently they do not apply to semiparametric regression using generalized estimating equations (GEE) (MENG, 1999). Of course, although we may find it difficult to prove good performance analytically for the latter, that does not imply that good performance will not be seen in practice. Experience suggests that Bayesian MI does interact well with a variety of semi- and nonparametric estimation procedures. Example. Consider a classic cluster sample, where y i is the observed total and m i is the sample size in clusteri, i ¼ 1,...,n. The usual estimate of the population mean is y ¼ P i y i= P i m i, with ^V ðyþ ¼n P i ðy i m i yþ 2 =ðn 1Þð P i m iþ 2 as its estimated variance. As noted by SKINNER in his discussion of MENG (1994), these estimates appear to have no interpretation from a parametric likelihood or Bayesian standpoint under any population model. I examined the performance of Bayesian MI in conjunction with y and ^V ðyþ under a two-stage normal population, l i N(l, s), y ij N(l i, r 2 ), i ¼ 1,...,n, j ¼ 1,...,m i. I performed a simulation with l ¼ 10, s ¼ 5, r 2 ¼ 20, and m i ¼ 20 or 40 with equal probability, which produces a rather large design effect (deff 7.5). Drawing a sample of n ¼ 50 clusters, I imposed missing values on the observations within each cluster in an MCAR fashion at a rate of 25%. I then generated five imputations of the missing values by a Bayesian procedure using a two-stage normal model and weakly informative prior distributions for the variance components. Finally, I calculated y and ^V ðyþ from each of the five imputed datasets and combined them by RUBIN s (1987) rules to obtain a single estimate and 95% confidence interval for the population mean. For comparison, I also calculated y and ^V ðyþ from the complete data (before missing values were imposed) and from the observed data alone (treating the incomplete data as a sample of clusters of smaller sizes). The entire procedure was repeated 1,000 times. Over the 1,000 repetitions, the average point estimate was using the complete data, using the observed data alone, and using MI, all very close to the true mean l ¼ 10. The variances of the estimates for the three methods were for complete data, for observed data, and for MI. Interestingly, an inflation of variance from to and from to corresponds to

14 32 J. L. Schafer rates of missing information of 3% and 5%, respectively, even though the actual observations are missing at a rate of 25%. Because the design effect in this example is strong, reducing the per-cluster sample sizes by 25% produces only a slight loss of information. Therefore, a high-quality missing-data procedure should produce interval estimates that are only slightly wider than those obtained from complete data. How did the intervals fare? With complete data, 938 (94%) of the simulated intervals covered the true mean l ¼ 10, and the average interval width was Using the observed data alone, 993 of the intervals covered l ¼ 10, and the average width of the intervals was 1.87 an increase of 36%. Treating the observed data in each cluster as if it were complete data from a smaller cluster produces intervals that are unnecessarily wide and inefficient. This result seems a bit odd, given the very simple nature of the missingness distribution. With MI, however, the performance of the intervals was outstanding; 948 of them covered the true mean, and their average width was only slightly greater than those from complete data (1.40). 7 Discussion and future directions In the early years of MI, many thought that imputing missing values under one model and analyzing the imputed datasets under another model (or no model at all) was ludicrous and potentially harmful (FAY, 1992). We have seen, however, that in many situations of practical interest, this strategy can be quite beneficial. Many important questions still need to be addressed in the area of MNAR models. A quick look at recent issues in biostatistics journals reveals a surprisingly large number of articles on selection and pattern-mixture models, particularly for dropout in longitudinal studies. Should we fit these models simply because we can? I hope that, in the future, we will see published analyses of data under MNAR models that answer more questions than they raise. I hope to see detailed comparisons of the performance of various MAR and MNAR models in realistic scenarios where the true nature of the complete-data population and the missingness are unknown. The performance of Bayesian MI in conjunction with popular semi- or non-parametric analyses also needs further study. For example, consider the current procedures for design-based variance estimation from stratified multistage cluster samples as implemented in SUDAAN (RESEARCH TRIANGLE INSTITUTE, 1998). To what extent do the complexities of the sample design need to be incorporated into the imputation model? Do we need to account for each level of clustering, or will it be sufficient to incorporate only the highest level, the ultimate clusters that drive the variance estimation procedures?

15 Multiple imputation when models differ 33 Appendix ######################################### # SPLUS code for multiple imputation of victimization status from # the National Crime Survey using an MNAR model ######################################### # Enter the data Y1_rep (c ("Crime-Free", "Victim", "NA"),3) Y2_rep (c ("Crime-Free", "Victim", "NA"),each¼3) count_c (392,76,31,55,38,7,33,9,115) R1_rep (c ("Observed","Observed","Missing"),3) R2_rep (c ("Observed","Observed","Missing"), each¼3) Crime_data.frame(Y1¼factor(Y1), Y2¼factor (Y2), R1¼factor (R1), R2¼factor (R2),count¼count) # Attach the S + MissingData library library(missing) # Fit the loglinear model by maximum likelihood Crime.mle_mdLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, na.proc¼"em") # Generate m¼10 imputations using the default noninformative prior set.seed(184) Crime.imp_impLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, prior ¼ 0.5, nimpute ¼ 10, control¼list(niter¼1000)) # Reduce the imputed data to Y1 x Y2 tables, collapsing over R s Crime.imp.Y1xY2_miEval(oldUnclass (crosstabs (frequency Y1 + Y2, data¼crime.imp)),vnames¼"crime.imp") # Compute estimates, SE s for log-odds ratios and combine them # by Rubin s (1987) rules logodds.est_mieval (log(crime.imp.y1xy2[1,1]*crime.imp.y1xy2[2,2]/ (Crime.imp.Y1xY2[2,1]*Crime.imp.Y1xY2[1,2]))) logodds.se_mieval(sqrt(sum(1/crime.imp.y1xy2))) logodds.result_mimeanse(logodds.est, log-odds.se, df¼inf) print(exp(logodds.result$est)) print(exp(logodds.result$est +c(),1)*qt(.975,logodds.result$df)*logodds.result$std.err)) # Do the same for difference in proportions P(Y1¼yes)P(Y2¼yes) diff.est_mieval ((Crime.imp.Y1xY2[1,2]-Crime.imp.Y1xY2[2,1])/sum(Crime.imp.Y1xY2)) diff.se_mieval(sqrt ((1/sum(Crime.imp.Y1xY2))* ((Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (1-Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2)) + (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))*

16 34 J. L. Schafer (1-Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2)) + 2*(Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))))) diff.result_mimeanse(diff.est, diff.se, df¼inf) print(diff.result$est) print(diff.result$est +c(1,1)*qt(.975,diff.result$df)*diff.result$std.err) # Create Table 2 for displaying the imputations and results Table2_matrix(NA,8,10) dimnames(table2)_list (c("no, no ","no, yes","yes, no ","yes, yes", "logodds","se","diff","se"), format(1:10)) for(i in 1:10) {Table2[1:4,i]_as.vector(t(Crime.imp.Y1xY2[[i]])) Table2[5,i]_logodds.est[[i]] Table2[6,i]_logodds.SE[[i]] Table2[7,i]_diff.est[[i]] Table2[8,i]_diff.SE[[i]]} # Test the joint significance of the Y1*R2 and Y2*R1 associations Crime.mle2_mdLoglin(Crime, margins ¼ count Y1:Y2 + R1:R2, na.proc ¼ "em") lrtest_2*(crime.mle$algorithm$likelihood-crime. mle2$algorithm$likelihood) print(1-pchisq(lrtest,2)) References Agresti, A. (1990), Categorical data analysis, John Wiley and Sons, New York. Arbuckle, J. L. and W. Wothke (1999), AMOS 4.0 User s Guide, Smallwaters, Inc., Chicago. Collins, L. M., B. P. Flaherty, S. L. Hyatt and J. L. Schafer (1999), WinLTA User s Guide, Version 2.0, The Methodology Center, The Pennsylvania State University, University Park, PA. Collins, L. M., J. L. Schafer and C. M. Kam (2001), A comparison of inclusive and restrictive strategies in modern missing-data procedures, Psychological Methods 6, Fay, R. E. (1986), Causal models for patterns of nonresponse, Journal of the American Statistical Association 81, Fay, R. E. (1992), When are inferences from multiple imputation valid? Proceedings of the Survey Research Methods Section of the American Statistical Association, Gelman, A., D. B. Rubin, J. Carlin and H. Stern (1995), Bayesian data analysis, Chapman and Hall, London. Jöreskog, K. G. and D. Sörbom (2001), LISREL 8.5. Scientific Software International, Inc., Chicago. Kenward, M. G. (1998), Selection models for repeated measurements for nonrandom dropout: an illustration of sensitivity, Statistics in Medicine 17, King, G., J. Honaker, A. Joseph and K. Scheve (2001), Analyzing incomplete political science data: an alternative algorithm for multiple imputation, American Political Science Review 95, Little, R. J. A. (1993), Pattern-mixture models for multivariate incomplete data, Journal of the American Statistical Association 88,

17 Multiple imputation when models differ 35 Little, R. J. A. (1995), Modeling the dropout mechanism in repeated-measures studies, J ournal of the American Statistical Association 90, Little, R. J. A. and D. B. Rubin (1987), Statistical analysis withmissing data, John Wiley and Sons, New York. Meng, X. L. (1994), Multiple-imputation inferences with uncongenial sources of input (with discussion), Statistical Science 10, Meng, X. L. (1999), A congenial overview and investigation of imputation inferences under uncongeniality. Paper presented at International Conference on Survey Nonresponse, Portland, October Muthén, L. K. and B. O. Muthén (1998), Mplus User s Guide, Muthe n and Muthén, Los Angeles. Neale, M. C., S. M. Boker, G.Xie and H. H. Maes (1999), Mx: Statistical Modeling (5th ed.), Department of Psychiatry, Virginia Commonwealth University, Richmond, VA. Raghunathan, T. E., P. W. Solenberger and J. Van Hoewyk (2000), IVEware: Imputation and Variance Estimation Software, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI. RESEARCH TRIANGLE INSTITUTE (1998), SUDAAN: Software for the Statistical Analysis of Correlated Data, Version 7, Research Triangle Institute, Research Triangle Park, NC. Robins, J. M., A. Rotnitzky and D. O. Scharfstein (1998), Semiparametric regression for repeated outcomes with non-ignorable non-response, Journal of the American Statistical Association 93, Rubin, D. B. (1976), Inference and missing data, Biometrika 63, Rubin, D. B. (1987), Multiple imputation for nonresponse in surveys, John Wiley and Sons, New York. Rubin, D. B. (1996), Multiple imputation after 18+ years, Journal of the American Statistical Association 91, Schafer, J. L. (1997), Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J. L. (1999), NORM: Multiple imputation of incomplete multivariate data under a normal model, Software for Windows, Department of Statistics, The Pennsylvania State University, University Park, PA. Schimert, J., J. L. Schafer, T. Hesterberg, C. Fraley and D. Clarkson (2001), Analyzing missing values in S-PLUS, Insightful Corporation, Seattle, WA. STATISTICAL SOLUTIONS INC. (2002), SOLAS for missing data analysis, Version 3. Statistical Solutions, Cork, Ireland. Van Buuren, S. and C. G. M. Oudshoorn (1999), Flexible multivariate imputation by MICE, TNO/VGZ/PG , TNO Prevention and Health, Leiden. Verbeke, G. and G. Molenberghs (2000), Linear mixed models for longitudinal data, Springer-Verlag, New York. Yuan, Y. C. (2000), Multiple imputation for missing data: concepts and new development, Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Paper 267. SAS Institute, Cary, NC. Zeger, S. L., K. Y. Liang and P. S. Albert (1988), Models for longitudinal data: a generalized estimating equation approach, Biometrics 44, Received: February 2002, Revised: October 2002.