# Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ

Save this PDF as:

Size: px
Start display at page:

Download "Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ"

## Transcription

1 19 Statistica Neerlandica (2003) Vol. 57, nr. 1, pp Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ Joseph L. Schafer* Department of Statistics and The Methodology Center, The Pennsylvania State University, 326 Thomas Building, University Park, PA 16802, USA Bayesian multiple imputation (MI) has become a highly useful paradigm for handling missing values in many settings. In this paper, I compare Bayesian MI with other methods maximum likelihood, in particular and point out some of its unique features. One key aspect of MI, the separation of the imputation phase from the analysis phase, can be advantageous in settings where the models underlying the two phases do not agree. Key Words and Pharases: missing data, nonresponse. 1 Fundamentals In modern statistical practice, the occurrence of missing values is usually viewed as a random phenomenon. Let Y com ¼ (Y obs, Y mis ) denote a set of complete data and Y obs the data actually observed. The missing data, Y mis, denotes real or hypothetical quantities which, if available, would simplify the analysis. Our primary interest is in some aspect of the population distribution of Y com, P(Y com h) not the distribution of Y obs, nor the process that partitions Y com into Y obs and Y mis. Nevertheless, we suppose that the partitioning of Y com is encoded in a set of random variables R. For example, if Y com is a data matrix containing both observed and missing values, then R could be a matrix of the same size containing 1 s and 0 s to show whether the corresponding elements of Y com are observed or missing. We will refer to R as the missingness, and P(R Y com ; n) the distribution of missingness. Note that R is itself completely observed. We do not posit a distribution for R because we want to model it. Rather, we hope to avoid modeling R because our model might not be plausible. Reasons for missingness are often not present in Y com, because explaining R is usually not the purpose of data collection. Nevertheless, these reasons may be related to aspects of Y com and, by omission, may induce relationships between R and Y com. The distribution of missingness should be regarded as a mathematical device to describe This work was supported by grant 1-P50-DA10075, National Institute on Drug Abuse.. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

2 20 J. L. Schafer the rates and patterns of missing values and to capture approximately relationships between R and Y com in a correlational (not causal) sense. Usually, our main reason for introducing P(R Y com ; n) is to clarify the conditions under which we may avoid specifying it. Elementary probability theory suggests that P(Y obs h) ¼ òp(y com h) dy mis may provide a basis for inference about h, both as a sampling distribution for Y obs and as a likelihood function for h. But that is true only under certain conditions. Missing data are missing at random (MAR) if the distribution of missingness does not depend on Y mis, P(R Y com ; n) ¼ P(R Y obs ; n) (RUBIN, 1976). They are missing completely at random (MCAR) if it does not depend on Y obs or Y mis, P(R Y com ; n) ¼ P(R; n) (LITTLE and RUBIN, 1987). Missing not at random (MNAR) refers to any violation of MAR. RUBIN (1976) showed that P(Y obs h) provides a correct sampling distribution for frequentist inference about h under MCAR, and a correct likelihood function for likelihood/bayes inference under the weaker condition of MAR. (To be precise, one also needs the parameters of the missingness distribution, n, tobe distinct from or independent of h; see LITTLE and RUBIN, 1987, for details.) That models for R may be avoided more often in a likelihood/bayesian mode than in a frequentist mode suggests that missing-data procedures based on likelihood may be more useful than those motivated by frequentist arguments alone. For the most part, I believe that is true. Many of the older data-editing procedures (e.g. listwise deletion) are based on frequentist arguments and are generally valid only under the strong and often implausible assumption of MCAR. Even if MCAR does hold, these procedures may be inefficient. Methods for analyzing data without a true likelihood function marginal modeling based on generalized estimating equations, for example can handle missing values only if they are MCAR (ZEGER, LIANG and ALBERT, 1988) or if one specifies a correct model for R (ROBINS, ROTNITZKY and SCHARFSTEIN, 1998). As a general principle, I believe that an analyst s time and effort are better spent building an intelligent model for the data rather than modeling the missingness, unless departures from MAR are suspected to be very serious. Consider a multivariate problem where we collect items Y j, j ¼ 1,...,p for subjects i ¼ 1,...,n, so that Y com is an n p data matrix, but portions of Y com are missing for reasons beyond our control. If Y 1,...,Y p represent repeated measures of the same variable at different occasions, and if subjects who drop out do not return (i.e., if Y j is missing then Y j+1,..., Y p are missing as well), then MAR is easy to understand: it means that a subject s probability of dropping out at occasion j, given that he has not yet dropped out, may depend on previous responses, Y 1,..., Y j)1, but not on the present or future, Y j,...,y p. In more general multivariate setting, however, MAR is not very intuitive; it means that a subject s probabilities of responding to Y 1,...,Y p may depend only on his or her own set of observed items, a set that changes from one subject to the next. One could say that this condition seems odd or unnatural. However, the apparent awkwardness of MAR does not imply that assuming MAR is unwise. COLLINS,SCHAFER and KAM (2001) demonstrated that, with continuous data,

3 Multiple imputation when models differ 21 an erroneous assumption of MAR failing to take into account a cause or correlate of missingness may have only a minor impact on estimates and standard errors unless the relationships between the omitted cause or correlate of missingness and the outcomes are unusually strong ( q much greater than 0.5). In typical social science applications, I believe that such strong correlations between causes of missingness and outcomes are the exception rather than the rule, and assuming MAR will probably not lead us far astray. When MNAR is a serious concern e.g., in a clinical trial where subjects drop out if the treatment s not working it may be necessary to jointly model Y com and R, but such models must be based on other unverifiable assumptions. Methods for joint modeling of Y com and R in longitudinal studies with dropout are reviewed by LITTLE (1995) and by VERBEKE and MOLENBERGHS (2000). 2 Bayesian multiple imputation Inferences about h may proceed from a likelihood based on P(Y obs h) under MAR or from a joint model for Y com and R under MNAR. Procedures for ML estimation in multivariate missing-data problems are becoming widely available. Programs for fitting linear mixed models to longitudinal data (e.g., PROC MIXED in SAS) allow missing responses. ML for latent-variable models with incomplete data is found in Amos (ARBUCKLE and WOTHKE, 1999), Mx (NEALE et al., 1999), LISREL 8.5 (JO RESKOG and SO RBOM, 2001), LTA (COLLINS et al., 1999) and Mplus (MUTHE N and MUTHE N, 1998). All of these programs assume that the missing values are MAR. A useful alternative to direct likelihood methods is given by multiple imputation (MI) (RUBIN, 1987). Suppose that Q is some aspect of p the ffiffiffiffi distribution of Y com to be estimated, and that an estimate ^Q and standard error U could be easily calculated if Y mis were available. In MI, Y mis is replaced by m > 1 simulated versions, Y ð1þ mis ;...; YðmÞ mis, resulting in m estimates and standard errors, ð ^Q j ; U j Þ; j ¼ 1;...; m. From RUBIN (1987), an overall estimate for Q is Q ¼ m 1 P m p ffiffiffi ^Q j¼1 j with a standard error of T, where T ¼ U þð1 þ m 1 ÞB, U ¼ m 1 P m j¼1 U j, B ¼ðm 1Þ 1 P m j ¼ 1 ð ^Q j QÞ 2.Ifð ^Q p QÞ= ffiffiffiffi p U is approximately N(0, 1) with complete data, then ð Q QÞ= ffiffiffi T tm provides tests and intervals for Q, where m ¼ (m ) 1)(1 + r )1 ) 2, r ¼ (1 + m )1 )B/U. Where do the imputations Y ð1þ mis ;...; YðmÞ mis come from? They are drawn from a Bayesian posterior predictive distribution for Y obs. The model for complete data, P(Y com ; h), implies a conditional distribution for the missing values given the observed ones, P(Y mis Y obs ; h). We could use this distribution for imputation if h were known. But because h is unknown, we must generate m independent random draws h (1),..., h (m) from a Bayesian posterior distribution given Y obs and R, which depends only on Y obs if the missing data are MAR. Once the random parameters are drawn, the imputations follow as Y ð jþ mis PðY misjy obs ; h ð jþ Þ, j ¼ 1,..., m. Inmost applications, the number of imputations does not need to be large; m ¼ 5 is often enough. Computational strategies for creating MI s in multivariate settings are

4 22 J. L. Schafer reviewed by SCHAFER (1997). Programs for creating MI s include NORM (SCHAFER, 1999), the S+MissingData library in S-PLUS (SCHIMERT et al., 2001), SAS PROC MI (YUAN, 2000), Amelia (KING et al., 2001), SOLAS (STATISTICAL SOLUTIONS, 2002), MICE (VAN BUUREN and OUDSHOORN, 1999), and IVEware (RAGHUNATHAN, SOLENBERGER and VAN HOEWYK, 2000). An excellent resource for information about these and other programs for MI is the website These programs assume that the missing data are MAR. However, in some cases, we can also use them to generate imputations under certain MNAR models by creating variables out of the missing-data pattern R and treating these variables as additional covariates; this idea will be explored in Section 5 below. It is easy to show that, when the estimand Q is a function of h, MI provides approximate Bayesian inferences for Q (SCHAFER, 1997, Sec ). Therefore, with a large sample and a diffuse prior distribution, MI and direct likelihood methods produce similar answers. From this standpoint, MI seems to be an unnecessarily circuitous way to summarize the likelihood for h. Why is MI worth considering at all? One reason is that, unlike ML, MI separates the inference into two phases: the imputation phase, in which Y ð1þ mis ;...; YðmÞ mis are created, and the analysis phase, in which the ð ^Q j ; U j Þ; j ¼ 1;...; m are calculated and combined. Because the phases are distinct, imputation and analysis may be carried out on different occasions by different persons. For example, the National Center for Health Statistics in the United States has recently released a multiply-imputed version of the Third National Health and Nutrition Examination Survey (NHANES III, ). Special techniques and software were used to create the imputations, but now that they exist, the imputed data files may be analyzed in a straightforward manner by anyone familiar with standard techniques of survey data analysis. The NHANES III Multiply Imputed Data Set is available at nhanes/nh3data.htm. Another advantage of separating imputation from the analysis and the primary topic of the rest of this paper is that the imputation and analysis may be carried out under different models. For example, it is a relatively simple matter to include extra variables in an imputation procedure but ignore them in later analyses. Properties of MI when the imputation and analysis models differ has explored from a theoretical standpoint by MENG (1994) and RUBIN (1996), and from a practical standpoint by COLLINS, SCHAFER, and KAM (2001). Discrepancies between the models are not necessarily harmful and can often be advantageous. 3 Comparing the results from MI and ML To understand what happens when the imputation and analysis models differ, it helps to compare first the results of an ML procedure applied to Y obs alone (which uses a single model) to an analysis based on MI (which uses two models). This discussion, which is summarized from COLLINS, SCHAFER and KAM (2001), assumes that the

5 Multiple imputation when models differ 23 model for the complete-data population P(Y com ; h) used in the ML analysis is the same model used to obtain estimates and standard errors ð ^Q j ; U j Þ; j ¼ 1;...; m after MI. Without this assumption, there is no guarantee that the two same population parameters are being estimated under the two methods. At present, we will also limit our attention to situations where the user of ML and the imputer both assume that the missing values are MAR, so that neither one is specifying a distribution for the missingness. This does not mean that the user of ML and the imputer are making identical assumptions about the missing data, because their models may still be different. For example, the imputer may use additional variables that do not appear in the analyses, or he may specify different inter-variable relationships (e.g. a structured covariance matrix versus an unstructured one). Proposition 1. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units); if their models apply equivalent distributional assumptions to the variables and the relationships among them; if the sample size is large; and if the number of imputations is sufficiently large; then the results from the ML and MI procedures will be essentially identical. Under these conditions, MI approximates a Bayesian analysis under the same model used in the ML procedure, and the asymptotic equivalence between Bayesian and likelihood-based procedures is well known (GELMAN et al., 1995). With large samples the effect of a prior distribution diminishes, and Bayesian and ML analyses produce similar results, especially when the prior is diffuse. For example, suppose that we want to regress a single variable Y on a set of predictors X 1,...,X p containing missing values. We could do this in AMOS by specifying the model with a path from each predictor X j to Y, ay-error that is uncorrelated with the predictors, and arbitrary correlations among the X j s. If the sample size is large, the parameter estimates and standard errors from AMOS will be essentially identical to those that would be obtained if the missing values were multiply imputed M times using NORM, the imputed datasets were analyzed using conventional regression software, and the results combined according to RUBIN s (1987) rules. When MI is carried out under these conditions, the imputer s and analyst s procedures are said to be congenial (MENG, 1994). Under congeniality, there is no major theoretical reason to prefer ML estimates to MI or vice-versa, because the properties of the two methods are very similar. It many real world applications of MI, however, the imputer s and analyst s models are uncongenial. Some types of uncongeniality are clearly harmful; for example, an imputer might create MI s under a model that is grossly misspecified. But MENG (1994) and RUBIN (1996) have demonstrated that uncongeniality can also be beneficial, particularly when the imputer can take advantage of extra information unavailable or unused by the analyst. My next two propositions describe different kinds of uncongeniality; the first is rather benign, whereas the second may help or harm.

6 24 J. L. Schafer Proposition 2. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units) but the imputer s model assumes a more general distributional form than the analyst s model, then ML and MI tend to produce similar parameter estimates, but the standard errors from MI may be larger. For example, suppose that Y 1,...,Y p are repeated measures of a variable over time, and missing values arise from dropout. Researcher A uses SAS PROC MIXED to fit a linear model with random intercepts and time-slopes for each individual, and then manipulates the fixed effects to estimate E(Y p ), the population mean response at the final occasion. Researcher B multiply imputes the missing responses using NORM, calculates the sample mean of Y p in each imputed dataset, and combines the results by RUBIN s (1987) rules. The two researchers will obtain similar estimates of E(Y p ), but A s standard errors may be slightly smaller than B s, because PROC MIXED assumes a patterned covariance structure for Y 1,...,Y p whereas NORM applies an unstructured model. In my experiences with real data, I have found that this increase in standard errors that arises when the imputation model is more general than the analysis model is often barely noticeable. Proposition 3. If the user of the ML procedure and the imputer use the same sample of units but a different set of variables, then the results from ML and MI could be quite different, even though the ML user s model is equivalent to the model used by the analyst of the imputed dataset. This can easily happen in practice. For example, suppose that Researcher A creates MI s for a set of variables Y 1,...,Y p using NORM, but also includes in the imputation procedure some additional covariates Z 1,...,Z q. After imputation, he ignores Z 1,...,Z q, fits a covariance structure model to Y 1,...,Y p with LISREL using each of the imputed datasets and combines the results by RUBIN s (1987) rules. Researcher B fits the same covariance structure model directly to Y 1,...,Y p without imputation using AMOS and finds that her results differ from A s. Many of her parameter estimates are similar to A s, but others are different enough to cause worry; some of her standard errors are similar, some are smaller and others are larger. In this example, discrepancies arise not because of inherent differences between MI and ML but because A included a set of auxiliary variables whereas B did not. If B had figured out a way to include the auxiliary variables in LISREL without altering the marginal model for Y 1,...,Y p, then her results would have been essentially identical to A s. From this discussion, it is clear that non-trivial differences between ML and MI can arise when the imputation and analysis models are uncongenial. In the remainder of this paper, I present situations where uncongeniality can be used to our benefit as part of an intelligent missing-data strategy.

7 Multiple imputation when models differ 25 4 Using auxiliary variables in MI Let Y 1,...,Y p denote the key items or variables that we wish to analyze. If missing values occur in these variables, there may be other completely or partially observed variables Z 1,...,Z q which are not inherently of interest, but which may potentially contain useful information for predicting the missing values. If so, then we may include Z 1,...,Z q in a multiple imputation procedure but exclude them from the subsequent analysis. When is it beneficial to use Z 1,...,Z q in this fashion? Are there potential dangers in doing so? COLLINS, SCHAFER and KAM (2001) (henceforth CSK) explore these questions in depth using simulation. Without going into details, I will attempt to explain and interpret our major findings. CSK classify auxiliary variables Z 1,...,Z q into three types. Type A variables are correlated with the outcomes Y 1,...,Y p and may also help to explain why Y 1,...,Y p are sometimes missing; that is, they are related to missingness. Type B variables are correlated with the outcomes Y 1,...,Y p but unrelated to missingness. Type C variables are unrelated to Y 1,...,Y p. Suppose that we plan to multiply impute the missing values in Y 1,...,Y p using one of the currently available MI programs (e.g. NORM) which assumes MAR. Because Type A variables are related to missingness, MAR is violated if they are excluded from the imputation procedure; therefore, including them may help to reduce bias. Type B variables will not reduce bias under MAR, but they may increase precision of the final parameter estimates because they contain useful information for predicting the missing values of Y 1,...,Y p. (Under MNAR conditions, Type B variables may help to reduce bias.) Type C variables will neither reduce bias nor increase precision; rather, they may reduce precision because they make the imputation model unnecessarily large and complicated. In one set of simulations, CSK show that the biases incurred by failing to include Type A variables in the imputation procedure are not as serious as some have previously thought. Including or excluding a Type A variable Z k made little difference unless the correlation between Z k and an outcome Y j was unusually strong (much greater than 0.4) and the rate of missing values in Y j was high (50% or more). With weaker correlations and lower rates of missing values, biases in parameters pertaining to Y j and its relationships to other variables were barely noticeable when Z k was excluded. In another set of simulations, CSK show that including a Type B variable can substantially increase precision under MAR conditions if its correlation with the outcomes is strong (say, 0.9). This situation could easily arise in longitudinal studies, because repeated measures on individuals over time are highly correlated; responses at one occasion may be very useful for imputing missing responses at another. Under MNAR conditions, where the probability that Y j is missing depends directly on Y j, an auxiliary variable Z k that is highly correlated with Y j will tend to reduce bias but will not eliminate it entirely. Finally, CSK show that the costs of unnecessarily including Type C variables in the imputation procedure tend to be minimal. Overall, there are potentially

9 Multiple imputation when models differ 27 from those who drop out later or not at all. It might seem reasonable to create a variable Z equal to the number of occasions for which the subject remains in the study (1, 2,...,p) and include it in the imputation model. Because this new variable is a function of R, including it in an MI procedure will produce imputations that are consistent with a particular MNAR model. Without Z, an MI procedure that assumes MAR will produce correct results under any MAR situation. But once Z has been included, the same procedure may produce biased results under many MAR mechanisms. In fact, the results may be nonsensical because, unless care is taken, these summaries of R may introduce parameters into the imputation model that cannot be identified from the observed data. (In this particular example, the correlation between Z and Y p is not identified, because Y p is observed if and only if Z ¼ p.) MNAR models are an important and potentially useful application for MI, but the user should be fully aware of the implications of these models and the special challenges they pose. 5Imputation under MNAR models When serious departures from MAR are suspected, it may be necessary to investigate alternative ways to jointly model the data and the missingness. Selection models specify a marginal distribution for the complete data, P(Y com ), and a conditional distribution for the missingness, P(R Y com ). Selection models have intuitive appeal, but their results can be highly sensitive to unverifiable assumptions about the shape of the complete-data population (LITTLE and RUBIN, 1987, Chapter 11; KENWARD, 1998). Pattern-mixture models posit a marginal distribution for the patterns of missingness, P(R), followed by a conditional model for the data distribution within patterns, P(Y com R) (LITTLE, 1993). In pattern-mixture models, some unverifiable assumptions must inevitably be made to identify all the parameters in P(Y com R). These assumptions are no less onerous than those made by selection models, but in one sense they are more honest because the parameters that cannot be estimated from the observed data are made explicit. Pattern-mixture models for longitudinal data with dropout are reviewed by LITTLE (1995) and VERBEKE and MOLENBERGHS (2000). One mildly unfortunate aspect of pattern-mixture models is that the effects of scientific interest are usually parameters of the marginal distribution of the complete data, not its conditional distribution within patterns. Therefore, when fitting these models, one must somehow manipulate the estimates from P(R) and P(Y com R) to obtain the desired estimates for PðY com Þ¼ P R PðRÞ PðY comjrþ. Alternatively, this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y ð1þ mis ;...; YðmÞ mis under a pattern-mixture model. Once these imputations exist, we may forget about R and use the imputed datasets to estimate the parameters of P(Y com ) directly. In many cases, the model applied to Y com in this analysis phase may deviate from the implied model

10 28 J. L. Schafer PðY com Þ¼ P R PðRÞ PðY comjrþ of the imputation phase, resulting in a mild form of uncongeniality. In practice, of course, neither of these models will be true, and the observed data probably contain little information that would allow us to distinguish one from the other. MI may also be used with selection models. In fact, any manner of specifying a joint model for Y com and R may be used, as long as it is possible to sample from the posterior predictive distribution P(Y mis Y obs, R) under the model. Once Y ð1þ mis ;...; YðmÞ mis have been created, further modeling of the missingness is unnecessary, and the analysis of the imputed datasets may proceed as usual. For this reason, MI seems to be an ideal device for sensitivity analyses. If imputations are generated under a variety of alternative models, the imputed datasets may be analyzed in the same manner and compared directly, without having to worry about the fact that the form of P(Y com ) and the meaning of its parameters may vary from imputation model to the next. MI under a variety of MNAR models is possible with current software. Any of the multivariate models implemented in S+MissingData library (SCHIMERT et al., 2001), which apply to continuous and categorical variables, may be jointly applied to a set of outcomes Y 1,...,Y p and to summaries of the missingness R. In most cases, some aspect of this joint distribution will not be identifiable from the observed data, and successful use of the routines may require omitting certain inter-variable relationships, use of an informative prior distribution, or both. Example. SCHAFER (1997) previously analyzed incomplete data from the National Crime Survey conducted by the U.S. Bureau of the Census. Residents in a sample of housing units were interviewed to determine whether they had been victimized by crime in the preceding half-year. Six months later, the same units were visited again and the residents were asked the same question. Missing values occurred at both occasions. The data are shown in Table 1. Let Y j denote victimization status (1 ¼ no, 2 ¼ yes) at occasions j ¼ 1, 2. SCHAFER (1997, Section 7.3.4) generated m ¼ 10 imputations of the missing values of Y 1 and Y 2 under the MAR assumption and used them to test hypotheses of independence and symmetry in the Y 1 Y 2 table. With S+MissingData, it is also quite easy to generate imputations under MNAR models. Let R j denote the missingness indicator for Y j (1 ¼ response, 2 ¼ missing value). In principle, any Table 1. Victimization status from the National Crime Survey. Victimized in second period? Victimized in first period? No Yes Missing No Yes Missing

11 Multiple imputation when models differ 29 loglinear model may be applied to the four-way table Y 1 Y 2 R 1 R 2, but many of these models will be under-identified. Models that contain associations between (Y 1, Y 2 ) and (R 1, R 2 ) correspond to various hypotheses of MNAR. Perhaps the simplest MNAR model worth considering is the one containing the associations Y 1 Y 2, R 1 R 2, Y 1 R 1 and Y 2 R 2, which allows missingness at each occasion to be influenced by the response only at that occasion. That model has 8 free parameters, the maximum that can be identified from the nine observed frequencies reported in Table 1. As noted by FAY (1986), even though this model appears to be saturated, ML estimates under this model will not necessarily reproduce the observed frequencies in Table 1. Models more complicated than this one will not have unique ML estimates. It is still possible to produce MI s under more complicated models, provided that a proper prior distribution is applied to the parameters. It is probably unwise to do so, however, unless the prior distribution truly does reflect the analyst s a priori state of knowledge. Using S+MissingData, I generated m ¼ 10 imputations under the model Y 1 Y 2, R 1 R 2, Y 1 R 1, Y 2 R 2 and then collapsed the imputed tables over R 1 and R 2. Using standard techniques, I then calculated estimates and standard errors for the log of the odds ratio a ¼ p 11 p 22 p 1 12 p 1 21 and the difference d ¼ p 12 ) p 21 from each table, where p ij ¼ P(Y 1 ¼ i, Y 2 ¼ j) (AGRESTI, 1990). The results are shown in Table 2 below. Combining the results by RUBIN s (1987) rules gives an estimate of 3.79 for the odds ratio a with a 95% interval of (2.16, 6.64), and an estimate of )0.058 for the difference d with a 95% interval of ()0.205, 0.089). The estimates are close to those reported by SCHAFER (1997) under the MAR assumption (3.60 for a, )0.039 for d), but the new intervals are 15% wider for a and 270% wider for d. Assuming MAR, the evidence for a shift in victimization rates from one occasion to the next (d 0) was fairly strong (p ¼ 0.06), but under the MNAR model the evidence has disappeared (p ¼ 0.40). The S-PLUS code for creating these imputations and performing the post-imputation analyses is provided in the Appendix. Table 2. Multiple imputations of (Y 1, Y 2 ) under MNAR model, with estimates and standard errors for the log-odds ratio a and difference d. Imputations (Y 1, Y 2 ) no, no no, yes yes, no yes, yes log ^a SE ^d )0.020 )0.037 )0.063 )0.144 ) )0.142 )0.060 )0.104 SE

12 30 J. L. Schafer What are we to make of these results? I am not entirely sure. On the one hand, we may be tempted to impute under MNAR models routinely because it is not difficult to do so. On the other hand, I suspect that these models may, in many cases, be unnecessarily complex. In this example, I tested the joint significance of the Y 1 R 1 and Y 2 R 2 associations by a likelihood-ratio test and found no evidence for them (p ¼ 0.98). By nature, the data can provide almost no information about these associations; we can estimate them only in a very indirect way, by assuming that other associations (e.g., Y 1 R 2 ) do not exist. What can we possibly achieve by adding parameters to a model that are barely identified, except to increase the width of our interval estimates by an arbitrary amount? This joint test for Y 1 R 1 and Y 2 R 2 is correctly interpreted as a test for MCAR, not a test for MAR. Nevertheless, omitting Y 1 R 1 and Y 2 R 2 results in a procedure that is valid under any MAR missingness model (a rather broad class), whereas including them produces results that are valid only under this particular MNAR model (a rather narrow class). It seems paradoxical that the inclusion of additional parameters, about which the data contain little evidence, produces a model that may in some sense be more restrictive than the model that omits them. Rather than rely heavily on poorly estimated MNAR models, I would prefer to examine auxiliary variables that may be related to missingness in Y 1 and Y 2, and include them in a richer imputation model under assumption of MAR. 6 Analyses by less-than-fully parametric methods In their quest for robustness, statisticians are fond of relaxing assumptions. In missing-data problems, however, there is an unpleasant aspect to this: once we dispense with a true likelihood function for Y com, we must usually bid farewell to inferential procedures that are valid under general MAR conditions. Incompletedata procedures derived from frequentist arguments are typically valid only under MCAR (RUBIN, 1976) or they require strong models for the missingness. A good example of the latter is weighting. If X 1,...,X n is a random sample from a density f(x), and the value X i ¼ x becomes missing with probability 1 ) g(x), then the observed data will be sampled from f * (x) µ f(x)g(x), and consistent nonparametric estimates of moments of f are available by applying weights w i µ g )1 (X i ) to the observed X i s. The problem, of course, is that g(x) is unknown and cannot be estimated from the observed data without strong assumptions or powerful auxiliary information. Therefore, statisticians may feel the need to choose between (a) fully parametric analyses that make strong assumptions about P(Y com ), and (b) semi- or non-parametric analyses that weaken the assumptions about P(Y com ) but make strong assumptions about the distribution of missingness. With MI, we can (almost) have the best of both worlds. If we create imputations Y ð1þ mis ;...; YðmÞ mis from a predictive distribution P(Y mis Y obs ) derived from a fully

13 Multiple imputation when models differ 31 parametric model, and then analyze the imputed datasets by a less-than-fully parametric method, we may be able to achieve better performance and greater robustness than is possible with any procedure that handles missing data and estimation in a single step. Any erroneous parametric assumptions in the imputation phase will effectively be applied only to the missing part of the dataset, not to the observed data. This type of uncongeniality was anticipated by MENG (1994), whose results suggest that these procedures tend to perform well as long as the imputation model is not grossly misspecified. MENG s (1994) theorems encompass certain types of estimation procedures but not others. For example, they do apply to certain techniques of design-based estimation commonly applied to sample surveys, but apparently they do not apply to semiparametric regression using generalized estimating equations (GEE) (MENG, 1999). Of course, although we may find it difficult to prove good performance analytically for the latter, that does not imply that good performance will not be seen in practice. Experience suggests that Bayesian MI does interact well with a variety of semi- and nonparametric estimation procedures. Example. Consider a classic cluster sample, where y i is the observed total and m i is the sample size in clusteri, i ¼ 1,...,n. The usual estimate of the population mean is y ¼ P i y i= P i m i, with ^V ðyþ ¼n P i ðy i m i yþ 2 =ðn 1Þð P i m iþ 2 as its estimated variance. As noted by SKINNER in his discussion of MENG (1994), these estimates appear to have no interpretation from a parametric likelihood or Bayesian standpoint under any population model. I examined the performance of Bayesian MI in conjunction with y and ^V ðyþ under a two-stage normal population, l i N(l, s), y ij N(l i, r 2 ), i ¼ 1,...,n, j ¼ 1,...,m i. I performed a simulation with l ¼ 10, s ¼ 5, r 2 ¼ 20, and m i ¼ 20 or 40 with equal probability, which produces a rather large design effect (deff 7.5). Drawing a sample of n ¼ 50 clusters, I imposed missing values on the observations within each cluster in an MCAR fashion at a rate of 25%. I then generated five imputations of the missing values by a Bayesian procedure using a two-stage normal model and weakly informative prior distributions for the variance components. Finally, I calculated y and ^V ðyþ from each of the five imputed datasets and combined them by RUBIN s (1987) rules to obtain a single estimate and 95% confidence interval for the population mean. For comparison, I also calculated y and ^V ðyþ from the complete data (before missing values were imposed) and from the observed data alone (treating the incomplete data as a sample of clusters of smaller sizes). The entire procedure was repeated 1,000 times. Over the 1,000 repetitions, the average point estimate was using the complete data, using the observed data alone, and using MI, all very close to the true mean l ¼ 10. The variances of the estimates for the three methods were for complete data, for observed data, and for MI. Interestingly, an inflation of variance from to and from to corresponds to

14 32 J. L. Schafer rates of missing information of 3% and 5%, respectively, even though the actual observations are missing at a rate of 25%. Because the design effect in this example is strong, reducing the per-cluster sample sizes by 25% produces only a slight loss of information. Therefore, a high-quality missing-data procedure should produce interval estimates that are only slightly wider than those obtained from complete data. How did the intervals fare? With complete data, 938 (94%) of the simulated intervals covered the true mean l ¼ 10, and the average interval width was Using the observed data alone, 993 of the intervals covered l ¼ 10, and the average width of the intervals was 1.87 an increase of 36%. Treating the observed data in each cluster as if it were complete data from a smaller cluster produces intervals that are unnecessarily wide and inefficient. This result seems a bit odd, given the very simple nature of the missingness distribution. With MI, however, the performance of the intervals was outstanding; 948 of them covered the true mean, and their average width was only slightly greater than those from complete data (1.40). 7 Discussion and future directions In the early years of MI, many thought that imputing missing values under one model and analyzing the imputed datasets under another model (or no model at all) was ludicrous and potentially harmful (FAY, 1992). We have seen, however, that in many situations of practical interest, this strategy can be quite beneficial. Many important questions still need to be addressed in the area of MNAR models. A quick look at recent issues in biostatistics journals reveals a surprisingly large number of articles on selection and pattern-mixture models, particularly for dropout in longitudinal studies. Should we fit these models simply because we can? I hope that, in the future, we will see published analyses of data under MNAR models that answer more questions than they raise. I hope to see detailed comparisons of the performance of various MAR and MNAR models in realistic scenarios where the true nature of the complete-data population and the missingness are unknown. The performance of Bayesian MI in conjunction with popular semi- or non-parametric analyses also needs further study. For example, consider the current procedures for design-based variance estimation from stratified multistage cluster samples as implemented in SUDAAN (RESEARCH TRIANGLE INSTITUTE, 1998). To what extent do the complexities of the sample design need to be incorporated into the imputation model? Do we need to account for each level of clustering, or will it be sufficient to incorporate only the highest level, the ultimate clusters that drive the variance estimation procedures?

15 Multiple imputation when models differ 33 Appendix ######################################### # SPLUS code for multiple imputation of victimization status from # the National Crime Survey using an MNAR model ######################################### # Enter the data Y1_rep (c ("Crime-Free", "Victim", "NA"),3) Y2_rep (c ("Crime-Free", "Victim", "NA"),each¼3) count_c (392,76,31,55,38,7,33,9,115) R1_rep (c ("Observed","Observed","Missing"),3) R2_rep (c ("Observed","Observed","Missing"), each¼3) Crime_data.frame(Y1¼factor(Y1), Y2¼factor (Y2), R1¼factor (R1), R2¼factor (R2),count¼count) # Attach the S + MissingData library library(missing) # Fit the loglinear model by maximum likelihood Crime.mle_mdLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, na.proc¼"em") # Generate m¼10 imputations using the default noninformative prior set.seed(184) Crime.imp_impLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, prior ¼ 0.5, nimpute ¼ 10, control¼list(niter¼1000)) # Reduce the imputed data to Y1 x Y2 tables, collapsing over R s Crime.imp.Y1xY2_miEval(oldUnclass (crosstabs (frequency Y1 + Y2, data¼crime.imp)),vnames¼"crime.imp") # Compute estimates, SE s for log-odds ratios and combine them # by Rubin s (1987) rules logodds.est_mieval (log(crime.imp.y1xy2[1,1]*crime.imp.y1xy2[2,2]/ (Crime.imp.Y1xY2[2,1]*Crime.imp.Y1xY2[1,2]))) logodds.se_mieval(sqrt(sum(1/crime.imp.y1xy2))) logodds.result_mimeanse(logodds.est, log-odds.se, df¼inf) print(exp(logodds.result\$est)) print(exp(logodds.result\$est +c(),1)*qt(.975,logodds.result\$df)*logodds.result\$std.err)) # Do the same for difference in proportions P(Y1¼yes)P(Y2¼yes) diff.est_mieval ((Crime.imp.Y1xY2[1,2]-Crime.imp.Y1xY2[2,1])/sum(Crime.imp.Y1xY2)) diff.se_mieval(sqrt ((1/sum(Crime.imp.Y1xY2))* ((Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (1-Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2)) + (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))*

16 34 J. L. Schafer (1-Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2)) + 2*(Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))))) diff.result_mimeanse(diff.est, diff.se, df¼inf) print(diff.result\$est) print(diff.result\$est +c(1,1)*qt(.975,diff.result\$df)*diff.result\$std.err) # Create Table 2 for displaying the imputations and results Table2_matrix(NA,8,10) dimnames(table2)_list (c("no, no ","no, yes","yes, no ","yes, yes", "logodds","se","diff","se"), format(1:10)) for(i in 1:10) {Table2[1:4,i]_as.vector(t(Crime.imp.Y1xY2[[i]])) Table2[5,i]_logodds.est[[i]] Table2[6,i]_logodds.SE[[i]] Table2[7,i]_diff.est[[i]] Table2[8,i]_diff.SE[[i]]} # Test the joint significance of the Y1*R2 and Y2*R1 associations Crime.mle2_mdLoglin(Crime, margins ¼ count Y1:Y2 + R1:R2, na.proc ¼ "em") lrtest_2*(crime.mle\$algorithm\$likelihood-crime. mle2\$algorithm\$likelihood) print(1-pchisq(lrtest,2)) References Agresti, A. (1990), Categorical data analysis, John Wiley and Sons, New York. Arbuckle, J. L. and W. Wothke (1999), AMOS 4.0 User s Guide, Smallwaters, Inc., Chicago. Collins, L. M., B. P. Flaherty, S. L. Hyatt and J. L. Schafer (1999), WinLTA User s Guide, Version 2.0, The Methodology Center, The Pennsylvania State University, University Park, PA. Collins, L. M., J. L. Schafer and C. M. Kam (2001), A comparison of inclusive and restrictive strategies in modern missing-data procedures, Psychological Methods 6, Fay, R. E. (1986), Causal models for patterns of nonresponse, Journal of the American Statistical Association 81, Fay, R. E. (1992), When are inferences from multiple imputation valid? Proceedings of the Survey Research Methods Section of the American Statistical Association, Gelman, A., D. B. Rubin, J. Carlin and H. Stern (1995), Bayesian data analysis, Chapman and Hall, London. Jöreskog, K. G. and D. Sörbom (2001), LISREL 8.5. Scientific Software International, Inc., Chicago. Kenward, M. G. (1998), Selection models for repeated measurements for nonrandom dropout: an illustration of sensitivity, Statistics in Medicine 17, King, G., J. Honaker, A. Joseph and K. Scheve (2001), Analyzing incomplete political science data: an alternative algorithm for multiple imputation, American Political Science Review 95, Little, R. J. A. (1993), Pattern-mixture models for multivariate incomplete data, Journal of the American Statistical Association 88,

17 Multiple imputation when models differ 35 Little, R. J. A. (1995), Modeling the dropout mechanism in repeated-measures studies, J ournal of the American Statistical Association 90, Little, R. J. A. and D. B. Rubin (1987), Statistical analysis withmissing data, John Wiley and Sons, New York. Meng, X. L. (1994), Multiple-imputation inferences with uncongenial sources of input (with discussion), Statistical Science 10, Meng, X. L. (1999), A congenial overview and investigation of imputation inferences under uncongeniality. Paper presented at International Conference on Survey Nonresponse, Portland, October Muthén, L. K. and B. O. Muthén (1998), Mplus User s Guide, Muthe n and Muthén, Los Angeles. Neale, M. C., S. M. Boker, G.Xie and H. H. Maes (1999), Mx: Statistical Modeling (5th ed.), Department of Psychiatry, Virginia Commonwealth University, Richmond, VA. Raghunathan, T. E., P. W. Solenberger and J. Van Hoewyk (2000), IVEware: Imputation and Variance Estimation Software, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI. RESEARCH TRIANGLE INSTITUTE (1998), SUDAAN: Software for the Statistical Analysis of Correlated Data, Version 7, Research Triangle Institute, Research Triangle Park, NC. Robins, J. M., A. Rotnitzky and D. O. Scharfstein (1998), Semiparametric regression for repeated outcomes with non-ignorable non-response, Journal of the American Statistical Association 93, Rubin, D. B. (1976), Inference and missing data, Biometrika 63, Rubin, D. B. (1987), Multiple imputation for nonresponse in surveys, John Wiley and Sons, New York. Rubin, D. B. (1996), Multiple imputation after 18+ years, Journal of the American Statistical Association 91, Schafer, J. L. (1997), Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J. L. (1999), NORM: Multiple imputation of incomplete multivariate data under a normal model, Software for Windows, Department of Statistics, The Pennsylvania State University, University Park, PA. Schimert, J., J. L. Schafer, T. Hesterberg, C. Fraley and D. Clarkson (2001), Analyzing missing values in S-PLUS, Insightful Corporation, Seattle, WA. STATISTICAL SOLUTIONS INC. (2002), SOLAS for missing data analysis, Version 3. Statistical Solutions, Cork, Ireland. Van Buuren, S. and C. G. M. Oudshoorn (1999), Flexible multivariate imputation by MICE, TNO/VGZ/PG , TNO Prevention and Health, Leiden. Verbeke, G. and G. Molenberghs (2000), Linear mixed models for longitudinal data, Springer-Verlag, New York. Yuan, Y. C. (2000), Multiple imputation for missing data: concepts and new development, Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Paper 267. SAS Institute, Cary, NC. Zeger, S. L., K. Y. Liang and P. S. Albert (1988), Models for longitudinal data: a generalized estimating equation approach, Biometrics 44, Received: February 2002, Revised: October 2002.

### Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

### HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS

HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS Mike Kenward London School of Hygiene and Tropical Medicine Acknowledgements to James Carpenter (LSHTM) Geert Molenberghs (Universities of

### Dealing with Missing Data

Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

### Handling attrition and non-response in longitudinal data

Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

### Problem of Missing Data

VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

### A Basic Introduction to Missing Data

John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

### Item Imputation Without Specifying Scale Structure

Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

### Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

### Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

### Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

### Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

### Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

### Missing Data Dr Eleni Matechou

1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.

### Missing Data: Our View of the State of the Art

Psychological Methods Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 7, No. 2, 147 177 1082-989X/02/\$5.00 DOI: 10.1037//1082-989X.7.2.147 Missing Data: Our View of the State

### Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

### PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data

### Missing Data. Paul D. Allison INTRODUCTION

4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

### A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

### Sensitivity Analysis in Multiple Imputation for Missing Data

Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

### Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

### Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

### APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

### MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

### Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

### Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

### arxiv:1301.2490v1 [stat.ap] 11 Jan 2013

The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12-AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM

### Dealing with missing data: Key assumptions and methods for applied analysis

Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

### Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

### Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

### Imputation of missing data under missing not at random assumption & sensitivity analysis

Imputation of missing data under missing not at random assumption & sensitivity analysis S. Jolani Department of Methodology and Statistics, Utrecht University, the Netherlands Advanced Multiple Imputation,

### Comparison of Estimation Methods for Complex Survey Data Analysis

Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.

### A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,

### Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

### In part 1 of this series, we provide a conceptual overview

Advanced Statistics: Missing Data in Clinical Research Part 2: Multiple Imputation Craig D. Newgard, MD, MPH, Jason S. Haukoos, MD, MS Abstract In part 1 of this series, the authors describe the importance

### Imputing Missing Data using SAS

ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

### 2. Making example missing-value datasets: MCAR, MAR, and MNAR

Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data

### Parametric fractional imputation for missing data analysis

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????,??,?, pp. 1 14 C???? Biometrika Trust Printed in

### How to choose an analysis to handle missing data in longitudinal observational studies

How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:

### Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know

### Best Practices for Missing Data Management in Counseling Psychology

Journal of Counseling Psychology 2010 American Psychological Association 2010, Vol. 57, No. 1, 1 10 0022-0167/10/\$12.00 DOI: 10.1037/a0018082 Best Practices for Missing Data Management in Counseling Psychology

### Rens van de Schoot a b, Peter Lugtig a & Joop Hox a a Department of Methods and Statistics, Utrecht

This article was downloaded by: [University Library Utrecht] On: 15 May 2012, At: 01:20 Publisher: Psychology Press Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:

### TEACHING STATISTICS THROUGH DATA ANALYSIS

TEACHING STATISTICS THROUGH DATA ANALYSIS Thomas Piazza Survey Research Center and Department of Sociology University of California Berkeley, California The question of how best to teach statistics is

### Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Brochure More information from http://www.researchandmarkets.com/reports/2741464/ Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Description: A modern and practical guide

### Missing data in randomized controlled trials (RCTs) can

EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled

### Analysis of Longitudinal Data with Missing Values.

Analysis of Longitudinal Data with Missing Values. Methods and Applications in Medical Statistics. Ingrid Garli Dragset Master of Science in Physics and Mathematics Submission date: June 2009 Supervisor:

### Missing Data Techniques for Structural Equation Modeling

Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021-843X/03/\$12.00 DOI: 10.1037/0021-843X.112.4.545 Missing Data Techniques

### Published online: 25 Sep 2014.

This article was downloaded by: [Cornell University Library] On: 23 February 2015, At: 11:23 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:

### Modern Methods for Missing Data

Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC www.statisticalhorizons.com 1 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years

### Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

### SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

### Sampling Error Estimation in Design-Based Analysis of the PSID Data

Technical Series Paper #11-05 Sampling Error Estimation in Design-Based Analysis of the PSID Data Steven G. Heeringa, Patricia A. Berglund, Azam Khan Survey Research Center, Institute for Social Research

### MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

### Organizing Your Approach to a Data Analysis

Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

### Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys

Kaplan and McCarty Large-scale Assessments in Education 2013, 1:6 RESEARCH Open Access Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys David Kaplan

### A Review of Methods for Missing Data

Educational Research and Evaluation 1380-3611/01/0704-353\$16.00 2001, Vol. 7, No. 4, pp. 353±383 # Swets & Zeitlinger A Review of Methods for Missing Data Therese D. Pigott Loyola University Chicago, Wilmette,

### Multiple Imputation of Missing Income Data in the National Health Interview Survey

Multiple Imputation of Missing Income Data in the National Health Interview Survey Nathaniel SCHENKER, Trivellore E. RAGHUNATHAN, Pei-LuCHIU, Diane M. MAKUC, Guangyu ZHANG, and Alan J. COHEN The National

### Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

### Overview of Factor Analysis

Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

### Visualization of missing values using the R-package VIM

Institut f. Statistik u. Wahrscheinlichkeitstheorie 040 Wien, Wiedner Hauptstr. 8-0/07 AUSTRIA http://www.statistik.tuwien.ac.at Visualization of missing values using the R-package VIM M. Templ and P.

### CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

### Sampling Distributions and the Central Limit Theorem

135 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 10 Sampling Distributions and the Central Limit Theorem In the previous chapter we explained

### Summary of Probability

Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible

### Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

### How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power

How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 391-9971 Fax: (310) 391-8971

### Dealing with Missing Data

Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

### Appendix Methodology and Statistics

Appendix Methodology and Statistics Introduction Science and Engineering Indicators (SEI) contains data compiled from a variety of sources. This appendix explains the methodological and statistical criteria

### Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in

### MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

### Comparison of Imputation Methods in the Survey of Income and Program Participation

Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance

### Journal Article Reporting Standards (JARS)

APPENDIX Journal Article Reporting Standards (JARS), Meta-Analysis Reporting Standards (MARS), and Flow of Participants Through Each Stage of an Experiment or Quasi-Experiment 245 Journal Article Reporting

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA Hatice UENAL Institute of Epidemiology and Medical Biometry, Ulm University, Germany

### Data Cleaning and Missing Data Analysis

Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

### BMJ Open. To condition or not condition? Analyzing change in longitudinal randomized controlled trials

To condition or not condition? Analyzing change in longitudinal randomized controlled trials Journal: BMJ Open Manuscript ID bmjopen-0-00 Article Type: Research Date Submitted by the Author: -Jun-0 Complete

### When Does it Make Sense to Perform a Meta-Analysis?

CHAPTER 40 When Does it Make Sense to Perform a Meta-Analysis? Introduction Are the studies similar enough to combine? Can I combine studies with different designs? How many studies are enough to carry

### Statistics Graduate Courses

Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

### Sensitivity analysis of longitudinal binary data with non-monotone missing values

Biostatistics (2004), 5, 4,pp. 531 544 doi: 10.1093/biostatistics/kxh006 Sensitivity analysis of longitudinal binary data with non-monotone missing values PASCAL MININI Laboratoire GlaxoSmithKline, UnitéMéthodologie

### Fairfield Public Schools

Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

### What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

### Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

American Statistical Association Biopharm Section Monthly Webinar Series: Sensitivity analyses that address missing data issues in Longitudinal studies for regulatory submission. Dr James Roger. GlaxoSmithKline

### Bayesian Approaches to Handling Missing Data

Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to

### Copyright 2010 The Guilford Press. Series Editor s Note

This is a chapter excerpt from Guilford Publications. Applied Missing Data Analysis, by Craig K. Enders. Copyright 2010. Series Editor s Note Missing data are a real bane to researchers across all social

### 2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

### Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

### How to Use a Monte Carlo Study to Decide on Sample Size and Determine Power

STRUCTURAL EQUATION MODELING, 9(4), 599 620 Copyright 2002, Lawrence Erlbaum Associates, Inc. TEACHER S CORNER How to Use a Monte Carlo Study to Decide on Sample Size and Determine Power Linda K. Muthén

### Regression Modeling Strategies

Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

### South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

### Statistical Rules of Thumb

Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise.

IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. Peter R. Welbrock Smith-Hanley Consulting Group Philadelphia, PA ABSTRACT Developing

### Chapter 11 Introduction to Survey Sampling and Analysis Procedures

Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152

### Department of Epidemiology and Public Health Miller School of Medicine University of Miami

Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data Wednesday-Friday 9:00 10:15PM Course Location: CRB 995