Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ


 Edward Wheeler
 3 years ago
 Views:
Transcription
1 19 Statistica Neerlandica (2003) Vol. 57, nr. 1, pp Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ Joseph L. Schafer* Department of Statistics and The Methodology Center, The Pennsylvania State University, 326 Thomas Building, University Park, PA 16802, USA Bayesian multiple imputation (MI) has become a highly useful paradigm for handling missing values in many settings. In this paper, I compare Bayesian MI with other methods maximum likelihood, in particular and point out some of its unique features. One key aspect of MI, the separation of the imputation phase from the analysis phase, can be advantageous in settings where the models underlying the two phases do not agree. Key Words and Pharases: missing data, nonresponse. 1 Fundamentals In modern statistical practice, the occurrence of missing values is usually viewed as a random phenomenon. Let Y com ¼ (Y obs, Y mis ) denote a set of complete data and Y obs the data actually observed. The missing data, Y mis, denotes real or hypothetical quantities which, if available, would simplify the analysis. Our primary interest is in some aspect of the population distribution of Y com, P(Y com h) not the distribution of Y obs, nor the process that partitions Y com into Y obs and Y mis. Nevertheless, we suppose that the partitioning of Y com is encoded in a set of random variables R. For example, if Y com is a data matrix containing both observed and missing values, then R could be a matrix of the same size containing 1 s and 0 s to show whether the corresponding elements of Y com are observed or missing. We will refer to R as the missingness, and P(R Y com ; n) the distribution of missingness. Note that R is itself completely observed. We do not posit a distribution for R because we want to model it. Rather, we hope to avoid modeling R because our model might not be plausible. Reasons for missingness are often not present in Y com, because explaining R is usually not the purpose of data collection. Nevertheless, these reasons may be related to aspects of Y com and, by omission, may induce relationships between R and Y com. The distribution of missingness should be regarded as a mathematical device to describe This work was supported by grant 1P50DA10075, National Institute on Drug Abuse.. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.
2 20 J. L. Schafer the rates and patterns of missing values and to capture approximately relationships between R and Y com in a correlational (not causal) sense. Usually, our main reason for introducing P(R Y com ; n) is to clarify the conditions under which we may avoid specifying it. Elementary probability theory suggests that P(Y obs h) ¼ òp(y com h) dy mis may provide a basis for inference about h, both as a sampling distribution for Y obs and as a likelihood function for h. But that is true only under certain conditions. Missing data are missing at random (MAR) if the distribution of missingness does not depend on Y mis, P(R Y com ; n) ¼ P(R Y obs ; n) (RUBIN, 1976). They are missing completely at random (MCAR) if it does not depend on Y obs or Y mis, P(R Y com ; n) ¼ P(R; n) (LITTLE and RUBIN, 1987). Missing not at random (MNAR) refers to any violation of MAR. RUBIN (1976) showed that P(Y obs h) provides a correct sampling distribution for frequentist inference about h under MCAR, and a correct likelihood function for likelihood/bayes inference under the weaker condition of MAR. (To be precise, one also needs the parameters of the missingness distribution, n, tobe distinct from or independent of h; see LITTLE and RUBIN, 1987, for details.) That models for R may be avoided more often in a likelihood/bayesian mode than in a frequentist mode suggests that missingdata procedures based on likelihood may be more useful than those motivated by frequentist arguments alone. For the most part, I believe that is true. Many of the older dataediting procedures (e.g. listwise deletion) are based on frequentist arguments and are generally valid only under the strong and often implausible assumption of MCAR. Even if MCAR does hold, these procedures may be inefficient. Methods for analyzing data without a true likelihood function marginal modeling based on generalized estimating equations, for example can handle missing values only if they are MCAR (ZEGER, LIANG and ALBERT, 1988) or if one specifies a correct model for R (ROBINS, ROTNITZKY and SCHARFSTEIN, 1998). As a general principle, I believe that an analyst s time and effort are better spent building an intelligent model for the data rather than modeling the missingness, unless departures from MAR are suspected to be very serious. Consider a multivariate problem where we collect items Y j, j ¼ 1,...,p for subjects i ¼ 1,...,n, so that Y com is an n p data matrix, but portions of Y com are missing for reasons beyond our control. If Y 1,...,Y p represent repeated measures of the same variable at different occasions, and if subjects who drop out do not return (i.e., if Y j is missing then Y j+1,..., Y p are missing as well), then MAR is easy to understand: it means that a subject s probability of dropping out at occasion j, given that he has not yet dropped out, may depend on previous responses, Y 1,..., Y j)1, but not on the present or future, Y j,...,y p. In more general multivariate setting, however, MAR is not very intuitive; it means that a subject s probabilities of responding to Y 1,...,Y p may depend only on his or her own set of observed items, a set that changes from one subject to the next. One could say that this condition seems odd or unnatural. However, the apparent awkwardness of MAR does not imply that assuming MAR is unwise. COLLINS,SCHAFER and KAM (2001) demonstrated that, with continuous data,
3 Multiple imputation when models differ 21 an erroneous assumption of MAR failing to take into account a cause or correlate of missingness may have only a minor impact on estimates and standard errors unless the relationships between the omitted cause or correlate of missingness and the outcomes are unusually strong ( q much greater than 0.5). In typical social science applications, I believe that such strong correlations between causes of missingness and outcomes are the exception rather than the rule, and assuming MAR will probably not lead us far astray. When MNAR is a serious concern e.g., in a clinical trial where subjects drop out if the treatment s not working it may be necessary to jointly model Y com and R, but such models must be based on other unverifiable assumptions. Methods for joint modeling of Y com and R in longitudinal studies with dropout are reviewed by LITTLE (1995) and by VERBEKE and MOLENBERGHS (2000). 2 Bayesian multiple imputation Inferences about h may proceed from a likelihood based on P(Y obs h) under MAR or from a joint model for Y com and R under MNAR. Procedures for ML estimation in multivariate missingdata problems are becoming widely available. Programs for fitting linear mixed models to longitudinal data (e.g., PROC MIXED in SAS) allow missing responses. ML for latentvariable models with incomplete data is found in Amos (ARBUCKLE and WOTHKE, 1999), Mx (NEALE et al., 1999), LISREL 8.5 (JO RESKOG and SO RBOM, 2001), LTA (COLLINS et al., 1999) and Mplus (MUTHE N and MUTHE N, 1998). All of these programs assume that the missing values are MAR. A useful alternative to direct likelihood methods is given by multiple imputation (MI) (RUBIN, 1987). Suppose that Q is some aspect of p the ffiffiffiffi distribution of Y com to be estimated, and that an estimate ^Q and standard error U could be easily calculated if Y mis were available. In MI, Y mis is replaced by m > 1 simulated versions, Y ð1þ mis ;...; YðmÞ mis, resulting in m estimates and standard errors, ð ^Q j ; U j Þ; j ¼ 1;...; m. From RUBIN (1987), an overall estimate for Q is Q ¼ m 1 P m p ffiffiffi ^Q j¼1 j with a standard error of T, where T ¼ U þð1 þ m 1 ÞB, U ¼ m 1 P m j¼1 U j, B ¼ðm 1Þ 1 P m j ¼ 1 ð ^Q j QÞ 2.Ifð ^Q p QÞ= ffiffiffiffi p U is approximately N(0, 1) with complete data, then ð Q QÞ= ffiffiffi T tm provides tests and intervals for Q, where m ¼ (m ) 1)(1 + r )1 ) 2, r ¼ (1 + m )1 )B/U. Where do the imputations Y ð1þ mis ;...; YðmÞ mis come from? They are drawn from a Bayesian posterior predictive distribution for Y obs. The model for complete data, P(Y com ; h), implies a conditional distribution for the missing values given the observed ones, P(Y mis Y obs ; h). We could use this distribution for imputation if h were known. But because h is unknown, we must generate m independent random draws h (1),..., h (m) from a Bayesian posterior distribution given Y obs and R, which depends only on Y obs if the missing data are MAR. Once the random parameters are drawn, the imputations follow as Y ð jþ mis PðY misjy obs ; h ð jþ Þ, j ¼ 1,..., m. Inmost applications, the number of imputations does not need to be large; m ¼ 5 is often enough. Computational strategies for creating MI s in multivariate settings are
4 22 J. L. Schafer reviewed by SCHAFER (1997). Programs for creating MI s include NORM (SCHAFER, 1999), the S+MissingData library in SPLUS (SCHIMERT et al., 2001), SAS PROC MI (YUAN, 2000), Amelia (KING et al., 2001), SOLAS (STATISTICAL SOLUTIONS, 2002), MICE (VAN BUUREN and OUDSHOORN, 1999), and IVEware (RAGHUNATHAN, SOLENBERGER and VAN HOEWYK, 2000). An excellent resource for information about these and other programs for MI is the website These programs assume that the missing data are MAR. However, in some cases, we can also use them to generate imputations under certain MNAR models by creating variables out of the missingdata pattern R and treating these variables as additional covariates; this idea will be explored in Section 5 below. It is easy to show that, when the estimand Q is a function of h, MI provides approximate Bayesian inferences for Q (SCHAFER, 1997, Sec ). Therefore, with a large sample and a diffuse prior distribution, MI and direct likelihood methods produce similar answers. From this standpoint, MI seems to be an unnecessarily circuitous way to summarize the likelihood for h. Why is MI worth considering at all? One reason is that, unlike ML, MI separates the inference into two phases: the imputation phase, in which Y ð1þ mis ;...; YðmÞ mis are created, and the analysis phase, in which the ð ^Q j ; U j Þ; j ¼ 1;...; m are calculated and combined. Because the phases are distinct, imputation and analysis may be carried out on different occasions by different persons. For example, the National Center for Health Statistics in the United States has recently released a multiplyimputed version of the Third National Health and Nutrition Examination Survey (NHANES III, ). Special techniques and software were used to create the imputations, but now that they exist, the imputed data files may be analyzed in a straightforward manner by anyone familiar with standard techniques of survey data analysis. The NHANES III Multiply Imputed Data Set is available at nhanes/nh3data.htm. Another advantage of separating imputation from the analysis and the primary topic of the rest of this paper is that the imputation and analysis may be carried out under different models. For example, it is a relatively simple matter to include extra variables in an imputation procedure but ignore them in later analyses. Properties of MI when the imputation and analysis models differ has explored from a theoretical standpoint by MENG (1994) and RUBIN (1996), and from a practical standpoint by COLLINS, SCHAFER, and KAM (2001). Discrepancies between the models are not necessarily harmful and can often be advantageous. 3 Comparing the results from MI and ML To understand what happens when the imputation and analysis models differ, it helps to compare first the results of an ML procedure applied to Y obs alone (which uses a single model) to an analysis based on MI (which uses two models). This discussion, which is summarized from COLLINS, SCHAFER and KAM (2001), assumes that the
5 Multiple imputation when models differ 23 model for the completedata population P(Y com ; h) used in the ML analysis is the same model used to obtain estimates and standard errors ð ^Q j ; U j Þ; j ¼ 1;...; m after MI. Without this assumption, there is no guarantee that the two same population parameters are being estimated under the two methods. At present, we will also limit our attention to situations where the user of ML and the imputer both assume that the missing values are MAR, so that neither one is specifying a distribution for the missingness. This does not mean that the user of ML and the imputer are making identical assumptions about the missing data, because their models may still be different. For example, the imputer may use additional variables that do not appear in the analyses, or he may specify different intervariable relationships (e.g. a structured covariance matrix versus an unstructured one). Proposition 1. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units); if their models apply equivalent distributional assumptions to the variables and the relationships among them; if the sample size is large; and if the number of imputations is sufficiently large; then the results from the ML and MI procedures will be essentially identical. Under these conditions, MI approximates a Bayesian analysis under the same model used in the ML procedure, and the asymptotic equivalence between Bayesian and likelihoodbased procedures is well known (GELMAN et al., 1995). With large samples the effect of a prior distribution diminishes, and Bayesian and ML analyses produce similar results, especially when the prior is diffuse. For example, suppose that we want to regress a single variable Y on a set of predictors X 1,...,X p containing missing values. We could do this in AMOS by specifying the model with a path from each predictor X j to Y, ayerror that is uncorrelated with the predictors, and arbitrary correlations among the X j s. If the sample size is large, the parameter estimates and standard errors from AMOS will be essentially identical to those that would be obtained if the missing values were multiply imputed M times using NORM, the imputed datasets were analyzed using conventional regression software, and the results combined according to RUBIN s (1987) rules. When MI is carried out under these conditions, the imputer s and analyst s procedures are said to be congenial (MENG, 1994). Under congeniality, there is no major theoretical reason to prefer ML estimates to MI or viceversa, because the properties of the two methods are very similar. It many real world applications of MI, however, the imputer s and analyst s models are uncongenial. Some types of uncongeniality are clearly harmful; for example, an imputer might create MI s under a model that is grossly misspecified. But MENG (1994) and RUBIN (1996) have demonstrated that uncongeniality can also be beneficial, particularly when the imputer can take advantage of extra information unavailable or unused by the analyst. My next two propositions describe different kinds of uncongeniality; the first is rather benign, whereas the second may help or harm.
6 24 J. L. Schafer Proposition 2. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units) but the imputer s model assumes a more general distributional form than the analyst s model, then ML and MI tend to produce similar parameter estimates, but the standard errors from MI may be larger. For example, suppose that Y 1,...,Y p are repeated measures of a variable over time, and missing values arise from dropout. Researcher A uses SAS PROC MIXED to fit a linear model with random intercepts and timeslopes for each individual, and then manipulates the fixed effects to estimate E(Y p ), the population mean response at the final occasion. Researcher B multiply imputes the missing responses using NORM, calculates the sample mean of Y p in each imputed dataset, and combines the results by RUBIN s (1987) rules. The two researchers will obtain similar estimates of E(Y p ), but A s standard errors may be slightly smaller than B s, because PROC MIXED assumes a patterned covariance structure for Y 1,...,Y p whereas NORM applies an unstructured model. In my experiences with real data, I have found that this increase in standard errors that arises when the imputation model is more general than the analysis model is often barely noticeable. Proposition 3. If the user of the ML procedure and the imputer use the same sample of units but a different set of variables, then the results from ML and MI could be quite different, even though the ML user s model is equivalent to the model used by the analyst of the imputed dataset. This can easily happen in practice. For example, suppose that Researcher A creates MI s for a set of variables Y 1,...,Y p using NORM, but also includes in the imputation procedure some additional covariates Z 1,...,Z q. After imputation, he ignores Z 1,...,Z q, fits a covariance structure model to Y 1,...,Y p with LISREL using each of the imputed datasets and combines the results by RUBIN s (1987) rules. Researcher B fits the same covariance structure model directly to Y 1,...,Y p without imputation using AMOS and finds that her results differ from A s. Many of her parameter estimates are similar to A s, but others are different enough to cause worry; some of her standard errors are similar, some are smaller and others are larger. In this example, discrepancies arise not because of inherent differences between MI and ML but because A included a set of auxiliary variables whereas B did not. If B had figured out a way to include the auxiliary variables in LISREL without altering the marginal model for Y 1,...,Y p, then her results would have been essentially identical to A s. From this discussion, it is clear that nontrivial differences between ML and MI can arise when the imputation and analysis models are uncongenial. In the remainder of this paper, I present situations where uncongeniality can be used to our benefit as part of an intelligent missingdata strategy.
7 Multiple imputation when models differ 25 4 Using auxiliary variables in MI Let Y 1,...,Y p denote the key items or variables that we wish to analyze. If missing values occur in these variables, there may be other completely or partially observed variables Z 1,...,Z q which are not inherently of interest, but which may potentially contain useful information for predicting the missing values. If so, then we may include Z 1,...,Z q in a multiple imputation procedure but exclude them from the subsequent analysis. When is it beneficial to use Z 1,...,Z q in this fashion? Are there potential dangers in doing so? COLLINS, SCHAFER and KAM (2001) (henceforth CSK) explore these questions in depth using simulation. Without going into details, I will attempt to explain and interpret our major findings. CSK classify auxiliary variables Z 1,...,Z q into three types. Type A variables are correlated with the outcomes Y 1,...,Y p and may also help to explain why Y 1,...,Y p are sometimes missing; that is, they are related to missingness. Type B variables are correlated with the outcomes Y 1,...,Y p but unrelated to missingness. Type C variables are unrelated to Y 1,...,Y p. Suppose that we plan to multiply impute the missing values in Y 1,...,Y p using one of the currently available MI programs (e.g. NORM) which assumes MAR. Because Type A variables are related to missingness, MAR is violated if they are excluded from the imputation procedure; therefore, including them may help to reduce bias. Type B variables will not reduce bias under MAR, but they may increase precision of the final parameter estimates because they contain useful information for predicting the missing values of Y 1,...,Y p. (Under MNAR conditions, Type B variables may help to reduce bias.) Type C variables will neither reduce bias nor increase precision; rather, they may reduce precision because they make the imputation model unnecessarily large and complicated. In one set of simulations, CSK show that the biases incurred by failing to include Type A variables in the imputation procedure are not as serious as some have previously thought. Including or excluding a Type A variable Z k made little difference unless the correlation between Z k and an outcome Y j was unusually strong (much greater than 0.4) and the rate of missing values in Y j was high (50% or more). With weaker correlations and lower rates of missing values, biases in parameters pertaining to Y j and its relationships to other variables were barely noticeable when Z k was excluded. In another set of simulations, CSK show that including a Type B variable can substantially increase precision under MAR conditions if its correlation with the outcomes is strong (say, 0.9). This situation could easily arise in longitudinal studies, because repeated measures on individuals over time are highly correlated; responses at one occasion may be very useful for imputing missing responses at another. Under MNAR conditions, where the probability that Y j is missing depends directly on Y j, an auxiliary variable Z k that is highly correlated with Y j will tend to reduce bias but will not eliminate it entirely. Finally, CSK show that the costs of unnecessarily including Type C variables in the imputation procedure tend to be minimal. Overall, there are potentially
8 26 J. L. Schafer important gains and small risks associated with auxiliary variables in MI. Therefore, we suggest that users of MI be quite liberal in deciding whether to include extra variables in the imputation procedure, even when these variables are not likely to appear in subsequent analyses. Agencies that collect and distribute data to the public for secondary analyses often have access to extra information (e.g., finely detailed geographic identifiers) that are not released for reasons of confidentiality, but which may be highly predictive of missing values. If the agency uses this additional information in an imputation procedure, the imputed data files released to the public will produce more efficient estimates than are possible by any missingdata procedure the user can implement himself. This is an example of the phenomenon known as superefficiency (MENG, 1994; RUBIN, 1996). LITTLE and YAO (1996) describe an innovative use of MI with auxiliary variables for reducing bias in the estimation of an intenttotreat (ITT) effect. In many randomized experiments, some subjects do not adhere to the treatment regimen to which they have been assigned. If this noncompliance is also accompanied by dropout, then the traditional ITT analysis which compares subjects who completed the study on the basis of the group to which they were assigned, ignoring the treatment actually received may produce a biased estimate of the actual ITT effect. The reason for this bias is that the rates of dropout are often related to the treatment actually received. In these settings, the treatment actually received provides useful information for imputing the missing values, but after imputation it is the treatment assigned (not treatment received) that is used to estimate the ITT effect. In fairness, one should note that in principle it is also possible to include auxiliary variables in a likelihoodbased procedure, so that the user of ML may derive the same benefits from them as the user of MI. In practice, however, this may be tricky. Consider a longitudinal study with dropout. Suppose that we have access to a covariate which, although it is not of substantive interest, may be highly predictive of dropping out. For example, at each occasion, we might ask, How likely is it that you will remain in this study for at least one more assessment? (1 ¼ unlikely, 2 ¼ somewhat likely, 3 ¼ very likely). Incorporating this variable into a longitudinal analysis might convert a seriously MNAR situation to one that is effectively MAR. However, if we simply include this variable in our longitudinal regression model as a timevarying covariate, we have substantially changed the meaning and interpretation of the model, making it more difficult to estimate the marginal treatment effects of interest. The correct way to add this variable to the model is to make it an additional response, jointly modeling this variable and the outcome of interest as a bivariate function of treatment group and time. Such models tend to be difficult to fit with current software for longitudinal analysis. Note that the advice given about auxiliary variables that including them in an imputation procedure has potential benefits but few risks does not apply to variables that are functions or summaries of the missingness R. For example, suppose that Y 1,...,Y p are repeated measures in a longitudinal study with dropout, and we notice that subjects who drop out early tend to have different response trajectories
9 Multiple imputation when models differ 27 from those who drop out later or not at all. It might seem reasonable to create a variable Z equal to the number of occasions for which the subject remains in the study (1, 2,...,p) and include it in the imputation model. Because this new variable is a function of R, including it in an MI procedure will produce imputations that are consistent with a particular MNAR model. Without Z, an MI procedure that assumes MAR will produce correct results under any MAR situation. But once Z has been included, the same procedure may produce biased results under many MAR mechanisms. In fact, the results may be nonsensical because, unless care is taken, these summaries of R may introduce parameters into the imputation model that cannot be identified from the observed data. (In this particular example, the correlation between Z and Y p is not identified, because Y p is observed if and only if Z ¼ p.) MNAR models are an important and potentially useful application for MI, but the user should be fully aware of the implications of these models and the special challenges they pose. 5Imputation under MNAR models When serious departures from MAR are suspected, it may be necessary to investigate alternative ways to jointly model the data and the missingness. Selection models specify a marginal distribution for the complete data, P(Y com ), and a conditional distribution for the missingness, P(R Y com ). Selection models have intuitive appeal, but their results can be highly sensitive to unverifiable assumptions about the shape of the completedata population (LITTLE and RUBIN, 1987, Chapter 11; KENWARD, 1998). Patternmixture models posit a marginal distribution for the patterns of missingness, P(R), followed by a conditional model for the data distribution within patterns, P(Y com R) (LITTLE, 1993). In patternmixture models, some unverifiable assumptions must inevitably be made to identify all the parameters in P(Y com R). These assumptions are no less onerous than those made by selection models, but in one sense they are more honest because the parameters that cannot be estimated from the observed data are made explicit. Patternmixture models for longitudinal data with dropout are reviewed by LITTLE (1995) and VERBEKE and MOLENBERGHS (2000). One mildly unfortunate aspect of patternmixture models is that the effects of scientific interest are usually parameters of the marginal distribution of the complete data, not its conditional distribution within patterns. Therefore, when fitting these models, one must somehow manipulate the estimates from P(R) and P(Y com R) to obtain the desired estimates for PðY com Þ¼ P R PðRÞ PðY comjrþ. Alternatively, this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y ð1þ mis ;...; YðmÞ mis under a patternmixture model. Once these imputations exist, we may forget about R and use the imputed datasets to estimate the parameters of P(Y com ) directly. In many cases, the model applied to Y com in this analysis phase may deviate from the implied model
10 28 J. L. Schafer PðY com Þ¼ P R PðRÞ PðY comjrþ of the imputation phase, resulting in a mild form of uncongeniality. In practice, of course, neither of these models will be true, and the observed data probably contain little information that would allow us to distinguish one from the other. MI may also be used with selection models. In fact, any manner of specifying a joint model for Y com and R may be used, as long as it is possible to sample from the posterior predictive distribution P(Y mis Y obs, R) under the model. Once Y ð1þ mis ;...; YðmÞ mis have been created, further modeling of the missingness is unnecessary, and the analysis of the imputed datasets may proceed as usual. For this reason, MI seems to be an ideal device for sensitivity analyses. If imputations are generated under a variety of alternative models, the imputed datasets may be analyzed in the same manner and compared directly, without having to worry about the fact that the form of P(Y com ) and the meaning of its parameters may vary from imputation model to the next. MI under a variety of MNAR models is possible with current software. Any of the multivariate models implemented in S+MissingData library (SCHIMERT et al., 2001), which apply to continuous and categorical variables, may be jointly applied to a set of outcomes Y 1,...,Y p and to summaries of the missingness R. In most cases, some aspect of this joint distribution will not be identifiable from the observed data, and successful use of the routines may require omitting certain intervariable relationships, use of an informative prior distribution, or both. Example. SCHAFER (1997) previously analyzed incomplete data from the National Crime Survey conducted by the U.S. Bureau of the Census. Residents in a sample of housing units were interviewed to determine whether they had been victimized by crime in the preceding halfyear. Six months later, the same units were visited again and the residents were asked the same question. Missing values occurred at both occasions. The data are shown in Table 1. Let Y j denote victimization status (1 ¼ no, 2 ¼ yes) at occasions j ¼ 1, 2. SCHAFER (1997, Section 7.3.4) generated m ¼ 10 imputations of the missing values of Y 1 and Y 2 under the MAR assumption and used them to test hypotheses of independence and symmetry in the Y 1 Y 2 table. With S+MissingData, it is also quite easy to generate imputations under MNAR models. Let R j denote the missingness indicator for Y j (1 ¼ response, 2 ¼ missing value). In principle, any Table 1. Victimization status from the National Crime Survey. Victimized in second period? Victimized in first period? No Yes Missing No Yes Missing
11 Multiple imputation when models differ 29 loglinear model may be applied to the fourway table Y 1 Y 2 R 1 R 2, but many of these models will be underidentified. Models that contain associations between (Y 1, Y 2 ) and (R 1, R 2 ) correspond to various hypotheses of MNAR. Perhaps the simplest MNAR model worth considering is the one containing the associations Y 1 Y 2, R 1 R 2, Y 1 R 1 and Y 2 R 2, which allows missingness at each occasion to be influenced by the response only at that occasion. That model has 8 free parameters, the maximum that can be identified from the nine observed frequencies reported in Table 1. As noted by FAY (1986), even though this model appears to be saturated, ML estimates under this model will not necessarily reproduce the observed frequencies in Table 1. Models more complicated than this one will not have unique ML estimates. It is still possible to produce MI s under more complicated models, provided that a proper prior distribution is applied to the parameters. It is probably unwise to do so, however, unless the prior distribution truly does reflect the analyst s a priori state of knowledge. Using S+MissingData, I generated m ¼ 10 imputations under the model Y 1 Y 2, R 1 R 2, Y 1 R 1, Y 2 R 2 and then collapsed the imputed tables over R 1 and R 2. Using standard techniques, I then calculated estimates and standard errors for the log of the odds ratio a ¼ p 11 p 22 p 1 12 p 1 21 and the difference d ¼ p 12 ) p 21 from each table, where p ij ¼ P(Y 1 ¼ i, Y 2 ¼ j) (AGRESTI, 1990). The results are shown in Table 2 below. Combining the results by RUBIN s (1987) rules gives an estimate of 3.79 for the odds ratio a with a 95% interval of (2.16, 6.64), and an estimate of )0.058 for the difference d with a 95% interval of ()0.205, 0.089). The estimates are close to those reported by SCHAFER (1997) under the MAR assumption (3.60 for a, )0.039 for d), but the new intervals are 15% wider for a and 270% wider for d. Assuming MAR, the evidence for a shift in victimization rates from one occasion to the next (d 0) was fairly strong (p ¼ 0.06), but under the MNAR model the evidence has disappeared (p ¼ 0.40). The SPLUS code for creating these imputations and performing the postimputation analyses is provided in the Appendix. Table 2. Multiple imputations of (Y 1, Y 2 ) under MNAR model, with estimates and standard errors for the logodds ratio a and difference d. Imputations (Y 1, Y 2 ) no, no no, yes yes, no yes, yes log ^a SE ^d )0.020 )0.037 )0.063 )0.144 ) )0.142 )0.060 )0.104 SE
12 30 J. L. Schafer What are we to make of these results? I am not entirely sure. On the one hand, we may be tempted to impute under MNAR models routinely because it is not difficult to do so. On the other hand, I suspect that these models may, in many cases, be unnecessarily complex. In this example, I tested the joint significance of the Y 1 R 1 and Y 2 R 2 associations by a likelihoodratio test and found no evidence for them (p ¼ 0.98). By nature, the data can provide almost no information about these associations; we can estimate them only in a very indirect way, by assuming that other associations (e.g., Y 1 R 2 ) do not exist. What can we possibly achieve by adding parameters to a model that are barely identified, except to increase the width of our interval estimates by an arbitrary amount? This joint test for Y 1 R 1 and Y 2 R 2 is correctly interpreted as a test for MCAR, not a test for MAR. Nevertheless, omitting Y 1 R 1 and Y 2 R 2 results in a procedure that is valid under any MAR missingness model (a rather broad class), whereas including them produces results that are valid only under this particular MNAR model (a rather narrow class). It seems paradoxical that the inclusion of additional parameters, about which the data contain little evidence, produces a model that may in some sense be more restrictive than the model that omits them. Rather than rely heavily on poorly estimated MNAR models, I would prefer to examine auxiliary variables that may be related to missingness in Y 1 and Y 2, and include them in a richer imputation model under assumption of MAR. 6 Analyses by lessthanfully parametric methods In their quest for robustness, statisticians are fond of relaxing assumptions. In missingdata problems, however, there is an unpleasant aspect to this: once we dispense with a true likelihood function for Y com, we must usually bid farewell to inferential procedures that are valid under general MAR conditions. Incompletedata procedures derived from frequentist arguments are typically valid only under MCAR (RUBIN, 1976) or they require strong models for the missingness. A good example of the latter is weighting. If X 1,...,X n is a random sample from a density f(x), and the value X i ¼ x becomes missing with probability 1 ) g(x), then the observed data will be sampled from f * (x) µ f(x)g(x), and consistent nonparametric estimates of moments of f are available by applying weights w i µ g )1 (X i ) to the observed X i s. The problem, of course, is that g(x) is unknown and cannot be estimated from the observed data without strong assumptions or powerful auxiliary information. Therefore, statisticians may feel the need to choose between (a) fully parametric analyses that make strong assumptions about P(Y com ), and (b) semi or nonparametric analyses that weaken the assumptions about P(Y com ) but make strong assumptions about the distribution of missingness. With MI, we can (almost) have the best of both worlds. If we create imputations Y ð1þ mis ;...; YðmÞ mis from a predictive distribution P(Y mis Y obs ) derived from a fully
13 Multiple imputation when models differ 31 parametric model, and then analyze the imputed datasets by a lessthanfully parametric method, we may be able to achieve better performance and greater robustness than is possible with any procedure that handles missing data and estimation in a single step. Any erroneous parametric assumptions in the imputation phase will effectively be applied only to the missing part of the dataset, not to the observed data. This type of uncongeniality was anticipated by MENG (1994), whose results suggest that these procedures tend to perform well as long as the imputation model is not grossly misspecified. MENG s (1994) theorems encompass certain types of estimation procedures but not others. For example, they do apply to certain techniques of designbased estimation commonly applied to sample surveys, but apparently they do not apply to semiparametric regression using generalized estimating equations (GEE) (MENG, 1999). Of course, although we may find it difficult to prove good performance analytically for the latter, that does not imply that good performance will not be seen in practice. Experience suggests that Bayesian MI does interact well with a variety of semi and nonparametric estimation procedures. Example. Consider a classic cluster sample, where y i is the observed total and m i is the sample size in clusteri, i ¼ 1,...,n. The usual estimate of the population mean is y ¼ P i y i= P i m i, with ^V ðyþ ¼n P i ðy i m i yþ 2 =ðn 1Þð P i m iþ 2 as its estimated variance. As noted by SKINNER in his discussion of MENG (1994), these estimates appear to have no interpretation from a parametric likelihood or Bayesian standpoint under any population model. I examined the performance of Bayesian MI in conjunction with y and ^V ðyþ under a twostage normal population, l i N(l, s), y ij N(l i, r 2 ), i ¼ 1,...,n, j ¼ 1,...,m i. I performed a simulation with l ¼ 10, s ¼ 5, r 2 ¼ 20, and m i ¼ 20 or 40 with equal probability, which produces a rather large design effect (deff 7.5). Drawing a sample of n ¼ 50 clusters, I imposed missing values on the observations within each cluster in an MCAR fashion at a rate of 25%. I then generated five imputations of the missing values by a Bayesian procedure using a twostage normal model and weakly informative prior distributions for the variance components. Finally, I calculated y and ^V ðyþ from each of the five imputed datasets and combined them by RUBIN s (1987) rules to obtain a single estimate and 95% confidence interval for the population mean. For comparison, I also calculated y and ^V ðyþ from the complete data (before missing values were imposed) and from the observed data alone (treating the incomplete data as a sample of clusters of smaller sizes). The entire procedure was repeated 1,000 times. Over the 1,000 repetitions, the average point estimate was using the complete data, using the observed data alone, and using MI, all very close to the true mean l ¼ 10. The variances of the estimates for the three methods were for complete data, for observed data, and for MI. Interestingly, an inflation of variance from to and from to corresponds to
14 32 J. L. Schafer rates of missing information of 3% and 5%, respectively, even though the actual observations are missing at a rate of 25%. Because the design effect in this example is strong, reducing the percluster sample sizes by 25% produces only a slight loss of information. Therefore, a highquality missingdata procedure should produce interval estimates that are only slightly wider than those obtained from complete data. How did the intervals fare? With complete data, 938 (94%) of the simulated intervals covered the true mean l ¼ 10, and the average interval width was Using the observed data alone, 993 of the intervals covered l ¼ 10, and the average width of the intervals was 1.87 an increase of 36%. Treating the observed data in each cluster as if it were complete data from a smaller cluster produces intervals that are unnecessarily wide and inefficient. This result seems a bit odd, given the very simple nature of the missingness distribution. With MI, however, the performance of the intervals was outstanding; 948 of them covered the true mean, and their average width was only slightly greater than those from complete data (1.40). 7 Discussion and future directions In the early years of MI, many thought that imputing missing values under one model and analyzing the imputed datasets under another model (or no model at all) was ludicrous and potentially harmful (FAY, 1992). We have seen, however, that in many situations of practical interest, this strategy can be quite beneficial. Many important questions still need to be addressed in the area of MNAR models. A quick look at recent issues in biostatistics journals reveals a surprisingly large number of articles on selection and patternmixture models, particularly for dropout in longitudinal studies. Should we fit these models simply because we can? I hope that, in the future, we will see published analyses of data under MNAR models that answer more questions than they raise. I hope to see detailed comparisons of the performance of various MAR and MNAR models in realistic scenarios where the true nature of the completedata population and the missingness are unknown. The performance of Bayesian MI in conjunction with popular semi or nonparametric analyses also needs further study. For example, consider the current procedures for designbased variance estimation from stratified multistage cluster samples as implemented in SUDAAN (RESEARCH TRIANGLE INSTITUTE, 1998). To what extent do the complexities of the sample design need to be incorporated into the imputation model? Do we need to account for each level of clustering, or will it be sufficient to incorporate only the highest level, the ultimate clusters that drive the variance estimation procedures?
15 Multiple imputation when models differ 33 Appendix ######################################### # SPLUS code for multiple imputation of victimization status from # the National Crime Survey using an MNAR model ######################################### # Enter the data Y1_rep (c ("CrimeFree", "Victim", "NA"),3) Y2_rep (c ("CrimeFree", "Victim", "NA"),each¼3) count_c (392,76,31,55,38,7,33,9,115) R1_rep (c ("Observed","Observed","Missing"),3) R2_rep (c ("Observed","Observed","Missing"), each¼3) Crime_data.frame(Y1¼factor(Y1), Y2¼factor (Y2), R1¼factor (R1), R2¼factor (R2),count¼count) # Attach the S + MissingData library library(missing) # Fit the loglinear model by maximum likelihood Crime.mle_mdLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, na.proc¼"em") # Generate m¼10 imputations using the default noninformative prior set.seed(184) Crime.imp_impLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, prior ¼ 0.5, nimpute ¼ 10, control¼list(niter¼1000)) # Reduce the imputed data to Y1 x Y2 tables, collapsing over R s Crime.imp.Y1xY2_miEval(oldUnclass (crosstabs (frequency Y1 + Y2, data¼crime.imp)),vnames¼"crime.imp") # Compute estimates, SE s for logodds ratios and combine them # by Rubin s (1987) rules logodds.est_mieval (log(crime.imp.y1xy2[1,1]*crime.imp.y1xy2[2,2]/ (Crime.imp.Y1xY2[2,1]*Crime.imp.Y1xY2[1,2]))) logodds.se_mieval(sqrt(sum(1/crime.imp.y1xy2))) logodds.result_mimeanse(logodds.est, logodds.se, df¼inf) print(exp(logodds.result$est)) print(exp(logodds.result$est +c(),1)*qt(.975,logodds.result$df)*logodds.result$std.err)) # Do the same for difference in proportions P(Y1¼yes)P(Y2¼yes) diff.est_mieval ((Crime.imp.Y1xY2[1,2]Crime.imp.Y1xY2[2,1])/sum(Crime.imp.Y1xY2)) diff.se_mieval(sqrt ((1/sum(Crime.imp.Y1xY2))* ((Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (1Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2)) + (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))*
16 34 J. L. Schafer (1Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2)) + 2*(Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))))) diff.result_mimeanse(diff.est, diff.se, df¼inf) print(diff.result$est) print(diff.result$est +c(1,1)*qt(.975,diff.result$df)*diff.result$std.err) # Create Table 2 for displaying the imputations and results Table2_matrix(NA,8,10) dimnames(table2)_list (c("no, no ","no, yes","yes, no ","yes, yes", "logodds","se","diff","se"), format(1:10)) for(i in 1:10) {Table2[1:4,i]_as.vector(t(Crime.imp.Y1xY2[[i]])) Table2[5,i]_logodds.est[[i]] Table2[6,i]_logodds.SE[[i]] Table2[7,i]_diff.est[[i]] Table2[8,i]_diff.SE[[i]]} # Test the joint significance of the Y1*R2 and Y2*R1 associations Crime.mle2_mdLoglin(Crime, margins ¼ count Y1:Y2 + R1:R2, na.proc ¼ "em") lrtest_2*(crime.mle$algorithm$likelihoodcrime. mle2$algorithm$likelihood) print(1pchisq(lrtest,2)) References Agresti, A. (1990), Categorical data analysis, John Wiley and Sons, New York. Arbuckle, J. L. and W. Wothke (1999), AMOS 4.0 User s Guide, Smallwaters, Inc., Chicago. Collins, L. M., B. P. Flaherty, S. L. Hyatt and J. L. Schafer (1999), WinLTA User s Guide, Version 2.0, The Methodology Center, The Pennsylvania State University, University Park, PA. Collins, L. M., J. L. Schafer and C. M. Kam (2001), A comparison of inclusive and restrictive strategies in modern missingdata procedures, Psychological Methods 6, Fay, R. E. (1986), Causal models for patterns of nonresponse, Journal of the American Statistical Association 81, Fay, R. E. (1992), When are inferences from multiple imputation valid? Proceedings of the Survey Research Methods Section of the American Statistical Association, Gelman, A., D. B. Rubin, J. Carlin and H. Stern (1995), Bayesian data analysis, Chapman and Hall, London. Jöreskog, K. G. and D. Sörbom (2001), LISREL 8.5. Scientific Software International, Inc., Chicago. Kenward, M. G. (1998), Selection models for repeated measurements for nonrandom dropout: an illustration of sensitivity, Statistics in Medicine 17, King, G., J. Honaker, A. Joseph and K. Scheve (2001), Analyzing incomplete political science data: an alternative algorithm for multiple imputation, American Political Science Review 95, Little, R. J. A. (1993), Patternmixture models for multivariate incomplete data, Journal of the American Statistical Association 88,
17 Multiple imputation when models differ 35 Little, R. J. A. (1995), Modeling the dropout mechanism in repeatedmeasures studies, J ournal of the American Statistical Association 90, Little, R. J. A. and D. B. Rubin (1987), Statistical analysis withmissing data, John Wiley and Sons, New York. Meng, X. L. (1994), Multipleimputation inferences with uncongenial sources of input (with discussion), Statistical Science 10, Meng, X. L. (1999), A congenial overview and investigation of imputation inferences under uncongeniality. Paper presented at International Conference on Survey Nonresponse, Portland, October Muthén, L. K. and B. O. Muthén (1998), Mplus User s Guide, Muthe n and Muthén, Los Angeles. Neale, M. C., S. M. Boker, G.Xie and H. H. Maes (1999), Mx: Statistical Modeling (5th ed.), Department of Psychiatry, Virginia Commonwealth University, Richmond, VA. Raghunathan, T. E., P. W. Solenberger and J. Van Hoewyk (2000), IVEware: Imputation and Variance Estimation Software, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI. RESEARCH TRIANGLE INSTITUTE (1998), SUDAAN: Software for the Statistical Analysis of Correlated Data, Version 7, Research Triangle Institute, Research Triangle Park, NC. Robins, J. M., A. Rotnitzky and D. O. Scharfstein (1998), Semiparametric regression for repeated outcomes with nonignorable nonresponse, Journal of the American Statistical Association 93, Rubin, D. B. (1976), Inference and missing data, Biometrika 63, Rubin, D. B. (1987), Multiple imputation for nonresponse in surveys, John Wiley and Sons, New York. Rubin, D. B. (1996), Multiple imputation after 18+ years, Journal of the American Statistical Association 91, Schafer, J. L. (1997), Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J. L. (1999), NORM: Multiple imputation of incomplete multivariate data under a normal model, Software for Windows, Department of Statistics, The Pennsylvania State University, University Park, PA. Schimert, J., J. L. Schafer, T. Hesterberg, C. Fraley and D. Clarkson (2001), Analyzing missing values in SPLUS, Insightful Corporation, Seattle, WA. STATISTICAL SOLUTIONS INC. (2002), SOLAS for missing data analysis, Version 3. Statistical Solutions, Cork, Ireland. Van Buuren, S. and C. G. M. Oudshoorn (1999), Flexible multivariate imputation by MICE, TNO/VGZ/PG , TNO Prevention and Health, Leiden. Verbeke, G. and G. Molenberghs (2000), Linear mixed models for longitudinal data, SpringerVerlag, New York. Yuan, Y. C. (2000), Multiple imputation for missing data: concepts and new development, Proceedings of the TwentyFifth Annual SAS Users Group International Conference, Paper 267. SAS Institute, Cary, NC. Zeger, S. L., K. Y. Liang and P. S. Albert (1988), Models for longitudinal data: a generalized estimating equation approach, Biometrics 44, Received: February 2002, Revised: October 2002.
Multiple Imputation for Missing Data: A Cautionary Tale
Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust
More informationHANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS
HANDLING DROPOUT AND WITHDRAWAL IN LONGITUDINAL CLINICAL TRIALS Mike Kenward London School of Hygiene and Tropical Medicine Acknowledgements to James Carpenter (LSHTM) Geert Molenberghs (Universities of
More informationDealing with Missing Data
Res. Lett. Inf. Math. Sci. (2002) 3, 153160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904
More informationHandling attrition and nonresponse in longitudinal data
Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 6372 Handling attrition and nonresponse in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein
More informationProblem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VAaffiliated statisticians;
More informationA Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit nonresponse. In a survey, certain respondents may be unreachable or may refuse to participate. Item
More informationItem Imputation Without Specifying Scale Structure
Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete
More informationMissing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random
[Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sageereference.com/survey/article_n298.html] Missing Data An important indicator
More informationHandling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
More informationReview of the Methods for Handling Missing Data in. Longitudinal Data Analysis
Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 113 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics
More informationMissing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
More informationAnalyzing Structural Equation Models With Missing Data
Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.
More informationMissing Data Dr Eleni Matechou
1 Statistical Methods Principles Missing Data Dr Eleni Matechou matechou@stats.ox.ac.uk References: R.J.A. Little and D.B. Rubin 2nd edition Statistical Analysis with Missing Data J.L. Schafer and J.W.
More informationMissing Data: Our View of the State of the Art
Psychological Methods Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 7, No. 2, 147 177 1082989X/02/$5.00 DOI: 10.1037//1082989X.7.2.147 Missing Data: Our View of the State
More informationOverview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models
Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear
More informationPATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,
PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data
More informationMissing Data. Paul D. Allison INTRODUCTION
4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)
More informationA REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA
123 Kwantitatieve Methoden (1999), 62, 123138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake
More informationSensitivity Analysis in Multiple Imputation for Missing Data
Paper SAS2702014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes
More informationMissing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University
Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationAPPLIED MISSING DATA ANALYSIS
APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationStatistical Analysis with Missing Data
Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES
More informationIntroduction to mixed model and missing data issues in longitudinal studies
Introduction to mixed model and missing data issues in longitudinal studies Hélène JacqminGadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models
More informationarxiv:1301.2490v1 [stat.ap] 11 Jan 2013
The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM
More informationDealing with missing data: Key assumptions and methods for applied analysis
Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina SoleyBori msoley@bu.edu This paper was published in fulfillment of the requirements
More informationStatistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation
Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk
More informationHandling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
More informationImputation of missing data under missing not at random assumption & sensitivity analysis
Imputation of missing data under missing not at random assumption & sensitivity analysis S. Jolani Department of Methodology and Statistics, Utrecht University, the Netherlands Advanced Multiple Imputation,
More informationComparison of Estimation Methods for Complex Survey Data Analysis
Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.
More informationA Mixed Model Approach for IntenttoTreat Analysis in Longitudinal Clinical Trials with Missing Values
Methods Report A Mixed Model Approach for IntenttoTreat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationTABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3
ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10
More informationIn part 1 of this series, we provide a conceptual overview
Advanced Statistics: Missing Data in Clinical Research Part 2: Multiple Imputation Craig D. Newgard, MD, MPH, Jason S. Haukoos, MD, MS Abstract In part 1 of this series, the authors describe the importance
More informationImputing Missing Data using SAS
ABSTRACT Paper 32952015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are
More information2. Making example missingvalue datasets: MCAR, MAR, and MNAR
Lecture 20 1. Types of missing values 2. Making example missingvalue datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data
More informationParametric fractional imputation for missing data analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????,??,?, pp. 1 14 C???? Biometrika Trust Printed in
More informationHow to choose an analysis to handle missing data in longitudinal observational studies
How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:
More informationMissing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center
Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know
More informationBest Practices for Missing Data Management in Counseling Psychology
Journal of Counseling Psychology 2010 American Psychological Association 2010, Vol. 57, No. 1, 1 10 00220167/10/$12.00 DOI: 10.1037/a0018082 Best Practices for Missing Data Management in Counseling Psychology
More informationRens van de Schoot a b, Peter Lugtig a & Joop Hox a a Department of Methods and Statistics, Utrecht
This article was downloaded by: [University Library Utrecht] On: 15 May 2012, At: 01:20 Publisher: Psychology Press Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:
More informationTEACHING STATISTICS THROUGH DATA ANALYSIS
TEACHING STATISTICS THROUGH DATA ANALYSIS Thomas Piazza Survey Research Center and Department of Sociology University of California Berkeley, California The question of how best to teach statistics is
More informationApplied Missing Data Analysis in the Health Sciences. Statistics in Practice
Brochure More information from http://www.researchandmarkets.com/reports/2741464/ Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Description: A modern and practical guide
More informationMissing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
More informationAnalysis of Longitudinal Data with Missing Values.
Analysis of Longitudinal Data with Missing Values. Methods and Applications in Medical Statistics. Ingrid Garli Dragset Master of Science in Physics and Mathematics Submission date: June 2009 Supervisor:
More informationMissing Data Techniques for Structural Equation Modeling
Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021843X/03/$12.00 DOI: 10.1037/0021843X.112.4.545 Missing Data Techniques
More informationPublished online: 25 Sep 2014.
This article was downloaded by: [Cornell University Library] On: 23 February 2015, At: 11:23 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:
More informationModern Methods for Missing Data
Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC www.statisticalhorizons.com 1 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years
More informationUsing Repeated Measures Techniques To Analyze Clustercorrelated Survey Responses
Using Repeated Measures Techniques To Analyze Clustercorrelated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract
More informationSPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & Oneway
More informationSampling Error Estimation in DesignBased Analysis of the PSID Data
Technical Series Paper #1105 Sampling Error Estimation in DesignBased Analysis of the PSID Data Steven G. Heeringa, Patricia A. Berglund, Azam Khan Survey Research Center, Institute for Social Research
More informationMISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)
MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,
More informationOrganizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
More informationData fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys
Kaplan and McCarty Largescale Assessments in Education 2013, 1:6 RESEARCH Open Access Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys David Kaplan
More informationA Review of Methods for Missing Data
Educational Research and Evaluation 13803611/01/0704353$16.00 2001, Vol. 7, No. 4, pp. 353±383 # Swets & Zeitlinger A Review of Methods for Missing Data Therese D. Pigott Loyola University Chicago, Wilmette,
More informationMultiple Imputation of Missing Income Data in the National Health Interview Survey
Multiple Imputation of Missing Income Data in the National Health Interview Survey Nathaniel SCHENKER, Trivellore E. RAGHUNATHAN, PeiLuCHIU, Diane M. MAKUC, Guangyu ZHANG, and Alan J. COHEN The National
More informationApproaches for Analyzing Survey Data: a Discussion
Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata
More informationOverview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 354870348 Phone: (205) 3484431 Fax: (205) 3488648 August 1,
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 SigmaRestricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationVisualization of missing values using the Rpackage VIM
Institut f. Statistik u. Wahrscheinlichkeitstheorie 040 Wien, Wiedner Hauptstr. 80/07 AUSTRIA http://www.statistik.tuwien.ac.at Visualization of missing values using the Rpackage VIM M. Templ and P.
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationSampling Distributions and the Central Limit Theorem
135 Part 2 / Basic Tools of Research: Sampling, Measurement, Distributions, and Descriptive Statistics Chapter 10 Sampling Distributions and the Central Limit Theorem In the previous chapter we explained
More informationSummary of Probability
Summary of Probability Mathematical Physics I Rules of Probability The probability of an event is called P(A), which is a positive number less than or equal to 1. The total probability for all possible
More informationLife Table Analysis using Weighted Survey Data
Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using
More informationHow To Use A Monte Carlo Study To Decide On Sample Size and Determine Power
How To Use A Monte Carlo Study To Decide On Sample Size and Determine Power Linda K. Muthén Muthén & Muthén 11965 Venice Blvd., Suite 407 Los Angeles, CA 90066 Telephone: (310) 3919971 Fax: (310) 3918971
More informationDealing with Missing Data
Dealing with Missing Data Roch Giorgi email: roch.giorgi@univamu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January
More informationAppendix Methodology and Statistics
Appendix Methodology and Statistics Introduction Science and Engineering Indicators (SEI) contains data compiled from a variety of sources. This appendix explains the methodological and statistical criteria
More informationChapter 5: Analysis of The National Education Longitudinal Study (NELS:88)
Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationComparison of Imputation Methods in the Survey of Income and Program Participation
Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationChallenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Dropout
Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Dropout Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance
More informationJournal Article Reporting Standards (JARS)
APPENDIX Journal Article Reporting Standards (JARS), MetaAnalysis Reporting Standards (MARS), and Flow of Participants Through Each Stage of an Experiment or QuasiExperiment 245 Journal Article Reporting
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationCHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA
CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA Hatice UENAL Institute of Epidemiology and Medical Biometry, Ulm University, Germany
More informationData Cleaning and Missing Data Analysis
Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS
More informationBMJ Open. To condition or not condition? Analyzing change in longitudinal randomized controlled trials
To condition or not condition? Analyzing change in longitudinal randomized controlled trials Journal: BMJ Open Manuscript ID bmjopen000 Article Type: Research Date Submitted by the Author: Jun0 Complete
More informationWhen Does it Make Sense to Perform a MetaAnalysis?
CHAPTER 40 When Does it Make Sense to Perform a MetaAnalysis? Introduction Are the studies similar enough to combine? Can I combine studies with different designs? How many studies are enough to carry
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002Topics in StatisticsBiological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationSensitivity analysis of longitudinal binary data with nonmonotone missing values
Biostatistics (2004), 5, 4,pp. 531 544 doi: 10.1093/biostatistics/kxh006 Sensitivity analysis of longitudinal binary data with nonmonotone missing values PASCAL MININI Laboratoire GlaxoSmithKline, UnitéMéthodologie
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationWhat s New in Econometrics? Lecture 8 Cluster and Stratified Sampling
What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and
More informationDr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.
American Statistical Association Biopharm Section Monthly Webinar Series: Sensitivity analyses that address missing data issues in Longitudinal studies for regulatory submission. Dr James Roger. GlaxoSmithKline
More informationBayesian Approaches to Handling Missing Data
Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to
More informationCopyright 2010 The Guilford Press. Series Editor s Note
This is a chapter excerpt from Guilford Publications. Applied Missing Data Analysis, by Craig K. Enders. Copyright 2010. Series Editor s Note Missing data are a real bane to researchers across all social
More information2. Simple Linear Regression
Research methods  II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationAnother Look at Sensitivity of Bayesian Networks to Imprecise Probabilities
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L44 Seattle, WA 98124
More informationHow to Use a Monte Carlo Study to Decide on Sample Size and Determine Power
STRUCTURAL EQUATION MODELING, 9(4), 599 620 Copyright 2002, Lawrence Erlbaum Associates, Inc. TEACHER S CORNER How to Use a Monte Carlo Study to Decide on Sample Size and Determine Power Linda K. Muthén
More informationRegression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
More informationSouth Carolina College and CareerReady (SCCCR) Probability and Statistics
South Carolina College and CareerReady (SCCCR) Probability and Statistics South Carolina College and CareerReady Mathematical Process Standards The South Carolina College and CareerReady (SCCCR)
More informationStatistical Rules of Thumb
Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN13: 9780470860809 ISBN10: 0470860804 Editors Brian S Everitt & David
More informationIS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise.
IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. Peter R. Welbrock SmithHanley Consulting Group Philadelphia, PA ABSTRACT Developing
More informationChapter 11 Introduction to Survey Sampling and Analysis Procedures
Chapter 11 Introduction to Survey Sampling and Analysis Procedures Chapter Table of Contents OVERVIEW...149 SurveySampling...150 SurveyDataAnalysis...151 DESIGN INFORMATION FOR SURVEY PROCEDURES...152
More informationDepartment of Epidemiology and Public Health Miller School of Medicine University of Miami
Department of Epidemiology and Public Health Miller School of Medicine University of Miami BST 630 (3 Credit Hours) Longitudinal and Multilevel Data WednesdayFriday 9:00 10:15PM Course Location: CRB 995
More informationAdequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20thcentury statistics dealt with maximum likelihood
More information