Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ

Size: px
Start display at page:

Download "Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ"

Transcription

1 19 Statistica Neerlandica (2003) Vol. 57, nr. 1, pp Multiple Imputation in Multivariate Problems When the Imputation and Analysis Models Differ Joseph L. Schafer* Department of Statistics and The Methodology Center, The Pennsylvania State University, 326 Thomas Building, University Park, PA 16802, USA Bayesian multiple imputation (MI) has become a highly useful paradigm for handling missing values in many settings. In this paper, I compare Bayesian MI with other methods maximum likelihood, in particular and point out some of its unique features. One key aspect of MI, the separation of the imputation phase from the analysis phase, can be advantageous in settings where the models underlying the two phases do not agree. Key Words and Pharases: missing data, nonresponse. 1 Fundamentals In modern statistical practice, the occurrence of missing values is usually viewed as a random phenomenon. Let Y com ¼ (Y obs, Y mis ) denote a set of complete data and Y obs the data actually observed. The missing data, Y mis, denotes real or hypothetical quantities which, if available, would simplify the analysis. Our primary interest is in some aspect of the population distribution of Y com, P(Y com h) not the distribution of Y obs, nor the process that partitions Y com into Y obs and Y mis. Nevertheless, we suppose that the partitioning of Y com is encoded in a set of random variables R. For example, if Y com is a data matrix containing both observed and missing values, then R could be a matrix of the same size containing 1 s and 0 s to show whether the corresponding elements of Y com are observed or missing. We will refer to R as the missingness, and P(R Y com ; n) the distribution of missingness. Note that R is itself completely observed. We do not posit a distribution for R because we want to model it. Rather, we hope to avoid modeling R because our model might not be plausible. Reasons for missingness are often not present in Y com, because explaining R is usually not the purpose of data collection. Nevertheless, these reasons may be related to aspects of Y com and, by omission, may induce relationships between R and Y com. The distribution of missingness should be regarded as a mathematical device to describe This work was supported by grant 1-P50-DA10075, National Institute on Drug Abuse.. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

2 20 J. L. Schafer the rates and patterns of missing values and to capture approximately relationships between R and Y com in a correlational (not causal) sense. Usually, our main reason for introducing P(R Y com ; n) is to clarify the conditions under which we may avoid specifying it. Elementary probability theory suggests that P(Y obs h) ¼ òp(y com h) dy mis may provide a basis for inference about h, both as a sampling distribution for Y obs and as a likelihood function for h. But that is true only under certain conditions. Missing data are missing at random (MAR) if the distribution of missingness does not depend on Y mis, P(R Y com ; n) ¼ P(R Y obs ; n) (RUBIN, 1976). They are missing completely at random (MCAR) if it does not depend on Y obs or Y mis, P(R Y com ; n) ¼ P(R; n) (LITTLE and RUBIN, 1987). Missing not at random (MNAR) refers to any violation of MAR. RUBIN (1976) showed that P(Y obs h) provides a correct sampling distribution for frequentist inference about h under MCAR, and a correct likelihood function for likelihood/bayes inference under the weaker condition of MAR. (To be precise, one also needs the parameters of the missingness distribution, n, tobe distinct from or independent of h; see LITTLE and RUBIN, 1987, for details.) That models for R may be avoided more often in a likelihood/bayesian mode than in a frequentist mode suggests that missing-data procedures based on likelihood may be more useful than those motivated by frequentist arguments alone. For the most part, I believe that is true. Many of the older data-editing procedures (e.g. listwise deletion) are based on frequentist arguments and are generally valid only under the strong and often implausible assumption of MCAR. Even if MCAR does hold, these procedures may be inefficient. Methods for analyzing data without a true likelihood function marginal modeling based on generalized estimating equations, for example can handle missing values only if they are MCAR (ZEGER, LIANG and ALBERT, 1988) or if one specifies a correct model for R (ROBINS, ROTNITZKY and SCHARFSTEIN, 1998). As a general principle, I believe that an analyst s time and effort are better spent building an intelligent model for the data rather than modeling the missingness, unless departures from MAR are suspected to be very serious. Consider a multivariate problem where we collect items Y j, j ¼ 1,...,p for subjects i ¼ 1,...,n, so that Y com is an n p data matrix, but portions of Y com are missing for reasons beyond our control. If Y 1,...,Y p represent repeated measures of the same variable at different occasions, and if subjects who drop out do not return (i.e., if Y j is missing then Y j+1,..., Y p are missing as well), then MAR is easy to understand: it means that a subject s probability of dropping out at occasion j, given that he has not yet dropped out, may depend on previous responses, Y 1,..., Y j)1, but not on the present or future, Y j,...,y p. In more general multivariate setting, however, MAR is not very intuitive; it means that a subject s probabilities of responding to Y 1,...,Y p may depend only on his or her own set of observed items, a set that changes from one subject to the next. One could say that this condition seems odd or unnatural. However, the apparent awkwardness of MAR does not imply that assuming MAR is unwise. COLLINS,SCHAFER and KAM (2001) demonstrated that, with continuous data,

3 Multiple imputation when models differ 21 an erroneous assumption of MAR failing to take into account a cause or correlate of missingness may have only a minor impact on estimates and standard errors unless the relationships between the omitted cause or correlate of missingness and the outcomes are unusually strong ( q much greater than 0.5). In typical social science applications, I believe that such strong correlations between causes of missingness and outcomes are the exception rather than the rule, and assuming MAR will probably not lead us far astray. When MNAR is a serious concern e.g., in a clinical trial where subjects drop out if the treatment s not working it may be necessary to jointly model Y com and R, but such models must be based on other unverifiable assumptions. Methods for joint modeling of Y com and R in longitudinal studies with dropout are reviewed by LITTLE (1995) and by VERBEKE and MOLENBERGHS (2000). 2 Bayesian multiple imputation Inferences about h may proceed from a likelihood based on P(Y obs h) under MAR or from a joint model for Y com and R under MNAR. Procedures for ML estimation in multivariate missing-data problems are becoming widely available. Programs for fitting linear mixed models to longitudinal data (e.g., PROC MIXED in SAS) allow missing responses. ML for latent-variable models with incomplete data is found in Amos (ARBUCKLE and WOTHKE, 1999), Mx (NEALE et al., 1999), LISREL 8.5 (JO RESKOG and SO RBOM, 2001), LTA (COLLINS et al., 1999) and Mplus (MUTHE N and MUTHE N, 1998). All of these programs assume that the missing values are MAR. A useful alternative to direct likelihood methods is given by multiple imputation (MI) (RUBIN, 1987). Suppose that Q is some aspect of p the ffiffiffiffi distribution of Y com to be estimated, and that an estimate ^Q and standard error U could be easily calculated if Y mis were available. In MI, Y mis is replaced by m > 1 simulated versions, Y ð1þ mis ;...; YðmÞ mis, resulting in m estimates and standard errors, ð ^Q j ; U j Þ; j ¼ 1;...; m. From RUBIN (1987), an overall estimate for Q is Q ¼ m 1 P m p ffiffiffi ^Q j¼1 j with a standard error of T, where T ¼ U þð1 þ m 1 ÞB, U ¼ m 1 P m j¼1 U j, B ¼ðm 1Þ 1 P m j ¼ 1 ð ^Q j QÞ 2.Ifð ^Q p QÞ= ffiffiffiffi p U is approximately N(0, 1) with complete data, then ð Q QÞ= ffiffiffi T tm provides tests and intervals for Q, where m ¼ (m ) 1)(1 + r )1 ) 2, r ¼ (1 + m )1 )B/U. Where do the imputations Y ð1þ mis ;...; YðmÞ mis come from? They are drawn from a Bayesian posterior predictive distribution for Y obs. The model for complete data, P(Y com ; h), implies a conditional distribution for the missing values given the observed ones, P(Y mis Y obs ; h). We could use this distribution for imputation if h were known. But because h is unknown, we must generate m independent random draws h (1),..., h (m) from a Bayesian posterior distribution given Y obs and R, which depends only on Y obs if the missing data are MAR. Once the random parameters are drawn, the imputations follow as Y ð jþ mis PðY misjy obs ; h ð jþ Þ, j ¼ 1,..., m. Inmost applications, the number of imputations does not need to be large; m ¼ 5 is often enough. Computational strategies for creating MI s in multivariate settings are

4 22 J. L. Schafer reviewed by SCHAFER (1997). Programs for creating MI s include NORM (SCHAFER, 1999), the S+MissingData library in S-PLUS (SCHIMERT et al., 2001), SAS PROC MI (YUAN, 2000), Amelia (KING et al., 2001), SOLAS (STATISTICAL SOLUTIONS, 2002), MICE (VAN BUUREN and OUDSHOORN, 1999), and IVEware (RAGHUNATHAN, SOLENBERGER and VAN HOEWYK, 2000). An excellent resource for information about these and other programs for MI is the website These programs assume that the missing data are MAR. However, in some cases, we can also use them to generate imputations under certain MNAR models by creating variables out of the missing-data pattern R and treating these variables as additional covariates; this idea will be explored in Section 5 below. It is easy to show that, when the estimand Q is a function of h, MI provides approximate Bayesian inferences for Q (SCHAFER, 1997, Sec ). Therefore, with a large sample and a diffuse prior distribution, MI and direct likelihood methods produce similar answers. From this standpoint, MI seems to be an unnecessarily circuitous way to summarize the likelihood for h. Why is MI worth considering at all? One reason is that, unlike ML, MI separates the inference into two phases: the imputation phase, in which Y ð1þ mis ;...; YðmÞ mis are created, and the analysis phase, in which the ð ^Q j ; U j Þ; j ¼ 1;...; m are calculated and combined. Because the phases are distinct, imputation and analysis may be carried out on different occasions by different persons. For example, the National Center for Health Statistics in the United States has recently released a multiply-imputed version of the Third National Health and Nutrition Examination Survey (NHANES III, ). Special techniques and software were used to create the imputations, but now that they exist, the imputed data files may be analyzed in a straightforward manner by anyone familiar with standard techniques of survey data analysis. The NHANES III Multiply Imputed Data Set is available at nhanes/nh3data.htm. Another advantage of separating imputation from the analysis and the primary topic of the rest of this paper is that the imputation and analysis may be carried out under different models. For example, it is a relatively simple matter to include extra variables in an imputation procedure but ignore them in later analyses. Properties of MI when the imputation and analysis models differ has explored from a theoretical standpoint by MENG (1994) and RUBIN (1996), and from a practical standpoint by COLLINS, SCHAFER, and KAM (2001). Discrepancies between the models are not necessarily harmful and can often be advantageous. 3 Comparing the results from MI and ML To understand what happens when the imputation and analysis models differ, it helps to compare first the results of an ML procedure applied to Y obs alone (which uses a single model) to an analysis based on MI (which uses two models). This discussion, which is summarized from COLLINS, SCHAFER and KAM (2001), assumes that the

5 Multiple imputation when models differ 23 model for the complete-data population P(Y com ; h) used in the ML analysis is the same model used to obtain estimates and standard errors ð ^Q j ; U j Þ; j ¼ 1;...; m after MI. Without this assumption, there is no guarantee that the two same population parameters are being estimated under the two methods. At present, we will also limit our attention to situations where the user of ML and the imputer both assume that the missing values are MAR, so that neither one is specifying a distribution for the missingness. This does not mean that the user of ML and the imputer are making identical assumptions about the missing data, because their models may still be different. For example, the imputer may use additional variables that do not appear in the analyses, or he may specify different inter-variable relationships (e.g. a structured covariance matrix versus an unstructured one). Proposition 1. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units); if their models apply equivalent distributional assumptions to the variables and the relationships among them; if the sample size is large; and if the number of imputations is sufficiently large; then the results from the ML and MI procedures will be essentially identical. Under these conditions, MI approximates a Bayesian analysis under the same model used in the ML procedure, and the asymptotic equivalence between Bayesian and likelihood-based procedures is well known (GELMAN et al., 1995). With large samples the effect of a prior distribution diminishes, and Bayesian and ML analyses produce similar results, especially when the prior is diffuse. For example, suppose that we want to regress a single variable Y on a set of predictors X 1,...,X p containing missing values. We could do this in AMOS by specifying the model with a path from each predictor X j to Y, ay-error that is uncorrelated with the predictors, and arbitrary correlations among the X j s. If the sample size is large, the parameter estimates and standard errors from AMOS will be essentially identical to those that would be obtained if the missing values were multiply imputed M times using NORM, the imputed datasets were analyzed using conventional regression software, and the results combined according to RUBIN s (1987) rules. When MI is carried out under these conditions, the imputer s and analyst s procedures are said to be congenial (MENG, 1994). Under congeniality, there is no major theoretical reason to prefer ML estimates to MI or vice-versa, because the properties of the two methods are very similar. It many real world applications of MI, however, the imputer s and analyst s models are uncongenial. Some types of uncongeniality are clearly harmful; for example, an imputer might create MI s under a model that is grossly misspecified. But MENG (1994) and RUBIN (1996) have demonstrated that uncongeniality can also be beneficial, particularly when the imputer can take advantage of extra information unavailable or unused by the analyst. My next two propositions describe different kinds of uncongeniality; the first is rather benign, whereas the second may help or harm.

6 24 J. L. Schafer Proposition 2. If the user of the ML procedure and the imputer use the same set of input data (same variables and observational units) but the imputer s model assumes a more general distributional form than the analyst s model, then ML and MI tend to produce similar parameter estimates, but the standard errors from MI may be larger. For example, suppose that Y 1,...,Y p are repeated measures of a variable over time, and missing values arise from dropout. Researcher A uses SAS PROC MIXED to fit a linear model with random intercepts and time-slopes for each individual, and then manipulates the fixed effects to estimate E(Y p ), the population mean response at the final occasion. Researcher B multiply imputes the missing responses using NORM, calculates the sample mean of Y p in each imputed dataset, and combines the results by RUBIN s (1987) rules. The two researchers will obtain similar estimates of E(Y p ), but A s standard errors may be slightly smaller than B s, because PROC MIXED assumes a patterned covariance structure for Y 1,...,Y p whereas NORM applies an unstructured model. In my experiences with real data, I have found that this increase in standard errors that arises when the imputation model is more general than the analysis model is often barely noticeable. Proposition 3. If the user of the ML procedure and the imputer use the same sample of units but a different set of variables, then the results from ML and MI could be quite different, even though the ML user s model is equivalent to the model used by the analyst of the imputed dataset. This can easily happen in practice. For example, suppose that Researcher A creates MI s for a set of variables Y 1,...,Y p using NORM, but also includes in the imputation procedure some additional covariates Z 1,...,Z q. After imputation, he ignores Z 1,...,Z q, fits a covariance structure model to Y 1,...,Y p with LISREL using each of the imputed datasets and combines the results by RUBIN s (1987) rules. Researcher B fits the same covariance structure model directly to Y 1,...,Y p without imputation using AMOS and finds that her results differ from A s. Many of her parameter estimates are similar to A s, but others are different enough to cause worry; some of her standard errors are similar, some are smaller and others are larger. In this example, discrepancies arise not because of inherent differences between MI and ML but because A included a set of auxiliary variables whereas B did not. If B had figured out a way to include the auxiliary variables in LISREL without altering the marginal model for Y 1,...,Y p, then her results would have been essentially identical to A s. From this discussion, it is clear that non-trivial differences between ML and MI can arise when the imputation and analysis models are uncongenial. In the remainder of this paper, I present situations where uncongeniality can be used to our benefit as part of an intelligent missing-data strategy.

7 Multiple imputation when models differ 25 4 Using auxiliary variables in MI Let Y 1,...,Y p denote the key items or variables that we wish to analyze. If missing values occur in these variables, there may be other completely or partially observed variables Z 1,...,Z q which are not inherently of interest, but which may potentially contain useful information for predicting the missing values. If so, then we may include Z 1,...,Z q in a multiple imputation procedure but exclude them from the subsequent analysis. When is it beneficial to use Z 1,...,Z q in this fashion? Are there potential dangers in doing so? COLLINS, SCHAFER and KAM (2001) (henceforth CSK) explore these questions in depth using simulation. Without going into details, I will attempt to explain and interpret our major findings. CSK classify auxiliary variables Z 1,...,Z q into three types. Type A variables are correlated with the outcomes Y 1,...,Y p and may also help to explain why Y 1,...,Y p are sometimes missing; that is, they are related to missingness. Type B variables are correlated with the outcomes Y 1,...,Y p but unrelated to missingness. Type C variables are unrelated to Y 1,...,Y p. Suppose that we plan to multiply impute the missing values in Y 1,...,Y p using one of the currently available MI programs (e.g. NORM) which assumes MAR. Because Type A variables are related to missingness, MAR is violated if they are excluded from the imputation procedure; therefore, including them may help to reduce bias. Type B variables will not reduce bias under MAR, but they may increase precision of the final parameter estimates because they contain useful information for predicting the missing values of Y 1,...,Y p. (Under MNAR conditions, Type B variables may help to reduce bias.) Type C variables will neither reduce bias nor increase precision; rather, they may reduce precision because they make the imputation model unnecessarily large and complicated. In one set of simulations, CSK show that the biases incurred by failing to include Type A variables in the imputation procedure are not as serious as some have previously thought. Including or excluding a Type A variable Z k made little difference unless the correlation between Z k and an outcome Y j was unusually strong (much greater than 0.4) and the rate of missing values in Y j was high (50% or more). With weaker correlations and lower rates of missing values, biases in parameters pertaining to Y j and its relationships to other variables were barely noticeable when Z k was excluded. In another set of simulations, CSK show that including a Type B variable can substantially increase precision under MAR conditions if its correlation with the outcomes is strong (say, 0.9). This situation could easily arise in longitudinal studies, because repeated measures on individuals over time are highly correlated; responses at one occasion may be very useful for imputing missing responses at another. Under MNAR conditions, where the probability that Y j is missing depends directly on Y j, an auxiliary variable Z k that is highly correlated with Y j will tend to reduce bias but will not eliminate it entirely. Finally, CSK show that the costs of unnecessarily including Type C variables in the imputation procedure tend to be minimal. Overall, there are potentially

8 26 J. L. Schafer important gains and small risks associated with auxiliary variables in MI. Therefore, we suggest that users of MI be quite liberal in deciding whether to include extra variables in the imputation procedure, even when these variables are not likely to appear in subsequent analyses. Agencies that collect and distribute data to the public for secondary analyses often have access to extra information (e.g., finely detailed geographic identifiers) that are not released for reasons of confidentiality, but which may be highly predictive of missing values. If the agency uses this additional information in an imputation procedure, the imputed data files released to the public will produce more efficient estimates than are possible by any missing-data procedure the user can implement himself. This is an example of the phenomenon known as superefficiency (MENG, 1994; RUBIN, 1996). LITTLE and YAO (1996) describe an innovative use of MI with auxiliary variables for reducing bias in the estimation of an intent-to-treat (ITT) effect. In many randomized experiments, some subjects do not adhere to the treatment regimen to which they have been assigned. If this noncompliance is also accompanied by dropout, then the traditional ITT analysis which compares subjects who completed the study on the basis of the group to which they were assigned, ignoring the treatment actually received may produce a biased estimate of the actual ITT effect. The reason for this bias is that the rates of dropout are often related to the treatment actually received. In these settings, the treatment actually received provides useful information for imputing the missing values, but after imputation it is the treatment assigned (not treatment received) that is used to estimate the ITT effect. In fairness, one should note that in principle it is also possible to include auxiliary variables in a likelihood-based procedure, so that the user of ML may derive the same benefits from them as the user of MI. In practice, however, this may be tricky. Consider a longitudinal study with dropout. Suppose that we have access to a covariate which, although it is not of substantive interest, may be highly predictive of dropping out. For example, at each occasion, we might ask, How likely is it that you will remain in this study for at least one more assessment? (1 ¼ unlikely, 2 ¼ somewhat likely, 3 ¼ very likely). Incorporating this variable into a longitudinal analysis might convert a seriously MNAR situation to one that is effectively MAR. However, if we simply include this variable in our longitudinal regression model as a time-varying covariate, we have substantially changed the meaning and interpretation of the model, making it more difficult to estimate the marginal treatment effects of interest. The correct way to add this variable to the model is to make it an additional response, jointly modeling this variable and the outcome of interest as a bivariate function of treatment group and time. Such models tend to be difficult to fit with current software for longitudinal analysis. Note that the advice given about auxiliary variables that including them in an imputation procedure has potential benefits but few risks does not apply to variables that are functions or summaries of the missingness R. For example, suppose that Y 1,...,Y p are repeated measures in a longitudinal study with dropout, and we notice that subjects who drop out early tend to have different response trajectories

9 Multiple imputation when models differ 27 from those who drop out later or not at all. It might seem reasonable to create a variable Z equal to the number of occasions for which the subject remains in the study (1, 2,...,p) and include it in the imputation model. Because this new variable is a function of R, including it in an MI procedure will produce imputations that are consistent with a particular MNAR model. Without Z, an MI procedure that assumes MAR will produce correct results under any MAR situation. But once Z has been included, the same procedure may produce biased results under many MAR mechanisms. In fact, the results may be nonsensical because, unless care is taken, these summaries of R may introduce parameters into the imputation model that cannot be identified from the observed data. (In this particular example, the correlation between Z and Y p is not identified, because Y p is observed if and only if Z ¼ p.) MNAR models are an important and potentially useful application for MI, but the user should be fully aware of the implications of these models and the special challenges they pose. 5Imputation under MNAR models When serious departures from MAR are suspected, it may be necessary to investigate alternative ways to jointly model the data and the missingness. Selection models specify a marginal distribution for the complete data, P(Y com ), and a conditional distribution for the missingness, P(R Y com ). Selection models have intuitive appeal, but their results can be highly sensitive to unverifiable assumptions about the shape of the complete-data population (LITTLE and RUBIN, 1987, Chapter 11; KENWARD, 1998). Pattern-mixture models posit a marginal distribution for the patterns of missingness, P(R), followed by a conditional model for the data distribution within patterns, P(Y com R) (LITTLE, 1993). In pattern-mixture models, some unverifiable assumptions must inevitably be made to identify all the parameters in P(Y com R). These assumptions are no less onerous than those made by selection models, but in one sense they are more honest because the parameters that cannot be estimated from the observed data are made explicit. Pattern-mixture models for longitudinal data with dropout are reviewed by LITTLE (1995) and VERBEKE and MOLENBERGHS (2000). One mildly unfortunate aspect of pattern-mixture models is that the effects of scientific interest are usually parameters of the marginal distribution of the complete data, not its conditional distribution within patterns. Therefore, when fitting these models, one must somehow manipulate the estimates from P(R) and P(Y com R) to obtain the desired estimates for PðY com Þ¼ P R PðRÞ PðY comjrþ. Alternatively, this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y ð1þ mis ;...; YðmÞ mis under a pattern-mixture model. Once these imputations exist, we may forget about R and use the imputed datasets to estimate the parameters of P(Y com ) directly. In many cases, the model applied to Y com in this analysis phase may deviate from the implied model

10 28 J. L. Schafer PðY com Þ¼ P R PðRÞ PðY comjrþ of the imputation phase, resulting in a mild form of uncongeniality. In practice, of course, neither of these models will be true, and the observed data probably contain little information that would allow us to distinguish one from the other. MI may also be used with selection models. In fact, any manner of specifying a joint model for Y com and R may be used, as long as it is possible to sample from the posterior predictive distribution P(Y mis Y obs, R) under the model. Once Y ð1þ mis ;...; YðmÞ mis have been created, further modeling of the missingness is unnecessary, and the analysis of the imputed datasets may proceed as usual. For this reason, MI seems to be an ideal device for sensitivity analyses. If imputations are generated under a variety of alternative models, the imputed datasets may be analyzed in the same manner and compared directly, without having to worry about the fact that the form of P(Y com ) and the meaning of its parameters may vary from imputation model to the next. MI under a variety of MNAR models is possible with current software. Any of the multivariate models implemented in S+MissingData library (SCHIMERT et al., 2001), which apply to continuous and categorical variables, may be jointly applied to a set of outcomes Y 1,...,Y p and to summaries of the missingness R. In most cases, some aspect of this joint distribution will not be identifiable from the observed data, and successful use of the routines may require omitting certain inter-variable relationships, use of an informative prior distribution, or both. Example. SCHAFER (1997) previously analyzed incomplete data from the National Crime Survey conducted by the U.S. Bureau of the Census. Residents in a sample of housing units were interviewed to determine whether they had been victimized by crime in the preceding half-year. Six months later, the same units were visited again and the residents were asked the same question. Missing values occurred at both occasions. The data are shown in Table 1. Let Y j denote victimization status (1 ¼ no, 2 ¼ yes) at occasions j ¼ 1, 2. SCHAFER (1997, Section 7.3.4) generated m ¼ 10 imputations of the missing values of Y 1 and Y 2 under the MAR assumption and used them to test hypotheses of independence and symmetry in the Y 1 Y 2 table. With S+MissingData, it is also quite easy to generate imputations under MNAR models. Let R j denote the missingness indicator for Y j (1 ¼ response, 2 ¼ missing value). In principle, any Table 1. Victimization status from the National Crime Survey. Victimized in second period? Victimized in first period? No Yes Missing No Yes Missing

11 Multiple imputation when models differ 29 loglinear model may be applied to the four-way table Y 1 Y 2 R 1 R 2, but many of these models will be under-identified. Models that contain associations between (Y 1, Y 2 ) and (R 1, R 2 ) correspond to various hypotheses of MNAR. Perhaps the simplest MNAR model worth considering is the one containing the associations Y 1 Y 2, R 1 R 2, Y 1 R 1 and Y 2 R 2, which allows missingness at each occasion to be influenced by the response only at that occasion. That model has 8 free parameters, the maximum that can be identified from the nine observed frequencies reported in Table 1. As noted by FAY (1986), even though this model appears to be saturated, ML estimates under this model will not necessarily reproduce the observed frequencies in Table 1. Models more complicated than this one will not have unique ML estimates. It is still possible to produce MI s under more complicated models, provided that a proper prior distribution is applied to the parameters. It is probably unwise to do so, however, unless the prior distribution truly does reflect the analyst s a priori state of knowledge. Using S+MissingData, I generated m ¼ 10 imputations under the model Y 1 Y 2, R 1 R 2, Y 1 R 1, Y 2 R 2 and then collapsed the imputed tables over R 1 and R 2. Using standard techniques, I then calculated estimates and standard errors for the log of the odds ratio a ¼ p 11 p 22 p 1 12 p 1 21 and the difference d ¼ p 12 ) p 21 from each table, where p ij ¼ P(Y 1 ¼ i, Y 2 ¼ j) (AGRESTI, 1990). The results are shown in Table 2 below. Combining the results by RUBIN s (1987) rules gives an estimate of 3.79 for the odds ratio a with a 95% interval of (2.16, 6.64), and an estimate of )0.058 for the difference d with a 95% interval of ()0.205, 0.089). The estimates are close to those reported by SCHAFER (1997) under the MAR assumption (3.60 for a, )0.039 for d), but the new intervals are 15% wider for a and 270% wider for d. Assuming MAR, the evidence for a shift in victimization rates from one occasion to the next (d 0) was fairly strong (p ¼ 0.06), but under the MNAR model the evidence has disappeared (p ¼ 0.40). The S-PLUS code for creating these imputations and performing the post-imputation analyses is provided in the Appendix. Table 2. Multiple imputations of (Y 1, Y 2 ) under MNAR model, with estimates and standard errors for the log-odds ratio a and difference d. Imputations (Y 1, Y 2 ) no, no no, yes yes, no yes, yes log ^a SE ^d )0.020 )0.037 )0.063 )0.144 ) )0.142 )0.060 )0.104 SE

12 30 J. L. Schafer What are we to make of these results? I am not entirely sure. On the one hand, we may be tempted to impute under MNAR models routinely because it is not difficult to do so. On the other hand, I suspect that these models may, in many cases, be unnecessarily complex. In this example, I tested the joint significance of the Y 1 R 1 and Y 2 R 2 associations by a likelihood-ratio test and found no evidence for them (p ¼ 0.98). By nature, the data can provide almost no information about these associations; we can estimate them only in a very indirect way, by assuming that other associations (e.g., Y 1 R 2 ) do not exist. What can we possibly achieve by adding parameters to a model that are barely identified, except to increase the width of our interval estimates by an arbitrary amount? This joint test for Y 1 R 1 and Y 2 R 2 is correctly interpreted as a test for MCAR, not a test for MAR. Nevertheless, omitting Y 1 R 1 and Y 2 R 2 results in a procedure that is valid under any MAR missingness model (a rather broad class), whereas including them produces results that are valid only under this particular MNAR model (a rather narrow class). It seems paradoxical that the inclusion of additional parameters, about which the data contain little evidence, produces a model that may in some sense be more restrictive than the model that omits them. Rather than rely heavily on poorly estimated MNAR models, I would prefer to examine auxiliary variables that may be related to missingness in Y 1 and Y 2, and include them in a richer imputation model under assumption of MAR. 6 Analyses by less-than-fully parametric methods In their quest for robustness, statisticians are fond of relaxing assumptions. In missing-data problems, however, there is an unpleasant aspect to this: once we dispense with a true likelihood function for Y com, we must usually bid farewell to inferential procedures that are valid under general MAR conditions. Incompletedata procedures derived from frequentist arguments are typically valid only under MCAR (RUBIN, 1976) or they require strong models for the missingness. A good example of the latter is weighting. If X 1,...,X n is a random sample from a density f(x), and the value X i ¼ x becomes missing with probability 1 ) g(x), then the observed data will be sampled from f * (x) µ f(x)g(x), and consistent nonparametric estimates of moments of f are available by applying weights w i µ g )1 (X i ) to the observed X i s. The problem, of course, is that g(x) is unknown and cannot be estimated from the observed data without strong assumptions or powerful auxiliary information. Therefore, statisticians may feel the need to choose between (a) fully parametric analyses that make strong assumptions about P(Y com ), and (b) semi- or non-parametric analyses that weaken the assumptions about P(Y com ) but make strong assumptions about the distribution of missingness. With MI, we can (almost) have the best of both worlds. If we create imputations Y ð1þ mis ;...; YðmÞ mis from a predictive distribution P(Y mis Y obs ) derived from a fully

13 Multiple imputation when models differ 31 parametric model, and then analyze the imputed datasets by a less-than-fully parametric method, we may be able to achieve better performance and greater robustness than is possible with any procedure that handles missing data and estimation in a single step. Any erroneous parametric assumptions in the imputation phase will effectively be applied only to the missing part of the dataset, not to the observed data. This type of uncongeniality was anticipated by MENG (1994), whose results suggest that these procedures tend to perform well as long as the imputation model is not grossly misspecified. MENG s (1994) theorems encompass certain types of estimation procedures but not others. For example, they do apply to certain techniques of design-based estimation commonly applied to sample surveys, but apparently they do not apply to semiparametric regression using generalized estimating equations (GEE) (MENG, 1999). Of course, although we may find it difficult to prove good performance analytically for the latter, that does not imply that good performance will not be seen in practice. Experience suggests that Bayesian MI does interact well with a variety of semi- and nonparametric estimation procedures. Example. Consider a classic cluster sample, where y i is the observed total and m i is the sample size in clusteri, i ¼ 1,...,n. The usual estimate of the population mean is y ¼ P i y i= P i m i, with ^V ðyþ ¼n P i ðy i m i yþ 2 =ðn 1Þð P i m iþ 2 as its estimated variance. As noted by SKINNER in his discussion of MENG (1994), these estimates appear to have no interpretation from a parametric likelihood or Bayesian standpoint under any population model. I examined the performance of Bayesian MI in conjunction with y and ^V ðyþ under a two-stage normal population, l i N(l, s), y ij N(l i, r 2 ), i ¼ 1,...,n, j ¼ 1,...,m i. I performed a simulation with l ¼ 10, s ¼ 5, r 2 ¼ 20, and m i ¼ 20 or 40 with equal probability, which produces a rather large design effect (deff 7.5). Drawing a sample of n ¼ 50 clusters, I imposed missing values on the observations within each cluster in an MCAR fashion at a rate of 25%. I then generated five imputations of the missing values by a Bayesian procedure using a two-stage normal model and weakly informative prior distributions for the variance components. Finally, I calculated y and ^V ðyþ from each of the five imputed datasets and combined them by RUBIN s (1987) rules to obtain a single estimate and 95% confidence interval for the population mean. For comparison, I also calculated y and ^V ðyþ from the complete data (before missing values were imposed) and from the observed data alone (treating the incomplete data as a sample of clusters of smaller sizes). The entire procedure was repeated 1,000 times. Over the 1,000 repetitions, the average point estimate was using the complete data, using the observed data alone, and using MI, all very close to the true mean l ¼ 10. The variances of the estimates for the three methods were for complete data, for observed data, and for MI. Interestingly, an inflation of variance from to and from to corresponds to

14 32 J. L. Schafer rates of missing information of 3% and 5%, respectively, even though the actual observations are missing at a rate of 25%. Because the design effect in this example is strong, reducing the per-cluster sample sizes by 25% produces only a slight loss of information. Therefore, a high-quality missing-data procedure should produce interval estimates that are only slightly wider than those obtained from complete data. How did the intervals fare? With complete data, 938 (94%) of the simulated intervals covered the true mean l ¼ 10, and the average interval width was Using the observed data alone, 993 of the intervals covered l ¼ 10, and the average width of the intervals was 1.87 an increase of 36%. Treating the observed data in each cluster as if it were complete data from a smaller cluster produces intervals that are unnecessarily wide and inefficient. This result seems a bit odd, given the very simple nature of the missingness distribution. With MI, however, the performance of the intervals was outstanding; 948 of them covered the true mean, and their average width was only slightly greater than those from complete data (1.40). 7 Discussion and future directions In the early years of MI, many thought that imputing missing values under one model and analyzing the imputed datasets under another model (or no model at all) was ludicrous and potentially harmful (FAY, 1992). We have seen, however, that in many situations of practical interest, this strategy can be quite beneficial. Many important questions still need to be addressed in the area of MNAR models. A quick look at recent issues in biostatistics journals reveals a surprisingly large number of articles on selection and pattern-mixture models, particularly for dropout in longitudinal studies. Should we fit these models simply because we can? I hope that, in the future, we will see published analyses of data under MNAR models that answer more questions than they raise. I hope to see detailed comparisons of the performance of various MAR and MNAR models in realistic scenarios where the true nature of the complete-data population and the missingness are unknown. The performance of Bayesian MI in conjunction with popular semi- or non-parametric analyses also needs further study. For example, consider the current procedures for design-based variance estimation from stratified multistage cluster samples as implemented in SUDAAN (RESEARCH TRIANGLE INSTITUTE, 1998). To what extent do the complexities of the sample design need to be incorporated into the imputation model? Do we need to account for each level of clustering, or will it be sufficient to incorporate only the highest level, the ultimate clusters that drive the variance estimation procedures?

15 Multiple imputation when models differ 33 Appendix ######################################### # SPLUS code for multiple imputation of victimization status from # the National Crime Survey using an MNAR model ######################################### # Enter the data Y1_rep (c ("Crime-Free", "Victim", "NA"),3) Y2_rep (c ("Crime-Free", "Victim", "NA"),each¼3) count_c (392,76,31,55,38,7,33,9,115) R1_rep (c ("Observed","Observed","Missing"),3) R2_rep (c ("Observed","Observed","Missing"), each¼3) Crime_data.frame(Y1¼factor(Y1), Y2¼factor (Y2), R1¼factor (R1), R2¼factor (R2),count¼count) # Attach the S + MissingData library library(missing) # Fit the loglinear model by maximum likelihood Crime.mle_mdLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, na.proc¼"em") # Generate m¼10 imputations using the default noninformative prior set.seed(184) Crime.imp_impLoglin(Crime, margins ¼ count Y1:Y2 + Y1:R1 + Y2:R2 + R1:R2, prior ¼ 0.5, nimpute ¼ 10, control¼list(niter¼1000)) # Reduce the imputed data to Y1 x Y2 tables, collapsing over R s Crime.imp.Y1xY2_miEval(oldUnclass (crosstabs (frequency Y1 + Y2, data¼crime.imp)),vnames¼"crime.imp") # Compute estimates, SE s for log-odds ratios and combine them # by Rubin s (1987) rules logodds.est_mieval (log(crime.imp.y1xy2[1,1]*crime.imp.y1xy2[2,2]/ (Crime.imp.Y1xY2[2,1]*Crime.imp.Y1xY2[1,2]))) logodds.se_mieval(sqrt(sum(1/crime.imp.y1xy2))) logodds.result_mimeanse(logodds.est, log-odds.se, df¼inf) print(exp(logodds.result$est)) print(exp(logodds.result$est +c(),1)*qt(.975,logodds.result$df)*logodds.result$std.err)) # Do the same for difference in proportions P(Y1¼yes)P(Y2¼yes) diff.est_mieval ((Crime.imp.Y1xY2[1,2]-Crime.imp.Y1xY2[2,1])/sum(Crime.imp.Y1xY2)) diff.se_mieval(sqrt ((1/sum(Crime.imp.Y1xY2))* ((Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (1-Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2)) + (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))*

16 34 J. L. Schafer (1-Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2)) + 2*(Crime.imp.Y1xY2[1,2]/sum(Crime.imp.Y1xY2))* (Crime.imp.Y1xY2[2,1]/sum(Crime.imp.Y1xY2))))) diff.result_mimeanse(diff.est, diff.se, df¼inf) print(diff.result$est) print(diff.result$est +c(1,1)*qt(.975,diff.result$df)*diff.result$std.err) # Create Table 2 for displaying the imputations and results Table2_matrix(NA,8,10) dimnames(table2)_list (c("no, no ","no, yes","yes, no ","yes, yes", "logodds","se","diff","se"), format(1:10)) for(i in 1:10) {Table2[1:4,i]_as.vector(t(Crime.imp.Y1xY2[[i]])) Table2[5,i]_logodds.est[[i]] Table2[6,i]_logodds.SE[[i]] Table2[7,i]_diff.est[[i]] Table2[8,i]_diff.SE[[i]]} # Test the joint significance of the Y1*R2 and Y2*R1 associations Crime.mle2_mdLoglin(Crime, margins ¼ count Y1:Y2 + R1:R2, na.proc ¼ "em") lrtest_2*(crime.mle$algorithm$likelihood-crime. mle2$algorithm$likelihood) print(1-pchisq(lrtest,2)) References Agresti, A. (1990), Categorical data analysis, John Wiley and Sons, New York. Arbuckle, J. L. and W. Wothke (1999), AMOS 4.0 User s Guide, Smallwaters, Inc., Chicago. Collins, L. M., B. P. Flaherty, S. L. Hyatt and J. L. Schafer (1999), WinLTA User s Guide, Version 2.0, The Methodology Center, The Pennsylvania State University, University Park, PA. Collins, L. M., J. L. Schafer and C. M. Kam (2001), A comparison of inclusive and restrictive strategies in modern missing-data procedures, Psychological Methods 6, Fay, R. E. (1986), Causal models for patterns of nonresponse, Journal of the American Statistical Association 81, Fay, R. E. (1992), When are inferences from multiple imputation valid? Proceedings of the Survey Research Methods Section of the American Statistical Association, Gelman, A., D. B. Rubin, J. Carlin and H. Stern (1995), Bayesian data analysis, Chapman and Hall, London. Jöreskog, K. G. and D. Sörbom (2001), LISREL 8.5. Scientific Software International, Inc., Chicago. Kenward, M. G. (1998), Selection models for repeated measurements for nonrandom dropout: an illustration of sensitivity, Statistics in Medicine 17, King, G., J. Honaker, A. Joseph and K. Scheve (2001), Analyzing incomplete political science data: an alternative algorithm for multiple imputation, American Political Science Review 95, Little, R. J. A. (1993), Pattern-mixture models for multivariate incomplete data, Journal of the American Statistical Association 88,

17 Multiple imputation when models differ 35 Little, R. J. A. (1995), Modeling the dropout mechanism in repeated-measures studies, J ournal of the American Statistical Association 90, Little, R. J. A. and D. B. Rubin (1987), Statistical analysis withmissing data, John Wiley and Sons, New York. Meng, X. L. (1994), Multiple-imputation inferences with uncongenial sources of input (with discussion), Statistical Science 10, Meng, X. L. (1999), A congenial overview and investigation of imputation inferences under uncongeniality. Paper presented at International Conference on Survey Nonresponse, Portland, October Muthén, L. K. and B. O. Muthén (1998), Mplus User s Guide, Muthe n and Muthén, Los Angeles. Neale, M. C., S. M. Boker, G.Xie and H. H. Maes (1999), Mx: Statistical Modeling (5th ed.), Department of Psychiatry, Virginia Commonwealth University, Richmond, VA. Raghunathan, T. E., P. W. Solenberger and J. Van Hoewyk (2000), IVEware: Imputation and Variance Estimation Software, Survey Research Center, Institute for Social Research, University of Michigan, Ann Arbor, MI. RESEARCH TRIANGLE INSTITUTE (1998), SUDAAN: Software for the Statistical Analysis of Correlated Data, Version 7, Research Triangle Institute, Research Triangle Park, NC. Robins, J. M., A. Rotnitzky and D. O. Scharfstein (1998), Semiparametric regression for repeated outcomes with non-ignorable non-response, Journal of the American Statistical Association 93, Rubin, D. B. (1976), Inference and missing data, Biometrika 63, Rubin, D. B. (1987), Multiple imputation for nonresponse in surveys, John Wiley and Sons, New York. Rubin, D. B. (1996), Multiple imputation after 18+ years, Journal of the American Statistical Association 91, Schafer, J. L. (1997), Analysis of incomplete multivariate data, Chapman and Hall, London. Schafer, J. L. (1999), NORM: Multiple imputation of incomplete multivariate data under a normal model, Software for Windows, Department of Statistics, The Pennsylvania State University, University Park, PA. Schimert, J., J. L. Schafer, T. Hesterberg, C. Fraley and D. Clarkson (2001), Analyzing missing values in S-PLUS, Insightful Corporation, Seattle, WA. STATISTICAL SOLUTIONS INC. (2002), SOLAS for missing data analysis, Version 3. Statistical Solutions, Cork, Ireland. Van Buuren, S. and C. G. M. Oudshoorn (1999), Flexible multivariate imputation by MICE, TNO/VGZ/PG , TNO Prevention and Health, Leiden. Verbeke, G. and G. Molenberghs (2000), Linear mixed models for longitudinal data, Springer-Verlag, New York. Yuan, Y. C. (2000), Multiple imputation for missing data: concepts and new development, Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Paper 267. SAS Institute, Cary, NC. Zeger, S. L., K. Y. Liang and P. S. Albert (1988), Models for longitudinal data: a generalized estimating equation approach, Biometrics 44, Received: February 2002, Revised: October 2002.

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

Dealing with Missing Data

Dealing with Missing Data Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Problem of Missing Data

Problem of Missing Data VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Item Imputation Without Specifying Scale Structure

Item Imputation Without Specifying Scale Structure Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

More information

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random [Leeuw, Edith D. de, and Joop Hox. (2008). Missing Data. Encyclopedia of Survey Research Methods. Retrieved from http://sage-ereference.com/survey/article_n298.html] Missing Data An important indicator

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis Int. Journal of Math. Analysis, Vol. 5, 2011, no. 1, 1-13 Review of the Methods for Handling Missing Data in Longitudinal Data Analysis Michikazu Nakai and Weiming Ke Department of Mathematics and Statistics

More information

Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

More information

Missing Data: Our View of the State of the Art

Missing Data: Our View of the State of the Art Psychological Methods Copyright 2002 by the American Psychological Association, Inc. 2002, Vol. 7, No. 2, 147 177 1082-989X/02/$5.00 DOI: 10.1037//1082-989X.7.2.147 Missing Data: Our View of the State

More information

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear

More information

Missing Data. Paul D. Allison INTRODUCTION

Missing Data. Paul D. Allison INTRODUCTION 4 Missing Data Paul D. Allison INTRODUCTION Missing data are ubiquitous in psychological research. By missing data, I mean data that are missing for some (but not all) variables and for some (but not all)

More information

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku,

PATTERN MIXTURE MODELS FOR MISSING DATA. Mike Kenward. London School of Hygiene and Tropical Medicine. Talk at the University of Turku, PATTERN MIXTURE MODELS FOR MISSING DATA Mike Kenward London School of Hygiene and Tropical Medicine Talk at the University of Turku, April 10th 2012 1 / 90 CONTENTS 1 Examples 2 Modelling Incomplete Data

More information

Sensitivity Analysis in Multiple Imputation for Missing Data

Sensitivity Analysis in Multiple Imputation for Missing Data Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

More information

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA 123 Kwantitatieve Methoden (1999), 62, 123-138. A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA Joop J. Hox 1 ABSTRACT. When we deal with a large data set with missing data, we have to undertake

More information

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

More information

APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Statistical Analysis with Missing Data

Statistical Analysis with Missing Data Statistical Analysis with Missing Data Second Edition RODERICK J. A. LITTLE DONALD B. RUBIN WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents Preface PARTI OVERVIEW AND BASIC APPROACHES

More information

Introduction to mixed model and missing data issues in longitudinal studies

Introduction to mixed model and missing data issues in longitudinal studies Introduction to mixed model and missing data issues in longitudinal studies Hélène Jacqmin-Gadda INSERM, U897, Bordeaux, France Inserm workshop, St Raphael Outline of the talk I Introduction Mixed models

More information

Dealing with missing data: Key assumptions and methods for applied analysis

Dealing with missing data: Key assumptions and methods for applied analysis Technical Report No. 4 May 6, 2013 Dealing with missing data: Key assumptions and methods for applied analysis Marina Soley-Bori msoley@bu.edu This paper was published in fulfillment of the requirements

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation

Statistical modelling with missing data using multiple imputation. Session 4: Sensitivity Analysis after Multiple Imputation Statistical modelling with missing data using multiple imputation Session 4: Sensitivity Analysis after Multiple Imputation James Carpenter London School of Hygiene & Tropical Medicine Email: james.carpenter@lshtm.ac.uk

More information

Imputation of missing data under missing not at random assumption & sensitivity analysis

Imputation of missing data under missing not at random assumption & sensitivity analysis Imputation of missing data under missing not at random assumption & sensitivity analysis S. Jolani Department of Methodology and Statistics, Utrecht University, the Netherlands Advanced Multiple Imputation,

More information

arxiv:1301.2490v1 [stat.ap] 11 Jan 2013

arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 The Annals of Applied Statistics 2012, Vol. 6, No. 4, 1814 1837 DOI: 10.1214/12-AOAS555 c Institute of Mathematical Statistics, 2012 arxiv:1301.2490v1 [stat.ap] 11 Jan 2013 ADDRESSING MISSING DATA MECHANISM

More information

Comparison of Estimation Methods for Complex Survey Data Analysis

Comparison of Estimation Methods for Complex Survey Data Analysis Comparison of Estimation Methods for Complex Survey Data Analysis Tihomir Asparouhov 1 Muthen & Muthen Bengt Muthen 2 UCLA 1 Tihomir Asparouhov, Muthen & Muthen, 3463 Stoner Ave. Los Angeles, CA 90066.

More information

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values

A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Methods Report A Mixed Model Approach for Intent-to-Treat Analysis in Longitudinal Clinical Trials with Missing Values Hrishikesh Chakraborty and Hong Gu March 9 RTI Press About the Author Hrishikesh Chakraborty,

More information

In part 1 of this series, we provide a conceptual overview

In part 1 of this series, we provide a conceptual overview Advanced Statistics: Missing Data in Clinical Research Part 2: Multiple Imputation Craig D. Newgard, MD, MPH, Jason S. Haukoos, MD, MS Abstract In part 1 of this series, the authors describe the importance

More information

TABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3

TABLE OF CONTENTS ALLISON 1 1. INTRODUCTION... 3 ALLISON 1 TABLE OF CONTENTS 1. INTRODUCTION... 3 2. ASSUMPTIONS... 6 MISSING COMPLETELY AT RANDOM (MCAR)... 6 MISSING AT RANDOM (MAR)... 7 IGNORABLE... 8 NONIGNORABLE... 8 3. CONVENTIONAL METHODS... 10

More information

Rens van de Schoot a b, Peter Lugtig a & Joop Hox a a Department of Methods and Statistics, Utrecht

Rens van de Schoot a b, Peter Lugtig a & Joop Hox a a Department of Methods and Statistics, Utrecht This article was downloaded by: [University Library Utrecht] On: 15 May 2012, At: 01:20 Publisher: Psychology Press Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office:

More information

Best Practices for Missing Data Management in Counseling Psychology

Best Practices for Missing Data Management in Counseling Psychology Journal of Counseling Psychology 2010 American Psychological Association 2010, Vol. 57, No. 1, 1 10 0022-0167/10/$12.00 DOI: 10.1037/a0018082 Best Practices for Missing Data Management in Counseling Psychology

More information

How to choose an analysis to handle missing data in longitudinal observational studies

How to choose an analysis to handle missing data in longitudinal observational studies How to choose an analysis to handle missing data in longitudinal observational studies ICH, 25 th February 2015 Ian White MRC Biostatistics Unit, Cambridge, UK Plan Why are missing data a problem? Methods:

More information

Imputing Missing Data using SAS

Imputing Missing Data using SAS ABSTRACT Paper 3295-2015 Imputing Missing Data using SAS Christopher Yim, California Polytechnic State University, San Luis Obispo Missing data is an unfortunate reality of statistics. However, there are

More information

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice

Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Brochure More information from http://www.researchandmarkets.com/reports/2741464/ Applied Missing Data Analysis in the Health Sciences. Statistics in Practice Description: A modern and practical guide

More information

2. Making example missing-value datasets: MCAR, MAR, and MNAR

2. Making example missing-value datasets: MCAR, MAR, and MNAR Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data

More information

Missing Data Techniques for Structural Equation Modeling

Missing Data Techniques for Structural Equation Modeling Journal of Abnormal Psychology Copyright 2003 by the American Psychological Association, Inc. 2003, Vol. 112, No. 4, 545 557 0021-843X/03/$12.00 DOI: 10.1037/0021-843X.112.4.545 Missing Data Techniques

More information

Parametric fractional imputation for missing data analysis

Parametric fractional imputation for missing data analysis 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????,??,?, pp. 1 14 C???? Biometrika Trust Printed in

More information

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center Missing Data & How to Deal: An overview of missing data Melissa Humphries Population Research Center Goals Discuss ways to evaluate and understand missing data Discuss common missing data methods Know

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

More information

TEACHING STATISTICS THROUGH DATA ANALYSIS

TEACHING STATISTICS THROUGH DATA ANALYSIS TEACHING STATISTICS THROUGH DATA ANALYSIS Thomas Piazza Survey Research Center and Department of Sociology University of California Berkeley, California The question of how best to teach statistics is

More information

Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys

Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys Kaplan and McCarty Large-scale Assessments in Education 2013, 1:6 RESEARCH Open Access Data fusion with international large scale assessments: a case study using the OECD PISA and TALIS surveys David Kaplan

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Analysis of Longitudinal Data with Missing Values.

Analysis of Longitudinal Data with Missing Values. Analysis of Longitudinal Data with Missing Values. Methods and Applications in Medical Statistics. Ingrid Garli Dragset Master of Science in Physics and Mathematics Submission date: June 2009 Supervisor:

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Missing data in randomized controlled trials (RCTs) can

Missing data in randomized controlled trials (RCTs) can EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled

More information

Sampling Error Estimation in Design-Based Analysis of the PSID Data

Sampling Error Estimation in Design-Based Analysis of the PSID Data Technical Series Paper #11-05 Sampling Error Estimation in Design-Based Analysis of the PSID Data Steven G. Heeringa, Patricia A. Berglund, Azam Khan Survey Research Center, Institute for Social Research

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

Modern Methods for Missing Data

Modern Methods for Missing Data Modern Methods for Missing Data Paul D. Allison, Ph.D. Statistical Horizons LLC www.statisticalhorizons.com 1 Introduction Missing data problems are nearly universal in statistical practice. Last 25 years

More information

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS)

MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) MISSING DATA IMPUTATION IN CARDIAC DATA SET (SURVIVAL PROGNOSIS) R.KAVITHA KUMAR Department of Computer Science and Engineering Pondicherry Engineering College, Pudhucherry, India DR. R.M.CHADRASEKAR Professor,

More information

A Review of Methods for Missing Data

A Review of Methods for Missing Data Educational Research and Evaluation 1380-3611/01/0704-353$16.00 2001, Vol. 7, No. 4, pp. 353±383 # Swets & Zeitlinger A Review of Methods for Missing Data Therese D. Pigott Loyola University Chicago, Wilmette,

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,

More information

Dealing with Missing Data

Dealing with Missing Data Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

More information

Comparison of Imputation Methods in the Survey of Income and Program Participation

Comparison of Imputation Methods in the Survey of Income and Program Participation Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

Approaches for Analyzing Survey Data: a Discussion

Approaches for Analyzing Survey Data: a Discussion Approaches for Analyzing Survey Data: a Discussion David Binder 1, Georgia Roberts 1 Statistics Canada 1 Abstract In recent years, an increasing number of researchers have been able to access survey microdata

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88)

Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Chapter 5: Analysis of The National Education Longitudinal Study (NELS:88) Introduction The National Educational Longitudinal Survey (NELS:88) followed students from 8 th grade in 1988 to 10 th grade in

More information

Visualization of missing values using the R-package VIM

Visualization of missing values using the R-package VIM Institut f. Statistik u. Wahrscheinlichkeitstheorie 040 Wien, Wiedner Hauptstr. 8-0/07 AUSTRIA http://www.statistik.tuwien.ac.at Visualization of missing values using the R-package VIM M. Templ and P.

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out

Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Challenges in Longitudinal Data Analysis: Baseline Adjustment, Missing Data, and Drop-out Sandra Taylor, Ph.D. IDDRC BBRD Core 23 April 2014 Objectives Baseline Adjustment Introduce approaches Guidance

More information

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen

More information

Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Sensitivity analysis of longitudinal binary data with non-monotone missing values

Sensitivity analysis of longitudinal binary data with non-monotone missing values Biostatistics (2004), 5, 4,pp. 531 544 doi: 10.1093/biostatistics/kxh006 Sensitivity analysis of longitudinal binary data with non-monotone missing values PASCAL MININI Laboratoire GlaxoSmithKline, UnitéMéthodologie

More information

Data Cleaning and Missing Data Analysis

Data Cleaning and Missing Data Analysis Data Cleaning and Missing Data Analysis Dan Merson vagabond@psu.edu India McHale imm120@psu.edu April 13, 2010 Overview Introduction to SACS What do we mean by Data Cleaning and why do we do it? The SACS

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine.

Dr James Roger. GlaxoSmithKline & London School of Hygiene and Tropical Medicine. American Statistical Association Biopharm Section Monthly Webinar Series: Sensitivity analyses that address missing data issues in Longitudinal studies for regulatory submission. Dr James Roger. GlaxoSmithKline

More information

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise.

IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise. Peter R. Welbrock Smith-Hanley Consulting Group Philadelphia, PA ABSTRACT Developing

More information

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA

CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA CHOOSING APPROPRIATE METHODS FOR MISSING DATA IN MEDICAL RESEARCH: A DECISION ALGORITHM ON METHODS FOR MISSING DATA Hatice UENAL Institute of Epidemiology and Medical Biometry, Ulm University, Germany

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

Automated Statistical Modeling for Data Mining David Stephenson 1

Automated Statistical Modeling for Data Mining David Stephenson 1 Automated Statistical Modeling for Data Mining David Stephenson 1 Abstract. We seek to bridge the gap between basic statistical data mining tools and advanced statistical analysis software that requires

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

Bayesian Approaches to Handling Missing Data

Bayesian Approaches to Handling Missing Data Bayesian Approaches to Handling Missing Data Nicky Best and Alexina Mason BIAS Short Course, Jan 30, 2012 Lecture 1. Introduction to Missing Data Bayesian Missing Data Course (Lecture 1) Introduction to

More information

Fixed-Effect Versus Random-Effects Models

Fixed-Effect Versus Random-Effects Models CHAPTER 13 Fixed-Effect Versus Random-Effects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Inequality, Mobility and Income Distribution Comparisons

Inequality, Mobility and Income Distribution Comparisons Fiscal Studies (1997) vol. 18, no. 3, pp. 93 30 Inequality, Mobility and Income Distribution Comparisons JOHN CREEDY * Abstract his paper examines the relationship between the cross-sectional and lifetime

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Missing data and net survival analysis Bernard Rachet

Missing data and net survival analysis Bernard Rachet Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Example G Cost of construction of nuclear power plants

Example G Cost of construction of nuclear power plants 1 Example G Cost of construction of nuclear power plants Description of data Table G.1 gives data, reproduced by permission of the Rand Corporation, from a report (Mooz, 1978) on 32 light water reactor

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Statistical Rules of Thumb

Statistical Rules of Thumb Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN

More information

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

More information

Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments

Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten Missing Data Treatments Brockmeier, Kromrey, & Hogarty Nonrandomly Missing Data in Multiple Regression Analysis: An Empirical Comparison of Ten s Lantry L. Brockmeier Jeffrey D. Kromrey Kristine Y. Hogarty Florida A & M University

More information

Teaching Multivariate Analysis to Business-Major Students

Teaching Multivariate Analysis to Business-Major Students Teaching Multivariate Analysis to Business-Major Students Wing-Keung Wong and Teck-Wong Soon - Kent Ridge, Singapore 1. Introduction During the last two or three decades, multivariate statistical analysis

More information

Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

[This document contains corrections to a few typos that were found on the version available through the journal s web page]

[This document contains corrections to a few typos that were found on the version available through the journal s web page] Online supplement to Hayes, A. F., & Preacher, K. J. (2014). Statistical mediation analysis with a multicategorical independent variable. British Journal of Mathematical and Statistical Psychology, 67,

More information

Analysis of Bayesian Dynamic Linear Models

Analysis of Bayesian Dynamic Linear Models Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main

More information