Development and validation of a prediction model with missing predictor data: a practical approach

Size: px
Start display at page:

Download "Development and validation of a prediction model with missing predictor data: a practical approach"

Transcription

1 Journal of Clinical Epidemiology 63 (2010) 205e214 Development and validation of a prediction model with missing predictor data: a practical approach Yvonne Vergouwe a, *, Patrick Royston b, Karel G.M. Moons a, Douglas G. Altman c a Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Str 6.131, P.O. Box 85500, 3508 GA, Utrecht, The Netherlands b MRC Clinical Trials Unit, London, United Kingdom c Cancer Research UK/NHS Centre for Statistics in Medicine, Oxford, United Kingdom Accepted 30 March 2009 Abstract Objective: To illustrate the sequence of steps needed to develop and validate a clinical prediction model, when missing predictor values have been multiply imputed. Study Design and Setting: We used data from consecutive primary care patients suspected of deep venous thrombosis (DVT) to develop and validate a diagnostic model for the presence of DVT. Missing values were imputed 10 times with the MICE conditional imputation method. After the selection of predictors and transformations for continuous predictors according to three different methods, we estimated regression coefficients and performance measures. Results: The three methods to select predictors and transformations of continuous predictors showed similar results. Rubin s rules could easily be applied to estimate regression coefficients and performance measures, once predictors and transformations were selected. Conclusion: We provide a practical approach for model development and validation with multiply imputed data. Ó 2010 Elsevier Inc. All rights reserved. Keywords: Missing values; Multiple imputation; Clinical prediction models; Model development; Model validation; Deep venous thrombosis 1. Introduction Interest in multivariable prediction models for diagnostic and prognostic research has grown over the past decade. Prediction models enable physicians explicitly to convert combinations of multiple predictor values to an estimated absolute risk of disease presence (in case of diagnosis) or the occurrence of a disease-related event (in case of prognosis). Prediction models are developed with data of patients from a development set, often using multivariable regression analysis. The models are accordingly validated in new, similar patients (validation set) [1,2]. Missing observations are almost universally encountered in clinical data sets, no matter how strictly studies have been designed or how hard investigators try to prevent them. The easiest way to deal with missing values is to exclude all This work is supported by the Netherlands Organization for Scientific Research Grant ZON-MW (Y. Vergouwe and K.G.M. Moons); UK Medical Research Council (P. Royston); and Cancer Research UK (D.G. Altman). * Corresponding author. Tel.: þ ; fax: þ address: y.vergouwe@umcutrecht.nl (Y. Vergouwe). patients with a missing value on any of the considered variables. Such a complete case analysis may sacrifice useful information and may cause biased results [3,4]. Imputation based on observed patient characteristics (conditional imputation) has been advocated to deal with the missing values [3]. To take the uncertainty of the imputed values into account, missing values should be imputed multiple (m) times, for which several iterative algorithms are available. The resulting m completed data sets are each analyzed separately by standard methods and the m results are combined into one final point estimate and variance, with the standard error equal to the square root of the variance [3,5]. Combining the m results is straightforward, when a single analysis is considered. The m point estimates are averaged and the m variances can be combined taking the variability between the m data sets into account with a components-of-variance argument (Rubin s rules) [3]. The development of a prediction model follows a sequence of steps [6], including selection of predictors, selection of transformations for continuous predictors [7,8], and estimation of the regression coefficients. Hence, model development with multiply imputed data is not straightforward and seldom illustrated. Here, we demonstrate the development and validation of a prediction model obtained with logistic /10/$ e see front matter Ó 2010 Elsevier Inc. All rights reserved. doi: /j.jclinepi

2 206 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 regression in the presence of multiply imputed data. Continuous predictors are modeled with transformations if necessary. In the second model, the continuous predictors are dichotomized. Further, three different methods to select predictors and transformations are applied. We encountered also another practical problem, typical with real life data. Two continuous predictors were recorded partly as dichotomous and partly as continuous. We impute the continuous values by using the observed value for the dichotomous variable and the distribution of the continuous variable where available. Two empirical data sets on the diagnosis of deep venous thrombosis (DVT) are used with minor to major percentages of missing predictor values, one data set to develop the model [9] and one to validate the model [10]. 2. Data sets on the diagnosis of DVT 2.1. Empirical data We used the data of 2,086 consecutive primary care patients suspected of DVT. The data originated from a large cross-sectional diagnostic study that was performed between January 1, 2002 and January 1, 2006 among over 100 primary care physicians in The Netherlands. For specific details and main results of the diagnostic study, we refer to the literature [9,10]. In brief, suspicion of DVT was based on swelling, redness, or pain of the lower extremities. Information was systematically documented on patient history and physical examination. Blood samples were taken to determine D-dimer plasma concentration. D-dimer is a fibrin degradation fragment that is related to the presence of thrombotic diseases, such as DVT. DVTwas considered present if one of the proximal deep veins of the symptomatic leg was not completely compressible on repeated compression ultrasonography. For the present analysis, we used the data of the first 1,295 patients [9], for model development and the data of the subsequent 791 patients to validate the various prediction models [10]. The model validation was therefore a temporal validation [1,2]. Two hundred and eighty-nine patients of the development set had DVT (22%) and 131 patients (17%) of the validation set. A description of the candidate predictors is given in Table 1 for both data sets. Ten candidate predictors were studied for model development of which four were continuous: age, duration of main symptoms, difference in calf circumference, and D-dimer value Imputation of missing values The percentage of missing values in the development set ranged from % for age to 39% for calf circumference difference and 55% for the D-dimer value (Table 1). Initially, the difference in calf circumference was only reported as a dichotomous value, that is >3 or!3 cm. Similarly, the D-dimer plasma concentration was first only provided by the lab as a dichotomous test result, that is >500 or!500 ng/ml. Later on, the continuous calf circumferences and D-dimer plasma concentrations were provided. This explains the high percentage of missing data on the continuous calf circumference and D-dimer test result in the development set. The number of missing values for the dichotomous calf circumference and D-dimer test result was lower, because the values were either recorded dichotomously or the recorded continuous values could be dichotomized. A complete case analysis for model development including the continuous values of the D-dimer test and calf circumference used only 326 patients (25% of the data). A complete case analysis with the dichotomous values used 934 patients (72%). Before missing values were imputed, we studied the missing data mechanism [11]. We created indicator variables for missing values for each variable with missing data. Fitted logistic regression models with the indicator variable as outcome and the other variables as covariates showed, that missingness was for some variables associated to observed values. Explained variation of the missingness as estimated with the regression models varied between 36% for Table 1 Distribution of candidate predictors to diagnose DVT Candidate predictor Development set (N 5 1,295) Validation set (N 5 791) n (%) % Missing data n (%) % Missing data Female gender 826 (63) (61) Age, yr a 61 (34e82) 61 (36e82) Oral contraceptive use 123 (10) (9) 2.0 Duration of symptoms, d a 5(1e20) (1e15) 15.0 Malignancy present 77 (6) (5) 8.9 Recent surgery 162 (13) (12) 9.1 Vein distension 229 (18) (17) 13.0 Leg trauma present 186 (14) (17) 3.4 Calf circumference difference, cm a 2(0e4) (0e4) 28.6 Calf circumference difference >3cm b 500 (39) (40) 7.7 D-dimer, ng/ml a 886 (160e6,288) (220e4,881) 18.8 D-dimer >500 ng/ml b 838 (65) (58) 18.8 a Continuous candidate predictor, median (10th and 90th centiles). b Dichotomous value recorded or continuous observed value dichotomized.

3 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e malignancy present and recent surgery to 2% for duration of symptoms. We can assume that the missing values were at least partly missing at random (MAR) and imputation of the missing values may reduce bias. Missing not at random (MNAR) can unfortunately never be excluded, because this mechanism depends on unobserved variables. We can distinguish two types of missingness for the continuous D-dimer and calf circumference values, either dichotomous values are observed or values are completely missing. In the former situation, the dichotomous information was used to impute the missing continuous values. All candidate predictors plus the outcome variable were used in the multiple imputation of missing values [12]. Ten imputations were performed using the methods described in Appendix A. Transformations of continuous variables were considered to enhance the flexibility of the imputation models. Simulation studies have shown that the required number of repeated imputations (m) can be as low as three for data with 20% of missing entries [13]. We had two predictors with approximately twice this percentage, and decided that 10 repeated imputations (i.e., m 5 10) would be a conservative choice. Unless rates of missing information are unusually high, there tends to be little or no practical benefit to using more than 10 imputations [11]. 3. Model development and model validation 3.1. Model development in general When developing a prediction model, various issues and choices need to be addressed. We discuss briefly three common steps in the development of prediction models. First, the number of candidate predictors is commonly too large to include them all in the prediction model. The data to hand can be used to select predictors, for instance with a backward elimination procedure. Second, the shape of the relation of continuous predictors with the outcome variable can be studied with nonlinear functions such as fractional polynomials (FPs) [14] and spline functions [15]. An advantage of the multivariable fractional polynomial (MFP) procedure is that selection of predictors and transformations are done simultaneously, in such a way as to preserve the nominal type 1 error probability. Third, given the selected predictors and transformations, the regression coefficients are estimated. As prediction models are developed to estimate outcome probabilities in new, similar patients, the regression coefficients from a model may benefit from being shrunk toward zero. With such shrinkage, better predictions will be found in new patients [6,16,17]. A heuristic shrinkage factor can be estimated as, c 2 model df =c2 model with c 2 model, the model chi-square and df, the number of degrees of freedom. Model chi-square is the difference in 2 log likelihood between a model with only an intercept and the fitted model. The number of degrees of freedom is in this case the total number of degrees of freedom that are considered in the process of selecting from all candidate predictors plus all considered transformations [6,17,18]. Shrinkage of the regression coefficients is particular worthwhile, if the sample size is relatively small Model development with multiple completed data sets We performed the following three steps of model development in each of the multiple completed data sets: (1) backward elimination of predictors and FP transformations simultaneously (for simplicity and to maximize power, we considered only FP1 transformations); (2) estimation of regression coefficients; and (3) estimation of a heuristic shrinkage factor. Backward elimination of predictors and transformations was performed with the MFP procedure and Akaike Information Criterion (AIC) stopping rule [19]. This rule corresponds to a P-value of for a predictor with one degree of freedom [20]. An outline of the MFP procedure can be found in the literature [8,21,22], see also Appendix B. To select the predictors and transformations, three different methods were applied. The first method (WALD) used Wald statistics based on Rubin s rules (for details see Appendix B). The variable with the lowest Wald statistic was eliminated from the model. The second method (majority method) involved applying backward elimination to each of the 10 completed data sets separately, resulting in 10 sets of selected predictors and transformations. The final set comprised those predictors and transformations that were selected in 50% or more of the 10 data sets. The third method (STACK) used the STACK method (Appendix B) and involved stacking the 10 data sets resulting in a single large data set with 10 n records. The data set was analyzed as one data set with weighted regression. WALD and STACK each produced a single set of predictors and transformations, whereas the majority method necessitated a majority vote to choose the set of predictors and transformations. Given the finally selected predictors and transformations for each of the three selection methods, a model was fitted in each of the 10 completed data sets. We used Rubin s rules to combine the estimated regression coefficients and variances from the 10 different completed data sets (see also Appendix C) [3]. A heuristic shrinkage factor was estimated, as described above, for each of the 10 models, and the 10 shrinkage factors were averaged Model validation with multiple completed data sets Validation of prediction models includes the estimation of performance measures in the development data set and in a validation set. We studied calibration graphically with predicted risks on the X-axis and the observed outcomes on the

4 208 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 Y-axis (calibration plot). The corresponding calibration line was described via a logistic regression model with the observed outcome regressed on the linear predictor (log odds of the predicted risk) [23]. The slope and intercept of the calibration line are ideally 1 and 0, respectively (perfect calibration). Discrimination was studied with the concordance (c) index [6,24]. Further, the squared Pearson correlation between the predicted probability and the binary outcome was estimated as measure of explained variation. Each of the 10 completed development data sets gave a (different) set of regression coefficients. Per development data set, the corresponding regression coefficients were used to calculate the predicted risk of each patient in that data set. The predicted risks were then compared with the observed outcomes to estimate the model performance, such as calibration and discrimination. The 10 performance estimates were averaged and their variances pooled according to Rubin s rules. Because the independent validation set also contained incomplete patient records, the same multiple imputation procedure was used to complete the records before estimating the model performance in the validation set. We applied in each of the 10 completed validation data sets the final model from the development phase, that is, selection of predictors was based on one of the three methods described before and the same averaged regression coefficients were applied to all 10 completed data sets. This resulted in 10 performance estimates that were averaged using Rubin s rules. 4. Case study: DVT 4.1. Model development We examine first only patients with complete cases and then the completed data using multiple imputation. We illustrate the model development methods in the presence of multiply imputed data as described in Section 3.2 with the data sets on the diagnosis of DVT. We consider a model that could contain continuous predictors that are modeled with the MFP algorithm and a second model with only dichotomous predictors. Continuous predictors (calf circumference, D-dimer, age, and duration of symptoms) are dichotomized for this purpose Complete case analysis The model including continuous predictors, analyzed with the MFP procedure (MFP model), was based on 326 patients and contained four selected predictors, three of which were continuous with log transformations in two (Table 2). The model with dichotomous predictors (dichotomous model) was based on 934 patients. Seven predictors were selected; yet the model discriminated much less well than the MFP model with only four predictors. For comparison, we also developed a dichotomous model on the 326 patients with completely observed continuous predictor values (Table 2). The two dichotomous models contained mainly the same predictors. Standard errors were smaller for the regression coefficients of the model that was based on the larger group of patients, as Table 2 Regression coefficients, standard errors, and powers for the predictors that are selected with backward elimination and AIC stopping rule on the complete cases Predictor MFP (N 5 326) Dichotomous (N 5 934) Dichotomous (N 5 326) Beta SE Power Beta SE Beta SE Female gender 0 92 (0.184) 93 (0.331) Age, yr 29 (10) 1 NA NA Age O61 yr NA 97 (0.187) 0 Oral contraceptive use (71) Duration of symptoms, d 0 NA NA Duration of symptoms O5d NA (0.186) 86 (0.305) Malignancy present Recent surgery Vein distension 0 83 (15) 0 Leg trauma present (0.596) NA 12 (70) (94) Calf circumference difference, cm 32 a (0.108) 0 NA NA Calf circumference difference >3 cm NA (0.188) (0.306) D-dimer, ng/ml b (25) 0 NA NA D-dimer >500 ng/ml NA (0.595) (21) Constant (1.303) NA (05) (44) Performance c-index (19) 22 (15) 41 (22) R 2 50 (54) 28 (28) 76 (45) MFP: continuous predictors are modelled with multivariable fractional polynomials; Dichotomous: continuous predictors are dichotomised; Beta: regression coefficient; SE: standard error; power: selected power transformation in MFP analysis, where 0 stands for natural logarithm; NA: not applicable. a Predictor is scaled and transformed: log ((calf circumference difference þ 1)/10). b Predictor is scaled and transformed: log (D-dimer/100,000).

5 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e Table 3 Predictor and power selection with the MFP procedure that allows continuous predictors in the 10 completed data sets separately and for the majority, WALD and STACK methods (N 5 1,295 in each completed data set) Completed data set Candidate predictor Majority WALD/STACK Female gender x x x x x x x x x x x x Age, yr Oral contraceptive use x x x x x x x e x x x x Duration of symptoms, d 1 3 e e e e 1 1 e Malignancy present e e x e e e x e e e e e Recent surgery e e e x e e e e e e e e Vein distension x e x x x x x x x x x x Leg trauma present x x x x x x x x x x x x Calf circumference difference, cm D-dimer, ng/ml x 5 dichotomous variable is selected; e 5 variable is not selected; number 5 continuous variable is selected with given power (0 5 natural logarithm; 1 5 linear; 3 5 cubic). expected. The performance measures were higher for the model that was based on only 326 patients. This could be a result of greater optimism or bias in the performance measures caused by the selected subgroup with complete observed values Analysis with the multiple completed datasets The three different procedures for selecting predictors and transformations in the model using continuous predictors gave very similar results (Table 3). The majority method selected one additional predictor, that is, duration of symptoms, compared with the STACK and WALD methods. The selection across the 10 completed data sets was very consistent (Table 3). The selected predictors with the majority method were also selected in all or nearly all individual data sets; the other predictors were selected in only one or two individual data sets. When the four continuous predictors were dichotomized, all 10 candidate predictors were selected with all three selection procedures. The majority method selected all predictors in all 10 data sets except recent surgery in data set 7 and malignancy present in data sets 5 and 6. The extra selected Table 4 Regression coefficients, standard errors, and powers for the predictors that are selected with backward elimination and AIC stopping rule on the multiple completed data sets (N 5 1,295 in each completed data set) MFP WALD/STACK method a Dichotomous WALD/STACK/majority method b Predictors Beta SE Power Beta SE Female gender 88 (0.198) NA 68 (0.167) Age, yr 17 (06) 1 NA Age O61 yr NA 16 (0.171) Oral contraceptive use 53 (0.341) NA (86) Duration of symptoms, d 0 NA Duration of symptoms O5d NA (0.167) Malignancy present 0 76 (93) Recent surgery (18) Vein distension 56 (17) NA 10 (0.181) Leg trauma present (81) NA 47 (33) Calf circumference difference, cm c (02) 0 b NA Calf circumference difference >3 cm NA (0.164) D-dimer test, ng/ml d (0.133) 0 NA D-dimer test >500 ng/ml NA (73) Constant (37) NA (93) Performance c-index 75 (15) 20 (14) R (37) 26 (24) MFP: continuous predictors are modelled with multivariable fractional polynomials; Dichotomous: continuous predictors are dichotomised; SE: standard error; Beta: regression coefficient; power: selected power transformation in MFP analysis, where 0 stands for natural logarithm; NA: not applicable. a WALD and STACK methods resulted in identical MFP models. Majority method included one extra predictor: the linear association of duration of symptoms (beta 5 14). Regression coefficients of the other predictors were very similar. b WALD, STACK, and majority methods resulted in identical dichotomized models. c Predictor is scaled and transformed: log ((calf circumference difference þ 1)/10). d Predictor is scaled and transformed: log (D-dimer/100,000).

6 210 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 predictors could apparently compensate for (at least partly) the information lost by dichotomizing the continuous predictors (Table 4). The predictor duration of symptoms that was additionally selected with the majority method had a weak effect (beta 5 14), changed the regression coefficients of the other predictors only slightly, and increased the performance minimally (c-index did not change and R 2 increased from to 0.368). The performance of the dichotomous model was lower than that of the MFP models even though more predictors were included. This shows that keeping the continuous predictors continuous (in particular the D-dimer test and calf circumference) is important for the prediction of DVT. Imputation of the missing continuous values, partly with the dichotomous information, was therefore particularly necessary. In comparison with the models developed with the complete cases (Table 2), the models developed with the completed data sets contained more predictors, possibly as a result of increased power. The standard errors (estimated as the square root of the pooled variances) were in general smaller. Furthermore, in the complete case analysis nonlinearity was not detected for the predictor calf circumference. The performance of the model with dichotomous predictors is clearly inferior despite the inclusion of additional predictors, as indicated by a lower c-index and a lower R 2. The performance of the models derived from the completed data sets was less good than those derived from the complete case analysis. This may be the result of optimism or selection bias in the small sample (n 5 326), because a similar discrepancy in performance was shown for the two dichotomous models that were developed with two different sample sizes (n and n 5 934). The heuristic shrinkage factors varied between and for the MFP models fitted to the completed data sets with a mean value of The shrinkage factor for the dichotomized model varied between and with a mean of These values are close to 1 and indicate little optimism in the models and little need to compensate for regression to the mean Model validation in independent data Complete case analysis The number of patients on which the analyses were performed depended on the number of completely observed cases for the different predictors in the models (Table 5). The c-index and R 2 of the MFP models were slightly higher than in the development data (Table 4), whereas the c-index and R 2 of the dichotomous model were lower Analysis with the multiple completed data sets The estimates after imputation (Table 6) were somewhat closer to the estimates for the development data (Table 4). The c-index of the MFP models was again slightly higher than in the development data (Table 4), whereas the c-index of the dichotomous model and R 2 of all three models were lower. Assessment of calibration showed similar results in the complete case analysis (Table 5) and the completed data sets (Table 6), which is confirmed in Fig. 1. The MFP models showed calibration slopes larger than 1 indicating that the regression coefficients may not be extreme enough, whereas the dichotomous model showed a slope smaller than 1. In general, predictions and observed proportions of DVT were well in agreement for the two MFP models. Predictions above 25% were too high for the dichotomous model. The broader range of predicted risks of DVT for the MFP models compared with the predicted risks of dichotomous model is in agreement with the higher c-index for the MFP models. 5. Discussion Missing data are commonplace in clinical studies. The main message we wish to bring out here is that good statistical methods are available to enable credible, practical analyses of such data sets. It is often unclear from reports whether a prediction model was developed or validated in the presence of missing data. Authors usually ignore cases with missing observations and perform complete case analyses. More recently, awareness has been growing of the usefulness of multiple imputation methodology, a powerful Table 5 Predictive performance of the three developed models estimated with complete case analysis in the validation data MFP W/S (N 5 418) MFP majority (N 5 369) Dichotomous (N 5 629) Calibration slope (0.146) (0.148) (0.108) Calibration intercept (0.163) 72 (0.170) (0.119) c-index 90 (19) 86 (20) (24) R 2 46 (36) 52 (38) 93 (16) MFP W/S, continuous predictors are modeled with the MFP procedure, predictors are selected with the WALD or STACK method; MFP majority, continuous predictors are modeled with the MFP procedure, predictors are selected with the majority method; Dichotomous, continuous predictors are dichotomized, predictors are selected with the WALD, STACK, or majority method. Table 6 Predictive performance of the three developed models estimated in the multiply imputed validation data MFP W/S MFP majority Dichotomous Calibration slope (0.113) (0.114) 06 (0.103) Calibration intercept (0.117) (0.117) (0.108) c-index 79 (17) 82 (17) (21) R (41) (41) (25) Values are means of 10 estimates (SE). MFP W/S, continuous predictors are modeled with the MFP procedure, predictors are selected with the WALD or STACK method; MFP majority, continuous predictors are modeled with the MFP procedure, predictors are selected with the majority method; Dichotomous, continuous predictors are dichotomized, predictors are selected with the WALD, STACK, or majority method.

7 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e A B C D E F Fig. 1. Calibration curves corresponding to different models and data sets: MFP Wald/STACK, MFP majority, and dichotomous models applied in the incomplete data set (A, B, and C, respectively) and in one completed data set (D, E, and F). The dotted line indicates perfect calibration; the solid line shows the relation between observed and predicted values. Triangles indicate observed proportions for five quintile-based risk groups with 95% confidence limits. Below the main plot, the vertical lines indicate distributions of the predicted risks by outcome ( or absent). approach pioneered by Rubin and others, to handle such data efficiently [3,5,13]. Several mainstream statistical packages now offer well-developed software for creating multiple completed data sets and analyzing them. Nevertheless, several important issues remain open [25], including identifying satisfactory methods of model development and validation. A guiding principle is to be found in Rubin s rules: a quantity of interest (be it a regression coefficient, or a performance measure) should be estimated in each of m completed data sets, together with its variance, and pooled over the m data sets, using Rubin s rules, to give a single estimate and variance (see Appendix C). Predictor and transformation selection in multivariable regression analysis are commonly based on likelihood ratio (LR) tests. A previous proposed approximation of the LR test for multiply imputed data [26] showed disappointing performance in the presence of nonlinear correlations [27]. We therefore used three other methods for predictor selection in developing a prediction model: WALD (based on Wald statistics for pooled estimates), a majority method and STACK (a proposed weighted regression method) [28]. In the present data set, the predictors and transformations selected with the three methods were very similar (Table 3). However, this is a practical case study and generalization of the results is not possible. Each of the three methods has it own advantages. Estimation of the Wald statistic in the WALD method follows Rubin s rules and is a sound and well-established approach. However, it was recently shown that the use of Wald statistics to select the power in an FP model can result in biased estimates [27]. The majority method gives much insight into the variability between the completed data sets. Variability may not only be found in the predictors selected, but also in the selection of powers for one particular continuous predictor, which results in different functional forms (Table 3). If predictor and transformation selection is based on the majority method, more than 10 imputations may be necessary to obtain stable results. The big advantage of the STACK method is that only one data set needs to be analyzed. The analysis leads directly to a single set of selected predictors, corresponding regression coefficients, and standard errors. We used the AIC as the stopping rule in the backward elimination procedure, which corresponds to a P-value of for variables with one degree of freedom (i.e., dichotomous variables that are modeled with one regression coefficient). Other P-values are also regularly used in predictor and

8 212 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 transformation selection, either more standard values such as 5 and 1 or higher values of to up to 0.5. In large data sets with strong predictors, the 5 level suffices. In small data sets, more liberal P-values are advocated to increase the probability that real predictors are selected at the expense of also selecting more noise variables [29]. An alternative for the applied traditional non-bayesian variable selection in multiply imputed data is an approach that draws on the Bayesian frameworks of multiple imputation and variable selection [30]. This approach has been applied for selection of variables in linear regression models. We focused in this paper on the traditional approach, because we believe that this is of greatest practical relevance to many data analysts. One step in model development that we did not consider in our analysis is internal validation. The development data are resampled and different samples can be used for development and validation. The most efficient procedure is bootstrapping [31,32]. It is unclear how resampling should be applied in the presence of multiply imputed data. Each completed data set can be bootstrapped, or each bootstrap sample with missing values can be imputed. Further research is necessary on this topic [33]. Few researchers have estimated model performance measures in multiply imputed data. Consistent with Rubin s rules, we applied the model to the patients of each completed data set, which resulted in 10 predicted risks per patient. Predicted risks were based on one single model, but on 10 different predictor values for predictors with a missing value. Accordingly, the 10 estimates of performance measures (i.e., c-index, or R 2 ) and variances were pooled. Another approach would be to average the 10 predicted risks of each patient, which would result in one performance estimate [34]. This approach yielded slightly higher estimates in our data. For instance, the estimates of the slope, c-index, and R 2 were 1.315, 0.901, and for the MFP W/S model in the external validation data (vs , 79, and 0.329). Another possible approach is to report all 10 estimates or the median and range [35]. As is often the case, the models were dominated by a small number of strongly influential predictorsdhere, D-dimer test result and difference in calf circumference. Models in which these predictors were dichotomized performed less well, although extra predictors were selected. This confirms the conclusion of an earlier paper that dichotomization of continuous predictors is an unwise strategy [36]. We are aware that the clinical interpretation of variables that are included in a model continuously may be less straightforward, particularly, when a transformation has been applied. Graphical representations can overcome this problem. The modeled association can easily be plotted with the predictor value on the X-axis against the outcome value on the Y-axis. If we assume that the missing data were MAR, the results after multiple imputation are less biased than the results of the complete case analysis. An important MAR assumption is that the probability that a data value is missing depends on values of variables that were actually observed. In other words, we assumed that the missingness of a variable cannot depend on the values of variables that we did not obtain data on. Conceptually, it also excludes a dependence of the occurrence of missing values on the true, but unobserved, value of the variable (MNAR) [11]. In general, the MAR mechanism is assumed to make imputation possible. Recently, a tool has been developed to investigate this MAR assumption: the index of sensitivity to nonignorability [37]. However, the assumption cannot be formally tested, because the true values cannot be observed. An important step in imputing the missing data is the specification of the imputation models. This is an explicit attempt to model the MAR process. Imputation models were specified for each candidate predictor with missing data irrespective of the quantity of missing data. All candidate predictors (10 in total) and the presence of DVT entered the imputation models [13]. We did not consider extra variables for the imputation model, because we did not expect substantial increase in explained variance beyond the 10 candidate predictors and the outcome variable. In the development set, some proportions of missing values were high: 55% and 39% of D-dimer test and calf circumference, respectively. We could perform a conditional imputation, because for most continuous missing values a dichotomized value was observed. This rather special situation occurs also in cancer data. Concentrations of markers, such as estrogen and progesterone receptors in breast cancer were formerly recorded dichotomously as low vs. high. When assessment of the markers became more common, actual concentrations were recorded. Here, conditional imputation might also be applied to the older data. In conclusion, this case study illustrated methods to deal with missing values in the development and validation process of a clinical prediction model. We found that multiple imputation and the corresponding Rubin s rules can be used for such analyses. Further experience of these methods with other empirical data sets is needed to formulate general guidelines for prediction modeling in the presence of missing data. Appendix A Imputation method Multiple imputation was performed using the ice program [38] for Stata, an implementation of the MICE regression switching algorithm [39]. MICE requires specification of conditional models for each incomplete variable given all other variables. m 5 10 completed data sets were created, each using 10 cycles of regression switching. Imputations for each continuous variable were drawn from a normal approximation to the posterior distribution from the corresponding conditional model. Logistic models were used for imputing binary predictors. After preliminary investigation, calf difference and D-dimer were log transformed to

9 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e approximate normality before imputation began, and were included in the conditional imputation models for other variables as linear and quadratic terms. Continuous values of log ((calf difference þ 1)/10) and log (D-dimer/100,000) were imputed when only dichotomized values were available by sampling from normal distributions truncated at the known cut-off values. Otherwise, missing values for these variables were imputed by assuming complete normal distributions. Parameter estimates for all regression models were combined across imputations using the micombine command [38], which applies Rubin s rules. Appendix B Methods of predictor and transformation selection with the MFP procedure in multiple completed data sets Brief outline of the MFP procedure The MFP approach to building regression models combines selection of predictors with determination of functional relationships for continuous predictors. Predictors are selected by backward elimination, using either a conventional stopping rule such as P! 5 for testing the statistical significance of a predictor or by using the AIC (see below). Consider FP modeling of a single continuous predictor, x. Usually one chooses between FP2, FP1, linear, or null functions of x, because more complex functions than FP2 are rarely needed. Sometimes, for example to maximize statistical power or to impose monotonicity on a functional relationship, FP1 may be the most complex function considered. For exponents (powers) p, q in the set { 2, 1, 0.5, 0, 0.5, 1, 2, 3}, 0 denoting a logarithmic transformation, an FP2 function has the two possible forms b 0 þ b 1 x p þ b 2 x q or b 0 þ b 1 x p þ b 2 x p log x, the former when p s q and the latter when p 5 q. An FP1 function is b 0 þ b 1 x p or b 0 þ b 1 log x when p 5 0. Significance testing for selecting an FP function and for selecting x uses a sequence of LR tests, as follows. FP2 is compared with the null model (b 0 only) on 4 df; if not significant, x is dropped. If significant, FP2 is further compared with a linear function on 3 df The linear function is chosen if this test is not significant, otherwise the final test is between FP2 and FP1 functions, on 2 df. Because more liberal significance levels have been advocated for predictor selection in prognostic model development, we used AIC, which corresponds to a P-value of for 1 df. Using AIC to select an FP function involves comparing penalized log likelihoods of the FP2, FP1, linear, and null models for x. The AIC is defined as ( 2 log likelihood) þ (2 df). Ignoring b 0, the dfs of these four models are 4, 2, 1, and 0, respectively. The model with the lowest AIC is selected. To choose a final, multivariable model from a set of predictors, the MFP algorithm uses a sequence of such tests (using either significance testing or AIC minimization) in an iterative, back-fitting manner. Predictors are first ordered according to their statistical significance in a full linear model. Then, each predictor is visited in turn and the procedure just described for selecting a predictor or FP function is applied. Currently selected predictors and (where necessary) their FP functions are included in the models. The procedure continues until there is no further change in selected predictors and functions, typically taking two to three cycles to completion. WALD: Wald tests based on Rubin s rules If the log likelihood is quadratic, which is exactly true in a normal errors model and approximately true in other models with small coefficients, then Wald and LR tests are equivalent. As an approximation, we used Wald test statistics as if they were LR statistics. For example, let the pooled regression estimate for a single predictor x be b and the pooled standard error be s, obtained by applying Rubin s rules to the estimates from the m completed data sets. The Wald c 2 statistic for testing b 5 0is(b/s) 2. The AIC of a model with df was defined as ( Wald c 2 statistic) þ (2 df). STACK: a weighted regression method We vertically stacked the 10 completed data sets for the 1,295 or 791 patients into one large data set of length 12,950 or 7,910. Fitting models to this single stacked data set, ignoring its special structure, yields valid parameter estimates but standard errors that are too small. To correct the standard error of a regression coefficient for a predictor x, we used the weight w 5 1 qðxþ m where q(x), the fraction of missing data for x, equals the number of missing values of x divided by n. The weights were used in regression models on x, applied so that the log likelihood was scaled by w. In Stata, such a scheme is known as importance weighting. It provides an approximate adjustment for the multiple data sets and for the missing data. Further, justification, results, and discussion of STACK are given recently [40]. Because STACK provides a type of log likelihood for any given model for x, conditional on other covariates, it also yields an AIC, defined as usual as ( 2 log likelihood) þ (2 df). Note that because q(x) will (almost always) vary across the xs, STACK does not impart a meaningful overall likelihood to a multivariable model. However, the MFP procedure requires only a likelihood for models involving different functions of x; contributions to the likelihood from other covariates are irrelevant.

10 214 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 Appendix C Combining the results from the m different completed data sets The computation of the multiple imputation point estimate and variance given the m (here, m 5 10) completed data sets follows Rubin s rules [41]. Let Q i and W i denote the point estimate and variance, respectively, from the ith (i 5 1,., 10) completed data set. The multiple imputation point estimate Q* ofq is the arithmetic mean of the 10 completed data estimates. The estimated variance T of Q is obtained by a components-of-variance argument, leading to the following formulas: T 5 W þ 1 þ 1 B; m where W 5 within-imputation variance: 1=m P m i51 W i, and where B 5 between-imputation variance: 1=ðm 1Þ P m i51 ðq i Q Þ 2. References [1] Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med 1999;130:515e24. [2] Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med 2000;19:453e73. [3] Rubin DB. Multiple imputation for non response in surveys. New York: Wiley; [4] Little RA. Regression with missing X s; a review. J Am Stat Assoc 1992;87:1227e37. [5] Schafer JL. Analysis of incomplete multivariate data. London: Chapman & Hall/CRC Press; [6] Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361e87. [7] Harrell FE Jr, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med 1984;3:143e52. [8] Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc A 1999;162:71e94. [9] Oudega R, Moons KG, Hoes AW. Ruling out deep venous thrombosis in primary care. A simple diagnostic algorithm including D-dimer testing. Thromb Haemost 2005;94:200e5. [10] Toll D, Oudega R, Vergouwe Y, Moons K, Hoes A. A new diagnostic rule for deep vein thrombosis: safety and efficiency in clinically relevant subgroups. Fam Pract 2008;25:3e8. [11] Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 1999;8(1):3e15. [12] Moons KG, Donders RA, Stijnen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol 2006;59:1092e101. [13] van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999;18:681e94. [14] Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Appl Stat 1994;43:429e67. [15] Harrell FE Jr. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag; [16] Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Stat Med 1986;5:421e33. [17] Copas JB. Regression, prediction and shrinkage. J R Stat Soc B 1983;45:311e54. [18] van Houwelingen HC, Thorogood J. Construction, validation and updating of a prognostic model for kidney graft survival. Stat Med 1995;14:1999e2008. [19] Atkinson AC. A note on the generalized information criterion for choice of a model. Biometrika 1980;67:413e8. [20] Sauerbrei W. The use of resampling methods to simplify regression models in medical statistics. Appl Stat 1999;48:313e29. [21] Royston P, Sauerbrei W. Building multivariable regression models with continuous covariates in clinical epidemiology-with an emphasis on fractional polynomials. Methods Inf Med 2005;44:561e71. [22] Royston P, Sauerbrei W. Multivariable model-building: a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley, John & Sons, Incorporated; [23] Miller ME, Hui SL. Validation techniques for logistic regression models. Stat Med 1991;10:1213e26. [24] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143: 29e36. [25] White IR, Wood AM, Royston P. Multiple imputation in practice. Stat Methods Med Res; in press. [26] Meng X-L, Rubin DB. Performing likelihood ratio tests with multiply-imputed data sets. Biometrika 1992;79:103e11. [27] Royston P, White IR, Wood AM. Building multivariable fractional polynomial models in multiply imputed data. Submitted; available on request [28] Wood AM, White IR, Royston P. How should variable selection be performed with multiply imputed data? Stat Med 2008;27:3227e46. [29] Steyerberg EW, Eijkemans MJC, Harrell FE Jr, Habbema JDF. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med 2000;19:1059e79. [30] Yang X, Belin TR, Boscardin WJ. Imputation and variable selection in linear regression models with missing covariates. Biometrics 2005;61:498e506. [31] Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman & Hall; [32] Steyerberg EW, Harrell FEJ, Borsboom GJJ, Eijkemans MJC, Vergouwe Y, Habbema JDF. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001;54:774e81. [33] Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HC. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 2007;7:33. [34] Burd RS, Jang TS, Nair SS. Predicting hospital mortality among injured children using a national trauma database. J Trauma 2006;60: 792e801. [35] Clark TG, Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 2003;56(1):28e37. [36] Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25: 127e41. [37] Troxel AB, Ma G, Heitjan DF. An index of sensitivity to nonignorability. Stat Sin 2004;14:1221e37. [38] Royston P. Multiple imputation of missing values: Update of ice. Stata Journal 2005;5:527e36. [39] van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999;18(6):681e94. [40] Wood AM, White IR, Royston P. How should variable selection be performed with multiply imputed data? Stat Med 2008;27(17):3227e46. [41] Rubin DB. Multiple imputation for non response in surveys. New York: Wiley; 1987.

Dealing with Missing Predictor Values When Applying Clinical Prediction Models

Dealing with Missing Predictor Values When Applying Clinical Prediction Models Clinical Chemistry 55:5 994 1001 (2009) Evidence-Based Medicine and Test Utilization Dealing with Missing Predictor Values When Applying Clinical Prediction Models Kristel J.M. Janssen, 1* Yvonne Vergouwe,

More information

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics

More information

Regression Modeling Strategies

Regression Modeling Strategies Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Missing data and net survival analysis Bernard Rachet

Missing data and net survival analysis Bernard Rachet Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,

More information

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University

Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University Missing Data in Longitudinal Studies: To Impute or not to Impute? Robert Platt, PhD McGill University 1 Outline Missing data definitions Longitudinal data specific issues Methods Simple methods Multiple

More information

Multiple Imputation for Missing Data: A Cautionary Tale

Multiple Imputation for Missing Data: A Cautionary Tale Multiple Imputation for Missing Data: A Cautionary Tale Paul D. Allison University of Pennsylvania Address correspondence to Paul D. Allison, Sociology Department, University of Pennsylvania, 3718 Locust

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015

Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015 1 Advanced Quantitative Methods for Health Care Professionals PUBH 742 Spring 2015 Instructor: Joanne M. Garrett, PhD e-mail: joanne_garrett@med.unc.edu Class Notes: Copies of the class lecture slides

More information

Dealing with Missing Data

Dealing with Missing Data Dealing with Missing Data Roch Giorgi email: roch.giorgi@univ-amu.fr UMR 912 SESSTIM, Aix Marseille Université / INSERM / IRD, Marseille, France BioSTIC, APHM, Hôpital Timone, Marseille, France January

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Dealing with Missing Data

Dealing with Missing Data Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904

More information

Alpaydin E. Introduction to Machine Learning. MIT Press, 2004.

Alpaydin E. Introduction to Machine Learning. MIT Press, 2004. Bibliography: Evaluating Predictive Models Last update: 29 July 2007 General Alpaydin E. Introduction to Machine Learning. MIT Press, 2004. An excellent introduction to the field of machine learning. As

More information

Examining a Fitted Logistic Model

Examining a Fitted Logistic Model STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic

More information

Statistical Rules of Thumb

Statistical Rules of Thumb Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist

Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist Guidelines and Guidance Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies: The CHARMS Checklist Karel G. M. Moons 1" *, Joris A. H. de Groot 1", Walter Bouwmeester

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

Systematic Reviews and Meta-analyses

Systematic Reviews and Meta-analyses Systematic Reviews and Meta-analyses Introduction A systematic review (also called an overview) attempts to summarize the scientific evidence related to treatment, causation, diagnosis, or prognosis of

More information

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

Missing values in data analysis: Ignore or Impute?

Missing values in data analysis: Ignore or Impute? ORIGINAL ARTICLE Missing values in data analysis: Ignore or Impute? Ng Chong Guan 1, Muhamad Saiful Bahri Yusoff 2 1 Department of Psychological Medicine, Faculty of Medicine, University Malaya 2 Medical

More information

APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

More information

ABSTRACT INTRODUCTION

ABSTRACT INTRODUCTION Paper SP03-2009 Illustrative Logistic Regression Examples using PROC LOGISTIC: New Features in SAS/STAT 9.2 Robert G. Downer, Grand Valley State University, Allendale, MI Patrick J. Richardson, Van Andel

More information

Algebra 1 Course Information

Algebra 1 Course Information Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Validity of prediction models: when is a model clinically useful?

Validity of prediction models: when is a model clinically useful? Chapter 2 Validity of prediction models: when is a model clinically useful? Abstract Prediction models combine patient characteristics to predict medical outcomes. Unfortunately, such models do not always

More information

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

1/27/2013. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 Introduce moderated multiple regression Continuous predictor continuous predictor Continuous predictor categorical predictor Understand

More information

Algebra 1 Course Title

Algebra 1 Course Title Algebra 1 Course Title Course- wide 1. What patterns and methods are being used? Course- wide 1. Students will be adept at solving and graphing linear and quadratic equations 2. Students will be adept

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

Example G Cost of construction of nuclear power plants

Example G Cost of construction of nuclear power plants 1 Example G Cost of construction of nuclear power plants Description of data Table G.1 gives data, reproduced by permission of the Rand Corporation, from a report (Mooz, 1978) on 32 light water reactor

More information

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities

Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

The Basics of Regression Analysis. for TIPPS. Lehana Thabane. What does correlation measure? Correlation is a measure of strength, not causation!

The Basics of Regression Analysis. for TIPPS. Lehana Thabane. What does correlation measure? Correlation is a measure of strength, not causation! The Purpose of Regression Modeling The Basics of Regression Analysis for TIPPS Lehana Thabane To verify the association or relationship between a single variable and one or more explanatory One explanatory

More information

College Readiness LINKING STUDY

College Readiness LINKING STUDY College Readiness LINKING STUDY A Study of the Alignment of the RIT Scales of NWEA s MAP Assessments with the College Readiness Benchmarks of EXPLORE, PLAN, and ACT December 2011 (updated January 17, 2012)

More information

Measures of diagnostic accuracy: basic definitions

Measures of diagnostic accuracy: basic definitions Measures of diagnostic accuracy: basic definitions Ana-Maria Šimundić Department of Molecular Diagnostics University Department of Chemistry, Sestre milosrdnice University Hospital, Zagreb, Croatia E-mail

More information

Experiment #1, Analyze Data using Excel, Calculator and Graphs.

Experiment #1, Analyze Data using Excel, Calculator and Graphs. Physics 182 - Fall 2014 - Experiment #1 1 Experiment #1, Analyze Data using Excel, Calculator and Graphs. 1 Purpose (5 Points, Including Title. Points apply to your lab report.) Before we start measuring

More information

Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis

Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis Journal of Clinical Epidemiology 54 (2001) 774 781 Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis Ewout W. Steyerberg a, *, Frank E. Harrell Jr

More information

THE RISK DISTRIBUTION CURVE AND ITS DERIVATIVES. Ralph Stern Cardiovascular Medicine University of Michigan Ann Arbor, Michigan. stern@umich.

THE RISK DISTRIBUTION CURVE AND ITS DERIVATIVES. Ralph Stern Cardiovascular Medicine University of Michigan Ann Arbor, Michigan. stern@umich. THE RISK DISTRIBUTION CURVE AND ITS DERIVATIVES Ralph Stern Cardiovascular Medicine University of Michigan Ann Arbor, Michigan stern@umich.edu ABSTRACT Risk stratification is most directly and informatively

More information

Problem of Missing Data

Problem of Missing Data VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Comparison of Imputation Methods in the Survey of Income and Program Participation

Comparison of Imputation Methods in the Survey of Income and Program Participation Comparison of Imputation Methods in the Survey of Income and Program Participation Sarah McMillan U.S. Census Bureau, 4600 Silver Hill Rd, Washington, DC 20233 Any views expressed are those of the author

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

13. Poisson Regression Analysis

13. Poisson Regression Analysis 136 Poisson Regression Analysis 13. Poisson Regression Analysis We have so far considered situations where the outcome variable is numeric and Normally distributed, or binary. In clinical work one often

More information

Applications of R Software in Bayesian Data Analysis

Applications of R Software in Bayesian Data Analysis Article International Journal of Information Science and System, 2012, 1(1): 7-23 International Journal of Information Science and System Journal homepage: www.modernscientificpress.com/journals/ijinfosci.aspx

More information

Sensitivity Analysis in Multiple Imputation for Missing Data

Sensitivity Analysis in Multiple Imputation for Missing Data Paper SAS270-2014 Sensitivity Analysis in Multiple Imputation for Missing Data Yang Yuan, SAS Institute Inc. ABSTRACT Multiple imputation, a popular strategy for dealing with missing values, usually assumes

More information

AP Physics 1 and 2 Lab Investigations

AP Physics 1 and 2 Lab Investigations AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks

More information

Time series analysis as a framework for the characterization of waterborne disease outbreaks

Time series analysis as a framework for the characterization of waterborne disease outbreaks Interdisciplinary Perspectives on Drinking Water Risk Assessment and Management (Proceedings of the Santiago (Chile) Symposium, September 1998). IAHS Publ. no. 260, 2000. 127 Time series analysis as a

More information

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

More information

5. Multiple regression

5. Multiple regression 5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Guide to Biostatistics

Guide to Biostatistics MedPage Tools Guide to Biostatistics Study Designs Here is a compilation of important epidemiologic and common biostatistical terms used in medical research. You can use it as a reference guide when reading

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations

Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Linear Models and Conjoint Analysis with Nonlinear Spline Transformations Warren F. Kuhfeld Mark Garratt Abstract Many common data analysis models are based on the general linear univariate model, including

More information

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

A Bayesian hierarchical surrogate outcome model for multiple sclerosis A Bayesian hierarchical surrogate outcome model for multiple sclerosis 3 rd Annual ASA New Jersey Chapter / Bayer Statistics Workshop David Ohlssen (Novartis), Luca Pozzi and Heinz Schmidli (Novartis)

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Cairo University Faculty of Economics and Political Science Statistics Department English Section Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study) Prepared

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Power and sample size in multilevel modeling

Power and sample size in multilevel modeling Snijders, Tom A.B. Power and Sample Size in Multilevel Linear Models. In: B.S. Everitt and D.C. Howell (eds.), Encyclopedia of Statistics in Behavioral Science. Volume 3, 1570 1573. Chicester (etc.): Wiley,

More information

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

More information

Item Imputation Without Specifying Scale Structure

Item Imputation Without Specifying Scale Structure Original Article Item Imputation Without Specifying Scale Structure Stef van Buuren TNO Quality of Life, Leiden, The Netherlands University of Utrecht, The Netherlands Abstract. Imputation of incomplete

More information

Statistics in Retail Finance. Chapter 2: Statistical models of default

Statistics in Retail Finance. Chapter 2: Statistical models of default Statistics in Retail Finance 1 Overview > We consider how to build statistical models of default, or delinquency, and how such models are traditionally used for credit application scoring and decision

More information

Appendix 1: Time series analysis of peak-rate years and synchrony testing.

Appendix 1: Time series analysis of peak-rate years and synchrony testing. Appendix 1: Time series analysis of peak-rate years and synchrony testing. Overview The raw data are accessible at Figshare ( Time series of global resources, DOI 10.6084/m9.figshare.929619), sources are

More information

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson JUST THE MATHS UNIT NUMBER 1.8 ALGEBRA 8 (Polynomials) by A.J.Hobson 1.8.1 The factor theorem 1.8.2 Application to quadratic and cubic expressions 1.8.3 Cubic equations 1.8.4 Long division of polynomials

More information

Credit Risk Analysis Using Logistic Regression Modeling

Credit Risk Analysis Using Logistic Regression Modeling Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans,

More information

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line College Algebra in Context with Applications for the Managerial, Life, and Social Sciences, 3rd Edition Ronald J. Harshbarger, University of South Carolina - Beaufort Lisa S. Yocco, Georgia Southern University

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

COURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences. 2015-2016 Academic Year Qualification.

COURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences. 2015-2016 Academic Year Qualification. COURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences 2015-2016 Academic Year Qualification. Master's Degree 1. Description of the subject Subject name: Biomedical Data

More information

How To Model The Fate Of An Animal

How To Model The Fate Of An Animal Models Where the Fate of Every Individual is Known This class of models is important because they provide a theory for estimation of survival probability and other parameters from radio-tagged animals.

More information

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012] Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

Poisson Models for Count Data

Poisson Models for Count Data Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

More information

Imputation of missing data under missing not at random assumption & sensitivity analysis

Imputation of missing data under missing not at random assumption & sensitivity analysis Imputation of missing data under missing not at random assumption & sensitivity analysis S. Jolani Department of Methodology and Statistics, Utrecht University, the Netherlands Advanced Multiple Imputation,

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

The Internal Rate of Return Model for Life Insurance Policies

The Internal Rate of Return Model for Life Insurance Policies The Internal Rate of Return Model for Life Insurance Policies Abstract Life insurance policies are no longer seen solely as a means of insuring life. Due to many new features introduced by life insurers,

More information

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996) MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL by Michael L. Orlov Chemistry Department, Oregon State University (1996) INTRODUCTION In modern science, regression analysis is a necessary part

More information

Incorporating transportation costs into inventory replenishment decisions

Incorporating transportation costs into inventory replenishment decisions Int. J. Production Economics 77 (2002) 113 130 Incorporating transportation costs into inventory replenishment decisions Scott R. Swenseth a, Michael R. Godfrey b, * a Department of Management, University

More information

Module 4 - Multiple Logistic Regression

Module 4 - Multiple Logistic Regression Module 4 - Multiple Logistic Regression Objectives Understand the principles and theory underlying logistic regression Understand proportions, probabilities, odds, odds ratios, logits and exponents Be

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Validation, updating and impact of clinical prediction rules: A review

Validation, updating and impact of clinical prediction rules: A review Journal of Clinical Epidemiology 61 (2008) 1085e1094 REVIEW ARTICLE Validation, updating and impact of clinical prediction rules: A review D.B. Toll, K.J.M. Janssen, Y. Vergouwe, K.G.M. Moons* Julius Center

More information

Analyzing Structural Equation Models With Missing Data

Analyzing Structural Equation Models With Missing Data Analyzing Structural Equation Models With Missing Data Craig Enders* Arizona State University cenders@asu.edu based on Enders, C. K. (006). Analyzing structural equation models with missing data. In G.

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional

More information