Development and validation of a prediction model with missing predictor data: a practical approach

Transcription

1 Journal of Clinical Epidemiology 63 (2010) 205e214 Development and validation of a prediction model with missing predictor data: a practical approach Yvonne Vergouwe a, *, Patrick Royston b, Karel G.M. Moons a, Douglas G. Altman c a Julius Center for Health Sciences and Primary Care, University Medical Centre Utrecht, Str 6.131, P.O. Box 85500, 3508 GA, Utrecht, The Netherlands b MRC Clinical Trials Unit, London, United Kingdom c Cancer Research UK/NHS Centre for Statistics in Medicine, Oxford, United Kingdom Accepted 30 March 2009 Abstract Objective: To illustrate the sequence of steps needed to develop and validate a clinical prediction model, when missing predictor values have been multiply imputed. Study Design and Setting: We used data from consecutive primary care patients suspected of deep venous thrombosis (DVT) to develop and validate a diagnostic model for the presence of DVT. Missing values were imputed 10 times with the MICE conditional imputation method. After the selection of predictors and transformations for continuous predictors according to three different methods, we estimated regression coefficients and performance measures. Results: The three methods to select predictors and transformations of continuous predictors showed similar results. Rubin s rules could easily be applied to estimate regression coefficients and performance measures, once predictors and transformations were selected. Conclusion: We provide a practical approach for model development and validation with multiply imputed data. Ó 2010 Elsevier Inc. All rights reserved. Keywords: Missing values; Multiple imputation; Clinical prediction models; Model development; Model validation; Deep venous thrombosis 1. Introduction Interest in multivariable prediction models for diagnostic and prognostic research has grown over the past decade. Prediction models enable physicians explicitly to convert combinations of multiple predictor values to an estimated absolute risk of disease presence (in case of diagnosis) or the occurrence of a disease-related event (in case of prognosis). Prediction models are developed with data of patients from a development set, often using multivariable regression analysis. The models are accordingly validated in new, similar patients (validation set) [1,2]. Missing observations are almost universally encountered in clinical data sets, no matter how strictly studies have been designed or how hard investigators try to prevent them. The easiest way to deal with missing values is to exclude all This work is supported by the Netherlands Organization for Scientific Research Grant ZON-MW (Y. Vergouwe and K.G.M. Moons); UK Medical Research Council (P. Royston); and Cancer Research UK (D.G. Altman). * Corresponding author. Tel.: þ ; fax: þ address: y.vergouwe@umcutrecht.nl (Y. Vergouwe). patients with a missing value on any of the considered variables. Such a complete case analysis may sacrifice useful information and may cause biased results [3,4]. Imputation based on observed patient characteristics (conditional imputation) has been advocated to deal with the missing values [3]. To take the uncertainty of the imputed values into account, missing values should be imputed multiple (m) times, for which several iterative algorithms are available. The resulting m completed data sets are each analyzed separately by standard methods and the m results are combined into one final point estimate and variance, with the standard error equal to the square root of the variance [3,5]. Combining the m results is straightforward, when a single analysis is considered. The m point estimates are averaged and the m variances can be combined taking the variability between the m data sets into account with a components-of-variance argument (Rubin s rules) [3]. The development of a prediction model follows a sequence of steps [6], including selection of predictors, selection of transformations for continuous predictors [7,8], and estimation of the regression coefficients. Hence, model development with multiply imputed data is not straightforward and seldom illustrated. Here, we demonstrate the development and validation of a prediction model obtained with logistic /10/$ e see front matter Ó 2010 Elsevier Inc. All rights reserved. doi: /j.jclinepi

2 206 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 regression in the presence of multiply imputed data. Continuous predictors are modeled with transformations if necessary. In the second model, the continuous predictors are dichotomized. Further, three different methods to select predictors and transformations are applied. We encountered also another practical problem, typical with real life data. Two continuous predictors were recorded partly as dichotomous and partly as continuous. We impute the continuous values by using the observed value for the dichotomous variable and the distribution of the continuous variable where available. Two empirical data sets on the diagnosis of deep venous thrombosis (DVT) are used with minor to major percentages of missing predictor values, one data set to develop the model [9] and one to validate the model [10]. 2. Data sets on the diagnosis of DVT 2.1. Empirical data We used the data of 2,086 consecutive primary care patients suspected of DVT. The data originated from a large cross-sectional diagnostic study that was performed between January 1, 2002 and January 1, 2006 among over 100 primary care physicians in The Netherlands. For specific details and main results of the diagnostic study, we refer to the literature [9,10]. In brief, suspicion of DVT was based on swelling, redness, or pain of the lower extremities. Information was systematically documented on patient history and physical examination. Blood samples were taken to determine D-dimer plasma concentration. D-dimer is a fibrin degradation fragment that is related to the presence of thrombotic diseases, such as DVT. DVTwas considered present if one of the proximal deep veins of the symptomatic leg was not completely compressible on repeated compression ultrasonography. For the present analysis, we used the data of the first 1,295 patients [9], for model development and the data of the subsequent 791 patients to validate the various prediction models [10]. The model validation was therefore a temporal validation [1,2]. Two hundred and eighty-nine patients of the development set had DVT (22%) and 131 patients (17%) of the validation set. A description of the candidate predictors is given in Table 1 for both data sets. Ten candidate predictors were studied for model development of which four were continuous: age, duration of main symptoms, difference in calf circumference, and D-dimer value Imputation of missing values The percentage of missing values in the development set ranged from % for age to 39% for calf circumference difference and 55% for the D-dimer value (Table 1). Initially, the difference in calf circumference was only reported as a dichotomous value, that is >3 or!3 cm. Similarly, the D-dimer plasma concentration was first only provided by the lab as a dichotomous test result, that is >500 or!500 ng/ml. Later on, the continuous calf circumferences and D-dimer plasma concentrations were provided. This explains the high percentage of missing data on the continuous calf circumference and D-dimer test result in the development set. The number of missing values for the dichotomous calf circumference and D-dimer test result was lower, because the values were either recorded dichotomously or the recorded continuous values could be dichotomized. A complete case analysis for model development including the continuous values of the D-dimer test and calf circumference used only 326 patients (25% of the data). A complete case analysis with the dichotomous values used 934 patients (72%). Before missing values were imputed, we studied the missing data mechanism [11]. We created indicator variables for missing values for each variable with missing data. Fitted logistic regression models with the indicator variable as outcome and the other variables as covariates showed, that missingness was for some variables associated to observed values. Explained variation of the missingness as estimated with the regression models varied between 36% for Table 1 Distribution of candidate predictors to diagnose DVT Candidate predictor Development set (N 5 1,295) Validation set (N 5 791) n (%) % Missing data n (%) % Missing data Female gender 826 (63) (61) Age, yr a 61 (34e82) 61 (36e82) Oral contraceptive use 123 (10) (9) 2.0 Duration of symptoms, d a 5(1e20) (1e15) 15.0 Malignancy present 77 (6) (5) 8.9 Recent surgery 162 (13) (12) 9.1 Vein distension 229 (18) (17) 13.0 Leg trauma present 186 (14) (17) 3.4 Calf circumference difference, cm a 2(0e4) (0e4) 28.6 Calf circumference difference >3cm b 500 (39) (40) 7.7 D-dimer, ng/ml a 886 (160e6,288) (220e4,881) 18.8 D-dimer >500 ng/ml b 838 (65) (58) 18.8 a Continuous candidate predictor, median (10th and 90th centiles). b Dichotomous value recorded or continuous observed value dichotomized.

3 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e malignancy present and recent surgery to 2% for duration of symptoms. We can assume that the missing values were at least partly missing at random (MAR) and imputation of the missing values may reduce bias. Missing not at random (MNAR) can unfortunately never be excluded, because this mechanism depends on unobserved variables. We can distinguish two types of missingness for the continuous D-dimer and calf circumference values, either dichotomous values are observed or values are completely missing. In the former situation, the dichotomous information was used to impute the missing continuous values. All candidate predictors plus the outcome variable were used in the multiple imputation of missing values [12]. Ten imputations were performed using the methods described in Appendix A. Transformations of continuous variables were considered to enhance the flexibility of the imputation models. Simulation studies have shown that the required number of repeated imputations (m) can be as low as three for data with 20% of missing entries [13]. We had two predictors with approximately twice this percentage, and decided that 10 repeated imputations (i.e., m 5 10) would be a conservative choice. Unless rates of missing information are unusually high, there tends to be little or no practical benefit to using more than 10 imputations [11]. 3. Model development and model validation 3.1. Model development in general When developing a prediction model, various issues and choices need to be addressed. We discuss briefly three common steps in the development of prediction models. First, the number of candidate predictors is commonly too large to include them all in the prediction model. The data to hand can be used to select predictors, for instance with a backward elimination procedure. Second, the shape of the relation of continuous predictors with the outcome variable can be studied with nonlinear functions such as fractional polynomials (FPs) [14] and spline functions [15]. An advantage of the multivariable fractional polynomial (MFP) procedure is that selection of predictors and transformations are done simultaneously, in such a way as to preserve the nominal type 1 error probability. Third, given the selected predictors and transformations, the regression coefficients are estimated. As prediction models are developed to estimate outcome probabilities in new, similar patients, the regression coefficients from a model may benefit from being shrunk toward zero. With such shrinkage, better predictions will be found in new patients [6,16,17]. A heuristic shrinkage factor can be estimated as, c 2 model df =c2 model with c 2 model, the model chi-square and df, the number of degrees of freedom. Model chi-square is the difference in 2 log likelihood between a model with only an intercept and the fitted model. The number of degrees of freedom is in this case the total number of degrees of freedom that are considered in the process of selecting from all candidate predictors plus all considered transformations [6,17,18]. Shrinkage of the regression coefficients is particular worthwhile, if the sample size is relatively small Model development with multiple completed data sets We performed the following three steps of model development in each of the multiple completed data sets: (1) backward elimination of predictors and FP transformations simultaneously (for simplicity and to maximize power, we considered only FP1 transformations); (2) estimation of regression coefficients; and (3) estimation of a heuristic shrinkage factor. Backward elimination of predictors and transformations was performed with the MFP procedure and Akaike Information Criterion (AIC) stopping rule [19]. This rule corresponds to a P-value of for a predictor with one degree of freedom [20]. An outline of the MFP procedure can be found in the literature [8,21,22], see also Appendix B. To select the predictors and transformations, three different methods were applied. The first method (WALD) used Wald statistics based on Rubin s rules (for details see Appendix B). The variable with the lowest Wald statistic was eliminated from the model. The second method (majority method) involved applying backward elimination to each of the 10 completed data sets separately, resulting in 10 sets of selected predictors and transformations. The final set comprised those predictors and transformations that were selected in 50% or more of the 10 data sets. The third method (STACK) used the STACK method (Appendix B) and involved stacking the 10 data sets resulting in a single large data set with 10 n records. The data set was analyzed as one data set with weighted regression. WALD and STACK each produced a single set of predictors and transformations, whereas the majority method necessitated a majority vote to choose the set of predictors and transformations. Given the finally selected predictors and transformations for each of the three selection methods, a model was fitted in each of the 10 completed data sets. We used Rubin s rules to combine the estimated regression coefficients and variances from the 10 different completed data sets (see also Appendix C) [3]. A heuristic shrinkage factor was estimated, as described above, for each of the 10 models, and the 10 shrinkage factors were averaged Model validation with multiple completed data sets Validation of prediction models includes the estimation of performance measures in the development data set and in a validation set. We studied calibration graphically with predicted risks on the X-axis and the observed outcomes on the

4 208 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 Y-axis (calibration plot). The corresponding calibration line was described via a logistic regression model with the observed outcome regressed on the linear predictor (log odds of the predicted risk) [23]. The slope and intercept of the calibration line are ideally 1 and 0, respectively (perfect calibration). Discrimination was studied with the concordance (c) index [6,24]. Further, the squared Pearson correlation between the predicted probability and the binary outcome was estimated as measure of explained variation. Each of the 10 completed development data sets gave a (different) set of regression coefficients. Per development data set, the corresponding regression coefficients were used to calculate the predicted risk of each patient in that data set. The predicted risks were then compared with the observed outcomes to estimate the model performance, such as calibration and discrimination. The 10 performance estimates were averaged and their variances pooled according to Rubin s rules. Because the independent validation set also contained incomplete patient records, the same multiple imputation procedure was used to complete the records before estimating the model performance in the validation set. We applied in each of the 10 completed validation data sets the final model from the development phase, that is, selection of predictors was based on one of the three methods described before and the same averaged regression coefficients were applied to all 10 completed data sets. This resulted in 10 performance estimates that were averaged using Rubin s rules. 4. Case study: DVT 4.1. Model development We examine first only patients with complete cases and then the completed data using multiple imputation. We illustrate the model development methods in the presence of multiply imputed data as described in Section 3.2 with the data sets on the diagnosis of DVT. We consider a model that could contain continuous predictors that are modeled with the MFP algorithm and a second model with only dichotomous predictors. Continuous predictors (calf circumference, D-dimer, age, and duration of symptoms) are dichotomized for this purpose Complete case analysis The model including continuous predictors, analyzed with the MFP procedure (MFP model), was based on 326 patients and contained four selected predictors, three of which were continuous with log transformations in two (Table 2). The model with dichotomous predictors (dichotomous model) was based on 934 patients. Seven predictors were selected; yet the model discriminated much less well than the MFP model with only four predictors. For comparison, we also developed a dichotomous model on the 326 patients with completely observed continuous predictor values (Table 2). The two dichotomous models contained mainly the same predictors. Standard errors were smaller for the regression coefficients of the model that was based on the larger group of patients, as Table 2 Regression coefficients, standard errors, and powers for the predictors that are selected with backward elimination and AIC stopping rule on the complete cases Predictor MFP (N 5 326) Dichotomous (N 5 934) Dichotomous (N 5 326) Beta SE Power Beta SE Beta SE Female gender 0 92 (0.184) 93 (0.331) Age, yr 29 (10) 1 NA NA Age O61 yr NA 97 (0.187) 0 Oral contraceptive use (71) Duration of symptoms, d 0 NA NA Duration of symptoms O5d NA (0.186) 86 (0.305) Malignancy present Recent surgery Vein distension 0 83 (15) 0 Leg trauma present (0.596) NA 12 (70) (94) Calf circumference difference, cm 32 a (0.108) 0 NA NA Calf circumference difference >3 cm NA (0.188) (0.306) D-dimer, ng/ml b (25) 0 NA NA D-dimer >500 ng/ml NA (0.595) (21) Constant (1.303) NA (05) (44) Performance c-index (19) 22 (15) 41 (22) R 2 50 (54) 28 (28) 76 (45) MFP: continuous predictors are modelled with multivariable fractional polynomials; Dichotomous: continuous predictors are dichotomised; Beta: regression coefficient; SE: standard error; power: selected power transformation in MFP analysis, where 0 stands for natural logarithm; NA: not applicable. a Predictor is scaled and transformed: log ((calf circumference difference þ 1)/10). b Predictor is scaled and transformed: log (D-dimer/100,000).

5 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e Table 3 Predictor and power selection with the MFP procedure that allows continuous predictors in the 10 completed data sets separately and for the majority, WALD and STACK methods (N 5 1,295 in each completed data set) Completed data set Candidate predictor Majority WALD/STACK Female gender x x x x x x x x x x x x Age, yr Oral contraceptive use x x x x x x x e x x x x Duration of symptoms, d 1 3 e e e e 1 1 e Malignancy present e e x e e e x e e e e e Recent surgery e e e x e e e e e e e e Vein distension x e x x x x x x x x x x Leg trauma present x x x x x x x x x x x x Calf circumference difference, cm D-dimer, ng/ml x 5 dichotomous variable is selected; e 5 variable is not selected; number 5 continuous variable is selected with given power (0 5 natural logarithm; 1 5 linear; 3 5 cubic). expected. The performance measures were higher for the model that was based on only 326 patients. This could be a result of greater optimism or bias in the performance measures caused by the selected subgroup with complete observed values Analysis with the multiple completed datasets The three different procedures for selecting predictors and transformations in the model using continuous predictors gave very similar results (Table 3). The majority method selected one additional predictor, that is, duration of symptoms, compared with the STACK and WALD methods. The selection across the 10 completed data sets was very consistent (Table 3). The selected predictors with the majority method were also selected in all or nearly all individual data sets; the other predictors were selected in only one or two individual data sets. When the four continuous predictors were dichotomized, all 10 candidate predictors were selected with all three selection procedures. The majority method selected all predictors in all 10 data sets except recent surgery in data set 7 and malignancy present in data sets 5 and 6. The extra selected Table 4 Regression coefficients, standard errors, and powers for the predictors that are selected with backward elimination and AIC stopping rule on the multiple completed data sets (N 5 1,295 in each completed data set) MFP WALD/STACK method a Dichotomous WALD/STACK/majority method b Predictors Beta SE Power Beta SE Female gender 88 (0.198) NA 68 (0.167) Age, yr 17 (06) 1 NA Age O61 yr NA 16 (0.171) Oral contraceptive use 53 (0.341) NA (86) Duration of symptoms, d 0 NA Duration of symptoms O5d NA (0.167) Malignancy present 0 76 (93) Recent surgery (18) Vein distension 56 (17) NA 10 (0.181) Leg trauma present (81) NA 47 (33) Calf circumference difference, cm c (02) 0 b NA Calf circumference difference >3 cm NA (0.164) D-dimer test, ng/ml d (0.133) 0 NA D-dimer test >500 ng/ml NA (73) Constant (37) NA (93) Performance c-index 75 (15) 20 (14) R (37) 26 (24) MFP: continuous predictors are modelled with multivariable fractional polynomials; Dichotomous: continuous predictors are dichotomised; SE: standard error; Beta: regression coefficient; power: selected power transformation in MFP analysis, where 0 stands for natural logarithm; NA: not applicable. a WALD and STACK methods resulted in identical MFP models. Majority method included one extra predictor: the linear association of duration of symptoms (beta 5 14). Regression coefficients of the other predictors were very similar. b WALD, STACK, and majority methods resulted in identical dichotomized models. c Predictor is scaled and transformed: log ((calf circumference difference þ 1)/10). d Predictor is scaled and transformed: log (D-dimer/100,000).

6 210 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 predictors could apparently compensate for (at least partly) the information lost by dichotomizing the continuous predictors (Table 4). The predictor duration of symptoms that was additionally selected with the majority method had a weak effect (beta 5 14), changed the regression coefficients of the other predictors only slightly, and increased the performance minimally (c-index did not change and R 2 increased from to 0.368). The performance of the dichotomous model was lower than that of the MFP models even though more predictors were included. This shows that keeping the continuous predictors continuous (in particular the D-dimer test and calf circumference) is important for the prediction of DVT. Imputation of the missing continuous values, partly with the dichotomous information, was therefore particularly necessary. In comparison with the models developed with the complete cases (Table 2), the models developed with the completed data sets contained more predictors, possibly as a result of increased power. The standard errors (estimated as the square root of the pooled variances) were in general smaller. Furthermore, in the complete case analysis nonlinearity was not detected for the predictor calf circumference. The performance of the model with dichotomous predictors is clearly inferior despite the inclusion of additional predictors, as indicated by a lower c-index and a lower R 2. The performance of the models derived from the completed data sets was less good than those derived from the complete case analysis. This may be the result of optimism or selection bias in the small sample (n 5 326), because a similar discrepancy in performance was shown for the two dichotomous models that were developed with two different sample sizes (n and n 5 934). The heuristic shrinkage factors varied between and for the MFP models fitted to the completed data sets with a mean value of The shrinkage factor for the dichotomized model varied between and with a mean of These values are close to 1 and indicate little optimism in the models and little need to compensate for regression to the mean Model validation in independent data Complete case analysis The number of patients on which the analyses were performed depended on the number of completely observed cases for the different predictors in the models (Table 5). The c-index and R 2 of the MFP models were slightly higher than in the development data (Table 4), whereas the c-index and R 2 of the dichotomous model were lower Analysis with the multiple completed data sets The estimates after imputation (Table 6) were somewhat closer to the estimates for the development data (Table 4). The c-index of the MFP models was again slightly higher than in the development data (Table 4), whereas the c-index of the dichotomous model and R 2 of all three models were lower. Assessment of calibration showed similar results in the complete case analysis (Table 5) and the completed data sets (Table 6), which is confirmed in Fig. 1. The MFP models showed calibration slopes larger than 1 indicating that the regression coefficients may not be extreme enough, whereas the dichotomous model showed a slope smaller than 1. In general, predictions and observed proportions of DVT were well in agreement for the two MFP models. Predictions above 25% were too high for the dichotomous model. The broader range of predicted risks of DVT for the MFP models compared with the predicted risks of dichotomous model is in agreement with the higher c-index for the MFP models. 5. Discussion Missing data are commonplace in clinical studies. The main message we wish to bring out here is that good statistical methods are available to enable credible, practical analyses of such data sets. It is often unclear from reports whether a prediction model was developed or validated in the presence of missing data. Authors usually ignore cases with missing observations and perform complete case analyses. More recently, awareness has been growing of the usefulness of multiple imputation methodology, a powerful Table 5 Predictive performance of the three developed models estimated with complete case analysis in the validation data MFP W/S (N 5 418) MFP majority (N 5 369) Dichotomous (N 5 629) Calibration slope (0.146) (0.148) (0.108) Calibration intercept (0.163) 72 (0.170) (0.119) c-index 90 (19) 86 (20) (24) R 2 46 (36) 52 (38) 93 (16) MFP W/S, continuous predictors are modeled with the MFP procedure, predictors are selected with the WALD or STACK method; MFP majority, continuous predictors are modeled with the MFP procedure, predictors are selected with the majority method; Dichotomous, continuous predictors are dichotomized, predictors are selected with the WALD, STACK, or majority method. Table 6 Predictive performance of the three developed models estimated in the multiply imputed validation data MFP W/S MFP majority Dichotomous Calibration slope (0.113) (0.114) 06 (0.103) Calibration intercept (0.117) (0.117) (0.108) c-index 79 (17) 82 (17) (21) R (41) (41) (25) Values are means of 10 estimates (SE). MFP W/S, continuous predictors are modeled with the MFP procedure, predictors are selected with the WALD or STACK method; MFP majority, continuous predictors are modeled with the MFP procedure, predictors are selected with the majority method; Dichotomous, continuous predictors are dichotomized, predictors are selected with the WALD, STACK, or majority method.

7 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e A B C D E F Fig. 1. Calibration curves corresponding to different models and data sets: MFP Wald/STACK, MFP majority, and dichotomous models applied in the incomplete data set (A, B, and C, respectively) and in one completed data set (D, E, and F). The dotted line indicates perfect calibration; the solid line shows the relation between observed and predicted values. Triangles indicate observed proportions for five quintile-based risk groups with 95% confidence limits. Below the main plot, the vertical lines indicate distributions of the predicted risks by outcome ( or absent). approach pioneered by Rubin and others, to handle such data efficiently [3,5,13]. Several mainstream statistical packages now offer well-developed software for creating multiple completed data sets and analyzing them. Nevertheless, several important issues remain open [25], including identifying satisfactory methods of model development and validation. A guiding principle is to be found in Rubin s rules: a quantity of interest (be it a regression coefficient, or a performance measure) should be estimated in each of m completed data sets, together with its variance, and pooled over the m data sets, using Rubin s rules, to give a single estimate and variance (see Appendix C). Predictor and transformation selection in multivariable regression analysis are commonly based on likelihood ratio (LR) tests. A previous proposed approximation of the LR test for multiply imputed data [26] showed disappointing performance in the presence of nonlinear correlations [27]. We therefore used three other methods for predictor selection in developing a prediction model: WALD (based on Wald statistics for pooled estimates), a majority method and STACK (a proposed weighted regression method) [28]. In the present data set, the predictors and transformations selected with the three methods were very similar (Table 3). However, this is a practical case study and generalization of the results is not possible. Each of the three methods has it own advantages. Estimation of the Wald statistic in the WALD method follows Rubin s rules and is a sound and well-established approach. However, it was recently shown that the use of Wald statistics to select the power in an FP model can result in biased estimates [27]. The majority method gives much insight into the variability between the completed data sets. Variability may not only be found in the predictors selected, but also in the selection of powers for one particular continuous predictor, which results in different functional forms (Table 3). If predictor and transformation selection is based on the majority method, more than 10 imputations may be necessary to obtain stable results. The big advantage of the STACK method is that only one data set needs to be analyzed. The analysis leads directly to a single set of selected predictors, corresponding regression coefficients, and standard errors. We used the AIC as the stopping rule in the backward elimination procedure, which corresponds to a P-value of for variables with one degree of freedom (i.e., dichotomous variables that are modeled with one regression coefficient). Other P-values are also regularly used in predictor and

8 212 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 transformation selection, either more standard values such as 5 and 1 or higher values of to up to 0.5. In large data sets with strong predictors, the 5 level suffices. In small data sets, more liberal P-values are advocated to increase the probability that real predictors are selected at the expense of also selecting more noise variables [29]. An alternative for the applied traditional non-bayesian variable selection in multiply imputed data is an approach that draws on the Bayesian frameworks of multiple imputation and variable selection [30]. This approach has been applied for selection of variables in linear regression models. We focused in this paper on the traditional approach, because we believe that this is of greatest practical relevance to many data analysts. One step in model development that we did not consider in our analysis is internal validation. The development data are resampled and different samples can be used for development and validation. The most efficient procedure is bootstrapping [31,32]. It is unclear how resampling should be applied in the presence of multiply imputed data. Each completed data set can be bootstrapped, or each bootstrap sample with missing values can be imputed. Further research is necessary on this topic [33]. Few researchers have estimated model performance measures in multiply imputed data. Consistent with Rubin s rules, we applied the model to the patients of each completed data set, which resulted in 10 predicted risks per patient. Predicted risks were based on one single model, but on 10 different predictor values for predictors with a missing value. Accordingly, the 10 estimates of performance measures (i.e., c-index, or R 2 ) and variances were pooled. Another approach would be to average the 10 predicted risks of each patient, which would result in one performance estimate [34]. This approach yielded slightly higher estimates in our data. For instance, the estimates of the slope, c-index, and R 2 were 1.315, 0.901, and for the MFP W/S model in the external validation data (vs , 79, and 0.329). Another possible approach is to report all 10 estimates or the median and range [35]. As is often the case, the models were dominated by a small number of strongly influential predictorsdhere, D-dimer test result and difference in calf circumference. Models in which these predictors were dichotomized performed less well, although extra predictors were selected. This confirms the conclusion of an earlier paper that dichotomization of continuous predictors is an unwise strategy [36]. We are aware that the clinical interpretation of variables that are included in a model continuously may be less straightforward, particularly, when a transformation has been applied. Graphical representations can overcome this problem. The modeled association can easily be plotted with the predictor value on the X-axis against the outcome value on the Y-axis. If we assume that the missing data were MAR, the results after multiple imputation are less biased than the results of the complete case analysis. An important MAR assumption is that the probability that a data value is missing depends on values of variables that were actually observed. In other words, we assumed that the missingness of a variable cannot depend on the values of variables that we did not obtain data on. Conceptually, it also excludes a dependence of the occurrence of missing values on the true, but unobserved, value of the variable (MNAR) [11]. In general, the MAR mechanism is assumed to make imputation possible. Recently, a tool has been developed to investigate this MAR assumption: the index of sensitivity to nonignorability [37]. However, the assumption cannot be formally tested, because the true values cannot be observed. An important step in imputing the missing data is the specification of the imputation models. This is an explicit attempt to model the MAR process. Imputation models were specified for each candidate predictor with missing data irrespective of the quantity of missing data. All candidate predictors (10 in total) and the presence of DVT entered the imputation models [13]. We did not consider extra variables for the imputation model, because we did not expect substantial increase in explained variance beyond the 10 candidate predictors and the outcome variable. In the development set, some proportions of missing values were high: 55% and 39% of D-dimer test and calf circumference, respectively. We could perform a conditional imputation, because for most continuous missing values a dichotomized value was observed. This rather special situation occurs also in cancer data. Concentrations of markers, such as estrogen and progesterone receptors in breast cancer were formerly recorded dichotomously as low vs. high. When assessment of the markers became more common, actual concentrations were recorded. Here, conditional imputation might also be applied to the older data. In conclusion, this case study illustrated methods to deal with missing values in the development and validation process of a clinical prediction model. We found that multiple imputation and the corresponding Rubin s rules can be used for such analyses. Further experience of these methods with other empirical data sets is needed to formulate general guidelines for prediction modeling in the presence of missing data. Appendix A Imputation method Multiple imputation was performed using the ice program [38] for Stata, an implementation of the MICE regression switching algorithm [39]. MICE requires specification of conditional models for each incomplete variable given all other variables. m 5 10 completed data sets were created, each using 10 cycles of regression switching. Imputations for each continuous variable were drawn from a normal approximation to the posterior distribution from the corresponding conditional model. Logistic models were used for imputing binary predictors. After preliminary investigation, calf difference and D-dimer were log transformed to

9 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e approximate normality before imputation began, and were included in the conditional imputation models for other variables as linear and quadratic terms. Continuous values of log ((calf difference þ 1)/10) and log (D-dimer/100,000) were imputed when only dichotomized values were available by sampling from normal distributions truncated at the known cut-off values. Otherwise, missing values for these variables were imputed by assuming complete normal distributions. Parameter estimates for all regression models were combined across imputations using the micombine command [38], which applies Rubin s rules. Appendix B Methods of predictor and transformation selection with the MFP procedure in multiple completed data sets Brief outline of the MFP procedure The MFP approach to building regression models combines selection of predictors with determination of functional relationships for continuous predictors. Predictors are selected by backward elimination, using either a conventional stopping rule such as P! 5 for testing the statistical significance of a predictor or by using the AIC (see below). Consider FP modeling of a single continuous predictor, x. Usually one chooses between FP2, FP1, linear, or null functions of x, because more complex functions than FP2 are rarely needed. Sometimes, for example to maximize statistical power or to impose monotonicity on a functional relationship, FP1 may be the most complex function considered. For exponents (powers) p, q in the set { 2, 1, 0.5, 0, 0.5, 1, 2, 3}, 0 denoting a logarithmic transformation, an FP2 function has the two possible forms b 0 þ b 1 x p þ b 2 x q or b 0 þ b 1 x p þ b 2 x p log x, the former when p s q and the latter when p 5 q. An FP1 function is b 0 þ b 1 x p or b 0 þ b 1 log x when p 5 0. Significance testing for selecting an FP function and for selecting x uses a sequence of LR tests, as follows. FP2 is compared with the null model (b 0 only) on 4 df; if not significant, x is dropped. If significant, FP2 is further compared with a linear function on 3 df The linear function is chosen if this test is not significant, otherwise the final test is between FP2 and FP1 functions, on 2 df. Because more liberal significance levels have been advocated for predictor selection in prognostic model development, we used AIC, which corresponds to a P-value of for 1 df. Using AIC to select an FP function involves comparing penalized log likelihoods of the FP2, FP1, linear, and null models for x. The AIC is defined as ( 2 log likelihood) þ (2 df). Ignoring b 0, the dfs of these four models are 4, 2, 1, and 0, respectively. The model with the lowest AIC is selected. To choose a final, multivariable model from a set of predictors, the MFP algorithm uses a sequence of such tests (using either significance testing or AIC minimization) in an iterative, back-fitting manner. Predictors are first ordered according to their statistical significance in a full linear model. Then, each predictor is visited in turn and the procedure just described for selecting a predictor or FP function is applied. Currently selected predictors and (where necessary) their FP functions are included in the models. The procedure continues until there is no further change in selected predictors and functions, typically taking two to three cycles to completion. WALD: Wald tests based on Rubin s rules If the log likelihood is quadratic, which is exactly true in a normal errors model and approximately true in other models with small coefficients, then Wald and LR tests are equivalent. As an approximation, we used Wald test statistics as if they were LR statistics. For example, let the pooled regression estimate for a single predictor x be b and the pooled standard error be s, obtained by applying Rubin s rules to the estimates from the m completed data sets. The Wald c 2 statistic for testing b 5 0is(b/s) 2. The AIC of a model with df was defined as ( Wald c 2 statistic) þ (2 df). STACK: a weighted regression method We vertically stacked the 10 completed data sets for the 1,295 or 791 patients into one large data set of length 12,950 or 7,910. Fitting models to this single stacked data set, ignoring its special structure, yields valid parameter estimates but standard errors that are too small. To correct the standard error of a regression coefficient for a predictor x, we used the weight w 5 1 qðxþ m where q(x), the fraction of missing data for x, equals the number of missing values of x divided by n. The weights were used in regression models on x, applied so that the log likelihood was scaled by w. In Stata, such a scheme is known as importance weighting. It provides an approximate adjustment for the multiple data sets and for the missing data. Further, justification, results, and discussion of STACK are given recently [40]. Because STACK provides a type of log likelihood for any given model for x, conditional on other covariates, it also yields an AIC, defined as usual as ( 2 log likelihood) þ (2 df). Note that because q(x) will (almost always) vary across the xs, STACK does not impart a meaningful overall likelihood to a multivariable model. However, the MFP procedure requires only a likelihood for models involving different functions of x; contributions to the likelihood from other covariates are irrelevant.

10 214 Y. Vergouwe et al. / Journal of Clinical Epidemiology 63 (2010) 205e214 Appendix C Combining the results from the m different completed data sets The computation of the multiple imputation point estimate and variance given the m (here, m 5 10) completed data sets follows Rubin s rules [41]. Let Q i and W i denote the point estimate and variance, respectively, from the ith (i 5 1,., 10) completed data set. The multiple imputation point estimate Q* ofq is the arithmetic mean of the 10 completed data estimates. The estimated variance T of Q is obtained by a components-of-variance argument, leading to the following formulas: T 5 W þ 1 þ 1 B; m where W 5 within-imputation variance: 1=m P m i51 W i, and where B 5 between-imputation variance: 1=ðm 1Þ P m i51 ðq i Q Þ 2. References [1] Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med 1999;130:515e24. [2] Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med 2000;19:453e73. [3] Rubin DB. Multiple imputation for non response in surveys. New York: Wiley; [4] Little RA. Regression with missing X s; a review. J Am Stat Assoc 1992;87:1227e37. [5] Schafer JL. Analysis of incomplete multivariate data. London: Chapman & Hall/CRC Press; [6] Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996;15:361e87. [7] Harrell FE Jr, Lee KL, Califf RM, Pryor DB, Rosati RA. Regression modelling strategies for improved prognostic prediction. Stat Med 1984;3:143e52. [8] Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc A 1999;162:71e94. [9] Oudega R, Moons KG, Hoes AW. Ruling out deep venous thrombosis in primary care. A simple diagnostic algorithm including D-dimer testing. Thromb Haemost 2005;94:200e5. [10] Toll D, Oudega R, Vergouwe Y, Moons K, Hoes A. A new diagnostic rule for deep vein thrombosis: safety and efficiency in clinically relevant subgroups. Fam Pract 2008;25:3e8. [11] Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 1999;8(1):3e15. [12] Moons KG, Donders RA, Stijnen T, Harrell FE Jr. Using the outcome for imputation of missing predictor values was preferred. J Clin Epidemiol 2006;59:1092e101. [13] van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999;18:681e94. [14] Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Appl Stat 1994;43:429e67. [15] Harrell FE Jr. Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis. New York: Springer-Verlag; [16] Spiegelhalter DJ. Probabilistic prediction in patient management and clinical trials. Stat Med 1986;5:421e33. [17] Copas JB. Regression, prediction and shrinkage. J R Stat Soc B 1983;45:311e54. [18] van Houwelingen HC, Thorogood J. Construction, validation and updating of a prognostic model for kidney graft survival. Stat Med 1995;14:1999e2008. [19] Atkinson AC. A note on the generalized information criterion for choice of a model. Biometrika 1980;67:413e8. [20] Sauerbrei W. The use of resampling methods to simplify regression models in medical statistics. Appl Stat 1999;48:313e29. [21] Royston P, Sauerbrei W. Building multivariable regression models with continuous covariates in clinical epidemiology-with an emphasis on fractional polynomials. Methods Inf Med 2005;44:561e71. [22] Royston P, Sauerbrei W. Multivariable model-building: a pragmatic approach to regression analysis based on fractional polynomials for continuous variables. Wiley, John & Sons, Incorporated; [23] Miller ME, Hui SL. Validation techniques for logistic regression models. Stat Med 1991;10:1213e26. [24] Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143: 29e36. [25] White IR, Wood AM, Royston P. Multiple imputation in practice. Stat Methods Med Res; in press. [26] Meng X-L, Rubin DB. Performing likelihood ratio tests with multiply-imputed data sets. Biometrika 1992;79:103e11. [27] Royston P, White IR, Wood AM. Building multivariable fractional polynomial models in multiply imputed data. Submitted; available on request [28] Wood AM, White IR, Royston P. How should variable selection be performed with multiply imputed data? Stat Med 2008;27:3227e46. [29] Steyerberg EW, Eijkemans MJC, Harrell FE Jr, Habbema JDF. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med 2000;19:1059e79. [30] Yang X, Belin TR, Boscardin WJ. Imputation and variable selection in linear regression models with missing covariates. Biometrics 2005;61:498e506. [31] Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman & Hall; [32] Steyerberg EW, Harrell FEJ, Borsboom GJJ, Eijkemans MJC, Vergouwe Y, Habbema JDF. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001;54:774e81. [33] Heymans MW, van Buuren S, Knol DL, van Mechelen W, de Vet HC. Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med Res Methodol 2007;7:33. [34] Burd RS, Jang TS, Nair SS. Predicting hospital mortality among injured children using a national trauma database. J Trauma 2006;60: 792e801. [35] Clark TG, Altman DG. Developing a prognostic model in the presence of missing data: an ovarian cancer case study. J Clin Epidemiol 2003;56(1):28e37. [36] Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25: 127e41. [37] Troxel AB, Ma G, Heitjan DF. An index of sensitivity to nonignorability. Stat Sin 2004;14:1221e37. [38] Royston P. Multiple imputation of missing values: Update of ice. Stata Journal 2005;5:527e36. [39] van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999;18(6):681e94. [40] Wood AM, White IR, Royston P. How should variable selection be performed with multiply imputed data? Stat Med 2008;27(17):3227e46. [41] Rubin DB. Multiple imputation for non response in surveys. New York: Wiley; 1987.