Alpaydin E. Introduction to Machine Learning. MIT Press, 2004.

Transcription

1 Bibliography: Evaluating Predictive Models Last update: 29 July 2007 General Alpaydin E. Introduction to Machine Learning. MIT Press, An excellent introduction to the field of machine learning. As with most books on machine learning, the emphasis is on classification. Chapter 14 describes the assessment and comparison of classification algorithms. Altman DG, Royston P. What do we mean by validating a prognostic model? Statistics in Medicine 2000; 19: This paper examines (i) what is meant by validation of prognostic models, (ii) reviews why it is necessary, and (iii) describes how validations should be carried out; The emphasis is on conceptual rather than technical issues. The idea of validating a prognostic model is generally taken to mean that it works satisfactorily for patients other than those from whose data the model was derived. The authors suggest that it is desirable to distinguish statistical from clinical validity. Statistical validity means that the model is the best that can be found with the available factors, while clinical validity means that the model predicts accurately enough for its purpose of course, this depends crucially on one's view of the aims of the model. The paper spents considerable attention to the problem of overfitting. It is known that analyses that are not prespecified but are data-dependent are liable to overoptimistic conclusions. The data-dependent aspect of most prognostic models stems from the variable selection and discretization procedures. Duda RO, Hart PE, Stork DG, Pattern Classification. Wiley, This classic textbook from 1973 was revised and updated in It covers a broad range of pattern classification techniques. Chapter 2 discusses Bayesian decision theory. Hand DJ, Construction and Assessment of Classification Rules. Wiley, Chapter 6 of this book (Aspects of evaluation) presents a framework for understanding model evaluation concepts. A distinction is made between measuring accuracy, precision, separability, and resemblance. A large number of examples is provided, e.g. the misclassification (i.e. error) rate and the Brier score are accuracy measures. The highly popular misclassification rate is further investigated in Chapter 7, where several crossvalidation schemes (rotation, leave-one-out, bootstrap) for estimating the actual error are discussed. Chapter 7 deals extensively with aspects of classification accuracy. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag, Chapter 7 of this excellent textbook on statistical learning methods discusses model assessment and selection methods based on loss (accuracy) functions. Mitchell TM. Machine Learning. McGraw-Hill, Another very good textbook on machine learning. Chapter 5 considers classification accuracy, including the statistical estimation of performance.

2 Pepe MS, The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press, Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction rules. New England Journal of Medicine 1985; 313: In this article, the following methodological standards for creating and validating clinical prediction rules are proposed: 1. The event to be predicted (outcome) should be clearly defined, preferrably by biological rather than sociological or behavioral criteria; 2. Predictive findings should be defined precisely and have a similar meaning to anyone who may use them; 3. The list of predictors should not include any criteria that are used in defining the outcome ("blind assessment"; most relevant for diagnostic rules); 4. Characterics of the patient population used to develop the rule should be clearly described; 5. The study site and type of practice where the data was gathered should be described; 6. An unbiased estimate of the rule's performance should be reported; 7. Effects of using the rule should be prospectively measured; and 8. The statistical technique that was used to derive the rule should be described. Thirty-three publications of clinical predictions rules during the years in four leading medical journals were reviewed by these standards. Most of the criteria were met by more than 80% of the studies. However, performance statistics were seldomly reported (11 publications), and effects of clinical use was almost never prospectively measured (2 publications). Ch. 1 Introduction: Predictive Models and Evaluation Abu-Hanna A, Lucas PJF. Prognostic models in medicine, Methods of Information in Medicine 2001; 40: 1 5. Wyatt JC, Altman DG. Commentary: Prognostic models: clinically useful of quickly forgotten? British Medical Journal 1995; 311: Few prognostic models are routinely used to inform difficult clinical decisions. Wyatt and Altman believe that the main reasons why doctors reject published prognostic models are lack of clinical credibility and lack of evidence that a prognostic model can support decisions about patient care (that is, evidence of accuracy, generality, and effectiveness). Ch. 3 Evaluating Probabilities (i) ROC analysis Bamber D. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology 1975; 12: In this paper it is shown that the area under the ROC curve (AUC) equals the probability that a randomly chosen positive case was given a higher test value (or higher prediction by a model) than a randomly chosen negative case. Furthermore, it is shown that the estimated AUC is equivalent to the Mann-Whitney U statistics normalized by the number of pairs of negative and positive cases.

3 DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics 1988; 44: A nonparametric method for comparing the areas under the ROC curves of two distinct models on the same dataset. The method is based on the theory of generalized U statistics. Hand DJ, Till RJ. A simple generalization of the area under the ROC curve to multiple class classification problems, Machine Learning 2001; 45(2): Hanley JA, McNeil BJ. The meaning and use of the area under the receiver operating characteristic (ROC) curve. Radiology 1982; 143: Hanley JA, McNeil BJ. A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology 1983; 148: Parametric method for comparing the areas under the ROC curves of two distinct models on the same dataset, assuming a Normal (Gaussian) distribution of the AUC. Superseded by the nonparametric method of DeLong et al. (1988). Lasko TA, Bhagwat JG, Zou KH, Ohno-Machado L. The use of receiver operating characteristic curves in biomedical informatics. Journal of Biomedical Informatics 2005; 38(5): An overview of different methods to estimate and compare areas on the ROC curve, and software packages available for ROC analysis. The paper does not present new methods but summarizes the existing literature. The approach is practical, and aims at readers who need to choose between applying different methods. Metz CE, Basic principles of ROC analysis, Seminars in Nuclear Medicine 1978; 8(4): Charles Metz was one of the people who popularized the use of ROC analysis in medical research. Provost F, Fawcett T, Kohavi R. The case against accuracy estimation for comparing induction algorithms. Proceedings of the 15th International Conference on Machine Learning (ICML 98), pp , Machine Learning research has traditionally concentrated on designing algorithms for building classifier functions, and the predominant evaluation methodology in this field is classification accuracy (error rate) estimation. In this influential paper, the authors argue that estimating classification accuracy is insufficient for comparing competing classifiers and algorithms, because classification accuracy assumes equal misclassification costs and a known marginal class distribution. However, both misclassification costs and marginal class distribution can be unknown at the time of building the model and may even vary from time to time, place to place, and situation to situation where the model is applied. For these reasons, the authors argue that classifiers and induction algorithms should be evaluated and compared using ROC analyses. Using ten datasets from the UCI repository and several standard machine learning algorithms, it is shown that high classification accuracy does not imply domination in ROC space. Therefore, comparing accuracies on benchmark datasets says little, if anything, about classifier performance on real-world tasks. (Note: the authors do not mention the fact that a model may be very imprecise (badly calibrated) even though it dominates other models in ROC space. This is an imperfection of ROC analysis.)

4 Somers RH. A new asymmetric measure of association for ordinal variables, American Sociological Review 1962; 27: Somers' rank correlation D xy is a nonparametric measure of association between ordinal variables, and is related to the concordance index C (= nonparametric AUC) as follows: D xy = 2(C 0.5). Ch. 3 Evaluating Probabilities (ii) Accuracy of probabilities Ash A, Shwartz M. R 2 : a useful measure of model performance when predicting a dichotomous outcome. Statistics in Medicine 1999; 18: Brier GW. Verification of weather forecasts expressed in terms of probability. Monthly Weather Review 1950; 78: 1 3. The principal reference for the Brier inaccuracy score. Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic Networks and Expert Systems. Berlin: Springer-Verlag, Chapter 10 of this textbook on probabilistic network models considers the problem of checking models against data. Although some of the methods are specific for Bayesian networks, others are general (Bayesian) statistical tools for evaluating predictive models. Specific attention is spent to the logarithmic score (deviance). Mittlböck M, Schemper M. Explained variation for logistic regression. Statistics in Medicine 1996; 15: Mittlböck and Schemper review 12 statistical measures that have been proposed to quantify the explained variation of binary predictive model (in contrast to what the title suggests, none of the measures is restricted to use in conjunction with logistic regression). Six measures are based on the correlation of estimated probabilities and observed outcome (e.g. Pearson correlation and Somers D), four are based on reduction in dispersion of the outcome (e.g. sum-of-squares R 2, Gini index, classification error), and two are based on model likelihood (likelihood-ratio and Nagelkerke R 2 ). Nagelkerke NJD. A note on a general definition of the coefficient of determination. Biometrika 1991; 78(3): Nagelkerke proposes to use the ratio of log likelihoods of a binary predictive model (e.g. logistic regression) and the 'null' (intercept only) model as a measure of explained variation by the model. This performance statistic is presented by some statistical packages (e.g. SAS) as 'Nagelkerke R 2 '. Although the statistic has some attractive properties (e.g. consistency with classical R 2, consistency with maximum likelihood estimation, independent of sample size), there are serious problems with its interpretation (e.g. see Mitlböck and Schemper, 1996). Redelmeier DA, Bloch DA, Hickam DH. Assessing predictive accuracy: How to compare Brier scores. Journal of Clinical Epidemiology 1991; 44(11): This paper presents a statistical method to compare the Brier scores from two different sets of predictive assessments (predicted probabilities) on a single test set. The method is an extension of Spiegelhalter's test whether a given Brier score is incompatable with the

5 observed outcomes. A problem with the comparison method is that the test statistic depends on the true, unknown probabilities. To solve this problem, the authors suggest to use the mean of both predictions. The paper contains a small example based on probability judgements of five medical students whom independently reviewed the symptoms and elektrocardiograms of 25 patients with recurrent chest pain. Ch. 6 Assessing the fit of a Model le Cessie S, van Houwelingen JC. A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics 1991; 47: Hosmer DW, Hosmer T, le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine 1997; 16: Experimental comparison of several goodness-of-fit tests for the logistic regression model. Hosmer DW, Lemeshow S. Goodness-of-fit tests for the multiple logistic regression model. Communications in Statistics 1980; A10: This article presents the original Hosmer-Lemeshow goodness-of-fit test for logistic regression model. The test is based on a statistic C that sums squared Pearson residuals in g (usually 10) risk groups, where the grouping is either based on fixed values of the estimated probabilities or on percentiles of the estimated probabilities (the latter approach is often preferrable). It was experimentally shown that the distribution of the statistic C is well approximated by the χ 2 distribution with g λ 1 degrees of freedom, where λ =1 if C is computed from the training data set, and λ=0 otherwise. If C is large, this indicates that, at least in part of the feature space, the estimated probabilities strongly deviate from the true probabilities. The Hosmer-Lemeshow goodness-of-fit test as described here was implemented in many statistical packages and is routinely applied in epidemiological research, even though it was later shown that the statistic may be unstable and the test is therefore unreliable. Hosmer DW, Lemeshow S. Applied Logistic Regression. New York: John Wiley & Sons, 2nd edition, Chapter 5 of this textbook deals extensively with the evaluation of logistic regression models. The evaluation methods described are partly dedicated for logistic regession models (e.g. goodness-of-fit tests) and partly generic (e.g. ROC analysis). Miller ME, Langefeld CD, Tierney WM, Hui SL, McDonald CJ. Validation of probabilistic predictions. Medical Decision Making 1993; 13(1): Moons E, Aerts M, Wets G. A tree-based lack-of-fit for multiple logistic regression. Statistics in Medicine 2004; 23: Ch. 7 Model Validation Bleeker SE, Moll HA, Steyerberg EW, Donders AR, Derksen-Lubsen G, Grobbee DE, Moons KG. External validation is necessary in prediction research: a clinical example. Journal of Clinical Epidemiology 2003; 56(9):

6 Case study in pediatric diagnostic management (predicting bacterial infections) shows that for relatively small data sets, internal validation of prediction models by bootstrap techniques may not be sufficient and indicative for the model's performance in future patients. External validation is essential before implementing prediction models in clinical practice. Davison AC, Hinkley DV. Bootstrap Methods and their Application. Cambridge: Cambridge University Press, Gant V, Rodway S, Wyatt JC. Artificial neural networks: practical considerations for clinical applications. In Clinical Applications of Artificial Neural Networks (Dybowski R, Gant V, eds.), Cambridge: Cambridge University Press, 2001, pp Hadorn DC, Draper D, Rogers WH, Keeler EB, Brook RH. Cross-validation performance of mortality prediction models. Statistics in Medicine 1992; 11(4): Early study on the performance of different modelling techniques (linear regression, logistic regression, Cox regression, CART) in predicting death after acute myocardial infarction. Similar, but more rigorous, studies were conducted by Steyerberg et al. Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Annals of Internal Medicine 1999; 130: Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression models. Statistics in Medicine 1991; 10(8): This paper presents a comprehensive approach to the validation of logistic prediction models. It reviews measures of overall goodness-of-fit, and indices of calibration and refinement. Using a model-based approach developed by Cox, logistic regression diagnostic techniques are adapted for use in model validation. This allows identification of problematic predictor variables in the prediction model as well as influential observations in the validation data that adversely affect the fit of the model. In appropriate situations, recommendations are made for correction of models that provide poor fit. Peek N, Arts DG, Bosman RJ, Van der Voort PH, De Keizer NF. External validation of prognostic models for critically ill patients required substantial sample sizes. Journal of Clinical Epidemiology 2007; 60(5): This study considers the behavior of predictive performance measures that are commonly used in external validation of prognostic models. A resampling scheme was used to investigate the effects of sample size; the domain of application was intensive care. The AUC and Brier score showed large variation with small samples. It was found that substantial sample sizes are required for performance assessment and model comparison in external validation. Standard errors of AUC values were accurate but the power to detect differences in performance was low. Calibration statistics and the associated significance tests are extremely sensitive to sample size, and should not be used in these settings. Instead, D. Cox customization method to repair lack-of-fit problems is recommended. Direct comparison of performance, without statistical analysis, was unreliable with either measure.

7 Schwarzer G, Vach W, Schumacher M. On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Statistics in Medicine 2000; 19: Schwarzer et al. present a critical review of applications of artificial neural networks (ANNs) in biomedicine. The flexibility of ANNs is often cited as an advantage, but the authors argue that it must be seen as a major concern. Several common pitfalls are discussed (e.g. fitting implausible functions, incorrect modelling of survival data, and biased estimation of network accuracy), and a review of the literature of ANN applications in oncology is presented. Many of the 43 applications that are discussed show (severe) methodological weaknesses. Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y, Habbema JDF, Internal validation of predictive models: efficiency of some procedures for logistic regression analysis, Journal of Clinical Epidemiology 2001; 54(8): The performance of a predictive model is overestimated when simply determined on the sample of subjects that was used to construct the model. Several internal validation methods are available that aim to provide a more accurate estimate of model performance in new subjects. This study evaluated several variants of split-sample, cross-validation and bootstrapping methods with a logistic regression model that included eight predictors for 30-day mortality after an acute myocardial infarction. Random samples of varying size were drawn from a large data set. Split-sample analyses gave overly pessimistic estimates of performance, with large variability. Cross-validation on 10% of the sample had low bias and low variability, but was not suitable for all performance measures. Internal validity could best be estimated with bootstrapping, which provided stable estimates with low bias. Steyerberg EW, Bleeker SE, Moll HA, Grobbee DE, Moons KG. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. Journal of Clinical Epidemiology 2003; 56(5): Simulation study to investigate the accuracy of bootstrap estimates of optimism (internal validation) and the precision of performance estimates in independent validation samples (external validation). Random samples were drawn from a data set on infectious diseases in children, for the development (n=376) and validation (n=179) of logistic regression models. Model development, including the selection of predictors, and validation were repeated in a bootstrapping procedure. The average apparent ROC area was 0.74, which was expected (based on bootstrapping) to decrease by 0.07 to 0.67, whereas the observed decrease in the validation samples was 0.09 to Omitting the selection of predictors from the bootstrap procedure led to a severe underestimation of the optimism (decrease 0.006). The standard error of the observed ROC area in the independent validation samples was large (0.05). So, for external validation, substantial sample sizes should be used for sufficient power to detect clinically important changes in performance as compared with the internally validated estimate. Steyerberg EW, Borsboom GJ, van Houwelingen HC, Eijkemans MJ, Habbema JD. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Statistics in Medicine 2004; 23(16): A logistic regression model may be used to provide predictions of outcome for individual patients at another centre than where the model was developed. When empirical data are available from this centre, the validity of predictions can be assessed by comparing observed outcomes and predicted probabilities. Subsequently, the model may be updated to improve predictions for future patients. In this study, a previously published model for

8 predicting 30-day mortality after acute myocardial infarction was validated and updated with external validation samples that varied in size. Heuristic shrinkage approaches were applied in the model revision methods, such that regression coefficients were shrunken towards their re-calibrated values. Parsimonious updating methods were found preferable to more extensive model revisions, which should only be attempted with relatively large validation samples in combination with shrinkage. Terrin N, Schmid CH, Griffith JL, D'Agostino RB, Selker HP. External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks. Journal of Clinical Epidemiology 2003; 56(8): Simulation study that compared the external validity of standard logistic regression (LR1), logistic regression with piecewise-linear and quadratic terms (LR2), classification trees, and neural networks (NNETs). Predictive models were developed on data simulated from a specified population and on data from perturbed forms of the population not representative of the original distribution. All models were tested on new data generated from the population. The performance of LR2 was superior to that of the other model types when the models were developed on data sampled from the population and when they were developed on nonrepresentative data. However, when the models developed using nonrepresentative data were compared with models developed from data sampled from the population, LR2 had the greatest loss in performance. These results highlight the necessity of external validation to test the transportability of predictive models. Vergouwe Y, Steyerberg, EW, Eijkemans MJC, Habbema JDF. Validity of prognostic models: When is a model clinically useful? Seminars in Urologic Oncology 2002; 20(2): Vergouwe et al. distinguish three aspects of validity of prognostic models: (1) agreement between predicted probabilities and observed probabilities (calibration), (2) ability of the model to distinguish subjects with different outcomes (discrimination), and (3) ability of the model to improve the decision-making process (clinical usefulness). Several techniques for visualizing and quantifiying calibration and discrimination are discussed. Clinical usefulness is inspected by considering classification accuracy, sensitivity, and specificity of the model (after choosing a classification threshold), and by estimating the expected decrease in disutility when the model is applied in practice. This is done by comparing the model s classifications and conventional policy by weighing false-positive and falsenegative classified patients according to relative severity. Vergouwe Y, Steyerberg EW, Eijkemans MJ, Habbema JD. Substantial effective sample sizes were required for external validation studies of predictive logistic regression models. Journal of Clinical Epidemiology 2005 ;58(5): Simulation study in the field of oncology (predicting the probability that residual masses of patients treated for metastatic testicular cancer contained only benign tissue) suggests that a minimum of 100 events and 100 nonevents are required for external validation samples. Zhu B-P, Lemeshow S, Hosmer DW, Klar J, Avrunin J, Teres D. Factors affecting the performance of the models in the Mortality Probability Model II system and strategies of customization: A simulation study. Critical Care Medicine 1996; 24:57 63.