Evaluation of Predictive Models



Similar documents
06 Validation of risk prediction model

Impact of Massachusetts Health Care Reform on Racial, Ethnic and Socioeconomic Disparities in Cardiovascular Care

Cilostazol versus Clopidogrel after Coronary Stenting

Measures of diagnostic accuracy: basic definitions

FFR CT : Clinical studies

11. Analysis of Case-control Studies Logistic Regression

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

LEADERS: 5-Year Follow-up

STATISTICA Formula Guide: Logistic Regression. Table of Contents

PRECOMBAT Trial. Seung-Whan Lee, MD, PhD On behalf of the PRECOMBAT Investigators

Therapeutic Approach in Patients with Diabetes and Coronary Artery Disease

Modeling Lifetime Value in the Insurance Industry

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Health Care and Life Sciences

For the NXT Investigators

Knowledge Discovery and Data Mining

Predictive Modelling of High Cost Healthcare Users in Ontario

RISK STRATIFICATION for Acute Coronary Syndrome in the Emergency Department

Data Mining Techniques for Prognosis in Pancreatic Cancer

The Bioresorbable Vascular Stent Dr Albert Ko

Predicting High Cost Users of Ontario s Healthcare System. Health Analytics Branch November 19, 2012

Cardiac Rehabilitation The Best Medicine for Your CAD Patients. James A. Stone

Multivariate Logistic Regression

Is Stenting or Coronary Artery By-pass Grafting the Better Treatment for This Patient?

Development and Validation of a Screening Questionnaire for Psoriatic Arthritis

Feature Subset Selection in Spam Detection

Prognostic impact of uric acid in patients with stable coronary artery disease

Advanced analytics at your hands

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Bijan Raahemi, Ph.D., P.Eng, SMIEEE Associate Professor Telfer School of Management and School of Electrical Engineering and Computer Science

How to set the main menu of STATA to default factory settings standards

Data quality in Accounting Information Systems

A Post-market Study to Assess the STENTYS Self-exPanding COronary Stent In AcuTe myocardial InfarctiON in Real Life APPOSITION III

Specialty Excellence Award and America s 100 Best Hospitals for Specialty Care Methodology Contents

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Hindrik Vondeling Department of Health Economics University of Southern Denmark Odense, Denmark

Logistic Regression 1. y log( ) logit( y) 1 y = = +

Chapter 6. The stacking ensemble approach

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Evaluation & Validation: Credibility: Evaluating what has been learned

International Statistical Institute, 56th Session, 2007: Phil Everson

Copenhagen University Hospital Rigshospitalet Aarhus University Hospital Skejby Denmark

PIHRATE Trial. Polish-Italian-Hungarian Randomized ThrombEctomy Trial. Dariusz Dudek MD, PhD. On behalf PIHRATE investigators

Strategies for Identifying Students at Risk for USMLE Step 1 Failure

A Property & Casualty Insurance Predictive Modeling Process in SAS

CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS

Post-MI Cardiac Rehabilitation. Mark Mason Consultant Cardiologist Harefield Hospital Royal Brompton and Harefield NHS Foundation Trust

Data Mining Practical Machine Learning Tools and Techniques

Cardiac Rehabilitation An Underutilized Class I Treatment for Cardiovascular Disease

Interpretation of Somers D under four simple models

PharmaSUG2011 Paper HS03

Scoring (manual, automated, automated with manual review)

Supervised Learning (Big Data Analytics)

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

CHAPTER 4 QUALITY ASSURANCE AND TEST VALIDATION

The Cardiac Society of Australia and New Zealand

Objectives. Preoperative Cardiac Risk Stratification for Noncardiac Surgery. History

Cardiac Rehab. Home. Do you suffer from a cardiac condition that is limiting your independence in household mobility?

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

Intracoronary Stenting and. Robert A. Byrne, Julinda Mehilli, Salvatore Cassese, Franz-Josef Neumann, Susanne Pinieck, Tomohisa Tada,

Performance Measures for Machine Learning

Data Mining mit der JMSL Numerical Library for Java Applications

Predicting Customer Default Times using Survival Analysis Methods in SAS

Gerry Hobbs, Department of Statistics, West Virginia University

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Statistics, Data Analysis & Econometrics

A CASE-CONTROL STUDY OF DOG BITE RISK FACTORS IN A DOMESTIC SETTING TO CHILDREN AGED 9 YEARS AND UNDER

DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

1. Classification problems

Modelling and Big Data. Leslie Smith ITNPBD4, October Updated 9 October 2015

Predicting breast cancer survivability: a comparison of three data mining methods

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Cardiac Rehabilitation: Strategies Approaching 2020

Antiplatelet and Antithrombotics From clinical trials to guidelines

EMR Tutorial Acute Coronary Syndrome

Statistics for Sports Medicine

Guide to Biostatistics

Main Effect of Screening for Coronary Artery Disease Using CT

Logistic Regression.

PREPARING THE PATIENT FOR TRANSFER TO AN INPATIENT REHABILITATON FACILITY (IRF) University Hospitals 8th Annual Neuroscience Nursing Symposium

An Article Critique - Helmet Use and Associated Spinal Fractures in Motorcycle Crash Victims. Ashley Roberts. University of Cincinnati

European system for cardiac operative risk evaluation (EuroSCORE) q

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

A LOGISTIC REGRESSION MODEL TO PREDICT FRESHMEN ENROLLMENTS Vijayalakshmi Sampath, Andrew Flagel, Carolina Figueroa

Predictive Dynamix Inc

Renovascular Hypertension

Repeat Coronary Revascularization Procedures after Successful Bare-Metal or Drug-Eluting Stent Implantation

Chapter 12 Discovering New Knowledge Data Mining

Transcription:

Evaluation of Predictive Models Assessing calibration and discrimination Examples Decision Systems Group, Brigham and Women s Hospital Harvard Medical School Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Main Concepts Example of a Medical Classification System Discrimination Discrimination: sensitivity, specificity, PPV, NPV, accuracy, ROC curves, areas, related concepts Calibration Calibration curves Hosmer and Lemeshow goodness-of-fit

Example I Modeling the Risk of Major In-Hospital Complications Following Percutaneous Coronary Interventions Frederic S. Resnic, Lucila Ohno-Machado, Gavin J. Blake, Jimmy Pavliska, Andrew Selwyn, Jeffrey J. Popma [Simplified risk score models accurately predict the risk of major in-hospital complications following percutaneous coronary intervention. Am J Cardiol. 2001 Jul 1;88(1):5-9.]

Background Interventional Cardiology has changed substantially since estimates of the risk of in-hospital complications were developed coronary stents glycoprotein IIb/IIIa antagonists Alternative modeling techniques may offer advantages over Multiple Logistic Regression prognostic risk score models: simple, applicable at bedside artificial neural networks: potential superior discrimination

Objectives Develop a contemporary dataset for model development: prospectively collected on all consecutive patients at Brigham and Women s Hospital, 1/97 through 2/99 - complete data on 61 historical, clinical and procedural covariates Develop and compare models to predict outcomes Outcomes: death and combined death, CABG or MI (MACE) Models: multiple logistic regression, prognostic score models, artificial neural networks Statistics: c-index (equivalent to area under the ROC curve) Validation of models on independent dataset: 3/99-12/99

Dataset: Attributes Collected History Presentation Angiographic Procedural Operator/Lab age gender diabetes iddm history CABG Baseline creatinine CRI ESRD hyperlipidemia Data Source: Medical Record Clinician Derived Other acute MI primary rescue CHF class angina class Cardiogenic shock failed CABG occluded lesion type (A,B1,B2,C) graft lesion vessel treated ostial number lesions multivessel number stents stent types (8) closure device gp 2b3a antagonists dissection post rotablator atherectomy angiojet max pre stenosis max post stenosis no reflow annual volume device experience daily volume lab device experience unscheduled case

Logistic and Score Models for Death Odds Ratio Age > 74yrs 2.51 B2/C Lesion 2.12 Acute MI 2.06 Class 3/4 CHF 8.41 Left main PCI 5.93 IIb/IIIa Use 0.57 Stent Use 0.53 Cardiogenic Shock 7.53 Unstable Angina 1.70 Tachycardic 2.78 Chronic Renal Insuf. 2.58 Logistic Regression Model Prognostic Risk Score Model Risk Value 2 1 1 4 3-1 -1 4 1 2 2

Artificial Neural Networks Artificial Neural Networks are non-linear mathematical models which incorporate a layer of hidden nodes connected to the input layer (covariates) and the output. Input Hidden Output Layer Layer Layer All Available Covariates I1 I2 I3 I4 H1 H2 H3 O1

Evaluation Indices

General indices Brier score (a.k.a. mean squared error) Σ(e i - o i ) 2 n e = estimate (e.g., 0.2) o = observation (0 or 1) n = number of cases

Discrimination Indices

Discrimination The system can somehow differentiate between cases in different categories Binary outcome is a special case: diagnosis (differentiate sick and healthy individuals) prognosis (differentiate poor and good outcomes)

Discrimination of Binary Outcomes Real outcome (true outcome, also known as gold standard ) is 0 or 1, estimated outcome is usually a number between 0 and 1 (e.g., 0.34) or a rank In practice, classification into category 0 or 1 is based on Thresholded Results (e.g., if output or probability > 0.5 then consider positive ) Threshold is arbitrary

threshold normal Disease True Negative (TN) FN FP True Positive (TP) 0 e.g. 0.5 1.0

nl D Sens = TP/TP+FN 40/50 =.8 Spec = TN/TN+FP 45/50 =.9 PPV = TP/TP+FP 40/45 =.89 NPV = TN/TN+FN 45/55 =.81 Accuracy = TN +TP 70/100 =.85 nl D 45 5 10 40 nl D

Sensitivity = 50/50 = 1 Specificity = 40/50 = 0.8 threshold nl nl 40 D 0 40 D 10 50 60 50 50 nl disease TN TP FP 0.0 0.4 1.0

Sensitivity = 40/50 =.8 Specificity = 45/50 =.9 threshold nl nl 45 D 10 50 D 5 40 50 50 50 nl disease TN TP FN FP 0.0 0.6 1.0

Sensitivity = 30/50 =.6 Specificity = 1 threshold nl nl 50 D 20 70 D 0 30 30 50 50 nl disease TN TP FN 0.0 0.7 1.0

Threshold 0.6 Threshold 0.4 nl D nl D nl D 40 0 10 50 50 50 nl D 45 10 5 40 50 50 40 60 50 50 1 Sensitivity ROC curve Threshold 0.7 nl D nl D 50 0 20 30 70 30 0 1 - Specificity 1 50 50

1 All Thresholds Sensitivity ROC curve 0 1 - Specificity 1

45 degree line: no discrimination 1 Sensitivity 0 1 - Specificity 1

45 degree line: no discrimination Area under ROC: 1 Sensitivity 0.5 0 1 - Specificity 1

Perfect discrimination 1 Sensitivity 0 1 - Specificity 1

Perfect discrimination 1 Sensitivity Area under ROC: 1 0 1 - Specificity 1

1 Sensitivity ROC curve Area = 0.86 0 1 - Specificity 1

What is the area under the ROC? An estimate of the discriminatory performance of the system the real outcome is binary, and systems estimates are continuous (0 to 1) all thresholds are considered NOT an estimate on how many times the system will give the right answer Usually a good way to describe the discrimination if there is no particular trade-off between false positives and false negatives (unlike in medicine ) Partial areas can be compared in this case

Simplified Example Systems estimates for 10 patients Probability of being sick 0.3 0.2 0.5 0.1 0.7 0.8 Sickness rank 0.2 (5 are healthy, 5 are sick): 0.5 0.7 0.9

Interpretation of the Area divide the groups Healthy (real outcome is 0) 0.3 0.8 0.2 0.2 0.5 0.5 0.1 0.7 0.7 0.9 Sick (real outcome is1)

All possible pairs 0-1 Healthy Sick 0.3 < 0.8 0.2 0.2 0.5 0.5 0.7 0.1 0.9 0.7 concordant discordant concordant concordant concordant

All possible pairs 0-1 Systems estimates for Healthy Sick 0.3 0.8 0.2 0.2 0.5 0.5 0.1 0.7 0.9 0.7 concordant tie concordant concordant concordant

C - index Concordant 18 Discordant 4 Ties 3 C -index = Concordant + 1/2 Ties = 18 + 1.5 All pairs 25

1 Sensitivity ROC curve Area = 0.78 0 1 - Specificity 1

Calibration Indices

Discrimination and Calibration Discrimination measures how much the system can discriminate between cases with gold standard 1 and gold standard 0 Calibration measures how close the estimates are to a real probability If the system is good in discrimination, calibration can be fixed

Calibration System can reliably estimate probability of a diagnosis a prognosis Probability is close to the real probability

What is the real probability? Binary events are YES/NO (0/1) i.e., probabilities are 0 or 1 for a given individual Some models produce continuous (or quasicontinuous estimates for the binary events) Example: Database of patients with spinal cord injury, and a model that predicts whether a patient will ambulate or not at hospital discharge Event is 0: doesn t walk or 1: walks Models produce a probability that patient will walk: 0.05, 0.10,...

How close are the estimates to the true probability for a patient? True probability can be interpreted as probability within a set of similar patients What are similar patients? Clones Patients who look the same (in terms of variables measured) Patients who get similar scores from models How to define boundaries for similarity?

Estimates and Outcomes Consider pairs of estimate and true outcome 0.6 and 1 0.2 and 0 0.9 and 0 And so on

Calibration Sorted pairs by systems estimates 0.1 0.2 0.2 sum of group = 0.5 0.3 0.5 0.5 sum of group = 1.3 0.7 0.7 0.8 0.9 sum of group = 3.1 Real outcomes 0 0 1 sum = 1 0 0 1 sum = 1 0 1 1 1 sum = 3

overestimation 1 Calibration Curves Sum of system s estimates 0 Sum of real outcomes 1

Regression line 1 Linear Regression and 45 0 line Sum of system s estimates 0 Sum of real outcomes 1

Goodness-of-fit Sort systems estimates, group, sum, chi-square Estimated Observed 0.1 0.2 0.2 sum of group = 0.5 0.3 0.5 0.5 sum of group = 1.3 0.7 0.7 0.8 0.9 sum of group = 3.1 χ2 = Σ [(observed - estimated) 2 /estimated] 0 0 1 sum = 1 0 0 1 sum = 1 0 1 1 1 sum = 3

Hosmer-Lemeshow C-hat Groups based on n-iles (e.g., terciles), n-2 d.f. training, n d.f. test Measured Groups Mirror groups Estimated Observed Estimated Observed 0.1 0 0.9 1 0.2 0 0.8 1 0.2 sum = 0.5 1 sum = 1 0.8 sum = 2.5 0 sum = 2 0.3 0 0.7 1 0.5 0 0.5 1 0.5 sum = 1.3 1 sum = 1 0.5 sum = 1.7 0 sum = 2 0.7 0 0.3 1 0.7 1 0.3 0 0.8 1 0.2 0 0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1

Hosmer-Lemeshow H-hat Groups based on n fixed thresholds (e.g., 0.3, 0.6, 0.9), n-2 d.f. Measured Groups Mirror groups Estimated Observed Estimated Observed 0.1 0 0.9 1 0.2 0 0.8 1 0.2 1 0.8 0 0.3 sum = 0.8 0 sum = 1 0.7 sum = 3.2 1 sum = 2 0.5 0 0.5 1 0.5 sum = 1.0 1 sum = 1 0.5 sum = 1.0 0 sum = 1 0.7 0 0.3 1 0.7 1 0.3 0 0.8 1 0.2 0 0.9 sum = 3.1 1 sum = 3 0.1 sum=0.9 0 sum = 1

Covariance decomposition Arkes et al, 1995 Brier = d(1-d) + bias 2 + d(1-d)slope(slope-2) + scatter where d = prior bias is a calibration index slope is a discrimination index scatter is a variance index

Covariance Graph PS=.2 bias= -0.1 slope=.3 scatter=.1 1 estimated probability (e) ê =.6 ê 1 =.7 slope ê 0 =.4 ô =.7 0 1 outcome index (o)

Logistic and Score Models for MACE Logistic Regression Model Risk Score Model Odds Ratio Age > 74yrs 1.42 B2/C Lesion 2.44 Acute MI 2.94 Class 3/4 CHF 3.56 Left main PCI 2.34 IIb/IIIa Use 1.43 Stent Use 0.56 Cardiogenic Shock 3.68 USA 2.60 Tachycardic 1.34 No Reflow 2.73 Unscheduled 1.48 Chronic Renal Insuff. 1.64 Risk Value 0 2 2 3 2 0-1 3 2 0 2 0 1

Model Performance Development Set (2804 consecutive cases) 1/97-2/99 Validation Set (1460 consecutive cases) 3/99-12/99 Multiple Logistic Regression Death MACE c-index Training Set c-index Test Set c-index Validation Set 0.880 0.806 0.898 0.851 0.840 0.787 Prognostic Score Model c-index Training Set c-index Test Set c-index Validation Set 0.882 0.798 0.910 0.846 0.855 0.780 Artificial Neural Network c-index Training Set c-index Test Set c-index Validation Set 0.950 0.849 0.930 0.870 0.835 0.811

Model Performance Validation Set: 1460 consecutive cases 3/1/99-12/31/99 Multiple Logistic Regression Death MACE c-index Validation Set Hosmer-Lemeshow c-index Test Set 0.840 0.787 16.07* 24.40* 0.898 0.851 Prognostic Score Models c-index Validation Set Hosmer-Lemeshow c-index Test Set 0.855 0.780 11.14* 10.66* 0.910 0.846 Artificial Neural Networks c-index Validation Set Hosmer-Lemeshow 0.835 0.811 7.17* 20.40* c-index Test Set 0.930 0.870 * indicates adequate goodness of fit (prob >0.5)

Conclusions In this data set, the use of stents and gp IIb/IIIa antagonists are associated with a decreased risk of inhospital death. Prognostic risk score models offer advantages over complex modeling systems. Simple to comprehend and implement Discriminatory power approaching full LR and ann models Limitations of this investigation include: the restricted scope of covariates available single high volume center s experience limiting generalizability

Example

Comparison of Practical Prediction Models for Ambulation Following Spinal Cord Injury Todd Rowland, M.D. Decision Systems Group Brigham and Womens Hospital

Study Rationale Patient s most common question: Will I walk again Study was conducted to compare logistic regression, neural network, and rough sets models which predict ambulation at discharge based upon information available at admission for individuals with acute spinal cord injury. Create simple models with good performance 762 cases training set 376 cases test set univariate statistics compared to make sure sets were similar (e.g., means)

SCI Ambulation Classification System Admission Info (9 items) system days injury days age gender racial/ethnic group level of neurologic fxn ASIA impairment index UEMS LEMS Ambulation (1 item) Yes - 1 No - 0

Thresholded Results Sens Spec NPV PPV Accuracy LR 0.875 0.853 0.971 0.549 0.856 NN 0.844 0.878 0.965 0.587 0.872 RS 0.875 0.862 0.971 0.566 0.864

Brier Scores Brier LR 0.0804 NN 0.0811 RS 0.0883

ROC Curves 1 0.9 0.8 0.7 Sensitivity 0.6 0.5 0.4 0.3 LR NN RS 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1-Specificity

Areas under ROC Curves Model ROC Curve Area Standard Error Logistic Regression 0.925 0.016 Neural Network 0.923 0.015 Rough Set 0.914 0.016

Calibration curves LR Model NN Model RS Model 1 1 0.8 0.8 1 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 Observed Observed Observed

Results: Goodness-of-fit Logistic Regression: H-L p = 0.50 Neural Network: H-L p = 0.21 Rough Sets: H-L p <.01 p > 0.05 indicates reasonable fit

Conclusion For the example, logistic regression seemed to be the best approach, given its simplicity and good performance Is it enough to assess discrimination and calibration in one data set?