Displaying and comparing correlated ROC curves with the SAS System

Similar documents
Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

ABSTRACT INTRODUCTION

PROC LOGISTIC: Traps for the unwary Peter L. Flom, Independent statistical consultant, New York, NY

Modeling Lifetime Value in the Insurance Industry

Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank

11. Analysis of Case-control Studies Logistic Regression

A LOGISTIC REGRESSION MODEL TO PREDICT FRESHMEN ENROLLMENTS Vijayalakshmi Sampath, Andrew Flagel, Carolina Figueroa

Basic Statistical and Modeling Procedures Using SAS

Statistics, Data Analysis & Econometrics

Cool Tools for PROC LOGISTIC

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Segmentation For Insurance Payments Michael Sherlock, Transcontinental Direct, Warminster, PA

VI. Introduction to Logistic Regression

SAS Software to Fit the Generalized Linear Model

Paper KEYWORDS PROC TRANSPOSE, PROC CORR, PROC MEANS, PROC GPLOT, Macro Language, Mean, Standard Deviation, Vertical Reference.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Charles Secolsky County College of Morris. Sathasivam 'Kris' Krishnan The Richard Stockton College of New Jersey

Graphing in SAS Software

II. DISTRIBUTIONS distribution normal distribution. standard scores

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

LOGISTIC REGRESSION ANALYSIS

Predictive Modelling of High Cost Healthcare Users in Ontario

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Lecture 19: Conditional Logistic Regression

Multinomial and ordinal logistic regression using PROC LOGISTIC Peter L. Flom National Development and Research Institutes, Inc

Getting Correct Results from PROC REG

SUGI 29 Statistics and Data Analysis

Two Correlated Proportions (McNemar Test)

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Introduction to mixed model and missing data issues in longitudinal studies

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Health Care and Life Sciences

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Lecture 14: GLM Estimation and Logistic Regression

Sun Li Centre for Academic Computing

Ordinal Regression. Chapter

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Evaluation of Predictive Models

Multiple Graphs on One Page (Step-by-step approach) Yogesh Pande, Schering-Plough Corporation, Summit NJ

Multivariate Logistic Regression

Using SAS Proc Mixed for the Analysis of Clustered Longitudinal Data

Interpretation of Somers D under four simple models

Credit Risk Models. August 24 26, 2010

ABSTRACT INTRODUCTION STUDY DESCRIPTION

Research Methods & Experimental Design

Nominal and ordinal logistic regression

Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics

9.2 User s Guide SAS/STAT. Introduction. (Book Excerpt) SAS Documentation

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Multinomial and Ordinal Logistic Regression

Generalized Linear Models

Binary Logistic Regression

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Imputing Missing Data using SAS

Additional sources Compilation of sources:

SAS/STAT. 9.2 User s Guide. Introduction to. Nonparametric Analysis. (Book Excerpt) SAS Documentation

ln(p/(1-p)) = α +β*age35plus, where p is the probability or odds of drinking

Weight of Evidence Module

Chapter 39 The LOGISTIC Procedure. Chapter Table of Contents

Chapter 12 Nonparametric Tests. Chapter Table of Contents

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Simulation Models for Business Planning and Economic Forecasting. Donald Erdman, SAS Institute Inc., Cary, NC

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Paper D Ranking Predictors in Logistic Regression. Doug Thompson, Assurant Health, Milwaukee, WI

Abbas S. Tavakoli, DrPH, MPH, ME 1 ; Nikki R. Wooten, PhD, LISW-CP 2,3, Jordan Brittingham, MSPH 4

SAS Syntax and Output for Data Manipulation:

Assessing Model Fit and Finding a Fit Model

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Sensitivity Analysis in Multiple Imputation for Missing Data

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

Using SAS to Create Graphs with Pop-up Functions Shiqun (Stan) Li, Minimax Information Services, NJ Wei Zhou, Lilly USA LLC, IN

Normality Testing in Excel

Answering Your Research Questions with Descriptive Statistics Diana Suhr, University of Northern Colorado

Using PROC RANK and PROC UNIVARIATE to Rank or Decile Variables

Box-and-Whisker Plots with The SAS System David Shannon, Amadeus Software Limited

Alex Vidras, David Tysinger. Merkle Inc.

Training/Internship Brochure Advanced Clinical SAS Programming Full Time 6 months Program

A survey on click modeling in web search

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Nonparametric Statistics

Ten Tips for Simulating Data with SAS

CC03 PRODUCING SIMPLE AND QUICK GRAPHS WITH PROC GPLOT

Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

Introduction to Quantitative Methods

LEGEND OPTIONS USING MULTIPLE PLOT STATEMENTS IN PROC GPLOT

An Application of the Cox Proportional Hazards Model to the Construction of Objective Vintages for Credit in Financial Institutions, Using PROC PHREG

From The Little SAS Book, Fifth Edition. Full book available for purchase here.

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

SP10 From GLM to GLIMMIX-Which Model to Choose? Patricia B. Cerrito, University of Louisville, Louisville, KY

Transcription:

Displaying and comparing correlated ROC curves with the SAS System Barbara Schneider, University of Vienna, Austria ABSTRACT Diagnosis is an essential part of clinical practice. Much medical research is carried out to improve methods of diagnosis. Medical tests are developed for diagnosis purpose. Thus statistical methods of evaluating and comparing the performance of such tests are of great importance. A test is associated with an observed variable which is able to discriminate between a diseased and non-diseased population. If this variable lies on an ordinal or metric scale an assessment of the overall performance of the test can be made through the use of the Receiver Operating Characteristic (ROC) curve. Correlated ROC curves arise when two or more different tests are performed on the same individuals. The poster presents an approach to display and compare correlated ROC curves with the SAS Software. SAS code using Base SAS, SAS/STAT and SAS/GRAPH is reported. INTRODUCTION The purpose of a medical test is to classify individuals in two groups (eg. diseased or nondiseased). A medical test is associated with an observed variable. Such variables cannot only be used for determining the presence or absence of a medical condition -diagnosis- but also for predicting an event of interest -prognosis-. The evaluation and comparison of medical tests requires knowledge about the true medical condition of an individual. It has to be determined by means independent of the test (Gold Standard) that an individual undergoes the condition or event. For example autopsy, biopsy or surgical inspection can be taken as gold standards. If the variable which is associated with the test is dichotomous that means T + : test positive, T - : test negative, then sensitivity and specificity reflect the quality of a test. It is assumed that T + is associated with the desease (condition) and T - with non-disease (non-condition). Sensitivity is the probability that the test result is positive under the condition that the subject comes from the diseased population. Specificity is the probability that the test result is negative under the condition that the subject comes from the non-diseased population. In practice sensitivity and specificity are estimated as the true positive rate and true negative rate - empirical sensitivity and empirical specificity -. Let D be the event that a subject undergoes the condition and D the event that a subject does not have the condition then: Sensitivity : sens = P(T + / D) Specificity: spez = P(T - / D ) When a test is based on an observed variable that lies on a continuous or graded scale any outcome value z can serve as a cutpoint to dichotomize the problem. For values greater or equal than the cutpoint the outcome of the test is positiv and for smaller values than the cutpoint the test result is taken as being negativ or vice versa. It depends on whether higher or lower values are associated with the condition of interest.

Sens(z) is the empirical sensitivity of a test that is derived by dichotomizing the metric or ordinal variable into positiv or negativ results on the basis of the cutpoint z. spez(z) is the corresponding empirical specificity. The receiver operating characteristic (ROC) curve is theoretically given as a plot of sensitivity (Y-Axis) vs. (1-specificity) (X-Axis) by varying the variable on which the test is based over all possible values.the empirical ROC curve is a graphical display of sens(z) vs. 1-spez(z) as the sample values z vary over all measured values. An interesting property of the ROC curve is that it is invariant under orderpreserving transformations of the test variable; that means if the test variable is transformed by a monotone function the resulting ROC curve will be the same. CONSTRUCTION OF THE ROC CURVE WITH THE SAS SYSTEM There are several possibilities to compute sens(z) and spez(z) for each of the sample values z using the SAS System. Since Version 6.10 an option in PROC LOGISTIC is available that allows computation of sens(z) and spez(z) and output them into a SAS data set. PROC LOGISTIC DATA= input- SAS-data-set ; MODEL status = test / OUTROC = output-sas-data-set ; RUN; The input- SAS-data-set consists of at least the two variables status and test. One record per observational unit ( eg. patient) is used: status : dichotomous event (condition) variable eg. 1 deseased, 2 non deseased test: variable which is associated with the medical test Amongst other variables the output-sas-data-set contains the variables: _SENSIT_ sens(z) _1MSPEC_ 1-spez(z) PROC LOGISTIC uses the estmated probability p of the event from the logistic model to exp( a + bz) compute sensitivity and specificity. For p = is a monotone 1 + exp( a + bz) transformation of the values z of the test variable the ROC curve cunstructed from the probabilities p will be the same as generated from the sample values z. An advantage of this transformation is that higher probabilities reflect the condition of interest. Therefore one need not take care ofwheather higher or lower values of the test variable are associated with the condition of interest. To display the ROC curve PROC GPLOT can be used. PROC GPLOT DATA= output-sas-data-set; PLOT _SENSIT_*_1MSPEC_ ; RUN; DISPLAYING CORRELATED ROC CURVES Correlated ROC curves arise when two or more different tests are performed on the same individuals (observational units). To compare the discriminating abilities of two or more test variables measured on the same individuals graphically, the corresponding ROC curves have to be plotted.

The following SAS macro gives an example how to display several correlated ROC curves within the same graph. The input- SAS-data-set consists of at least the variables status, test1,test2,...,testn and one record per observational unit ( eg. patient): status : dichotomous event (condition) variable eg. 1 deseased, 2 non deseased test1,..,testn: variables associated with the medical test options ls=79 nonumber; title "ROC curve"; goptions cback=white ctext=black ftext=swiss hsize=20cm vsize=20cm target=winprtc; %let indata = input- SAS-data-set ; %let vars = test1 test2 testn; %let pop=status; %let cvars = "&vars"; /* in order to restructure the input data set from multivariate to univariate*/ data trans; set &indata; array w &vars; do over w; wert=w; variable=scan(&cvars,_i_); index=_i_; output; end; proc sort data=trans; by index; /** log-reg **/ proc logistic data=trans /*descending*/; model &pop=wert / outroc=o; output out=p p=_prob_; /* to compute the ROC curve for each test variable separately*/ by index variable; data p; format _PROB_ 8.6; set p; _PROB_=round(_PROB_,.00001); data o; set o; _PROB_=round(_PROB_,.00001); proc sort data=o; by index _PROB_; proc sort data=p; by index _PROB_; /* in order to have the test variables mapped to sensitivity and 1-specificity */ data p1; keep _PROB_ wert _sensit 1mspec_ index variable; merge p o; if wert^=.; by index _PROB_; proc print data=p1 ; symbol1 c=black w=3 v=none i=join l=1 ; symbol2 c=blue w=3 v=none i=join l=2 ; symbol3 c=green w=3 v=none i=join l=3; /*symboln c=red w=3 v=none i=join l=4;*/ proc gplot data=o; plot _sensit_*_1mspec_=variable / ctext=black caxis=black vaxis=0 to 1 by.1; To illustrate how the above SAS program works, the following data set (reported in DeLong et al. 1988) is used: quit; title; ROC data set ALB TP TOTSCORE POPIND POPIND1 PATIENT

3.0 3.2 5.8 6.3 0 5 0 1 2 1 1 2 3.9 6.8 7 1 1 3 2.8 4.8 4 0 2 4 3.2 5.8 7 1 1 5 0.9 4.0 5 0 2 6 2.5 5.7 2 0 2 7 1.6 5.6 5 1 1 8 3.8 5.7 5 1 1 9 3.7 6.7 4 1 1 10.. 4 1 1 11 3.2 5.4 6 1 1 12 3.8 6.6 4 1 1 13 4.1 6.6 5 1 1 14 3.6 5.7 5 1 1 15 4.3 7.0 6 1 1 16 3.6 6.7 6 0 2 17 2.3 4.4 4 1 1 18 4.2 7.6 6 0 2 19 4.0 6.6 4 0 2 20 3.5 5.8 4 1 1 21 3.8 6.8 3 1 1 22 3.0 4.7 2 0 2 23 4.5 7.4 5 1 1 24 3.7 7.4 5 1 1 25 3.1 6.6 4 1 1 26 4.1 8.2 4 1 1 27 4.3 7.0 5 1 1 28 4.3 6.5 6 1 1 29 3.2 5.1 5 1 1 30 2.6 4.7 4 1 1 31 3.3 6.8 4 0 2 32 1.7 4.0 3 0 2 33.. 4 1 1 34 3.7 6.1 5 1 1 35 3.3 6.3 3 1 1 36 4.2 7.7 4 1 1 37 3.5 6.2 5 1 1 38 2.9 5.7 1 0 2 39 2.1 4.8 3 1 1 40.. 2 1 1 41 2.8 6.2 2 0 2 42.. 3 1 1 43.. 3 1 1 44 4.0 7.0 3 1 1 45 3.3 5.7 4 1 1 46 3.7 6.9 5 1 1 47 2.0. 3 1 1 48 3.6 6.6 5 1 1 49 POPIND event variable 1... benefit from surgery 0... no benefit from surgery POPIND1 event variable 1... benefit from surgery 2... no benefit from surgery can be used without the descending OPTION in PROC LOGISTIC ALB albumin, TP total protein, TOTSCORE score variable... medical test variables PATIENT... patient identification number Part of the program with actual parameters: options ls=79 nonumber; title "ROC curve"; goptions cback=white ctext=black ftext=swiss hsize=20cm vsize=20cm target=winprtc; %let indata = ROC ; %let vars = ALB TP TOTSCORE; %let pop=popind1;

SAS - OUTPUT: ROC curve S e n s i t i v i t y 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 - Specificity VARIABLE alb totscore tp

Printed output is only given for variable ALB ROC curve ---------------------------- INDEX=1 VARIABLE=alb ----------------------------- Data Set: WORK.TRANS Response Variable: POPIND1 Response Levels: 2 Number of Observations: 44 Link Function: Logit The LOGISTIC Procedure Response Profile Ordered Value POPIND1 Count 1 1 32 2 2 12 WARNING: 5 observation(s) were deleted due to missing values for the response or explanatory variables. Model Fitting Information and Testing Global Null Hypothesis BETA=0 Intercept Intercept and Criterion Only Covariates Chi-Square for Covariates AIC 53.564 51.021. SC 55.348 54.590. -2 LOG L 51.564 47.021 4.542 with 1 DF (p=0.0331) Score.. 4.700 with 1 DF (p=0.0302) Analysis of Maximum Likelihood Estimates Parameter Standard Wald Pr > Standardized Odds Variable DF Estimate Error Chi-Square Chi-Square Estimate Ratio INTERCPT 1-1.9418 1.4606 1.7675 0.1837.. WERT 1 0.9099 0.4508 4.0742 0.0435 0.403372 2.484 Association of Predicted Probabilities and Observed Responses Concordant = 71.1% Somers D = 0.438 Discordant = 27.3% Gamma = 0.444 Tied = 1.6% Tau-a = 0.178 (384 pairs) c = 0.719

ROC curve OBS _PROB_ WERT VARIABLE INDEX _SENSIT 1MSPEC_ 1 0.245480 0.9 alb 1 1.00000 1.00000 2 0.380860 1.6 alb 1 1.00000 0.91667 3 0.402540 1.7 alb 1 0.96875 0.91667 4 0.469560 2.0 alb 1 0.96875 0.83333 5 0.492270 2.1 alb 1 0.93750 0.83333 6 0.537700 2.3 alb 1 0.90625 0.83333 7 0.582500 2.5 alb 1 0.87500 0.83333 8 0.604450 2.6 alb 1 0.87500 0.75000 9 0.647040 2.8 alb 1 0.84375 0.75000 10 0.647040 2.8 alb 1 0.84375 0.75000 11 0.667530 2.9 alb 1 0.84375 0.58333 12 0.687410 3.0 alb 1 0.84375 0.50000 13 0.687410 3.0 alb 1 0.84375 0.50000 14 0.706620 3.1 alb 1 0.84375 0.33333 15 0.725120 3.2 alb 1 0.81250 0.33333 16 0.725120 3.2 alb 1 0.81250 0.33333 17 0.725120 3.2 alb 1 0.81250 0.33333 18 0.725120 3.2 alb 1 0.81250 0.33333 19 0.742880 3.3 alb 1 0.68750 0.33333 20 0.742880 3.3 alb 1 0.68750 0.33333 21 0.742880 3.3 alb 1 0.68750 0.33333 22 0.776090 3.5 alb 1 0.62500 0.25000 23 0.776090 3.5 alb 1 0.62500 0.25000 24 0.791500 3.6 alb 1 0.56250 0.25000 25 0.791500 3.6 alb 1 0.56250 0.25000 26 0.791500 3.6 alb 1 0.56250 0.25000 27 0.806120 3.7 alb 1 0.50000 0.16667 28 0.806120 3.7 alb 1 0.50000 0.16667 29 0.806120 3.7 alb 1 0.50000 0.16667 30 0.806120 3.7 alb 1 0.50000 0.16667 31 0.819950 3.8 alb 1 0.37500 0.16667 32 0.819950 3.8 alb 1 0.37500 0.16667 33 0.819950 3.8 alb 1 0.37500 0.16667 34 0.832990 3.9 alb 1 0.28125 0.16667 35 0.845270 4.0 alb 1 0.25000 0.16667 36 0.845270 4.0 alb 1 0.25000 0.16667 37 0.856800 4.1 alb 1 0.21875 0.08333 38 0.856800 4.1 alb 1 0.21875 0.08333 39 0.867610 4.2 alb 1 0.15625 0.08333 40 0.867610 4.2 alb 1 0.15625 0.08333 41 0.877710 4.3 alb 1 0.12500 0.00000 42 0.877710 4.3 alb 1 0.12500 0.00000 43 0.877710 4.3 alb 1 0.12500 0.00000 44 0.895940 4.5 alb 1 0.03125 0.00000 COMPARISON OF CORRELATED ROC CURVES For statistical analysis, a recommended index of accuracy in relation with the ROC curve is the area under the curve. The area under the ROC curve can be interpreted as the probability that a randomly selected individual from the abnormal population has a higher value than a randomly selected individual from the normal population, if higher values are associated with the abnormal population. Vice versa if lower values are associated with the abnormal population. A nonparametric method to compare correlated ROC curves is described in (DeLong,DeLong and Clark-Pearson, 1988). A SAS/IML programm has been written by the authors and is available from the web : http://www.sas.com/techsup/download/stat/roc.sas. PROC LOGISTIC does not compare ROC curves but issues the area under the curve (c =.719, ALB from the above example) in the following portion of the output: Association of Predicted Probabilities and Observed Responses. Concordant = 71.1% Somers D = 0.438 Discordant = 27.3% Gamma = 0.444 Tied = 1.6% Tau-a = 0.178 (384 pairs) c = 0.719

SAS, Base SAS, SAS/STAT, SAS/GRAPH and SAS/IML software are registered trademarks of SAS Institute Inc., Cary, NC, USA Address for correspondence: Dipl.Ing. Dr. Barbara Schneider Institut für Medizinische Statistik Universität Wien Schwarzspanierstr. 17 A - 1090 Wien Austria REFERENCES: Elizabeth R. DeLong, David M. DeLong, Daniel L. Clarke-Pearson (1988) Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, BIOMETRICS 44, 837-845, September 1988 Bamber D. (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph, Journal of Mathematical Psychology 12, 387-415 SAS Language: Reference, Version 6, First Edition SAS Procedures Guide, Version 6, Third Edition SAS/GRAPH Software: Reference, Version 6, First Edition, Volumes 1 and 2 SAS/IML Software: Usage and Reference, Version 6, First Edition SAS/STAT Software: Reference, Version 6, First Edition, Volumes 1 and 2