Knowledge Discovery and Data Mining

Similar documents
Evaluation & Validation: Credibility: Evaluating what has been learned

Performance Measures for Machine Learning

Performance Measures in Data Mining

Data Mining Algorithms Part 1. Dejan Sarka

Chapter 6. The stacking ensemble approach

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Health Care and Life Sciences

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

ROC Curve, Lift Chart and Calibration Plot

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

Copyright 2006, SAS Institute Inc. All rights reserved. Predictive Modeling using SAS

Data Mining - The Next Mining Boom?

Customer Analytics. Turn Big Data into Big Value

Dynamic Predictive Modeling in Claims Management - Is it a Game Changer?

1. Classification problems

Azure Machine Learning, SQL Data Mining and R

A Decision Tree for Weather Prediction

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Performance Metrics for Graph Mining Tasks

Mining Life Insurance Data for Customer Attrition Analysis

ViviSight: A Sophisticated, Data-driven Business Intelligence Tool for Churn and Loan Default Prediction

not possible or was possible at a high cost for collecting the data.

Bijan Raahemi, Ph.D., P.Eng, SMIEEE Associate Professor Telfer School of Management and School of Electrical Engineering and Computer Science

Chapter 12 Discovering New Knowledge Data Mining

CSC574 - Computer and Network Security Module: Intrusion Detection

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Direct Marketing When There Are Voluntary Buyers

Business Case Development for Credit and Debit Card Fraud Re- Scoring Models

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining - Evaluation of Classifiers

CLINICAL DECISION SUPPORT FOR HEART DISEASE USING PREDICTIVE MODELS

A Hybrid Data Mining Model to Improve Customer Response Modeling in Direct Marketing

Using Random Forest to Learn Imbalanced Data

An Approach to Detect Spam s by Using Majority Voting

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Data Mining Practical Machine Learning Tools and Techniques

How To Cluster

The Data Mining Process

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Quality and Complexity Measures for Data Linkage and Deduplication

Mining the Software Change Repository of a Legacy Telephony System

Measures of diagnostic accuracy: basic definitions

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Individual patient data meta-analysis of continuous diagnostic markers

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Identifying SPAM with Predictive Models

Data Mining Applications in Higher Education

Strategies for Identifying Students at Risk for USMLE Step 1 Failure

Understanding Characteristics of Caravan Insurance Policy Buyer

The Relationship Between Precision-Recall and ROC Curves

Didacticiel Études de cas

Supervised Learning (Big Data Analytics)

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Evaluation of Predictive Models

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Data Mining Techniques Chapter 4: Data Mining Applications in Marketing and Customer Relationship Management

Marketing Strategies for Retail Customers Based on Predictive Behavior Models

Enhancing Quality of Data using Data Mining Method

Consolidated Tree Classifier Learning in a Car Insurance Fraud Detection Domain with Class Imbalance

Selecting Data Mining Model for Web Advertising in Virtual Communities

P (B) In statistics, the Bayes theorem is often used in the following way: P (Data Unknown)P (Unknown) P (Data)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Advanced analytics at your hands

Stock Market Forecasting Using Machine Learning Algorithms

Statistics in Retail Finance. Chapter 7: Fraud Detection in Retail Credit

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Data and Analysis. Informatics 1 School of Informatics, University of Edinburgh. Part III Unstructured Data. Ian Stark. Staff-Student Liaison Meeting

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Online Ensembles for Financial Trading

PREDICTING SUCCESS IN THE COMPUTER SCIENCE DEGREE USING ROC ANALYSIS

A Property & Casualty Insurance Predictive Modeling Process in SAS

Financial Statement Fraud Detection: An Analysis of Statistical and Machine Learning Algorithms

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Expert Systems with Applications

MHI3000 Big Data Analytics for Health Care Final Project Report

Scoring (manual, automated, automated with manual review)

Data Mining and Machine Learning in Bioinformatics

Pattern Recognition and Prediction in Equity Market

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Data mining and statistical models in marketing campaigns of BT Retail

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

Classification of Bad Accounts in Credit Card Industry

Despite its emphasis on credit-scoring/rating model validation,

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Validation parameters: An introduction to measures of

Transcription:

Knowledge Discovery and Data Mining Lecture 15 - ROC, AUC & Lift Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-17-AUC 13 March 2015 1 / 27

Testing A useful tool for investigating model performance is the confusion matrix: y = 0 y = 1 ŷ = 0 a b ŷ = 1 c d Contains quantities for the correct prediction of class 0, correct prediction of class 1, and the two ways you may have made incorrect predictions. Tom Kelsey ID5059-17-AUC 13 March 2015 2 / 27

Performance Measures a + d Accuracy a + b + c + d d Precision b + d d Recall (TP) Sensitivity c + d a True negative Specificity a + b b False positive a + b c False negative c + d Tom Kelsey ID5059-17-AUC 13 March 2015 3 / 27

Receiver-Operator Characteristics ROC curves For continuous data with variable cutoff points for the classification Obese Y/N based on BMI, age, etc. Cancerous based on percent of abnormal tissue in a slide Given a tree, some test data and a confusion matrix, it s easy to generate a point on a ROC chart x-axis is FP rate, y-axis is TP rate This point depends on a probability threshold for the classification Varying this threshold will change the confusion matrix, giving more points on the chart Use this to tune the model w.r.t FP and TP rates Tom Kelsey ID5059-17-AUC 13 March 2015 4 / 27

Example Goldstein and Mushlin (J. Gen. Intern. Med. 1987 2 20-24) Tom Kelsey ID5059-17-AUC 13 March 2015 5 / 27

Example Tom Kelsey ID5059-17-AUC 13 March 2015 6 / 27

Example Tom Kelsey ID5059-17-AUC 13 March 2015 7 / 27

Example Tom Kelsey ID5059-17-AUC 13 March 2015 8 / 27

Effect of Thresholding How the balance between TP, TN, FP and FN changes: Tom Kelsey ID5059-17-AUC 13 March 2015 9 / 27

Area Under Curve The area measures discrimination the ability of the test to classify correctly Useful for comparing ROC curves standard academic banding: 0.90 1.00 = excellent 0.80 0.90 = good 0.86 for the example 0.70 0.80 = fair 0.60 0.70 = poor 0.50 0.60 = fail Computed by trapezoidal estimates (or the curve can be smoothed, then integrated) Tom Kelsey ID5059-17-AUC 13 March 2015 10 / 27

Examples Kelsey et al. Tom Kelsey ID5059-17-AUC 13 March 2015 11 / 27

Examples Kelsey et al. Tom Kelsey ID5059-17-AUC 13 March 2015 12 / 27

Examples proc package for R Tom Kelsey ID5059-17-AUC 13 March 2015 13 / 27

Examples proc package for R Tom Kelsey ID5059-17-AUC 13 March 2015 14 / 27

Examples proc package for R Tom Kelsey ID5059-17-AUC 13 March 2015 15 / 27

Examples proc package for R Tom Kelsey ID5059-17-AUC 13 March 2015 16 / 27

Examples proc package for R Tom Kelsey ID5059-17-AUC 13 March 2015 17 / 27

Examples proc package for R Tom Kelsey ID5059-17-AUC 13 March 2015 18 / 27

The Case For S. Ma & J. Huang Regularized ROC method for disease classification and biomarker selection with microarray data Bioinf. (2005) 21 (24) An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease classification. Thus there is a need for developing statistical methods that can efficiently use such high-throughput genomic data, select biomarkers with discriminant power and construct classification rules. The ROC technique has been widely used in disease classification with low-dimensional biomarkers because (1) it does not assume a parametric form of the class probability as required for example in the logistic regression method; (2) it accommodates case-control designs and (3) it allows treating false positives and false negatives differently. However, due to computational difficulties, the ROC-based classification has not been used with microarray data. Tom Kelsey ID5059-17-AUC 13 March 2015 19 / 27

The Case Against J.M. Lobo et al. AUC: a misleading measure of the performance of predictive distribution models Global Ecol. and Biogeog. 17(2); 2008 The... AUC, is currently considered to be the standard method to assess the accuracy of predictive distribution models. It avoids the supposed subjectivity in the threshold selection process, when continuous probability derived scores are converted to a binary presence-absence variable, by summarizing overall model performance over all possible thresholds... We do not recommend using AUC for five reasons: (1) it ignores the predicted probability values and the goodness-of-fit of the model; (2) it summarises the test performance over regions of the ROC space in which one would rarely operate; (3) it weights omission and commission errors equally; (4) it does not give information about the spatial distribution of model errors; and, most importantly, (5) the total extent to which models are carried out highly influences the rate of well-predicted absences and the AUC scores. Tom Kelsey ID5059-17-AUC 13 March 2015 20 / 27

Lift Measures the degree to which the predictions of a classification model are better than random predictions. In simple terms lift is the ratio of the correct positive classifications made by the model to the actual positive classifications in the test data For example, if 40% of patients have been diagnosed (the positive classification) in the past, and the model accurately predicts 75% of them, the lift would be 0.75 0.4 = 1.875 Tom Kelsey ID5059-17-AUC 13 March 2015 21 / 27

Lift Lift charts for a model can be obtained in a similar manner to ROC charts. For threshold value t x = TP(t) + FP(t) P + N y = TP(t) The AUC of a lift chart is no smaller than the AUC of the ROC curve for the same model As before, we can compare lift charts for competing models, and investigate optimal threshold values Tom Kelsey ID5059-17-AUC 13 March 2015 22 / 27

Lift Example Suppose there is have a mailing list of former students, and we want to get money by mailing an elaborate brochure. We have demographic information that we can relate to the response rate. Also, from similar mail-out campaigns, we estimated the baseline response rate at 8%. Sending to everyone would result in a net loss. We build a model based on the data collected. We can select the 10% most likely to respond. If among these the response rate is 16% percent then the lift value due to using the predictive model is 16% / 8% = 2. Analogous lift values can be computed for each percentile of the population. From this we work out the best trade-off between expense and anticipated response. Tom Kelsey ID5059-17-AUC 13 March 2015 23 / 27

General chart structure You can think of this as a customer database ordered by predicted probability - as we move from left-to-right we are penetrating deeper in to the database from high ˆp observations to low ˆp observations: Tom Kelsey ID5059-17-AUC 13 March 2015 24 / 27

Lift Closely associated with the Pareto Principle 80% of profit comes from 20% of customers. A good model and a lift chart help identify those customers. Tom Kelsey ID5059-17-AUC 13 March 2015 25 / 27

Why use these plots? The utility of these charts is hopefully clear: if we had a limited budget we can see what kind of level of response this would buy by targeting the (modelled) most likely responders we can see how much value our model has brought to the problem (compared to a random sample of customers) - in direct monetary terms if costs are included perhaps we can do a smaller campaign, as the returns diminish beyond some percentage of customers targeted we can see where a level of customer targeting becomes unprofitable if the costs are known. Tom Kelsey ID5059-17-AUC 13 March 2015 26 / 27

Summary Medics and management use ROC, AUC & Lift whenever possible Easy to compute Easy to understand Simple 2D graphical expression of how Model A compares to Model B Plus useful threshold cutoff information Plus important cost-benefit information You are expected to be able to produce ROC curves. You are not expected to be able to produce lift charts, but be able to explain their design and use. Tom Kelsey ID5059-17-AUC 13 March 2015 27 / 27