Evaluation of Classifiers

Similar documents
Chapter 6. The stacking ensemble approach

Evaluation & Validation: Credibility: Evaluating what has been learned

Supervised Learning (Big Data Analytics)

Data Mining. Nonlinear Classification

Data Mining - Evaluation of Classifiers

Performance Measures for Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Knowledge Discovery and Data Mining

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Big Data Analytics CSCI 4030

Knowledge Discovery and Data Mining

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Data Mining Practical Machine Learning Tools and Techniques

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Model Combination. 24 Novembre 2009

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Knowledge Discovery and Data Mining

Azure Machine Learning, SQL Data Mining and R

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

L25: Ensemble learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

A Simple Introduction to Support Vector Machines

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Machine Learning Final Project Spam Filtering

ROC Curve, Lift Chart and Calibration Plot

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Chapter 5 Analysis of variance SPSS Analysis of variance

Data Mining Techniques for Prognosis in Pancreatic Cancer

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Random Forest Based Imbalanced Data Cleaning and Classification

Performance Metrics for Graph Mining Tasks

Linear Threshold Units

ROC Graphs: Notes and Practical Considerations for Data Mining Researchers

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Predicting Flight Delays

Leveraging Ensemble Models in SAS Enterprise Miner

E-commerce Transaction Anomaly Classification

Decompose Error Rate into components, some of which can be measured on unlabeled data

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

1. Classification problems

Chapter 12 Discovering New Knowledge Data Mining

Knowledge Discovery and Data Mining

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Classification of Bad Accounts in Credit Card Industry

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Least-Squares Intersection of Lines

Review of Fundamental Mathematics

Advanced analytics at your hands

Fraud Detection for Online Retail using Random Forests

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Cross-Validation. Synonyms Rotation estimation

Social Media Mining. Data Mining Essentials

Active Learning SVM for Blogs recommendation

Data Mining Algorithms Part 1. Dejan Sarka

STATISTICA Formula Guide: Logistic Regression. Table of Contents

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Logistic Regression Approach to Ad Click Prediction

Content-Based Recommendation

Collaborative Filtering. Radek Pelánek

Statistical Machine Learning

Local classification and local likelihoods

Homework Assignment 7

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Gerry Hobbs, Department of Statistics, West Virginia University

Statistical tests for SPSS

An Approach to Detect Spam s by Using Majority Voting

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Machine Learning in Spam Filtering

More Data Mining with Weka

Quality and Complexity Measures for Data Linkage and Deduplication

Session 7 Bivariate Data and Analysis

Consistent Binary Classification with Generalized Performance Metrics

Lecture 6: Logistic Regression

Server Load Prediction

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Scatter Plots with Error Bars

Supervised Learning Evaluation (via Sentiment Analysis)!

Mining the Software Change Repository of a Legacy Telephony System

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Predict Influencers in the Social Network

Towards better accuracy for Spam predictions

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

How To Cluster

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

The Relationship Between Precision-Recall and ROC Curves

Chapter ML:XI (continued)

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Data Mining as a tool to Predict the Churn Behaviour among Indian bank customers

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Sample Size and Power in Clinical Trials

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

Transcription:

Evaluation of Classifiers ROC Curves Reject Curves Precision-Recall Curves Statistical Tests Estimating the error rate of a classifier Comparing two classifiers Estimating the error rate of a learning algorithm Comparing two algorithms

Cost-Sensitive Learning In most applications, false positive and false negative errors are not equally important. We therefore want to adjust the tradeoff between them. Many learning algorithms provide a way to do this: probabilistic classifiers: combine cost matrix with decision theory to make classification decisions discriminant functions: adjust the threshold for classifying into the positive class ensembles: adjust the number of votes required to classify as positive

Example: 30 decision trees constructed by bagging Classify as positive if K out of 30 trees predict positive. Vary K.

Directly Visualizing the Tradeoff We can plot the false positives versus false negatives directly. If L(0,1) = R L(1,0) (i.e., a FN is R times more expensive than a FP), then the best operating point will be tangent to a line with a slope s of R If R=1, we should set the threshold to 10. If R=10, the threshold should be 29

Receiver Operating Characteristic (ROC) Curve It is traditional to plot this same information in a normalized form with 1 False Negative Rate plotted against the False Positive Rate. The optimal operating point is tangent to a line with a slope of R

Generating ROC Curves Linear Threshold Units, Sigmoid Units, Neural Networks adjust the classification threshold between 0 and 1 K nearest neighbor adjust number of votes (between 0 and k) required to classify positive Naïve Bayes, Logistic Regression, etc. vary the probability threshold for classifying as positive Support vector machines require different margins for positive and negative examples

SVM: Asymmetric Margins Minimize w 2 + C i ξ i Subject to w x i + ξ i R (positive examples) w x i + ξ i 1 (negative examples)

ROC Convex Hull If we have two classifiers h 1 and h 2 with (fp1,fn1) and (fp2,fn2), then we can construct a stochastic classifier that interpolates between them. Given a new data point x,, we use classifier h 1 with probability p and h 2 with probability (1-p). The resulting classifier has an expected false positive level of p fp1 + (1 p) fp2 and an expected false negative level of p fn1 + (1 p) fn2. This means that we can create a classifier that matches any point on the convex hull of the ROC curve

ROC Convex Hull ROC Convex Hull Original ROC Curve

Maximizing AUC At learning time, we may not know the cost ratio R. In such cases, we can maximize the Area Under the ROC Curve (AUC) Efficient computation of AUC Assume h(x) ) returns a real quantity (larger values => class 1) Sort x i according to h(x i ). Number the sorted points from 1 to N such that r(i) = the rank of data point x i AUC = probability that a randomly chosen example from class 1 ranks above a randomly chosen example from class 0 = the Wilcoxon-Mann Mann-Whitney statistic

Computing AUC Let S 1 = sum of r(i) for y i = 1 (sum of the ranks of the positive examples) AUC d = S 1 N 1 (N 1 +1)/2 N 0 N 1 where N 0 is the number of negative examples and N 1 is the number of positive examples

Optimizing AUC A hot topic in machine learning right now is developing algorithms for optimizing AUC RankBoost: A modification of AdaBoost. The main idea is to define a ranking loss function and then penalize a training example x by the number of examples of the other class that are misranked (relative to x)

Rejection Curves In most learning algorithms, we can specify a threshold for making a rejection decision Probabilistic classifiers: adjust cost of rejecting versus cost of FP and FN Decision-boundary method: if a test point x is within θ of the decision boundary, then reject Equivalent to requiring that the activation of the best class is larger than the second-best class by at least θ

Rejection Curves (2) Vary θ and plot fraction correct versus fraction rejected

Precision versus Recall Information Retrieval: y = 1: document is relevant to query y = 0: document is irrelevant to query K: number of documents retrieved Precision: fraction of the K retrieved documents (ŷ=1)( that are actually relevant (y=1) TP / (TP + FP) Recall: fraction of all relevant documents that are retrieved TP / (TP + FN) = true positive rate

Precision Recall Graph Plot recall on horizontal axis; precision on vertical axis; and vary the threshold for making positive predictions (or vary K)

The F 1 Measure Figure of merit that combines precision and recall. F 1 =2 P R P + R where P = precision; R = recall. This is twice the harmonic mean of P and R. We can plot F 1 as a function of the classification threshold θ

Summarizing a Single Operating Point WEKA and many other systems normally report various measures for a single operating point (e.g., θ = 0.5). Here is example output from WEKA: === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class 0.854 0.1 0.899 0.854 0.876 0 0.9 0.146 0.854 0.9 0.876 1

Visualizing ROC and P/R Curves in WEKA Right-click on the result list and choose Visualize Threshold Curve.. Select 1 from the popup window. ROC: Plot False Positive Rate on X axis Plot True Positive Rate on Y axis WEKA will display the AUC also Precision/Recall: Plot Recall on X axis Plot Precision on Y axis WEKA does not support rejection curves

Sensitivity and Selectivity In medical testing, the terms sensitivity and selectivity are used Sensitivity = TP/(TP + FN) = true positive rate = recall Selectivity = TN/(FP + TN) = true negative rate = recall for the negative class = 1 the false positive rate The sensitivity versus selectivity tradeoff is identical to the ROC curve tradeoff

Estimating the Error Rate of a Classifier Compute the error rate on hold-out out data suppose a classifier makes k errors on n holdout data points the estimated error rate is ê = k / n. Compute a confidence internal on this estimate the standard error of this estimate is SE = s ˆ² (1 ˆ²) n A 1 α confidence interval on the true error ε is ˆ² z α/2 SE <= ²<=ˆ² + z α/2 SE For a 95% confidence interval, Z 0.025 = 1.96, so we use ˆ² 1.96SE <= ²<=ˆ² +1.96SE.

Comparing Two Classifiers Goal: decide which of two classifiers h 1 and h 2 has lower error rate Method: Run them both on the same test data set and record the following information: n 00 : the number of examples correctly classified by both classifiers n 01 : the number of examples correctly classified by h 1 but misclassified by h 2 n 10 : The number of examples misclassified by h 1 but correctly classified by h 2 n 00 : The number of examples misclassified by both h 1 and h 2. n 00 n 01 n 10 n 11

McNemar s s Test M = ( n 01 n 10 1) 2 n 01 + n 10 > χ 2 1,α M is distributed approximately as χ 2 with 1 degree of freedom. For a 95% confidence test, χ 2 1,095 = 3.84. So if M is larger than 3.84, then with 95% confidence, we can reject the null hypothesis that the two classifies have the same error rate

Confidence Interval on the Difference Between Two Classifiers Let p ij = n ij /n be the 2x2 contingency table converted to probabilities SE = s p 01 + p 10 +(p 01 p 10 ) 2 p A = p 10 + p 11 p B = p 01 + p 11 n A 95% confidence interval on the difference in the true error between the two classifiers is µ p A p B 1.96 SE + 1 2n µ <= ² A ² B <= p A p B +1.96 SE + 1 2n

Cost-Sensitive Comparison of Two Classifiers Suppose we have a non-0/1 loss matrix L(ŷ,y) and we have two classifiers h 1 and h 2. Goal: determine which classifier has lower expected loss. A method that does not work well: For each algorithm a and each test example (x( i,y i ) compute l a,i L(h a (x i ),y i ). Let δ i = l 1,i l 2,i Treat the δ s s as normally distributed and compute a normal confidence interval a,i = The problem is that there are only a finite number of different possible values for δ i. They are not normally distributed, and the resulting confidence intervals are too wide

A Better Method: BDeltaCost Let = {δ{ i } N i=1 be the set of δ i s s computed as above For b from 1 to 1000 do Let T b be a bootstrap replicate of Let s b = average of the δ s s in T b Sort the s b s s and identify the 26 th and 975 th items. These form a 95% confidence interval on the average difference between the loss from h 1 and the loss from h 2. The bootstrap confidence interval quantifies the uncertainty due to the size of the test set. It does not allow us to compare algorithms,, only classifiers.

Estimating the Error Rate of a Learning Algorithm Under the PAC model, training examples x are drawn from an underlying distribution D and labeled according to an unknown function f to give (x,y)( pairs where y = f(x). The error rate of a classifier h is error(h) = P D (h(x) f(x)) Define the error rate of a learning algorithm A for sample size m and distribution D as error(a,m,d) = E S [error(a(s))] This is the expected error rate of h = A(S) for training sets S of size m drawn according to D. We could estimate this if we had several training sets S 1,,, S L all drawn from D.. We could compute A(S 1 ), A(S 2 ),,, A(S L ), measure their error rates, and average them. Unfortunately, we don t t have enough data to do this!

Two Practical Methods k-fold Cross Validation This provides an unbiased estimate of error(a, (1 1/k)m, D) ) for training sets of size (1 1/k)m Bootstrap error estimate (out-of of-bag estimate) Construct L bootstrap replicates of S train Train A on each of them Evaluate on the examples that did not appear in the bootstrap replicate Average the resulting error rates

Estimating the Difference Between Two Algorithms: the 5x2CV F test for i from 1 to 5 do perform a 2-fold cross-validation split S evenly and randomly into S 1 and S 2 for j from 1 to 2 do Train algorithm A on S j, measure error rate p (i,j) A Train algorithm B on S j, measure error rate p (i,j) B p (j) := p (i,j) i A p(i,j) B end /* for j */ p i := p(1) i + p (2) i 2 s 2 µ i = p (1) 2 µ i p i + p (2) i end /* for i */ P i p 2 F := i 2 P i s 2 i Difference in error rates on fold j Average difference in error rates in iteration i p i 2 Variance in the difference, for iteration i

5x2cv F test p (1,1) A p (1,1) B p (1) 1 p (1,2) A p (1,2) B p (2) 1 p (2,1) A p (2,1) B p (1) 2 p (2,2) A p (2,2) B p (2) 2 p (3,1) A p (3,1) B p (1) 3 p (3,2) A p (3,2) B p (2) 3 p (4,1) A p (4,1) B p (1) 4 p (4,2) A p (4,2) B p (2) 4 p (5,1) A p (5,1) B p (1) 5 p (5,2) A p (5,2) B p (2) 5 p 1 s 2 1 p 2 s 2 p 3 s 2 3 p 4 s 2 4 p 5 s 2 5

5x2CV F test If F > 4.47, then with 95% confidence, we can reject the null hypothesis that algorithms A and B have the same error rate when trained on data sets of size m/2.

Summary ROC Curves Reject Curves Precision-Recall Curves Statistical Tests Estimating error rate of classifier Comparing two classifiers Estimating error rate of a learning algorithm Comparing two algorithms