Consistent Binary Classification with Generalized Performance Metrics



Similar documents
Consistent Binary Classification with Generalized Performance Metrics

Performance Metrics. number of mistakes total number of observations. err = p.1/1

Statistical Machine Learning

Bilinear Prediction Using Low-Rank Models

Social Media Mining. Data Mining Essentials

Predict Influencers in the Social Network

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Simple and efficient online algorithms for real world applications

Polarization codes and the rate of polarization

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Interactive Machine Learning. Maria-Florina Balcan

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Bootstrapping Big Data

Supervised Learning (Big Data Analytics)

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

A Network Flow Approach in Cloud Computing

Linear Threshold Units

Introduction to Online Learning Theory

Big Data Analytics. Lucas Rego Drumond

DUOL: A Double Updating Approach for Online Learning

Message-passing sequential detection of multiple change points in networks

Big Data - Lecture 1 Optimization reminders

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Machine Learning.

How performance metrics depend on the traffic demand in large cellular networks

Towards better accuracy for Spam predictions

Introduction to Detection Theory

Chapter 6. The stacking ensemble approach

Data Mining - Evaluation of Classifiers

L3: Statistical Modeling with Hadoop

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Knowledge Discovery and Data Mining

Introduction to Markov Chain Monte Carlo

Probabilistic Methods for Time-Series Analysis

Week 1: Introduction to Online Learning

SVM Ensemble Model for Investment Prediction

More Data Mining with Weka

Microsoft Azure Machine learning Algorithms

Lecture 6: Logistic Regression

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Classification with Hybrid Generative/Discriminative Models

Performance Measures in Data Mining

Practical Graph Mining with R. 5. Link Analysis

CSC574 - Computer and Network Security Module: Intrusion Detection

6.231 Dynamic Programming and Stochastic Control Fall 2008

Projektgruppe. Categorization of text documents via classification

Machine Learning in Spam Filtering

Lecture 2: The SVM classifier

Data Mining Algorithms Part 1. Dejan Sarka

Reject Inference in Credit Scoring. Jie-Men Mok

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Data Mining Classification: Decision Trees

Lecture 3: Linear methods for classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Categorical Data Visualization and Clustering Using Subjective Factors

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Introduction to Support Vector Machines. Colin Campbell, Bristol University

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

Machine-Learning for Big Data: Sampling and Distributed On-Line Algorithms. Stéphan Clémençon

Introducing diversity among the models of multi-label classification ensemble

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

CSCI567 Machine Learning (Fall 2014)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Airport Planning and Design. Excel Solver

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Specification of the Bayesian CRM: Model and Sample Size. Ken Cheung Department of Biostatistics, Columbia University

Prediction of Stock Performance Using Analytical Techniques

Using Graph Theory to Analyze Gene Network Coherence

Big Data Analytics CSCI 4030

Linear Classification. Volker Tresp Summer 2015

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

One-sided Support Vector Regression for Multiclass Cost-sensitive Classification

Large-Scale Similarity and Distance Metric Learning

Problem Set 6 - Solutions

Single item inventory control under periodic review and a minimum order quantity

Online Semi-Supervised Learning

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Lecture 2: August 29. Linear Programming (part I)

Knowledge Discovery and Data Mining

STA 4273H: Statistical Machine Learning

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Tweaking Naïve Bayes classifier for intelligent spam detection

A HYBRID GENETIC ALGORITHM FOR THE MAXIMUM LIKELIHOOD ESTIMATION OF MODELS WITH MULTIPLE EQUILIBRIA: A FIRST REPORT

Less naive Bayes spam detection

CHAPTER 2 Estimating Probabilities

Un point de vue bayésien pour des algorithmes de bandit plus performants

Question 2 Naïve Bayes (16 points)

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Evaluation & Validation: Credibility: Evaluating what has been learned

Transcription:

Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014

Problem and Motivation (1/3) State-of-the-art understanding of optimal decision making and consistent algorithms for binary classification is limited. It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholding P(Y = 1 x) at 0.5. Such a characterization is lacking for many utility measures used in practice.

Problem and Motivation (2/3) Most performance measures are based on the four fundamental population quantities: Examples include F β, Jaccard coefficient, and other cost-sensitive measures.

Problem and Motivation (3/3) Goals: 1. Develop a general framework for analyzing performance measures 2. Characterize optimal decision functions for a large family of utility measures 3. Develop efficient, and provably consistent, algorithms for maximizing measures in practice

A Family of Generalized Performance Metrics (1/3) Let θ : X {0, 1} denote a classifier, and P be a fixed unknown distribution over labeled data X {0, 1}. We define the following ratio family of performance metrics: L(θ, P) = a 0 + a 11 TP + a 10 FP + a 01 FN + a 00 TN b 0 + b 11 TP + b 10 FP + b 01 FN + b 00 TN where a 0, b 0, a ij, b ij, i, j {0, 1} are non-negative constants and: TP := TP(θ, P) = P(Y = 1, θ = 1), FP := FP(θ, P) = P(Y = 0, θ = 1), FN := FN(θ, P) = P(Y = 1, θ = 0), TN := TN(θ, P) = P(Y = 0, θ = 0).

A Family of Generalized Performance Metrics (2/3) Example metrics in this family: AM = F β = Jaccard Coefficient = Weighted Accuracy = (1 π)tp + πtn 2π(1 π) (1 + β 2 )TP (1 + β 2 )TP + β 2 FN + FP TP TP + FN + FP w 1 TP + w 2 TN w 1 TP + w 2 TN + w 3 FP + w 4 FN

A Family of Generalized Performance Metrics (3/3) Let γ(θ) := P(θ = 1) and π := P(Y = 1). Observing that: FP = γ(θ) TP, FN = π TP(θ), TN = 1 γ(θ) π + TP we get the following equivalent, simpler representation of the family: L(θ, P) = c 0 + c 1 TP + c 2 γ(θ) d 0 + d 1 TP + d 2 γ(θ), for certain constants c 0, c 1, c 2, d 0, d 1, d 2.

Optimal Classifier (1/2) Optimal (Bayes) decision function for a given metric L is: θ = arg max θ Θ L(θ, P). Main Result 1. Given a performance metric L, or equivalently, the constants {c 0, c 1, c 2 } and {d 0, d 1, d 2 }, let L := L(θ ) and let: δ = d 2L c 2 c 1 d 1 L The Bayes classifier θ takes the form θ (x) = sign(p(y = 1 x) δ ).

Optimal Classifier (2/2) Implication: Optimal decision function for a metric in our family can be found among the thresholded classifiers: θ arg max L(I (P(Y = 1 x) δ), P), δ (0,1) where I (P(Y = 1 x) δ) is the classifier that thresholds the conditional at δ.

Recovered and New Results (1/2)

Recovered and New Results (2/2) Simulated results showing η(x) := P(Y = 1 x), optimal threshold δ and Bayes classifier θ F 1 Weighted Accuracy 1.0 0.8 0.6 η(x) δ =0.34 θ 1.0 0.8 0.6 η(x) δ =0.50 θ 0.4 0.4 0.2 0.0 2TP 2TP +FP +FN 1 2 3 4 5 6 7 8 9 10 x 0.2 0.0 2TP +2TN 2TP +FP +FN +2TN 1 2 3 4 5 6 7 8 9 10 x

Maximizing L in Practice (1/3) Given iid sample (X i, Y i ), i = 1, 2,..., n, we would want to maximize the empirical measure: L n (θ) = c 1TP n (θ) + c 2 γ n (θ) + c 0 d 1 TP n (θ) + d 2 γ n (θ) + d 0, where TP n (θ) = 1 n n i=1 θ(x i)y i and γ n (θ) = 1 n n i=1 θ(x i). However, maximizing L n (θ) is often NP-hard. Main Result 1 suggests two simple procedures for estimating θ from training data

Maximizing L in Practice (2/3) Algorithm 1: Two-Step EUM Input: Training examples S = {X i, Y i } n i=1 and utility measure L. 1. Split the training data S into two sets S 1 and S 2. 2. Estimate η(x) := P(Y = 1 x) using S 1, define θ δ = sign(ˆη(x) δ) 3. Compute δ = arg max δ (0,1) L n ( θ δ ) on S 2. 4. Return: θ δ 1-d optimization in Step 3 can be done efficiently L n changes only on O(n) discrete thresholds

Maximizing L in Practice (3/3) The second method is based on minimizing a surrogate l of the weighted 0-1 loss: Algorithm 2: Weighted ERM l δ (t, y) = (1 δ)1 {y=1} l(t, 1) + δ1 {y=0} l(t, 0). Input: Training examples S = {X i, Y i } n i=1, prediction function class Φ {φ : X R} and utility measure L. 1. Split the training data S into two sets S 1 and S 2. 2. Compute δ = arg max δ (0,1) L n ( θ δ ) on S 2. Sub-algorithm: θ δ (x) := sign( φ δ (x)) where 1 S1 φ δ (x) = arg min φ Φ S 1 i=1 l δ(φ(x i ), Y i ). 3. Return: θ δ

Consistency of Empirical Estimation (1/2) For consistency w.r.t. L metric, we need estimated θ to satisfy L L( θ) p 0. Theorem (Uniform convergence of L n ). Consider the function class of all thresholded decisions Θ = {I (φ(x) δ) δ (0, 1)} for a [0, 1]-valued function φ : X [0, 1]. For sufficiently large n that is a function of constants associated with L, ɛ and ρ, with prob. at least 1 ρ, sup L n (θ) L(θ) < ɛ. θ Θ

Consistency of Empirical Estimation (2/2) Main Result 2. If the estimate η(x) satisfies η(x) p η(x), Algorithm 1 is L-consistent. Main Result 3. Let l : R : [0, ) be a classification-calibrated convex (margin) loss and let l δ be the corresponding weighted loss for a given δ used in Algorithm 2. Then, Algorithm 2 is L-consistent.

Experimental Results Evaluate Algorithms 1 and 2 on two metrics, F 1 and Weighted Accuracy 2(TP+TN) 2(TP+TN)+FP+FN. Compare the two algorithms with standard ERM (regularized logistic regression). On datasets listed below: 1. Letters: 26 classes (English alphabet), 20000 instances 2. Scene: 6 classes (scene types), 2230 images 3. Web Page: 2 classes (spam/non-spam), 34780 pages 4. Image: 2 classes, 2068 images 5. Spambase: 2 classes (spam/non-spam), 4601 emails

Experimental Results: F 1

Experimental Results: Weighted Accuracy

Open Problems & Future Directions There exist other utility metrics that are not in our family, but have similar thresholded optimal classifiers (Check out Poster??!) Raises the question Identify/characterize the entire family of utility metrics with simple optimal decision functions Develop surrogate theory for L Obtain convergence rates for L(ˆθ) p L(θ ) as ˆθ p θ Multi-label classification setting: Can extend the definition L in more than one way! Do similar results hold in this setting?