Consistent Binary Classification with Generalized Performance Metrics

Size: px

Start display at page:

Download "Consistent Binary Classification with Generalized Performance Metrics"

Ashley Kelly
8 years ago
Views:

1 Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014

2 Problem and Motivation (1/3) State-of-the-art understanding of optimal decision making and consistent algorithms for binary classification is limited. It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholding P(Y = 1 x) at 0.5. Such a characterization is lacking for many utility measures used in practice.

It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholding

3 Problem and Motivation (2/3) Most performance measures are based on the four fundamental population quantities: Examples include F β, Jaccard coefficient, and other cost-sensitive measures.

4 Problem and Motivation (3/3) Goals: 1. Develop a general framework for analyzing performance measures 2. Characterize optimal decision functions for a large family of utility measures 3. Develop efficient, and provably consistent, algorithms for maximizing measures in practice

Characterize optimal decision functions for a large family of utility

5 A Family of Generalized Performance Metrics (1/3) Let θ : X {0, 1} denote a classifier, and P be a fixed unknown distribution over labeled data X {0, 1}. We define the following ratio family of performance metrics: L(θ, P) = a 0 + a 11 TP + a 10 FP + a 01 FN + a 00 TN b 0 + b 11 TP + b 10 FP + b 01 FN + b 00 TN where a 0, b 0, a ij, b ij, i, j {0, 1} are non-negative constants and: TP := TP(θ, P) = P(Y = 1, θ = 1), FP := FP(θ, P) = P(Y = 0, θ = 1), FN := FN(θ, P) = P(Y = 1, θ = 0), TN := TN(θ, P) = P(Y = 0, θ = 0).

We define the following ratio family of performance metrics: L(θ, P) = a 0 + a 11 TP + a 10 FP + a 01 FN + a 00 TN b 0 + b 11 TP

6 A Family of Generalized Performance Metrics (2/3) Example metrics in this family: AM = F β = Jaccard Coefficient = Weighted Accuracy = (1 π)tp + πtn 2π(1 π) (1 + β 2 )TP (1 + β 2 )TP + β 2 FN + FP TP TP + FN + FP w 1 TP + w 2 TN w 1 TP + w 2 TN + w 3 FP + w 4 FN

Accuracy = (1 π)tp + πtn 2π(1 π) (1 + β 2 )TP (1 + β 2 )TP + β 2

7 A Family of Generalized Performance Metrics (3/3) Let γ(θ) := P(θ = 1) and π := P(Y = 1). Observing that: FP = γ(θ) TP, FN = π TP(θ), TN = 1 γ(θ) π + TP we get the following equivalent, simpler representation of the family: L(θ, P) = c 0 + c 1 TP + c 2 γ(θ) d 0 + d 1 TP + d 2 γ(θ), for certain constants c 0, c 1, c 2, d 0, d 1, d 2.

following equivalent, simpler representation of the family: L(θ, P) = c 0 + c 1 TP

8 Optimal Classifier (1/2) Optimal (Bayes) decision function for a given metric L is: θ = arg max θ Θ L(θ, P). Main Result 1. Given a performance metric L, or equivalently, the constants {c 0, c 1, c 2 } and {d 0, d 1, d 2 }, let L := L(θ ) and let: δ = d 2L c 2 c 1 d 1 L The Bayes classifier θ takes the form θ (x) = sign(p(y = 1 x) δ ).

Given a performance metric L, or equivalently, the constants {c 0, c 1, c 2 } and {d

9 Optimal Classifier (2/2) Implication: Optimal decision function for a metric in our family can be found among the thresholded classifiers: θ arg max L(I (P(Y = 1 x) δ), P), δ (0,1) where I (P(Y = 1 x) δ) is the classifier that thresholds the conditional at δ.

classifiers: θ arg max L(I (P(Y = 1 x) δ), P), δ (0,1) where I

10 Recovered and New Results (1/2)

11 Recovered and New Results (2/2) Simulated results showing η(x) := P(Y = 1 x), optimal threshold δ and Bayes classifier θ F 1 Weighted Accuracy η(x) δ =0.34 θ η(x) δ =0.50 θ TP 2TP +FP +FN x TP +2TN 2TP +FP +FN +2TN x

8 0.6 η(x) δ =0.34 θ 1.0 0.8 0.6 η(x) δ =0.50 θ 0.4 0.4 0.2 0.

12 Maximizing L in Practice (1/3) Given iid sample (X i, Y i ), i = 1, 2,..., n, we would want to maximize the empirical measure: L n (θ) = c 1TP n (θ) + c 2 γ n (θ) + c 0 d 1 TP n (θ) + d 2 γ n (θ) + d 0, where TP n (θ) = 1 n n i=1 θ(x i)y i and γ n (θ) = 1 n n i=1 θ(x i). However, maximizing L n (θ) is often NP-hard. Main Result 1 suggests two simple procedures for estimating θ from training data

1 TP n (θ) + d 2 γ n (θ) + d 0, where TP n (θ) = 1 n n i=1 θ(x i)y i and γ n (θ) = 1 n n i=1 θ(x i).

13 Maximizing L in Practice (2/3) Algorithm 1: Two-Step EUM Input: Training examples S = {X i, Y i } n i=1 and utility measure L. 1. Split the training data S into two sets S 1 and S Estimate η(x) := P(Y = 1 x) using S 1, define θ δ = sign(ˆη(x) δ) 3. Compute δ = arg max δ (0,1) L n ( θ δ ) on S Return: θ δ 1-d optimization in Step 3 can be done efficiently L n changes only on O(n) discrete thresholds

2. Estimate η(x) := P(Y = 1 x) using S 1, define θ δ = sign(ˆη(x) δ) 3.

14 Maximizing L in Practice (3/3) The second method is based on minimizing a surrogate l of the weighted 0-1 loss: Algorithm 2: Weighted ERM l δ (t, y) = (1 δ)1 {y=1} l(t, 1) + δ1 {y=0} l(t, 0). Input: Training examples S = {X i, Y i } n i=1, prediction function class Φ {φ : X R} and utility measure L. 1. Split the training data S into two sets S 1 and S Compute δ = arg max δ (0,1) L n ( θ δ ) on S 2. Sub-algorithm: θ δ (x) := sign( φ δ (x)) where 1 S1 φ δ (x) = arg min φ Φ S 1 i=1 l δ(φ(x i ), Y i ). 3. Return: θ δ

Input: Training examples S = {X i, Y i } n i=1, prediction function class Φ {φ : X R} and utility measure L. 1.

15 Consistency of Empirical Estimation (1/2) For consistency w.r.t. L metric, we need estimated θ to satisfy L L( θ) p 0. Theorem (Uniform convergence of L n ). Consider the function class of all thresholded decisions Θ = {I (φ(x) δ) δ (0, 1)} for a [0, 1]-valued function φ : X [0, 1]. For sufficiently large n that is a function of constants associated with L, ɛ and ρ, with prob. at least 1 ρ, sup L n (θ) L(θ) < ɛ. θ Θ

Consider the function class of all thresholded decisions Θ = {I (φ(x) δ) δ (0, 1)} for a [0, 1]-valued

16 Consistency of Empirical Estimation (2/2) Main Result 2. If the estimate η(x) satisfies η(x) p η(x), Algorithm 1 is L-consistent. Main Result 3. Let l : R : [0, ) be a classification-calibrated convex (margin) loss and let l δ be the corresponding weighted loss for a given δ used in Algorithm 2. Then, Algorithm 2 is L-consistent.

17 Experimental Results Evaluate Algorithms 1 and 2 on two metrics, F 1 and Weighted Accuracy 2(TP+TN) 2(TP+TN)+FP+FN. Compare the two algorithms with standard ERM (regularized logistic regression). On datasets listed below: 1. Letters: 26 classes (English alphabet), instances 2. Scene: 6 classes (scene types), 2230 images 3. Web Page: 2 classes (spam/non-spam), pages 4. Image: 2 classes, 2068 images 5. Spambase: 2 classes (spam/non-spam), s

Letters: 26 classes (English alphabet), 20000 instances 2. Scene: 6 classes (scene types), 2230 images 3.

18 Experimental Results: F 1

19 Experimental Results: Weighted Accuracy

20 Open Problems & Future Directions There exist other utility metrics that are not in our family, but have similar thresholded optimal classifiers (Check out Poster??!) Raises the question Identify/characterize the entire family of utility metrics with simple optimal decision functions Develop surrogate theory for L Obtain convergence rates for L(ˆθ) p L(θ ) as ˆθ p θ Multi-label classification setting: Can extend the definition L in more than one way! Do similar results hold in this setting?

?!) Raises the question Identify/characterize the entire family of utility metrics with simple optimal decision functions

Consistent Binary Classification with Generalized Performance Metrics

Consistent Binary Classification with Generalized Performance Metrics Oluwasanmi Koyejo Department of Psychology, Stanford University sanmi@stanford.edu Pradeep Ravikumar Department of Computer Science,