Consistent Binary Classification with Generalized Performance Metrics Nagarajan Natarajan Joint work with Oluwasanmi Koyejo, Pradeep Ravikumar and Inderjit Dhillon UT Austin Nov 4, 2014
Problem and Motivation (1/3) State-of-the-art understanding of optimal decision making and consistent algorithms for binary classification is limited. It is well-known that accuracy (0-1 loss) is maximized (minimized) by thresholding P(Y = 1 x) at 0.5. Such a characterization is lacking for many utility measures used in practice.
Problem and Motivation (2/3) Most performance measures are based on the four fundamental population quantities: Examples include F β, Jaccard coefficient, and other cost-sensitive measures.
Problem and Motivation (3/3) Goals: 1. Develop a general framework for analyzing performance measures 2. Characterize optimal decision functions for a large family of utility measures 3. Develop efficient, and provably consistent, algorithms for maximizing measures in practice
A Family of Generalized Performance Metrics (1/3) Let θ : X {0, 1} denote a classifier, and P be a fixed unknown distribution over labeled data X {0, 1}. We define the following ratio family of performance metrics: L(θ, P) = a 0 + a 11 TP + a 10 FP + a 01 FN + a 00 TN b 0 + b 11 TP + b 10 FP + b 01 FN + b 00 TN where a 0, b 0, a ij, b ij, i, j {0, 1} are non-negative constants and: TP := TP(θ, P) = P(Y = 1, θ = 1), FP := FP(θ, P) = P(Y = 0, θ = 1), FN := FN(θ, P) = P(Y = 1, θ = 0), TN := TN(θ, P) = P(Y = 0, θ = 0).
A Family of Generalized Performance Metrics (2/3) Example metrics in this family: AM = F β = Jaccard Coefficient = Weighted Accuracy = (1 π)tp + πtn 2π(1 π) (1 + β 2 )TP (1 + β 2 )TP + β 2 FN + FP TP TP + FN + FP w 1 TP + w 2 TN w 1 TP + w 2 TN + w 3 FP + w 4 FN
A Family of Generalized Performance Metrics (3/3) Let γ(θ) := P(θ = 1) and π := P(Y = 1). Observing that: FP = γ(θ) TP, FN = π TP(θ), TN = 1 γ(θ) π + TP we get the following equivalent, simpler representation of the family: L(θ, P) = c 0 + c 1 TP + c 2 γ(θ) d 0 + d 1 TP + d 2 γ(θ), for certain constants c 0, c 1, c 2, d 0, d 1, d 2.
Optimal Classifier (1/2) Optimal (Bayes) decision function for a given metric L is: θ = arg max θ Θ L(θ, P). Main Result 1. Given a performance metric L, or equivalently, the constants {c 0, c 1, c 2 } and {d 0, d 1, d 2 }, let L := L(θ ) and let: δ = d 2L c 2 c 1 d 1 L The Bayes classifier θ takes the form θ (x) = sign(p(y = 1 x) δ ).
Optimal Classifier (2/2) Implication: Optimal decision function for a metric in our family can be found among the thresholded classifiers: θ arg max L(I (P(Y = 1 x) δ), P), δ (0,1) where I (P(Y = 1 x) δ) is the classifier that thresholds the conditional at δ.
Recovered and New Results (1/2)
Recovered and New Results (2/2) Simulated results showing η(x) := P(Y = 1 x), optimal threshold δ and Bayes classifier θ F 1 Weighted Accuracy 1.0 0.8 0.6 η(x) δ =0.34 θ 1.0 0.8 0.6 η(x) δ =0.50 θ 0.4 0.4 0.2 0.0 2TP 2TP +FP +FN 1 2 3 4 5 6 7 8 9 10 x 0.2 0.0 2TP +2TN 2TP +FP +FN +2TN 1 2 3 4 5 6 7 8 9 10 x
Maximizing L in Practice (1/3) Given iid sample (X i, Y i ), i = 1, 2,..., n, we would want to maximize the empirical measure: L n (θ) = c 1TP n (θ) + c 2 γ n (θ) + c 0 d 1 TP n (θ) + d 2 γ n (θ) + d 0, where TP n (θ) = 1 n n i=1 θ(x i)y i and γ n (θ) = 1 n n i=1 θ(x i). However, maximizing L n (θ) is often NP-hard. Main Result 1 suggests two simple procedures for estimating θ from training data
Maximizing L in Practice (2/3) Algorithm 1: Two-Step EUM Input: Training examples S = {X i, Y i } n i=1 and utility measure L. 1. Split the training data S into two sets S 1 and S 2. 2. Estimate η(x) := P(Y = 1 x) using S 1, define θ δ = sign(ˆη(x) δ) 3. Compute δ = arg max δ (0,1) L n ( θ δ ) on S 2. 4. Return: θ δ 1-d optimization in Step 3 can be done efficiently L n changes only on O(n) discrete thresholds
Maximizing L in Practice (3/3) The second method is based on minimizing a surrogate l of the weighted 0-1 loss: Algorithm 2: Weighted ERM l δ (t, y) = (1 δ)1 {y=1} l(t, 1) + δ1 {y=0} l(t, 0). Input: Training examples S = {X i, Y i } n i=1, prediction function class Φ {φ : X R} and utility measure L. 1. Split the training data S into two sets S 1 and S 2. 2. Compute δ = arg max δ (0,1) L n ( θ δ ) on S 2. Sub-algorithm: θ δ (x) := sign( φ δ (x)) where 1 S1 φ δ (x) = arg min φ Φ S 1 i=1 l δ(φ(x i ), Y i ). 3. Return: θ δ
Consistency of Empirical Estimation (1/2) For consistency w.r.t. L metric, we need estimated θ to satisfy L L( θ) p 0. Theorem (Uniform convergence of L n ). Consider the function class of all thresholded decisions Θ = {I (φ(x) δ) δ (0, 1)} for a [0, 1]-valued function φ : X [0, 1]. For sufficiently large n that is a function of constants associated with L, ɛ and ρ, with prob. at least 1 ρ, sup L n (θ) L(θ) < ɛ. θ Θ
Consistency of Empirical Estimation (2/2) Main Result 2. If the estimate η(x) satisfies η(x) p η(x), Algorithm 1 is L-consistent. Main Result 3. Let l : R : [0, ) be a classification-calibrated convex (margin) loss and let l δ be the corresponding weighted loss for a given δ used in Algorithm 2. Then, Algorithm 2 is L-consistent.
Experimental Results Evaluate Algorithms 1 and 2 on two metrics, F 1 and Weighted Accuracy 2(TP+TN) 2(TP+TN)+FP+FN. Compare the two algorithms with standard ERM (regularized logistic regression). On datasets listed below: 1. Letters: 26 classes (English alphabet), 20000 instances 2. Scene: 6 classes (scene types), 2230 images 3. Web Page: 2 classes (spam/non-spam), 34780 pages 4. Image: 2 classes, 2068 images 5. Spambase: 2 classes (spam/non-spam), 4601 emails
Experimental Results: F 1
Experimental Results: Weighted Accuracy
Open Problems & Future Directions There exist other utility metrics that are not in our family, but have similar thresholded optimal classifiers (Check out Poster??!) Raises the question Identify/characterize the entire family of utility metrics with simple optimal decision functions Develop surrogate theory for L Obtain convergence rates for L(ˆθ) p L(θ ) as ˆθ p θ Multi-label classification setting: Can extend the definition L in more than one way! Do similar results hold in this setting?