AUC Optiization vs. Error Rate Miniization Corinna Cortes and Mehryar Mohri AT&T Labs Research 180 Park Avenue, Florha Park, NJ 0793, USA {corinna, ohri}@research.att.co Abstract The area under an ROC curve (AUC is a criterion used in any applications to easure the quality of a classification algorith. However, the objective function optiized in ost of these algoriths is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first eact epression of the epected value and the variance of the AUC for a fied error rate. Our results show that the average AUC is onotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algoriths designed to iniize the error rate ay not lead to the best possible AUC values. We show that, under certain conditions, the global function optiized by the RankBoost algorith is eactly the AUC. We report the results of our eperients with RankBoost in several datasets deonstrating the benefits of an algorith specifically designed to globally optiize the AUC over other eisting algoriths optiizing an approiation of the AUC or only locally optiizing the AUC. 1 Motivation In any applications, the overall classification error rate is not the ost pertinent perforance easure, criteria such as ordering or ranking see ore appropriate. Consider for eaple the list of relevant docuents returned by a search engine for a specific query. That list ay contain several thousand docuents, but, in practice, only the top fifty or so are eained by the user. Thus, a search engine s ranking of the docuents is ore critical than the accuracy of its classification of all docuents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or suary statistic often used to easure the ranking quality of a classifier is the area under an ROC curve (AUC [8]. 1 However, the objective function optiized by ost classification algoriths is the error rate and not the AUC. Recently, several algoriths have been proposed for aiizing the AUC value locally [4] or aiizing soe approiations of the global AUC value [9, 15], but, in general, these algoriths do not obtain AUC values significantly better than those obtained by an algorith designed to iniize the error rates. Thus, it is iportant to deterine the relationship between the AUC values and the error rate. This author s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.co. 1 The AUC value is equivalent to the Wilcoon-Mann-Whitney statistic [8] and closely related to the Gini inde [1]. It has been re-invented under the nae of L-easure by [11], as already pointed out by [], and slightly odified under the nae of Linear Ranking by [13, 14].
True positive rate ROC Curve. AUC0.718 (1,1 (0,0 False positive rate True positive rate False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An eaple of ROC curve. The line connecting (0, 0 and (1, 1, corresponding to rando classification, is drawn for reference. The true positive (negative rate is soeties referred to as the sensitivity (resp. specificity in this contet. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first eact epression of the epected value and the variance of the AUC for a fied error rate. We show that, under certain conditions, the global function optiized by the RankBoost algorith is eactly the AUC. We report the results of our eperients with RankBoost in several datasets and deonstrate the benefits of an algorith specifically designed to globally optiize the AUC over other eisting algoriths optiizing an approiation of the AUC or only locally optiizing the AUC. Definition and properties of the AUC The Receiver Operating Characteristics (ROC curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in any other applications, in particular for edical decision-aking. Over the last few years, they have found increased interest in the achine learning and data ining counities for odel evaluation and selection [1, 10, 4, 9, 15, ]. The ROC curve for a binary classification proble plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold fro the ost positive classification value to the ost negative. For a fully rando classification, the ROC curve is a straight line connecting the origin to (1, 1. Any iproveent over rando classification results in an ROC curve at least partially above this straight line. Fig. (1 shows an eaple of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown ore forally by Lea 1 below. Consider a binary classification task with positive eaples and n negative eaples. We will assue that a classifier outputs a strictly ordered list for these eaples and will denote by 1 X the indicator function of a set X. Lea 1 ([8] Let c be a fied classifier. Let 1,..., be the output of c on the positive eaples and y 1,..., y n its output on the negative eaples. Then, the AUC, A, associated to c is given by: A n i1 j1 1 i>y j n that is the value of the Wilcoon-Mann-Whitney statistic [8]. (1 Proof. The proof is based on the observation that the AUC value is eactly the probability P(X > Y where X is the rando variable corresponding to the distribution of the outputs for the positive eaples and Y the one corresponding to the negative eaples [7]. The Wilcoon-Mann-Whitney statistic is clearly the epression of that probability in the discrete case, which proves the lea [8]. Thus, the AUC can be viewed as a easure based on pairwise coparisons between classifications of the two classes. With a perfect ranking, all positive eaples are ranked higher than the negative ones and A 1. Any deviation fro this ranking decreases the AUC. An attept in that direction was ade by [15], but, unfortunately, the authors analysis and the result are both wrong.
Threshold θ k Positive eaples n Negative eaples Negative eaples (k Positive eaples Figure : For a fied nuber of errors k, there ay be,0 k, false negative eaples. 3 The Epected Value of the AUC In this section, we copute eactly the epected value of the AUC over all classifications with a fied nuber of errors and copare that to the error rate. Different classifiers ay have the sae error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the eaples with outputs ore than θ clearly does not affect the error rate but leads to different AUC values. Siilarly, one ay reorder the eaples with output less than θ without changing the error rate. Assue that the nuber of errors k is fied. We wish to copute the average value of the AUC over all classifications with k errors. Our odel is based on the siple assuption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., eaples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there ay be, 0 k, false positive eaples. Since the nuber of errors is fied, there are k false negative eaples. Figure 3 shows the corresponding configuration. The two regions of eaples with classification outputs above and below the threshold are separated by a vertical line. For a given, the coputation of the AUC, A, as given by Eq. (1 can be divided into the following three parts: A A 1 + A + A 3, with ( n A 1 the su over all pairs ( i, y j with i and y j in distinct regions; A the su over all pairs ( i, y j with i and y j in the region above the threshold; A 3 the su over all pairs ( i, y j with i and y j in the region below the threshold. The first ter, A 1, is easy to copute. Since there are ( (k positive eaples above the threshold and n negative eaples below the threshold, A 1 is given by: A 1 ( (k (n (3 To copute A, we can assign to each negative eaple above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α 1 <... < α denote the positions in increasing order of the negative eaples in the region above the threshold. The total nuber of eaples classified as positive is N (k +. Thus, by definition of A, A (N α i ( i (4 i1 where the first ter N α i represents the nuber of eaples ranked higher than the ith eaple and the second ter i discounts the nuber of negative eaples incorrectly ranked higher than the ith eaple. Siilarly, let α 1 <... < α k denote the positions of the k positive eaples below the threshold, counting positions in reverse by starting fro the threshold. Then, A 3 is given by: A 3 (N α j ( j (5 j1 with N n + (k and k. Cobining the epressions of A 1, A, and A 3 leads to: A A 1 + A + A 3 n 1 + (k + k n ( i1 α i + j1 α j n (6
Lea For a fied, the average value of the AUC A is given by: Proof. < A > 1 n + k The proof is based on the coputation of the average values of i1 α i and j1 α j for a given. We start by coputing the average value < α i > for a given i, 1 i. Consider all the possible positions for α 1... α i 1 and α i+1... α, when the value of α i is fied at say α i l. We have i l N ( i since there need to be at least i 1 positions before α i and N ( i above. There are l 1 possible positions for α 1... α i 1 and N l possible positions for α i+1... α. Since the total nuber of ways of choosing the positions for α 1... α out of N is, the average value < αi > is: N ( i li l ( ( l 1 N l i 1 i < α i > (8 Thus, < N ( i i1 li l ( ( l 1 N l N i 1 i l1 α i > l ( l 1 l i1 i 1 i (9 i1 Using the classical identity: ( u ( v ( p 1+p p p 1 p u+v p, we can write: N l1 < α i > l( ( N 1 N 1 1 N(N + 1 1 (N + 1 i1 Siilarly, we have: < α j > (N + 1 j1 Replacing < i1 α i > and < j1 α j > in Eq. (6 by the epressions given by Eq. (10 and Eq. (11 leads to: < A > 1 + (k + k (N + 1 (N + 1 1 n which ends the proof of the lea. n + k Note that Eq. (7 shows that the average AUC value for a given is siply one inus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assue that a binary classification task with positive eaples and n negative eaples is given. Then, the epected value of the AUC A over all classifications with k errors is given by: < A > 1 k + n (n ( + n + 1 4n ( k + n k 1 ( +n 0 k 0 ( +n+1 Proof. Lea gives the average value of the AUC for a fied value of. To copute the average over all possible values of, we need to weight the epression of Eq. (7 with the total nuber of possible classifications for a given. There are possible ways of choosing the positions of the isclassified negative eaples, and siilarly ( N possible ways of choosing the positions of the k isclassified positive eaples. Thus, in view of Lea, the average AUC is given by: < A > N ( (1 k 0 k 0 n + k (7 (10 (11 (1 (13 (14
0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC r0.05 r0.01 r0.1 r0.5 r0.5 0.0 0.1 0. 0.3 0.4 0.5 Error rate.00.05.10.15.0.5 Relative standard deviation r0.01 r0.05 r0.1 r0.5 0.0 0.1 0. 0.3 0.4 0.5 Error rate r0.5 Figure 3: Mean (left and relative standard deviation (right of the AUC as a function of the error rate. Each curve corresponds to a fied ratio of r n/(n +. The average AUC value onotonically increases with the accuracy. For n, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n. This epression can be siplified into Eq. (13 3 using the following novel identities:!!! kx N N kx n + + 1 0 0!!! kx N N kx (k ( n + k n + + 1 0 0 (15 (16 that we obtained by using Zeilberger s algorith 4 and nuerous cobinatorial tricks. Fro the epression of Eq. (13, it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n. For n, the epected value of the AUC is a onotonic function of the accuracy, see Fig. (3(left. For a fied ratio of n/(n +, the curves are obtained by increasing the accuracy fro n/(n + to 1. The average AUC varies onotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there sees nothing to be gained in designing specific learning algoriths for aiizing the AUC: a classification algorith iniizing the error rate also optiizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the net section that the variance of the AUC value is not null for any ratio n/(n + when k 0. 4 The Variance of the AUC Let D n + (k +k, a i1 α i, a j1 α j, and α a + a. Then, by Eq. (6, na D α. Thus, the variance of the AUC, σ (A, is given by: (n σ (A < (D α (< D > < α > > (17 < D > < D > + < α > < α > (< αd > < α >< D > As before, to copute the average of a ter X over all classifications, we can first deterine its average < X > for a fied, and then use the function F defined by: k N 0 F(Y ( Y k N (18 0 ( and < X > F(< X >. A crucial step in coputing the eact value of the variance of the AUC is to deterine the value of the ters of the type < a > < ( i1 α i >. 3 An essential difference between Eq. (14 and the epression given by [15] is the weighting by the nuber of configurations. The authors analysis leads the to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n +, which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger s algorith and Maple package.
Lea 3 For a fied, the average of ( i1 α i is given by: < a (N + 1 > (3N + + N (19 1 Proof. By definition of a, < a > b + c with: b < α i > c < i1 1 i<j Reasoning as in the proof of Lea, we can obtain: N ( i i1 li l ( ( l 1 N l N 1 i 1 i b l 1 l1 α i α j > (0 (N + 1(N + 1 6 To copute c, we start by coputing the average value of < α i α j >, for a given pair (i, j with i < j. As in the proof of Lea, consider all the possible positions of α 1... α i 1, α i+1...α j 1, and α j+1... α when α i is fied at α i l, and α j is fied at α j l. There are l 1 possible positions for the α 1... α i 1, l l 1 possible positions for α i+1...α j 1, and N l possible positions for α j+1... α. Thus, we have: i l<l < α i α j > N ( j ll ( ( l 1 l ( l 1 N l i 1 j i 1 j ( (1 and c ( l 1 ( l ( l<l ll l 1 N l 1+ + 3 1 ( N (3 3 Using the identity ( l 1 ( l ( l 1 N l ( 1+ + 3 1 3 N, we obtain: (N + 1(3N + ( 1 c 4 Cobining Eq. (1 and Eq. (4 leads to Eq. (19. Proposition Assue that a binary classification task with positive eaples and n negative eaples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ (A F((1 n + k F((1 n + k (4 + (5 F( + n(k + (( + 1 + n(n + 1(k (k ( + n + 1 1 n Proof. Eq. (18 can be developed and epressed in ters of F, D, a, and a : (n σ (A F([D < a + a > ] F(D < a + a > + F(< a > < a > + F(< a > < a > (6 The epressions for < a > and < a > were given in the proof of Lea, and that of < a > by Lea 3. The following forula can be obtained in a siilar way: < a > (N +1 1 (3N + + N. Replacing these epressions in Eq. (6 and further siplifications give eactly Eq. (5 and prove the proposition. The epression of the variance is illustrated by Fig. (3(right which shows the value of one standard deviation of the AUC divided by the corresponding ean value of the AUC. This figure is parallel to the one showing the ean of the AUC (Fig. (3(left. Each line is obtained by fiing the ratio n/(n + and varying the nuber of errors fro 1 to the size of the sallest class. The ore uneven class distributions have the highest variance, the variance increases with the nuber of errors. These observations contradict the ineact clai of [15] that the variance is zero for all error rates with even distributions n. In Fig. (3(right, the even distribution n corresponds to the lowest dashed line.
n Dataset Size # of n+ AUCsplit[4] RankBoost Attr. (% Accuracy (% AUC (% Accuracy (% AUC (% Breast-Wpbc 194 33 3.7 69.5 ± 10.6 59.3 ± 16. 65.5 ± 13.8 80.4 ± 8.0 Credit 653 15 45.3 81.0 ± 7.4 94.5 ±.9 Ionosphere 351 34 35.9 89.6 ± 5.0 89.7 ± 6.7 83.6 ± 10.9 98.0 ± 3.3 Pia 768 8 34.9 7.5 ± 5.1 76.7 ± 6.0 69.7 ± 7.6 84.8 ± 6.5 SPECTF 69 43 0.4 67.3 93.4 Page-blocks 5473 10 10. 96.8 ± 0. 95.1 ± 6.9 9.0 ±.5 98.5 ± 1.5 Yeast (CYT 1484 8 31. 71.1 ± 3.6 73.3 ± 4.0 45.3 ± 3.8 78.5 ± 3.0 Table 1: Accuracy and AUC values for several datasets fro the UC Irvine repository. The values for RankBoost are obtained by 10-fold cross-validation. The values for AUCsplit are fro [4]. 5 Eperiental Results Proposition above deonstrates that, for uneven distributions, classifiers with the sae fied (low accuracy ehibit noticeably different AUC values. This otivates the use of algoriths directly optiizing the AUC rather than doing so indirectly via iniizing the error rate. Under certain conditions, RankBoost [5] can be viewed eactly as an algorith optiizing the AUC. In this section, we ake the connection between RankBoost and AUC optiization, and copare the perforance of RankBoost to two recent algoriths proposed for optiizing an approiation [15] or locally optiizing the AUC [4]. The objective of RankBoost is to produce a ranking that iniizes the nuber of incorrectly ordered pairs of eaples, possibly with different costs assigned to the is-rankings. When the eaples to be ranked are siply two disjoint sets, the objective function iniized by RankBoost is n 1 1 rloss n 1 i y j (7 i1 j1 which is eactly one inus the Wilcoon-Mann-Whitney statistic. Thus, by Lea 1, the objective function aiized by RankBoost coincides with the AUC. RankBoost s optiization is based on cobining a nuber of weak rankings. For our eperients, we chose as weak rankings threshold rankers with the range {0, 1}, siilar to the boosted stups often used by AdaBoost [6]. We used the so-called Third Method of RankBoost for selecting the best weak ranker. According to this ethod, at each step, the weak threshold ranker is selected so as to aiize the AUC of the weighted distribution. Thus, with this ethod, the global objective of obtaining the best AUC is obtained by selecting the weak ranking with the best AUC at each step. Furtherore, the RankBoost algorith aintains a perfect 50-50% distribution of the weights on the positive and negative eaples. By Proposition 1, for even distributions, the ean of the AUC is identical to the classification accuracy. For threshold rankers like step functions, or stups, there is no variance of the AUC, so the ean of the AUC is equal to the observed AUC. That is, instead of viewing RankBoost as selecting the weak ranker with the best weighted AUC value, one can view it as selecting the weak ranker with the lowest weighted error rate. This is siilar to the choice of the best weak learner for boosted stups in AdaBoost. So, for stups, AdaBoost and RankBoost differ only in the updating schee of the weights: RankBoost updates the positive eaples differently fro the negative ones, while AdaBoost uses one coon schee for the two groups. Our eperiental results corroborate the observation that RankBoost is an algorith optiizing the AUC. RankBoost based on boosted stups obtains AUC values that are substantially better than those reported in the literature for algoriths designed to locally or approiately optiize the AUC. Table 1 copares the results of RankBoost on a nuber of datasets fro the UC Irvine repository to the results reported by [4]. The results for RankBoost are obtained by 10-fold cross-validation. For RankBoost, the accuracy and the best AUC values reported on each line of the table correspond to the sae boosting step. RankBoost consistently outperfors AUCsplit in a coparison based on AUC values, even for the datasets such as Breast-Wpbc and Pia where the two algoriths obtain siilar accuracies. The table also lists results for the UC Irvine Credit Approval and SPECTF heart dataset, for which the authors of [15] report results corresponding to their AUC optiization algoriths. The AUC values reported by [15] are no better than 9.5% for the Credit
Approval dataset and only 87.5% for the SPECTF dataset, which is substantially lower. Fro the table, it is also clear that RankBoost is not an error rate iniization algorith. The accuracy for the Yeast (CYT dataset is as low as 45%. 6 Conclusion A statistical analysis of the relationship between the AUC value and the error rate was given, including the first eact epression of the epected value and standard deviation of the AUC for a fied error rate. The results offer a better understanding of the effect on the AUC value of algoriths designed for error rate iniization. For uneven distributions and relatively high error rates, the standard deviation of the AUC suggests that algoriths designed to optiize the AUC value ay lead to substantially better AUC values. Our eperiental results using RankBoost corroborate this clai. In separate eperients we have observed that AdaBoost achieves significantly better error rates than RankBoost (as epected but that it also leads to AUC values close to those achieved by RankBoost. It is a topic for further study to eplain and understand this property of AdaBoost. A partial eplanation could be that, just like RankBoost, AdaBoost aintains at each boosting round an equal distribution of the weights for positive and negative eaples. References [1] L. Breian, J. H. Friedan, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth International, Belont, CA, 1984. [] J-H. Chauchat, R. Rakotoalala, M. Carloz, and C. Pelletier. Targeting custoer groups using gain and cost atri; a arketing application. Technical report, ERIC Laboratory - University of Lyon, 001. [3] J. P. Egan. Signal Detection Theory and ROC Analysis. Acadeic Press, 1975. [4] C. Ferri, P. Flach, and J. Hernández-Orallo. Learning decision trees using the area under the ROC curve. In ICML-00. Morgan Kaufann, 00. [5] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorith for cobining preferences. In ICML-98. Morgan Kaufann, San Francisco, US, 1998. [6] Yoav Freund and Robert E. Schapire. A decision theoretical generalization of online learning and an application to boosting. In Proceedings of the Second European Conference on Coputational Learning Theory, volue, 1995. [7] D. M. Green and J. A Swets. Signal detection theory and psychophysics. New York: Wiley, 1966. [8] J. A. Hanley and B. J. McNeil. The eaning and use of the area under a receiver operating characteristic (ROC curve. Radiology, 198. [9] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC curve. In NIPS-00. MIT Press, 00. [10] C. Perlich, F. Provost, and J. Sionoff. Tree induction vs. logistic regression: A learning curve analysis. Journal of Machine Learning Research, 003. [11] G. Piatetsky-Shapiro and S. Steingold. Measuring lift quality in database arketing. In SIGKDD Eplorations. ACM SIGKDD, 000. [1] F. Provost and T. Fawcett. Analysis and visualization of classifier perforance: Coparison under iprecise class and cost distribution. In KDD-97. AAAI, 1997. [13] S. Rosset. Ranking-ethods for fleible evaluation and efficient coparison of - class odels. Master s thesis, Tel-Aviv University, 1999. [14] S. Rosset, E. Neuann, U. Eick, N. Vatnik, and I. Idan. Evaluation of prediction odels for arketing capaigns. In KDD-001. ACM Press, 001. [15] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optiizing Classifier Perforance Via the Wilcoon-Mann-Whitney Statistics. In ICML-003, 003.