Optimizing Area Under the ROC Curve using Ranking SVMs

Transcription

1 Optimizing Area Under the ROC Curve using Ranking SVMs Kaan Ataman Department of Management Sciences The University of Iowa W. Nick Street Department of Management Sciences The University of Iowa ABSTRACT Area Under the ROC Curve (AUC), often used for comparing classifiers, is a widely accepted performance measure for ranking instances. Many researches have studied optimization of AUC, usually via optimizing some approximation of a ranking function. Ranking SVMs are among the better performers but their usage in the literature is typically limited to learning a total ranking from partial rankings. In this paper, we observe that a ranking SVM is in fact a direct optimization of AUC via optimizing the Wilcoxon-Mann- Whitney statistic. We compare a linear ranking SVM with some well-known linear classifiers such as linear SVM, perceptron and logit in the context of binary classification. We show that an SVM optimized for ranking not only achieves better AUC than other linear classifiers on average, but also performs comparably in accuracy. Keywords AUC,ranking,SVM,optimization 1. INTRODUCTION Many real world classification problems may require an ordering instead of a simple classification. For example in direct mail marketing, the marketer may want to know which potential customers to target for sending new catalogs so as to maximize the revenue. If the number of catalogs the marketer is sending is limited then it is not enough to know who is likely to buy a product after receiving the catalog. In this case it would be more useful to to distinguish top n-percent of potential customers who yield the highest likelihood of buying a product. By doing so, the marketer can more intelligently target a subset of her customer base increasing the expected profit. A slightly different example is collaborative book recommendations. To intelligently recommend a book, an algorithm must distinguish the similarities of book preferences of people who have read and submitted a score for the book. Combining these scores it is possible to return a recommendation list to the user. Ideally this list would be ranked in such a way that the book on the top of the list is expected to appeal the most to the user. It is not hard to see that there is a slight difference between those two ranking problems. In the marketing problem the ranking is based on binary input (purchase, non-purchase) whereas in the book recommendation problem the ranking is based on a collection of partial rankings. In this paper we consider the problem of ranking based on binary input. Ranking is a popular machine learning problem that has been addressed extensively in the literature [3, 4, 19]. The most commonly used performance measure to evaluate a classifiers ability to rank instances in binary classification problems is the area under the ROC curve. ROC curves are widely used for visualization and comparison of performance of binary classifiers. ROC curves were originally used in signal detection theory. They were introduced to the machine learning community by Spackman in 1989, who showed that ROC curves can be used for evaluation and comparison of algorithms [17]. ROC curves plot true positive rate vs. false positive rate by varying the threshold (probability of membership to a class) or score. In the ROC space, the upper left corner represents perfect classification while a diagonal line represents random classification. A point in ROC space that lies to the upper left of another point represents a better classifier. Some classifiers such as neural networks or naive Bayes naturally provide those probabilities while other classifiers such as decision trees do not. It is still possible to produce ROC curves for any type of classifier using minor adjustments [5]. Area under the ROC Curve (AUC) is a single scalar value for classifier comparison [2, 9]. Statistically speaking, the AUC of a classifier is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Since AUC is a probability, its value varies between 0 and 1, where 1 represents all positives being ranked higher than all negatives. Larger AUC values indicate better classifier performance across the full range of possible thresholds. Even though it is possible that a classifier with high AUC can be outperformed by a lower AUC classifier at some region of the ROC, in general the high AUC classifier is better.

2 AUC can also be represented as the ratio of the number of correct pairwise rankings vs. the number of all possible pairs [8]. This quantity is called the Wilcoxon-Mann-Whitney (WMW) statistic [13, 20]. W = p 1 i=0 I(x i, y j) = n 1 j=0 I(xi, yj) pn { 1 if xi > y j 0 otherwise where p is the number of positive points, n is the number of negative points, x i is the i th positive point and y j is the j th negative point. Note that in the binary classification problem we have no prior knowledge of intra-class rankings. Mainstream classifiers are usually designed to minimize the classification error, therefore they are not specifically designed to do well with the rankings. Support Vector Machines (SVMs) are exceptions, since by nature (margin maximization) they turn out to be good rankers. It has been shown that accuracy is not always the best performance measure, especially for datasets with skewed class or cost distributions as many real world problems have. The machine learning community has recently explored the question of which performance measure is better in general. Ling and Huang showed in their recent study that AUC is a statistically consistent and a more discriminating value than accuracy [12]. Recent research indicates an increase in the popularity of AUC as a performance measure especially in cases where class priors and/or misclassification costs are unknown. In the previous literature researchers introduced ways to optimize rankings. Cohen, Schapire and Singer proposed a two-stage approach where first a preference function is learned then new instances are ordered to maximize the agreements with the learned function [4]. Caruana, Baluja and Mitchell introduced Rankprop, an algorithm that improves rankings by backpropogation with sum of squares error on estimated ranks [3]. Vogt and Cottrell introduced d, an optimization measure for ranking problems [19]. In their paper they claim their method works as well as optimizing exact precision, a popular IR performance measure. Recently, optimizing the area under the ROC curve has become a focal point of researchers. The problem can be framed as optimizing a total ranking either with a given collection of rankings or using 0-1 input, depending on the available input. Freund et al. introduced Rankboost, an algorithm for combining preferences [6]. In the context of collaborative filtering the algorithm combines ranked preferences by users to create a recommendation list optimally ranked for a similar user. Yan et al. had a direct approach to optimizing the AUC. They use an alternate continuous function to approximate the WMW statistic, which by itself is not continuous, and optimize this function directly [22]. Another recent paper by Herschtal and Raskutti introduces an algorithm that optimizes AUC using gradient descent [10]. In this study, the authors approximated the WMW statistic with a sigmoid function making use of its (1) differentiability. The authors also introduced an efficient way to reduce the number of constraints in the optimization problem which reduces the overall complexity. Ranking SVMs are introduced by Herbrich et al. [15] and Joachims [11]. To our knowledge comparison of ranking SVMs to other baseline classifiers in the binary classification context has not been studied. The closest article we were able to find is the optimization of AUC using SVMs by Rakotomamonjy [16]. The study shows an application of a quadratic-programming-based algorithm to optimize AUC. Joachims used ranking SVMs in the context of optimizing search engines [11]. Even though in his paper the author does not experiment with binary classification here we show that his approach can be extended to binary classification problems. In this paper we focus on maximizing the number of correct pairwise rankings, which in turn will optimize Wilcoxon- Mann-Whitney statistic. In the following section we explain the optimization problem and our experimental setup in detail. This is followed by section 3, which covers the experimental results and discussions. Finally, we finish this paper with conclusions. 2. AUC OPTIMIZATION 2.1 Primal and Dual Formulations To directly optimize the WMW statistic, we would like to have all positive points ranked higher than all negative points. We therefore construct a linear program with soft constraints penalizing negatives that are ranked above positives. This requires p (number of positive points) times n (number of negative points) constraints in total. A perfect linear separation of classes is impossible for most real world datasets, therefore the LP has to be constructed to minimize some error term corresponding to the number of incorrect orderings. As in the classification case, our formulation avoids combinatorial complexity by minimizing the distance by which such pairs are mis-ordered. The primal form of the LP is given below. minimize w,v,z subject to e T z + e T v Cw z 1 w v 0 w v 0 v, z 0 where e is a vector of ones in the appropriate dimension, w represents the attribute weights, v minimizes the magnitude of the weights and z is a vector of error terms corresponding to each pairwise misranking. C represents a matrix of misrankings, more specifically: (2) C i = A i B i (3) where A i represents a collection of positive points, B i represents a collection of negative points, and C i constrains a

3 Initialize the weight vector While the number of added violations greater than k do For each positive point ranked below a negative point Create a constraint for each misranked positive point Create balancing, non-violated, constraints End for Solve the LP and update the weight vector End while Figure 1: Algorithm Outline positive point A i to be ranked higher then a negative point B i. In the ideal case the C matrix has pn rows. The result of this LP is a vector w such that the distance of pairwise mis-rankings is minimized. By finding such w without an intercept, we are in fact learning an optimal family of parallel planes, that is equivalent to scanning a single separating plane across a dataset parallel to its normal vector, and constructing an optimal ROC based on the series of intercept values. At this point our LP looks similar to a soft margin linear SVM [18]. Similar to an SVM, our LP maximizes the margin via minimizing the magnitude of the weights. The parameter that controls the balance between error and margin is set to one in our case so that sum of the margin and sum of errors are balanced. We observed that on average, using one was as good as using any other value for the coefficient. An SVM optimizes the margin of the relevant points (support vectors) relative to a single separating plane, while the ranking SVM optimizes the margin between each inter-class pair of points along the normal direction of the plane. As such, the number of points influencing the ranking solution is potentially much larger than the number affecting the separating plane. We use the dual of the LP to reduce the number of constraints, as the soft constraints on w become simple bounds on the dual variables. The dual of the above LP can be arranged as: minimize α,β,γ subject to 2.2 The Algorithm e T α C T α + β γ = 0 α 1 β γ 1 β, γ 0 Previously we mentioned that ideally we would like to include all possible constraints such that each positive point is ranked above each negative point. This however is computationally expensive especially with large datasets. For (4) instance, for a balanced dataset with 1000 points the number of constraints necessary is Of course this is the worst case scenario for a 1000 point data set since class ratio is 1. For a skewed dataset with a class ratio of 1:100 the required number of constraints would be roughly Initializing the weights randomly increases the run time of the algorithm by delaying the convergence of weights. Instead we initialize the weights using the solution of a linear SVM. This speeds up the algorithm dramatically since we start with good initial weights. At each iteration having good weights provides lesser number of violations. Figure 1 outlines the algorithm. Our approach is to adjust the weight vector incrementally. To do this, we create constraints for only the misranked positives. This a variation of the well-known column generation technique [7]. At each iteration, each newly misranked pair of points creates a new constraint that is added to the LP. Eventually, all non-slack constraints from the complete problem will be contained in our constraint set, resulting in an optimal solution. In practice, we stop the iteration early if only a few new constraints were added. In the experiments that follow k parameter in the algorithm is set to 10. In addition to these violated constraints we also add constraints to prevent weights flipping sign ( to + or + to ). Flipping of the sign of the weight vector occurs when we create constraints that only correspond to the violations. The reason is that, with a constraint set consisting of only those pairs of points that were misranked in the previous iteration an optimal solution can be achieved by simply reversing the signs of the coefficients. However this does not guarantee minimization for the whole problem since it ranks most of the original points incorrectly. Adding violated constraints together with some balancing non-violated constraints to avoid the reversal phenomenon dramatically reduces the total number of constraints compared to an LP that lists all possible p by n constraints. It takes only a few iterations for the objective function to converge and overall time to run the algorithm is usually much less then an LP which solves for all possible constraints. 3. EXPERIMENTS AND RESULTS In this study we used Matlab as programming environment with GLPK LP Solver [14]. We also used WEKA [21] for comparison tests. We used 17 datasets from UCI repository [1]. Some of the datasets used were multi-class problems. Those datasets are converted to binary classification problems by splitting the datasets by one class vs. others splitting. An overview of the datasets and the splitting criterion for multi-class datasets are given in Table 1. For our experiments we used 10-fold cross validation. For training we limited the number of data points to 500 to keep the number of constraints manageable for the GLPK solver. For comparisons we ran WEKA using exactly the same training and testing sets in each fold. This provided us reliability with the significance tests. We present our results in Table 2. The table provides averages and standard deviations for five 10-fold cross validations. Table 3

4 Table 1: Overview of the datasets and modification details Datasets # of points # of attributes % rare class Comments abalone female = 1, rest = 0 blocks text = 1, rest = 0 boston (MEDV<35) = 1, rest = 0 cancer(wbc) cancer(wpbc) diabetes ecoli pp = 1, rest = 0 glass = 1, rest = 0 heart ionosphere liver(bupa) pendigits = 1, rest = 0 sonar spambase spectf survival vehicle opel = 1, rest = 0 Table 3: Summary of best-worst comparison results for AUC analysis RankerSVM SVM Logit Perc Best Worst Table 6: Summary of best-worst comparison results for accuracy analysis RankerSVM SVM Logit Perc Best Worst Table 4: Summary of pairwise significance (t-test) results for AUC W-L(significant) Ranking vs. SVM 7-4 Ranking vs. Logit 4-2 Ranking vs. Perceptron 8-1 Table 7: Summary of pairwise significance (t-test) results for accuracy Algorithm Pair W-L(significant) Ranking vs. SVM 5-2 Ranking vs. Logit 2-2 Ranking vs. Perceptron 4-0 summarizes the number of data sets on which the various algorithms performed best and worst, and the significance test (at 95%) results are provided in Table 4. The results in Table 3 show that ranker SVM returns the best AUC among all the tested algorithms for 10 data sets out of 17 while it is the worst in only one case. Significance test analysis in Table 4 show that in general ranker SVM performs better ranking than the classifiers. The linear SVM tended to perform either very well or very poorly, while the perceptron classifier did a generally poor job of ranking the cases. To test the impact of ranking optimization on classification accuracy we also collected accuracy results from the above experiment. It is fairly simple to get classifications from the ranker. To do so we simply find the best accuracyyielding threshold value on the training data. Then we use this threshold in the separating surface in the test set to obtain accuracy. The accuracy of the other linear classifiers were obtained from WEKA using the same data sets at each fold. The results are provided in Table 5. Table 6 summarizes overall performance and 95% pairwise significance tests for the accuracy comparisons are presented in Table 7. From Table 6 we see that ranker SVM performs comparably in accuracy with other classifiers. Even though it was not optimized for classification, by choosing a good threshold, it still achieves comparable levels of accuracy. Although the ranker SVM didn t score the best accuracies in our tests, it usually was not significantly worse as can be seen in Table 7. We believe the basic reason for comparable performance is the margin maximization in the objective function which seems to be helping with the accuracy performance. While the linear SVM and perceptron were the best overall classifiers on accuracy, the ranker SVM beat both in head-to-head significance results. The only drawback of the algorithm is the number of constraints created. In our tests we didn t go beyond training with more than 500 data points because of the computational expense. Table 8 shows the number of iterations needed to converge and the range of the number of constraints created for each dataset. In this table we also show the maximum possible number of constraints for comparison. For instance, in general it is much faster to solve five LPs with 5000 constraints compared to a single LP with constraints.

5 Table 2: AUC Results from 5 10-fold cross-validation. (Best results in each dataset shown in bold) Datasets RankerSVM SVM Logit Perc abalone (0.0025) (0.0148) (0.0021) (0.0049) blocks (0.0107) (0.0180) (0.0158) (0.0094) boston (0.0037) (0.0177) (0.0135) (0.0294) cancer(wbc) (0.0011) (0.0014) (0.0019) (0.0024) cancer(wpbc) (0.0203) (0.0071) (0.0155) (0.0266) diabetes (0.0058) (0.0034) (0.0026) (0.0039) ecoli (0.0099) (0.0074) (0.0068) (0.0083) glass (0.0227) (0.0236) (0.0298) (0.0145) heart (0.0080) (0.0055) (0.0076) (0.0090) ionosphere (0.0196) (0.0265) (0.0257) (0.0245) liver(bupa) (0.0114) (0.0136) (0.0108) (0.0094) pendigits (0.0005) (0.0053) (0.0058) (0.0067) sonar (0.0161) (0.0094) (0.0193) (0.0195) spambase (0.0027) (0.0028) (0.0084) (0.0009) spectf (0.0037) (0.0254) (0.0249) (0.0263) survival (0.0081) (0.0121) (0.0111) (0.0089) vehicle (0.0042) (0.0051) (0.0051) (0.0079) Table 5: Accuracy Results from 5 10-fold cross-validation (Best results in each dataset shown in bold) Datasets RankerSVM SVM Logit Perc abalone (0.0044) (0.0003) (0.0018) (0.0099) blocks (0.0011) (0.0021) (0.0018) (0.0021) boston (0.0073) (0.0047) (0.0087) (0.0091) cancer(wbc) (0.0026) (0.0023) (0.0046) (0.0053) cancer(wpbc) (0.0212) (0.0072) (0.0173) (0.0149) diabetes (0.0063) (0.0045) (0.0087) (0.014) ecoli (0.0078) (0.0017) (0.0070) (0.0055) glass (0.0102) (0.0027) (0.0061) (0.0027) heart (0.0077) (0.0072) (0.0072) (0.0214) ionosphere (0.0068) (0.0137) (0.0079) (0.0041) liver(bupa) (0.0152) (0.0057) (0.0071) (0.0167) pendigits (0.0019) (0.0028) (0.0012) (0.0038) sonar (0.0164) (0.0097) (0.0252) (0.0135) spambase (0.0042) (0.0060) (0.0044) (0.0149) spectf (0.0098) (0.0062) (0.0049) (0.0187) survival (0.0051) (0.0030) (0.0018) (0.0043) vehicle (0.0058) (0.0021) (0.0111) (0.0095) Table 8: Constraint Size Comparison Datasets # of iterations range of constraints max. possible # of constraints abalone blocks boston cancer(wbc) cancer(wpbc) diabetes ecoli glass heart ionosphere liver(bupa) pendigits sonar spambase spectf survival vehicle

6 4. CONCLUSIONS AND FUTURE WORK It is a common practice in supervised learning tasks to use the output from a classifier as a score to rank the instances. However, it is reasonable to expect a learner that directly optimizes ranking performance would produce better results. In this paper we provided a performance comparison between a linear ranking-svm and a number of linear classifiers. We showed that ranker-svm, which is optimized for maximizing AUC using the Wilcoxon-Mann-Whitney statistic, has better ranking performance than commanly used linear classifiers. As a side effect it is also comparable in accuracy performance against classifiers optimized for accuracy. This study was limited to linear algorithms. As a future step it would be interesting to see how a non linear ranker SVM performs against non linear classifiers. Also it would be interesting to see how ranker SVM compares to other algorithms that are also optimized for AUC maximization. This study considered datasets with binary output but the method can easily be extended to the problem of learning a total ranking from partial rankings. This is still a popular problem in the research area of collaborative filtering and meta search. Computational complexity of mid-size to large-size problems is another issue. There are a number of papers that have addressed this issue and improved the efficiency. We think reducing the number of constraints in the AUC optimization problem without compromising AUC performance is still an open and an interesting problem. 5. REFERENCES [1] C.L. Blake and C.J. Merz. UCI repository of machine learning databases. mlearn/mlrepository.html, [2] A.P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30: , [3] R. Caruana, S. Baluja, and T. Mitchell. Using the future to sort out the present: Rankprop and multitask learning for medical risk evaluation. In D.S. Touretzky, M.C. Mozer, and M.E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages The MIT Press, [4] W.W. Cohen, R.E. Schapire, and Y. Singer. Learning to order things. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10, pages The MIT Press, [5] P. Domingos. Metacost: A general method for making classifiers cost-sensitive. In Knowledge Discovery and Data Mining, pages , [6] Y. Freund, R. Iyer, R.E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4: , [7] P.C. Gilmore and R.E. Gomory. A linear programming approach to the cutting-stock problem. Operations Research, 8: , [8] D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Bans Lane, England, [9] J.A. Hanley and B.J. McNeil. The meaning and the use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143:29 36, [10] A. Herschtal and B. Raskutti. Optimising area under the ROC curve using gradient descent. In ICML 04: Twenty-First International Conference on Machine Learning. ACM Press, [11] T. Joachims. Optimizing search engines using clickthrough data. In KDD 02: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages ACM Press, [12] C. Ling, J. Huang, and H. Zhang. AUC: A statistically consistent and more discriminating measure than accuracy, [13] H.B. Mann and D.R. Whitney. On a test whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statististics, 18:50 60, [14] GLPK Project. [15] K. Obermayer R. Herbrich, T. Graepel. Large margin rank boundaries for ordinal regression. In Advances in Large Margin Classifiers, pages The MIT Press, [16] A. Rakotomamonjy. Optimizing area under ROC curve with SVMs. In ROC Analysis in Artificial Intelligence, pages 71 80, [17] K.A. Spackman. Signal detection theory: Valuable tools for evaluating inductive learning. In Proceedings of the Sixth International Workshop on Machine Learning, pages Morgan Kaufman, [18] V. Vapnik. Statistical Learning Theory. John Wiley & Sons, New York, [19] C. Vogt and G. Cottrell. Using d to optimize rankings. Technical Report CS98-601, U.C. San Diego, CSE Department, [20] F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1:80 83, [21] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, [22] L. Yan, R.H. Dodier, M. Mozer, and R.H. Wolniewicz. Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. In International Conference on Machine Learning, pages , 2003.