Reducing multiclass to binary by coupling probability estimates

Transcription

1 Reducing multiclass to inary y coupling proaility estimates Bianca Zadrozny Department of Computer Science and Engineering University of California, San Diego La Jolla, CA zadrozny@cs.ucsd.edu Astract This paper presents a method for otaining class memership proaility estimates for multiclass classification prolems y coupling the proaility estimates produced y inary classifiers. This is an extension for aritrary code matrices of a method due to Hastie and Tishirani for pairwise coupling of proaility estimates. Experimental results with Boosted Naive Bayes show that our method produces calirated class memership proaility estimates, while having similar classification accuracy as loss-ased decoding, a method for otaining the most likely class that does not generate proaility estimates. 1 Introduction The two most well-known approaches for reducing a multiclass classification prolem to a set of inary classification prolems are known as one-against-all and all-pairs. In the one-against-all approach, we train a classifier for each of the classes using as positive examples the training examples that elong to that class, and as negatives all the other training examples. In the all-pairs approach, we train a classifier for each possile pair of classes ignoring the examples that do not elong to the classes in question. Although these two approaches are the most ovious, Allwein et al. [Allwein et al., 2000] have shown that there are many other ways in which a multiclass prolem can e decomposed into a numer of inary classification prolems. We can represent each such decomposition y a code matrix M k l, where k is the numer of classes and l is the numer of inary classification prolems. If M c 1 then the examples elonging to class c are considered to e positive examples for the inary classification prolem. Similarly, if M c 1 the examples elonging to c are considered to e negative examples for. Finally, if M c 0 the examples elonging to c are not used in training a classifier for. For example, in the 3-class case, the all-pairs code matrix is c c c This approach for representing the decomposition of a multiclass prolem into inary pro-

2 lems is a generalization of the Error-Correcting Output Codes (ECOC) scheme proposed y Dietterich and Bakiri [Dietterich and Bakiri, 1995]. The ECOC scheme does not allow zeros in the code matrix, meaning that all examples are used in each inary classification prolem. Orthogonal to the prolem of choosing a code matrix for reducing multiclass to inary is the prolem of classifying an example given the laels assigned y each inary classifier. Given an example x, Allwein et al. [Allwein et al., 2000] first create a vector v of length l containing the -1,+1 laels assigned to x y each inary classifier. Then, they compute the Hamming distance etween v and each row of M, and find the row c that is closest to v according to this metric. The lael c is then assigned to x. This method is called Hamming decoding. For the case in which the inary classifiers output a score whose magnitude is a measure of confidence in the prediction, they use a loss-ased decoding approach that takes into account the scores to calculate the distance etween v and each row of M, instead of using the Hamming distance. This method is called loss-ased decoding. Allwein et al. [Allwein et al., 2000] present theoretical and experimental results indicating that this method is etter than Hamming decoding. However, oth of these methods simply assign a class lael to each example. They do not output class memership proaility estimates ˆP C c X x for an example x. These proaility estimates are important when the classification outputs are not used in isolation and must e comined with other sources of information, such as misclassification costs [Zadrozny and Elkan, 2001a] or the outputs of another classifier. Given a code matrix M and a inary classification learning algorithm that outputs proaility estimates, we would like to couple the estimates given y each inary classifier in order to otain class proaility memership estimates for the multiclass prolem. Hastie and Tishirani [Hastie and Tishirani, 1998] descrie a solution for otaining proaility estimates ˆP C c X x in the all-pairs case y coupling the pairwise proaility estimates, which we descrie in Section 2. In Section 3, we extend the method to aritrary code matrices. In Section 4 we discuss the loss-ased decoding approach in more detail and compare it mathematically to the method y Hastie and Tishirani. In Section 5 we present experimental results. 2 Coupling pairwise proaility estimates We are given pairwise proaility estimates r i j x for every class i j, otained y training a classifier using the examples elonging to class i as positives and the examples elonging to class j as negatives. We would like to couple these estimates to otain a set of class memership proailities p i x P C c i X x for each example x. The r i j are related to the p i according to r i j x P C i C i C j X x p i x p i x p j x Since we additionally require that i p i x 1, there are k 1 free parameters and k k 1 2 constraints. This implies that there may not exist p i satisfying these constraints. Let n i j e the numer of training examples used to train the inary classifier that predicts r i j. In order to find the est approximation ˆr i j x ˆp i x ˆp i x ˆp j x, Hastie and Tishirani fit the Bradley-Terrey model for paired comparisons [Bradley and Terry, 1952] y minimizing the average weighted Kullack-Leiler distance l x etween r i j x and

3 ˆr i j x for each x, given y l x i j n i j r i j x log r i j x ˆr i j x 1 r i j x log 1 r i j x 1 ˆr i j x The algorithm is as follows: 1. Start with some guess for the ˆp i x and corresponding ˆr i j x. 2. Repeat until convergence: (a) For each i 1 2 k () Renormalize the ˆp i x. (c) Recompute the ˆr i j x. ˆp i x ˆp i x j i n i j r i j x j i n i j ˆr i j x Hastie and Tishirani [Hastie and Tishirani, 1998] prove that the Kullack-Leiler distance etween r i j x and ˆr i j x decreases at each step. Since this distance is ounded elow y zero, the algorithm converges. At convergence, the ˆr i j are consistent with the ˆp i. The class predicted for each example x is ĉ x argmax ˆp i x. Hastie and Tishirani also prove that the ˆp i x are in the same order as the non-iterative estimates p i x j i r i j x for each x. Thus, the p i x are sufficient for predicting the most likely class for each example. However, as shown y Hastie and Tishirani, they are not accurate proaility estimates ecause they tend to underestimate the differences etween the ˆp i x values. 3 Extending the Hastie-Tishirani method to aritrary code matrices For an aritrary code matrix M, instead of having pairwise proaility estimates, we have an estimate r x for each column of M, such that r x C c C c X x P c I c I J where I and J are the set of classes for which M c I p c x c I J p c x 1 and M 1, respectively. We would like to otain a set of class memership proailities p i x for each example x compatile with the r x and suject to i p i x 1. In this case, the numer of free parameters is k 1 and the numer of constraints is l 1, where l is the numer of columns of the code matrix. Since for most code matrices l is greater than k 1, in general there is no exact solution to this prolem. For this reason, we propose an algorithm analogous to the Hastie-Tishirani method presented in the previous section to find the est approximate proaility estimates ˆp i (x) such that ˆr x c I ˆp c x c I J ˆp c x and the Kullack-Leiler distance etween ˆr x and r x is minimized. Let n e the numer of training examples used to train the inary classifier that corresponds to column of the code matrix. The algorithm is as follows: 1. Start with some guess for the ˆp i x and corresponding ˆr x. 2. Repeat until convergence:

4 (a) For each i 1 2 k ˆp i x s t M i ˆp i 1 n r x s t M i 1 n 1 r x x s t M i 1 n ˆr x s t M i 1 n 1 ˆr x () Renormalize the ˆp i x. (c) Recompute the ˆr x. If the code matrix is the all-pairs matrix, this algorithm reduces to the original method y Hastie and Tishirani. Let B i e the set of matrix columns for which M i 1 and B i e the set of matrix columns for which M c 1. By analogy with the non-iterative estimates suggested y Hastie and Tishirani, we can define non-iterative estimates p i x B i x 1 B r i x. For the all-pairs code matrix, these estimates are the same as the ones suggested y Hastie and Tishirani. However, for aritrary matrices, we cannot prove that the non-iterative estimates predict the same class as the iterative estimates. 4 Loss-ased decoding In this section, we discuss how to apply the loss-ased decoding method to classifiers that output class memership proaility estimates. We also study the conditions under which this method predicts the same class as the Hastie-Tishirani method, in the all-pairs case. The loss-ased decoding method [Allwein et al., 2000] requires that each inary classifier output a margin score satisfying two requirements. First, the score should e positive if the example is classified as positive, and negative if the example is classified as negative. Second, the magnitude of the score should e a measure of confidence in the prediction. The method works as follows. Let f x e the margin score predicted y the classifier corresponding to column of the code matrix for example x. For each row c of the code matrix M and for each example x, we compute the distance etween f and M c as d L x c l L M c f x (1) 1 where L is a loss function that is dependent on the nature of the inary classifier and M c = 0, 1 or 1. We then lael each example x with the lael c for which d L is minimized. If the inary classification learning algorithm outputs scores that are proaility estimates, they do not satisfy the first requirement ecause the proaility estimates are all etween 0 and 1. However, we can transform the proaility estimates r x output y each classifier into margin scores y sutracting 1 2 from the scores, so that we consider as positives the examples x for which r x is aove 1/2, and as negatives the examples x for which r x is elow 1/2. We now prove a theorem that relates the loss-ased decoding method to the Hastie- Tishirani method, for a particular class of loss functions. Theorem 1 The loss-ased decoding method for all-pairs code matrices predicts the same class lael as the iterative estimates ˆp i x given y Hastie and Tishirani, if the loss function is of the form L y ay, for any a 0. Proof: We first show that, if the loss function is of the form L y ay, the loss-ased decoding method predicts the same class lael as the non-iterative estimates p i x, for the all-pairs code matrix.

5 Dataset #Training Examples #Test Examples #Attriutes #Classes satimage pendigits soyean Tale 1: Characteristics of the datasets used in the experiments. The non-iterative estimates p i x are given y p c x x 1 r x B c x x B c B c where B c and B c are the sets of matrix columns for which M c 1 and M c 1, respectively. Considering that L y ay and f x r x M c 0, we can rewrite Equation 1 as d x c a r x 1 2 a r x B c 1 2 a 1 2, and eliminating the terms for which x x B c For the all-pairs code matrix the following relationship holds: 1 2 B c k 1 2, where k is the numer of classes. So, the distance d x c is d x c a r x B c k 1 2 x B c 1 2 B c B c B c It is now easy to see that the class c x which minimizes d x c for example x, also maximizes p c x. Furthermore, if d x i d x j then p x i p x j, which means that the ranking of the classes for each example is the same. Since the non-iterative estimates p c x are in the same order as the iterative estimates ˆp c x, we can conclude that the Hastie-Tishirani method is equivalent to the loss-ased decoding method if L y ay, in terms of class prediction, for the all-pairs code matrix. Allwein et al. do not consider loss functions of the form L y ay, and uses non-linear loss functions such as L y e y. In this case, the class predicted y loss-ased decoding may differ from the one predicted y the method y Hastie and Tishirani. This theorem applies only to the all-pairs code matrix. For other matrices such that B c B c is a linear function of B c (such as the one-against-all matrix), we can prove that loss-ased decoding (with L y ay) predicts the same class as the non-iterative estimates. However, in this case, the non-iterative estimates do not necessarily predict the same class as the iterative ones. 5 Experiments We performed experiments using the following multiclass datasets from the UCI Machine Learning Repository [Blake and Merz, 1998]: satimage, pendigits and soyean. Tale 1 summarizes the characteristics of each dataset. The inary learning algorithm used in the experiments is oosted naive Bayes [Elkan, 1997], since this is a method that cannot e easily extended to handle multiclass prolems directly. For all the experiments, we ran 10 rounds of oosting.

6 Method Code Matrix Error Rate MSE Loss-ased (L y y) All-pairs Loss-ased (L y e y ) All-pairs Hastie-Tishirani (non-iterative) All-pairs Hastie-Tishirani (iterative) All-pairs Loss-ased (L y y) One-against-all Loss-ased (L y e y ) One-against-all Extended Hastie-Tishirani (non-iterative) One-against-all Extended Hastie-Tishirani (iterative) One-against-all Loss-ased (L y y) Sparse Loss-ased (L y e y ) Sparse Extended Hastie-Tishirani (non-iterative) Sparse Extended Hastie-Tishirani (iterative) Sparse Multiclass Naive Bayes Tale 2: Test set results on the satimage dataset. We use three different code matrices for each dataset: all-pairs, one-against-all and a sparse random matrix. The sparse random matrices have 15 log 2 k columns, and each element is 0 with proaility 1/2 and -1 or +1 with proaility 1/4 each. This is the same type of sparse random matrix used y Allwein et al.[allwein et al., 2000]. In order to have good error correcting properties, the Hamming distance ρ etween each pair of rows in the matrix must e large. We select the matrix y generating 10,000 random matrices and selecting the one for which ρ is maximized, checking that each column has at least one 1 and one 1, and that the matrix does not have two identical columns. We evaluate the performance of each method using two metrics. The first metric is the error rate otained when we assign each example to the most likely class predicted y the method. This metric is sufficient if we are only interested in classifying the examples correctly and do not need accurate proaility estimates of class memership. The second metric is squared error, defined for one example x as SE x j t j x p j x 2, where p j x is the proaility estimated y the method for example x and class j, and t j x is the true proaility of class j for x. Since for most real-world datasets true laels are known, ut not proailities, t j x is defined to e 1 if the lael of x is j and 0 otherwise. We calculate the squared error for each x to otain the mean squared error (MSE). The mean squared error is an adequate metrics for assessing the accuracy of proaility estimates [Zadrozny and Elkan, 2001]. This metric cannot e applied to the loss-ased decoding method, since it does not produce proaility estimates. Tale 2 shows the results of the experiments on the satimage dataset for each type of code matrix. As a aseline for comparison, we also show the results of applying multiclass Naive Bayes to this dataset. We can see that the iterative Hastie-Tishirani procedure (and its extension to aritrary code matrices) succeeds in lowering the MSE significantly compared to the non-iterative estimates, which indicates that it produces proaility estimates that are more accurate. In terms of error rate, the differences etween methods are small. For one-against-all matrices, the iterative method performs consistently worse, while for sparse random matrices, it performs consistently etter. Figure 1 shows how the MSE is lowered at each iteration of the Hastie-Tishirani algorithm, for the three types of code matrices. Tale 3 shows the results of the same experiments on the datasets pendigits and soyean. Again, the MSE is significantly lowered y the iterative procedure, in all cases. For the soyean dataset, using the sparse random matrix, the iterative method again has a lower error rate than the other methods, which is even lower than the error rate using the all-pairs matrix. This is an interesting result, since in this case the all-pairs matrix has 171 columns (corresponding to 171 classifiers), while the sparse matrix has only 64 columns.

7 0.12 Satimage all pairs one against all sparse MSE Iteration Figure 1: Convergence of the MSE for the satimage dataset. pendigits soyean Method Code Matrix Error Rate MSE Error Rate MSE Loss-ased (L y y) All-pairs Loss-ased (L y e y ) All-pairs Hastie-Tishirani (non-iterative) All-pairs Hastie-Tishirani (iterative) All-pairs Loss-ased (L y y) One-against-all Loss-ased (L y e y ) One-against-all Ext. Hastie-Tishirani (non-it.) One-against-all Ext. Hastie-Tishirani (it.) One-against-all Loss-ased (L y y) Sparse Loss-ased (L y e y ) Sparse Ext. Hastie-Tishirani (non-it.) Sparse Ext. Hastie-Tishirani (it.) Sparse Multiclass Naive Bayes Tale 3: Test set results on the pendigits and soyean datasets. 6 Conclusions We have presented a method for producing class memership proaility estimates for multiclass prolems, given proaility estimates for a series of inary prolems determined y an aritrary code matrix. Since research in designing optimal code matrices is still on-going [Utschick and Weichselerger, 2001] [Crammer and Singer, 2000], it is important to e ale to otain class memership proaility estimates from aritrary code matrices. In current research, the effectiveness of a code matrix is determined primarily y the classification accuracy. However, since many applications require accurate class memership proaility estimates for each of the classes, it is important to also compare the different types of code matrices according to their aility of producing such estimates. Our extension of Hastie and Tishirani s method is useful for this purpose. Our method relies on the proaility estimates given y the inary classifiers to produce the multiclass proaility estimates. However, the proaility estimates produced y Boosted

8 Naive Bayes are not calirated proaility estimates. An interesting direction for future work is in determining whether the caliration of the proaility estimates given y the inary classifiers improves the caliration of the multiclass proailities. References [Allwein et al., 2000] Allwein, E. L., Schapire, R. E., and Singer, Y. (2000). Reducing multiclass to inary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1: [Blake and Merz, 1998] Blake, C. L. and Merz, C. J. (1998). UCI repository of machine learning dataases. Department of Information and Computer Sciences, University of California, Irvine. mlearn/mlrepository.html. [Bradley and Terry, 1952] Bradley, R. and Terry, M. (1952). Rank analysis of incomplete lock designs, I: The method of paired comparisons. Biometrics, pages [Crammer and Singer, 2000] Crammer, K. and Singer, Y. (2000). On the learnaility and design of output codes for multiclass prolems. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, pages [Dietterich and Bakiri, 1995] Dietterich, T. G. and Bakiri, G. (1995). Solving multiclass learning prolems via error-correcting output codes. Journal of Artificial Intelligence Research, 2: [Elkan, 1997] Elkan, C. (1997). Boosting and naive ayesian learning. Technical Report CS97-557, University of California, San Diego. [Hastie and Tishirani, 1998] Hastie, T. and Tishirani, R. (1998). Classification y pairwise coupling. In Advances in Neural Information Processing Systems, volume 10. MIT Press. [Utschick and Weichselerger, 2001] Utschick, W. and Weichselerger, W. (2001). Stochastic organization of output codes in multiclass learning prolems. Neural Computation, 13(5): [Zadrozny and Elkan, 2001a] Zadrozny, B. and Elkan, C. (2001a). Learning and making decisions when costs and proailities are oth unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, pages ACM Press. [Zadrozny and Elkan, 2001] Zadrozny, B. and Elkan, C. (2001). Otaining calirated proaility estimates from decision trees and naive ayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, pages Morgan Kaufmann Pulishers, Inc.