Fast Training of Support Vector Machines Using Error-Center-Based Optimization

Transcription

1 International Journal of Automation and Computing 1 (2005) 6-12 Fast Training of Support Vector Machines Using Error-Center-Based Optimization L. Meng, Q. H. Wu Department of Electrical Engineering and Electronics, The University of Liverpool, Liverpool, L69 3GJ, UK Abstract: This paper presents a new algorithm for Support Vector Machine (SVM) training, which trains a machine based on the cluster centers of errors caused by the current machine. Experiments with various training sets show that the computation time of this new algorithm scales almost linear with training set size and thus may be applied to much larger training sets, in comparison to standard quadratic programming (QP) techniques. Keywords: Support vector machines, quadratic programming, pattern classification, machine learning. 1 Introduction Based on recent advances in statistical learning theory, Support Vector Machines (SVMs) compose a new class of learning system for pattern classification. Training a SVM amounts to solving a quadratic programming (QP) problem with a dense matrix. Standard QP solvers require the full storage of this matrix, and their efficiency lies in its sparseness, which make its application to SVM training with large training sets intractable. The SVM, pioneered by Vapnik and his team, is a new technique for pattern classification and nonlinear regression (see, [1], [2], and [3]). For linearly separable problems, a SVM is a hyperplane that separates a set of positive examples from a set of negative examples with a maximum margin. Although intuitively simple, this idea of a maximum margin actually exploits the structural risk minimization (SRM) principle in statistical learning theory [4]. Therefore, the learned machine will not only have a minimal empirical risk but also good generalization performance. For nonlinearly separable problems, a nonlinear mapping is introduced before the construction of the separating hyperplane, which transforms the training examples from the input space to a higher-dimensional feature space. The separating hyperplane is constructed in the feature space. This yields a nonlinear decision boundary in the input space. The decision boundary is composed of the points that their mapped points are on the separating hyperplane in the feature space. Nonlinear mapping is performed in accordance Manuscript received November 5, 2003; revised June 1, Corresponding author. address: q.h.wu@liv.ac.uk with the theorem on the separability of patterns by [5]. A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. For SVM, the decision function for classifying new examples is defined as: sgn(f( x)) = sgn( w Φ( x) + b) (1) where x denotes an example to classify, Φ( x) the corresponding feature vector, and w and b the normal vector and intercept of the separating hyperplane. Vector w and constant b are the parameters to optimize. The optimization of w and b amounts to optimizing an objective function, subject to some linear constraints. The objective function associated with the SVM optimization is a convex quadratic function and therefore the optimization problem has no local optimum. The problem of optimizing the quadratic function of many variables has been well understood in optimization theory and most of the standard approaches can be directly applied to SVM training. However most standard QP techniques require full storage of the quadratic term in the objective function. They are either suitable only for small problems or assume that the quadratic term is very sparse, i.e. most elements of the quadratic term are zero. Unfortunately this is not true for a SVM optimization problem, where the quadratic term is not only dense but also has a size which grows quadratically with the number of data points in the training set. For training tasks with 10,000 examples or more, the memory requirement will exceed hundreds of Megabytes and hence be impossible to meet. This prohibits the application of standard QP techniques to problems with large training sets. An alternative would be to recompute the quadratic term every time it is needed. But this becomes prohibitively expensive

2 L. Meng et al./fast Training of Support Vector Machines Using Error-Center-Based Optimization 7 since QP techniques are iterative and the calculation of the quadratic term is needed at each iteration. Such considerations have driven the design of a new training algorithm for support vector machines. The algorithm proposed in this paper is conceptually simple, generally fast and has much better scaling properties than standard QP techniques. 2 The optimization problem in SVM training Given a training sample {( x i, y i )} l where y i = ±1 is the target response indicating which pattern an input example x i belongs to, the optimization problem associated with training a SVM can be written as follows: OP1 : min w,b, ξ 1 2 w 2 + C ξ i subject to y i ( w Φ( x i ) + b) 1 ξ i, i = 1,...,l (2) where the margin is bounded by the two hyperplanes w Φ( x i ) + b = ±1 and is measured by 1/ w, ξ i 0 are slack variables that permit margin failures and C is a parameter that trades off a wide margin with a small number of margin failures. When ξ i = 0, i, C =, the machine is called a hard-margin SVM since all the training examples must lie outside the margin, no margin failure is allowed. Otherwise, the machine is called a soft-margin SVM. By introducing Lagrange multipliers α = {α 1, α 2,..., α l } and β = {β 1, β 2,..., β l } and a Lagrangian: L( w,b, ξ, α, β) = 1 2 w 2 + C α i [y i ( w Φ( x i ) + b) 1 + ξ i ] ξ i β i ξ i and then minimising the Lagrangian with respect to w, b, ξ and maximising it with respect to α, β, where α i, β i 0, i, we have w = y i α i Φ( x i ) (3) and the dual form of OP1 as follows: OP2 :min α α i subject to j=1 y i y j α i α j K( x i, x j ) y i α i = 0, 0 α i C, i = 1,...,l (4) where K( x i, x j ) = Φ( x i ) Φ( x j ) defines the inner product of two vectors in the feature space and is called a kernel function. The use of a kernel function allows a SVM, without ever representing the feature space explicitly, to locate a separating hyperplane in the feature space and classify vectors in that space such that the computational burden of explicitly representing the feature vectors is avoided. OP2 is essentially a QP problem since it has the form: min α α T αt Q α subject to α T y = 0, α 0 (5) where matrix Q is the quadratic term. For SVM training, it is defined as Q ij = y i y j K( x i, x j ). The Karush-Kuhn-Tucker (KKT) conditions, devised by [6]; and [7], are the necessary and sufficient conditions for a set of variables to be optimal for an optimization problem. Applying the KKT conditions to problem OP1, we know that the optimal solution α, ( w, b ) must satisfy: and α i [y i ( w x i + b ) 1 + ξ i ] = 0, i = 1,...,l, (6) implying that ξ i (α i C) = 0, i = 1,...,l (7) α i = 0 y i f( x i ) 1 (8) 0 < α i < C y i f( x i ) = 1 (9) α i = C y i f( x i ) 1. (10) Equation (9) along with equations (8) and (10) show that only for those examples lying on the margin boundary are the corresponding α i not at the bounds. Equation (8) indicates that all examples for which the corresponding α i equals zero must be correctly classified and lie outside the margin. Equation (10) shows that all margin errors have the corresponding α i equal to the upper bound C. Furthermore, equation (7) indicates that non-zero slack variables can only occur when α i = C and hence all margin errors are penalized. 3 Error-center-based optimization The size of a QP problem is determined by the quadratic term Q. In SVM training, the size of matrix Q is l 2, where l denotes the number of training data points. As stated, there is a requirement for standard solving techniques to explicitly store Q, yet the denseness of the matrix Q in SVM training prohibits

3 8 International Journal of Automation and Computing 1 (2005) 6-12 the application of standard QP solvers to SVM training with large data sets. Considering this, a new technique has been devised for SVM training by [8]. The basic idea is to compress the original training set and then train the machine on a working set composed of the centers of clusters in the current compression. The compression is updated every iteration by splitting each of the clusters that have a support vector as its center into two subclusters. Since this new algorithm extracts classification information from the working set that is composed of cluster centers, it is called a center-based optimization (CO) algorithm. Experiments on various training sets have shown that the training time taken by CO is much less than that for standard techniques. For large training tasks, a CO algorithm can reduce the training time to less than 1/150 of that of a standard technique. Unfortunately, although an optimal decision boundary may be found by CO, the optimality of the resulting decision boundary is not guaranteed for each run (see Fig.1(a) and Fig.1(b) for a comparison). This is because a k-means algorithm [9] has been used to split the CO. The hill-climbing nature of this algorithm causes it to become easily trapped in different local optima. Despite the inaccuracy and multiplicity of the resulting decision boundaries, the fast speed of CO indicates the great potential of center-based algorithms for fast solving of SVM optimization problems with large training sets. By observing Fig.1(b), we can see that lost support vectors lie either inside or on the wrong side of the margin. And since they are not involved in the last training their corresponding α i are zero. KKT conditions indicate that the examples associated with zero α i must be correctly classified and lie outside the margin. Inspired by this, modification has been made to CO. Now, each cluster is split into two sub-clusters by separating those examples that satisfy the KKT conditions and thus lie outside or on the current margin from those that violate the KKT conditions and thus lie inside or on the wrong side of the current margin. On the one hand, as long as there are examples in the original training set that violate the KKT conditions at least one cluster would be split. On the other hand, the procedure iterates until no example in the original training set violates the KKT conditions. Since the KKT conditions are the necessary and sufficient conditions for optimal solutions, the optimality of solutions found by this algorithm is guaranteed. Again, this new algorithm builds SVMs using a set of cluster centers. Here, we refer to examples that violate the KKT conditions as margin errors. To further reduce the size of the QP problem in each iteration, only are the clusters of the margin errors are involved in the SVM training. The remaining clusters are represented by the support vectors found in the previous iteration. Moreover, it has been proved by [10] that a large QP problem can be broken down into a series of smaller QP sub-problems. As long as at least one example that violates the KKT conditions is added to the examples for the previous sub-problem, each step will reduce the overall objective function and maintain a feasible solution that obeys all of the constraints. Therefore, a sequence of QP sub-problems that always add at least one violator will be guaranteed to converge. Taking this into consideration, in order to ensure a strict improvement in the objective function and hence convergence, the new algorithm inserts an error center into the working set only if it violates the KKT conditions. Otherwise, the example in that cluster that most violates the KKT conditions will be inse- (a) Fig.1 Two possible decision boundaries found using the CO algorithm. The dots are the positive examples and the stars the negative ones. Cluster centers are plotted as large dots. A solid line denotes the decision boundary. The area between the dotted lines shows the margin. In (b), examples in the cluster containing the lost support vector are marked with boxes (b)

4 L. Meng et al./fast Training of Support Vector Machines Using Error-Center-Based Optimization 9 rted into the working set as the representative of its cluster. Since most examples of the working set are the centers of error clusters (the support vectors of previous iterations must have been centers of error clusters), this new algorithm is called error-center-based optimization (ECO). The implementation steps of ECO are listed in Table 1. 4 Experiments and results The ECO algorithm has been implemented in MAT- LAB. The quadratic programming subroutine provided in the MATLAB optimization toolbox has been used as the standard technique for comparison. The QP problem in each iteration of ECO is also solved by this subroutine. ECO has been tested on the Iris data set and an image segmentation data set, respectively. To allow visualization of the results, experiments with the Iris data set were conducted which separated the classes Versicolour and Virginica according to petal length and width (these attributes having the largest correlation with the class labels). Both benchmark sets were trained with a Gaussian SVM both using the standard technique and ECO, respectively. For the Iris data set, the variance of the Gaussian kernel is 0.6, and for image segmentation, it is 1.0. Fig.2 and Fig.3 show the decision boundaries obtained using different algorithms when C = for the Iris data set and image segmentation data set, respectively. As can be observed, in both data sets the results obtained using different algorithms are exactly the same. Therefore, the optimality of the solution found by a SVM is testified. Moreover, since no randomness resides in the ECO procedure, the decision boundary generated by ECO for a particular training set is certain and unique. For a SVM with a soft margin, noisy examples are allowed to remain inside or even on the wrong side of the optimal margin. On the contrary, by applying the KKT conditions in error checking and involving error centers in training, ECO actually tries to push all training examples outside the final margin. It may happen that even though all examples lying inside or on the wrong side of the margin are identified by the KKT conditions in the error checking step, the QP solving step will allow their cluster centers to remain inside or on the wrong side of the margin. Consequently, the decision boundary does not move, the same group of error points are detected, and further iterations will bring no improvement. The problem is that the iteration of ECO will not stop until all the training examples are outside the margin. To solve this problem, in the case of soft-margin SVM training, ECO stops when no new error cluster is formed. ECO has been applied to the image segmentation data set for C = 1000, C = 100 and C = 10. The resulting decision boundaries are shown in Fig.4(I(a)) 4(III(b)), respectively. For the same values of C, the decision boundaries obtained using different algorithms are almost the same. The reason for the existence of the difference is that under ECO the SVM is trained on and thus penalizes cluster centers rather than individual examples. Table 1 Implementation steps of the error-center-based optimization (ECO) algorithm Given a training set S, treat each pattern of S as a cluster Initialize the working set Ŝ to the centers of these two clusters Repeat Train SVM on Ŝ Set Ŝ to the support vectors For each cluster C r of S split the current cluster C r into two subclusters by identifying the margin errors, i.e. those that violate the KKT conditions. If center of the error cluster violates the KKT conditions add the center into Ŝ. Else add the example, the worst point violating the KKT conditions in C r, into Ŝ. Until no new margin error is found. S denotes a training set whose two patterns are to be classified by the decision function. Ŝ denotes the set of examples involved in subsequent SVM training. C r denotes the rth cluster of S whose center is defined as c r = x j x j C r. 1 x j C r

5 10 International Journal of Automation and Computing 1 (2005) 6-12 (a) (b) Fig.2 The decision boundaries found with a two-feature Iris data set where C = using (a) the standard technique and (b) the ECO algorithm, respectively. Positive examples and negative examples are marked with x s and + s, respectively. Support vectors are marked with dark circles. A solid line denotes the decision boundary. The area between the dotted lines shows the margin. In (b), different clusters are indicated by different grey levels. Each cluster center in the working set is marked with a dot with the same grey level used for the members of that cluster (a) (b) Fig.3 The decision boundaries found with an image segmentation data set where C = using (a) the standard technique and (b) the ECO algorithm, respectively. The same markers as in Fig.2 are used I(a) I(b)

6 L. Meng et al./fast Training of Support Vector Machines Using Error-Center-Based Optimization II(a) II(b) III(a) III(b) 11 Fig.4 The decision boundaries found with an image segmentation data set where (I)C = 1000, (II)C = 100 and (III)C = 10 using (a) the standard technique and (b) the ECO algorithm, respectively. The same markers as in Fig.2 are used To investigate the increase of training time with the size of training set, the image segmentation data set used in the experiment and the size of the training set was varied by randomly taking subsets of the full training set. Table 2 and 3 compare the performance of ECO with the standard QP technique for C = and C = 100, respectively. CPU times are averaged over 100 independent runs. As shown in the tables, the running time of ECO is dominated by error checking. Fig.5 shows the log-log plot of training time in seconds versus the size of the full training set for C = and C = 100. In both cases, ECO is much faster than the standard technique. And more importantly, the increase in the training time of ECO is much slower than that of the standard technique as the size of the data set increases. By fitting a line to the log-log plot and Table 2 Performance of a standard QP technique and ECO algorithm when applied respectively to different image segmentation subsets (C = ). All CPU times are in seconds. Problem size CPU time of standard algorithm CPU time of ECO CPU time only for solving all the QP subproblems involved in ECO no. of ECO iterations Table 3 Performance of a standard QP technique and ECO algorithm when applied respectively to different image segmentation subsets (C = 100). All CPU times are in seconds. Problem size CPU time of standard algorithm CPU time of ECO CPU time only for solving all the QP subproblems involved in ECO no. of ECO iterations

7 12 International Journal of Automation and Computing 1 (2005) 6-12 then working out the gradient of the line, we know that the training time of the standard technique scales l 3.3 for both C = and C = 100, while the ECO time scales l 1.05, i.e. for both hard and soft-margin SVMs, the training time of ECO grows almost linearly with the size of the training set. Fig.5 The log-log plot of training time versus the size of training set for the standard QP technique and ECO algorithm when applied to image segmentation subsets 5 Conclusion Standard QP techniques are not suitable for SVM training. Considering this, a new center-based algorithm, ECO, has been introduced to speed up the training of SVMs. Under ECO, the full training set is compressed and represented by the set of cluster centers. In the training process, more and more error cluster centers are added into the current working set until the approach converges. For hard-margin SVMs, the optimality of the solution obtained by ECO is guaranteed since the KKT conditions have been used as its stop criterion. Moreover, the great potential of ECO for large training sets has been demonstrated through experimental results, which show that with ECO training time scales almost linearly with training set size. References [1] B. E. Boser, I. M. Guyon, V. N. Vapnik, A Training Algorithm for Optimal Margin Classifiers, in Haussler, D. (ed.), Proceedings of the Fifth Annual ACM Workshop on COLT, , Pittsburgh, PA. ACM Press, [2] C. Cortes, V. Vapnik, Support Vector Networks, Machine Learning vol. 20, , [3] V. Vapnik, S. Golowich, A. Smola, Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing, in Mozer, M., Jordan, M. and Petsche, T. (eds.), Advances in Neural Information Processing Systems Cambridge, MA. MIT Press, vol. 9, , [4] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, [5] T. M. Cover, Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition, IEEE Transactions on Electonic Computers EC-14, , [6] W. Karush, Minima of Funcitons of Several Variables with Inequalities as Side Constraints. Department of Mathematics, University of Chicago, MSc Thesis, [7] H. Kuhn, A. Tucker, Nonlinear Programming, Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probabilistics, University of California Press, , 1951 [8] L. Meng, K. W. Lau, Q. H. Wu, Pattern Classification Using a Support Vector Machine Based on Subclass Centres, in Proceedings of the IEEE Third Internatiobal Conference on Control Theory and Applications, South Africa, , [9] R. O. Duda, P. E. Hart, Pattern Classification and Scene Analysis, Wiley, New York, [10] E. Osuna, R. Freund, F. Girosi, An Improved Training Algorithm for Support Vector Machines, in Principe, J., Gile, L., Morgan, N. and Wilson, E. (eds.), Proceedings of 1997 IEEE Workshop Neural Networks Signal Processing VII, IEEE Press, , L. Meng received B.Sc. Electrical and Electronic Engineering from Shenzhen University, China, in 1997, M.Sc. Electrical and Electronic Engineering, in 1998 and Ph.D. in Electrical Engineering in 2002, both from The University of Liverpool, U.K. She worked as a Post-Doctoral Research Fellow at London Metropolitan University, U.K. from June 2002 to Feb Currently she is a lecturer at the University of Hertfordshire, U.K. Her research interests include Pattern Recognition, Kernel Machines, Fuzzy Control, Evolutionary Computation, Wireless Networks, and Digital Video Streaming. Q.H. Wu obtained an M.Sc. degree in Electrical Engineering from Huazhong University of Science and Technology (HUST), China, in From 1981 to 1984, he was appointed Lecturer in Electrical Engineering in the University. He obtained a Ph.D. degree from The Queen s University of Belfast (QUB), U.K., in He worked as a Research Fellow and Senior Research Fellow in QUB from 1987 to 1991 and Lecturer and Senior Lecturer in the Department of Mathematical Sciences, Loughborough University, U.K. from 1991 to Since 1995 he has held the Chair of Electrical Engineering in the Department of Electrical Engineering and Electronics, The University of Liverpool, U.K., acting as the Head of Intelligence Engineering and Automation group. Professor Wu is a Chartered Engineer, Fellow of IEE and Senior Member of IEEE. His research interests include adaptive control, mathematical morphology, neural networks, learning systems, pattern recognition, evolutionary computation and power system control and operation.