Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes. 1 Prelmnares Our task s to predct whether a test sample belongs to one of two classes. We receve tranng examples of the form: {x, y }, = 1,..., n and x R d, y { 1, +1}. We call {x } the co-varates or nput vectors and {y } the response varables or labels. We consder a very smple example where the data are n fact lnearly separable:.e. I can draw a straght lne f(x) = w T x b such that all cases wth y = 1 fall on one sde and have f(x ) < 0 and cases wth y = +1 fall on the other and have f(x ) > 0. Gven that we have acheved that, we could classfy new test cases accordng to the rule y test = sgn(x test ). However, typcally there are nfntely many such hyper-planes obtaned by small perturbatons of a gven soluton. How do we choose between all these hyper-planes whch the solve the separaton problem for our tranng data, but may have dfferent performance on the newly arrvng test cases. For nstance, we could choose to put the lne very close to members of one partcular class, say y = 1. Intutvely, when test cases arrve we wll not make many mstakes on cases that should be classfed wth y = +1, but we wll make very easly mstakes on the cases wth y = 1 (for nstance, magne that a new batch of test cases arrves whch are small perturbatons of the tranng data). A sensble thng thus seems to choose the separaton lne as far away from both y = 1 and y = +1 tranng cases as we can,.e. rght n the mddle. Geometrcally, the vector w s drected orthogonal to the lne defned by w T x = b. Ths can be understood as follows. Frst take b = 0. Now t s clear that all vectors, x, wth vanshng nner product wth w satsfy ths equaton,.e. all vectors orthogonal to w satsfy ths equaton. Now translate the hyperplane away from the orgn over a vector a. The equaton for the plane now becomes: (x a) T w = 0,.e. we fnd that for the offset b = a T w, whch s the projecton of a onto to the vector w. Wthout loss of generalty we may thus choose a perpendcular to the plane, n whch case the length a = b / w represents the shortest, orthogonal dstance between the orgn and the hyperplane. We now defne 2 more hyperplanes parallel to the separatng hyperplane. They represent that planes that cut through the closest tranng examples on ether sde. We wll call them
support hyper-planes n the followng, because the data-vectors they contan support the plane. We defne the dstance between the these hyperplanes and the separatng hyperplane to be d + and d respectvely. The margn, γ, s defned to be d + + d. Our goal s now to fnd a the separatng hyperplane so that the margn s largest, whle the separatng hyperplane s equdstant from both. We can wrte the followng equatons for the support hyperplanes: w T x = b + δ (1) w T x = b δ (2) We now note that we have over-parameterzed the problem: f we scale w, b and δ by a constant factor α, the equatons for x are stll satsfed. To remove ths ambguty we wll requre that δ = 1, ths sets the scale of the problem,.e. f we measure dstance n mllmeters or meters. We can now also compute the values for d + = ( b+1 b )/ w = 1/ w (ths s only true f b / ( 1, 0) snce the orgn doesn t fall n between the hyperplanes n that case. If b ( 1, 0) you should use d + = ( b + 1 + b )/ w = 1/ w ). Hence the margn s equal to twce that value: γ = 2/ w. Wth the above defnton of the support planes we can wrte down the followng constrant that any soluton must satsfy, w T x b 1 y = 1 (3) w T x b +1 y = +1 (4) or n one equaton, y (w T x b) 1 0 (5) We now formulate the prmal problem of the SVM: 1 mnmze 2 w 2 subject to y (w T x b) 1 0 (6) Thus, we maxmze the margn, subject to the constrants that all tranng cases fall on ether sde of the support hyper-planes. The data-cases that le on the hyperplane are called support vectors, snce they support the hyper-planes and hence determne the soluton to the problem. The prmal problem can be solved by a quadratc program. However, t s not ready to be kernelsed, because ts dependence s not only on nner products between data-vectors. Hence, we transform to the dual formulaton by frst wrtng the problem usng a Lagrangan, L(w, b, α) = 1 N [ 2 w 2 α y (w T x b) 1 ] (7) =1 The soluton that mnmzes the prmal problem subject to the constrants s gven by mn w max α L(w, α),.e. a saddle pont problem. When the orgnal objectve-functon s convex, (and only then), we can nterchange the mnmzaton and maxmzaton. Dong that, we fnd that we can fnd the condton on w that must hold at the saddle pont we are solvng for. Ths s done by takng dervatves wrt w and b and solvng, w α y x = 0 w = α y x (8) α y = 0 (9)
Insertng ths back nto the Lagrangan we obtan what s known as the dual problem, N maxmze L D = α 1 α α j y y j x T x j 2 =1 j subject to α y = 0 (10) α 0 (11) The dual formulaton of the problem s also a quadratc program, but note that the number of varables, α n ths problem s equal to the number of data-cases, N. The crucal pont s however, that ths problem only depends on x through the nner product x T x j. Ths s readly kernelsed through the substtuton x T x j k(x, x j ). Ths s a recurrent theme: the dual problem lends tself to kernelsaton, whle the prmal problem dd not. The theory of dualty guarantees that for convex problems, the dual problem wll be concave, and moreover, that the unque soluton of the prmal problem corresponds tot the unque soluton of the dual problem. In fact, we have: L P (w ) = L D (α ),.e. the dualty-gap s zero. Next we turn to the condtons that must necessarly hold at the saddle pont and thus the soluton of the problem. These are called the KKT condtons (whch stands for Karush- Kuhn-Tucker). These condtons are necessary n general, and suffcent for convex optmzaton problems. They can be derved from the prmal problem by settng the dervatves wrt to w to zero. Also, the constrants themselves are part of these condtons and we need that for nequalty constrants the Lagrange multplers are non-negatve. Fnally, an mportant constrant called complementary slackness needs to be satsfed, w L P = 0 w α y x = 0 (12) b L P = 0 α y = 0 (13) constrant - 1 y (w T x b) 1 0 (14) multpler condton α 0 (15) complementary slackness α [ y (w T x b) 1 ] = 0 (16) It s the last equaton whch may be somewhat surprsng. It states that ether the nequalty constrant s satsfed, but not saturated: y (w T x b) 1 > 0 n whch case α for that data-case must be zero, or the nequalty constrant s saturated y (w T x b) 1 = 0, n whch case α can be any value α 0. Inequalty constrants whch are saturated are sad to be actve, whle unsaturated constrants are nactve. One could magne the process of searchng for a soluton as a ball whch runs down the prmary objectve functon usng gradent descent. At some pont, t wll ht a wall whch s the constrant and although the dervatve s stll pontng partally towards the wall, the constrants prohbts the ball to go on. Ths s an actve constrant because the ball s glued to that wall. When a fnal soluton s reached, we could remove some constrants, wthout changng the soluton, these are nactve constrants. One could thnk of the term w L P as the force actng on the ball. We see from the frst equaton above that only the forces wth α 0 exsert a force on the ball that balances wth the force from the curved quadratc surface w. The tranng cases wth α > 0, representng actve constrants on the poston of the support hyperplane are called support vectors. These are the vectors that are stuated n the support hyperplane and they determne the soluton. Typcally, there are only few of them, whch people call a sparse soluton (most α s vansh).
What we are really nterested n s the functon f( ) whch can be used to classfy future test cases, f(x) = w T x b = α y x T x b (17) As an applcaton of the KKT condtons we derve a soluton for b by usng the complementary slackness condton, b = j α j y j x T j x y a support vector (18) where we used y 2 = 1. So, usng any support vector one can determne b, but for numercal stablty t s better to average over all of them (although they should obvously be consstent). The most mportant concluson s agan that ths functon f( ) can thus be expressed solely n terms of nner products x T x whch we can replace wth kernel matrces k(x, x j ) to move to hgh dmensonal non-lnear spaces. Moreover, snce α s typcally very sparse, we don t need to evaluate many kernel entres n order to predct the class of the new nput x. 2 The Non-Separable case Obvously, not all datasets are lnearly separable, and so we need to change the formalsm to account for that. Clearly, the problem les n the constrants, whch cannot always be satsfed. So, let s relax those constrants by ntroducng slack varables, ξ, w T x b 1 + ξ y = 1 (19) w T x b +1 ξ y = +1 (20) ξ 0 (21) The varables, ξ allow for volatons of the constrant. We should penalze the objectve functon for these volatons, otherwse the above constrants become vod (smply always pck ξ very large). Penalty functons of the form C( ξ ) k wll lead to convex optmzaton problems for postve ntegers k. For k = 1, 2 t s stll a quadratc program (QP). In the followng we wll choose k = 1. C controls the tradeoff between the penalty and margn. To be on the wrong sde of the separatng hyperplane, a data-case would need ξ > 1. Hence, the sum ξ could be nterpreted as measure of how bad the volatons are and s an upper bound on the number of volatons. The new prmal problem thus becomes, mnmze L P = 1 2 w 2 + C ξ leadng to the Lagrangan, subject to y (w T x b) 1 + ξ 0 (22) ξ 0 (23) L(w, b, ξ, α, µ) = 1 2 w 2 +C N [ ξ α y (w T ] N x b) 1 + ξ µ ξ (24) =1 =1
from whch we derve the KKT condtons, 1. w L P = 0 w α y x = 0 (25) 2. b L P = 0 α y = 0 (26) 3. ξ L P = 0 C α µ = 0 (27) 4.constrant-1 y (w T x b) 1 + ξ 0 (28) 5.constrant-2 ξ 0 (29) 6.multpler condton-1 α 0 (30) 7.multpler condton-2 µ 0 (31) 8.complementary slackness-1 [ α y (w T ] x b) 1 + ξ = 0 (32) 9.complementary slackness-1 µ ξ = 0 (33) (34) From here we can deduce the followng facts. If we assume that ξ > 0, then µ = 0 (9), hence α = C (1) and thus ξ = 1 y (x T w b) (8). Also, when ξ = 0 we have µ > 0 (9) and hence α < C. If n addton to ξ = 0 we also have that y (w T x b) 1 = 0, then α > 0 (8). Otherwse, f y (w T x b) 1 > 0 then α = 0. In summary, as before for ponts not on the support plane and on the correct sde we have ξ = α = 0 (all constrants nactve). On the support plane, we stll have ξ = 0, but now α > 0. Fnally, for data-cases on the wrong sde of the support hyperplane the α max-out to α = C and the ξ balance the volaton of the constrant such that y (w T x b) 1 + ξ = 0. Geometrcally, we can calculate the gap between support hyperplane and the volatng datacase to be ξ / w. Ths can be seen because the plane defned by y (w T x b) 1+ξ = 0 s parallel to the support plane at a dstance 1 + y b ξ / w from the orgn. Snce the support plane s at a dstance 1 + y b / w the result follows. Fnally, we need to convert to the dual problem to solve t effcently and to kernelse t. Agan, we use the KKT equatons to get rd of w, b and ξ, N maxmze L D = α 1 α α j y y j x T x j 2 =1 j subject to α y = 0 (35) 0 α C (36) Surprsngly, ths s almost the same QP s before, but wth an extra constrant on the multplers α whch now lve n a box. Ths constrant s derved from the fact that α = C µ and µ 0. We also note that t only depends on nner products x T x j whch are ready to be kernelsed.