An Idiot s guide to Support vector machines (SVMs)

Transcription

1 An Idiot s guide to Support vector machines (SVMs) R. Berwick, Viage Idiot SVMs: A New Generation of Learning Agorithms Pre 1980: Amost a earning methods earned inear decision surfaces. Linear earning methods have nice theoretica properties 1980 s Decision trees and NNs aowed efficient earning of noninear decision surfaces Litte theoretica basis and a suffer from oca minima 1990 s Efficient earning agorithms for non-inear functions based on computationa earning theory deveoped Nice theoretica properties. 1

2 Key Ideas Two independent deveopments within ast decade New efficient separabiity of non-inear regions that use kerne functions : generaization of simiarity to new kinds of simiarity measures based on dot products Use of quadratic optimization probem to avoid oca minimum issues with neura nets The resuting earning agorithm is an optimization agorithm rather than a greedy search Organization Basic idea of support vector machines: just ike 1- ayer or muti-ayer neura nets Optima hyperpane for ineary separabe patterns Extend to patterns that are not ineary separabe by transformations of origina data to map into new space the Kerne function SVM agorithm for pattern recognition 2

3 Support Vectors Support vectors are the data points that ie cosest to the decision surface (or hyperpane) They are the data points most difficut to cassify They have direct bearing on the optimum ocation of the decision surface We can show that the optima hyperpane stems from the function cass with the owest capacity = # of independent features/parameters we can twidde [note this is extra materia not covered in the ectures you don t have to know this] Reca from 1-ayer nets : Which Separating Hyperpane? In genera, ots of possibe soutions for a,b,c (an infinite number!) Support Vector Machine (SVM) finds an optima soution 3

4 Support Vector Machine (SVM) SVMs maximize the margin (Winston terminoogy: the street ) around the separating hyperpane. The decision function is fuy specified by a (usuay very sma) subset of training sampes, the support vectors. This becomes a Quadratic programming probem that is easy to sove by standard methods Support vectors Maximize margin Separation by Hyperpanes Assume inear separabiity for now (we wi reax this ater) in 2 dimensions, can separate by a ine in higher dimensions, need hyperpanes 4

5 Genera input/output for SVMs just ike for neura nets, but for one important addition Input: set of (input, output) training pair sampes; ca the input sampe features x 1, x 2 x n, and the output resut y. Typicay, there can be ots of input features x i. Output: set of weights w (or w i ), one for each feature, whose inear combination predicts the vaue of y. (So far, just ike neura nets ) Important difference: we use the optimization of maximizing the margin ( street width ) to reduce the number of weights that are nonzero to just a few that correspond to the important features that matter in deciding the separating ine(hyperpane) these nonzero weights correspond to the support vectors (because they support the separating hyperpane) 2-D Case Find a,b,c, such that ax + by c for red points ax + by (or < ) c for green points. 5

6 Which Hyperpane to pick? Lots of possibe soutions for a,b,c. Some methods find a separating hyperpane, but not the optima one (e.g., neura net) But: Which points shoud infuence optimaity? A points? Linear regression Neura nets Or ony difficut points cose to decision boundary Support vector machines Support Vectors again for ineary separabe case Support vectors are the eements of the training set that woud change the position of the dividing hyperpane if removed. Support vectors are the critica eements of the training set The probem of finding the optima hyper pane is an optimization probem and can be soved by optimization techniques (we use Lagrange mutipiers to get this probem into a form that can be soved anayticay). 6

7 Support Vectors: Input vectors that just touch the boundary of the margin (street) circed beow, there are 3 of them (or, rather, the tips of the vectors w 0T x + b 0 = 1 or w 0T x + b 0 = 1 d X X X X X X Here, we have shown the actua support vectors, v 1, v 2, v 3, instead of just the 3 circed points at the tai ends of the support vectors. d denotes 1/2 of the street width d X X v 1 v 2 v 3 X X X X 7

8 Definitions Define the hyperpanes H such that: w x i +b +1 when =+1 w x i +b -1 when = 1 H 1 and H 2 are the panes: d + H 1 : w x i +b = +1 d - H H 2 : w x i +b = 1 The points on the panes H 1 and H 2 are the tips of the Support Vectors The pane H 0 is the median in between, where w x i +b =0 d+ = the shortest distance to the cosest positive point d- = the shortest distance to the cosest negative point The margin (gutter) of a separating hyperpane is d+ + d. H 2 H 0 H 1 Moving a support vector moves the decision boundary Moving the other vectors has no effect The optimization agorithm to generate the weights proceeds in such a way that ony the support vectors determine the weights and thus the boundary 8

9 Defining the separating Hyperpane Form of equation defining the decision surface separating the casses is a hyperpane of the form: w T x + b = 0 w is a weight vector x is input vector b is bias Aows us to write w T x + b 0 for d i = +1 w T x + b < 0 for d i = 1 Some fina definitions Margin of Separation (d): the separation between the hyperpane and the cosest data point for a given weight vector w and bias b. Optima Hyperpane (maxima margin): the particuar hyperpane for which the margin of separation d is maximized. 9

10 Maximizing the margin (aka street width) We want a cassifier (inear separator) with as big a margin as possibe. H 1 Reca the distance from a point(x 0,y 0 ) to a ine: Ax+By+c = 0 is: Ax 0 +By 0 +c /sqrt(a 2 +B 2 ), so, The distance between H 0 and H 1 is then: w x+b / w =1/ w, so The tota distance between H 1 and H 2 is thus: 2/ w H 2 H 0 d- d+ In order to maximize the margin, we thus need to minimize w. With the condition that there are no datapoints between H 1 and H 2 : x i w+b +1 when =+1 x i w+b 1 when = 1 Can be combined into: (x i w) 1 We now must sove a quadratic programming probem Probem is: minimize w, s.t. discrimination boundars obeyed, i.e., min f(x) s.t. g(x)=0, which we can rewrite as: min f: ½ w 2 (Note this is a quadratic function) s.t. g: (w x i ) b = 1 or [ (w x i ) b] 1 =0 This is a constrained optimization probem It can be soved by the Lagrangian mutiper method Because it is quadratic, the surface is a parabooid, with just a singe goba minimum (thus avoiding a probem we had with neura nets!) 10

11 fatten Exampe: parabooid 2+x 2 +2y 2 s.t. x+y=1 Intuition: find intersection of two functions f, g at a tangent point (intersection = both constraints satisfied; tangent = derivative is 0); this wi be a min (or max) for f s.t. the contraint g is satisfied Fattened parabooid f: 2x 2 +2y 2 =0 with superimposed constraint g: x +y = 1 Minimize when the constraint ine g (shown in green) is tangent to the inner eipse contour inez of f (shown in red) note direction of gradient arrows. 11

12 fattened parabooid f: 2+x 2 +2y 2 =0 with superimposed constraint g: x +y = 1; at tangent soution p, gradient vectors of f,g are parae (no possibe move to increment f that aso keeps you in region g) Minimize when the constraint ine g is tangent to the inner eipse contour ine of f Two constraints 1. Parae norma constraint (= gradient constraint on f, g s.t. soution is a max, or a min) 2. g(x)=0 (soution is on the constraint ine as we) We now recast these by combining f, g as the new Lagrangian function bntroducing new sack variabes denoted a or (more usuay, denoted α in the iterature) 12

13 Redescribing these conditions Want to ook for soution point p where " f ( p) = "! g( p) g( x) = 0 Or, combining these two as the Langrangian L & requiring derivative of L be zero: L(x,a) = f (x)! ag(x) "(x,a) = 0 At a soution p The the constraint ine g and the contour ines of f must be tangent If they are tangent, their gradient vectors (perpendicuars) are parae Gradient of g must be 0 i.e., steepest ascent & so perpendicuar to f Gradient of f must aso be in the same direction as g 13

14 How Langrangian soves constrained optimization L(x,a) = f (x)! ag(x) where "(x,a) = 0 Partia derivatives wrt x recover the parae norma constraint Partia derivatives wrt λ recover the g(x,y)=0 In genera, L(x,a) = f (x) +! a i i g i (x) In genera Gradient min of f constraint condition g L(x,a) = f (x) +! a i i g i (x) a function of n + m variabes n for the x' s, m for the a. Differentiating gives n + m equations, each set to 0. The n eqns differentiated wrt each x i give the gradient conditions; the m eqns differentiated wrt each a i recover the constraints g i In our case, f(x): ½ w 2 ; g(x): (w x i +b) 1=0 so Lagrangian is: min L= ½ w 2 Σa i [ (w x i +b) 1] wrt w, b We expand the ast to get the foowing L form: min L= ½ w 2 Σa i (w x i +b) +Σa i wrt w, b 14

15 Lagrangian Formuation So in the SVM probem the Lagrangian is min L P = 1 w 2! a 2 " i ( x i # w + b) + " a i s.t. $i, a i % 0 where is the # of training points From the property that the derivatives at min = 0 we get:!l P!w = w " # a i x i = 0!L P!b = " a i = 0 so w =! a i x i,! a i = 0 What s with this L p business? This indicates that this is the prima form of the optimization probem We wi actuay sove the optimization probem by now soving for the dua of this origina probem What is this dua formuation? 15

16 The Lagrangian Dua Probem: instead of minimizing over w, b, subject to constraints invoving a s, we can maximize over a (the dua variabe) subject to the reations obtained previousy for w and b Our soution must satisfy these two reations: w =! a i x i,! a i = 0 By substituting for w and b back in the origina eqn we can get rid of the dependence on w and b. Note first that we aready now have our answer for what the weights w must be: they are a inear combination of the training inputs and the training outputs, x i and and the vaues of a. We wi now sove for the a s by differentiating the dua probem wrt a, and setting it to zero. Most of the a s wi turn out to have the vaue zero. The non-zero a s wi correspond to the support vectors Prima probem: min L P = 1 w 2! a 2 " i x i # w + b s.t. $i a i % 0 ( ) + " a i w =! a i x i,! a i = 0 Dua probem: max L D (a i ) =! a i " 1! a 2 i a j y j x i # x j s.t.! a i = 0 & a i $ 0 ( ) (note that we have removed the dependence on w and b) 16

17 The Dua probem Kuhn-Tucker theorem: the soution we find here wi be the same as the soution to the origina probem Q: But why are we doing this???? (why not just sove the origina probem????) Ans: Because this wi et us sove the probem by computing the just the inner products of x i, x j (which wi be vermportant ater on when we want to sove non-ineary separabe cassification probems) Dua probem: The Dua Probem max L D (a i ) =! a i " 1! a 2 i a j y j x i # x j s.t.! a i = 0 & a i $ 0 ( ) Notice that a we have are the dot products of x i,x j If we take the derivative wrt a and set it equa to zero, we get the foowing soution, so we can sove for a i :! a i = 0 0 " a i " C 17

18 Now knowing the a i we can find the weights w for the maxima margin separating hyperpane: w =! a i x i And now, after training and finding the w by this method, given an unknown point u measured on features x i we can cassift by ooking at the sign of: f (x) = wiu + b = ( a i x i iu) + b Remember: most of the weights w i, i.e., the a s, wi be zero Ony the support vectors (on the gutters or margin) wi have nonzero weights or a s this reduces the dimensionaity of the soution! Inner products, simiarity, and SVMs Why shoud inner product kernes be invoved in pattern recognition using SVMs, or at a? Intuition is that inner products provide some measure of simiarity Inner product in 2D between 2 vectors of unit ength returns the cosine of the ange between them = how far apart they are e.g. x = [1, 0] T, y = [0, 1] T i.e. if they are parae their inner product is 1 (competey simiar) x T y = x y = 1 If they are perpendicuar (competey unike) their inner product is 0 (so shoud not contribute to the correct cassifier) x T y = x y = 0 18

19 Insight into inner products Consider that we are trying to maximize the form: L D (a i ) = a i! a i a j y j ( x i # x j )! " 1 2 s.t.! a i = 0 & a i $ 0 The caim is that this function wi be maximized if we give nonzero vaues to a s that correspond to the support vectors, ie, those that matter in fixing the maximum width margin ( street ). We, consider what this ooks ike. Note first from the constraint condition that a the a s are positive. Now et s think about a few cases. Case 1. If two features x i, x j are competey dissimiar, their dot product is 0, and they don t contribute to L. Case 2. If two features x i,x j are competey aike, their dot product is 0. There are 2 subcases. Subcase 1: both x i,and x j predict the same output vaue (either +1 or 1). Then x y j is aways 1, and the vaue of a i a j y j x i x j wi be positive. But this woud decrease the vaue of L (since it woud subtract from the first term sum). So, the agorithm downgrades simiar feature vectors that make the same prediction. Subcase 2: x i,and x j make opposite predictions about the output vaue (ie, one is +1, the other 1), but are otherwise very cosey simiar: then the product a i a j y j x i x is negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisey the exampes we are ooking for: the critica ones that te the two cassses apart. Insight into inner products, graphicay: 2 very very simiar x i, x j vectors that predict difft casses tend to maximize the margin width x i x j 19

20 2 vectors that are simiar but predict the same cass are redundant x i x j 2 dissimiar (orthogona) vectors don t count at a x j x i 20

21 But are we done??? Not Lineary Separabe! Find a ine that penaizes points on the wrong side 21

22 Transformation to separate o o x x o x o o x x x o x ϕ ϕ (x) ϕ (x) ϕ (o) ϕ (x) ϕ (x) ϕ (o) ϕ (x) ϕ (o) ϕ (o) ϕ (x) ϕ (o) ϕ (x) ϕ (o) ϕ (o) X F Non Linear SVMs The idea is to gain ineary separation by mapping the data to a higher dimensiona space The foowing set can t be separated by a inear function, but can be separated by a quadratic one 2 ( )( ) ( ) x! a x! b = x! a + b x + ab a b { 2, } x! x x So if we map we gain inear separation 22

23 Probems with inear SVM =-1 =+1 What if the decision function is not inear? What transform woud separate these? Ans: poar coordinates! Non-inear SVM The Kerne trick Imagine a function φ that maps the data into another space: φ=radia Η Radia Η =-1 =+1 φ =-1 =+1 Remember the function we want to optimize: L d = a i ½ a i a j y j (x i x j ) where (x i x j ) is the dot product of the two feature vectors. If we now transform to φ, instead of computing this dot product (x i x j ) we wi have to compute (φ (x i ) φ (x j )). But how can we do this? This is expensive and time consuming (suppose φ is a quartic poynomia or worse, we don t know the function expicity. We, here is the neat thing: If there is a kerne function K such that K(x i,x j ) = φ (x i ) φ (x j ), then we do not need to know or compute φ at a!! That is, the kerne function defines inner products in the transformed space. Or, it defines simiaritn the transformed space. 23

24 Non-inear SVMs So, the function we end up optimizing is: L d = a i ½ a i a j y j K(x i x j ), Kerne exampe: The poynomia kerne K(xi,xj) = (x i x j + 1) p, where p is a tunabe parameter Note: Evauating K ony requires one addition and one exponentiation more than the origina dot product Exampes for Non Linear SVMs ( x y) ( x y ) K, =! + 1 p 2 " ( x, y) = exp{ " 2 2 } x y K! ( x, y) = tanh( x # y $ ) K! " 1 st is poynomia (incudes x x as specia case) 2 nd is radia basis function (gaussians) 3 rd is sigmoid (neura net activation function) 24

25 We ve aready seen such noninear transforms What is it??? tanh(β 0 x T x i + β 1 ) It s the sigmoid transform (for neura nets) So, SVMs subsume neura nets! (but w/o their probems ) Inner Product Kernes Type of Support Vector Machine Inner Product Kerne K(x,x i ), I = 1, 2,, N Usua inner product Poynomia earning machine (x T x i + 1) p Power p is specified a priori by the user Radia-basis function (RBF) exp(1/(2σ 2 ) x-x i 2 ) The width σ 2 is specified a priori Two ayer neura net tanh(β 0 x T x i + β 1 ) Actuay works ony for some vaues of β 0 and β 1 25

26 Kernes generaize the notion of inner product simiarity Note that one can define kernes over more than just vectors: strings, trees, structures, in fact, just about anything A very powerfu idea: used in comparing DNA, protein structure, sentence structures, etc. Exampes for Non Linear SVMs 2 Gaussian Kerne Linear Gaussian 26

27 Noninear rbf kerne Admira s deight w/ difft kerne functions 27

28 Overfitting by SVM Every point is a support vector too much freedom to bend to fit the training data no generaization. In fact, SVMs have an automatic way to avoid such issues, but we won t cover it here see the book by Vapnik, (We add a penaty function for mistakes made after training by over-fitting: reca that if one over-fits, then one wi tend to make errors on new data. This penaty fn can be put into the quadratic programming probem directy. You don t need to know this for this course.) 28