How to Solve A Support Vector Machine Problem. Hao Zhang Colorado State University

Size: px

Start display at page:

Download "How to Solve A Support Vector Machine Problem. Hao Zhang Colorado State University"

Tobias Malone
7 years ago
Views:

1 How to Solve A Support Vector Machine Problem Hao Zhang Colorado State University

2 Outline Preliminaries Support Vector Machine (SVM) Reference

3 Preliminaries

4 Convex function Concave function Local minimum is global minimum Local maximum is global maximum

5 Primal Problem: mmmmmmmm f 0 (x) sssssss tt f i x 0, i = 1,, m h i x = 0, i = 1,, p We denote p as the optimal value of this problem Lagrange function: m L x, λ, ν = f 0 x + λ i f i (x) p + ν i h i (x) (λ i 0) λ i aaa ν i aaa cccccc Lagrange multipliers

6 Lagrange dual function g λ, ν = inf x = inf x L x, λ, ν m f 0 x + λ i f i x p + ν i h i x Note: g λ, ν can be for certain λ aaa ν Since the dual function g λ, ν is the point-wise infimum of a family of affine functions of (λ, ν), it is concave. (Don t care about the convexity of f i x or h i x )

7 Lagrange dual function g λ, ν = inf x = inf x L x, λ, ν m f 0 x + λ i f i x p + ν i h i x If x is primal feasible (f 0 (x) won t approach infinity), then g(λ, ν) f 0 (x) (quite easy to prove) What does this mean if g(λ, ν) > (called dual feasible)?

8 g(λ, ν) f 0 (x) g(λ, ν) is one lower bound on optimal value p Can we do better?

9 Sure! Just maximize g λ, ν : mmmmmmm g λ, ν sssssss tt λ i 0 This is called Lagrange dual problem, associated with its primal problem. We denote d as the optimal value of this dual problem This is a convex optimization problem, since the objective to be maximized is concave and the constraint is convex. It s still the case even if the primal problem is not convex.

10 g(λ, ν) f 0 (x) d p (weak duality) p d is defined as optimal duality gap

11 If the primal problem is convex, we usually (but not always) have strong duality: d = p One simple constraint qualification is Slater s condition: There exists an x rrrrrr(d) such that f i (x) < 0, i = 1,, m h i (x) = 0, i = 1,, p then we have strong duality * rrrrrr D = {x D: ε > 0, N ε (x) aaa(d) D}

12 Duality implemented in algorithms At kth iteration: compute g(λ (k), ν (k) ) and f 0 x k g λ k, ν k p f 0 x k stop if the bound is tight enough

13 Complementary slackness Suppose that the primal and dual optimal values are attained and equal (strong duality holds). Denote x and λ, ν as a dual optimal point, i.e., f 0 x = g λ, ν = inf x m f 0 x + λ if i x f 0 x + λ if i x f 0 x m p + ν ih i x p + ν ih i x

14 Complementary slackness Suppose that the primal and dual optimal values are attained and equal (strong duality holds). Denote x and λ, ν as a dual optimal point, i.e., f 0 x = g λ, ν = inf x m f 0 x + λ if i x = f 0 x + λ if i x = f 0 x m p + ν ih i x p + ν ih i x So, the two inequalities in this chain hold with equality.

15 Hence, m λ if i x = 0 Since each term is non-positive, we conclude that λ if i x = 0, i = 1,, m This is called complementary slackness

16 Karush-Kuhn-Tucker conditions Assume f 0,, f m, h 1,, h p are differentiable, therefore they have open domains. (Don t care about convexity) m p f 0 x + λ i f i x + ν i h i x = 0

17 Karush-Kuhn-Tucker (KKT) conditions Put all the conditions and conclusions together, we get KKT conditions: f 0 x f i x 0, i = 1,, m h i x = 0, i = 1,, p λ i 0, i = 1,..., m λ if i x = 0, i = 1,, m m p + λ i f i x + ν i h i x = 0 Summary: for any optimization problem with differentiable objective and constraint functions for which strong duality obtains, any pair of primal and dual optimal points must satisfy the KKT conditions

18 The KKT conditions play an important role in optimization. In general, many algorithms for convex optimization are conceived as, or can be interpreted as, methods for solving the KKT conditions. Example: SVM! Fletcher proved that for all SVMs, KKT conditions are necessary and sufficient for a solution.

19 Support Vector Machine (SVM)

20 Linear SVM We are given some training data D, a set of n points of the form D = x i, y i x i R p, y i { 1,1}} Our goal is to find the maximum-margin hyperplane that divides the points from these two classes. H3 (green) doesn't separate the two classes. H1 (blue) does, with a small margin and H2 (red) with the maximum margin.

21 A hyper plane can be written as w x b = 0 We want to choose w and b to maximize the margin, or distance between the parallel hyperplanes that are as far apart as possible while still separating the data. These hyperplanes can be described by the equations: w x b = 1 and w x b = 1 (Uniqueness of the parameters)

22 the distance between these two hyperplanes is: 2 w Maximum-margin hyperplane and margins for an SVM trained with samples from two classes. Samples on the margin are called the support vectors.

23 As we also have to prevent data points from falling into the margin, we add the following constraint: for each x i either w x i b 1 for y i = 1 or w x i b 1 for y i = 1 This can be written as y i (w x i b) 1 for all i

24 The primal form of this problem is MMMMMMMM 1 2 w 2 subject to y i (w x i b) 1 Its dual form is max min L α, w, b α w,b where L α, w, b = 1 2 w 2 α i [y i w x i b 1] subject to α i 0

25 min w,b L α, w, b L(α, w, b) w (α, w, b) b = w α i y i x i = 0 = α i y i = 0 Geometric view min w,b L α, w, b = α i 1 2 α iα j y i y j x i T x j i,j = L (α)

26 Our goal is max L (α) α Subject to α i 0 and α i y i = 0 How to compute w? w = α i y i x i How to compute b? Complementary slackness: α i y i w x i b 1 = 0 fff ttttt α i > 0 b = w x i y i (In practice, it s safer to compute the mean)

27 Kernel SVM Revisit L (α) L α = 1 2 α iα j y i y j x i T x j i,j k(x i, x j ) Some examples of kernels from Wikipedia:

28 What we have discussed so far focused on separable cases. what if the data is not separable? Relax the constraints!

29 y i (w x i b) 1 y i w x i b 1 ξ i, ξ i 0 Primal problem becomes: min w,ξ,b 1 2 w 2 + C ξ i C is a trade-off between a large margin and a small error penalty. Convert this primal form to dual form. (Same idea!)

30 Reference [1] [2] A Tutorial on Support Vector Machines for Pattern Recognition, CHRISTOPHER J.C. BURGES [3] Convex Optimization, Stephen Boyd, Lieven Vandenberghe [4] ECEN 629 Lecture 5: Duality and KKT Conditions, Shuguang Cui [5] [6]

31 Thanks!

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725

Duality in General Programs. Ryan Tibshirani Convex Optimization 10-725/36-725 Duality in General Programs Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: duality in linear programs Given c R n, A R m n, b R m, G R r n, h R r : min x R n c T x max u R m, v R r b T