Support Vector Machines



Similar documents
Several Views of Support Vector Machines

Support Vector Machine (SVM)

A Simple Introduction to Support Vector Machines

Support Vector Machines Explained

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Lecture 2: The SVM classifier

Nonlinear Optimization: Algorithms 3: Interior-point methods

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Distributed Machine Learning and Big Data

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Nonlinear Programming Methods.S2 Quadratic Programming

Statistical Machine Learning

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Date: April 12, Contents

Linear Programming Notes V Problem Transformations

Lecture 2: August 29. Linear Programming (part I)

Proximal mapping via network optimization

Big Data Analytics CSCI 4030

Support Vector Machines for Classification and Regression

Linear Programming I

Big Data - Lecture 1 Optimization reminders

Practical Guide to the Simplex Method of Linear Programming

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Online learning of multi-class Support Vector Machines

Linear Threshold Units

A User s Guide to Support Vector Machines

Chapter 6. Linear Programming: The Simplex Method. Introduction to the Big M Method. Section 4 Maximization and Minimization with Problem Constraints

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Scalable Developments for Big Data Analytics in Remote Sensing

An Introduction to Machine Learning

Inner Product Spaces

Machine Learning in Spam Filtering

Multiclass Classification Class 06, 25 Feb 2008 Ryan Rifkin

4.6 Linear Programming duality

2.3 Convex Constrained Optimization Problems

Two-Stage Stochastic Linear Programs

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Solving polynomial least squares problems via semidefinite programming relaxations

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

A fast multi-class SVM learning method for huge databases

α = u v. In other words, Orthogonal Projection

Online Passive-Aggressive Algorithms on a Budget

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

LINEAR INEQUALITIES. less than, < 2x + 5 x 3 less than or equal to, greater than, > 3x 2 x 6 greater than or equal to,

15 Kuhn -Tucker conditions

Fóra Gyula Krisztián. Predictive analysis of financial time series

Lecture 3: Linear methods for classification

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

1 Portfolio mean and variance

Maximum-Margin Matrix Factorization

LCs for Binary Classification

Name: ID: Discussion Section:

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Fast Kernel Classifiers with Online and Active Learning

9 Multiplication of Vectors: The Scalar or Dot Product

Domain of a Composition

What is Linear Programming?

1 Solving LPs: The Simplex Algorithm of George Dantzig

Mathematical finance and linear programming (optimization)

The Envelope Theorem 1

BANACH AND HILBERT SPACE REVIEW

Zeros of Polynomial Functions

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

In this section, we will consider techniques for solving problems of this type.

Cost Minimization and the Cost Function

Chapter 4. Duality. 4.1 A Graphical Example

Linear Programming Problems

Classification algorithm in Data mining: An Overview

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

DUOL: A Double Updating Approach for Online Learning

Inner product. Definition of inner product

CSCI567 Machine Learning (Fall 2014)

Applications of Support Vector-Based Learning

Linear Programming Sensitivity Analysis

Big Data Analytics. Lucas Rego Drumond

Linear Programming. March 14, 2014

Duality of linear conic problems

Duality in Linear Programming

Vector and Matrix Norms

24. The Branch and Bound Method

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Further Study on Strong Lagrangian Duality Property for Invex Programs via Penalty Functions 1

Introduction to Online Learning Theory

Support-Vector Networks

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Early defect identification of semiconductor processes using machine learning

Transcription:

Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google).

Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric derivation of SVMs. Practical issues.

The Regularization Setting (Again) Given n examples (x 1, y 1 ),...,(x n, y n ), with x i R n and y i { 1, 1} for all i. We can find a classification function by solving a regularized learning problem: argmin f H 1 n n V(y i, f(x i ))+λ f 2 H. i=1 Note that in this class we are specifically considering binary classification.

The Hinge Loss The classical SVM arises by considering the specific loss function V(f(x, y)) (1 yf(x)) +, where (k) + max(k, 0).

The Hinge Loss 4 3.5 3 2.5 Hinge Loss 2 1.5 1 0.5 0 3 2 1 0 1 2 3 y * f(x)

Substituting In The Hinge Loss With the hinge loss, our regularization problem becomes argmin f H 1 n n (1 y i f(x i )) + +λ f 2 H. i=1 Note that we don t have a 1 2 term. multiplier on the regularization

Slack Variables This problem is non-differentiable (because of the kink in V ). So rewrite the max function using slack variables ξ i. 1 argmin n n i=1 ξ i +λ f 2 H f H subject to : ξ i 1 y i f(x i ) i = 1,...,n ξ i 0 i = 1,...,n

Applying The Representer Theorem Substituting in: n f (x) = c i K(x, x i ), i=1 we get a constrained quadratic programming problem: 1 argmin n c R n,ξ R n n i=1 ξ i +λc T Kc subject to : ξ i 1 y i n j=1 c jk(x i, x j ) i = 1,..., n ξ i 0 i = 1,..., n

Adding A Bias Term Adding an unregularized bias term b (which presents some theoretical difficulties) we get the primal SVM: 1 argmin n c R n,b R,ξ R n n i=1 ξ i +λc T Kc subject to : ξ i 1 y i ( n j=1 c jk(x i, x j )+b) i = 1,...,n ξ i 0 i = 1,...,n

Standard Notation In most of the SVM literature, instead of λ, a parameter C is used to control regularization: C = 1 2λn. Using this definition (after multiplying our objective function by the constant 1 2λ, the regularization problem becomes argmin f H n C V(y i, f(x i ))+ 1 2 f 2 H. i=1 Like λ, the parameter C also controls the tradeoff between classification accuracy and the norm of the function. The primal problem becomes...

The Reparametrized Problem argmin c R n,b R,ξ R n C n i=1 ξ i + 1 2 ct Kc subject to : ξ i 1 y i ( n j=1 c jk(x i, x j )+b) i = 1,...,n ξ i 0 i = 1,...,n

How to Solve? argmin c R n,b R,ξ R n C n i=1 ξ i + 1 2 ct Kc subject to : ξ i 1 y i ( n j=1 c jk(x i, x j )+b) i = 1,...,n ξ i 0 i = 1,...,n This is a constrained optimization problem. The general approach: Form the primal problem we did this. Lagrangian from primal just like Lagrange multipliers. Dual one dual variable associated to each primal constraint in the Lagrangian.

Lagrangian We derive the dual from the primal using the Lagrangian: L(c,ξ, b,α,ζ) = C n ξ i + 1 2 ct Kc i=1 n n α i (y i { c j K(x i, x j )+b} 1+ξ i ) i=1 n ζ i ξ i i=1 j=1

Dual I Dual problem is: argmax α,ζ 0 inf c,ξ,b First, minimize L w.r.t. (c,ξ, b): (1) L c = 0 (2) L b = 0 L(c,ξ, b,α,ζ) = c i = α i y i n = α i y i = 0 i=1 (3) L ξ i = 0 = C α i ζ i = 0 = 0 α i C

Dual II Dual: Optimality conditions: argmax α,ζ 0 inf c,ξ,b L(c,ξ, b,α,ζ) (1) c i = α i y i (2) n i=1 α iy i = 0 (3) α i [0, C] Plug in (2) and (3): argmax inf L(c,α) = 1 α 0 c 2 ct Kc + n i=1 α i 1 y i n K(x i, x j )c j j=1

Dual II Dual: Optimality conditions: argmax α,ζ 0 inf c,ξ,b L(c,ξ, b,α,ζ) Plug in (1): (1) c i = α i y i (2) n i=1 α iy i = 0 (3) α i [0, C] argmax L(α) = n i=1 α i 1 n 2 i,j=1 α iy i K(x i, x j )α j y j α 0 = n i=1 α i 1 2 αt (diagy)k(diagy)α

The Primal and Dual Problems Again argmin c R n,b R,ξ R n C n i=1 ξ i + 1 2 ct Kc subject to : ξ i 1 y i ( n j=1 c jk(x i, x j )+b) i = 1,...,n ξ i 0 i = 1,...,n max α R n subject to : n i=1 α i 1 2 αt Qα n i=1 y iα i = 0 0 α i C i = 1,...,n

SVM Training Basic idea: solve the dual problem to find the optimal α s, and use them to find b and c. The dual problem is easier to solve the primal problem. It has simple box constraints and a single equality constraint, and the problem can be decomposed into a sequence of smaller problems (see appendix).

Interpreting the solution α tells us: c and b. The identities of the misclassified points. How to analyze? Use the optimality conditions. Already used: derivative of L w.r.t. (c,ξ, b) is zero at optimality. Haven t used: complementary slackness, primal/dual constraints.

Optimality Conditions: all of them All optimal solutions must satisfy: n n c j K(x i, x j ) y i α j K(x i, x j ) = 0 j=1 j=1 y i ( α i [y i ( n α i y i = 0 i=1 C α i ζ i = 0 n y j α j K(x i, x j )+b) 1+ξ i 0 j=1 n y j α j K(x i, x j )+b) 1+ξ i ] = 0 j=1 ζ i ξ i = 0 ξ i,α i,ζ i 0 i = 1,...,n i = 1,...,n i = 1,...,n i = 1,...,n i = 1,...,n i = 1,...,n

Optimality Conditions II These optimality conditions are both necessary and sufficient for optimality: (c,ξ, b,α,ζ) satisfy all of the conditions if and only if they are optimal for both the primal and the dual. (Also known as the Karush-Kuhn-Tucker (KKT) conditons.)

Interpreting the solution c L c = 0 = c i = α i y i, i

Interpreting the solution b Suppose we have the optimal α i s. Also suppose that there exists an i satisfying 0 < α i < C. Then α i < C = ζ i > 0 = ξ i = 0 n = y i ( y j α j K(x i, x j )+b) 1 = 0 j=1 = b = y i n y j α j K(x i, x j ) j=1

Interpreting the solution sparsity (Remember we defined f(x) = n i=1 y iα i K(x, x i )+b.) y i f(x i ) > 1 (1 y i f(x i )) < 0 ξ i (1 y i f(x i )) α i = 0

Interpreting the solution - support vectors y i f(x i ) < 1 (1 y i f(x i )) > 0 ξ i > 0 ζ i = 0 α i = C

Interpreting the solution support vectors So y i f(x i ) < 1 α i = C. Conversely, suppose α i = C: α i = C = ξ i = 1 y i f(x i ) = y i f(x i ) 1

Interpreting the solution Here are all of the derived conditions: α i = 0 = y i f(x i ) 1 0 < α i < C = y i f(x i ) = 1 α i = C = y i f(x i ) < 1 α i = 0 = y i f(x i ) > 1 α i = C = y i f(x i ) 1

Geometric Interpretation of Reduced Optimality Conditions

Summary so far The SVM is a Tikhonov regularization problem, using the hinge loss: argmin f H 1 n n (1 y i f(x i )) + +λ f 2 H. i=1 Solving the SVM means solving a constrained quadratic program. Solutions can be sparse some coefficients are zero. The nonzero coefficients correspond to points that aren t classified correctly enough this is where the support vector in SVM comes from.

The Geometric Approach The traditional approach to developing the mathematics of SVM is to start with the concepts of separating hyperplanes and margin. The theory is usually developed in a linear space, beginning with the idea of a perceptron, a linear hyperplane that separates the positive and the negative examples. Defining the margin as the distance from the hyperplane to the nearest example, the basic observation is that intuitively, we expect a hyperplane with larger margin to generalize better than one with smaller margin.

Large and Small Margin Hyperplanes (a) (b)

Maximal Margin Classification Classification function: f(x) = sign (w x). (1) w is a normal vector to the hyperplane separating the classes. We define the boundaries of the margin by w, x = ±1. What happens as we change w? We push the margin in/out by rescaling w the margin moves out with 1 w. So maximizing the margin corresponds to minimizing w.

Maximal Margin Classification Classification function: f(x) = sign (w x). (1) w is a normal vector to the hyperplane separating the classes. We define the boundaries of the margin by w, x = ±1. What happens as we change w? We push the margin in/out by rescaling w the margin moves out with 1 w. So maximizing the margin corresponds to minimizing w.

Maximal Margin Classification, Separable case Separable means w s.t. all points are beyond the margin, i.e. So we solve: y i w, x i 1, i. argmin w s.t. w 2 y i w, x i 1, i

Maximal Margin Classification, Non-separable case Non-separable means there are points on the wrong side of the margin, i.e. i s.t. y i w, x i < 1. We add slack variables to account for the wrongness: argmin ξ i,w s.t. n i=1 ξ i + w 2 y i w, x i 1 ξ i, i

Historical Perspective Historically, most developments begin with the geometric form, derived a dual program which was identical to the dual we derived above, and only then observed that the dual program required only dot products and that these dot products could be replaced with a kernel function.

More Historical Perspective In the linearly separable case, we can also derive the separating hyperplane as a vector parallel to the vector connecting the closest two points in the positive and negative classes, passing through the perpendicular bisector of this vector. This was the Method of Portraits, derived by Vapnik in the 1970 s, and recently rediscovered (with non-separable extensions) by Keerthi.

Summary The SVM is a Tikhonov regularization problem, with the hinge loss: argmin f H 1 n n (1 y i f(x i )) + +λ f 2 H. i=1 Solving the SVM means solving a constrained quadratic program. It s better to work with the dual program. Solutions can be sparse few non-zero coefficients. The non-zero coefficients correspond to points not classified correctly enough a.k.a. support vectors. There is alternative, geometric interpretation of the SVM, from the perspective of maximizing the margin.

Practical issues We can also use RLS for classification. What are the tradeoffs? SVM possesses sparsity: can have parameters set to zero in the solution. This enables potentially faster training and faster prediction than RLS. SVM QP solvers tend to have many parameters to tune. SVM can scale to very large datasets, unlike RLS for the moment (active research topic!).

Good Large-Scale SVM Solvers SVM Light: http://svmlight.joachims.org SVM Torch: http://www.torch.ch libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Appendix (Follows.)

SVM Training Our plan will be to solve the dual problem to find the α s, and use that to find b and our function f. The dual problem is easier to solve the primal problem. It has simple box constraints and a single inequality constraint, even better, we will see that the problem can be decomposed into a sequence of smaller problems.

Off-the-shelf QP software We can solve QPs using standard software. Many codes are available. Main problem the Q matrix is dense, and is n-by-n, so we cannot write it down. Standard QP software requires the Q matrix, so is not suitable for large problems.

Decomposition, I Partition the dataset into a working set W and the remaining points R. We can rewrite the dual problem as: max n α W R W, α R R R i=1 i W subject to : 1 2 [α W α R ] α i + i=1 α i i R [ QWW Q WR Q RW Q RR ][ αw i W y iα i + i R y iα i = 0 0 α i C, i α R ]

Decomposition, II Suppose we have a feasible solution α. We can get a better solution by treating the α W as variable and the α R as constant. We can solve the reduced dual problem: max α W R W subject to : (1 Q WR α R )α W 1 2 α W Q WW α W i W y iα i = i R y iα i 0 α i C, i W

Decomposition, III The reduced problems are fixed size, and can be solved using a standard QP code. Convergence proofs are difficult, but this approach seems to always converge to an optimal solution in practice.

Selecting the Working Set There are many different approaches. The basic idea is to examine points not in the working set, find points which violate the reduced optimality conditions, and add them to the working set. Remove points which are in the working set but are far from violating the optimality conditions.