An Idiot s guide to Support vector machines (SVMs)

Similar documents

SAT Math Facts & Formulas

SAT Math Must-Know Facts & Formulas

1 Basic concepts in geometry

Energy Density / Energy Flux / Total Energy in 3D

Physics 100A Homework 11- Chapter 11 (part 1) The force passes through the point A, so there is no arm and the torque is zero.

Angles formed by 2 Lines being cut by a Transversal

Math 447/547 Partial Differential Equations Prof. Carlson Homework 7 Text section Solve the diffusion problem. u(t,0) = 0 = u x (t,l).

A Simple Introduction to Support Vector Machines

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks

5. Introduction to Robot Geometry and Kinematics

Finance 360 Problem Set #6 Solutions

FRAME BASED TEXTURE CLASSIFICATION BY CONSIDERING VARIOUS SPATIAL NEIGHBORHOODS. Karl Skretting and John Håkon Husøy

SECOND DERIVATIVE TEST FOR CONSTRAINED EXTREMA

Several Views of Support Vector Machines

Secure Network Coding with a Cost Criterion

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

7. Dry Lab III: Molecular Symmetry

3.5 Pendulum period :40:05 UTC / rev 4d4a39156f1e. g = 4π2 l T 2. g = 4π2 x1 m 4 s 2 = π 2 m s Pendulum period 68

Support Vector Machines Explained

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

Vector and Matrix Norms

A Supplier Evaluation System for Automotive Industry According To Iso/Ts Requirements

A Latent Variable Pairwise Classification Model of a Clustering Ensemble

Support Vector Machine (SVM)

Face Hallucination and Recognition

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Big Data Analytics CSCI 4030

Chapter 3: JavaScript in Action Page 1 of 10. How to practice reading and writing JavaScript on a Web page

Maintenance activities planning and grouping for complex structure systems

Least-Squares Intersection of Lines

ACO and SVM Selection Feature Weighting of Network Intrusion Detection Method

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Statistical Machine Learning

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

To give it a definition, an implicit function of x and y is simply any relationship that takes the form:

Cooperative Content Distribution and Traffic Engineering in an ISP Network

Multi-Robot Task Scheduling

Discounted Cash Flow Analysis (aka Engineering Economy)

Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization

CONDENSATION. Prabal Talukdar. Associate Professor Department of Mechanical Engineering IIT Delhi

LADDER SAFETY Table of Contents

Pricing and Revenue Sharing Strategies for Internet Service Providers

Support Vector Machines

SOLUTIONS. f x = 6x 2 6xy 24x, f y = 3x 2 6y. To find the critical points, we solve

Lecture 3: Linear methods for classification

Chapter 1 Structural Mechanics

Risk Margin for a Non-Life Insurance Run-Off

Lecture 2: The SVM classifier

500 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 3, MARCH 2013

ASYMPTOTIC DIRECTION FOR RANDOM WALKS IN RANDOM ENVIRONMENTS arxiv:math/ v2 [math.pr] 11 Dec 2007

Microeconomic Theory: Basic Math Concepts

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 31, NO. 12, DECEMBER

Market Design & Analysis for a P2P Backup System

Budgeting Loans from the Social Fund

A Branch-and-Price Algorithm for Parallel Machine Scheduling with Time Windows and Job Priorities

In this section, we will consider techniques for solving problems of this type.

Christfried Webers. Canberra February June 2015

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS

Early access to FAS payments for members in poor health

Constrained optimization.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Linear Programming Notes V Problem Transformations

2013 MBA Jump Start Program

Core Maths C1. Revision Notes

COMPARISON OF DIFFUSION MODELS IN ASTRONOMICAL OBJECT LOCALIZATION

Chapter 6 Signal Data Mining from Wearable Systems

Spherical Correlation of Visual Representations for 3D Model Retrieval

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Teaching fractions in elementary school: A manual for teachers

NCH Software Bolt PDF Printer

Mechanical Engineering Drawing Workshop. Sample Drawings. Sample 1. Sample 1. Sample 2. Sample /7/19. JIS Drawing Part 1

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Support Vector Machines for Classification and Regression

Advanced ColdFusion 4.0 Application Development Server Clustering Using Bright Tiger

Australian Bureau of Statistics Management of Business Providers

Leakage detection in water pipe networks using a Bayesian probabilistic framework

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

1 3 4 = 8i + 20j 13k. x + w. y + w

Prot Maximization and Cost Minimization

Multi-variable Calculus and Optimization

CERTIFICATE COURSE ON CLIMATE CHANGE AND SUSTAINABILITY. Course Offered By: Indian Environmental Society

INTRODUCTION TO THE FINITE ELEMENT METHOD

EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Scheduling in Multi-Channel Wireless Networks

(a) We have x = 3 + 2t, y = 2 t, z = 6 so solving for t we get the symmetric equations. x 3 2. = 2 y, z = 6. t 2 2t + 1 = 0,

Week 3: Consumer and Firm Behaviour: The Work-Leisure Decision and Profit Maximization

NCH Software Warp Speed PC Tune-up Software

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

PENALTY TAXES ON CORPORATE ACCUMULATIONS

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Breakeven analysis and short-term decision making

The width of single glazing. The warmth of double glazing.

The Use of Cooling-Factor Curves for Coordinating Fuses and Reclosers

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Comparison of Traditional and Open-Access Appointment Scheduling for Exponentially Distributed Service Time

Hybrid Process Algebra

Transcription:

An Idiot s guide to Support vector machines (SVMs) R. Berwick, Viage Idiot SVMs: A New Generation of Learning Agorithms Pre 1980: Amost a earning methods earned inear decision surfaces. Linear earning methods have nice theoretica properties 1980 s Decision trees and NNs aowed efficient earning of noninear decision surfaces Litte theoretica basis and a suffer from oca minima 1990 s Efficient earning agorithms for non-inear functions based on computationa earning theory deveoped Nice theoretica properties. 1

Key Ideas Two independent deveopments within ast decade New efficient separabiity of non-inear regions that use kerne functions : generaization of simiarity to new kinds of simiarity measures based on dot products Use of quadratic optimization probem to avoid oca minimum issues with neura nets The resuting earning agorithm is an optimization agorithm rather than a greedy search Organization Basic idea of support vector machines: just ike 1- ayer or muti-ayer neura nets Optima hyperpane for ineary separabe patterns Extend to patterns that are not ineary separabe by transformations of origina data to map into new space the Kerne function SVM agorithm for pattern recognition 2

Support Vectors Support vectors are the data points that ie cosest to the decision surface (or hyperpane) They are the data points most difficut to cassify They have direct bearing on the optimum ocation of the decision surface We can show that the optima hyperpane stems from the function cass with the owest capacity = # of independent features/parameters we can twidde [note this is extra materia not covered in the ectures you don t have to know this] Reca from 1-ayer nets : Which Separating Hyperpane? In genera, ots of possibe soutions for a,b,c (an infinite number!) Support Vector Machine (SVM) finds an optima soution 3

Support Vector Machine (SVM) SVMs maximize the margin (Winston terminoogy: the street ) around the separating hyperpane. The decision function is fuy specified by a (usuay very sma) subset of training sampes, the support vectors. This becomes a Quadratic programming probem that is easy to sove by standard methods Support vectors Maximize margin Separation by Hyperpanes Assume inear separabiity for now (we wi reax this ater) in 2 dimensions, can separate by a ine in higher dimensions, need hyperpanes 4

Genera input/output for SVMs just ike for neura nets, but for one important addition Input: set of (input, output) training pair sampes; ca the input sampe features x 1, x 2 x n, and the output resut y. Typicay, there can be ots of input features x i. Output: set of weights w (or w i ), one for each feature, whose inear combination predicts the vaue of y. (So far, just ike neura nets ) Important difference: we use the optimization of maximizing the margin ( street width ) to reduce the number of weights that are nonzero to just a few that correspond to the important features that matter in deciding the separating ine(hyperpane) these nonzero weights correspond to the support vectors (because they support the separating hyperpane) 2-D Case Find a,b,c, such that ax + by c for red points ax + by (or < ) c for green points. 5

Which Hyperpane to pick? Lots of possibe soutions for a,b,c. Some methods find a separating hyperpane, but not the optima one (e.g., neura net) But: Which points shoud infuence optimaity? A points? Linear regression Neura nets Or ony difficut points cose to decision boundary Support vector machines Support Vectors again for ineary separabe case Support vectors are the eements of the training set that woud change the position of the dividing hyperpane if removed. Support vectors are the critica eements of the training set The probem of finding the optima hyper pane is an optimization probem and can be soved by optimization techniques (we use Lagrange mutipiers to get this probem into a form that can be soved anayticay). 6

Support Vectors: Input vectors that just touch the boundary of the margin (street) circed beow, there are 3 of them (or, rather, the tips of the vectors w 0T x + b 0 = 1 or w 0T x + b 0 = 1 d X X X X X X Here, we have shown the actua support vectors, v 1, v 2, v 3, instead of just the 3 circed points at the tai ends of the support vectors. d denotes 1/2 of the street width d X X v 1 v 2 v 3 X X X X 7

Definitions Define the hyperpanes H such that: w x i +b +1 when =+1 w x i +b -1 when = 1 H 1 and H 2 are the panes: d + H 1 : w x i +b = +1 d - H H 2 : w x i +b = 1 The points on the panes H 1 and H 2 are the tips of the Support Vectors The pane H 0 is the median in between, where w x i +b =0 d+ = the shortest distance to the cosest positive point d- = the shortest distance to the cosest negative point The margin (gutter) of a separating hyperpane is d+ + d. H 2 H 0 H 1 Moving a support vector moves the decision boundary Moving the other vectors has no effect The optimization agorithm to generate the weights proceeds in such a way that ony the support vectors determine the weights and thus the boundary 8

Defining the separating Hyperpane Form of equation defining the decision surface separating the casses is a hyperpane of the form: w T x + b = 0 w is a weight vector x is input vector b is bias Aows us to write w T x + b 0 for d i = +1 w T x + b < 0 for d i = 1 Some fina definitions Margin of Separation (d): the separation between the hyperpane and the cosest data point for a given weight vector w and bias b. Optima Hyperpane (maxima margin): the particuar hyperpane for which the margin of separation d is maximized. 9

Maximizing the margin (aka street width) We want a cassifier (inear separator) with as big a margin as possibe. H 1 Reca the distance from a point(x 0,y 0 ) to a ine: Ax+By+c = 0 is: Ax 0 +By 0 +c /sqrt(a 2 +B 2 ), so, The distance between H 0 and H 1 is then: w x+b / w =1/ w, so The tota distance between H 1 and H 2 is thus: 2/ w H 2 H 0 d- d+ In order to maximize the margin, we thus need to minimize w. With the condition that there are no datapoints between H 1 and H 2 : x i w+b +1 when =+1 x i w+b 1 when = 1 Can be combined into: (x i w) 1 We now must sove a quadratic programming probem Probem is: minimize w, s.t. discrimination boundars obeyed, i.e., min f(x) s.t. g(x)=0, which we can rewrite as: min f: ½ w 2 (Note this is a quadratic function) s.t. g: (w x i ) b = 1 or [ (w x i ) b] 1 =0 This is a constrained optimization probem It can be soved by the Lagrangian mutiper method Because it is quadratic, the surface is a parabooid, with just a singe goba minimum (thus avoiding a probem we had with neura nets!) 10

fatten Exampe: parabooid 2+x 2 +2y 2 s.t. x+y=1 Intuition: find intersection of two functions f, g at a tangent point (intersection = both constraints satisfied; tangent = derivative is 0); this wi be a min (or max) for f s.t. the contraint g is satisfied Fattened parabooid f: 2x 2 +2y 2 =0 with superimposed constraint g: x +y = 1 Minimize when the constraint ine g (shown in green) is tangent to the inner eipse contour inez of f (shown in red) note direction of gradient arrows. 11

fattened parabooid f: 2+x 2 +2y 2 =0 with superimposed constraint g: x +y = 1; at tangent soution p, gradient vectors of f,g are parae (no possibe move to increment f that aso keeps you in region g) Minimize when the constraint ine g is tangent to the inner eipse contour ine of f Two constraints 1. Parae norma constraint (= gradient constraint on f, g s.t. soution is a max, or a min) 2. g(x)=0 (soution is on the constraint ine as we) We now recast these by combining f, g as the new Lagrangian function bntroducing new sack variabes denoted a or (more usuay, denoted α in the iterature) 12

Redescribing these conditions Want to ook for soution point p where " f ( p) = "! g( p) g( x) = 0 Or, combining these two as the Langrangian L & requiring derivative of L be zero: L(x,a) = f (x)! ag(x) "(x,a) = 0 At a soution p The the constraint ine g and the contour ines of f must be tangent If they are tangent, their gradient vectors (perpendicuars) are parae Gradient of g must be 0 i.e., steepest ascent & so perpendicuar to f Gradient of f must aso be in the same direction as g 13

How Langrangian soves constrained optimization L(x,a) = f (x)! ag(x) where "(x,a) = 0 Partia derivatives wrt x recover the parae norma constraint Partia derivatives wrt λ recover the g(x,y)=0 In genera, L(x,a) = f (x) +! a i i g i (x) In genera Gradient min of f constraint condition g L(x,a) = f (x) +! a i i g i (x) a function of n + m variabes n for the x' s, m for the a. Differentiating gives n + m equations, each set to 0. The n eqns differentiated wrt each x i give the gradient conditions; the m eqns differentiated wrt each a i recover the constraints g i In our case, f(x): ½ w 2 ; g(x): (w x i +b) 1=0 so Lagrangian is: min L= ½ w 2 Σa i [ (w x i +b) 1] wrt w, b We expand the ast to get the foowing L form: min L= ½ w 2 Σa i (w x i +b) +Σa i wrt w, b 14

Lagrangian Formuation So in the SVM probem the Lagrangian is min L P = 1 w 2! a 2 " i ( x i # w + b) + " a i s.t. $i, a i % 0 where is the # of training points From the property that the derivatives at min = 0 we get:!l P!w = w " # a i x i = 0!L P!b = " a i = 0 so w =! a i x i,! a i = 0 What s with this L p business? This indicates that this is the prima form of the optimization probem We wi actuay sove the optimization probem by now soving for the dua of this origina probem What is this dua formuation? 15

The Lagrangian Dua Probem: instead of minimizing over w, b, subject to constraints invoving a s, we can maximize over a (the dua variabe) subject to the reations obtained previousy for w and b Our soution must satisfy these two reations: w =! a i x i,! a i = 0 By substituting for w and b back in the origina eqn we can get rid of the dependence on w and b. Note first that we aready now have our answer for what the weights w must be: they are a inear combination of the training inputs and the training outputs, x i and and the vaues of a. We wi now sove for the a s by differentiating the dua probem wrt a, and setting it to zero. Most of the a s wi turn out to have the vaue zero. The non-zero a s wi correspond to the support vectors Prima probem: min L P = 1 w 2! a 2 " i x i # w + b s.t. $i a i % 0 ( ) + " a i w =! a i x i,! a i = 0 Dua probem: max L D (a i ) =! a i " 1! a 2 i a j y j x i # x j s.t.! a i = 0 & a i $ 0 ( ) (note that we have removed the dependence on w and b) 16

The Dua probem Kuhn-Tucker theorem: the soution we find here wi be the same as the soution to the origina probem Q: But why are we doing this???? (why not just sove the origina probem????) Ans: Because this wi et us sove the probem by computing the just the inner products of x i, x j (which wi be vermportant ater on when we want to sove non-ineary separabe cassification probems) Dua probem: The Dua Probem max L D (a i ) =! a i " 1! a 2 i a j y j x i # x j s.t.! a i = 0 & a i $ 0 ( ) Notice that a we have are the dot products of x i,x j If we take the derivative wrt a and set it equa to zero, we get the foowing soution, so we can sove for a i :! a i = 0 0 " a i " C 17

Now knowing the a i we can find the weights w for the maxima margin separating hyperpane: w =! a i x i And now, after training and finding the w by this method, given an unknown point u measured on features x i we can cassift by ooking at the sign of: f (x) = wiu + b = ( a i x i iu) + b Remember: most of the weights w i, i.e., the a s, wi be zero Ony the support vectors (on the gutters or margin) wi have nonzero weights or a s this reduces the dimensionaity of the soution! Inner products, simiarity, and SVMs Why shoud inner product kernes be invoved in pattern recognition using SVMs, or at a? Intuition is that inner products provide some measure of simiarity Inner product in 2D between 2 vectors of unit ength returns the cosine of the ange between them = how far apart they are e.g. x = [1, 0] T, y = [0, 1] T i.e. if they are parae their inner product is 1 (competey simiar) x T y = x y = 1 If they are perpendicuar (competey unike) their inner product is 0 (so shoud not contribute to the correct cassifier) x T y = x y = 0 18

Insight into inner products Consider that we are trying to maximize the form: L D (a i ) = a i! a i a j y j ( x i # x j )! " 1 2 s.t.! a i = 0 & a i $ 0 The caim is that this function wi be maximized if we give nonzero vaues to a s that correspond to the support vectors, ie, those that matter in fixing the maximum width margin ( street ). We, consider what this ooks ike. Note first from the constraint condition that a the a s are positive. Now et s think about a few cases. Case 1. If two features x i, x j are competey dissimiar, their dot product is 0, and they don t contribute to L. Case 2. If two features x i,x j are competey aike, their dot product is 0. There are 2 subcases. Subcase 1: both x i,and x j predict the same output vaue (either +1 or 1). Then x y j is aways 1, and the vaue of a i a j y j x i x j wi be positive. But this woud decrease the vaue of L (since it woud subtract from the first term sum). So, the agorithm downgrades simiar feature vectors that make the same prediction. Subcase 2: x i,and x j make opposite predictions about the output vaue (ie, one is +1, the other 1), but are otherwise very cosey simiar: then the product a i a j y j x i x is negative and we are subtracting it, so this adds to the sum, maximizing it. This is precisey the exampes we are ooking for: the critica ones that te the two cassses apart. Insight into inner products, graphicay: 2 very very simiar x i, x j vectors that predict difft casses tend to maximize the margin width x i x j 19

2 vectors that are simiar but predict the same cass are redundant x i x j 2 dissimiar (orthogona) vectors don t count at a x j x i 20

But are we done??? Not Lineary Separabe! Find a ine that penaizes points on the wrong side 21

Transformation to separate o o x x o x o o x x x o x ϕ ϕ (x) ϕ (x) ϕ (o) ϕ (x) ϕ (x) ϕ (o) ϕ (x) ϕ (o) ϕ (o) ϕ (x) ϕ (o) ϕ (x) ϕ (o) ϕ (o) X F Non Linear SVMs The idea is to gain ineary separation by mapping the data to a higher dimensiona space The foowing set can t be separated by a inear function, but can be separated by a quadratic one 2 ( )( ) ( ) x! a x! b = x! a + b x + ab a b { 2, } x! x x So if we map we gain inear separation 22

Probems with inear SVM =-1 =+1 What if the decision function is not inear? What transform woud separate these? Ans: poar coordinates! Non-inear SVM The Kerne trick Imagine a function φ that maps the data into another space: φ=radia Η Radia Η =-1 =+1 φ =-1 =+1 Remember the function we want to optimize: L d = a i ½ a i a j y j (x i x j ) where (x i x j ) is the dot product of the two feature vectors. If we now transform to φ, instead of computing this dot product (x i x j ) we wi have to compute (φ (x i ) φ (x j )). But how can we do this? This is expensive and time consuming (suppose φ is a quartic poynomia or worse, we don t know the function expicity. We, here is the neat thing: If there is a kerne function K such that K(x i,x j ) = φ (x i ) φ (x j ), then we do not need to know or compute φ at a!! That is, the kerne function defines inner products in the transformed space. Or, it defines simiaritn the transformed space. 23

Non-inear SVMs So, the function we end up optimizing is: L d = a i ½ a i a j y j K(x i x j ), Kerne exampe: The poynomia kerne K(xi,xj) = (x i x j + 1) p, where p is a tunabe parameter Note: Evauating K ony requires one addition and one exponentiation more than the origina dot product Exampes for Non Linear SVMs ( x y) ( x y ) K, =! + 1 p 2 " ( x, y) = exp{ " 2 2 } x y K! ( x, y) = tanh( x # y $ ) K! " 1 st is poynomia (incudes x x as specia case) 2 nd is radia basis function (gaussians) 3 rd is sigmoid (neura net activation function) 24

We ve aready seen such noninear transforms What is it??? tanh(β 0 x T x i + β 1 ) It s the sigmoid transform (for neura nets) So, SVMs subsume neura nets! (but w/o their probems ) Inner Product Kernes Type of Support Vector Machine Inner Product Kerne K(x,x i ), I = 1, 2,, N Usua inner product Poynomia earning machine (x T x i + 1) p Power p is specified a priori by the user Radia-basis function (RBF) exp(1/(2σ 2 ) x-x i 2 ) The width σ 2 is specified a priori Two ayer neura net tanh(β 0 x T x i + β 1 ) Actuay works ony for some vaues of β 0 and β 1 25

Kernes generaize the notion of inner product simiarity Note that one can define kernes over more than just vectors: strings, trees, structures, in fact, just about anything A very powerfu idea: used in comparing DNA, protein structure, sentence structures, etc. Exampes for Non Linear SVMs 2 Gaussian Kerne Linear Gaussian 26

Noninear rbf kerne Admira s deight w/ difft kerne functions 27

Overfitting by SVM Every point is a support vector too much freedom to bend to fit the training data no generaization. In fact, SVMs have an automatic way to avoid such issues, but we won t cover it here see the book by Vapnik, 1995. (We add a penaty function for mistakes made after training by over-fitting: reca that if one over-fits, then one wi tend to make errors on new data. This penaty fn can be put into the quadratic programming probem directy. You don t need to know this for this course.) 28