SVM Example. Dan Ventura. March 12, 2009

Similar documents
Support Vector Machine (SVM)

Support Vector Machines Explained

Linearly Independent Sets and Linearly Dependent Sets

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Several Views of Support Vector Machines

A Simple Introduction to Support Vector Machines

Section Continued

by the matrix A results in a vector which is a reflection of the given

Support Vector Machines

Lecture 2: The SVM classifier

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Systems of Linear Equations

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear Programming Problems

Linear Algebra Notes

Solving Quadratic Equations

Manifold Learning Examples PCA, LLE and ISOMAP

Active Learning SVM for Blogs recommendation

Linear Equations in Linear Algebra

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Nonlinear Programming Methods.S2 Quadratic Programming

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Mathematical goals. Starting points. Materials required. Time needed

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

5 Systems of Equations

Chapter 20. Vector Spaces and Bases

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Neural Networks and Support Vector Machines

Lectures notes on orthogonal matrices (with exercises) Linear Algebra II - Spring 2004 by D. Klain

LS.6 Solution Matrices

Methods for Finding Bases

Early defect identification of semiconductor processes using machine learning

1.5 SOLUTION SETS OF LINEAR SYSTEMS

A fast multi-class SVM learning method for huge databases

Click on the links below to jump directly to the relevant section

Solutions to Math 51 First Exam January 29, 2015

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Mechanics 1: Conservation of Energy and Momentum

Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

Determine If An Equation Represents a Function

Linear Programming Notes V Problem Transformations

Lecture 2: August 29. Linear Programming (part I)

3 An Illustrative Example

LCs for Binary Classification

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Support Vector Machines with Clustering for Training with Very Large Datasets

Grade level: secondary Subject: mathematics Time required: 45 to 90 minutes

5. Linear algebra I: dimension

1 Solving LPs: The Simplex Algorithm of George Dantzig

Solving Systems of Linear Equations

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Linear Equations. Find the domain and the range of the following set. {(4,5), (7,8), (-1,3), (3,3), (2,-3)}

Solving simultaneous equations using the inverse matrix

Au = = = 3u. Aw = = = 2w. so the action of A on u and w is very easy to picture: it simply amounts to a stretching by 3 and 2, respectively.

Polynomial and Rational Functions

THE DIMENSION OF A VECTOR SPACE

Second Order Linear Partial Differential Equations. Part I

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Vector and Matrix Norms

5 Homogeneous systems

Circuit Analysis using the Node and Mesh Methods

Homogeneous equations, Linear independence

CHAPTER 2. Eigenvalue Problems (EVP s) for ODE s

Because the slope is, a slope of 5 would mean that for every 1cm increase in diameter, the circumference would increase by 5cm.

POLYNOMIAL FUNCTIONS

Big Data Analytics CSCI 4030

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Lecture Notes 2: Matrices as Systems of Linear Equations

CORRELATION ANALYSIS

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

9. Momentum and Collisions in One Dimension*

PLOTTING DATA AND INTERPRETING GRAPHS

Lecture 2: Homogeneous Coordinates, Lines and Conics

This means there are two equilibrium solutions 0 and K. dx = rx(1 x). x(1 x) dt = r

Introduction to Matrices

Review Jeopardy. Blue vs. Orange. Review Jeopardy

STA 4273H: Statistical Machine Learning

Homogeneous systems of algebraic equations. A homogeneous (ho-mo-geen -ius) system of linear algebraic equations is one in which

Binary Adders: Half Adders and Full Adders

Inner Product Spaces

x 2 + y 2 = 1 y 1 = x 2 + 2x y = x 2 + 2x + 1

T ( a i x i ) = a i T (x i ).

Representation of functions as power series

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Solving Rational Equations

Predict the Popularity of YouTube Videos Using Early View Data

Chapter 6. Linear Programming: The Simplex Method. Introduction to the Big M Method. Section 4 Maximization and Minimization with Problem Constraints

Chapter 2 Solving Linear Programs

Part 1 Expressions, Equations, and Inequalities: Simplifying and Solving

Finite dimensional topological vector spaces

MA106 Linear Algebra lecture notes

18.01 Single Variable Calculus Fall 2006

9.4. The Scalar Product. Introduction. Prerequisites. Learning Style. Learning Outcomes

Matrix Representations of Linear Transformations and Changes of Coordinates

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

2.2 Derivative as a Function

Principal components analysis

Transcription:

SVM Example Dan Ventura March 009 Abstract We try to give a helpful simple example that demonstrates a linear SVM and then extend the example to a simple non-linear case to illustrate the use of mapping functions and kernels. Introduction Many learning models make use of the idea that any learning problem can be made easy with the right set of features. The trick of course is discovering that right set of features which in general is a very difficult thing to do. SVMs are another attempt at a model that does this. The idea behind SVMs is to make use of a (nonlinear) mapping function Φ that transforms data in input space to data in feature space in such a way as to render a problem linearly separable. The SVM then automatically discovers the optimal separating hyperplane (which when mapped back into input space via Φ can be a complex decision surface). SVMs are rather interesting in that they enjoy both a sound theoretical basis as well as state-of-the-art success in real-world applications. To illustrate the basic ideas we will begin with a linear SVM (that is a model that assumes the data is linearly separable). We will then expand the example to the nonlinear case to demonstrate the role of the mapping function Φ and finally we will explain the idea of a kernel and how it allows SVMs to make use of high-dimensional feature spaces while remaining tractable. Linear Example when Φ is trivial Suppose we are given the following positively labeled data points in R : {( ) ( ) ( ) ( )} 3 3 6 6 and the following negatively labeled data points in R (see Figure ): {( ) ( ) ( ) ( )} 0 0 0 0

Figure : Sample data points in R. Blue diamonds are positive examples and red squares are negative examples. We would like to discover a simple SVM that accurately discriminates the two classes. Since the data is linearly separable we can use a linear SVM (that is one whose mapping function Φ() is the identity function). By inspection it should be obvious that there are three support vectors (see Figure ): { s = ( 0 ) s = ( 3 ) s 3 = ( 3 In what follows we will use vectors augmented with a as a bias input and for clarity we will differentiate these with an over-tilde. So if s = (0) then s = (0). Figure 3 shows the SVM architecture and our task is to find values for the α i such that )} α Φ(s ) Φ(s ) + α Φ(s ) Φ(s ) + α 3 Φ(s 3 ) Φ(s ) = α Φ(s ) Φ(s ) + α Φ(s ) Φ(s ) + α 3 Φ(s 3 ) Φ(s ) = + α Φ(s ) Φ(s 3 ) + α Φ(s ) Φ(s 3 ) + α 3 Φ(s 3 ) Φ(s 3 ) = + Since for now we have let Φ() = I this reduces to α s s + α s s + α 3 s 3 s = α s s + α s s + α 3 s 3 s = + α s s 3 + α s s 3 + α 3 s 3 s 3 = + Now computing the dot products results in

Figure : The three support vectors are marked as yellow circles. Figure 3: The SVM architecture. 3

α + 4α + 4α 3 = 4α + α + 9α 3 = + 4α + 9α + α 3 = + A little algebra reveals that the solution to this system of equations is α = 3.5 α = 0.75 and α 3 = 0.75. Now we can look at how these α values relate to the discriminating hyperplane; or in other words now that we have the α i how do we find the hyperplane that discriminates the positive from the negative examples? It turns out that w = i α i s i = 3.5 0 = 0 + 0.75 3 + 0.75 3 Finally remembering that our vectors are augmented with a bias we can equate the last entry in w as the hyperplane ( offset ) b and write the separating hyperplane equation y = wx + b with w = and b =. Plotting the line 0 gives the expected decision surface (see Figure 4).. Input space vs. Feature space 3 Nonlinear Example when Φ is non-trivial Now suppose instead that we are given the following positively labeled data points in R : {( ) ( ) ( ) ( )} and the following negatively labeled data points in R (see Figure 5): {( ) ( ) ( ) ( )} 4

Figure 4: The discriminating hyperplane corresponding to the values α = 3.5 α = 0.75 and α 3 = 0.75. Figure 5: Nonlinearly separable sample data points in R. Blue diamonds are positive examples and red squares are negative examples. 5

Figure 6: The data represented in feature space. Our goal again is to discover a separating hyperplane that accurately discriminates the two classes. Of course it is obvious that no such hyperplane exists in the input space (that is in the space in which the original input data live). Therefore we must use a nonlinear SVM (that is one whose mapping function Φ is a nonlinear mapping from input space into some feature space). Define ( ) ( ) 4 x + x x if x x 4 x Φ = + x x + x ( ) > () x x otherwise x Referring back to Figure 3 we can see how Φ transforms our data before the dot products are performed. Therefore we can rewrite the data in feature space as {( ) ( 6 for the positive examples and {( ) ( ) ( 6 6 ) ( ) ( 6 )} ) ( for the negative examples (see Figure 6). Now we can once again easily identify the support vectors (see Figure 7): { ( ) ( )} s = s = We again use vectors augmented with a as a bias input and will differentiate them as before. Now given the [augmented] support vectors we must again find values for the α i. This time our constraints are )} 6

Figure 7: The two support vectors (in feature space) are marked as yellow circles. α Φ (s ) Φ (s ) + α Φ (s ) Φ (s ) = α Φ (s ) Φ (s ) + α Φ (s ) Φ (s ) = + Given Eq. this reduces to α s s + α s s = α s s + α s s = + (Note that even though Φ is a nontrivial function both s and s map to themselves under Φ. This will not be the case for other inputs as we will see later.) Now computing the dot products results in 3α + 5α = 5α + 9α = + And the solution to this system of equations is α = 7 and α = 4. Finally we can again look at the discriminating hyperplane in input space that corresponds to these α. w = i α i s i 7

Figure 8: The discriminating hyperplane corresponding to the values α = 7 and α = 4 = 7 + 4 = 3 ( ) giving us the separating hyperplane equation y = wx + b with w = and b = 3. Plotting the line gives the expected decision surface (see Figure 8). 3. Using the SVM Let s briefly look at how we would use the SVM model to classify data. Given x the classification f(x) is given by the equation ( ) f(x) = σ α i Φ(s i ) Φ(x) () i where σ(z) returns the sign of z. For example if we wanted to classify the point x = (4 5) using the mapping function of Eq. f ( 4 5 ) ( ( ) ( ) ( ) ( 4 4 = σ 7Φ Φ + 4Φ 5 Φ 5 0 0 = σ 7 + 4 = σ( ) )) 8

Figure 9: The decision surface in input space corresponding to Φ. Note the singularity. and thus we would classify x = (4 5) as negative. Looking again at the input space we might be tempted to think this is not a reasonable classification; however it is what our model says and our model is consistent with all the training data. As always there are no guarantees on generalization accuracy and if we are not happy about our generalization the likely culprit is our choice of Φ. Indeed if we map our discriminating hyperplane (which lives in feature space) back into input space we can see the effective decision surface of our model (see Figure 9). Of course we may or may not be able to improve generalization accuracy by choosing a different Φ; however there is another reason to revisit our choice of mapping function. 4 The Kernel Trick Our definition of Φ in Eq. preserved the number of dimensions. In other words our input and feature spaces are the same size. However it is often the case that in order to effectively separate the data we must use a feature space that is of (sometimes very much) higher dimension than our input space. Let us now consider an alternative mapping function Φ ( x x ) = x x (x +x ) 5 3 (3) which transforms our data from -dimensional input space to 3-dimensional feature space. Using this alternative mapping the data in the new feature space looks like 9

Figure 0: The decision surface in input space corresponding Φ. for the positive examples and for the negative examples. With a little thought we realize that in this case all 8 of the examples will be support vectors with α i = 46 for the positive support vectors and α i = 7 46 for the negative ones. Note that a consequence of this mapping is that we do not need to use augmented vectors (though it wouldn t hurt to do so) because the hyperplane in feature space goes through the origin 0 y = wx+b where w = 0 and b = 0. Therefore the discriminating feature is x 3 and Eq. reduces to f(x) = σ(x 3 ). Figure 0 shows the decision surface induced in the input space for this new mapping function. Kernel trick. 5 Conclusion What kernel to use? Slack variables. Theory. Generalization. Dual problem. QP. 0