Support Vector Machine (SVM)

Similar documents

Support Vector Machines Explained

A Simple Introduction to Support Vector Machines

Support Vector Machines

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Lecture 2: The SVM classifier

Support Vector Machines for Classification and Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A fast multi-class SVM learning method for huge databases

Several Views of Support Vector Machines

Support Vector Machines

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Data clustering optimization with visualization

Big Data - Lecture 1 Optimization reminders

Nonlinear Optimization: Algorithms 3: Interior-point methods

An Introduction to Machine Learning

Support Vector Machines with Clustering for Training with Very Large Datasets

Neural Networks and Support Vector Machines

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Machine Learning in Spam Filtering

Nonlinear Programming Methods.S2 Quadratic Programming

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

3.1 Solving Systems Using Tables and Graphs

Linear Programming Notes V Problem Transformations

Support-Vector Networks

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Local features and matching. Image classification & object localization

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

Maximum Margin Clustering

Classification of high resolution satellite images

Arrangements And Duality

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Linear Models for Classification

Online learning of multi-class Support Vector Machines

E-commerce Transaction Anomaly Classification

1 Solving LPs: The Simplex Algorithm of George Dantzig

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Comparison of machine learning methods for intelligent tutoring systems

A User s Guide to Support Vector Machines

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Programming Problems

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Constrained optimization.

Christfried Webers. Canberra February June 2015

Big Data Analytics CSCI 4030

Early defect identification of semiconductor processes using machine learning

Distributed Machine Learning and Big Data

Mathematical finance and linear programming (optimization)

Making Sense of the Mayhem: Machine Learning and March Madness

Lecture 2: August 29. Linear Programming (part I)

Machine Learning Final Project Spam Filtering

Applications of Support Vector-Based Learning

Maximum-Margin Matrix Factorization

Fóra Gyula Krisztián. Predictive analysis of financial time series

Largest Fixed-Aspect, Axis-Aligned Rectangle

A Tutorial on Support Vector Machines for Pattern Recognition

Scalable Developments for Big Data Analytics in Remote Sensing

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Linear Threshold Units

Monitoring Grinding Wheel Redress-life Using Support Vector Machines

Building a qualitative recruitment system via SVM with MCDM approach

Date: April 12, Contents

Support Vector Clustering

Online (and Offline) on an Even Tighter Budget

4.6 Linear Programming duality

Network Intrusion Detection using Semi Supervised Support Vector Machine

Optimization for Machine Learning

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

Investigation of Support Vector Machines for Classification

Math 53 Worksheet Solutions- Minmax and Lagrange

LCs for Binary Classification

Classification of Bad Accounts in Credit Card Industry

CONSTRAINED NONLINEAR PROGRAMMING

A Neural Support Vector Network Architecture with Adaptive Kernels. 1 Introduction. 2 Support Vector Machines and Motivations

Introduction Epipolar Geometry Calibration Methods Further Readings. Stereo Camera Calibration

Intrusion Detection via Machine Learning for SCADA System Protection

SUPPORT VECTOR MACHINE (SVM) is the optimal

A MapReduce based distributed SVM algorithm for binary classification

Online Support Vector Regression

Classification Problems

Practical Guide to the Simplex Method of Linear Programming

How To Improve Support Vector Machine Learning

SUPPORT vector machine (SVM) formulation of pattern

Lecture 6: Logistic Regression

26 Linear Programming

Proximal mapping via network optimization

Least-Squares Intersection of Lines

Classification algorithm in Data mining: An Overview

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Transcription:

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani

Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin SVM and Soft-Margin SVM Nonlinear SVM Kernel trick 2

Margin Which line is better to select as the boundary to provide more generalization capability? Larger margin provides better generalization to unseen data 2 Margin for a hyperplane that separates samples of two linearly separable classes is: The smallest distance between the decision boundary and any of the training samples 1 3

Maximum Margin SVM finds the solution with maximum margin Solution: a hyperplane that is farthest from all training samples 2 2 1 Larger margin 1 The hyperplane with the largest margin has equal distances to the nearest sample of both classes 4

Hard-Margin SVM: Optimization Problem,, =1 = 1 + =0 2 + = + = 5 1

Hard-Margin SVM: Optimization Problem (),, + =0 2 6 + = + = 1

Hard-Margin SVM: Optimization Problem We can set, : (), + =0 2 1 The place of boundary and margin lines do not change 7 + = 1 + =1 1

Hard-Margin SVM: Optimization Problem We can equivalently optimize: 1 min, 2 s. t. + 1 = 1,, It is a convex Quadratic Programming (QP) problem There are computationally efficient packages to solve it. It has a unique global minimum (if any). When training samples are not linearly separable, it has no solution. How to extend it to find a solution even though the classes are not linearly separable. 8

Beyond Linear Separability How to extend the hard-margin SVM to allow classification error Overlapping classes that can be approximately separated by a linear boundary Noise in the linearly separable classes 2 9 1 1

Beyond Linear Separability: Soft-Margin SVM Minimizing the number of misclassified points?! NP-complete Soft margin: Maximizing a margin while trying to minimize the distance between misclassified points and their correct margin plane 10

Soft-Margin SVM SVM with slack variables: allows samples to fall within the margin, but penalizes them 1 min,, 2 + s. t. + 1 = 1,, 0 2 : slack variables >1:if misclassifed 0 1: if correctly classified but inside margin 11 1

Soft-Margin SVM linear penalty (hinge loss) for a sample if it is misclassified or lied in the margin tries to maintain small while maximizing the margin. always finds a solution (compared to hard-margin SVM) more robust to the outliers Soft margin problem is still a convex QP =0 =0 12

Soft-Margin SVM: Parameter is a tradeoff parameter: small allows margin constraints to be easily ignored large margin large makes constraints hard to ignore narrow margin enforces all constraints: hard margin can be determined using a technique like crossvalidation 13

Soft-Margin SVM: Cost Function,, It is equivalent to the unconstrained optimization problem: () (), 14

SVM Loss Function Hinge loss vs. 0-1 loss =1 max (0,1 () ( () + )) 0-1 Loss Hinge Loss + 15

Optimization: Lagrangian Multipliers Lagrangian multipliers =[,, ] 16

Optimization: Dual Problem Primal problem: =min In general, we have: max Dual problem: min max L,, (, ) min max (, ) =max min L,, Obtained by swapping the order of the optimizations When the original problem is convex ( and are convex functions and h is affine), we have strong duality = 17

Hard-Margin SVM: Dual Problem, By incorporating the constraints through lagrangian multipliers, we will have: 18 min, max { } 1 2 + 1 ( () + ) Dual problem (changing the order of min and max in the above problem): max min 1 { }, 2 + 1 ( () + )

Hard-Margin SVM: Dual Problem { }, () () (),, () do not appear, instead, a global constraint on is created. 19

Hard-Margin SVM: Dual Problem Subject to () () () It is a convex QP By solving the above problem first we find = () () and then 20

Hard-Margin SVM: Dual Problem Subject to () () () Only the dot product of each pair of training data appears in the optimization problem This is an important property that is helpful to extend to non-linear SVM (the cost function does not depend explicitly on the dimensionality of the feature space). 21

Hard-Margin SVM: Support Vectors Support Vectors (SVs)= The direction of hyper-plane can be found only based on support vectors: () () can be set by making the margin equidistant to two classes. can be found using each of equations on SVs: + =1 Numerically safer to find using the equations on all SVs 22

Hard-Margin SVM: Dual Problem Classifying New Samples Using only SVs Classification of a new sample : = + = + = ( + ) Support vectors are sufficient to predict labels of new samples The classifier is based on the expansion in terms of dot products of with support vectors. 23

Karush-Kuhn-Tucker (KKT) Conditions Necessary conditions for the solution L,, =0,, =0 0 = 1,, 1 + = 0 = 1,, 24

Hard-Margin SVM: Support Vectors Inactive constraint: + >1 =0and thus is not a support vector. Active constraint: + =1 can be greater than 0 and thus can be a support vector. 2 >0 >0 >0 1 25 1

Hard-Margin SVM: Support Vectors Inactive constraint: + >1 =0and thus is not a support vector. Active constraint: + =1 can be greater than 0 and thus can be a support vector. 2 >0 =0 =0 >0 >0 1 A sample with =0can lie on one of the margin hyperplanes 26 1

Soft-Margin SVM: Dual Problem max 1 2 () () Subject to () =0 0 =1,, By solving the above quadratic problem first we find and then find = () () and is computed from SVs. For a test sample (as before): 27 = + = ( + )

Soft-Margin SVM: Support Vectors Support Vectors: If : SVs on the margin,. If : SVs on or over the margin. 28

Primal vs. Dual Soft-Margin SVM Problem Primal problem of soft-margin SVM inequality constraints positivity constraints ++1number of variables Dual problem of soft-margin SVM one equality constraint 2 positivity constraints number of variables (Lagrange multipliers) Objective function more complicated The dual problem is helpful and instrumental to use the kernel trick 29

Not linearly separable data Noisy data or overlapping classes (we discussed about it: soft margin) Near linearly separable 2 1 Non-linear decision surface 2 Transform to a new feature space 30 1

Nonlinear SVM Nonlinearly separable classes Φ: x φ(x) = [ (),..., ()] { (),..., ()}: set of basis functions (or features) :R R 31

SVM in a Transformed Feature Space Assume a transformation on the feature space Find a hyper-plane in the transformed feature space 2 () : + =0 32 1 ()

Basis functions: Examples Polynomial: Gaussian: Sigmoid: () 33 [Bishop]

Soft-Margin SVM in a Transformed Space: Primal Problem Primal problem: 1 min, 2 + s. t. ( ) + 1 = 1,, 0 R : the weights that must be found If (very high dimensional feature space) then there are many more parameters to learn Classifying a new data: = + () = ( + ( ) () ) 34

Soft-Margin SVM in a Transformed Space: Dual Problem Optimization problem: max 1 2 () () () Subject to () () If we have inner products () (), only =[,, ] needs to be learnt 35 It is not necessary to learn parameters as opposed to the primal problem

Kernelized Soft-Margin SVM Optimization problem: Subject to () () () ()() Classifying a new data: () = + () =( + ( ) () ), = ( ) 36

SVM: Summary Hard margin: maximizing margin Soft margin: handling noisy data and overlapping classes Slack variables in the problem Dual problems of hard-margin and soft-margin SVM Lead us to non-linear SVM method easily by kernel substitution Also, classifier decision in terms of support vectors Kernel SVM s Learns linear decision boundary in a high dimension space without explicitly working on the mapped data 37