Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Similar documents

Support Vector Machines with Clustering for Training with Very Large Datasets

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A Simple Introduction to Support Vector Machines

Support Vector Machines Explained

Support Vector Machine (SVM)

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Knowledge Discovery from patents using KMX Text Analytics

Lecture 6: Logistic Regression

Support Vector Machines for Classification and Regression

A Tutorial on Support Vector Machines for Pattern Recognition

Learning to Process Natural Language in Big Data Environment

Several Views of Support Vector Machines

Big Data Analytics CSCI 4030

Scalable Developments for Big Data Analytics in Remote Sensing

Support Vector Machines

Online Classification on a Budget

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

A User s Guide to Support Vector Machines

Support Vector Pruning with SortedVotes for Large-Scale Datasets

Introduction to Online Learning Theory

Data Mining - Evaluation of Classifiers

An Introduction to Machine Learning

Choosing Multiple Parameters for Support Vector Machines

INTRUSION DETECTION USING THE SUPPORT VECTOR MACHINE ENHANCED WITH A FEATURE-WEIGHT KERNEL

An Introduction to Data Mining

Fast Kernel Classifiers with Online and Active Learning

Machine learning for algo trading

Spam detection with data mining method:

Machine Learning in Spam Filtering

Maximum Margin Clustering

fml The SHOGUN Machine Learning Toolbox (and its python interface)

Statistical Machine Learning

Support Vector Machines

Using artificial intelligence for data reduction in mechanical engineering

Online learning of multi-class Support Vector Machines

Data visualization and dimensionality reduction using kernel maps with a reference point

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Training Invariant Support Vector Machines

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Simple and efficient online algorithms for real world applications

Support-Vector Networks

Network Intrusion Detection using Semi Supervised Support Vector Machine

Adapting Codes and Embeddings for Polychotomies

A Support Vector Method for Multivariate Performance Measures

Linear Threshold Units

Active Learning SVM for Blogs recommendation

Moving Least Squares Approximation

Predict Influencers in the Social Network

Supervised Learning (Big Data Analytics)

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Large Margin DAGs for Multiclass Classification

Content-Based Recommendation

Classification algorithm in Data mining: An Overview

A fast multi-class SVM learning method for huge databases

Machine Learning and Pattern Recognition Logistic Regression

A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression

A Neural Support Vector Network Architecture with Adaptive Kernels. 1 Introduction. 2 Support Vector Machines and Motivations

Preprocessing Using Support Vector Machines

Early defect identification of semiconductor processes using machine learning

A Linear Programming Approach to Novelty Detection

Server Load Prediction

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Making Sense of the Mayhem: Machine Learning and March Madness

Section 3-7. Marginal Analysis in Business and Economics. Marginal Cost, Revenue, and Profit. 202 Chapter 3 The Derivative

The Artificial Prediction Market

Lecture 3: Linear methods for classification

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Machine Learning Final Project Spam Filtering

DUOL: A Double Updating Approach for Online Learning

Massive Data Classification via Unconstrained Support Vector Machines

Neural Networks and Support Vector Machines

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Classification of high resolution satellite images

How to Win at the Track

Carrier Relevance Study for Indoor Localization Using GSM

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

When I was 3.1 POLYNOMIAL FUNCTIONS

Online (and Offline) on an Even Tighter Budget

Training Methods for Adaptive Boosting of Neural Networks for Character Recognition

Introduction to Pattern Recognition

Support vector machines based on K-means clustering for real-time business intelligence systems

BACK CALCULATION PROCEDURE FOR THE STIFFNESS MODULUS OF CEMENT TREATED BASE LAYERS USING COMPUTATIONAL INTELLIGENCE BASED MODELS

Five 5. Rational Expressions and Equations C H A P T E R

Test Problem Construction for Single-Objective Bilevel Optimization

LCs for Binary Classification

MACHINE LEARNING BRETT WUJEK, SAS INSTITUTE INC.

Classification of Bad Accounts in Credit Card Industry

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

A MapReduce based distributed SVM algorithm for binary classification

Java Modules for Time Series Analysis

Polynomial Degree and Finite Differences

MATH 10550, EXAM 2 SOLUTIONS. x 2 + 2xy y 2 + x = 2

A Quick Algebra Review

Transcription:

Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com

1 Support Vector Machines: history SVMs introduced in COLT-92 by Boser, Guyon & Vapnik. Became rather popular since. Theoretically well motivated algorithm: developed from Statistical Learning Theory (Vapnik & Chervonenkis) since the 60s. Empirically good performance: successful applications in many fields(bioinformatics,tet,imagerecognition,...)

2 Support Vector Machines: history II Centralized website: www.kernel-machines.org. Several tetbooks, e.g. An introduction to Support Vector Machines by Cristianini and Shawe-Taylor is one. A large and diverse community work on them: from machine learning, optimization, statistics, neural networks, functional analysis, etc.

3 Support Vector Machines: basics [Boser, Guyon, Vapnik 92],[Cortes & Vapnik 95] - - - - margin - + + margin + + + Nice properties: conve, theoretically motivated, nonlinear with kernels..

4 Preliminaries: Machine learning is about learning structure from data. Although the class of algorithms called SVM s can do more, in this talk we focus on pattern recognition. So we want to learn the mapping: X Y,where Xis some object and y Yis a class label. Let s take the simplest case: 2-class classification. So: R n, y {±1}.

5 Eample: Suppose we have 50 photographs of elephants and 50 photos of tigers. vs. We digitize them into 100 100 piel images, so we have R n where n =10, 000. Now, given a new (different) photograph we want to answer the question: is it an elephant or a tiger? [we assume it is one or the other.]

6 Training sets and prediction models input/output sets X, Y training set ( 1,y 1 ),...,( m,y m ) generalization : given a previously seen X, find a suitable y Y. i.e., want to learn a classifier: y = f(, α),whereα are the parameters of the function. For eample, if we are choosing our model from the set of hyperplanes in R n,thenwehave: f(, {w, b}) =sign(w + b).

7 Empirical Risk and the true Risk We can try to learn f(, α) by choosing a function that performs well on training data: R emp (α) = 1 m m l(f( i,α),y i ) = Training Error i=1 where l is the zero-one loss function, l(y, ŷ) =1,ify ŷ,and0 otherwise. R emp is called the empirical risk. By doing this we are trying to minimize the overall risk: R(α) = l(f(, α), y)dp (, y) = Test Error where P(,y) is the (unknown) joint distribution function of and y.

8 Choosing the set of functions What about f(, α) allowing all functions from X to {±1}? Training set ( 1,y 1 ),...,( m,y m ) X {±1} Test set 1,..., m X, such that the two sets do not intersect. For any f there eists f : 1. f ( i )=f( i ) for all i 2. f ( j ) f( j ) for all j Based on the training data alone, there is no means of choosing which function is better. On the test set however they give different results. So generalization is not guaranteed. = a restriction must be placed on the functions that we allow.

9 Empirical Risk and the true Risk Vapnik & Chervonenkis showed that an upper bound on the true risk can be given by the empirical risk + an additional term: h(log( 2m h R(α) R emp (α)+ +1) log( η 4 ) m where h is the VC dimension of the set of functions parameterized by α. The VC dimension of a set of functions is a measure of their capacity or compleity. If you can describe a lot of different phenomena with a set of functions then the value of h is large. [VC dim = the maimum number of points that can be separated in all possible ways by that set of functions.]

10 VC dimension: The VC dimension of a set of functions is the maimum number of points that can be separated in all possible ways by that set of functions. For hyperplanes in R n, the VC dimension can be shown to be n +1.

11 VC dimension and capacity of functions Simplification of bound: Test Error Training Error + Compleity of set of Models Actually, a lot of bounds of this form have been proved (different measures of capacity). The compleity function is often called a regularizer. If you take a high capacity set of functions (eplain a lot) you get low training error. But you might overfit. If you take a very simple set of models, you have low compleity, but won t get low training error.

12 Capacity of a set of functions (classification) [Images taken from a talk by B. Schoelkopf.]

13 Capacity of a set of functions (regression) y sine curve fit hyperplane fit true function

14 Controlling the risk: model compleity Bound on the risk Confidence interval Empirical risk (training error) h 1 h* h n S1 S* S n

15 Capacity of hyperplanes Vapnik & Chervonenkis also showed the following: Consider hyperplanes (w ) =0where w is normalized w.r.t a set of points X such that: min i w i =1. The set of decision functions f w () =sign(w ) defined on X such that w A has a VC dimension satisfying h R 2 A 2. where R is the radius of the smallest sphere around the origin containing X. = minimize w 2 and have low capacity = minimizing w 2 equivalent to obtaining a large margin classifier

<w, > + b > 0 <w, > + b < 0 w { <w, > + b = 0}

{ <w, > + b = 1} y i = 1 w { <w, > + b = +1} 2 1, y i = +1 Note: <w, 1 > + b = +1 <w, 2 > + b = 1 => <w, ( 1 2 )> = 2 => < w, w ( 1 2 ) > 2 = w { <w, > + b = 0}

16 Linear Support Vector Machines (at last!) So, we would like to find the function which minimizes an objective like: Training Error + Compleity term We write that as: 1 m m l(f( i,α),y i ) + Compleity term i=1 For now we will choose the set of hyperplanes (we will etend this later), so f() =(w )+b: 1 m m l(w i + b, y i ) + w 2 i=1 subject to min i w i =1.

17 Linear Support Vector Machines II That function before was a little difficult to minimize because of the step function in l(y, ŷ) (either 1 or 0). Let s assume we can separate the data perfectly. Then we can optimize the following: Minimize w 2, subject to: (w i + b) 1, if y i =1 (w i + b) 1, if y i = 1 The last two constraints can be compacted to: y i (w i + b) 1 This is a quadratic program.

18 SVMs : non-separable case To deal with the non-separable case, one can rewrite the problem as: Minimize: subject to: w 2 + C m i=1 ξ i y i (w i + b) 1 ξ i, ξ i 0 This is just the same as the original objective: 1 m m l(w i + b, y i ) + w 2 i=1 ecept l is no longer the zero-one loss, but is called the hinge-loss : l(y, ŷ) =ma(0, 1 yŷ). This is still a quadratic program!

- - + - - - margin + + + ξ i - + - +

19 Support Vector Machines - Primal Decision function: Primal formulation: f() =w + b min P (w, b)= 1 2 w 2 + C H 1 [ y i f( i )] } {{ } i } {{ } maimize margin minimize training error Ideally H 1 would count the number of errors, approimate with: H 1(z) Hinge Loss H 1 (z) =ma(0, 1 z) 0 z

20 SVMs : non-linear case Linear classifiers aren t comple enough sometimes. SVM solution: Map data into a richer feature space including nonlinear features, then construct a hyperplane in that space so all other equations are the same! Formally, preprocess the data with: Φ() and then learn the map from Φ() to y: f() =w Φ()+b.

21 SVMs : polynomial mapping Φ:R 2 R 3 ( 1, 2 ) (z 1,z 2,z 3 ):=( 2 1, (2) 1 2, 2 2) 1 2 z 1 z 3 z 2

22 SVMs : non-linear case II For eample MNIST hand-writing recognition. 60,000 training eamples, 10000 test eamples, 2828. Linear SVM has around 8.5% test error. Polynomial SVM has around 1% test error. 5 0 4 1 9 2 1 3 1 4 3 5 3 6 1 7 2 8 6 9 4 0 9 1 1 2 4 3 2 7 3 8 6 9 0 5 6 0 7 6 1 8 7 9 3 9 8 5 9 3 3 0 7 4 9 8 0 9 4 1 4 4 6 0 4 5 6 1 0 0 1 7 1 6 3 0 2 1 1 7 9 0 2 6 7 8 3 9 0 4 6 7 4 6 8 0 7 8 3 1

23 SVMs : full MNIST results Classifier Test Error linear 8.4% 3-nearest-neighbor 2.4% RBF-SVM 1.4 % Tangent distance 1.1 % LeNet 1.1 % Boosted LeNet 0.7 % Translation invariant SVM 0.56 % Choosing a good mapping Φ( ) (encoding prior knowledge + getting right compleity of function class) for your problem improves results.

24 SVMs : the kernel trick Problem: the dimensionality of Φ() can be very large, making w hard to represent eplicitly in memory, and hard for the QP to solve. The Representer theorem (Kimeldorf & Wahba, 1971) shows that (for SVMs as a special case): m w = α i Φ( i ) for some variables α. Instead of optimizing w directly we can thus optimize α. The decision rule is now: f() = i=1 m α i Φ( i ) Φ()+b i=1 We call K( i,)=φ( i ) Φ() the kernel function.

25 Support Vector Machines - kernel trick II We can rewrite all the SVM equations we saw before, but with the w = m i=1 α iφ( i ) equation: Decision function: f() = i α i Φ( i ) Φ()+b = i α i K( i, )+b Dual formulation: min P (w, b)= 1 m 2 α i Φ( i ) 2 + C H 1 [ y i f( i )] i=1 i } {{ } } {{ } maimize margin minimize training error

26 Support Vector Machines - Dual But people normally write it like this: Dual formulation: min α D(α) = 1 2 α i α j Φ( i ) Φ( j ) i,j i y i α i s.t. Èi α i=0 0 y i α i C Dual Decision function: f() = i α i K( i, )+b Kernel function K(, ) is used to make (implicit) nonlinear feature map, e.g. Polynomial kernel: K(, )=( +1) d. RBF kernel: K(, )=ep( γ 2 ).

27 Polynomial-SVMs The kernel K(, )=( ) d gives the same result as the eplicit mapping + dot product that we described before: Φ:R 2 R 3 ( 1, 2 ) (z 1,z 2,z 3 ):=( 2 1, (2) 1 2, 2 2) Φ(( 1, 2 ) Φ(( 1, 2)=( 2 1, (2) 1 2, 2 2) ( 2 1, (2) 1 2, 2 2) is the same as: = 2 1 2 1 +2 1 1 2 2 + 2 2 2 2 K(, )=( ) 2 =(( 1, 2 ) ( 1, 2)) 2 =( 1 1 + 2 2) 2 = 2 1 2 1 + 2 2 2 2 +2 1 1 2 2 Interestingly, if d is large the kernel is still only requires n multiplications to compute, whereas the eplicit representation may not fit in memory!

28 RBF-SVMs The RBF kernel K(, )=ep( γ 2 ) is one of the most popular kernel functions. It adds a bump around each data point: m f() = α i ep( γ i 2 )+b i=1 Φ.. ' Φ() Φ(') Using this one can get state-of-the-art results.

29 SVMs : more results There is much more in the field of SVMs/ kernel machines than we could cover here, including: Regression, clustering, semi-supervised learning and other domains. Lots of other kernels, e.g. string kernels to handle tet. Lots of research in modifications, e.g. to improve generalization ability, or tailoring to a particular task. Lots of research in speeding up training. Please see tet books such as the ones by Cristianini & Shawe-Taylor or by Schoelkopf and Smola.

30 SVMs : software Lots of SVM software: LibSVM (C++) SVMLight (C) As well as complete machine learning toolboes that include SVMs: Torch (C++) Spider (Matlab) Weka (Java) All available through www.kernel-machines.org.