Introduction to Support Vector Machines. Colin Campbell, Bristol University



Similar documents
Support Vector Machines Explained

Support Vector Machines for Classification and Regression

A Simple Introduction to Support Vector Machines

Support Vector Machine (SVM)

An Introduction to Machine Learning

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Several Views of Support Vector Machines

Machine Learning and Pattern Recognition Logistic Regression

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

A fast multi-class SVM learning method for huge databases

Statistical Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Online learning of multi-class Support Vector Machines

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Making Sense of the Mayhem: Machine Learning and March Madness

Big Data - Lecture 1 Optimization reminders

Maximum Margin Clustering

Machine Learning in Spam Filtering

Lecture 2: The SVM classifier

A Linear Programming Approach to Novelty Detection

Big Data Analytics CSCI 4030

2.3 Convex Constrained Optimization Problems

Classification of high resolution satellite images

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Support Vector Machines

Distributed Machine Learning and Big Data

Nonlinear Optimization: Algorithms 3: Interior-point methods

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Local features and matching. Image classification & object localization

Proximal mapping via network optimization

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Lecture 6: Logistic Regression

Vector and Matrix Norms

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Applications of Support Vector-Based Learning

Data visualization and dimensionality reduction using kernel maps with a reference point

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Case Study Report: Building and analyzing SVM ensembles with Bagging and AdaBoost on big data sets

Optimal shift scheduling with a global service level constraint

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Linear Models for Classification

In this section, we will consider techniques for solving problems of this type.

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Constrained optimization.

Data, Measurements, Features

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

The Goldberg Rao Algorithm for the Maximum Flow Problem

Application of Support Vector Machines to Fault Diagnosis and Automated Repair

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

24. The Branch and Bound Method

Employer Health Insurance Premium Prediction Elliott Lui

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

BANACH AND HILBERT SPACE REVIEW

Inner Product Spaces

Social Media Mining. Data Mining Essentials

Linear Threshold Units

Classification algorithm in Data mining: An Overview

A Hybrid Forecasting Methodology using Feature Selection and Support Vector Regression

3. INNER PRODUCT SPACES

Fóra Gyula Krisztián. Predictive analysis of financial time series

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Introduction to Machine Learning Using Python. Vikram Kamath

Simple and efficient online algorithms for real world applications

Linear Programming Notes V Problem Transformations

Decision Support Systems

A User s Guide to Support Vector Machines

Introduction: Overview of Kernel Methods

Message-passing sequential detection of multiple change points in networks

Data Mining: Algorithms and Applications Matrix Math Review

Reject Inference in Credit Scoring. Jie-Men Mok

A Tutorial on Support Vector Machines for Pattern Recognition

KERNEL LOGISTIC REGRESSION-LINEAR FOR LEUKEMIA CLASSIFICATION USING HIGH DIMENSIONAL DATA

Summer course on Convex Optimization. Fifth Lecture Interior-Point Methods (1) Michel Baes, K.U.Leuven Bharath Rangarajan, U.

Linear Classification. Volker Tresp Summer 2015

THE SVM APPROACH FOR BOX JENKINS MODELS

Numerical methods for American options

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

IEOR 4404 Homework #2 Intro OR: Deterministic Models February 14, 2011 Prof. Jay Sethuraman Page 1 of 5. Homework #2

Machine Learning Algorithms for Classification. Rob Schapire Princeton University

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Machine Learning Final Project Spam Filtering

Transcription:

Introduction to Support Vector Machines Colin Campbell, Bristol University 1

Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification. 1.3. SVMs for regression. 2

Part 2. General kernel methods 2.1 Kernel methods based on linear programming and other approaches. 2.2 Training SVMs and other kernel machines. 2.3 Model Selection. 2.4 Different types of kernels. 2.5. SVMs in practice 3

Advantages of SVMs ˆ A principled approach to classification, regression or novelty detection tasks. ˆ SVMs exhibit good generalisation. ˆ Hypothesis has an explicit dependence on the data (via the support vectors). Hence can readily interpret the model. 4

ˆ Learning involves optimisation of a convex function (no false minima, unlike a neural network). ˆ Few parameters required for tuning the learning machine (unlike neural network where the architecture and various parameters must be found). ˆ Can implement confidence measures, etc. 5

1.1 SVMs for Binary Classification. Preliminaries: Consider a binary classification problem: input vectors are x i and y i = ±1 are the targets or labels. The index i labels the pattern pairs (i = 1,..., m). The x i define a space of labelled points called input space. 6

From the perspective of statistical learning theory the motivation for considering binary classifier SVMs comes from theoretical bounds on the generalization error. These generalization bounds have two important features: 7

1. the upper bound on the generalization error does not depend on the dimensionality of the space. 2. the bound is minimized by maximizing the margin, γ, i.e. the minimal distance between the hyperplane separating the two classes and the closest datapoints of each class. 8

Margin Separating hyperplane 9

In an arbitrary-dimensional space a separating hyperplane can be written: w x + b = 0 where b is the bias, and w the weights, etc. Thus we will consider a decision function of the form: D(x) = sign (w x + b) 10

We note that the argument in D(x) is invariant under a rescaling: w λw, b λb. We will implicitly fix a scale with: 11

w x + b = 1 w x + b = 1 for the support vectors (canonical hyperplanes). 12

Thus: w (x 1 x 2 ) = 2 For two support vectors on each side of the separating hyperplane. 13

The margin will be given by the projection of the vector (x 1 x 2 ) onto the normal vector to the hyperplane i.e. w/ w from which we deduce that the margin is given by γ = 1/ w 2. 14

Margin 1 2 Separating hyperplane 15

Maximisation of the margin is thus equivalent to minimisation of the functional: Φ(w) = 1 (w w) 2 subject to the constraints: y i [(w x i ) + b] 1 16

Thus the task is to find an optimum of the primal objective function: L(w, b) = 1 2 (w w) m i=1 α i [y i ((w x i ) + b) 1] Solving the saddle point equations L/ b = 0 gives: m i=1 α i y i = 0 17

and L/ w = 0 gives: w = m i=1 α i y i x i which when substituted back in L(w, α, α) tells us that we should maximise the functional (the Wolfe dual): 18

W (α) = m i=1 α i 1 2 m i,j=1 α i α j y i y j (x i x j ) subject to the constraints: α i 0 (they are Lagrange multipliers) 19

and: m i=1 α i y i = 0 The decision function is then: D(z) = sign m j=1 α j y j (x j z) + b 20

So far we don t know how to handle non-separable datasets. To answer this problem we will exploit the second point mentioned above: the theoretical generalisation bounds do not depend on the dimensionality of the space. 21

For the dual objective function we notice that the datapoints, x i, only appear inside an inner product. To get a better representation of the data we can therefore map the datapoints into an alternative higher dimensional space through a replacement: x i x j φ (x i ) φ(x j ) 22

i.e. we have used a mapping x i φ( x i ). This higher dimensional space is called Feature Space and must be a Hilbert Space (the concept of an inner product applies). 23

The function φ(x i ) φ(x j ) will be called a kernel, so: φ(x i ) φ(x j ) = K(x i, x j ) The kernel is therefore the inner product between mapped pairs of points in Feature Space. 24

What is the mapping relation? In fact we do not need to know the form of this mapping since it will be implicitly defined by the functional form of the mapped inner product in Feature Space. 25

Different choices for the kernel K(x, x ) define different Hilbert spaces to use. A large number of Hilbert spaces are possible with RBF kernels: K(x, x ) = e x x 2 /2σ 2 and polynomial kernels: K(x, x ) = (< x, x > +1) d being popular choices. 26

Legitimate kernels are not restricted to functions e.g. they may be defined by an algorithm, for example. (1) String kernels Consider the following text strings car, cat, cart, chart. They have certain similarities despite being of unequal length. dug is of equal length to car and cat but differs in all letters. 27

A measure of the degree of alignment of two text strings or sequences can be found using algorithmic techniques e.g. edit codes (dynamic programming). Frequent applications in ˆ bioinformatics e.g. Smith-Waterman, Waterman-Eggert algorithm (4-digit strings... ACCGTATGTAAA... from genomes), ˆ text processing (e.g. WWW), etc. 28

(2) Kernels for graphs/networks Diffusion kernel ( Kondor and Lafferty ) 29

Mathematically, valid kernel functions should satisfy Mercer s condition: 30

For any g(x) for which: g(x) 2 dx < it must be the case that: K(x, x )g(x)g(x )dxdx 0 A simple criterion is that the kernel should be positive semi-definite. 31

Theorem: If a kernel is positive semi-definite i.e.: i,j K(x i, x j )c i c j 0 where {c 1,..., c n } are real numbers, then there exists a function φ(x) defining an inner product of possibly higher dimension i.e.: K(x, y) = φ(x) φ(y) 32

Thus the following steps are used to train an SVM: 1. Choose kernel function K(x i, x j ). 2. Maximise: W (α) = m i=1 α i 1 2 m i=1 α i α j y i y j K(x i, x j ) subject to α i 0 and i α i y i = 0. 33

3. The bias b is found as follows: b = 1 2 + max min {i y i =+1} {i y i = 1} α i y i K(x i, x j ) α i y i K(x i, x j ) 34

4. The optimal α i go into the decision function: D(z) = sign ( m i=1 α i y i K(x i, z) + b ) 35

1.2. Soft Margins and Multi-class problems Multi-class classification: Many problems involve multiclass classification and a number of schemes have been outlined. One of the simplest schemes is to use a directed acyclic graph (DAG) with the learning task reduced to binary classification at each node: 36

1/3 1/2 2/3 1 2 3 37

Soft margins: Most real life datasets contain noise and an SVM can fit this noise leading to poor generalization. The effect of outliers and noise can be reduced by introducing a soft margin. Two schemes are commonly used: 38

L 1 error norm: 0 α i C L 2 error norm: K(x i, x i ) K(x i, x i ) + λ 39

7.2 7 6.8 6.6 6.4 6.2 6 5.8 5.6 5.4 0 2 4 6 8 10 40

7.5 7 6.5 6 5.5 5 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 41

The justification for these soft margin techniques comes from statistical learning theory. Can be readily viewed as relaxation of the hard margin constraint. 42

For the L 1 error norm (prior to introducing kernels) we introduce a positive slack variable ξ i : y i (w x i + b) 1 ξ i so minimize the sum of errors m i=1 ξ i in addition to w 2 : min [ 1 2 w w + C m i=1 ξ i ] 43

This is readily formulated as a primal objective function : L(w, b, α, ξ) = 1 2 w w + C m m i=1 i=1 ξ i α i [y i (w x i + b) 1 + ξ i ] with Lagrange multipliers α i 0 and r i 0. m i=1 r i ξ i 44

The derivatives with respect to w, b and ξ give: L w = w m L b = m i=1 i=1 α i y i x i = 0 (1) α i y i = 0 (2) L ξ i = C α i r i = 0 (3) 45

Resubstituting back in the primal objective function obtain the same dual objective function as before. However, r i 0 and C α i r i = 0, hence α i C and the constraint 0 α i is replaced by 0 α i C. 46

Patterns with values 0 < α i < C will be referred to as non-bound and those with α i = 0 or α i = C will be said to be at bound. The optimal value of C must be found by experimentation using a validation set and it cannot be readily related to the characteristics of the dataset or model. 47

In an alternative approach, ν-svm, it can be shown that solutions for an L 1 -error norm are the same as those obtained from maximizing: W (α) = 1 2 m i,j=1 α i α j y i y j K(x i, x j ) 48

subject to: m i=1 y i α i = 0 m i=1 α i ν 0 α i 1 m (4) where ν lies on the range 0 to 1. 49

The fraction of training errors is upper bounded by ν and ν also provides a lower bound on the fraction of points which are support vectors. Hence in this formulation the conceptual meaning of the soft margin parameter is more transparent. 50

For the L 2 error norm the primal objective function is: L(w, b, α, ξ) = 1 2 w w + C m m i=1 with α i 0 and r i 0. i=1 ξ 2 i α i [y i (w x i + b) 1 + ξ i ] m i=1 r(5) i ξ i 51

After obtaining the derivatives with respect to w, b and ξ, substituting for w and ξ in the primal objective function and noting that the dual objective function is maximal when r i = 0, we obtain the following dual objective function after kernel substitution: W (α) = m i=1 α i 1 2 m i,j=1 y i y j α i α j K(x i, x j ) 1 4C m i=1 α 2 i 52

With λ = 1/2C this gives the same dual objective function as for hard margin learning except for the substitution: K(x i, x i ) K(x i, x i ) + λ 53

For many real-life datasets there is an imbalance between the amount of data in different classes, or the significance of the data in the two classes can be quite different. The relative balance between the detection rate for different classes can be easily shifted by introducing asymmetric soft margin parameters. 54

Thus for binary classification with an L 1 error norm: for y i = +1), and 0 α i C + for y i = 1, etc. 0 α i C, 55

1.3. Solving Regression Tasks with SVMs ɛ-sv regression (Vapnik): not unique approach. Statistical learning theory: we use constraints y i w x i b ɛ and w x i + b y i ɛ to allow for some deviation ɛ between the eventual targets y i and the function f(x) = w x + b, modelling the data. 56

As before we minimize w 2 to penalise overcomplexity. To account for training errors we also introduce slack variables ξ i, ξ i for the two types of training error. These slack variables are zero for points inside the tube and progressively increase for points outside the tube according to the loss function used. 57

L -ε 0 ε Figure 1: A linear ɛ-insensitive loss function 58

For a linear ɛ insensitive loss function the task is therefore to minimize: min [ w 2 + C m ( ξi + ξ i ) ] i=1 59

subject to y i w x i b ɛ + ξ i (w x i + b) y i ɛ + ξ i where the slack variables are both positive. 60

Deriving the dual objective function we get: W (α, α) = 1 2 m i=1 y i (α i α i ) ɛ m i,j=1 m i=1 (α i + α i ) (α i α i )(α j α j )K(x i, x j ) 61

which is maximized subject to m i=1 α i = m i=1 α i and: 0 α i C 0 α i C 62

The function modelling the data is then: f(z) = m i=1 (α i α i )K(x i, z) + b 63

Similarly we can define many other types of loss functions. We can define a linear programming approach to regression. We can use many other approaches to regression (e.g. kernelising least squares). 64