Lecture 2: The SVM classifier



Similar documents
Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Support Vector Machine (SVM)

Support Vector Machines

Statistical Machine Learning

Linear Threshold Units

A Simple Introduction to Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Machine Learning in Spam Filtering

Support Vector Machines Explained

Lecture 6: Logistic Regression

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

Introduction to Support Vector Machines. Colin Campbell, Bristol University

CSCI567 Machine Learning (Fall 2014)

Several Views of Support Vector Machines

Big Data Analytics CSCI 4030

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Local features and matching. Image classification & object localization

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

AIMS Big data. AIMS Big data. Outline. Outline. Lecture 5: Structured-output learning January 7, 2015 Andrea Vedaldi

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Linear Programming Notes V Problem Transformations

Distributed Machine Learning and Big Data

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Machine Learning and Pattern Recognition Logistic Regression

Date: April 12, Contents

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

An Introduction to Machine Learning

Online learning of multi-class Support Vector Machines

4.1 Introduction - Online Learning Model

Big Data - Lecture 1 Optimization reminders

Nonlinear Programming Methods.S2 Quadratic Programming

In this section, we will consider techniques for solving problems of this type.

Wes, Delaram, and Emily MA751. Exercise p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Walrasian Demand. u(x) where B(p, w) = {x R n + : p x w}.

Linear Models for Classification

CONSTRAINED NONLINEAR PROGRAMMING

Lecture 3: Linear methods for classification

Lecture 8 February 4

A NEW LOOK AT CONVEX ANALYSIS AND OPTIMIZATION

Machine Learning Final Project Spam Filtering

Introduction to Online Learning Theory

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

AdaBoost. Jiri Matas and Jan Šochman. Centre for Machine Perception Czech Technical University, Prague

LCs for Binary Classification

Nonlinear Optimization: Algorithms 3: Interior-point methods

Derivative Free Optimization

What is Linear Programming?

Week 1: Introduction to Online Learning

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

15 Kuhn -Tucker conditions

2.3 Convex Constrained Optimization Problems

Simple and efficient online algorithms for real world applications

Max Flow. Lecture 4. Optimization on graphs. C25 Optimization Hilary 2013 A. Zisserman. Max-flow & min-cut. The augmented path algorithm

Support Vector Machines

Dual Methods for Total Variation-Based Image Restoration

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Making Sense of the Mayhem: Machine Learning and March Madness

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

Linear Classification. Volker Tresp Summer 2015

Proximal mapping via network optimization

LECTURE: INTRO TO LINEAR PROGRAMMING AND THE SIMPLEX METHOD, KEVIN ROSS MARCH 31, 2005

Logistic Regression for Spam Filtering

Numerical methods for American options

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

International Doctoral School Algorithmic Decision Theory: MCDA and MOO

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Two-Stage Stochastic Linear Programs

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Lecture 2: August 29. Linear Programming (part I)

Support Vector Machines with Clustering for Training with Very Large Datasets

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Mean-Shift Tracking with Random Sampling

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Efficient online learning of a non-negative sparse autoencoder

DUOL: A Double Updating Approach for Online Learning

Machine Learning Big Data using Map Reduce

Distance Metric Learning for Large Margin Nearest Neighbor Classification

CHAPTER 3 SECURITY CONSTRAINED OPTIMAL SHORT-TERM HYDROTHERMAL SCHEDULING

A fast multi-class SVM learning method for huge databases

CSC 411: Lecture 07: Multiclass Classification

Table 1: Summary of the settings and parameters employed by the additive PA algorithm for classification, regression, and uniclass.

Large Scale Learning to Rank

Duality of linear conic problems

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Machine learning challenges for big data

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Object Detection with Discriminatively Trained Part Based Models

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

constraint. Let us penalize ourselves for making the constraint too big. We end up with a

Transcription:

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function Slack variables Loss functions revisited Optimization

Binary Classification Given training data (x i,y i )fori =1...N,with x i R d and y i { 1, 1}, learnaclassifier f(x) such that ( 0 yi =+1 f(x i ) < 0 y i = 1 i.e. y i f(x i ) > 0 for a correct classification.

Linear separability linearly separable not linearly separable

Linear classifiers A linear classifier has the form f(x) =0 X 2 f(x) =w > x + b f(x) < 0 f(x) > 0 X 1 in 2D the discriminant is a line is the normal to the line, and b the bias is known as the weight vector

Linear classifiers A linear classifier has the form f(x) =0 f(x) =w > x + b in 3D the discriminant is a plane, and in nd it is a hyperplane For a K-NN classifier it was necessary to `carry the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data

The Perceptron Classifier Given linearly separable data x i labelled into two categories y i = {-1,1}, find a weight vector w such that the discriminant function f(x i )=w > x i + b separates the categories for i = 1,.., N how can we find this separating hyperplane? The Perceptron Algorithm Write classifier as f(x i )= w > x i + w 0 = w > x i where w =( w,w 0 ), x i =( x i, 1) Initialize w = 0 Cycle though the data points { x i, y i } if x i is misclassified then w w + α sign(f(x i )) x i Until all the data is correctly classified

For example in 2D Initialize w = 0 Cycle though the data points { x i, y i } if x i is misclassified then Until all the data is correctly classified w w + α sign(f(x i )) x i before update after update X 2 X 2 w w x i X 1 X 1 NB after convergence w = P N i α i x i

8 Perceptron example 6 4 2 0-2 -4-6 -8-10 -15-10 -5 0 5 10 if the data is linearly separable, then the algorithm will converge convergence can be slow separating line close to training data we would prefer a larger margin for generalization

What is the best w? maximum margin solution: most stable under perturbations of the inputs

f(x) = X i Support Vector Machine linearly separable data w T x + b = 0 b w Support Vector Support Vector w α i y i (x i > x)+b support vectors

SVM sketch derivation Since w > x + b =0andc(w > x + b) =0define the same plane, we have the freedom to choose the normalization of w Choose normalization such that w > x + +b =+1andw > x + b = 1 for the positive and negative support vectors respectively Then the margin is given by w w. ³ w > ³ x + x x + x = w = 2 w

Support Vector Machine linearly separable data Margin = 2 w Support Vector Support Vector w T x + b = 1 w w T x + b = 0 w T x + b = -1

SVM Optimization Learning the SVM can be formulated as an optimization: max w 2 w subject to w > x i +b 1 if y i =+1 1 if y i = 1 for i =1...N Or equivalently min w w 2 subject to y i ³ w > x i + b 1fori =1...N This is a quadratic optimization problem subject to linear constraints and there is a unique minimum

Linear separability again: What is the best w? the points can be linearly separated but there is a very narrow margin but possibly the large margin solution is better, even though one constraint is violated In general there is a trade off between the margin and the number of mistakes on the training data

Introduce slack variables ξ i 0 for 0 < ξ 1 point is between margin and correct side of hyperplane. This is a margin violation Misclassified point ξ i w > 2 w ξ i w < 1 w Margin = 2 w for ξ > 1 point is misclassified Support Vector Support Vector = 0 w T x + b = 1 w w T x + b = 0 w T x + b = -1

The optimization problem becomes Soft margin solution subject to min w R d,ξ i R w 2 +C + NX ξ i i y i ³ w > x i + b 1 ξ i for i =1...N Every constraint can be satisfied if ξ i is sufficiently large C is a regularization parameter: small C allows constraints to be easily ignored large margin large C makes constraints hard to ignore narrow margin C = enforces all constraints: hard margin This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C.

0.8 0.6 0.4 feature y 0.2 0-0.2-0.4-0.6-0.8-1 -0.8-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8 feature x data is linearly separable but only with a narrow margin

C = Infinity hard margin

C = 10 soft margin

Application: Pedestrian detection in Computer Vision Objective: detect (localize) standing humans in an image cf face detection with a sliding window classifier reduces object detection to binary classification does an image window contain a person or not? Method: the HOG detector

Training data and features Positive data 1208 positive window examples Negative data 1218 negative window examples (initially)

Feature: histogram of oriented gradients (HOG) image dominant direction HOG tile window into 8 x 8 pixel cells each cell represented by HOG frequency orientation Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024

Averaged positive examples

Algorithm Training (Learning) Represent each example window by a HOG feature vector x i R d, with d = 1024 Train a SVM classifier Testing (Detection) Sliding window classifier f(x) =w > x + b

Dalal and Triggs, CVPR 2005

Learned model f (x) = w>x + b Slide from Deva Ramanan

Slide from Deva Ramanan

Optimization Learning an SVM has been formulated as a constrained optimization problem over w and ξ min w R d,ξ i R w 2 + C + NX i ξ i subject to y i ³ w > x i + b 1 ξ i for i =1...N The constraint y i ³ w > x i + b 1 ξ i, can be written more concisely as y i f(x i ) 1 ξ i which, together with ξ i 0, is equivalent to ξ i =max(0, 1 y i f(x i )) Hence the learning problem is equivalent to the unconstrained optimization problem over w min w R w 2 + C d regularization NX i max (0, 1 y i f(x i )) loss function

Loss function min w 2 + C w R d NX i max (0, 1 y i f(x i )) loss function w T x + b = 0 Points are in three categories: 1. y i f(x i ) > 1 Point is outside margin. No contribution to loss 2. y i f(x i )=1 Point is on margin. No contribution to loss. As in hard margin case. 3. y i f(x i ) < 1 Point violates margin constraint. Contributes to loss Support Vector w Support Vector

Loss functions y i f(x i ) SVM uses hinge loss an approximation to the 0-1 loss max (0, 1 y i f(x i ))

Optimization continued min w R d C N X i max (0, 1 y i f(x i )) + w 2 local minimum global minimum Does this cost function have a unique solution? Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)? If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case)

Convex functions

Convex function examples convex Not convex A non-negative sum of convex functions is convex

+ SVM min w R d C N X i max (0, 1 y i f(x i )) + w 2 convex

Gradient (or steepest) descent algorithm for SVM To minimize a cost function C(w) use the iterative update where η is the learning rate. First, rewrite the optimization problem as an average min w C(w) = λ 2 w 2 + 1 N = 1 N NX i w t+1 w t η t w C(w t ) NX i max (0, 1 y i f(x i )) µ λ 2 w 2 +max(0, 1 y i f(x i )) (with λ =2/(NC) up to an overall scale of the problem) and f(x) =w > x + b Because the hinge loss is not differentiable, a sub-gradient is computed

Sub-gradient for hinge loss L(x i,y i ; w) =max(0, 1 y i f(x i )) f(x i )=w > x i + b L w = y ix i L w =0 y i f(x i )

Sub-gradient descent algorithm for SVM C(w) = 1 N NX i µ λ 2 w 2 + L(x i,y i ; w) The iterative update is w t+1 w t η wt C(w t ) where η is the learning rate. w t η 1 N NX i (λw t + w L(x i,y i ; w t )) Then each iteration t involves cycling through the training data with the updates: w t+1 w t η(λw t y i x i ) if y i f(x i ) < 1 w t ηλw t otherwise In the Pegasos algorithm the learning rate is set at η t = 1 λt

Pegasos Stochastic Gradient Descent Algorithm Randomly sample from the training data 10 2 6 10 1 4 2 energy 10 0 0 10-1 -2-4 10-2 0 50 100 150 200 250 300-6 -6-4 -2 0 2 4 6

Background reading and more Next lecture see that the SVM can be expressed as a sum over the support vectors: f(x) = X i α i y i (x i > x)+b support vectors On web page: http://www.robots.ox.ac.uk/~az/lectures/ml links to SVM tutorials and video lectures MATLAB SVM demo