Support Vector and Kernel Machines



Similar documents
Data Analytics for Campaigns Assignment 1: Jan 6 th, 2015 Due: Jan 13 th, 2015

Michigan Transfer Agreement (MTA) Frequently Asked Questions for College Personnel

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Disk Redundancy (RAID)

Applied Spatial Statistics: Lecture 6 Multivariate Normal

Licensing Windows Server 2012 for use with virtualization technologies

Student Academic Learning Services Page 1 of 7. Statistics: The Null and Alternate Hypotheses. A Student Academic Learning Services Guide

Chapter 3: Cluster Analysis

Licensing Windows Server 2012 R2 for use with virtualization technologies

CSE 231 Fall 2015 Computer Project #4

Computer Science Undergraduate Scholarship

Importance and Contribution of Software Engineering to the Education of Informatics Professionals

Conversations of Performance Management

Industrial and Systems Engineering Master of Science Program Data Analytics and Optimization

Document Management Versioning Strategy

Partnership for better solutions DATALAB DEVELOPER PROGRAM

PBS TeacherLine Course Syllabus

How To Install Fcus Service Management Software On A Pc Or Macbook

Statistical Analysis (1-way ANOVA)

1.3. The Mean Temperature Difference

THE CUSTOMER SUPPORT KNOWLEDGE BASE FAQ

Why Can t Johnny Encrypt? A Usability Evaluation of PGP 5.0 Alma Whitten and J.D. Tygar

FOCUS Service Management Software Version 8.5 for Passport Business Solutions Installation Instructions

Access EEC s Web Applications... 2 View Messages from EEC... 3 Sign In as a Returning User... 3

Times Table Activities: Multiplication

Licensing the Core Client Access License (CAL) Suite and Enterprise CAL Suite

What Does Specialty Own Occupation Really Mean?

How to Address Key Selection Criteria

Frequently Asked Questions November 19, Which browsers are compatible with the Global Patent Search Network (GPSN)?

CHECKING ACCOUNTS AND ATM TRANSACTIONS

WEB APPLICATION SECURITY TESTING

To achieve these objectives we will use a combination of lectures, cases, class discussion, and exercises.

FOCUS Service Management Software Version 8.5 for CounterPoint Installation Instructions

Assessment of Learning Report Computer Science CPM Fall 2008 Spring 2010

Trends and Considerations in Currency Recycle Devices. What is a Currency Recycle Device? November 2003

SUMMARY This is what Business Analysts do in the real world when embarking on a new project: they analyse

ATL: Atlas Transformation Language. ATL Installation Guide

ARE YOU INTERESTED IN THE PRIOR LEARNING ASSESSMENT (PLA) PROGRAM?

METU. Computer Engineering

Project Startup Report Presented to the IT Committee June 26, 2012

Knowledge Base Article

expertise hp services valupack consulting description security review service for Linux

Welcome to Microsoft Access Basics Tutorial

Succession Planning & Leadership Development: Your Utility s Bridge to the Future

Mobile Workforce. Improving Productivity, Improving Profitability

UNIVERSITY OF CALIFORNIA MERCED PERFORMANCE MANAGEMENT GUIDELINES

COURSE OUTLINE UNIVERSITY OF LIMPOPO DEPARTMENT: MASTER OF DEVELOPMENT COURSE CODE: CDED 192 MODULE TITLE: PROJECT PLANNING AND MANAGEMENT

FUJITSU Software ServerView Suite ServerView PrimeCollect

Issue Brief. SBC Distribution Rules for Employer Sponsored Health Plans October Summary. Which Plans Are Required to Provide the SBC?

Responsive Design Fundamentals Chapter 1: Chapter 2: name content

Fermilab Time & Labor Desktop Computer Requirements

Traffic monitoring on ProCurve switches with sflow and InMon Traffic Sentinel

Basics of Supply Chain Management

990 e-postcard FAQ. Is there a charge to file form 990-N (e-postcard)? No, the e-postcard system is completely free.

Vijay Desai, PMP, MBA, P.Eng December 16, 2009

Some Statistical Procedures and Functions with Excel

Accident Investigation

GUJARAT TECHNOLOGICAL UNIVERSITY

Reading Pie Charts Introduction to Bar Graphs Reading Bar Graphs Introduction to Data in Tables Reading Data in Tables

CONTENTS UNDERSTANDING PPACA. Implications of PPACA Relative to Student Athletes. Institution Level Discussion/Decisions.

Wireless Light-Level Monitoring

CMS Eligibility Requirements Checklist for MSSP ACO Participation

Aladdin HASP SRM Key Problem Resolution

SBClient and Microsoft Windows Terminal Server (Including Citrix Server)

What is Software Risk Management? (And why should I care?)

VIRTUAL TEAM LEARNING IN ONLINE MBA EDUCATION: AN EMPIRICAL INVESTIGATION

Standardization or Harmonization? You need Both

2 DAY TRAINING THE BASICS OF PROJECT MANAGEMENT

Project Name: Herefordshire and Worcestershire Short Course Market Research Proposal. Andrew Corcoran

Mandatory Courses Optional Courses Elective Courses

COE: Hybrid Course Request for Proposals. The goals of the College of Education Hybrid Course Funding Program are:

Using PayPal Website Payments Pro UK with ProductCart

Business Intelligence and DataWarehouse workshop

Backups and Backup Strategies

HP ExpertOne. HP2-T21: Administering HP Server Solutions. Table of Contents

Phi Kappa Sigma International Fraternity Insurance Billing Methodology

How to Write Program Objectives/Outcomes

KronoDesk Migration and Integration Guide Inflectra Corporation

Best Practice - Pentaho BA for High Availability

CE 566 Project Controls Planning and Scheduling

OCR LEVEL 2 CAMBRIDGE TECHNICAL

Transcription:

Supprt Vectr and Kernel Machines Nell Cristianini BIOwulf Technlgies nell@supprt-vectr.net http:///tutrial.html ICML 2001

A Little Histry SVMs intrduced in COLT-92 by Bser, Guyn, Vapnik. Greatly develped ever since. Initially ppularied in the NIPS cmmunity, nw an imprtant and active field f all Machine Learning research. Special issues f Machine Learning Jurnal, and Jurnal f Machine Learning Research. Kernel Machines: large class f learning algrithms, SVMs a particular instance.

A Little Histry Annual wrkshp at NIPS Centralied website: www.kernel-machines.rg Tetbk (2000): see Nw: a large and diverse cmmunity: frm machine learning, ptimiatin, statistics, neural netwrks, functinal analysis, etc. etc Successful applicatins in many fields (biinfrmatics, tet, handwriting recgnitin, etc) Fast epanding field, EVERYBODY WELCOME! -

Preliminaries Task f this class f algrithms: detect and eplit cmple patterns in data (eg: by clustering, classifying, ranking, cleaning, etc. the data) Typical prblems: hw t represent cmple patterns; and hw t eclude spurius (unstable) patterns (= verfitting) The first is a cmputatinal prblem; the secnd a statistical prblem.

Very Infrmal Reasning The class f kernel methds implicitly defines the class f pssible patterns by intrducing a ntin f similarity between data Eample: similarity between dcuments By length By tpic By language Chice f similarity Î Chice f relevant features

Mre frmal reasning Kernel methds eplit infrmatin abut the inner prducts between data items Many standard algrithms can be rewritten s that they nly require inner prducts between data (inputs) Kernel functins = inner prducts in sme feature space (ptentially very cmple) If kernel given, n need t specify what features f the data are being used

Just in case Inner prduct between vectrs Hyperplane: w, + b = 0, = i i i w b

Overview f the Tutrial Intrduce basic cncepts with etended eample f Kernel Perceptrn Derive Supprt Vectr Machines Other kernel based algrithms Prperties and Limitatins f Kernels On Kernel Alignment On Optimiing Kernel Alignment

Parts I and II: verview Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generaliatin Thery Optimiatin Thery Supprt Vectr Machines (SVM)

Mdularity IMPORTANT CONCEPT Any kernel-based learning algrithm cmpsed f tw mdules: A general purpse learning machine A prblem specific kernel functin Any K-B algrithm can be fitted with any kernel Kernels themselves can be cnstructed in a mdular way Great fr sftware engineering (and fr analysis)

1-Linear Learning Machines Simplest case: classificatin. Decisin functin is a hyperplane in input space The Perceptrn Algrithm (Rsenblatt, 57) Useful t analye the Perceptrn algrithm, befre lking at SVMs and Kernel Methds in general

Basic Ntatin Input space Output space Hypthesis Real-valued: Training Set Test errr Dt prduct y h X Y = { 1, + 1} H f : X R S = {( 1, y1),...,( i, yi),...} ε,

Perceptrn Linear Separatin f the input space f ( ) = w, + b w h( ) = sign( f ( )) b

Perceptrn Algrithm Update rule (ignring threshld): y ( w, ) 0 i k i if then w + 1 w + ηy k k i i k k + 1 b w

Observatins Slutin is a linear cmbinatin f training pints w = α i 0 α y i i i Only used infrmative pints (mistake driven) The cefficient f a pint in cmbinatin reflects its difficulty

Observatins - 2 Mistake bund: M R γ 2 g cefficients are nn-negative pssible t rewrite the algrithm using this alternative representatin

Dual Representatin IMPORTANT CONCEPT The decisin functin can be re-written as fllws: f( ) = w, + b= α iyi i, + b w = αiyii

Dual Representatin And als the update rule can be rewritten as fllws: 3 8 0 αi αi + η y i α jyj j, i + b if then Nte: in dual representatin, data appears nly inside dt prducts

Duality: First Prperty f SVMs DUALITY is the first feature f Supprt Vectr Machines SVMs are Linear Learning Machines represented in a dual fashin f( ) = w, + b= α iyi i, + b Data appear nly within dt prducts (in decisin functin and in training algrithm)

Limitatins f LLMs Linear classifiers cannt deal with Nn-linearly separable data Nisy data + this frmulatin nly deals with vectrial data

Nn-Linear Classifiers One slutin: creating a net f simple linear classifiers (neurns): a Neural Netwrk (prblems: lcal minima; many parameters; heuristics needed t train; etc) Other slutin: map data int a richer feature space including nn-linear features, then use a linear classifier

Learning in the Feature Space Map data int a feature space where they are linearly separable φ( ) f f() f() f() f() f() f() f() f() X F

Prblems with Feature Space Wrking in high dimensinal feature spaces slves the prblem f epressing cmple functins BUT: There is a cmputatinal prblem (wrking with very large vectrs) And a generaliatin thery prblem (curse f dimensinality)

Implicit Mapping t Feature Space We will intrduce Kernels: Slve the cmputatinal prblem f wrking with many dimensins Can make it pssible t use infinite dimensins efficiently in time / space Other advantages, bth practical and cnceptual

Kernel-Induced Feature Spaces In the dual representatin, the data pints nly appear inside dt prducts: f( ) = α iyi φ( i ), φ( ) + b The dimensinality f space F nt necessarily imprtant. May nt even knw the map φ

Kernels IMPORTANT CONCEPT A functin that returns the value f the dt prduct between the images f the tw arguments K( 1, 2) = φ( 1), φ( 2) Given a functin K, it is pssible t verify that it is a kernel

Kernels One can use LLMs in a feature space by simply rewriting it in dual representatin and replacing dt prducts with kernels: 1, 2 K( 1, 2) = φ( 1), φ( 2)

The Kernel Matri IMPORTANT CONCEPT (aka the Gram matri): K(1,1) K(1,2) K(1,3) K(1,m) K(2,1) K(2,2) K(2,3) K(2,m) K= K(m,1) K(m,2) K(m,3) K(m,m)

The Kernel Matri The central structure in kernel machines Infrmatin bttleneck : cntains all necessary infrmatin fr the learning algrithm Fuses infrmatin abut the data AND the kernel Many interesting prperties:

Mercer s Therem The kernel matri is Symmetric Psitive Definite Any symmetric psitive definite matri can be regarded as a kernel matri, that is as an inner prduct matri in sme space

Mre Frmally: Mercer s Therem Every (semi) psitive definite, symmetric functin is a kernel: i.e. there eists a mapping φ such that it is pssible t write: K( 1, 2) = φ( 1), φ( 2) Ps. Def. I K (, ) f ( ) f ( ) d d 0 f L 2

Mercer s Therem Eigenvalues epansin f Mercer s Kernels: K( 1, 2) = λφ i i( 1) φi( 2) i That is: the eigenfunctins act as features!

Eamples f Kernels Simple eamples f kernels are: K(, ) =, d K(, ) = 2 / 2σ e

Eample: Plynmial Kernels = ( 1, 2); = ( 1, 2); 2, = ( 11 + 22) 2 = 2 2 2 2 = + + 21122 = 1 1 2 2 2 2 2 = (,, 212),(,, 212) = 1 2 = φ( ), φ( ) 2 1 2

Eample: Plynmial Kernels

Eample: the tw spirals Separated by a hyperplane in feature space (gaussian kernels)

Making Kernels IMPORTANT CONCEPT The set f kernels is clsed under sme peratins. If K, K are kernels, then: K+K is a kernel ck is a kernel, if c>0 ak+bk is a kernel, fr a,b >0 Etc etc etc can make cmple kernels frm simple nes: mdularity!

Secnd Prperty f SVMs: SVMs are Linear Learning Machines, that Use a dual representatin AND Operate in a kernel induced feature space (that is: f( ) = α iyi φ( i ), φ( ) + b is a linear functin in the feature space implicitely defined by K)

Kernels ver General Structures Haussler, Watkins, etc: kernels ver sets, ver sequences, ver trees, etc. Applied in tet categriatin, biinfrmatics, etc

A bad kernel wuld be a kernel whse kernel matri is mstly diagnal: all pints rthgnal t each ther, n clusters, n structure 1 0 0 0 0 1 0 0 1 0 0 0 1

N Free Kernel IMPORTANT CONCEPT If mapping in a space with t many irrelevant features, kernel matri becmes diagnal Need sme prir knwledge f target s chse a gd kernel

Other Kernel-based algrithms Nte: ther algrithms can use kernels, nt just LLMs (e.g. clustering; PCA; etc). Dual representatin ften pssible (in ptimiatin prblems, by Representer s therem).

%5($.

The Generaliatin Prblem NEW TOPIC The curse f dimensinality: easy t verfit in high dimensinal spaces (=regularities culd be fund in the training set that are accidental, that is that wuld nt be fund again in a test set) The SVM prblem is ill psed (finding ne hyperplane that separates the data: many such hyperplanes eist) Need principled way t chse the best pssible hyperplane

The Generaliatin Prblem Many methds eist t chse a gd hyperplane (inductive principles) Bayes, statistical learning thery / pac, MDL, Each can be used, we will fcus n a simple case mtivated by statistical learning thery (will give the basic SVM)

Statistical (Cmputatinal) Learning Thery Generaliatin bunds n the risk f verfitting (in a p.a.c. setting: assumptin f I.I.d. data; etc) Standard bunds frm VC thery give upper and lwer bund prprtinal t VC dimensin VC dimensin f LLMs prprtinal t dimensin f space (can be huge)

Assumptins and Definitins distributin D ver input space X train and test pints drawn randmly (I.I.d.) frm D training errr f h: fractin f pints in S misclassifed by h test errr f h: prbability under D t misclassify a pint VC dimensin: sie f largest subset f X shattered by H (every dichtmy implemented)

VC Bunds ε = O ~ VC m VC = (number f dimensins f X) +1 Typically VC >> m, s nt useful Des nt tell us which hyperplane t chse

Margin Based Bunds ε γ = = ~ O min ( R i y / γ ) m i f ( f 2 i ) Nte: als cmpressin bunds eist; and nline bunds.

Margin Based Bunds IMPORTANT CONCEPT (The wrst case bund still hlds, but if lucky (margin is large)) the ther bund can be applied and better generaliatin can be achieved: = ~ ( R / γ O m Best hyperplane: the maimal margin ne Margin is large is kernel chsen well ε 2 )

Maimal Margin Classifier Minimie the risk f verfitting by chsing the maimal margin hyperplane in feature space Third feature f SVMs: maimie the margin SVMs cntrl capacity by increasing the margin, nt by reducing the number f degrees f freedm (dimensin free capacity cntrl).

Tw kinds f margin Functinal and gemetric margin: funct = min yif ( i) g gem = min yif ( i) f

Tw kinds f margin

Ma Margin = Minimal Nrm If we fi the functinal margin t 1, the gemetric margin equal 1/ w Hence, maimie the margin by minimiing the nrm

Ma Margin = Minimal Nrm Distance between The tw cnve hulls g w, w, + + b = + 1 + b = 1 + w,( ) = 2 w,( w + 2 ) = w

The primal prblem IMPORTANT STEP Minimie: subject t: w, w 4 9 1 yi w, i + b

Optimiatin Thery The prblem f finding the maimal margin hyperplane: cnstrained ptimiatin (quadratic prgramming) Use Lagrange thery (r Kuhn-Tucker Thery) Lagrangian: 1 2! 1 6 w, w αiyi w, i + b 1 " $# α 0

Frm Primal t Dual 1 L( w) = w, w αiy i w, i! + b 1 2 αi 0 Differentiate and substitute: L = 0 b L = 0 w 1 6 " $#

The Dual Prblem IMPORTANT STEP Maimie: Subject t: W( α) = αi 1 αα i y y, i, j 2 αi 0 αiyi = i 0 i j i j i j The duality again! Can use kernels!

Cnveity IMPORTANT CONCEPT This is a Quadratic Optimiatin prblem: cnve, n lcal minima (secnd effect f Mercer s cnditins) Slvable in plynmial time (cnveity is anther fundamental prperty f SVMs)

Kuhn-Tucker Therem Prperties f the slutin: Duality: can use kernels KKT cnditins: αi i i i 1 6 1 0 y w, + b = Sparseness: nly the pints nearest t the hyperplane (margin = 1) have psitive weight w = αiyii They are called supprt vectrs

KKT Cnditins Imply Sparseness g Sparseness: anther fundamental prperty f SVMs

Prperties f SVMs - Summary 9 Duality 9 Kernels 9 Margin 9 Cnveity 9 Sparseness

Dealing with nise In the case f nn-separable data in feature space, the margin distributin can be ptimied ε 1 m ( ) R + ξ 2 γ 2 2 4 9 1 ξ y w, + b i i i

The Sft-Margin Classifier Minimie: Or: 1 2 1 2 w, w w, w + + C C i i ξi ξ 2 i Subject t: y 4 i w, i + b9 1 ξi

Slack Variables ( ) 2 ξ 1 R + 2 ε m γ 2 4 9 1 ξ y w, + b i i i

Sft Margin-Dual Lagrangian B cnstraints W( α) = αi 1 αα i y y, i, j 2 0 αi C αiyi = 0 i i j i j i j Diagnal i 1 1 αi αα i jyiyj i, j αjαj i, j 2 2C 0 αi αiy i i 0

The regressin case Fr regressin, all the abve prperties are retained, intrducing epsiln-insensitive lss: L e 0 y i -<w, i >+b

Regressin: the ε-tube

Implementatin Techniques Maimiing a quadratic functin, subject t a linear equality cnstraint (and inequalities as well) W( α) = αi 1 αα i jyiyjk( i, j) i, j 2 αi i 0 αiyi = 0 i

Simple Apprimatin Initially cmple QP pachages were used. Stchastic Gradient Ascent (sequentially update 1 weight at the time) gives ecellent apprimatin in mst cases 1 αi αi + 1 y i αiyik( i, j) K( i, i)

Full Slutin: S.M.O. SMO: update tw weights simultaneusly Realies gradient descent withut leaving the linear cnstraint (J. Platt). Online versins eist (Li-Lng; Gentile)

Other kernelied Algrithms Adatrn, nearest neighbur, fisher discriminant, bayes classifier, ridge regressin, etc. etc Much wrk in past years int designing kernel based algrithms Nw: mre wrk n designing gd kernels (fr any algrithm)

On Cmbining Kernels When is it advantageus t cmbine kernels? T many features leads t verfitting als in kernel methds Kernel cmbinatin needs t be based n principles Alignment

Kernel Alignment IMPORTANT CONCEPT Ntin f similarity between kernels: Alignment (= similarity between Gram matrices) A( K1, K2) = K1, K2 K1, K1 K2, K2

Many interpretatins As measure f clustering in data As Crrelatin cefficient between racles Basic idea: the ultimate kernel shuld be YY, that is shuld be given by the labels vectr (after all: target is the nly relevant feature!)

The ideal kernel 1 1-1 -1 1 1-1 -1 YY = -1-1 1 1-1 -1 1 1

Cmbining Kernels Alignment in increased by cmbining kernels that are aligned t the target and nt aligned t each ther. A( K1, YY' ) = K1, YY' K1, K1 YY', YY'

Spectral Machines Can (apprimately) maimie the alignment f a set f labels t a given kernel By slving this prblem: yky Apprimated by principal eigenvectr (threshlded) (see curant-hilbert therem) y yi = arg ma { 1, + 1} yy'

Curant-Hilbert therem A: symmetric and psitive definite, Principal Eigenvalue / Eigenvectr characteried by: λ = ma v vav vv'

Optimiing Kernel Alignment One can either adapt the kernel t the labels r vice versa In the first case: mdel selectin methd Secnd case: clustering / transductin methd

Applicatins f SVMs Biinfrmatics Machine Visin Tet Categriatin Handwritten Character Recgnitin Time series analysis

Tet Kernels Jachims (bag f wrds) Latent semantic kernels (icml2001) String matching kernels See KerMIT prject

Biinfrmatics Gene Epressin Prtein sequences Phylgenetic Infrmatin Prmters

Cnclusins: Much mre than just a replacement fr neural netwrks. - General and rich class f pattern recgnitin methds %RR RQ 690VZZZVXSSRUWYHFWRUQHW Kernel machines website www.kernel-machines.rg www.neurcolt.rg