Supprt Vectr and Kernel Machines Nell Cristianini BIOwulf Technlgies nell@supprt-vectr.net http:///tutrial.html ICML 2001
A Little Histry SVMs intrduced in COLT-92 by Bser, Guyn, Vapnik. Greatly develped ever since. Initially ppularied in the NIPS cmmunity, nw an imprtant and active field f all Machine Learning research. Special issues f Machine Learning Jurnal, and Jurnal f Machine Learning Research. Kernel Machines: large class f learning algrithms, SVMs a particular instance.
A Little Histry Annual wrkshp at NIPS Centralied website: www.kernel-machines.rg Tetbk (2000): see Nw: a large and diverse cmmunity: frm machine learning, ptimiatin, statistics, neural netwrks, functinal analysis, etc. etc Successful applicatins in many fields (biinfrmatics, tet, handwriting recgnitin, etc) Fast epanding field, EVERYBODY WELCOME! -
Preliminaries Task f this class f algrithms: detect and eplit cmple patterns in data (eg: by clustering, classifying, ranking, cleaning, etc. the data) Typical prblems: hw t represent cmple patterns; and hw t eclude spurius (unstable) patterns (= verfitting) The first is a cmputatinal prblem; the secnd a statistical prblem.
Very Infrmal Reasning The class f kernel methds implicitly defines the class f pssible patterns by intrducing a ntin f similarity between data Eample: similarity between dcuments By length By tpic By language Chice f similarity Î Chice f relevant features
Mre frmal reasning Kernel methds eplit infrmatin abut the inner prducts between data items Many standard algrithms can be rewritten s that they nly require inner prducts between data (inputs) Kernel functins = inner prducts in sme feature space (ptentially very cmple) If kernel given, n need t specify what features f the data are being used
Just in case Inner prduct between vectrs Hyperplane: w, + b = 0, = i i i w b
Overview f the Tutrial Intrduce basic cncepts with etended eample f Kernel Perceptrn Derive Supprt Vectr Machines Other kernel based algrithms Prperties and Limitatins f Kernels On Kernel Alignment On Optimiing Kernel Alignment
Parts I and II: verview Linear Learning Machines (LLM) Kernel Induced Feature Spaces Generaliatin Thery Optimiatin Thery Supprt Vectr Machines (SVM)
Mdularity IMPORTANT CONCEPT Any kernel-based learning algrithm cmpsed f tw mdules: A general purpse learning machine A prblem specific kernel functin Any K-B algrithm can be fitted with any kernel Kernels themselves can be cnstructed in a mdular way Great fr sftware engineering (and fr analysis)
1-Linear Learning Machines Simplest case: classificatin. Decisin functin is a hyperplane in input space The Perceptrn Algrithm (Rsenblatt, 57) Useful t analye the Perceptrn algrithm, befre lking at SVMs and Kernel Methds in general
Basic Ntatin Input space Output space Hypthesis Real-valued: Training Set Test errr Dt prduct y h X Y = { 1, + 1} H f : X R S = {( 1, y1),...,( i, yi),...} ε,
Perceptrn Linear Separatin f the input space f ( ) = w, + b w h( ) = sign( f ( )) b
Perceptrn Algrithm Update rule (ignring threshld): y ( w, ) 0 i k i if then w + 1 w + ηy k k i i k k + 1 b w
Observatins Slutin is a linear cmbinatin f training pints w = α i 0 α y i i i Only used infrmative pints (mistake driven) The cefficient f a pint in cmbinatin reflects its difficulty
Observatins - 2 Mistake bund: M R γ 2 g cefficients are nn-negative pssible t rewrite the algrithm using this alternative representatin
Dual Representatin IMPORTANT CONCEPT The decisin functin can be re-written as fllws: f( ) = w, + b= α iyi i, + b w = αiyii
Dual Representatin And als the update rule can be rewritten as fllws: 3 8 0 αi αi + η y i α jyj j, i + b if then Nte: in dual representatin, data appears nly inside dt prducts
Duality: First Prperty f SVMs DUALITY is the first feature f Supprt Vectr Machines SVMs are Linear Learning Machines represented in a dual fashin f( ) = w, + b= α iyi i, + b Data appear nly within dt prducts (in decisin functin and in training algrithm)
Limitatins f LLMs Linear classifiers cannt deal with Nn-linearly separable data Nisy data + this frmulatin nly deals with vectrial data
Nn-Linear Classifiers One slutin: creating a net f simple linear classifiers (neurns): a Neural Netwrk (prblems: lcal minima; many parameters; heuristics needed t train; etc) Other slutin: map data int a richer feature space including nn-linear features, then use a linear classifier
Learning in the Feature Space Map data int a feature space where they are linearly separable φ( ) f f() f() f() f() f() f() f() f() X F
Prblems with Feature Space Wrking in high dimensinal feature spaces slves the prblem f epressing cmple functins BUT: There is a cmputatinal prblem (wrking with very large vectrs) And a generaliatin thery prblem (curse f dimensinality)
Implicit Mapping t Feature Space We will intrduce Kernels: Slve the cmputatinal prblem f wrking with many dimensins Can make it pssible t use infinite dimensins efficiently in time / space Other advantages, bth practical and cnceptual
Kernel-Induced Feature Spaces In the dual representatin, the data pints nly appear inside dt prducts: f( ) = α iyi φ( i ), φ( ) + b The dimensinality f space F nt necessarily imprtant. May nt even knw the map φ
Kernels IMPORTANT CONCEPT A functin that returns the value f the dt prduct between the images f the tw arguments K( 1, 2) = φ( 1), φ( 2) Given a functin K, it is pssible t verify that it is a kernel
Kernels One can use LLMs in a feature space by simply rewriting it in dual representatin and replacing dt prducts with kernels: 1, 2 K( 1, 2) = φ( 1), φ( 2)
The Kernel Matri IMPORTANT CONCEPT (aka the Gram matri): K(1,1) K(1,2) K(1,3) K(1,m) K(2,1) K(2,2) K(2,3) K(2,m) K= K(m,1) K(m,2) K(m,3) K(m,m)
The Kernel Matri The central structure in kernel machines Infrmatin bttleneck : cntains all necessary infrmatin fr the learning algrithm Fuses infrmatin abut the data AND the kernel Many interesting prperties:
Mercer s Therem The kernel matri is Symmetric Psitive Definite Any symmetric psitive definite matri can be regarded as a kernel matri, that is as an inner prduct matri in sme space
Mre Frmally: Mercer s Therem Every (semi) psitive definite, symmetric functin is a kernel: i.e. there eists a mapping φ such that it is pssible t write: K( 1, 2) = φ( 1), φ( 2) Ps. Def. I K (, ) f ( ) f ( ) d d 0 f L 2
Mercer s Therem Eigenvalues epansin f Mercer s Kernels: K( 1, 2) = λφ i i( 1) φi( 2) i That is: the eigenfunctins act as features!
Eamples f Kernels Simple eamples f kernels are: K(, ) =, d K(, ) = 2 / 2σ e
Eample: Plynmial Kernels = ( 1, 2); = ( 1, 2); 2, = ( 11 + 22) 2 = 2 2 2 2 = + + 21122 = 1 1 2 2 2 2 2 = (,, 212),(,, 212) = 1 2 = φ( ), φ( ) 2 1 2
Eample: Plynmial Kernels
Eample: the tw spirals Separated by a hyperplane in feature space (gaussian kernels)
Making Kernels IMPORTANT CONCEPT The set f kernels is clsed under sme peratins. If K, K are kernels, then: K+K is a kernel ck is a kernel, if c>0 ak+bk is a kernel, fr a,b >0 Etc etc etc can make cmple kernels frm simple nes: mdularity!
Secnd Prperty f SVMs: SVMs are Linear Learning Machines, that Use a dual representatin AND Operate in a kernel induced feature space (that is: f( ) = α iyi φ( i ), φ( ) + b is a linear functin in the feature space implicitely defined by K)
Kernels ver General Structures Haussler, Watkins, etc: kernels ver sets, ver sequences, ver trees, etc. Applied in tet categriatin, biinfrmatics, etc
A bad kernel wuld be a kernel whse kernel matri is mstly diagnal: all pints rthgnal t each ther, n clusters, n structure 1 0 0 0 0 1 0 0 1 0 0 0 1
N Free Kernel IMPORTANT CONCEPT If mapping in a space with t many irrelevant features, kernel matri becmes diagnal Need sme prir knwledge f target s chse a gd kernel
Other Kernel-based algrithms Nte: ther algrithms can use kernels, nt just LLMs (e.g. clustering; PCA; etc). Dual representatin ften pssible (in ptimiatin prblems, by Representer s therem).
%5($.
The Generaliatin Prblem NEW TOPIC The curse f dimensinality: easy t verfit in high dimensinal spaces (=regularities culd be fund in the training set that are accidental, that is that wuld nt be fund again in a test set) The SVM prblem is ill psed (finding ne hyperplane that separates the data: many such hyperplanes eist) Need principled way t chse the best pssible hyperplane
The Generaliatin Prblem Many methds eist t chse a gd hyperplane (inductive principles) Bayes, statistical learning thery / pac, MDL, Each can be used, we will fcus n a simple case mtivated by statistical learning thery (will give the basic SVM)
Statistical (Cmputatinal) Learning Thery Generaliatin bunds n the risk f verfitting (in a p.a.c. setting: assumptin f I.I.d. data; etc) Standard bunds frm VC thery give upper and lwer bund prprtinal t VC dimensin VC dimensin f LLMs prprtinal t dimensin f space (can be huge)
Assumptins and Definitins distributin D ver input space X train and test pints drawn randmly (I.I.d.) frm D training errr f h: fractin f pints in S misclassifed by h test errr f h: prbability under D t misclassify a pint VC dimensin: sie f largest subset f X shattered by H (every dichtmy implemented)
VC Bunds ε = O ~ VC m VC = (number f dimensins f X) +1 Typically VC >> m, s nt useful Des nt tell us which hyperplane t chse
Margin Based Bunds ε γ = = ~ O min ( R i y / γ ) m i f ( f 2 i ) Nte: als cmpressin bunds eist; and nline bunds.
Margin Based Bunds IMPORTANT CONCEPT (The wrst case bund still hlds, but if lucky (margin is large)) the ther bund can be applied and better generaliatin can be achieved: = ~ ( R / γ O m Best hyperplane: the maimal margin ne Margin is large is kernel chsen well ε 2 )
Maimal Margin Classifier Minimie the risk f verfitting by chsing the maimal margin hyperplane in feature space Third feature f SVMs: maimie the margin SVMs cntrl capacity by increasing the margin, nt by reducing the number f degrees f freedm (dimensin free capacity cntrl).
Tw kinds f margin Functinal and gemetric margin: funct = min yif ( i) g gem = min yif ( i) f
Tw kinds f margin
Ma Margin = Minimal Nrm If we fi the functinal margin t 1, the gemetric margin equal 1/ w Hence, maimie the margin by minimiing the nrm
Ma Margin = Minimal Nrm Distance between The tw cnve hulls g w, w, + + b = + 1 + b = 1 + w,( ) = 2 w,( w + 2 ) = w
The primal prblem IMPORTANT STEP Minimie: subject t: w, w 4 9 1 yi w, i + b
Optimiatin Thery The prblem f finding the maimal margin hyperplane: cnstrained ptimiatin (quadratic prgramming) Use Lagrange thery (r Kuhn-Tucker Thery) Lagrangian: 1 2! 1 6 w, w αiyi w, i + b 1 " $# α 0
Frm Primal t Dual 1 L( w) = w, w αiy i w, i! + b 1 2 αi 0 Differentiate and substitute: L = 0 b L = 0 w 1 6 " $#
The Dual Prblem IMPORTANT STEP Maimie: Subject t: W( α) = αi 1 αα i y y, i, j 2 αi 0 αiyi = i 0 i j i j i j The duality again! Can use kernels!
Cnveity IMPORTANT CONCEPT This is a Quadratic Optimiatin prblem: cnve, n lcal minima (secnd effect f Mercer s cnditins) Slvable in plynmial time (cnveity is anther fundamental prperty f SVMs)
Kuhn-Tucker Therem Prperties f the slutin: Duality: can use kernels KKT cnditins: αi i i i 1 6 1 0 y w, + b = Sparseness: nly the pints nearest t the hyperplane (margin = 1) have psitive weight w = αiyii They are called supprt vectrs
KKT Cnditins Imply Sparseness g Sparseness: anther fundamental prperty f SVMs
Prperties f SVMs - Summary 9 Duality 9 Kernels 9 Margin 9 Cnveity 9 Sparseness
Dealing with nise In the case f nn-separable data in feature space, the margin distributin can be ptimied ε 1 m ( ) R + ξ 2 γ 2 2 4 9 1 ξ y w, + b i i i
The Sft-Margin Classifier Minimie: Or: 1 2 1 2 w, w w, w + + C C i i ξi ξ 2 i Subject t: y 4 i w, i + b9 1 ξi
Slack Variables ( ) 2 ξ 1 R + 2 ε m γ 2 4 9 1 ξ y w, + b i i i
Sft Margin-Dual Lagrangian B cnstraints W( α) = αi 1 αα i y y, i, j 2 0 αi C αiyi = 0 i i j i j i j Diagnal i 1 1 αi αα i jyiyj i, j αjαj i, j 2 2C 0 αi αiy i i 0
The regressin case Fr regressin, all the abve prperties are retained, intrducing epsiln-insensitive lss: L e 0 y i -<w, i >+b
Regressin: the ε-tube
Implementatin Techniques Maimiing a quadratic functin, subject t a linear equality cnstraint (and inequalities as well) W( α) = αi 1 αα i jyiyjk( i, j) i, j 2 αi i 0 αiyi = 0 i
Simple Apprimatin Initially cmple QP pachages were used. Stchastic Gradient Ascent (sequentially update 1 weight at the time) gives ecellent apprimatin in mst cases 1 αi αi + 1 y i αiyik( i, j) K( i, i)
Full Slutin: S.M.O. SMO: update tw weights simultaneusly Realies gradient descent withut leaving the linear cnstraint (J. Platt). Online versins eist (Li-Lng; Gentile)
Other kernelied Algrithms Adatrn, nearest neighbur, fisher discriminant, bayes classifier, ridge regressin, etc. etc Much wrk in past years int designing kernel based algrithms Nw: mre wrk n designing gd kernels (fr any algrithm)
On Cmbining Kernels When is it advantageus t cmbine kernels? T many features leads t verfitting als in kernel methds Kernel cmbinatin needs t be based n principles Alignment
Kernel Alignment IMPORTANT CONCEPT Ntin f similarity between kernels: Alignment (= similarity between Gram matrices) A( K1, K2) = K1, K2 K1, K1 K2, K2
Many interpretatins As measure f clustering in data As Crrelatin cefficient between racles Basic idea: the ultimate kernel shuld be YY, that is shuld be given by the labels vectr (after all: target is the nly relevant feature!)
The ideal kernel 1 1-1 -1 1 1-1 -1 YY = -1-1 1 1-1 -1 1 1
Cmbining Kernels Alignment in increased by cmbining kernels that are aligned t the target and nt aligned t each ther. A( K1, YY' ) = K1, YY' K1, K1 YY', YY'
Spectral Machines Can (apprimately) maimie the alignment f a set f labels t a given kernel By slving this prblem: yky Apprimated by principal eigenvectr (threshlded) (see curant-hilbert therem) y yi = arg ma { 1, + 1} yy'
Curant-Hilbert therem A: symmetric and psitive definite, Principal Eigenvalue / Eigenvectr characteried by: λ = ma v vav vv'
Optimiing Kernel Alignment One can either adapt the kernel t the labels r vice versa In the first case: mdel selectin methd Secnd case: clustering / transductin methd
Applicatins f SVMs Biinfrmatics Machine Visin Tet Categriatin Handwritten Character Recgnitin Time series analysis
Tet Kernels Jachims (bag f wrds) Latent semantic kernels (icml2001) String matching kernels See KerMIT prject
Biinfrmatics Gene Epressin Prtein sequences Phylgenetic Infrmatin Prmters
Cnclusins: Much mre than just a replacement fr neural netwrks. - General and rich class f pattern recgnitin methds %RR RQ 690VZZZVXSSRUWYHFWRUQHW Kernel machines website www.kernel-machines.rg www.neurcolt.rg