Lecture 12 Oct 24 th 2008
Revew Lnear SVM seeks to fnd a lnear decson boundary that maxmzes the geometrc t margn Usng the concept of soft margn, we can acheve tradeoff between maxmzng the margn and fathfully fttng all tranng examples, thus provdes better handlng of outlers/nosy tranng examples and (smple) cases that are not lnearly separable By mappng the data from the orgnal nput space nto a hgher dmensonal feature space, we obtan non lnear SVM Let s see a bt more about ths
Non lnear SVM The basc dea s to map the data onto a new feature space such that the data s lnearly separable n ths space However, we don t need to explctly compute the mappng. Instead, we use what we call the kernel functon (ths s referred to as the kernel trck) What s a kernel functon? A kernel functon k(a,b) (, ) = <φ(a) φ(b)> ) φ s a mappng Why usng kernel functon? Thennerproduct n the mapped feature space <φ(a) φ(b)> scomputed usng the kernel functon appled to the orgnal nput vectors k(a,b) Ths avods the computatonal cost ncurred by the mappng We have shown prevously that usng kernel n predcton stage s straght forward, but what about learnng of w (or more drectly the α s)? Does ths make any dfference n the SVM learnng step? Let s look at the optmzaton problem of the orgnal SVM
ξ c N b mn 1 2, + = w w Soft margn SVM optmzaton N N ξ b y, 1, 0,, 1,, 1 ) ( subject to : L L = = + ξ x w N optmzaton problem Knowng that, we can rewrte the optmzaton problem n terms of nner products = = N y 1 x w α N ξ b ξ c y y N j j N j j j j b 1 1 ) ( bj mn 1,, > + < = α α x x w I f l f SVM ll d l h N N ξ b y y j j j j, 1, 0,, 1,, 1 ) ( subject to : 1 L L = = > + < = ξ α x x In fact, learnng for SVM s typcally carred out usng only the nner products of x, wthout usng orgnal vector of x Now we just need to replace the nner products wth an Now we just need to replace the nner products wth an approprate kernel functons
What we have seen so far Lnear SVM Geometrc margn vs. functonal margn Problem formulaton Soft margn SVM Include the slack varable n optmzaton c controls the trade off Nonlnear SVM usng kernel functons Kernel functons
Notes on Applyng SVM Many SVM mplementatons areavalable, avalable, and can be found at www.kernel machne.org/software.html Handlng multple class problem wth SVM requres transformng a multclass problem nto multple bnary class problems One aganst rest Parwse etc
Model selecton for SVM There are a number of model selecton questons when applyng SVM Whch kernel functons to use? What c parameter to use (soft margn)? You can choose to use default df optons provded by the software, but a more relable approach s to use cross valdatons
Strengths Strength vs weakness The soluton s globally optmal It scales well wth hgh dmensonal data It can handle non tradtonal data lke strngs, trees, nstead of the tradtonal fxed length feature vectors Weakness Why? Because as long as we can defne a kernel functon for such nput, we can apply svm Need to specfy a good kernel Tranng tme can be long f you use the wrong software package 2007 2006 1998
Ensemble Learnng
Ensemble Learnng So far we have desgned learnng algorthms that take a tranng set and output a classfer What f we want more accuracy than current algorthms afford? Develop new learnng algorthm Improve exstng algorthms Another approach s to leverage the algorthms we have va ensemble methods Instead of callng an algorthm just once and usng ts classfer Call algorthm multple tmes and combne the multple classfers
What s Ensemble Learnng Tradtonal: Ensemble method: S S L 1 L 1 L 2 L L S dfferent tranng sets and/or learnng algorthms h 1 h 2 L h S (x,?) h 1 (x, y * =h 1 (x)) h * = F(h 1, h 2, L, h S ) (x,?) (x, y * =h * (x))
Ensemble Learnng INTUITION: Combnng Predctons of multple classfers (an ensemble) s more accurate than a sngle classfer. Justfcaton: easy to fnd qute good rules of thumb however hard to fnd sngle hghly accurate predcton rule. If the tranng set s small and the hypothess space s large then there may be many equally accurate classfers. Hypothess space does not contan the true functon, but t has several good approxmatons. Exhaustve global search n the hypothess space s expensve so we can combne the predctons of several locally accurate classfers.
How to generate ensemble? There are a varety of methods developed We wll look at two of them: Baggng Boostng (Adaboost: adaptve boostng) Both of these methods takes a sngle learnng algorthm (we wll call ths the base learner) and use t multple tmes to generate multple classfers
Baggng: Bootstrap Aggregaton (Breman, 1996) Generate a random sample from tranng set S by a random re samplng technque called bootstrappng Repeat ths samplng procedure, gettng a sequence of T tranng sets: S 1,SS 2,,SS T Learn a sequence of classfers h 1,h 2,,h T for each of these tranng sets, usng the same base learner To classfy an unknown pont X, let each classfer predct h 1 (X) = 1 h 2 (X) = 1, h 3 (X) = 0,, h T (X) = 1, Take smple majorty vote to makethe fnal predcton Predct the class that gets the most vote from all the learned classfers
Bootstrappng S = {} For =1,, N (N s the total number of ponts n S) draw a random pont from S and add t to S End Return S Ths s a samplng procedure that samples wth replacement Each tme a pont s drawn, t wll not be removed Ths means that t we can have multple l copes of the same data dt pont tn my sample New tranng set contans the same number of ponts ( may contan repeats) t) as the orgnal ltranng set On average, 66.7% of the orgnal ponts wll appear n a random sample
The true decson boundary
Decson Boundary by the CART Decson Tree Algorthm Note that the decson tree has trouble representng ths decson boundary
By averagng 100 trees, we acheve better approxmaton of the boundary, together wth nformaton regardng how confdence we are about our predcton.
Emprcal Results for Baggng Decson Trees (Freund & Schapre) Each pont represents the results of one data set Why can baggng mprove the classfcaton accuracy?
The Concept of Bas and Varance Target Bas Varance
Bas/Varance for classfers Bas arses when the classfer cannot represent the true functon that s, the classfer underfts the data Varance arses when the classfer overfts the data mnor varatons n tranng set cause the classfer to overft dfferently Clearly you would lke to have a low basandand lowvarance classfer! Typcally, low bas classfers (overfttng) have hgh varance hghbas classfers (underfttng) have low varance We have a trade off
Effect of Algorthm Parameters on Bas and Varance k nearest neghbor: ncreasng k typcally ncreases bas and reduces varance decson trees of depth D: ncreasng D typcally ncreases varance and reduces bas CS434 Fall 2007
Why does baggng work? Baggng takes the average of multple models reduces the varance Ths suggests that baggng works the best wth low bas and hgh varance classfers CS434 Fall 2007