Lecture 5,6 Linear Methods for Classification. Summary

Transcription

1 Lecture 5,6 Lnear Methods for Classfcaton Rce ELEC 697 Farnaz Koushanfar Fall 2006 Summary Bayes Classfers Lnear Classfers Lnear regresson of an ndcator matrx Lnear dscrmnant analyss (LDA) Logstc regresson Separatng hyperplanes Readng (ch4, ELS) 1

2 Bayes Classfer he margnal dstrbutons of G are specfed as PMF p G (g), g=1,2,,k f X G (x G=g) shows the condtonal dstrbuton of X for G=g he tranng set (x,g ),=1,..,N has ndependent samples from the jont dstrbuton f X,G (x,g) f X,G (x,g) = p G (g)f X G (x G=g) he loss of predctng G * for G s L(G *,G) Classfcaton goal: mnmze the expected loss E X,G L(G(X),G)=E X (E G X L(G(X),G)) Bayes Classfer (cont d) It suffces to mnmze E G X L(G(X),G) for each X. he optmal classfer s: G(x) = argmn g E G X=x L(g,G) he Bayes rule s also nown as the rule of maxmum a posteror probablty G(x) = argmax g Pr(G=g X=x) Many classfcaton algorthms estmate the Pr(G=g X=x) and then apply the Bayes rule Bayes classfcaton rule 2

3 More About Lnear Classfcaton Snce predctor G(x) tae values n a dscrete set G, we can dvde the nput space nto a collecton of regons labeled accordng to classfcaton For K classes (1,2,,K), and the ftted lnear model for -th ndcator response varable s fˆ ( x ) = β ˆ + βˆ 0 x he decson boundary b/w and l s: fˆ ˆ ( x) = fl( x) An affne set or hyperplane: x : ( βˆ βˆ ) + ( βˆ βˆ { 0 l 0 l ) x = Model dscrmnant functon δ (x) for each class, then classfy x to the class wth the largest value for δ (x) 0} Lnear Decson Boundary We requre that monotone transformaton of δ or Pr(G= X=x) be lnear Decson boundares are the set of ponts wth log-odds=0 Prob. of class 1: π, prob. of class 2: 1- π Apply a transformaton:: log[π/(1- π)]=β 0 + β x wo popular methods that use log-odds Lnear dscrmnant analyss, lnear logstc regresson Explctly model the boundary b/w two classes as lnear. For a two-class problem wth p-dmensonal nput space, ths s modelng decson boundary as a hyperplane wo methods usng separatng hyperplanes Perceptron - Rosenblatt, optmally separatng hyperplanes -Vapn 3

4 Generalzng Lnear Decson Boundares Expand the varable set X 1,,X p by ncludng squares and cross products, addng up to p(p+1)/2 addtonal varables Lnear Regresson of an Indcator Matrx For K classes, K ndcators Y, =1,,K, wth Y =1, f G=, else 0 Indcator response matrx 4

5 Lnear Regresson of an Indcator Matrx (Cont d) For N tranng data, form N K ndcator response matrx Y, a matrx of 0 s and 1 s Yˆ 1 = X( X X) X Y A new observaton s classfed as follows: Compute the ftted output (K vector) - f ˆ( x) = [(1, x) Bˆ ] Identfy the largest component and classfy accordngly: Gˆ ( x) = argmax fˆ G (x) But how good s the ft? Verfy G f (x)=1 for any x f (x) can be negatve or larger than 1 We can allow lnear regresson nto bass expanson of h(x) As the sze of tranng set ncreases, adaptvely add more bass Lnear Regresson - Drawbac For K 3, especally for large K 5

6 Lnear Regresson - Drawbac For large K and small p, masng can naturally occur E.g. Vowel recognton data n 2D subspace, K=11, p=10 dmensons Lnear Regresson and Projecton * A lnear regresson functon (here n 2D) Projectseach pont x=[x 1 x 2 ] to a lne parallel to W 1 We can study how well the projected ponts {z 1,z 2,,z n }, vewed as functons of w 1, are separated across the classes * Sldes Courtesy of omm S. Jaaola, MI CSAIL 6

7 Lnear Regresson and Projecton A lnear regresson functon (here n 2D) Projects each pont x=[x 1 x 2 ] to a lne parallel to W 1 We can study how well the projected ponts {z 1,z 2,,z n }, vewed as functons of w 1, are separated across the classes Projecton and Classfcaton By varyng w 1 we get dfferent levels of separaton between the projected ponts 7

8 Optmzng the Projecton We would le to fnd the w 1 that somehow maxmzes the separaton of the projected ponts across classes We can quantfy the separaton (overlap) n terms of means and varatons of the resultng 1-D class dstrbuton Fsher Lnear Dscrmnant: Prelmnares Class descrpton n R d Class 0: n 0 samples, mean μ 0, covarance 0 Class 1: n 1 samples, mean μ 1, covarance 1 Projected class descrptons n R Class 0: n 0 samples, mean μ 0 w 1, covarance w 1 0 w 1 Class 1: n 1 samples, mean μ 1 w 1, covarance w 1 1 w 1 8

9 Fsher Lnear Dscrmnant Estmaton crteron: fnd w 1 that maxmzes he soluton (class separaton) s decson theoretcally optmal for two normal populatons wth equal covarances ( 1 = 0 ) Lnear Dscrmnant Analyss (LDA) π class pror Pr(G=) Functon f (x)=densty of X n class G= Bayes heorem: Leads to LDA, QDA, MDA (mxture DA), Kernel DA, Naïve Bayes Suppose that we model densty as a MVG: LDA s when we assume the classes have a common covarance matrx: =. It s suffcent to loo at log-odds 9

10 LDA Log-odds functon mples decson boundary b/w and l: Pr(G= X=x)=Pr(G=l X=x) lnear n x; n p dmensons a hyperplane Example: three classes and p=2 LDA (Cont d) 10

11 LDA (Cont d) In practce, we do not now the parameters of Gaussan dstrbutons. Estmate w/ tranng set N s the number of class data π μˆ = g = x / N K Σ ˆ = ( x μˆ )( x μˆ ) /( N ˆ = = = 1 g N / K ) For two classes, ths s le lnear regresson N QDA If s are not equal, the quadratc terms n x reman; we get quadratc dscrmnant functons (QDA) 11

12 QDA (Cont d) he estmates are smlar to LDA, but each class has a separate covarance matrces For large p dramatc ncrease n parameters In LDA, there are (K-1)(p+1) parameters For QDA, there are (K-1) {1+p(p+3)/2} LDA and QDA both wor really well hs s not because the data s Gaussan, rather, for smple decson boundares, Gaussan estmates are stable Bas-varance trade-off Regularzed Dscrmnent Analyss A compromse b/w LDA and QDA. Shrn separate covarances of QDA towards a common covarance (smlar to Rdge Reg.) 12

13 Example - RDA Computatons for LDA Suppose we compute the egen decomposton for,.e. U s p p orthonormal, D dagonal matrx of postve egenvalues d l. hen, ˆ 1 1 ( x μ ) ( ) [ ( ˆ )] [ ( ˆ Σ x μ = U x μ D U x μ log Σˆ = log d l l he LDA classfer s mplemented as: X* D -1/2 U X, where =UDU. he common covarance estmate of X* s dentty Classfy to the closest class centrod n the transformed space, modulo the effect of the class pror probabltes π )] 13

14 Bacground: Smple Decson heory * Suppose we now the class-condtonal denstes p(x y) for y=0,1 as well as the overall class frequences P(y) How do we decde whch class a new example x belongs to so as to mnmze the overall probablty of error? * Courtesy of omm S. Jaaola, MI CSAIL Bacground: Smple Decson heory Suppose we now the class-condtonal denstes p(x y) for y=0,1 as well as the overall class frequences P(y) How do we decde whch class a new example x belongs to so as to mnmze the overall probablty of error? 14

15 2-Class Logstc Regresson he optmal decsons are based on the posteror class probabltes P(y x). For bnary classfcaton problems, we can wrte these decsons as We generally don t now P(y x) but we can parameterze the possble decsons accordng to 2-Class Logstc Regresson (Cont d) Our log-odds model Gves rse to a specfc form for the condtonal probablty over the labels (the logstc model): Where Is a logstc squashng functon hat turns lnear predctons nto probabltes 15

16 2-Class Logstc Regresson: Decsons Logstc regresson models mply a lnear decson boundary K-Class Logstc Regresson he model s specfed n terms of K-1 log-odds or logt transformatons (reflectng the constrant that the probabltes sum to one) he choce of denomnator s arbtrary, typcally last class Pr( G = 1 X = x) log = β Pr( G = K X = x) Pr( G = 2 X = x) log = β Pr( G = K X = x) β 1 + β x 2 x log.. Pr( G = K 1 X = x) Pr( G = K X = x) = β + β ( K 1) 0 K 1 x 16

17 K-Class Logstc Regresson (Cont d) he model s specfed n terms of K-1 log-odds or logt transformatons (reflectng the constrant that the probabltes sum to one) A smple calculaton shows that exp( β 0 + β x) Pr( G = X = x) =, = 1,..., K 1, K exp( β ) β x l = l l 1 Pr( G = K X = x) = K exp( β + β x = 1 0 ) l l l o emphasze the dependence on the entre parameter set θ={β 10, β 1,,β (K-1)0, β (K-1)}, we denote the probabltes as Pr(G= X=x) = p (x; θ) Fttng Logstc Regresson Models logt P x P x = ( ) ( ) log = η( x = β 1 P( x) ) x log Lelhood = = N = 1 N = 1 { y log p + (1 y )log(1 p )} { y β x log(1 + e β x )} 17

18 Fttng Logstc Regresson Models IRLS s equvalent to Newton-Raphson procedure Fttng Logstc Regresson Models logt P x P x = ( ) ( ) log = η( x = β 1 P( x) ) log Lelhood = = N = 1 N = 1 { y β x { y log p + (1 y )log(1 p )} x log(1 + e β x IRLS algorthm (equvalent to Newton-Raphson) Intalze β. Form Lnearzed response: Form weghts w =p (1-p ) Update β by weghted LS of z on x wth weghts w Steps 2-4 repeated untl convergence )} 18

19 Example Logstc Regresson South Afrcan Heart Dsease: Coronary rs factor study (CORIS) baselne survey, carred out n three rural areas. Whte males b/w 15 and 64 Response: presence or absence of myocardal nfarcton Maxmum lelhood ft: Example Logstc Regresson South Afrcan Heart Dsease: 19

20 LDA: Logstc Regresson or LDA? hs lnearty s a consequence of the Gaussan assumpton for the class denstes, as well as the assumpton of a common covarance matrx. Logstc model hey use the same form for the logt functon Logstc Regresson or LDA? Dscrmnatve vs nformatve learnng: logstc regresson uses the condtonal dstrbuton of Y gven x to estmate parameters, whle LDA uses the full jont dstrbuton (assumng normalty). If normalty holds, LDA s up to 30% more effcent; o/w logstc regresson can be more robust. But the methods are smlar n practce. 20

21 Separatng Hyperplanes Separatng Hyperplanes Perceptrons: compute a lnear combnaton of the nput features and return the sgn For x 1,x 2 n L, β (x 1 -x 2 )=0 β*= β/ β normal to surface L For x 0 n L, β x 0 = - β 0 he sgned dstance of any pont x to L s gven by * 1 β ( x x0) = ( β x + β0) β 1 = f ( x) f '( x) 21

22 Rosenblatt's Perceptron Learnng Algorthm Fnds a separatng hyperplane by mnmzng the dstance of msclassfed ponts to the decson boundary If a response y =1 s msclassfed, then x β+β 0 <0, and the opposte for msclassfed pont y =-1 he goal s to mnmze Rosenblatt's Perceptron Learnng Algorthm (Cont d) Stochastc gradent descent he msclassfed observatons are vsted n some sequence and the parameters β updated ρ s the learnng rate, can be 1 w/o loss of generalty It can be shown that algorthm converges to a separatng hyperplane n a fnte number of steps 22

23 Optmal Separatng Hyperplanes Problem Example - Optmal Separatng Hyperplanes 23