L10: Linear discriminants analysis

L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Lnear dscrmnant analyss, two-classes Objectve LDA seeks to reduce dmensonalty whle preservng as much of the class dscrmnatory nformaton as possble Assume we have a set of D-dmensonal samples x (, x (, x (N, N of whch belong to class ω, and N to class ω We seek to obtan a scalar y by projectng the samples x onto a lne y = w T x Of all the possble lnes we would lke to select the one that maxmzes the separablty of the scalars x x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU x

In order to fnd a good projecton vector, we need to defne a measure of separaton The mean vector of each class n x-space and y-space s μ = N x ω x and μ = N y ω y = N x ω w T x = w T μ We could then choose the dstance between the projected means as our objectve functon J w = μ μ = w T μ μ However, the dstance between projected means s not a good measure snce t does not account for the standard devaton wthn classes x Ths axs yelds better class separablty x Ths axs has a larger dstance between means CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Fsher s soluton Fsher suggested maxmzng the dfference between the means, normalzed by a measure of the wthn-class scatter For each class we defne the scatter, an equvalent of the varance, as s = y μ y ω where the quantty s + s s called the wthn-class scatter of the projected examples The Fsher lnear dscrmnant s defned as the lnear functon w T x that maxmzes the crteron functon J w = μ μ s +s Therefore, we are lookng for a projecton where examples from the same class are projected very close to each other and, at the same tme, the projected means are as farther apart as possble x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

To fnd the optmum w, we must express J(w) as a functon of w Frst, we defne a measure of the scatter n feature space x T S = x ω x μ x μ S + S = S W where S W s called the wthn-class scatter matrx The scatter of the projecton y can then be expressed as a functon of the scatter matrx n feature space x s = y ω y μ = w T x w T x ω μ = = x ω w T x μ x μ T w = w T S w s + s = w T S W w Smlarly, the dfference between the projected means can be expressed n terms of the means n the orgnal feature space μ μ = w T μ w T μ = w T T μ μ μ μ w = w T S B w The matrx S B s called the between-class scatter. Note that, snce S B s the outer product of two vectors, ts rank s at most one We can fnally express the Fsher crteron n terms of S W and S B as J w = wt S B w w T S W w CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU S B

To fnd the maxmum of J(w) we derve and equate to zero d dw J w = d dw w T S W w d wt S B w w T S B w w T S W w = 0 w T S dw B w d wt S W w dw w T S W w S B w w T S B w S W w = 0 Dvdng by w T S W w = 0 w T S W w S w T S W w Bw wt S B w S w T S W w Ww = 0 S B w JS W w = 0 S W S B w Jw = 0 Solvng the generalzed egenvalue problem (S W S B w = Jw) yelds w = arg max wt S B w w T S W w = S W μ μ Ths s know as Fsher s lnear dscrmnant (96), although t s not a dscrmnant but rather a specfc choce of drecton for the projecton of the data down to one dmenson CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 6

Example Compute the LDA projecton for the followng D dataset X = {(,), (,), (,), (,6), (,)} X = {(9,0), (6,8), (9,), (8,7), (0,8)} SOLUTION (by hand) The class statstcs are S =.8..6 S =.8.0.6 0 8 6 x w LDA μ =.0.6 T ; μ = 8. 7.6 T The wthn- and between-class scatter are S B = 9.6.6 6.0 S W =.6..8 The LDA projecton s then obtaned as the soluton of the generalzed egenvalue problem S W S B v = λv S W S B λi = 0.89 8.8.08.76 Or drectly by w = S W μ μ v v =.6 v v v v =.9.9 =.9.9 T.89 λ 8.8.08.76 λ 0 0 6 8 0 x = 0 λ =.6 CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 7

LDA, C classes Fsher s LDA generalzes gracefully for C-class problems Instead of one projecton y, we wll now seek (C ) projectons [y, y, y C ] by means of (C ) projecton vectors w arranged by columns nto a projecton matrx W = [w w w C ]: y = w T x y = W T x Dervaton x The wthn-class scatter generalzes as S W = C = S where S = x μ x μ T x ω and μ = N x ω x And the between-class scatter becomes S B = N μ μ μ μ T C = where μ = N x x = N C = N μ Matrx S T = S B + S W s called the total scatter S W S W S B S B S B S W x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 8

Smlarly, we defne the mean vector and scatter matrces for the projected samples as μ = N μ = N C = T y ω y S W = y μ y μ y ω y y S B = C = N μ μ μ μ T From our dervaton for the two-class problem, we can wrte S W = W T S W W S B = W T S B W Recall that we are lookng for a projecton that maxmzes the rato of between-class to wthn-class scatter. Snce the projecton s no longer a scalar (t has C dmensons), we use the determnant of the scatter matrces to obtan a scalar objectve functon J W = S B S W = WT S B W W T S W W And we wll seek the projecton matrx W that maxmzes ths rato CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 9

It can be shown that the optmal projecton matrx W s the one whose columns are the egenvectors correspondng to the largest egenvalues of the followng generalzed egenvalue problem NOTES W = w w w C = arg max WT S B W W T S W W S B λ S W w = 0 S B s the sum of C matrces of rank and the mean vectors are constraned by C C = μ = μ Therefore, S B wll be of rank (C ) or less Ths means that only (C ) of the egenvalues λ wll be non-zero The projectons wth maxmum class separablty nformaton are the egenvectors correspondng to the largest egenvalues of S W S B LDA can be derved as the Maxmum Lkelhood method for the case of normal class-condtonal denstes wth equal covarance matrces CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU 0

LDA vs. PCA Ths example llustrates the performance of PCA and LDA on an odor recognton problem Fve types of coffee beans were presented to an array of gas sensors For each coffee type, snffs were performed and the response of the gas sensor array was processed n order to obtan a 60-dmensonal feature vector Results From the D scatter plots t s clear that LDA outperforms PCA n terms of class dscrmnaton Ths s one example where the dscrmnatory nformaton s not algned wth the drecton of maxmum varance PCA Sensor response 60 0 0 0-0 -0 normalzed data 0 0 00 0 00 Sulaw esy Kenya Araban Sumatra Colomba LDA axs 0 0 0-60 -0-0 -00 70 90 80 axs 00 axs 7. 7. 7.8 7.6 7. 7. -.96 0. -.9 -.9 -.9 -.88 0. axs 0. axs axs CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Lmtatons of LDA LDA produces at most C feature projectons If the classfcaton error estmates establsh that more features are needed, some other method must be employed to provde those addtonal features LDA s a parametrc method (t assumes unmodal Gaussan lkelhoods) If the dstrbutons are sgnfcantly non-gaussan, the LDA projectons may not preserve complex structure n the data needed for classfcaton = = = = x LDA wll also fal f dscrmnatory nformaton s not n the mean but n the varance of the data P C A L D A x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Varants of LDA Non-parametrc LDA (Fukunaga) NPLDA relaxes the unmodal Gaussan assumpton by computng S B usng local nformaton and the knn rule. As a result of ths The matrx S B s full-rank, allowng us to extract more than (C ) features The projectons are able to preserve the structure of the data more closely Orthonormal LDA (Okada and Tomta) OLDA computes projectons that maxmze the Fsher crteron and, at the same tme, are par-wse orthonormal The method used n OLDA combnes the egenvalue soluton of S W S B and the Gram- Schmdt orthonormalzaton procedure OLDA sequentally fnds axes that maxmze the Fsher crteron n the subspace orthogonal to all features already extracted OLDA s also capable of fndng more than (C ) features Generalzed LDA (Lowe) GLDA generalzes the Fsher crteron by ncorporatng a cost functon smlar to the one we used to compute the Bayes Rsk As a result, LDA can produce projects that are based by the cost functon,.e., classes wth a hgher cost C j wll be placed further apart n the low-dmensonal projecton Multlayer perceptrons (Webb and Lowe) It has been shown that the hdden layers of mult-layer perceptrons perform nonlnear dscrmnant analyss by maxmzng Tr[S B S T ], where the scatter matrces are measured at the output of the last hdden layer CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU

Other dmensonalty reducton methods Exploratory Projecton Pursut (Fredman and Tukey) EPP seeks an M-dmensonal (M=, typcally) lnear projecton of the data that maxmzes a measure of nterestngness Interestngness s measured as departure from multvarate normalty Ths measure s not the varance and s commonly scale-free. In most mplementatons t s also affne nvarant, so t does not depend on correlatons between features. [Rpley, 996] In other words, EPP seeks projectons that separate clusters as much as possble and keeps these clusters compact, a smlar crteron as Fsher s, but EPP does NOT use class labels Once an nterestng projecton s found, t s mportant to remove the structure t reveals to allow other nterestng vews to be found more easly Unnterestng Interestng x x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU x

Sammon s non-lnear mappng (Sammon) Ths method seeks a mappng onto an M-dmensonal space that preserves the nter-pont dstances n the orgnal N-dmensonal space Ths s accomplshed by mnmzng the followng objectve functon E d, d = j d P,P j d P,Pj d P,P j The orgnal method dd not obtan an explct mappng but only a lookup table for the elements n the tranng set Newer mplementatons based on neural networks do provde an explct mappng for test data and also consder cost functons (e.g., Neuroscale) Sammon s mappng s closely related to Mult Dmensonal Scalng (MDS), a famly of multvarate statstcal methods commonly used n the socal scences We wll revew MDS technques when we cover manfold learnng x x P j P j d(p, P j )= d(p, P j ) ",j P P x x x CSCE 666 Pattern Analyss Rcardo Guterrez-Osuna CSE@TAMU