Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis

Transcription

1 Joural of Machie Learig Research 8 (2007) Submitted 3/06; Revised 12/06; Published 5/07 Dimesioality Reductio of Multimodal Labeled Data by Local Fisher Discrimiat Aalysis Masashi Sugiyama Departmet of Computer Sciece Tokyo Istitute of Techology , O-okayama, Meguro-ku, Tokyo, , Japa SUGI@CS.TITECH.AC.JP Editor: Sam Roweis Abstract Reducig the dimesioality of data without losig itrisic iformatio is a importat preprocessig step i high-dimesioal data aalysis. Fisher discrimiat aalysis (FDA) is a traditioal techique for supervised dimesioality reductio, but it teds to give udesired results if samples i a class are multimodal. A usupervised dimesioality reductio method called localitypreservig projectio (LPP) ca work well with multimodal data due to its locality preservig property. However, sice LPP does ot take the label iformatio ito accout, it is ot ecessarily useful i supervised learig scearios. I this paper, we propose a ew liear supervised dimesioality reductio method called local Fisher discrimiat aalysis (LFDA), which effectively combies the ideas of FDA ad LPP. LFDA has a aalytic form of the embeddig trasformatio ad the solutio ca be easily computed just by solvig a geeralized eigevalue problem. We demostrate the practical usefuless ad high scalability of the LFDA method i data visualizatio ad classificatio tasks through extesive simulatio studies. We also show that LFDA ca be exteded to o-liear dimesioality reductio scearios by applyig the kerel trick. Keywords: dimesioality reductio, supervised learig, Fisher discrimiat aalysis, locality preservig projectio, affiity matrix 1. Itroductio The goal of dimesioality reductio is to embed high-dimesioal data samples i a low-dimesioal space so that most of itrisic iformatio cotaied i the data is preserved (e.g., Roweis ad Saul, 2000; Teebaum et al., 2000; Hito ad Salakhutdiov, 2006). Oce dimesioality reductio is carried out appropriately, the compact represetatio of the data ca be used for various succeedig tasks such as visualizatio, classificatio, etc. I this paper, we cosider the supervised dimesioality reductio problem, that is, samples are accompaied with class labels. Fisher discrimiat aalysis (FDA) (Fisher, 1936; Fukuaga, 1990) is a popular method for liear supervised dimesioality reductio. 1 FDA seeks for a embeddig trasformatio such. A efficiet MATLAB implemetatio of local Fisher discrimiat aalysis is available from the author s website: sugi/software/lfda/. 1. FDA may refer to the classificatio method which first projects data samples oto a oe-dimesioal subspace ad the classifies the samples by thresholdig (Fisher, 1936; Duda et al., 2001). The oe-dimesioal embeddig space used here is obtaied as the maximizer of the so-called Fisher criterio. This Fisher criterio ca be used for dimesioality reductio oto a space with dimesio more tha oe i multi-class problems (Fukuaga, 1990). With some abuse, we refer to the dimesioality reductio method based o the Fisher criterio as FDA (see Sectio 2.2 for detail). c 2007 Masashi Sugiyama.

2 SUGIYAMA that the betwee-class scatter is maximized ad the withi-class scatter is miimized. FDA is a traditioal but useful method for dimesioality reductio. However, it teds to give udesired results if samples i a class form several separate clusters (i.e., multimodal) (see, e.g., Fukuaga, 1990). Withi-class multimodality ca be observed i may practical applicatios. For example, i disease diagosis, the distributio of medial checkup samples of sick patiets could be multimodal sice there may be several differet causes eve for a sigle disease. I a traditioal task of hadwritte digit recogitio, withi-class multimodality appears if digits are classified ito, for example, eve ad odd umbers. More geerally, solvig multi-class classificatio problems by a set of two-class oe-versus-rest problems aturally iduces withi-class multimodality. For this reaso, there is a uiversal eed for reducig the dimesioality of multimodal data. I order to reduce the dimesioality of multimodal data appropriately, it is importat to preserve the local structure of the data. Locality-preservig projectio (LPP) (He ad Niyogi, 2004) meets this requiremet; LPP seeks for a embeddig trasformatio such that earby data pairs i the origial space close i the embeddig space. Thus LPP ca reduce the dimesioality of multimodal data without losig the local structure. However, LPP is a usupervised dimesioality reductio method ad does ot take the label iformatio ito accout. Therefore, it does ot ecessarily work appropriately i supervised dimesioality reductio scearios. I this paper, we propose a ew dimesioality reductio method called local Fisher discrimiat aalysis (LFDA). LFDA effectively combies the ideas of FDA ad LPP, that is, LFDA maximizes betwee-class separability ad preserves withi-class local structure at the same time. Thus LFDA is useful for dimesioality reductio of multimodal labeled data. The origial FDA provides a meaigful result oly whe the dimesioality of the embeddig space is smaller tha the umber of classes because of the rak deficiecy of the betwee-class scatter matrix (Fukuaga, 1990). This is a essetial limitatio of FDA i dimesioality reductio. O the other had, the proposed LFDA does ot geerally suffer from this problem ad ca be employed for dimesioality reductio ito a arbitrary dimesioal space. Furthermore, LFDA iherits a excellet property from FDA it has a aalytic form of the embeddig matrix ad the solutio ca be easily computed just by solvig a geeralized eigevalue problem. This is a advatage over recetly proposed supervised dimesioality reductio methods (e.g., Goldberger et al., 2005; Globerso ad Roweis, 2006). Furthermore, LFDA ca be aturally exteded to oliear dimesioality reductio scearios by applyig the kerel trick (Schölkopf ad Smola, 2002). The rest of this paper is orgaized as follows. I Sectio 2, we formulate the liear dimesioality reductio problem, briefly review FDA ad LPP, ad illustrate how they typically behave. I Sectio 3, we defie LFDA ad show its fudametal properties. I Sectio 4, we discuss the relatio betwee LFDA ad other methods. I Sectio 5, we umerically evaluate the performace of LFDA ad existig methods i visualizatio ad classificatio tasks usig bechmark data sets. Fially, we give cocludig remarks ad future prospects i Sectio Liear Dimesioality Reductio I this sectio, we formulate the problem of liear dimesioality reductio ad review existig methods. 1028

3 LOCAL FISHER DISCRIMINANT ANALYSIS 2.1 Formulatio Let x i R d (i = 1,2,...,) be d-dimesioal samples ad y i {1,2,...,c} be associated class labels, where is the umber of samples ad c is the umber of classes. Let l be the umber of samples i class l: c l =. Let X be the matrix of all samples: l=1 X (x 1 x 2 x ). Let z i R r (1 r d) be low-dimesioal represetatios of x i, where r is the reduced dimesio (i.e., the dimesio of the embeddig space). Effectively we cosider d to be large ad r to be small, but ot limited to such cases. For the momet, we focus o liear dimesioality reductio, that is, usig a d r trasformatio matrix T, the embedded samples z i are give by z i = T x i, where deotes the traspose of a matrix or vector. I Sectio 3.4, we exted our discussio to the o-liear dimesioality reductio scearios where the mappig from x i to z i is o-liear. 2.2 Fisher Discrimiat Aalysis for Dimesioality Reductio Oe of the most popular dimesioality reductio techiques is Fisher discrimiat aalysis (FDA) (Fisher, 1936; Fukuaga, 1990; Duda et al., 2001). Here we briefly describe the defiitio of FDA. Let S (w) ad S (b) be the withi-class scatter matrix ad the betwee-class scatter matrix: S (w) S (b) c l=1 i:y i =l c l=1 (x i µ l )(x i µ l ), (1) l (µ l µ)(µ l µ), (2) where i:yi =l deotes the summatio over i such that y i = l, µ l is the mea of the samples i class l, ad µ is the mea of all samples: µ l 1 l x i, i:y i =l µ 1 i = i=1x 1 We assume that S (w) has full rak. The FDA trasformatio matrix T FDA is defied as follows: 2 [ ( )] T FDA argmax T R d r tr (T S (w) T ) 1 T S (b) T. (3) c l=1 l µ l. 2. The followig defiitio is also used i the literature (e.g., Fukuaga, 1990) ad yields the same solutio. ( ) T FDA = argmax det T S (b) T ( ), T R d r det T S (w) T where det( ) deotes the determiat of a matrix. 1029

4 SUGIYAMA That is, FDA seeks a trasformatio matrix T such that the betwee-class scatter is maximized while the withi-class scatter is miimized. I the above formulatio, we implicitly assumed that T S (w) T is ivertible. This implies that the above optimizatio is subject to rak(t ) = r. Let {ϕ k } d k=1 be the geeralized eigevectors associated with the geeralized eigevalues λ 1 λ 2 λ d of the followig geeralized eigevalue problem: S (b) ϕ = λs (w) ϕ. The a solutio T FDA of the above maximizatio problem is aalytically give by T FDA = (ϕ 1 ϕ 2 ϕ r ). Note that the solutio is ot uique ad the followig simple costrait is sometimes imposed additioally (Fukuaga, 1990). T FDAS (w) T FDA = I r, where I r is the idetity matrix o R r. This costrait makes the withi-class scatter i the embeddig space sphered. The betwee-class scatter matrix S (b) has at most rak c 1 (Fukuaga, 1990). This implies that the multiplicity of λ = 0 is at least d c+1. Therefore, FDA ca fid at most c 1 meaigful features; the remaiig features foud by FDA are arbitrary. This is a essetial limitatio of FDA for dimesioality reductio ad is very restrictive i practice. 2.3 Locality-Preservig Projectio Aother dimesioality reductio techique that is relevat to the curret settig is locality-preservig projectio (LPP) (He ad Niyogi, 2004). Here we review LPP. Let A be a affiity matrix, that is, the -dimesioal matrix with the (i, j)-th elemet A i, j beig the affiity betwee x i ad x j. We assume that A i, j [0,1]; A i, j is large if x i ad x j are close ad A i, j is small if x i ad x j are far apart. There are several differet maers of defiig A. We briefly describe typical defiitios i Appedix D. The LPP trasformatio matrix T LPP is defied as follows: 3 T LPP argmi T R d r ( 1 2 A i, j T x i T x j 2 ) subject to T XDX T = I r, (4) where D is the -dimesioal diagoal matrix with i-th diagoal elemet beig D i,i A i, j. j=1 3. The matrix D i the costrait (4) is motivated by a geometric argumet (Belki ad Niyogi, 2003). However, it is sometimes dropped for the sake of simplicity (Ham et al., 2004). 1030

5 LOCAL FISHER DISCRIMINANT ANALYSIS Eq. (4) implies that LPP looks for a trasformatio matrix T such that earby data pairs i the origial space R d are kept close i the embeddig space. The costrait (4) is imposed for avoidig degeeracy. Let {ψ k } d k=1 be the geeralized eigevectors associated with the geeralized eigevalues γ 1 γ 2 γ d of the followig geeralized eigevalue problem: XLX ψ = γxdx ψ, where L D A. L is called the graph-laplacia matrix i the spectral graph theory (Chug, 1997), where A is see as the adjacecy matrix of a graph. He ad Niyogi (2004) showed that a solutio of Eq. (4) is give by T LPP = (ψ d ψ d 1 ψ d r+1 ). 2.4 Typical Behavior of FDA ad LPP Dimesioality reductio results obtaied by FDA ad LPP are illustrated i Figure 1 (LFDA will be defied ad explaied i Sectio 3) two-dimesioal two-class data samples are embedded ito a oe-dimesioal space. I LPP, the affiity matrix A is determied by the local scalig method (Zelik-Maor ad Peroa, 2005, see also Appedix D.4). For the simplest data set depicted i Figure 1(a), both FDA ad LPP icely separate the samples i differet classes ( ad ) from each other. For the data set depicted i Figure 1(b), FDA still works well, but LPP mixes samples i differet classes ito a sigle cluster. This is caused by the usupervised ature of LPP. O the other had, for the data set depicted i Figure 1(c), LPP works well but FDA collapses the samples i differet classes ito a sigle cluster. The reaso for the failure of FDA is that the levels of the betwee-class scatter ad the withi-class scatter are ot evaluated i a ituitively atural way because of the two separate clusters i -class (see also Fukuaga, 1990). 3. Local Fisher Discrimiat Aalysis As illustrated i Figure 1, FDA ca perform poorly if samples i a class form several separate clusters (i.e., multimodal). I other words, the udesired behavior of FDA is caused by the globality whe evaluatig the withi-class scatter ad the betwee-class scatter (e.g., Figure 1(c)). O the other had, because of the usupervised ature of LPP, it ca overlap samples i differet classes if they are close i the origial high-dimesioal space R d (e.g., Figure 1(b)). To overcome these problems, we propose combiig the ideas of FDA ad LPP; more specifically, we evaluate the levels of the betwee-class scatter ad the withi-class scatter i a local maer. This allows us to attai betwee-class separatio ad withi-class local structure preservatio at the same time. We call our ew method local Fisher discrimiat aalysis (LFDA). 3.1 Reformulatig FDA I order to itroduce LFDA, let us first reformulate FDA i a pairwise maer. 1031

6 SUGIYAMA 10 FDA LPP LFDA (a) Toy data set 1 10 FDA LPP LFDA 10 FDA LPP LFDA (b) Toy data set 2 (c) Toy data set 3 Figure 1: Examples of dimesioality reductio by FDA, LPP ad LFDA. Two-dimesioal twoclass samples are embedded ito a oe-dimesioal space. The lie i the figure deotes the oe-dimesioal embeddig space (which the data samples are projected o) obtaied by each method. 1032

7 LOCAL FISHER DISCRIMINANT ANALYSIS Lemma 1 S (w) ad S (b) defied by Eqs. (1) ad (2) ca be expressed as where S (w) = 1 2 S (b) = 1 2 W (w) i, j (x i x j )(x i x j ), (5) W (b) i, j (x i x j )(x i x j ), (6) W (w) i, j W (b) i, j { 1/l if y i = y j = l, 0 if y i y j, { 1/ 1/l if y i = y j = l, 1/ if y i y j. (7) (8) A proof of Lemma 1 is give i Appedix A. Note that 1/ 1/ l i Eq. (8) is egative while 1/ l ad 1/ i Eqs. (7) ad (8) are positive. This implies that if the data pairs i the same class are made close, the withi-class scatter matrix S (w) gets small ad the betwee-class scatter matrix S (b) gets large. O the other had, if the data pairs i differet classes are separated from each other, the betwee-class scatter matrix S (b) gets large. Therefore, we may iterpret FDA as keepig the sample pairs i the same class close ad the sample pairs i differet classes apart. A more formal discussio o the above iterpretatio is give i Appedix B. 3.2 Defiitio ad Typical Behavior of LFDA Based o the above pairwise expressio, let us defie the local withi-class scatter matrix S (w) ad the local betwee-class scatter matrix S (b) as follows. where S (w) 1 2 S (b) 1 2 W (w) i, j (x i x j )(x i x j ), (9) W (b) i, j (x i x j )(x i x j ), W (w) i, j W (b) i, j { Ai, j / l if y i = y j = l, 0 if y i y j, { Ai, j (1/ 1/ l ) if y i = y j = l, 1/ if y i y j. (10) (11) Namely, accordig to the affiity A i, j, we weight the values for the sample pairs i the same class. This meas that far apart sample pairs i the same class have less ifluece o S (w) ad S (b). Note that we do ot weight the values for the sample pairs i differet classes sice we wat to separate them from each other irrespective of the affiity i the origial space. From here o, we deote the local couterparts of matrices by symbols with tilde. 1033

8 SUGIYAMA We defie the LFDA trasformatio matrix T LFDA as [ tr T LFDA argmax T R d r ( (T S (w) T ) 1 T S (b) T )]. (12) That is, we look for a trasformatio matrix T such that earby data pairs i the same class are made close ad the data pairs i differet classes are separated from each other; far apart data pairs i the same class are ot imposed to be close. Eq. (12) is of the same form as Eq. (3). Therefore, we ca similarly compute a aalytic form of T LFDA by solvig a geeralized eigevalue problem of S (b) ad S (w). A efficiet implemetatio of LFDA is summarized as a pseudo code i Figure 2 (see Appedix C for detail). Toy examples of dimesioality reductio by LFDA are illustrated i Figure 1. We used the local scalig method for computig the affiity matrix A (see Appedix D.4). Note that we perform the earest eighbor search i the local scalig method i a classwise maer sice we do ot eed the affiity values for the sample pairs i differet classes (see Eqs. 10 ad 11). This highly cotributes to reducig the computatioal cost (see Appedix C). Figure 1 shows that LFDA gives desirable results for all three data sets, that is, LFDA ca compesate for the drawbacks of FDA ad LPP by effectively combiig the ideas of FDA ad LPP. If the affiity value A i, j is set to 1 for all sample pairs (i.e., all pairs are equally close to each other), S (w) ad S (b) agree with S (w) ad S (b), respectively, ad LFDA is reduced to the origial FDA. Therefore, LFDA may be regarded as a atural localized variat of FDA. 3.3 Properties of LFDA Here we discuss fudametal properties of LFDA. First, we give a iterpretatio of LFDA i terms of the poitwise scatter. S (w) ca be expressed as S (w) = P (w) i, yi i=1 where yi is the umber of samples i the class to which the sample x i belogs ad P (w) i is the poitwise local withi-class scatter matrix aroud x i : P (w) i j:y j =y i A i, j (x j x i )(x j x i ). Therefore, miimizig S (w) correspods to miimizig the weighted sum of the poitwise local withi-class scatter matrices over all samples. S (b) ca also be expressed i a similar way as S (b) = 1 ( ) P (w) i + 1 yi 2 P (b) i, (13) i=1 where P (b) i is the poitwise betwee-class scatter matrix aroud x i : i=1 P (b) i j:y j y i (x j x i )(x j x i ). 1034

9 LOCAL FISHER DISCRIMINANT ANALYSIS Iput: Labeled samples {(x i,y i ) x i R d,y i {1,2,...,c}} i=1 Dimesioality of embeddig space r (1 r d) Output: d r trasformatio matrix T LFDA 1: S (b) 0 d d ; 2: S (w) 0 d d ; 3: for l = 1,2,...,c % Compute scatter matrices i a classwise maer 4: {x i } l i=1 {x j} j:y j =l; 5: for i = 1,2,..., l % Determie local scalig 6: x (7) i 7: σ i x i x (7) 7th earest eighbor of x i amog {x j } l i ; 8: ed 9: for i, j = 1,2,..., l % Defie affiity matrix 10: A i, j exp( x i x j 2 /(σ i σ j )); 11: ed 12: X (x 1 x 2 x l ); 13: G Xdiag(A1 l )X X AX ; j=1 ; 14: S (b) S (b) + G/ + (1 l /)X X + X1 l (X1 l ) /; 15: S (w) S (w) + G/ l ; 16: ed 17: S (b) S (b) X1 (X1 ) / S (w) ; 18: { λ k, ϕ k } r k=1 geeralized eigevalues ad ormalized eigevectors of S (b) ϕ = λ S (w) ϕ; % λ 1 λ 2 λ d 19: T LFDA = ( λ1 ϕ 1 λ2 ϕ 2 λr ϕ r ); Figure 2: Efficiet implemetatio of LFDA (see Appedix C for detail). The affiity matrix is computed by the local scalig method (see Appedix D.4). Matrices ad vectors deoted with uderlie are classwise couterparts of the origial oes. 0 d d deotes the d d matrix with zeros, 1 l deotes the l -dimesioal vector with oes, ad diag(a1 l ) deotes the diagoal matrix with diagoal elemets A1 l. The geeralized eigevectors i lie 18 are ormalized by Eq. (14), which is ofte automatically carried out by a eigesolver. The weightig scheme of the eigevectors i lie 19 is explaied i Sectio 3.3. A possible bottleeck of the above implemetatio is the earest eighbor search i lie 6. This could be alleviated by icorporatig the prior kowledge of the data structure or by approximatio (see Saul ad Roweis, 2003, ad refereces therei). Aother possible bottleeck is the computatio of X A X i lie 13, which could be eased by sparsely defiig the affiity matrix (see Appedix D). A MATLAB implemetatio is available from sugi/software/lfda/. 1035

10 SUGIYAMA Note that P (b) i does ot iclude the localizatio factor A i, j. Eq. (13) implies that maximizig S (b) correspods to miimizig the weighted sum of the poitwise local withi-class scatter matrices ad maximizig the sum of the poitwise betwee-class scater matrices. Next, we discuss the issue of eigevalue multiplicity i LFDA. The origial FDA allows us to extract at most c 1 meaigful features sice the betwee-class scatter matrix S (b) has rak at most c 1 (Fukuaga, 1990). O the other had, the local betwee-class scatter matrix S (b) geerally has a much higher rak with less eigevalue multiplicity, thaks to the localizatio factor A i, j icluded i W (b) (see Eq. 11). I the simulatio show i Sectio 5, S (b) is always full rak for various data sets. Therefore, the proposed LFDA ca be practically employed for dimesioality reductio ito ay dimesioal spaces. This is a very importat ad sigificat improvemet over the origial FDA. Fially, we discuss the ivariace property of LFDA. The value of the LFDA criterio (12) is ivariat uder liear trasformatios, that is, for ay r-dimesioal ivertible matrix H, T LFDA H is also a solutio of Eq. (12). Therefore, the solutio T LFDA is ot uique the rage of the trasformatio H T LFDA is uiquely determied, but the distace metric (Goldberger et al., 2005; Globerso ad Roweis, 2006; Weiberger et al., 2006) i the embeddig space ca be arbitrary because of the arbitrariess of the matrix H. I practice, we propose determiig the LFDA trasformatio matrix T LFDA as follows. First, we rescale the geeralized eigevectors { ϕ k } d k=1 so that ϕ k S (w) ϕ k = { 1 if k = k, 0 if k k. (14) Note that this rescalig is ofte automatically carried out by a eigesolver. The we weight each geeralized eigevector by the square root of its associated geeralized eigevalue, that is, T LFDA = ( λ1 ϕ 1 λ2 ϕ 2 λr ϕ r ), (15) where λ 1 λ 2 λ d. This weightig scheme weakes the ifluece of mior eigevectors ad is show to work well i experimets (see Sectio 5). 3.4 Kerel LFDA for No-Liear Dimesioality Reductio Here we show how LFDA ca be exteded to o-liear dimesioality reductio scearios. As detailed i Appedix C, the geeralized eigevalue problem that eeds to be solved i LFDA ca be expressed as X L (b) X ϕ = λx L (w) X ϕ, (16) where L (b) = L (m) L (w) ad L (m) ad L (w) are defied by Eqs. (33) ad (35), respectively. Sice X ϕ i Eq. (16) belogs to the rage of X, it ca be expressed by usig some vector α R as X ϕ = X X α = K α, where K is the -dimesioal matrix with the (i, j)-th elemet beig K i, j x i x j. 1036

11 LOCAL FISHER DISCRIMINANT ANALYSIS The multiplyig Eq. (16) by X from the left-had side yields K L (b) K α = λk L (w) K α. (17) This implies that {x i } i=1 appear oly i terms of their ier products. Therefore, we ca obtai a o-liear variat of LFDA by the kerel trick (Vapik, 1998; Schölkopf et al., 1998), which is explaied below. Let us cosider a o-liear mappig φ(x) from R d to a reproducig kerel Hilbert space H (Aroszaj, 1950). Let K(x,x ) be the reproducig kerel of H. A typical choice of the kerel fuctio would be the Gaussia kerel: K(x,x ) = exp ( x x 2 ) 2σ 2, with σ > 0. For other choices, see, for example, Wahba (1990), Vapik (1998), ad Schölkopf ad Smola (2002). Because of the reproducig property of K(x,x ), K is ow the kerel matrix, that is, the (i, j)-th elemet is give by where, deotes the ier product i H. K i, j = φ(x i ),φ(x j ) = K(x i,x j ), It ca be cofirmed that L (w) is always degeerated (sice L (w) (1,1,...,1) always vaishes; see Eq. 35 for detail). Therefore, K L (w) K is always degeerated ad we caot directly solve the geeralized eigevalue problem (17). To cope with this problem, we propose regularizig K L (w) K ad solvig the followig geeralized eigevalue problem istead (cf. Friedma, 1989). K L (b) K α = λ(k L (w) K + εi ) α, (18) where ε is a small costat. Let { α k } k=1 be the geeralized eigevectors associated with the geeralized eigevalues λ 1 λ 2 λ of Eq. (18). The the embedded image of φ(x ) i H is give by K(x 1,x ) ( λ1 α 1 λ2 α 2 λr α r ) K(x 2,x ).. K(x,x ) We call this kerelized variat of LFDA kerel LFDA (KLFDA). Recetly, kerel fuctios for o-vectorial structured data such as strigs, trees, ad graphs have bee proposed (see, e.g., Lodhi et al., 2002; Duffy ad Collis, 2002; Kashima ad Koyaagi, 2002; Kodor ad Lafferty, 2002; Kashima et al., 2003; Gärter et al., 2003; Gärter, 2003). Sice KLFDA uses the samples oly via the kerel fuctio K(x,x ), it allows us to reduce the dimesioality of such o-vectorial data. 4. Compariso with Related Methods I this sectio, we discuss the relatio betwee the proposed LFDA ad other methods. 1037

12 SUGIYAMA 4.1 Dimesioality Reductio Usig Local Discrimiat Iformatio A discrimiat adaptive earest eighbor (DANN) classifier (Hastie ad Tibshirai, 1996a) employs a adapted distace metric at each test poit for classificatio. Based o a similar idea, they also proposed a global supervised dimesioality reductio method usig local discrimiat iformatio (LDI) i the same paper. We refer to this supervised dimesioality reductio method as LDI. The mai idea of LDI is to localize FDA which is very similar to the proposed LFDA. Here we discuss the relatio betwee LDI ad LFDA. I LDI, the data samples {x i } i=1 are first sphered accordig to the withi-class scatter matrix S (w), that is, for i = 1,2,...,, x i (S (w) ) 2 1 xi. Let A i, j be the weight of sample x j aroud x i defied by [ ( ) ] 3 3 x 1 i x j if x A i, j x i x (K) i x j < x i x (K) i i, 0 otherwise. where x (K) i is the K-th earest eighbor of x i i the sphered space. Note that 0 A i, j 1 ad A i, j is o-icreasig as x i x j icreases. Thus it has the same meaig as our affiity matrix. K is suggested to be determied by K = max(/5,50). Let µ [i] l be the local weighted mea of the sphered samples i class l aroud x i, ad let µ [i] be the local weighted mea of the sphered samples aroud x i : where µ [i] l 1 [i] l µ [i] 1 [i] A i, j x j, j:y j =l A i, j x j = 1 j=1 [i] [i] l [i] A i, j, j:y j =l A i, j. j=1 c l=1 [i] l µ[i] l, Let S (b) be the average betwee sum-of-squares matrix defied as S (b) i=1 1 [i] c l=1 The LDI trasformatio matrix T LDI is defied as T LDI argmax T R d r [i] l (µ[i] l µ[i] )(µ [i] l µ[i] ). [ ] T S (b) T subject to T T = I r. 1038

13 LOCAL FISHER DISCRIMINANT ANALYSIS T LDI is a trasformatio matrix for sphered samples; the LDI trasformatio matrix T LDI for osphered samples is give by T LDI = (S (w) ) 1 2 T LDI. Similar to FDA (ad LFDA), T LDI ca be efficietly computed by solvig a geeralized eigevalue problem. The average betwee sum-of-squares matrix S (b) is coceptually very similar to the local betweeclass scatter matrix S (b) i LFDA. Ideed, as proved i Appedix E, we ca express S (b) i a pairwise maer as where W (b) i, j S (b) = 1 2 k=1 1 [k] ( k=1 W (b) i, j (x i x j )(x i x j ), (19) 1 [k] 1 [k] l ) A i,k A j,k if y i = y j = l, 1 ( [k] ) A i,ka 2 j,k if y i y j. However, there exist critical differeces betwee LDI ad LFDA. A sigificat differece is that the values for the sample pairs i differet classes are also localized i LDI (see Eq. 20), while they are kept ulocalized i LFDA (see Eq. 11). This implies that far apart sample pairs i differet classes could be made close i LDI, which is ot desirable i supervised dimesioality reductio. Furthermore, the computatio of S (b) is slightly less efficiet tha S (b) sice W (b) icludes the summatio over k. Aother importat differece betwee LDI ad LFDA is that the withi-class scatter matrix S (w) is ot localized i LDI. However, as we showed i Sectio 3.1, the withi-class scatter matrix S (w) also accouts for collapsig the withi-class multimodal structure (i.e., far apart sample pairs i the same class are made close). This pheomeo is experimetally cofirmed i Sectio Mixture Discrimiat Aalysis FDA ca be iterpreted as maximum likelihood estimatio of Gaussia distributios with commo covariace ad differet meas for each class. Based o this view, Hastie ad Tibshirai (1996b) proposed mixture discrimiat aalysis (MDA), which exteds FDA to maximum likelihood estimatio of Gaussia mixture distributios. A maximum likelihood solutio is obtaied by a EM-type algorithm (cf. Dempster et al., 1977). However, this is a iterative algorithm ad gives oly a local optimal solutio. Therefore, the computatio of MDA is rather slow ad there is o guaratee that the global solutio ca be obtaied. Furthermore, the umber of mixture compoets (clusters) i each class as well as the iitial locatio of cluster ceters should be determied by users. For cluster ceters, usig stadard techiques such as k-meas clusterig (MacQuee, 1967; Everitt et al., 2001) or learig vector quatizatio (Kohoe, 1989) are recommeded. However, they are also iterative algorithms ad have o guaratee that the global solutio ca be obtaied. Furthermore, there seems to be o systematic method for determiig the umber of clusters. O the other had, the proposed LFDA cotais o tuig parameters (give that the affiity matrix is determied by the local scalig method, see Appedix D.4) ad the global solutio ca (20) 1039

14 SUGIYAMA be obtaied aalytically. However, it still lacks a probabilistic iterpretatio, which remais ope curretly. 4.3 Neighborhood Compoet Aalysis Goldberger et al. (2005) proposed a supervised dimesioality reductio method called eighborhood compoet aalysis (NCA). The NCA trasformatio matrix T NCA is defied as follows. ) T NCA argmax p i, j (T T ), (21) T R d r j:y j =y i where ( i=1 exp { (x i x j ) U(x i x j ) } p i, j (U) k i exp{ (x i x k ) if i j, U(x i x k )} 0 if i = j. The above defiitio correspods to maximizig the expected umber of correctly classified samples by a stochastic variat of earest eighbor classifiers. Therefore, NCA seeks a trasformatio matrix T such that the betwee-class separability is maximized. Eqs. (21) ad (22) imply that earby data pairs i the same class are made close, which is similar to the proposed LFDA. Ideed, the simulatio results i Sectio 5.2 show that NCA teds to preserve the multimodal structure of the data very well. However, a crucial weakess of NCA is optimizatio: the optimizatio problem (21) is o-covex. Therefore, there is o guaratee that the globally optimal solutio ca be obtaied. Goldberger et al. (2005) proposed usig a gradiet ascet method for optimizatio: T T + ε J NCA (T ), (23) where ε (> 0) is the step size ad the gradiet J NCA (T ) is give by ({ J NCA (T ) = 2T i=1 } p i, j (T T )}{ p i, j (T T )(x i x j )(x i x j ) j:y j =y i j=1 j:y j =y i p i, j (T T )(x i x j )(x i x j ) The gradiet ascet iteratio (23) is computatioally rather iefficiet. Also, the choice of the step size ε is troublesome. If the step size is small eough, the covergece to oe of the local optima is guarateed but such a choice makes the covergece very slow; o the other had, if the step size is too large, gradiet flows oscillate ad proper covergece properties may ot be guarateed aymore. Furthermore, the choice of the termiatio coditio i the iterative algorithm is ofte cumbersome i practice. Because of the o-covexity of the optimizatio problem, the quality of the obtaied solutio depeds o the iitializatio of the matrix T. A useful heuristic to alleviate the local optimum problem is to employ the FDA (or LFDA) result as a iitial matrix for optimizatio (Goldberger et al., 2005). I the experimets i Sectio 5, usig the LFDA result as a iitial matrix appears to be better tha the radom iitializatio. However, the local optima problem still remais eve with the above heuristic. ). (22) 1040

15 LOCAL FISHER DISCRIMINANT ANALYSIS Whe a dimesioality reductio techique is applied to classificatio tasks, we ofte wat to embed the data samples ito spaces with several differet dimesios the best dimesioality is later chose by, for example, cross-validatio (Stoe, 1974; Wahba, 1990). I such a sceario, NCA requires to optimize the trasformatio matrix idividually for each dimesioality r of the embeddig space. O the other had, LFDA eeds to compute the trasformatio matrix oly oce for the largest r; its sub-matrices become the optimal solutios for smaller dimesios. Therefore, LFDA is computatioally more efficiet tha NCA i this sceario. A simple MATLAB implemetatio of NCA is available. 4 We use this software i Sectio Maximally Collapsig Metric Learig I order to overcome the computatioal problem of NCA, Globerso ad Roweis (2006) proposed a alterative method called maximally collapsig metric learig (MCML). Let p i, j be the ideal value of p i, j(u) defied by Eq. (22): where p i, j is ormalized so that p i, j { 1 if yi = y j, 0 if y i y j, p i, j = 1. j i p i, j ca be attaied if all samples i the same class collapse ito a sigle poit while samples i other classes are mapped to other locatios. I reality, however, ay U may ot be able to attai p i, j (U) = p i, j exactly; istead the optimal approximatio to p i, j uder the Kullback-Leibler divergece (Kullback ad Leibler, 1951) is obtaied. This is formally defied as U MCML argmi U R d d ( ) p p i, j i, j log p i, j (U) subject to U PSD(r), (24) where PSD(r) is the set of all positive semidefiite matrices of rak r (i.e., r eigevalues are positive ad others are zero). Oce U MCML is obtaied, the MCML trasformatio matrix T MCML is computed by T MCML = (φ 1 φ 2 φ r ), (25) where {φ k } r k=1 are the eigevectors associated with the positive eigevalues η 1 η 2 η r > 0 of the followig eigevalue problem: U MCML φ = ηφ. Oe of the motivatios of MCML is to alleviate the difficulty of optimizatio i NCA. However, MCML still has a weakess i optimizatio: the optimizatio problem (24) is covex oly whe r = d, that is, the dimesioality is ot reduced but oly the distace metric of the origial space is chaged. This meas that if r < d (which is our primal focus i this paper), we may ot be able to 4. Implemetatio available at fowlkes/software/ca/. 1041

16 SUGIYAMA obtai the globally optimal solutio. Globerso ad Roweis (2006) proposed the followig heuristic algorithm to approximate T MCML. First, the optimizatio problem (24) with r = d is solved: Û MCML argmi U R d d ( p p i, j i, j log p i, j (U) ) subject to U PSD(d). (26) Although Eq. (26) is covex, a aalytic form of the uique optimal solutio Û MCML is ot kow yet. Globerso ad Roweis (2006) proposed usig the followig alterate iterative procedure for obtaiig Û MCML. U U ε J MCML (U), (27) U d k=1 max(0, η k ) φ k φ k, (28) where ε (> 0) is the step size, η k ad φ k are eigevalues ad eigevectors of U, ad the gradiet J MCML (U) is give by J MCML (U) = (p i, j p i, j (U))(x i x j )(x i x j ). The the eigevalue decompositio of Û MCML is carried out ad eigevalues η 1 η 2 η d ad associated eigevectors { φ k } d k=1 are obtaied: Û MCML φ = η φ. Fially, {φ k } r k=1 i Eq. (25) are replaced by { φ k } r k=1, which yields T MCML ( φ 1 φ 2 φ r ). (29) This approximatio is show to be practically useful (Globerso ad Roweis, 2006), although there seems to be o theoretical aalysis for this approximatio. MCML may have a advatage over NCA i computatio: there exists the aalytic approximatio (29) that ca be computed efficietly usig the solutio of aother covex optimizatio problem (26). However, MCML still relies o the gradiet-based alterate iterative algorithm (27) (28) to solve the covex optimizatio problem (26), which is computatioally very expesive sice the eigevalue decompositio of a d-dimesioal matrix should be carried out i each iteratio (see Eq. 28). Furthermore, the difficulty of appropriately choosig the step size ad the termiatio coditio i the iterative procedure still remais. Sice MCML requires all the samples i the same class to collapse ito a sigle poit, it is ot ecessarily useful i dimesioality reductio of multimodal data samples. Furthermore, the MCML results ca be sigificatly iflueced by outliers sice the outliers are also required to collapse ito the same sigle poit together with other samples. This pheomeo is illustrated i Figure 3, where a sigle outlier sigificatly chages the MCML result. Globerso ad Roweis (2006) showed that the sufficiet statistics of the MCML algorithm are poitwise scatter matrices (cf. Sectio 3.3). Sice LFDA also has a iterpretatio i terms of poitwise scatter matrices, there may be a lik betwee LFDA ad MCML ad this eeds to be ivestigated i the future work. 1042

17 LOCAL FISHER DISCRIMINANT ANALYSIS 10 8 LFDA MCML 10 8 LFDA MCML outlier (a) Toy data set (b) Toy data set 1 Figure 3: Toy examples of dimesioality reductio. The toy data set 1 is equivalet to the oe used i Figure 1(a). The data set 1 icludes a sigle outlier. 4.5 Remark o Rak Costrait The optimizatio problem of MCML (see Eq. 24) is ot geerally covex sice the rak costrait is o-covex (Boyd ad Vadeberghe, 2004). The o-covexity iduced by the rak costrait seems to be a uiversal problem i dimesioality reductio. NCA elimiates the rak costrait by decomposig U ito T T (see Eqs. 21 ad 22). However, eve with this decompositio, the optimizatio problem is still o-covex. O the other had, FDA, LDI, ad LFDA cast the optimizatio problem i the form of the Rayleigh quotiet. This is computatioally very advatageous sice it allows us to aalytically determie the rage of the embeddig space. However, we caot determie the distace metric i the embeddig space sice the Rayleigh quotiet is ivariat uder liear trasformatios. For this reaso, a additioal criterio is eeded to determie the distace metric (see also Sectio 3.3). 5. Numerical Examples I this sectio, we umerically evaluate the performace of LFDA ad existig methods. 5.1 Exploratory Data Aalysis Here we use the Thyroid disease data set available from the UCI machie learig repository (Blake ad Merz, 1998) ad illustrate how LFDA ca be used for exploratory data aalysis. The origial data cosists of 5-dimesioal iput vector x of the followig laboratory tests. 1. T3-resi uptake test. 2. Total Serum thyroxi as measured by the isotopic displacemet method. 1043

18 SUGIYAMA 8 Hyperthyroidism Hypothyroidism 8 Hyperthyroidism Hypothyroidism First Feature First Feature 30 Euthyroidism 20 Euthyroidism First Feature First Feature (a) FDA (b) LFDA Figure 4: Histograms of the first feature values obtaied by FDA ad LFDA for the Thyroid disease data set. The top row correspods to the sick patiets ad the bottom row correspods to the healthy patiets. 3. Total Serum triiodothyroie as measured by radioimmuo assay. 4. Basal thyroid-stimulatig hormoe (TSH) as measured by radioimmuo assay. 5. Maximal absolute differece of TSH value after ijectio of 200 micro grams of thyrotropireleasig hormoe as compared to the basal value. The task is to predict whether patiets thyroids are euthyroidism, hypothyroidism, or hyperthyroidism (Coomas et al., 1983), that is, whether patiets thyroids are ormal, hypo-fuctioig, or hyper-fuctioig (Blake ad Merz, 1998). The diagosis (the class label) is based o a complete medical record, icludig aamesis, sca etc. Here we merge the hypothyroidism class ad the hyperthyroidism class ito a sigle class ad create biary labeled data (whether thyroids are ormal or ot). Our goal is to predict whether patiets thyroids are ormal, hypo-fuctioig, or hyper-fuctioig from the biary labeled data samples. Figure 4 depicts the histograms of the first feature values obtaied by FDA ad LFDA the top row correspods to the sick patiets ad the bottom row correspods to the healthy patiets. This shows that both FDA ad LFDA separate the patiets with ormal thyroids from sick patiets reasoably well. I additio to betwee-class separability, LFDA clearly preserves the multimodal structure amog sick patiets (i.e., hypo-fuctioig ad hyper-fuctioig), which is lost by ordiary FDA. Aother iterestig fidig from the figure is that the first feature values obtaied by LFDA has a strog egative correlatio to the fuctioig level of thyroids this could be used for predictig the fuctioig level of thyroids. 1044

19 LOCAL FISHER DISCRIMINANT ANALYSIS Data Set d -ad- class class Letter recogitio 16 A & C B Iris 4 Setosa & Virgiica Versicolour Table 1: Two-class data sets used for visualizatio experimets (r = 2). 5.2 Data Visualizatio Here we apply the proposed ad existig dimesioality reductio methods to bechmark data sets ad ivestigate how they behave i data visualizatio tasks. We use the Letter recogitio data set ad the Iris data set available from the UCI machie learig repository (Blake ad Merz, 1998). Table 1 describes the specificatios of the data sets. Each data set cotais three types of samples specified by,, ad. We merged ad ito a sigle class ad created two-class problems. We test LFDA, FDA, LPP, LDI, NCA, ad MCML ad evaluate the betwee-class separability (i.e., ad are well separated from ) ad the withi-class multimodality preservatio capability (i.e., ad are well grouped). For LPP ad LFDA, we determied the affiity matrix by the local scalig method (see Appedix D.4). For NCA, we used the LFDA result as a iitial matrix sice this iitializatio scheme appears to work better tha the radom iitializatio. FDA allows us to extract oly oe meaigful feature i two-class classificatio problems (see Sectio 2.2), so we choose the secod feature radomly here. Figures 5 ad 6 depict the samples embedded i the two-dimesioal space foud by each method. The horizotal axis is the first feature foud by each method, while the vertical axis is the secod feature. First, we compare the embeddig results of LFDA with those of FDA ad LPP. For the Letter recogitio data set (see the top row of Figure 5), LFDA icely separates samples i differet classes from each other, ad at the same time, it clearly preserves withi-class multimodality. FDA separates ad from well, but withi-class multimodality is lost, that is, ad are mixed. LPP gives two separate clusters of samples, but samples i differet classes are mixed i oe of the clusters. For the Iris data set (see the top row of Figure 6), LFDA simultaeously achieves betwee-class separatio ad withi-class multimodality preservatio. O the other had, FDA teds to mix samples i differet classes, which would be caused by withi-class multimodality. LPP also works well for this data set because three clusters are well separated from each other i the origial high-dimesioal space. Overall, LFDA is foud to be more appropriate for embeddig labeled multimodal data samples tha FDA ad LPP, implyig that our primal goal has bee successfully achieved. Next, we compare the results of LFDA with those of LDI, NCA, ad MCML. For the Letter recogitio data set (see Figure 5), LFDA, LDI, NCA, ad MCML separate the samples i differet classes from each other very well. However, LDI ad MCML collapse ad ito a sigle cluster, while LFDA ad NCA preserve the multimodal structure clearly. The NCA result is almost idetical to the LFDA result (i.e., the iitial value of the NCA iteratio), but the result may vary if the iitial value for the gradiet ascet algorithm is chaged. For the Iris data set (see Figure 6), LFDA, LDI, ad NCA work excelletly i both betwee-class separatio ad withi-class multimodality preservatio. O the other had, MCML mixes the samples i differet classes. Overall, LDI works fairly well, but the withi-class multimodal structure is sometimes lost sice LDI oly partially takes withi-class multimodality ito accout (see Sectio 4.1). NCA also works very well, which 1045

20 SUGIYAMA LFDA A C B FDA LPP LDI NCA MCML Figure 5: Visualizatio of the Letter recogitio data set. LFDA FDA LPP Setosa Virgiica Verisicolour LDI 0 NCA MCML Figure 6: Visualizatio of the Iris data set. 1046

21 LOCAL FISHER DISCRIMINANT ANALYSIS Data ame Iput dimesioality # of traiig samples # of test samples # of realizatios baaa breast-cacer diabetes flare-solar germa heart image rigorm splice thyroid titaic twoorm waveform USPS-eo USPS-sl Table 2: List of biary classificatio data sets. Data sets idicated by cotai itrisic withiclass multimodal structures. implies that the heuristic to use the LFDA result as a iitial value is useful. However, NCA does ot provide sigificat performace improvemet over LFDA i the above simulatios. The MCML results have similar tedecies to FDA. Based o the above simulatio results, we coclude that LFDA is a promisig method i the visualizatio of multimodal labeled data. 5.3 Classificatio Here we apply the proposed ad existig dimesioality reductio techiques to classificatio tasks, ad objectively evaluate the effectiveess of LFDA. There are several measures for quatitatively evaluatig separability of data samples i differet classes (e.g., Fukuaga, 1990; Globerso et al., 2005). Here we use a simple oe: misclassificatio rate by a oe-earest-eighbor classifier. As explaied i Sectio 3.3, the LFDA criterio is ivariat uder liear trasformatios, while the misclassificatio rate by a oe-earest-eighbor classifier depeds o the distace metric. This meas that the followig simulatio results are highly depedet o the ormalizatio scheme (15). We employ the IDA data sets, 5 which are stadard biary classificatio data sets origially used i Rätsch et al. (2001). I additio, we use two biary classificatio data sets created from the USPS hadwritte digit data set. The first task (USPS-eo) is to separate eve umbers from odd umbers ad the secod task (USPS-sl) is to separate small umbers ( 0 to 4 ) from large umbers ( 5 to 9 ). For traiig ad testig, 100 samples are radomly chose for each digit. Table 2 summarizes 5. Data sets available at

22 SUGIYAMA Data set LFDA LDI NCA MCML LPP PCA baaa 13.7 ± ± ± ± ± ± 0.8 breast-cacer 34.7 ± ± ± ± ± ± 5.0 diabetes 32.0 ± ± ± ± ± 3.0 flare-solar 39.2 ± ± ± ± 5.1 germa 29.9 ± ± ± ± ± ± 2.4 heart 21.9 ± ± ± ± ± ± 3.5 image 3.2 ± ± ± ± ± 0.5 rigorm 21.1 ± ± ± ± ± ± 1.4 splice 16.9 ± ± ± ± ± 1.3 thyroid 4.6 ± ± ± ± ± ± 2.6 titaic 33.1 ± ± ± ± ± ± 12.0 twoorm 3.5 ± ± ± ± ± ± 0.6 waveform 12.5 ± ± ± ± ± ± 1.2 USPS1 9.0 ± ± ± ± 0.7 USPS ± ± ± ± ± 0.8 Computatio time (ratio) Table 3: Meas ad stadard deviatios of the misclassificatio rate whe the embeddig dimesioality is chose by cross validatio. For each data set, the best method ad comparable oes based o the t-test at the sigificace level 5% are marked by. Data sets idicated by cotai the itrisic withi-class multimodal structure. the specificatios of the data sets. The rigorm, twoorm, ad waveform data sets cotai features with oly oise. The thyroid, waveform, USPS-eo, ad USPS-sl data sets cotai itrisic withiclass multimodal structures sice they are coverted from multi-class problems by mergig some of the classes. The baaa data set is also multimodal. We test LFDA, LDI, NCA, MCML, LPP, ad pricipal compoet aalysis (PCA). Note that LPP ad PCA are usupervised dimesioality reductio methods, while others are supervised methods. NCA is ot tested for the diabetes, flare-solar, image, splice, USPS-eo, ad USPS-sl data sets ad MCML is ot tested for the flare-solar ad USPS-eo data sets sice the executio time is too log. Figure 7 depicts the mea misclassificatio rate by a oe-earest-eighbor classifier as fuctios of the dimesioality r of the reduced space. The error bars are omitted for clear visibility. Istead, we plotted the results of the followig sigificace test: for each dimesioality r, the mea misclassificatio rate by the best method ad comparable oes based o the t-test (Hekel, 1979) at the sigificace level 5% are marked by. The results show that LFDA works quite well, but overall there is o sigle best method that cosistetly outperforms the others. Table 3 describes the mea ad stadard deviatio of the misclassificatio rate by each method whe the embeddig dimesioality r is chose by 5-fold cross validatio (Stoe, 1974; Wahba, 1990); for the USPS-eo ad USPS-sl data sets, we used 20-fold cross validatio sice this was more accurate. For each data set, the best method ad comparable oes based o the t-test at the sigificace level 5% are idicated by. The table shows that overall LFDA has excellet 1048

23 LOCAL FISHER DISCRIMINANT ANALYSIS Mea Misclassificatio Rate LFDA LDI NCA MCML LPP PCA baaa 1 2 Reduced Dimesio r Mea Misclassificatio Rate breast cacer Reduced Dimesio r Mea Misclassificatio Rate diabetes Reduced Dimesio r flare solar 0.42 germa 0.27 heart Mea Misclassificatio Rate Mea Misclassificatio Rate Mea Misclassificatio Rate Reduced Dimesio r Reduced Dimesio r Reduced Dimesio r 0.35 image 0.36 rigorm 0.45 splice Mea Misclassificatio Rate Mea Misclassificatio Rate Mea Misclassificatio Rate Reduced Dimesio r Reduced Dimesio r Reduced Dimesio r thyroid titaic twoorm Mea Misclassificatio Rate Mea Misclassificatio Rate Mea Misclassificatio Rate Reduced Dimesio r Reduced Dimesio r Reduced Dimesio r 0.26 waveform 0.5 USPS eo 0.5 USPS sl Mea Misclassificatio Rate Mea Misclassificatio Rate Mea Misclassificatio Rate Reduced Dimesio r Reduced Dimesio r Reduced Dimesio r Figure 7: Mea misclassificatio rates by a oe-earest-eighbor method as fuctios of the dimesioality of the embeddig space. For each dimesio, the best method ad comparable oes based o the t-test at the sigificace level 5% are marked by. 1049