Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis



Similar documents
Modified Line Search Method for Global Optimization

Output Analysis (2, Chapters 10 &11 Law)

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 8

Chapter 7 Methods of Finding Estimators

I. Chi-squared Distributions

LECTURE 13: Cross-validation

Lecture 3. denote the orthogonal complement of S k. Then. 1 x S k. n. 2 x T Ax = ( ) λ x. with x = 1, we have. i = λ k x 2 = λ k.

Department of Computer Science, University of Otago

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

5: Introduction to Estimation

Incremental calculation of weighted mean and variance

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

Solutions to Selected Problems In: Pattern Classification by Duda, Hart, Stork

Coordinating Principal Component Analyzers

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

Properties of MLE: consistency, asymptotic normality. Fisher information.

Maximum Likelihood Estimators.

Soving Recurrence Relations

Hypothesis testing. Null and alternative hypotheses

The analysis of the Cournot oligopoly model considering the subjective motive in the strategy selection

CHAPTER 3 THE TIME VALUE OF MONEY

Convexity, Inequalities, and Norms

A probabilistic proof of a binomial identity

5 Boolean Decision Trees (February 11)

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

NEW HIGH PERFORMANCE COMPUTATIONAL METHODS FOR MORTGAGES AND ANNUITIES. Yuri Shestopaloff,

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Now here is the important step

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Determining the sample size

Sequences and Series

1 Computing the Standard Deviation of Sample Means

Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Regularized Distance Metric Learning: Theory and Algorithm

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 13

Plug-in martingales for testing exchangeability on-line

Chapter 5: Inner Product Spaces

CS103A Handout 23 Winter 2002 February 22, 2002 Solving Recurrence Relations

Systems Design Project: Indoor Location of Wireless Devices

Asymptotic Growth of Functions

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

CHAPTER 3 DIGITAL CODING OF SIGNALS


Chapter 7: Confidence Interval and Sample Size

Normal Distribution.

Measures of Spread and Boxplots Discrete Math, Section 9.4

DAME - Microsoft Excel add-in for solving multicriteria decision problems with scenarios Radomir Perzina 1, Jaroslav Ramik 2

Confidence Intervals for One Mean

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Entropy of bi-capacities

Vladimir N. Burkov, Dmitri A. Novikov MODELS AND METHODS OF MULTIPROJECTS MANAGEMENT

Finding the circle that best fits a set of points

Your organization has a Class B IP address of Before you implement subnetting, the Network ID and Host ID are divided as follows:

INVESTMENT PERFORMANCE COUNCIL (IPC) Guidance Statement on Calculation Methodology

Totally Corrective Boosting Algorithms that Maximize the Margin

Example 2 Find the square root of 0. The only square root of 0 is 0 (since 0 is not positive or negative, so those choices don t exist here).

Review: Classification Outline

FIBONACCI NUMBERS: AN APPLICATION OF LINEAR ALGEBRA. 1. Powers of a matrix

CS103X: Discrete Structures Homework 4 Solutions

INVESTMENT PERFORMANCE COUNCIL (IPC)

Capacity of Wireless Networks with Heterogeneous Traffic

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies ( 3.1.1) Limitations of Experiments. Pseudocode ( 3.1.2) Theoretical Analysis

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE The absolute value of the complex number z a bi is

Lesson 17 Pearson s Correlation Coefficient

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

1 Correlation and Regression Analysis

Theorems About Power Series

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

Week 3 Conditional probabilities, Bayes formula, WEEK 3 page 1 Expected value of a random variable

Ekkehart Schlicht: Economic Surplus and Derived Demand

Cutting-Plane Training of Structural SVMs

Lecture 4: Cheeger s Inequality

3. Greatest Common Divisor - Least Common Multiple

Lesson 15 ANOVA (analysis of variance)

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

(VCP-310)

5.4 Amortization. Question 1: How do you find the present value of an annuity? Question 2: How is a loan amortized?

AMS 2000 subject classification. Primary 62G08, 62G20; secondary 62G99

Partial Di erential Equations

Factoring x n 1: cyclotomic and Aurifeuillian polynomials Paul Garrett <garrett@math.umn.edu>

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

Domain 1: Designing a SQL Server Instance and a Database Solution

Convention Paper 6764

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

Building Blocks Problem Related to Harmonic Series

Research Article Sign Data Derivative Recovery

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Probabilistic Engineering Mechanics. Do Rosenblatt and Nataf isoprobabilistic transformations really differ?

A Faster Clause-Shortening Algorithm for SAT with No Restriction on Clause Length

PSYCHOLOGICAL STATISTICS

Chatpun Khamyat Department of Industrial Engineering, Kasetsart University, Bangkok, Thailand

Transcription:

Joural of Machie Learig Research 8 (2007) 1027-1061 Submitted 3/06; Revised 12/06; Published 5/07 Dimesioality Reductio of Multimodal Labeled Data by Local Fisher Discrimiat Aalysis Masashi Sugiyama Departmet of Computer Sciece Tokyo Istitute of Techology 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japa SUGI@CS.TITECH.AC.JP Editor: Sam Roweis Abstract Reducig the dimesioality of data without losig itrisic iformatio is a importat preprocessig step i high-dimesioal data aalysis. Fisher discrimiat aalysis (FDA) is a traditioal techique for supervised dimesioality reductio, but it teds to give udesired results if samples i a class are multimodal. A usupervised dimesioality reductio method called localitypreservig projectio (LPP) ca work well with multimodal data due to its locality preservig property. However, sice LPP does ot take the label iformatio ito accout, it is ot ecessarily useful i supervised learig scearios. I this paper, we propose a ew liear supervised dimesioality reductio method called local Fisher discrimiat aalysis (LFDA), which effectively combies the ideas of FDA ad LPP. LFDA has a aalytic form of the embeddig trasformatio ad the solutio ca be easily computed just by solvig a geeralized eigevalue problem. We demostrate the practical usefuless ad high scalability of the LFDA method i data visualizatio ad classificatio tasks through extesive simulatio studies. We also show that LFDA ca be exteded to o-liear dimesioality reductio scearios by applyig the kerel trick. Keywords: dimesioality reductio, supervised learig, Fisher discrimiat aalysis, locality preservig projectio, affiity matrix 1. Itroductio The goal of dimesioality reductio is to embed high-dimesioal data samples i a low-dimesioal space so that most of itrisic iformatio cotaied i the data is preserved (e.g., Roweis ad Saul, 2000; Teebaum et al., 2000; Hito ad Salakhutdiov, 2006). Oce dimesioality reductio is carried out appropriately, the compact represetatio of the data ca be used for various succeedig tasks such as visualizatio, classificatio, etc. I this paper, we cosider the supervised dimesioality reductio problem, that is, samples are accompaied with class labels. Fisher discrimiat aalysis (FDA) (Fisher, 1936; Fukuaga, 1990) is a popular method for liear supervised dimesioality reductio. 1 FDA seeks for a embeddig trasformatio such. A efficiet MATLAB implemetatio of local Fisher discrimiat aalysis is available from the author s website: http://sugiyama-www.cs.titech.ac.jp/ sugi/software/lfda/. 1. FDA may refer to the classificatio method which first projects data samples oto a oe-dimesioal subspace ad the classifies the samples by thresholdig (Fisher, 1936; Duda et al., 2001). The oe-dimesioal embeddig space used here is obtaied as the maximizer of the so-called Fisher criterio. This Fisher criterio ca be used for dimesioality reductio oto a space with dimesio more tha oe i multi-class problems (Fukuaga, 1990). With some abuse, we refer to the dimesioality reductio method based o the Fisher criterio as FDA (see Sectio 2.2 for detail). c 2007 Masashi Sugiyama.

SUGIYAMA that the betwee-class scatter is maximized ad the withi-class scatter is miimized. FDA is a traditioal but useful method for dimesioality reductio. However, it teds to give udesired results if samples i a class form several separate clusters (i.e., multimodal) (see, e.g., Fukuaga, 1990). Withi-class multimodality ca be observed i may practical applicatios. For example, i disease diagosis, the distributio of medial checkup samples of sick patiets could be multimodal sice there may be several differet causes eve for a sigle disease. I a traditioal task of hadwritte digit recogitio, withi-class multimodality appears if digits are classified ito, for example, eve ad odd umbers. More geerally, solvig multi-class classificatio problems by a set of two-class oe-versus-rest problems aturally iduces withi-class multimodality. For this reaso, there is a uiversal eed for reducig the dimesioality of multimodal data. I order to reduce the dimesioality of multimodal data appropriately, it is importat to preserve the local structure of the data. Locality-preservig projectio (LPP) (He ad Niyogi, 2004) meets this requiremet; LPP seeks for a embeddig trasformatio such that earby data pairs i the origial space close i the embeddig space. Thus LPP ca reduce the dimesioality of multimodal data without losig the local structure. However, LPP is a usupervised dimesioality reductio method ad does ot take the label iformatio ito accout. Therefore, it does ot ecessarily work appropriately i supervised dimesioality reductio scearios. I this paper, we propose a ew dimesioality reductio method called local Fisher discrimiat aalysis (LFDA). LFDA effectively combies the ideas of FDA ad LPP, that is, LFDA maximizes betwee-class separability ad preserves withi-class local structure at the same time. Thus LFDA is useful for dimesioality reductio of multimodal labeled data. The origial FDA provides a meaigful result oly whe the dimesioality of the embeddig space is smaller tha the umber of classes because of the rak deficiecy of the betwee-class scatter matrix (Fukuaga, 1990). This is a essetial limitatio of FDA i dimesioality reductio. O the other had, the proposed LFDA does ot geerally suffer from this problem ad ca be employed for dimesioality reductio ito a arbitrary dimesioal space. Furthermore, LFDA iherits a excellet property from FDA it has a aalytic form of the embeddig matrix ad the solutio ca be easily computed just by solvig a geeralized eigevalue problem. This is a advatage over recetly proposed supervised dimesioality reductio methods (e.g., Goldberger et al., 2005; Globerso ad Roweis, 2006). Furthermore, LFDA ca be aturally exteded to oliear dimesioality reductio scearios by applyig the kerel trick (Schölkopf ad Smola, 2002). The rest of this paper is orgaized as follows. I Sectio 2, we formulate the liear dimesioality reductio problem, briefly review FDA ad LPP, ad illustrate how they typically behave. I Sectio 3, we defie LFDA ad show its fudametal properties. I Sectio 4, we discuss the relatio betwee LFDA ad other methods. I Sectio 5, we umerically evaluate the performace of LFDA ad existig methods i visualizatio ad classificatio tasks usig bechmark data sets. Fially, we give cocludig remarks ad future prospects i Sectio 6. 2. Liear Dimesioality Reductio I this sectio, we formulate the problem of liear dimesioality reductio ad review existig methods. 1028

LOCAL FISHER DISCRIMINANT ANALYSIS 2.1 Formulatio Let x i R d (i = 1,2,...,) be d-dimesioal samples ad y i {1,2,...,c} be associated class labels, where is the umber of samples ad c is the umber of classes. Let l be the umber of samples i class l: c l =. Let X be the matrix of all samples: l=1 X (x 1 x 2 x ). Let z i R r (1 r d) be low-dimesioal represetatios of x i, where r is the reduced dimesio (i.e., the dimesio of the embeddig space). Effectively we cosider d to be large ad r to be small, but ot limited to such cases. For the momet, we focus o liear dimesioality reductio, that is, usig a d r trasformatio matrix T, the embedded samples z i are give by z i = T x i, where deotes the traspose of a matrix or vector. I Sectio 3.4, we exted our discussio to the o-liear dimesioality reductio scearios where the mappig from x i to z i is o-liear. 2.2 Fisher Discrimiat Aalysis for Dimesioality Reductio Oe of the most popular dimesioality reductio techiques is Fisher discrimiat aalysis (FDA) (Fisher, 1936; Fukuaga, 1990; Duda et al., 2001). Here we briefly describe the defiitio of FDA. Let S (w) ad S (b) be the withi-class scatter matrix ad the betwee-class scatter matrix: S (w) S (b) c l=1 i:y i =l c l=1 (x i µ l )(x i µ l ), (1) l (µ l µ)(µ l µ), (2) where i:yi =l deotes the summatio over i such that y i = l, µ l is the mea of the samples i class l, ad µ is the mea of all samples: µ l 1 l x i, i:y i =l µ 1 i = i=1x 1 We assume that S (w) has full rak. The FDA trasformatio matrix T FDA is defied as follows: 2 [ ( )] T FDA argmax T R d r tr (T S (w) T ) 1 T S (b) T. (3) c l=1 l µ l. 2. The followig defiitio is also used i the literature (e.g., Fukuaga, 1990) ad yields the same solutio. ( ) T FDA = argmax det T S (b) T ( ), T R d r det T S (w) T where det( ) deotes the determiat of a matrix. 1029

SUGIYAMA That is, FDA seeks a trasformatio matrix T such that the betwee-class scatter is maximized while the withi-class scatter is miimized. I the above formulatio, we implicitly assumed that T S (w) T is ivertible. This implies that the above optimizatio is subject to rak(t ) = r. Let {ϕ k } d k=1 be the geeralized eigevectors associated with the geeralized eigevalues λ 1 λ 2 λ d of the followig geeralized eigevalue problem: S (b) ϕ = λs (w) ϕ. The a solutio T FDA of the above maximizatio problem is aalytically give by T FDA = (ϕ 1 ϕ 2 ϕ r ). Note that the solutio is ot uique ad the followig simple costrait is sometimes imposed additioally (Fukuaga, 1990). T FDAS (w) T FDA = I r, where I r is the idetity matrix o R r. This costrait makes the withi-class scatter i the embeddig space sphered. The betwee-class scatter matrix S (b) has at most rak c 1 (Fukuaga, 1990). This implies that the multiplicity of λ = 0 is at least d c+1. Therefore, FDA ca fid at most c 1 meaigful features; the remaiig features foud by FDA are arbitrary. This is a essetial limitatio of FDA for dimesioality reductio ad is very restrictive i practice. 2.3 Locality-Preservig Projectio Aother dimesioality reductio techique that is relevat to the curret settig is locality-preservig projectio (LPP) (He ad Niyogi, 2004). Here we review LPP. Let A be a affiity matrix, that is, the -dimesioal matrix with the (i, j)-th elemet A i, j beig the affiity betwee x i ad x j. We assume that A i, j [0,1]; A i, j is large if x i ad x j are close ad A i, j is small if x i ad x j are far apart. There are several differet maers of defiig A. We briefly describe typical defiitios i Appedix D. The LPP trasformatio matrix T LPP is defied as follows: 3 T LPP argmi T R d r ( 1 2 A i, j T x i T x j 2 ) subject to T XDX T = I r, (4) where D is the -dimesioal diagoal matrix with i-th diagoal elemet beig D i,i A i, j. j=1 3. The matrix D i the costrait (4) is motivated by a geometric argumet (Belki ad Niyogi, 2003). However, it is sometimes dropped for the sake of simplicity (Ham et al., 2004). 1030

LOCAL FISHER DISCRIMINANT ANALYSIS Eq. (4) implies that LPP looks for a trasformatio matrix T such that earby data pairs i the origial space R d are kept close i the embeddig space. The costrait (4) is imposed for avoidig degeeracy. Let {ψ k } d k=1 be the geeralized eigevectors associated with the geeralized eigevalues γ 1 γ 2 γ d of the followig geeralized eigevalue problem: XLX ψ = γxdx ψ, where L D A. L is called the graph-laplacia matrix i the spectral graph theory (Chug, 1997), where A is see as the adjacecy matrix of a graph. He ad Niyogi (2004) showed that a solutio of Eq. (4) is give by T LPP = (ψ d ψ d 1 ψ d r+1 ). 2.4 Typical Behavior of FDA ad LPP Dimesioality reductio results obtaied by FDA ad LPP are illustrated i Figure 1 (LFDA will be defied ad explaied i Sectio 3) two-dimesioal two-class data samples are embedded ito a oe-dimesioal space. I LPP, the affiity matrix A is determied by the local scalig method (Zelik-Maor ad Peroa, 2005, see also Appedix D.4). For the simplest data set depicted i Figure 1(a), both FDA ad LPP icely separate the samples i differet classes ( ad ) from each other. For the data set depicted i Figure 1(b), FDA still works well, but LPP mixes samples i differet classes ito a sigle cluster. This is caused by the usupervised ature of LPP. O the other had, for the data set depicted i Figure 1(c), LPP works well but FDA collapses the samples i differet classes ito a sigle cluster. The reaso for the failure of FDA is that the levels of the betwee-class scatter ad the withi-class scatter are ot evaluated i a ituitively atural way because of the two separate clusters i -class (see also Fukuaga, 1990). 3. Local Fisher Discrimiat Aalysis As illustrated i Figure 1, FDA ca perform poorly if samples i a class form several separate clusters (i.e., multimodal). I other words, the udesired behavior of FDA is caused by the globality whe evaluatig the withi-class scatter ad the betwee-class scatter (e.g., Figure 1(c)). O the other had, because of the usupervised ature of LPP, it ca overlap samples i differet classes if they are close i the origial high-dimesioal space R d (e.g., Figure 1(b)). To overcome these problems, we propose combiig the ideas of FDA ad LPP; more specifically, we evaluate the levels of the betwee-class scatter ad the withi-class scatter i a local maer. This allows us to attai betwee-class separatio ad withi-class local structure preservatio at the same time. We call our ew method local Fisher discrimiat aalysis (LFDA). 3.1 Reformulatig FDA I order to itroduce LFDA, let us first reformulate FDA i a pairwise maer. 1031

SUGIYAMA 10 FDA LPP LFDA 5 0 5 10 10 5 0 5 10 (a) Toy data set 1 10 FDA LPP LFDA 10 FDA LPP LFDA 5 5 0 0 5 5 10 10 5 0 5 10 10 10 5 0 5 10 (b) Toy data set 2 (c) Toy data set 3 Figure 1: Examples of dimesioality reductio by FDA, LPP ad LFDA. Two-dimesioal twoclass samples are embedded ito a oe-dimesioal space. The lie i the figure deotes the oe-dimesioal embeddig space (which the data samples are projected o) obtaied by each method. 1032

LOCAL FISHER DISCRIMINANT ANALYSIS Lemma 1 S (w) ad S (b) defied by Eqs. (1) ad (2) ca be expressed as where S (w) = 1 2 S (b) = 1 2 W (w) i, j (x i x j )(x i x j ), (5) W (b) i, j (x i x j )(x i x j ), (6) W (w) i, j W (b) i, j { 1/l if y i = y j = l, 0 if y i y j, { 1/ 1/l if y i = y j = l, 1/ if y i y j. (7) (8) A proof of Lemma 1 is give i Appedix A. Note that 1/ 1/ l i Eq. (8) is egative while 1/ l ad 1/ i Eqs. (7) ad (8) are positive. This implies that if the data pairs i the same class are made close, the withi-class scatter matrix S (w) gets small ad the betwee-class scatter matrix S (b) gets large. O the other had, if the data pairs i differet classes are separated from each other, the betwee-class scatter matrix S (b) gets large. Therefore, we may iterpret FDA as keepig the sample pairs i the same class close ad the sample pairs i differet classes apart. A more formal discussio o the above iterpretatio is give i Appedix B. 3.2 Defiitio ad Typical Behavior of LFDA Based o the above pairwise expressio, let us defie the local withi-class scatter matrix S (w) ad the local betwee-class scatter matrix S (b) as follows. where S (w) 1 2 S (b) 1 2 W (w) i, j (x i x j )(x i x j ), (9) W (b) i, j (x i x j )(x i x j ), W (w) i, j W (b) i, j { Ai, j / l if y i = y j = l, 0 if y i y j, { Ai, j (1/ 1/ l ) if y i = y j = l, 1/ if y i y j. (10) (11) Namely, accordig to the affiity A i, j, we weight the values for the sample pairs i the same class. This meas that far apart sample pairs i the same class have less ifluece o S (w) ad S (b). Note that we do ot weight the values for the sample pairs i differet classes sice we wat to separate them from each other irrespective of the affiity i the origial space. From here o, we deote the local couterparts of matrices by symbols with tilde. 1033

SUGIYAMA We defie the LFDA trasformatio matrix T LFDA as [ tr T LFDA argmax T R d r ( (T S (w) T ) 1 T S (b) T )]. (12) That is, we look for a trasformatio matrix T such that earby data pairs i the same class are made close ad the data pairs i differet classes are separated from each other; far apart data pairs i the same class are ot imposed to be close. Eq. (12) is of the same form as Eq. (3). Therefore, we ca similarly compute a aalytic form of T LFDA by solvig a geeralized eigevalue problem of S (b) ad S (w). A efficiet implemetatio of LFDA is summarized as a pseudo code i Figure 2 (see Appedix C for detail). Toy examples of dimesioality reductio by LFDA are illustrated i Figure 1. We used the local scalig method for computig the affiity matrix A (see Appedix D.4). Note that we perform the earest eighbor search i the local scalig method i a classwise maer sice we do ot eed the affiity values for the sample pairs i differet classes (see Eqs. 10 ad 11). This highly cotributes to reducig the computatioal cost (see Appedix C). Figure 1 shows that LFDA gives desirable results for all three data sets, that is, LFDA ca compesate for the drawbacks of FDA ad LPP by effectively combiig the ideas of FDA ad LPP. If the affiity value A i, j is set to 1 for all sample pairs (i.e., all pairs are equally close to each other), S (w) ad S (b) agree with S (w) ad S (b), respectively, ad LFDA is reduced to the origial FDA. Therefore, LFDA may be regarded as a atural localized variat of FDA. 3.3 Properties of LFDA Here we discuss fudametal properties of LFDA. First, we give a iterpretatio of LFDA i terms of the poitwise scatter. S (w) ca be expressed as S (w) = 1 1 2 P (w) i, yi i=1 where yi is the umber of samples i the class to which the sample x i belogs ad P (w) i is the poitwise local withi-class scatter matrix aroud x i : P (w) i j:y j =y i A i, j (x j x i )(x j x i ). Therefore, miimizig S (w) correspods to miimizig the weighted sum of the poitwise local withi-class scatter matrices over all samples. S (b) ca also be expressed i a similar way as S (b) = 1 ( 1 2 1 ) P (w) i + 1 yi 2 P (b) i, (13) i=1 where P (b) i is the poitwise betwee-class scatter matrix aroud x i : i=1 P (b) i j:y j y i (x j x i )(x j x i ). 1034

LOCAL FISHER DISCRIMINANT ANALYSIS Iput: Labeled samples {(x i,y i ) x i R d,y i {1,2,...,c}} i=1 Dimesioality of embeddig space r (1 r d) Output: d r trasformatio matrix T LFDA 1: S (b) 0 d d ; 2: S (w) 0 d d ; 3: for l = 1,2,...,c % Compute scatter matrices i a classwise maer 4: {x i } l i=1 {x j} j:y j =l; 5: for i = 1,2,..., l % Determie local scalig 6: x (7) i 7: σ i x i x (7) 7th earest eighbor of x i amog {x j } l i ; 8: ed 9: for i, j = 1,2,..., l % Defie affiity matrix 10: A i, j exp( x i x j 2 /(σ i σ j )); 11: ed 12: X (x 1 x 2 x l ); 13: G Xdiag(A1 l )X X AX ; j=1 ; 14: S (b) S (b) + G/ + (1 l /)X X + X1 l (X1 l ) /; 15: S (w) S (w) + G/ l ; 16: ed 17: S (b) S (b) X1 (X1 ) / S (w) ; 18: { λ k, ϕ k } r k=1 geeralized eigevalues ad ormalized eigevectors of S (b) ϕ = λ S (w) ϕ; % λ 1 λ 2 λ d 19: T LFDA = ( λ1 ϕ 1 λ2 ϕ 2 λr ϕ r ); Figure 2: Efficiet implemetatio of LFDA (see Appedix C for detail). The affiity matrix is computed by the local scalig method (see Appedix D.4). Matrices ad vectors deoted with uderlie are classwise couterparts of the origial oes. 0 d d deotes the d d matrix with zeros, 1 l deotes the l -dimesioal vector with oes, ad diag(a1 l ) deotes the diagoal matrix with diagoal elemets A1 l. The geeralized eigevectors i lie 18 are ormalized by Eq. (14), which is ofte automatically carried out by a eigesolver. The weightig scheme of the eigevectors i lie 19 is explaied i Sectio 3.3. A possible bottleeck of the above implemetatio is the earest eighbor search i lie 6. This could be alleviated by icorporatig the prior kowledge of the data structure or by approximatio (see Saul ad Roweis, 2003, ad refereces therei). Aother possible bottleeck is the computatio of X A X i lie 13, which could be eased by sparsely defiig the affiity matrix (see Appedix D). A MATLAB implemetatio is available from http://sugiyama-www.cs.titech.ac.jp/ sugi/software/lfda/. 1035

SUGIYAMA Note that P (b) i does ot iclude the localizatio factor A i, j. Eq. (13) implies that maximizig S (b) correspods to miimizig the weighted sum of the poitwise local withi-class scatter matrices ad maximizig the sum of the poitwise betwee-class scater matrices. Next, we discuss the issue of eigevalue multiplicity i LFDA. The origial FDA allows us to extract at most c 1 meaigful features sice the betwee-class scatter matrix S (b) has rak at most c 1 (Fukuaga, 1990). O the other had, the local betwee-class scatter matrix S (b) geerally has a much higher rak with less eigevalue multiplicity, thaks to the localizatio factor A i, j icluded i W (b) (see Eq. 11). I the simulatio show i Sectio 5, S (b) is always full rak for various data sets. Therefore, the proposed LFDA ca be practically employed for dimesioality reductio ito ay dimesioal spaces. This is a very importat ad sigificat improvemet over the origial FDA. Fially, we discuss the ivariace property of LFDA. The value of the LFDA criterio (12) is ivariat uder liear trasformatios, that is, for ay r-dimesioal ivertible matrix H, T LFDA H is also a solutio of Eq. (12). Therefore, the solutio T LFDA is ot uique the rage of the trasformatio H T LFDA is uiquely determied, but the distace metric (Goldberger et al., 2005; Globerso ad Roweis, 2006; Weiberger et al., 2006) i the embeddig space ca be arbitrary because of the arbitrariess of the matrix H. I practice, we propose determiig the LFDA trasformatio matrix T LFDA as follows. First, we rescale the geeralized eigevectors { ϕ k } d k=1 so that ϕ k S (w) ϕ k = { 1 if k = k, 0 if k k. (14) Note that this rescalig is ofte automatically carried out by a eigesolver. The we weight each geeralized eigevector by the square root of its associated geeralized eigevalue, that is, T LFDA = ( λ1 ϕ 1 λ2 ϕ 2 λr ϕ r ), (15) where λ 1 λ 2 λ d. This weightig scheme weakes the ifluece of mior eigevectors ad is show to work well i experimets (see Sectio 5). 3.4 Kerel LFDA for No-Liear Dimesioality Reductio Here we show how LFDA ca be exteded to o-liear dimesioality reductio scearios. As detailed i Appedix C, the geeralized eigevalue problem that eeds to be solved i LFDA ca be expressed as X L (b) X ϕ = λx L (w) X ϕ, (16) where L (b) = L (m) L (w) ad L (m) ad L (w) are defied by Eqs. (33) ad (35), respectively. Sice X ϕ i Eq. (16) belogs to the rage of X, it ca be expressed by usig some vector α R as X ϕ = X X α = K α, where K is the -dimesioal matrix with the (i, j)-th elemet beig K i, j x i x j. 1036

LOCAL FISHER DISCRIMINANT ANALYSIS The multiplyig Eq. (16) by X from the left-had side yields K L (b) K α = λk L (w) K α. (17) This implies that {x i } i=1 appear oly i terms of their ier products. Therefore, we ca obtai a o-liear variat of LFDA by the kerel trick (Vapik, 1998; Schölkopf et al., 1998), which is explaied below. Let us cosider a o-liear mappig φ(x) from R d to a reproducig kerel Hilbert space H (Aroszaj, 1950). Let K(x,x ) be the reproducig kerel of H. A typical choice of the kerel fuctio would be the Gaussia kerel: K(x,x ) = exp ( x x 2 ) 2σ 2, with σ > 0. For other choices, see, for example, Wahba (1990), Vapik (1998), ad Schölkopf ad Smola (2002). Because of the reproducig property of K(x,x ), K is ow the kerel matrix, that is, the (i, j)-th elemet is give by where, deotes the ier product i H. K i, j = φ(x i ),φ(x j ) = K(x i,x j ), It ca be cofirmed that L (w) is always degeerated (sice L (w) (1,1,...,1) always vaishes; see Eq. 35 for detail). Therefore, K L (w) K is always degeerated ad we caot directly solve the geeralized eigevalue problem (17). To cope with this problem, we propose regularizig K L (w) K ad solvig the followig geeralized eigevalue problem istead (cf. Friedma, 1989). K L (b) K α = λ(k L (w) K + εi ) α, (18) where ε is a small costat. Let { α k } k=1 be the geeralized eigevectors associated with the geeralized eigevalues λ 1 λ 2 λ of Eq. (18). The the embedded image of φ(x ) i H is give by K(x 1,x ) ( λ1 α 1 λ2 α 2 λr α r ) K(x 2,x ).. K(x,x ) We call this kerelized variat of LFDA kerel LFDA (KLFDA). Recetly, kerel fuctios for o-vectorial structured data such as strigs, trees, ad graphs have bee proposed (see, e.g., Lodhi et al., 2002; Duffy ad Collis, 2002; Kashima ad Koyaagi, 2002; Kodor ad Lafferty, 2002; Kashima et al., 2003; Gärter et al., 2003; Gärter, 2003). Sice KLFDA uses the samples oly via the kerel fuctio K(x,x ), it allows us to reduce the dimesioality of such o-vectorial data. 4. Compariso with Related Methods I this sectio, we discuss the relatio betwee the proposed LFDA ad other methods. 1037

SUGIYAMA 4.1 Dimesioality Reductio Usig Local Discrimiat Iformatio A discrimiat adaptive earest eighbor (DANN) classifier (Hastie ad Tibshirai, 1996a) employs a adapted distace metric at each test poit for classificatio. Based o a similar idea, they also proposed a global supervised dimesioality reductio method usig local discrimiat iformatio (LDI) i the same paper. We refer to this supervised dimesioality reductio method as LDI. The mai idea of LDI is to localize FDA which is very similar to the proposed LFDA. Here we discuss the relatio betwee LDI ad LFDA. I LDI, the data samples {x i } i=1 are first sphered accordig to the withi-class scatter matrix S (w), that is, for i = 1,2,...,, x i (S (w) ) 2 1 xi. Let A i, j be the weight of sample x j aroud x i defied by [ ( ) ] 3 3 x 1 i x j if x A i, j x i x (K) i x j < x i x (K) i i, 0 otherwise. where x (K) i is the K-th earest eighbor of x i i the sphered space. Note that 0 A i, j 1 ad A i, j is o-icreasig as x i x j icreases. Thus it has the same meaig as our affiity matrix. K is suggested to be determied by K = max(/5,50). Let µ [i] l be the local weighted mea of the sphered samples i class l aroud x i, ad let µ [i] be the local weighted mea of the sphered samples aroud x i : where µ [i] l 1 [i] l µ [i] 1 [i] A i, j x j, j:y j =l A i, j x j = 1 j=1 [i] [i] l [i] A i, j, j:y j =l A i, j. j=1 c l=1 [i] l µ[i] l, Let S (b) be the average betwee sum-of-squares matrix defied as S (b) i=1 1 [i] c l=1 The LDI trasformatio matrix T LDI is defied as T LDI argmax T R d r [i] l (µ[i] l µ[i] )(µ [i] l µ[i] ). [ ] T S (b) T subject to T T = I r. 1038

LOCAL FISHER DISCRIMINANT ANALYSIS T LDI is a trasformatio matrix for sphered samples; the LDI trasformatio matrix T LDI for osphered samples is give by T LDI = (S (w) ) 1 2 T LDI. Similar to FDA (ad LFDA), T LDI ca be efficietly computed by solvig a geeralized eigevalue problem. The average betwee sum-of-squares matrix S (b) is coceptually very similar to the local betweeclass scatter matrix S (b) i LFDA. Ideed, as proved i Appedix E, we ca express S (b) i a pairwise maer as where W (b) i, j S (b) = 1 2 k=1 1 [k] ( k=1 W (b) i, j (x i x j )(x i x j ), (19) 1 [k] 1 [k] l ) A i,k A j,k if y i = y j = l, 1 ( [k] ) A i,ka 2 j,k if y i y j. However, there exist critical differeces betwee LDI ad LFDA. A sigificat differece is that the values for the sample pairs i differet classes are also localized i LDI (see Eq. 20), while they are kept ulocalized i LFDA (see Eq. 11). This implies that far apart sample pairs i differet classes could be made close i LDI, which is ot desirable i supervised dimesioality reductio. Furthermore, the computatio of S (b) is slightly less efficiet tha S (b) sice W (b) icludes the summatio over k. Aother importat differece betwee LDI ad LFDA is that the withi-class scatter matrix S (w) is ot localized i LDI. However, as we showed i Sectio 3.1, the withi-class scatter matrix S (w) also accouts for collapsig the withi-class multimodal structure (i.e., far apart sample pairs i the same class are made close). This pheomeo is experimetally cofirmed i Sectio 5.2. 4.2 Mixture Discrimiat Aalysis FDA ca be iterpreted as maximum likelihood estimatio of Gaussia distributios with commo covariace ad differet meas for each class. Based o this view, Hastie ad Tibshirai (1996b) proposed mixture discrimiat aalysis (MDA), which exteds FDA to maximum likelihood estimatio of Gaussia mixture distributios. A maximum likelihood solutio is obtaied by a EM-type algorithm (cf. Dempster et al., 1977). However, this is a iterative algorithm ad gives oly a local optimal solutio. Therefore, the computatio of MDA is rather slow ad there is o guaratee that the global solutio ca be obtaied. Furthermore, the umber of mixture compoets (clusters) i each class as well as the iitial locatio of cluster ceters should be determied by users. For cluster ceters, usig stadard techiques such as k-meas clusterig (MacQuee, 1967; Everitt et al., 2001) or learig vector quatizatio (Kohoe, 1989) are recommeded. However, they are also iterative algorithms ad have o guaratee that the global solutio ca be obtaied. Furthermore, there seems to be o systematic method for determiig the umber of clusters. O the other had, the proposed LFDA cotais o tuig parameters (give that the affiity matrix is determied by the local scalig method, see Appedix D.4) ad the global solutio ca (20) 1039

SUGIYAMA be obtaied aalytically. However, it still lacks a probabilistic iterpretatio, which remais ope curretly. 4.3 Neighborhood Compoet Aalysis Goldberger et al. (2005) proposed a supervised dimesioality reductio method called eighborhood compoet aalysis (NCA). The NCA trasformatio matrix T NCA is defied as follows. ) T NCA argmax p i, j (T T ), (21) T R d r j:y j =y i where ( i=1 exp { (x i x j ) U(x i x j ) } p i, j (U) k i exp{ (x i x k ) if i j, U(x i x k )} 0 if i = j. The above defiitio correspods to maximizig the expected umber of correctly classified samples by a stochastic variat of earest eighbor classifiers. Therefore, NCA seeks a trasformatio matrix T such that the betwee-class separability is maximized. Eqs. (21) ad (22) imply that earby data pairs i the same class are made close, which is similar to the proposed LFDA. Ideed, the simulatio results i Sectio 5.2 show that NCA teds to preserve the multimodal structure of the data very well. However, a crucial weakess of NCA is optimizatio: the optimizatio problem (21) is o-covex. Therefore, there is o guaratee that the globally optimal solutio ca be obtaied. Goldberger et al. (2005) proposed usig a gradiet ascet method for optimizatio: T T + ε J NCA (T ), (23) where ε (> 0) is the step size ad the gradiet J NCA (T ) is give by ({ J NCA (T ) = 2T i=1 } p i, j (T T )}{ p i, j (T T )(x i x j )(x i x j ) j:y j =y i j=1 j:y j =y i p i, j (T T )(x i x j )(x i x j ) The gradiet ascet iteratio (23) is computatioally rather iefficiet. Also, the choice of the step size ε is troublesome. If the step size is small eough, the covergece to oe of the local optima is guarateed but such a choice makes the covergece very slow; o the other had, if the step size is too large, gradiet flows oscillate ad proper covergece properties may ot be guarateed aymore. Furthermore, the choice of the termiatio coditio i the iterative algorithm is ofte cumbersome i practice. Because of the o-covexity of the optimizatio problem, the quality of the obtaied solutio depeds o the iitializatio of the matrix T. A useful heuristic to alleviate the local optimum problem is to employ the FDA (or LFDA) result as a iitial matrix for optimizatio (Goldberger et al., 2005). I the experimets i Sectio 5, usig the LFDA result as a iitial matrix appears to be better tha the radom iitializatio. However, the local optima problem still remais eve with the above heuristic. ). (22) 1040

LOCAL FISHER DISCRIMINANT ANALYSIS Whe a dimesioality reductio techique is applied to classificatio tasks, we ofte wat to embed the data samples ito spaces with several differet dimesios the best dimesioality is later chose by, for example, cross-validatio (Stoe, 1974; Wahba, 1990). I such a sceario, NCA requires to optimize the trasformatio matrix idividually for each dimesioality r of the embeddig space. O the other had, LFDA eeds to compute the trasformatio matrix oly oce for the largest r; its sub-matrices become the optimal solutios for smaller dimesios. Therefore, LFDA is computatioally more efficiet tha NCA i this sceario. A simple MATLAB implemetatio of NCA is available. 4 We use this software i Sectio 5. 4.4 Maximally Collapsig Metric Learig I order to overcome the computatioal problem of NCA, Globerso ad Roweis (2006) proposed a alterative method called maximally collapsig metric learig (MCML). Let p i, j be the ideal value of p i, j(u) defied by Eq. (22): where p i, j is ormalized so that p i, j { 1 if yi = y j, 0 if y i y j, p i, j = 1. j i p i, j ca be attaied if all samples i the same class collapse ito a sigle poit while samples i other classes are mapped to other locatios. I reality, however, ay U may ot be able to attai p i, j (U) = p i, j exactly; istead the optimal approximatio to p i, j uder the Kullback-Leibler divergece (Kullback ad Leibler, 1951) is obtaied. This is formally defied as U MCML argmi U R d d ( ) p p i, j i, j log p i, j (U) subject to U PSD(r), (24) where PSD(r) is the set of all positive semidefiite matrices of rak r (i.e., r eigevalues are positive ad others are zero). Oce U MCML is obtaied, the MCML trasformatio matrix T MCML is computed by T MCML = (φ 1 φ 2 φ r ), (25) where {φ k } r k=1 are the eigevectors associated with the positive eigevalues η 1 η 2 η r > 0 of the followig eigevalue problem: U MCML φ = ηφ. Oe of the motivatios of MCML is to alleviate the difficulty of optimizatio i NCA. However, MCML still has a weakess i optimizatio: the optimizatio problem (24) is covex oly whe r = d, that is, the dimesioality is ot reduced but oly the distace metric of the origial space is chaged. This meas that if r < d (which is our primal focus i this paper), we may ot be able to 4. Implemetatio available at http://www.cs.berkeley.edu/ fowlkes/software/ca/. 1041

SUGIYAMA obtai the globally optimal solutio. Globerso ad Roweis (2006) proposed the followig heuristic algorithm to approximate T MCML. First, the optimizatio problem (24) with r = d is solved: Û MCML argmi U R d d ( p p i, j i, j log p i, j (U) ) subject to U PSD(d). (26) Although Eq. (26) is covex, a aalytic form of the uique optimal solutio Û MCML is ot kow yet. Globerso ad Roweis (2006) proposed usig the followig alterate iterative procedure for obtaiig Û MCML. U U ε J MCML (U), (27) U d k=1 max(0, η k ) φ k φ k, (28) where ε (> 0) is the step size, η k ad φ k are eigevalues ad eigevectors of U, ad the gradiet J MCML (U) is give by J MCML (U) = (p i, j p i, j (U))(x i x j )(x i x j ). The the eigevalue decompositio of Û MCML is carried out ad eigevalues η 1 η 2 η d ad associated eigevectors { φ k } d k=1 are obtaied: Û MCML φ = η φ. Fially, {φ k } r k=1 i Eq. (25) are replaced by { φ k } r k=1, which yields T MCML ( φ 1 φ 2 φ r ). (29) This approximatio is show to be practically useful (Globerso ad Roweis, 2006), although there seems to be o theoretical aalysis for this approximatio. MCML may have a advatage over NCA i computatio: there exists the aalytic approximatio (29) that ca be computed efficietly usig the solutio of aother covex optimizatio problem (26). However, MCML still relies o the gradiet-based alterate iterative algorithm (27) (28) to solve the covex optimizatio problem (26), which is computatioally very expesive sice the eigevalue decompositio of a d-dimesioal matrix should be carried out i each iteratio (see Eq. 28). Furthermore, the difficulty of appropriately choosig the step size ad the termiatio coditio i the iterative procedure still remais. Sice MCML requires all the samples i the same class to collapse ito a sigle poit, it is ot ecessarily useful i dimesioality reductio of multimodal data samples. Furthermore, the MCML results ca be sigificatly iflueced by outliers sice the outliers are also required to collapse ito the same sigle poit together with other samples. This pheomeo is illustrated i Figure 3, where a sigle outlier sigificatly chages the MCML result. Globerso ad Roweis (2006) showed that the sufficiet statistics of the MCML algorithm are poitwise scatter matrices (cf. Sectio 3.3). Sice LFDA also has a iterpretatio i terms of poitwise scatter matrices, there may be a lik betwee LFDA ad MCML ad this eeds to be ivestigated i the future work. 1042

LOCAL FISHER DISCRIMINANT ANALYSIS 10 8 LFDA MCML 10 8 LFDA MCML 6 6 4 4 2 2 0 2 0 2 outlier 4 4 6 6 8 8 10 10 0 10 20 30 40 50 (a) Toy data set 1 10 10 0 10 20 30 40 50 (b) Toy data set 1 Figure 3: Toy examples of dimesioality reductio. The toy data set 1 is equivalet to the oe used i Figure 1(a). The data set 1 icludes a sigle outlier. 4.5 Remark o Rak Costrait The optimizatio problem of MCML (see Eq. 24) is ot geerally covex sice the rak costrait is o-covex (Boyd ad Vadeberghe, 2004). The o-covexity iduced by the rak costrait seems to be a uiversal problem i dimesioality reductio. NCA elimiates the rak costrait by decomposig U ito T T (see Eqs. 21 ad 22). However, eve with this decompositio, the optimizatio problem is still o-covex. O the other had, FDA, LDI, ad LFDA cast the optimizatio problem i the form of the Rayleigh quotiet. This is computatioally very advatageous sice it allows us to aalytically determie the rage of the embeddig space. However, we caot determie the distace metric i the embeddig space sice the Rayleigh quotiet is ivariat uder liear trasformatios. For this reaso, a additioal criterio is eeded to determie the distace metric (see also Sectio 3.3). 5. Numerical Examples I this sectio, we umerically evaluate the performace of LFDA ad existig methods. 5.1 Exploratory Data Aalysis Here we use the Thyroid disease data set available from the UCI machie learig repository (Blake ad Merz, 1998) ad illustrate how LFDA ca be used for exploratory data aalysis. The origial data cosists of 5-dimesioal iput vector x of the followig laboratory tests. 1. T3-resi uptake test. 2. Total Serum thyroxi as measured by the isotopic displacemet method. 1043

SUGIYAMA 8 Hyperthyroidism Hypothyroidism 8 Hyperthyroidism Hypothyroidism 6 6 4 4 2 2 0 25 20 15 10 5 First Feature 0 3 4 5 6 7 First Feature 30 Euthyroidism 20 Euthyroidism 25 20 15 15 10 10 5 5 0 25 20 15 10 5 First Feature 0 3 4 5 6 7 First Feature (a) FDA (b) LFDA Figure 4: Histograms of the first feature values obtaied by FDA ad LFDA for the Thyroid disease data set. The top row correspods to the sick patiets ad the bottom row correspods to the healthy patiets. 3. Total Serum triiodothyroie as measured by radioimmuo assay. 4. Basal thyroid-stimulatig hormoe (TSH) as measured by radioimmuo assay. 5. Maximal absolute differece of TSH value after ijectio of 200 micro grams of thyrotropireleasig hormoe as compared to the basal value. The task is to predict whether patiets thyroids are euthyroidism, hypothyroidism, or hyperthyroidism (Coomas et al., 1983), that is, whether patiets thyroids are ormal, hypo-fuctioig, or hyper-fuctioig (Blake ad Merz, 1998). The diagosis (the class label) is based o a complete medical record, icludig aamesis, sca etc. Here we merge the hypothyroidism class ad the hyperthyroidism class ito a sigle class ad create biary labeled data (whether thyroids are ormal or ot). Our goal is to predict whether patiets thyroids are ormal, hypo-fuctioig, or hyper-fuctioig from the biary labeled data samples. Figure 4 depicts the histograms of the first feature values obtaied by FDA ad LFDA the top row correspods to the sick patiets ad the bottom row correspods to the healthy patiets. This shows that both FDA ad LFDA separate the patiets with ormal thyroids from sick patiets reasoably well. I additio to betwee-class separability, LFDA clearly preserves the multimodal structure amog sick patiets (i.e., hypo-fuctioig ad hyper-fuctioig), which is lost by ordiary FDA. Aother iterestig fidig from the figure is that the first feature values obtaied by LFDA has a strog egative correlatio to the fuctioig level of thyroids this could be used for predictig the fuctioig level of thyroids. 1044

LOCAL FISHER DISCRIMINANT ANALYSIS Data Set d -ad- class class Letter recogitio 16 A & C B Iris 4 Setosa & Virgiica Versicolour Table 1: Two-class data sets used for visualizatio experimets (r = 2). 5.2 Data Visualizatio Here we apply the proposed ad existig dimesioality reductio methods to bechmark data sets ad ivestigate how they behave i data visualizatio tasks. We use the Letter recogitio data set ad the Iris data set available from the UCI machie learig repository (Blake ad Merz, 1998). Table 1 describes the specificatios of the data sets. Each data set cotais three types of samples specified by,, ad. We merged ad ito a sigle class ad created two-class problems. We test LFDA, FDA, LPP, LDI, NCA, ad MCML ad evaluate the betwee-class separability (i.e., ad are well separated from ) ad the withi-class multimodality preservatio capability (i.e., ad are well grouped). For LPP ad LFDA, we determied the affiity matrix by the local scalig method (see Appedix D.4). For NCA, we used the LFDA result as a iitial matrix sice this iitializatio scheme appears to work better tha the radom iitializatio. FDA allows us to extract oly oe meaigful feature i two-class classificatio problems (see Sectio 2.2), so we choose the secod feature radomly here. Figures 5 ad 6 depict the samples embedded i the two-dimesioal space foud by each method. The horizotal axis is the first feature foud by each method, while the vertical axis is the secod feature. First, we compare the embeddig results of LFDA with those of FDA ad LPP. For the Letter recogitio data set (see the top row of Figure 5), LFDA icely separates samples i differet classes from each other, ad at the same time, it clearly preserves withi-class multimodality. FDA separates ad from well, but withi-class multimodality is lost, that is, ad are mixed. LPP gives two separate clusters of samples, but samples i differet classes are mixed i oe of the clusters. For the Iris data set (see the top row of Figure 6), LFDA simultaeously achieves betwee-class separatio ad withi-class multimodality preservatio. O the other had, FDA teds to mix samples i differet classes, which would be caused by withi-class multimodality. LPP also works well for this data set because three clusters are well separated from each other i the origial high-dimesioal space. Overall, LFDA is foud to be more appropriate for embeddig labeled multimodal data samples tha FDA ad LPP, implyig that our primal goal has bee successfully achieved. Next, we compare the results of LFDA with those of LDI, NCA, ad MCML. For the Letter recogitio data set (see Figure 5), LFDA, LDI, NCA, ad MCML separate the samples i differet classes from each other very well. However, LDI ad MCML collapse ad ito a sigle cluster, while LFDA ad NCA preserve the multimodal structure clearly. The NCA result is almost idetical to the LFDA result (i.e., the iitial value of the NCA iteratio), but the result may vary if the iitial value for the gradiet ascet algorithm is chaged. For the Iris data set (see Figure 6), LFDA, LDI, ad NCA work excelletly i both betwee-class separatio ad withi-class multimodality preservatio. O the other had, MCML mixes the samples i differet classes. Overall, LDI works fairly well, but the withi-class multimodal structure is sometimes lost sice LDI oly partially takes withi-class multimodality ito accout (see Sectio 4.1). NCA also works very well, which 1045

SUGIYAMA 30 28 26 24 22 20 18 16 14 12 10 LFDA 10 0 10 20 A C B 12 11 10 9 8 7 6 5 4 3 2 FDA 10 5 0 5 6 4 2 0 2 4 6 8 10 12 14 LPP 40 35 30 25 LDI NCA MCML 0.2 30 0.4 0.1 25 0.5 0 0.1 20 0.6 0.7 0.2 15 0.8 0.3 10 0.9 0.6 0.4 0.2 0 0.2 20 10 0 10 20 1 3 2 1 0 1 Figure 5: Visualizatio of the Letter recogitio data set. LFDA FDA LPP 55 50 1.5 2 3 2 1 45 2.5 0 1 40 35 Setosa Virgiica Verisicolour 0 50 100 150 200 3 3.5 4 3.5 3 2.5 2 1.5 2 3 4 5 14 12 10 8 6 0.5 0.45 0.4 LDI 0 NCA 0.2 0 MCML 0.35 0.3 0.25 0.2 5 10 0.2 0.4 0.6 0.15 0.1 15 0.8 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 20 0 50 100 150 1 1.2 0 0.5 1 1.5 Figure 6: Visualizatio of the Iris data set. 1046

LOCAL FISHER DISCRIMINANT ANALYSIS Data ame Iput dimesioality # of traiig samples # of test samples # of realizatios baaa 2 400 4900 100 breast-cacer 9 200 77 100 diabetes 8 468 300 100 flare-solar 9 666 400 100 germa 20 700 300 100 heart 13 170 100 100 image 18 1300 1010 20 rigorm 20 400 7000 100 splice 60 1000 2175 20 thyroid 5 140 75 100 titaic 3 150 2051 100 twoorm 20 400 7000 100 waveform 21 400 4600 100 USPS-eo 256 1000 1000 20 USPS-sl 256 1000 1000 20 Table 2: List of biary classificatio data sets. Data sets idicated by cotai itrisic withiclass multimodal structures. implies that the heuristic to use the LFDA result as a iitial value is useful. However, NCA does ot provide sigificat performace improvemet over LFDA i the above simulatios. The MCML results have similar tedecies to FDA. Based o the above simulatio results, we coclude that LFDA is a promisig method i the visualizatio of multimodal labeled data. 5.3 Classificatio Here we apply the proposed ad existig dimesioality reductio techiques to classificatio tasks, ad objectively evaluate the effectiveess of LFDA. There are several measures for quatitatively evaluatig separability of data samples i differet classes (e.g., Fukuaga, 1990; Globerso et al., 2005). Here we use a simple oe: misclassificatio rate by a oe-earest-eighbor classifier. As explaied i Sectio 3.3, the LFDA criterio is ivariat uder liear trasformatios, while the misclassificatio rate by a oe-earest-eighbor classifier depeds o the distace metric. This meas that the followig simulatio results are highly depedet o the ormalizatio scheme (15). We employ the IDA data sets, 5 which are stadard biary classificatio data sets origially used i Rätsch et al. (2001). I additio, we use two biary classificatio data sets created from the USPS hadwritte digit data set. The first task (USPS-eo) is to separate eve umbers from odd umbers ad the secod task (USPS-sl) is to separate small umbers ( 0 to 4 ) from large umbers ( 5 to 9 ). For traiig ad testig, 100 samples are radomly chose for each digit. Table 2 summarizes 5. Data sets available at http://ida.first.frauhofer.de/projects/bech/bechmarks.htm. 1047

SUGIYAMA Data set LFDA LDI NCA MCML LPP PCA baaa 13.7 ± 0.8 13.6 ± 0.8 14.3 ± 2.0 39.4 ± 6.7 13.6 ± 0.8 13.6 ± 0.8 breast-cacer 34.7 ± 4.3 36.4 ± 4.9 34.9 ± 5.0 34.0 ± 5.8 33.5 ± 5.4 34.5 ± 5.0 diabetes 32.0 ± 2.5 30.8 ± 1.9 31.2 ± 2.1 31.5 ± 2.5 31.2 ± 3.0 flare-solar 39.2 ± 5.0 39.3 ± 4.8 39.2 ± 4.9 39.1 ± 5.1 germa 29.9 ± 2.8 30.7 ± 2.4 29.8 ± 2.6 31.3 ± 2.4 30.7 ± 2.4 30.2 ± 2.4 heart 21.9 ± 3.7 23.9 ± 3.1 23.0 ± 4.3 23.3 ± 3.8 23.3 ± 3.8 24.3 ± 3.5 image 3.2 ± 0.8 3.0 ± 0.6 4.7 ± 0.8 3.6 ± 0.7 3.4 ± 0.5 rigorm 21.1 ± 1.3 17.5 ± 1.0 21.8 ± 1.3 22.0 ± 1.2 20.6 ± 1.1 21.6 ± 1.4 splice 16.9 ± 0.9 17.9 ± 0.8 17.3 ± 0.9 23.2 ± 1.2 22.6 ± 1.3 thyroid 4.6 ± 2.6 8.0 ± 2.9 4.5 ± 2.2 18.5 ± 3.8 4.2 ± 2.9 4.9 ± 2.6 titaic 33.1 ± 11.9 33.1 ± 11.9 33.0 ± 11.9 33.1 ± 11.9 33.0 ± 11.9 33.0 ± 12.0 twoorm 3.5 ± 0.4 4.1 ± 0.6 3.7 ± 0.6 3.5 ± 0.4 3.7 ± 0.7 3.6 ± 0.6 waveform 12.5 ± 1.0 20.7 ± 2.5 12.6 ± 0.8 17.9 ± 1.5 12.4 ± 1.0 12.7 ± 1.2 USPS1 9.0 ± 0.8 12.5 ± 0.9 7.2 ± 1.0 3.5 ± 0.7 USPS2 12.9 ± 1.2 25.9 ± 1.7 11.7 ± 1.3 7.5 ± 0.8 3.9 ± 0.8 Computatio time (ratio) 1.00 1.11 97.23 70.61 1.04 0.91 Table 3: Meas ad stadard deviatios of the misclassificatio rate whe the embeddig dimesioality is chose by cross validatio. For each data set, the best method ad comparable oes based o the t-test at the sigificace level 5% are marked by. Data sets idicated by cotai the itrisic withi-class multimodal structure. the specificatios of the data sets. The rigorm, twoorm, ad waveform data sets cotai features with oly oise. The thyroid, waveform, USPS-eo, ad USPS-sl data sets cotai itrisic withiclass multimodal structures sice they are coverted from multi-class problems by mergig some of the classes. The baaa data set is also multimodal. We test LFDA, LDI, NCA, MCML, LPP, ad pricipal compoet aalysis (PCA). Note that LPP ad PCA are usupervised dimesioality reductio methods, while others are supervised methods. NCA is ot tested for the diabetes, flare-solar, image, splice, USPS-eo, ad USPS-sl data sets ad MCML is ot tested for the flare-solar ad USPS-eo data sets sice the executio time is too log. Figure 7 depicts the mea misclassificatio rate by a oe-earest-eighbor classifier as fuctios of the dimesioality r of the reduced space. The error bars are omitted for clear visibility. Istead, we plotted the results of the followig sigificace test: for each dimesioality r, the mea misclassificatio rate by the best method ad comparable oes based o the t-test (Hekel, 1979) at the sigificace level 5% are marked by. The results show that LFDA works quite well, but overall there is o sigle best method that cosistetly outperforms the others. Table 3 describes the mea ad stadard deviatio of the misclassificatio rate by each method whe the embeddig dimesioality r is chose by 5-fold cross validatio (Stoe, 1974; Wahba, 1990); for the USPS-eo ad USPS-sl data sets, we used 20-fold cross validatio sice this was more accurate. For each data set, the best method ad comparable oes based o the t-test at the sigificace level 5% are idicated by. The table shows that overall LFDA has excellet 1048

LOCAL FISHER DISCRIMINANT ANALYSIS Mea Misclassificatio Rate 0.4 0.35 0.3 0.25 0.2 0.15 LFDA LDI NCA MCML LPP PCA baaa 1 2 Reduced Dimesio r Mea Misclassificatio Rate 0.38 0.37 0.36 0.35 0.34 0.33 0.32 breast cacer 1 2 3 4 5 6 7 8 9 Reduced Dimesio r Mea Misclassificatio Rate 0.38 0.37 0.36 0.35 0.34 0.33 0.32 0.31 0.3 diabetes 1 2 3 4 5 6 7 8 Reduced Dimesio r flare solar 0.42 germa 0.27 heart 0.395 0.4 0.26 Mea Misclassificatio Rate 0.394 0.393 0.392 0.391 Mea Misclassificatio Rate 0.38 0.36 0.34 0.32 Mea Misclassificatio Rate 0.25 0.24 0.23 0.22 0.39 0.3 0.21 1 2 3 4 5 6 7 8 9 Reduced Dimesio r 5 10 15 20 Reduced Dimesio r 2 4 6 8 10 12 Reduced Dimesio r 0.35 image 0.36 rigorm 0.45 splice Mea Misclassificatio Rate 0.3 0.25 0.2 0.15 0.1 Mea Misclassificatio Rate 0.34 0.32 0.3 0.28 0.26 0.24 0.22 Mea Misclassificatio Rate 0.4 0.35 0.3 0.25 0.05 0.2 0.18 0.2 2 4 6 8 10 12 14 16 18 Reduced Dimesio r 5 10 15 20 Reduced Dimesio r 0 10 20 30 40 50 60 Reduced Dimesio r thyroid titaic twoorm Mea Misclassificatio Rate 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 Mea Misclassificatio Rate 0.345 0.34 0.335 Mea Misclassificatio Rate 0.065 0.06 0.055 0.05 0.045 0.04 0.04 0.33 0.035 1 2 3 4 5 Reduced Dimesio r 1 2 3 Reduced Dimesio r 5 10 15 20 Reduced Dimesio r 0.26 waveform 0.5 USPS eo 0.5 USPS sl Mea Misclassificatio Rate 0.24 0.22 0.2 0.18 0.16 0.14 Mea Misclassificatio Rate 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 Mea Misclassificatio Rate 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.12 0.05 0.05 0 5 10 15 20 Reduced Dimesio r 0 50 100 150 200 250 Reduced Dimesio r 0 50 100 150 200 250 Reduced Dimesio r Figure 7: Mea misclassificatio rates by a oe-earest-eighbor method as fuctios of the dimesioality of the embeddig space. For each dimesio, the best method ad comparable oes based o the t-test at the sigificace level 5% are marked by. 1049

SUGIYAMA Data set LFDA EUCLID FDA baaa 13.7 ± 0.8 13.6 ± 0.8 13.6 ± 0.8 breast-cacer 34.7 ± 4.3 32.7 ± 4.8 32.9 ± 4.5 diabetes 32.0 ± 2.5 30.1 ± 2.1 30.6 ± 2.2 flare-solar 39.2 ± 5.0 39.2 ± 5.1 39.0 ± 4.9 germa 29.9 ± 2.8 29.5 ± 2.5 30.5 ± 2.8 heart 21.9 ± 3.7 23.2 ± 3.7 24.0 ± 3.7 image 3.2 ± 0.8 3.4 ± 0.5 6.5 ± 1.7 rigorm 21.1 ± 1.3 35.0 ± 1.4 31.2 ± 1.6 splice 16.9 ± 0.9 28.9 ± 1.5 33.7 ± 1.5 thyroid 4.6 ± 2.6 4.4 ± 2.2 5.3 ± 2.5 titaic 33.1 ± 11.9 33.0 ± 11.9 33.1 ± 11.9 twoorm 3.5 ± 0.4 6.7 ± 0.7 5.0 ± 0.7 waveform 12.5 ± 1.0 15.8 ± 0.7 17.6 ± 1.4 USPS-eo 9.0 ± 0.8 3.6 ± 0.7 15.1 ± 2.4 USPS-sl 12.9 ± 1.2 3.8 ± 0.8 13.3 ± 2.9 Table 4: Meas ad stadard deviatios of the misclassificatio rate. The LFDA results are copied from Table 3. EUCLID deotes a aive oe-earest-eighbor classificatio without dimesioality reductio. FDA deotes a aive oe-earest-eighbor classificatio after the samples are projected oto a oe-dimesioal FDA subspace. performace. LDI ad MCML also work quite well, but they ted to perform rather poorly for the multimodal data sets specified by. NCA also works well, but it does ot compare favorably with LFDA. Note that NCA with radom iitializatio was slightly worse; therefore our heuristic to use the LFDA results for iitializatio would be reasoable. LPP ad PCA perform well, despite the fact that they are usupervised dimesioality reductio methods. I particular, PCA has excellet performace for the USPS data sets sice the projectio oto the two-dimesioal PCA subspace already gives reasoably separate embeddig (He ad Niyogi, 2004). The computatio time of each method summed over 9 data sets for which NCA is tested is described i the bottom of Table 3. For better compariso, we ormalized the values by the computatio time of LFDA. This shows that LFDA is much faster tha NCA ad MCML, 6 ad is comparable to LDI, LPP, ad PCA. The misclassificatio rates by a aive oe-earest-eighbor classificatio without dimesioality reductio ( EUCLID ) are described i Table 4. The table shows that, o the whole, the performace of LFDA is comparable to EUCLID. This implies that the use of LFDA is advatageous whe the dimesioality of the origial data is very high sice the computatio time i the test phase ca be reduced. Table 4 also icludes the misclassificatio rates by a aive oe-earest-eighbor classificatio after the samples are projected oto a oe-dimesioal FDA subspace, showig that LFDA teds to outperform FDA. 6. I our implemetatio of MCML, we used a costat step size for the gradiet descet. The computatio time could be improved if, for example, a Armijo like step size rule (Bertsekas, 1976) is employed. 1050

LOCAL FISHER DISCRIMINANT ANALYSIS Based o the above simulatio results, we coclude that the proposed LFDA is a promisig dimesioality reductio techique also i classificatio scearios. 6. Coclusios We discussed the problem of supervised dimesioality reductio. FDA (Fisher, 1936; Fukuaga, 1990; Duda et al., 2001) works well for this purpose, give that data samples i each class are uimodal Gaussia. However, samples i a class are ofte multimodal i practice, for example, whe multi-class classificatio problems are solved by a set of two-class oe-versus-rest problems. LPP (He ad Niyogi, 2004) ca work well i dimesioality reductio of multimodal data. However, it is a usupervised method ad does ot ecessarily useful i supervised dimesioality reductio scearios. I this paper, we proposed a ew method called LFDA, which effectively combies the ideas of FDA ad LPP. LFDA allows us to reduce the dimesioality of multimodal labeled data appropriately by maximizig betwee-class separability ad preservig the withi-class local structure at the same time. The derivatio of LFDA is based o a ovel pairwise iterpretatio of FDA (see Sectio 3.1). The origial FDA provides a meaigful result oly whe the dimesioality of the embeddig space is smaller tha the umber of classes because of the rak deficiecy of the betwee-class scatter matrix. O the other had, LFDA does ot share this limitatio ad ca be employed for dimesioality reductio ito ay dimesioal spaces (see Sectio 3.3). This is a sigificat improvemet over the origial FDA. As discussed i Sectio 3.3, the LFDA criterio is ivariat uder liear trasformatios. This meas that the rage of the trasformatio matrix ca be uiquely determied, but the distace metric i the embeddig space caot be determied. I this paper, we determied the distace metric i a heuristic maer. Although this ormalizatio scheme is show to be reasoable i experimets, there is still room for further improvemet. A importat future directio is to develop a computatioally efficiet method of determiig the distace metric of the embeddig space, for example, followig the lies of Goldberger et al. (2005), Globerso ad Roweis (2006), ad Weiberger et al. (2006). We showed i Sectio 3.4 that a o-liear variat of LFDA ca be obtaied by employig the kerel trick. FDA, LPP, ad MCML ca also be kerelized similarly (Baudat ad Aouar, 2000; Mika et al., 2003; Belki ad Niyogi, 2003; He ad Niyogi, 2004; Globerso ad Roweis, 2006). As show i these papers, the performace of the kerelized methods heavily deped o the choice of the family ad parameters of kerel fuctios. Therefore, how to optimally determie the kerel fuctio for supervised dimesioality reductio eeds to be explored. The performace of LFDA depeds o the choice of the affiity matrix. I this paper, we simply employed a stadard defiitio as it is (see Appedix D.4). Although this stadard choice appeared to be reasoable i experimets, it is importat to fid the optimal way to defie the affiity matrix i the cotext of supervised dimesioality reductio. MDA (Hastie ad Tibshirai, 1996b) provides a solid probabilistic framework for supervised dimesioality reductio with multimodality (see Sectio 4.2). O the other had, LFDA still lacks a probabilistic iterpretatio. A iterestig future directio is to aalyze the behavior of LFDA i terms of desity models. 1051

SUGIYAMA Ackowledgmets The author would like to thak Klaus-Robert Müller, Hideki Asoh, Motoaki Kawaabe, ad Stefa Harmelig for fruitful discussios. His gratitude also goes to aoymous reviewers for their valuable commets; particularly, the weightig scheme of the eigevectors i Eq. (15) was suggested by oe of the reviewers. He also ackowledges fiacial support from MEXT (Grat-i-Aid for Youg Scietists 17700142 ad Grat-i-Aid for Scietific Research (B) 18300057). A part of this work has bee doe whe the author was stayig at Uiversity of Ediburgh, which is supported by EU Erasmus Mudus Scholarship. He would like to thak Sethu Vijayakumar for the warm hospitality. Appedix A. Proof of Lemma 1 It follows from Eq. (1) that S (w) = = = c l=1 i:y i =l i=1 x i x i ( x i 1 l c l=1 W i=1( (w) i, j j=1 = 1 2 1 l ) ) x j )(x i 1 j:y j =l l x j j:y j =l x i x j i, j:y i =y j =l x i x i W (w) i, j x i x j W (w) i, j (x i x i + x j x j x i x j x j x i ), which yields Eq. (5). Let S (m) be the mixture scatter matrix (Fukuaga, 1990): The we have which yields Eq. (6). Appedix B. Iterpretatio of FDA S (m) S (w) + S (b) = i=1 (x i µ)(x i µ). S (b) = i x i=1x i 1 x i x j S (w) ) 1 = x i x i=1( 1 i j=1 x ix j S (w) = 1 ( 1 (w) W i, j 2 ) (x i x i + x j x j x i x j x j x i ), I Sectio 3.1, we claimed that FDA tries to keep data pairs i the same class close ad data pairs i differet classes apart. Here we show this claim more formally. 1052

LOCAL FISHER DISCRIMINANT ANALYSIS For v i, j T (x i x j ), let us ivestigate the chage i the Fisher criterio (3) whe v i, j yields αv i, j with α > 0. Note that there does ot geerally exist a trasformatio T that keeps all v i, j ad oly chages a particular pair. For this reaso, the followig aalysis may be regarded as comparig the values of the Fisher criterio for two differet data sets. This aalysis will give a isight ito what kid of trasformatio matrices the Fisher criterio favors. Let W T S (w) T, B T S (b) T, W α W βw (w) i, j v i, j v i, j, B α B βw (b) i, j v i, j v i, j, β 1 α2. 2 Note that W α ad B α correspod to the withi-class ad betwee-class scatter matrices for αv i, j, respectively. We assume that W ad W α are positive defiite ad B ad B α are positive semidefiite. The the values of the Fisher criterio (3) for v i, j ad αv i, j are expressed as tr ( W 1 B ) ad tr ( W 1 ) α B α, respectively. The stadard matrix iversio lemma (e.g., Albert, 1972) yields W 1 = (W α + βw (w) i, j v i, j v i, j) 1 = W 1 α W 1 α v i, j (W 1 α v i, j ) (βw (w) i, j ) 1 + W 1 α v i, j,v i, j. If y i = y j, we have W (w) i, j > 0 ad W (b) i, j < 0. The we have tr ( W 1 B ) ( ) = tr (W α + βw (w) i, j v i, j v i, j) 1 (B α + βw (b) i, j v i, j v i, j) = tr ( W 1 α B α ) + βw (b) i, j W 1 α v i, j,v i, j B αw 1 α v i, j,w 1 α v i, j + βw (b) i, j W 1 α v i, j,v i, j 2 (βw (w) i, j ) 1 + W 1 α v i, j,v i, j = tr ( W 1 ) B α W 1 α v i, j,w 1 α v i, j W (b) i, j α B W (w) 1 i, j W 1 α v i, j,v i, j α (βw (w). (30) i, j ) 1 + W 1 α v i, j,v i, j If α < 1, we have β > 0 sice α > 0 by defiitio. Therefore, Eq. (30) yields tr ( W 1 α B α ) > tr ( W 1 B ), where we used the facts that W α is positive defiite ad B α is positive semi-defiite. This implies that the value of the Fisher criterio icreases if a data pair i the same class is made close. 1053

SUGIYAMA Similarly, if y i y j, we have W (w) i, j = 0 ad W (b) i, j > 0. The we have tr ( W 1 ) ( ) α B α = tr (W βw (w) i, j v i, j v i, j) 1 (B βw (b) i, j v i, j v i, j ) If α > 1, we have β < 0 ad hece Eq. (31) yields = tr ( W 1 B ) βw (b) i, j W 1 v i, j,v i, j. (31) tr ( W 1 α B α ) > tr ( W 1 B ). This implies that the value of the Fisher criterio icreases if a pair of samples i differet classes are separated from each other. Appedix C. Efficiet Computatio of T LFDA As show i Eq. (15), the LFDA trasformatio matrix T LFDA ca be computed aalytically usig the geeralized eigevectors ad geeralized eigevalues of the followig geeralized eigevalue problem. S (b) ϕ = λ S (w) ϕ. Give S (b) ad S (w), the computatioal complexity of calculatig T LFDA is O(rd 2 ). Here, we provide a efficiet computig method of S (b) ad S (w). Let S (m) be the local mixture scatter matrix defied by S (m) S (b) + S (w). From Eqs. (9) (11), we ca immediately show that S (m) is expressed i the followig pairwise form. S (m) = 1 2 W (m) i, j (x i x j )(x i x j ), where W (m) is the -dimesioal matrix with (i, j)-th elemet beig { W (m) Ai, j / if y i = y j, i, j 1/ if y i y j. Sice S (m) = 1 2 = i=1( j=1 W (m) i, j (x i x i + x j x j x i x j x j x i ) W (m) i, j ) x i x i W (m) i, j x i x j, S (m) ca be expressed i a matrix form as S (m) = X L (m) X, (32) 1054

LOCAL FISHER DISCRIMINANT ANALYSIS where L (m) D (m) W (m), (33) ad D (m) is the -dimesioal diagoal matrix with i-th diagoal elemet beig D (m) i,i Similarly, S (w) ca be expressed i a matrix form as W (m) i, j. j=1 S (w) = X L (w) X, (34) where L (w) D (w) W (w), (35) ad D (w) is the -dimesioal diagoal matrix with i-th diagoal elemet beig D (w) i,i W (w) i, j. j=1 L (m) ad L (w) are -dimesioal matrices ad could be very high dimesioal. However, L (w) ca be made block-diagoal if the samples {x i } i=1 are sorted accordig to the labels {y i} i=1. Furthermore, diagoal sub-matrices of L (w) ca be sparse if the affiity matrix A is sparsely defied (see Appedix D for detail). Therefore, directly calculatig S (w) by Eq. (34) may be already computatioally efficiet. O the other had, computig S (m) directly by Eq. (32) is ot so efficiet sice W (m) is dese. This problem ca be alleviated as follows. W (m) ca be decomposed as W (m) = 1 1 1 + W (m 1) + W (m 2), where 1 is the -dimesioal vector with all oes ad W (m 1) ad W (m 2) are the -dimesioal matrices with (i, j)-th elemet beig { W (m Ai, 1) j / if y i = y j, i, j 0 if y i y j, { W (m 2) 1/ if yi = y j, i, j 0 if y i y j. The S (m) ca be expressed as S (m) = X D (m) X 1 X1 (X1 ) X W (m 1) X X W (m 2) X, (36) where the diagoal matrix D (m) is expressed i terms of W (m 1) as D (m) i,i = 1 y i + W (m 1) i, j. j=1 1055

SUGIYAMA Note that yi i the above equatio is the umber of samples i the class which the sample x i belogs to. W (m 2) is a costat block-diagoal matrix if the samples {xi } i=1 are sorted accordig to the labels {y i } i=1. Therefore, X W (m 2) X i the right-had side of Eq. (36) ca be computed efficietly. Similarly, W (m 1) ca also be made block-diagoal, so X W (m 1) X i the right-had side of Eq. (36) may also be computed efficietly; if the affiity matrix A is sparse, the computatioal efficiecy ca be further improved. The first two terms i the right-had side of Eq. (36) ca also be computed efficietly. Therefore, computig S (m) by Eq. (36) may be more efficiet tha directly by Eq. (32). Fially, we ca compute S (b) efficietly by usig S (m) as S (b) = S (m) S (w). To further improve computatioal efficiecy, the affiity matrix A may be computed i a classwise maer sice we do ot eed the affiity values for sample pairs i differet classes. This speeds up the earest eighbor search which is ofte carried out whe defiig A (see Appedix D). The earest eighbor search itself could also be a bottleeck, but this may be eased by icorporatig the prior kowledge of the data structure or by approximatio (see Saul ad Roweis, 2003, ad refereces therei). The above efficiet implemetatio of LFDA is summarized as a pseudo code i Figure 2. Appedix D. Defiitios of Affiity Matrix Here, we briefly review typical choices of the affiity matrix A. D.1 Heat Kerel A stadard choice of the affiity matrix A is A i, j = exp ( x i x j 2 ), (37) where σ (> 0) is a tuig parameter which cotrols the decay of the affiity (e.g., Belki ad Niyogi, 2003). D.2 Euclidea Neighbor The heat kerel gives a o-sparse affiity matrix. It would be computatioally advatageous if the affiity matrix is made sparse. A sparse affiity matrix ca be obtaied by assigig positive affiity values oly to eighborig samples. More specifically, x i ad x j are said to be eighbors if σ 2 x i x j ε, where ε (> 0) is a tuig parameter. The A i, j is defied by Eq. (37) for two eighborig samples ad A i, j = 0 for o-eighbors (Teebaum et al., 2000). This defiitio icludes two tuig parameters (ε ad σ), which are rather troublesome to determie i practice. To ease the problem, we may simply let A i, j = 1 if x i ad x j are eighbors ad A i, j = 0 otherwise. This correspods to settig σ =. 1056

LOCAL FISHER DISCRIMINANT ANALYSIS D.3 Nearest Neighbor Tuig the distace threshold ε is practically rather cumbersome sice the relatio betwee the umber of eighbors ad the value of ε is ot ituitively clear. Aother optio to determie the eighbors is to directly specify the umber of eighbors (Roweis ad Saul, 2000; Teebaum et al., 2000). Let NN (K) i be the set of K earest eighbor samples of x i uder the Euclidea distace, where K is a tuig parameter. If x j NN (K) i or x i NN (K) j, x i ad x j are regarded as eighbors; otherwise they are regarded as o-eighbors. The the affiity matrix is defied by the heat kerel or i the simple zero-oe maer. D.4 Local Scalig A drawback of the above defiitios could be that the affiity is computed globally i the same way. The desity of data samples may be differet depedig o regios. Therefore, it would be more appropriate to take the local scalig of the data ito accout. Followig this idea, Zelik-Maor ad Peroa (2005) proposed defiig the affiity matrix as A i, j = exp ( x i x j 2 ). σ i σ j σ i represets the local scalig of the data samples aroud x i, which is determied by σ i = x i x (K) i, where x (K) i is the K-th earest eighbor of x i. The parameter K is a tuig parameter, but Zelik- Maor ad Peroa (2005) demostrated that K = 7 works well o the whole. This would be a coveiet heuristic for those who do ot have ay subjective/prior prefereces. We employed the local scalig method with this heuristic all through the paper. For computatioal efficiecy, we may further sparsify the above affiity matrix based o, for example, the earest eighbor idea, although this is ot tested i this paper. Appedix E. Pairwise Expressio of S (b) A pairwise expressio of S (b) ca be derived as 1057

SUGIYAMA S (b) = = = c 1 k=1 [k] [k] l (µ[k] l l=1 1 k=1 [k] 1 k=1 [k] = 1 2 = 1 2 1 k=1 [k] ( c [k] l µ[k] l µ[k] l l=1 ( c [k] l µ[k] l µ[k] l l=1 ( which yields Eqs. (19) ad (20). c l=1 µ [k] )(µ [k] l µ [k] ) ) [k] µ [k] µ [k] 1 [k] l A i,k x i x i + i=1 i=1 A i,k x i x i [k] µ [k] µ [k] ) A i,k A j,k (x i x j )(x i x j ) i, j:y i =y j =l + 1 [k] W (b) i, j (x i x j )(x i x j ), ) A i,k A j,k (x i x j )(x i x j ) Refereces A. Albert. Regressio ad the Moore-Perose Pseudoiverse. Academic Press, New York ad Lodo, 1972. N. Aroszaj. Theory of reproducig kerels. Trasactios of the America Mathematical Society, 68:337 404, 1950. G. Baudat ad F. Aouar. Geeralized discrimiat aalysis usig a kerel approach. Neural Computatio, 12(10):2385 2404, 2000. M. Belki ad P. Niyogi. Laplacia eigemaps for dimesioality reductio ad data represetatio. Neural Computatio, 15(6):1373 1396, 2003. D. P. Bertsekas. O the Goldstei-Leviti-Polyak gradiet projectio method. IEEE Trasactios o Automatic Cotrol, 21(2):174 184, 1976. C. L. Blake ad C. J. Merz. UCI repository of machie learig databases, 1998. URL http://www.ics.uci.edu/ mlear/mlrepository.html. S. Boyd ad L. Vadeberghe. Covex Optimizatio. Cambridge Uiversity Press, Cambridge, 2004. F. R. K. Chug. Spectral Graph Theory. America Mathematical Society, Providece, R.I., 1997. D. Coomas, M. Broeckaert, M. Jockheer, ad D. L. Massart. Compariso of multivariate discrimiat techiques for cliical data Applicatio to the thyroid fuctioal state. Methods of Iformatio i Medicie, 22:93 101, 1983. 1058

LOCAL FISHER DISCRIMINANT ANALYSIS A. P. Dempster, N. M. Laird, ad D. B. Rubi. Maximum likelihood from icomplete data via the EM algorithm. Joural of the Royal Statistical Society, series B, 39(1):1 38, 1977. R. O. Duda, P. E. Hart, ad D. G. Stor. Patter Classificatio. Wiley, New York, 2001. N. Duffy ad M. Collis. Covolutio kerels for atural laguage. I T. G. Dietterich, S. Becker, ad Z. Ghahramai, editors, Advaces i Neural Iformatio Processig Systems 14, Cambridge, MA, 2002. MIT Press. B. S. Everitt, S. Ladau, ad M. Leese. Cluster Aalysis. Arold, Lodo, 2001. R. A. Fisher. The use of multiple measuremets i taxoomic problems. Aals of Eugeics, 7(2): 179 188, 1936. J. H. Friedma. Regularized discrimiat aalysis. Joural of the America Statistical Associatio, 84:165 175, 1989. K. Fukuaga. Itroductio to Statistical Patter Recogitio. Academic Press, Ic., Bosto, secod editio, 1990. T. Gärter. A survey of kerels for structured data. SIGKDD Exploratios, 5(1):S268 S275, 2003. T. Gärter, P. Flach, ad S. Wrobel. O graph kerels: Hardess results ad efficiet alteratives. I Proceedigs of the Sixteeth Aual Coferece o Computatioal Learig Theory, 2003. A. Globerso, G. Chechik, F. Pereira, ad N. Tishby. Euclidea embeddig of co-occurrece data. I L. K. Saul, Y. Weiss, ad L. Bottou, editors, Advaces i Neural Iformatio Processig Systems 17, pages 497 504. MIT Press, Cambridge, MA, 2005. A. Globerso ad S. Roweis. Metric learig by collapsig classes. I Y. Weiss, B. Schölkopf, ad J. Platt, editors, Advaces i Neural Iformatio Processig Systems 18, pages 451 458, Cambridge, MA, 2006. MIT Press. J. Goldberger, S. Roweis, G. Hito, ad R. Salakhutdiov. Neighbourhood compoets aalysis. I L. K. Saul, Y. Weiss, ad L. Bottou, editors, Advaces i Neural Iformatio Processig Systems 17, pages 513 520. MIT Press, Cambridge, MA, 2005. J. Ham, D. D. Lee, S. Mika, ad B. Schölkopf. A kerel view of the dimesioality reductio of maifolds. I Proceedigs of the Twety-First Iteratioal Coferece o Machie Learig, New York, NY, 2004. ACM Press. T. Hastie ad R. Tibshirai. Discrimiat adaptive earest eighbor classificatio. IEEE Trasactios o Patter Aalysis ad Machie Itelligece, 18(6):607 615, 1996a. T. Hastie ad R. Tibshirai. Discrimiat aalysis by Gaussia mixtures. Joural of the Royal Statistical Society, Series B, 58(1):155 176, 1996b. X. He ad P. Niyogi. Locality preservig projectios. I S. Thru, L. Saul, ad B. Schölkopf, editors, Advaces i Neural Iformatio Processig Systems 16. MIT Press, Cambridge, MA, 2004. 1059

SUGIYAMA R. E. Hekel. Tests of Sigificace. SAGE Publicatio, Beverly Hills, 1979. G. E. Hito ad R. R. Salakhutdiov. Reducig the dimesioality of data with eural etworks. Sciece, 313(5786):504 507, 2006. H. Kashima ad T. Koyaagi. Kerels for semi-structured date. I Proceedigs of the Nieteeth Iteratioal Coferece o Machie Learig, pages 291 298, Sa Fracisco, CA, 2002. Morga Kaufma. H. Kashima, K. Tsuda, ad A. Iokuchi. Margialized kerels betwee labeled graphs. I Proceedigs of the Twetieth Iteratioal Coferece o Machie Learig, Sa Fracisco, CA, 2003. Morga Kaufma. T. Kohoe. Self-Orgaizatio ad Associative Memory. Spriger-Verlag, Berli, 1989. R. I. Kodor ad J. Lafferty. Diffusio kerels o graphs ad other discrete iput spaces. I Proceedigs of the Nieteeth Iteratioal Coferece o Machie Learig, pages 315 322, 2002. S. Kullback ad R. A. Leibler. O iformatio ad sufficiecy. Aals of Mathematical Statistics, 22:79 86, 1951. H. Lodhi, C. Sauders, J. Shawe-Taylor, N. Cristiaii, ad C. Watkis. Text classificatio usig strig kerels. Joural of Machie Learig Research, 2:419 444, 2002. J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of the 5th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages 281 297, Berkeley, CA., USA, 1967. Uiversity of Califoria Press. S. Mika, G. Rätsch, J. Westo, B. Schölkopf, A. Smola, ad K.-R. Müller. Costructig descriptive ad discrimiative oliear features: Rayleigh coefficiets i kerel feature spaces. IEEE Trasactios o Patter Aalysis ad Machie Itelligece, 25(5):623 628, 2003. G. Rätsch, T. Ooda, ad K.-R. Müller. Soft margis for adaboost. Machie Learig, 42(3): 287 320, 2001. S. Roweis ad L. Saul. Noliear dimesioality reductio by locally liear embeddig. Sciece, 290(5500):2323 2326, 2000. L. K. Saul ad S. T. Roweis. Thik globally, fit locally: Usupervised learig of low dimesioal maifolds. Joural of Machie Learig Research, 4(Ju):119 155, 2003. B. Schölkopf, A. Smola, ad K.-R. Müller. Noliear compoet aalysis as a kerel eigevalue problem. Neural Computatio, 10(5):1299 1319, 1998. B. Schölkopf ad A. J. Smola. Learig with Kerels. MIT Press, Cambridge, MA, 2002. M. Stoe. Cross-validatory choice ad assessmet of statistical predictios. Joural of the Royal Statistical Society, Series B, 36:111 147, 1974. 1060

LOCAL FISHER DISCRIMINANT ANALYSIS J. B. Teebaum, V. de Silva, ad J. C. Lagford. A global geometric framework for oliear dimesioality reductio. Sciece, 290(5500):2319 2323, 2000. V. N. Vapik. Statistical Learig Theory. Wiley, New York, 1998. G. Wahba. Splie Model for Observatioal Data. Society for Idustrial ad Applied Mathematics, Philadelphia ad Pesylvaia, 1990. K. Weiberger, J. Blitzer, ad L. Saul. Distace metric learig for large margi earest eighbor classificatio. I Y. Weiss, B. Schölkopf, ad J. Platt, editors, Advaces i Neural Iformatio Processig Systems 18, pages 1473 1480, Cambridge, MA, 2006. MIT Press. L. Zelik-Maor ad P. Peroa. Self-tuig spectral clusterig. I L. K. Saul, Y. Weiss, ad L. Bottou, editors, Advaces i Neural Iformatio Processig Systems 17, pages 1601 1608. MIT Press, Cambridge, MA, 2005. 1061