Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Out-of-Sample Extensons for LLE, Isomap, MDS, Egenmaps, and Spectral Clusterng Yoshua Bengo, Jean-Franços Paement, Pascal Vncent Olver Delalleau, Ncolas Le Roux and Mare Oumet Département d Informatque et Recherche Opératonnelle Unversté de Montréal Montréal, Québec, Canada, H3C 3J7 {bengoy,vncentp,paemeje,delallea,lerouxn,oumema} @ro.umontreal.ca Abstract Several unsupervsed learnng algorthms based on an egendecomposton provde ether an embeddng or a clusterng only for gven tranng ponts, wth no straghtforward extenson for out-of-sample examples short of recomputng egenvectors. Ths paper provdes a unfed framework for extendng Local Lnear Embeddng (LLE), Isomap, Laplacan Egenmaps, Mult-Dmensonal Scalng (for dmensonalty reducton) as well as for Spectral Clusterng. Ths framework s based on seeng these algorthms as learnng egenfunctons of a data-dependent kernel. Numercal experments show that the generalzatons performed have a level of error comparable to the varablty of the embeddng algorthms due to the choce of tranng data. 1 Introducton Many unsupervsed learnng algorthms have been recently proposed, all usng an egendecomposton for obtanng a lower-dmensonal embeddng of data lyng on a non-lnear manfold: Local Lnear Embeddng (LLE) (Rowes and Saul, 2000), Isomap (Tenenbaum, de Slva and Langford, 2000) and Laplacan Egenmaps (Belkn and Nyog, 2003). There are also many varants of Spectral Clusterng (Wess, 1999; Ng, Jordan and Wess, 2002), n whch such an embeddng s an ntermedate step before obtanng a clusterng of the data that can capture flat, elongated and even curved clusters. The two tasks (manfold learnng and clusterng) are lnked because the clusters found by spectral clusterng can be arbtrary curved manfolds (as long as there s enough data to locally capture ther curvature). 2 Common Framework In ths paper we consder fve types of unsupervsed learnng algorthms that can be cast n the same framework, based on the computaton of an embeddng for the tranng ponts obtaned from the prncpal egenvectors of a symmetrc matrx. Algorthm 1 1. Start from a data set D = {x 1,..., x n } wth n ponts n R d. Construct a n n neghborhood or smlarty matrx M. Let us denote K D (, ) (or K for shorthand) the data-dependent functon whch produces M by M j = K D (x, x j ). 2. Optonally transform M, yeldng a normalzed matrx M. Equvalently, ths corresponds to generatng M from a K D by M j = K D (x, x j ).

3. Compute the m largest postve egenvalues λ k and egenvectors v k of M. 4. The embeddng of each example x s the vector y wth y k the -th element of the k-th prncpal egenvector v k of M. Alternatvely (MDS and Isomap), the embeddng s e, wth e k = λ k y k. If the frst m egenvalues are postve, then e e j s the best approxmaton of M j usng only m coordnates, n the squared error sense. In the followng, we consder the specalzatons of Algorthm 1 for dfferent unsupervsed learnng algorthms. Let S be the -th row sum of the affnty matrx M: S = j M j. (1) We say that two ponts (a, b) are k-nearest-neghbors of each other f a s among the k nearest neghbors of b n D {a} or vce-versa. We denote by x j the j-th coordnate of the vector x. 2.1 Mult-Dmensonal Scalng Mult-Dmensonal Scalng (MDS) starts from a noton of dstance or affnty K that s computed between each par of tranng examples. We consder here metrc MDS (Cox and Cox, 1994). For the normalzaton step 2 n Algorthm 1, these dstances are converted to equvalent dot products usng ( the double-centerng formula: ) M j = 1 M j 1 2 n S 1 n S j + 1 n 2 S k. (2) The embeddng e k of example x s gven by λ k v k. 2.2 Spectral Clusterng Spectral clusterng (Wess, 1999) can yeld mpressvely good results where tradtonal clusterng lookng for round blobs n the data, such as K-means, would fal mserably. It s based on two man steps: frst embeddng the data ponts n a space n whch clusters are more obvous (usng the egenvectors of a Gram matrx), and then applyng a classcal clusterng algorthm such as K-means, e.g. as n (Ng, Jordan and Wess, 2002). The affnty matrx M s formed usng a kernel such as the Gaussan kernel. Several normalzaton steps have been proposed. Among the most successful ones, as advocated n (Wess, 1999; Ng, Jordan and Wess, 2002), s the followng: M j = k M j S S j. (3) To obtan m clusters, the frst m prncpal egenvectors of M are computed and K-means s appled on the unt-norm coordnates, obtaned from the embeddng y k = v k. 2.3 Laplacan Egenmaps Laplacan Egenmaps s a recently proposed dmensonalty reducton procedure (Belkn and Nyog, 2003) that has been proposed for sem-supervsed learnng. The authors use an approxmaton of the Laplacan operator such as the Gaussan kernel or the matrx whose element (, j) s 1 f x and x j are k-nearest-neghbors and 0 otherwse. Instead of solvng an ordnary egenproblem, the followng generalzed egenproblem s solved: (S M)v j = λ j Sv j (4) wth egenvalues λ j, egenvectors v j and S the dagonal matrx wth entres gven by eq. (1). The smallest egenvalue s left out and the egenvectors correspondng to the other small egenvalues are used for the embeddng. Ths s the same embeddng that s computed wth the spectral clusterng algorthm from (Sh and Malk, 1997). As noted n (Wess, 1999) (Normalzaton Lemma 1), an equvalent result (up to a componentwse scalng of the embeddng) can be obtaned by consderng the prncpal egenvectors of the normalzed matrx defned n eq. (3).

2.4 Isomap Isomap (Tenenbaum, de Slva and Langford, 2000) generalzes MDS to non-lnear manfolds. It s based on replacng the Eucldean dstance by an approxmaton of the geodesc dstance on the manfold. We defne the geodesc dstance wth respect to a data set D, a dstance d(u, v) and a neghborhood k as follows: D(a, b) = mn p d(p, p +1 ) (5) where p s a sequence of ponts of length l 2 wth p 1 = a, p l = b, p D {2,..., l 1} and (p,p +1 ) are k-nearest-neghbors. The length l s free n the mnmzaton. The Isomap algorthm obtans the normalzed matrx M from whch the embeddng s derved by transformng the raw parwse dstances matrx as follows: frst compute the matrx M j = D 2 (x, x j ) of squared geodesc dstances wth respect to the data D, then apply to ths matrx the dstance-to-dot-product transformaton (eq. (2)), as for MDS. As n MDS, the embeddng s e k = λ k v k rather than y k = v k. 2.5 LLE The Local Lnear Embeddng (LLE) algorthm (Rowes and Saul, 2000) looks for an embeddng that preserves the local geometry n the neghborhood of each data pont. Frst, a sparse matrx of local predctve weghts W j s computed, such that j W j = 1, W j = 0 f x j s not a k-nearest-neghbor of x and ( j W jx j x ) 2 s mnmzed. Then the matrx M = (I W ) (I W ) (6) s formed. The embeddng s obtaned from the lowest egenvectors of M, except for the smallest egenvector whch s unnterestng because t s (1, 1,... 1), wth egenvalue 0. Note that the lowest egenvectors of M are the largest egenvectors of M µ = µi M to ft Algorthm 1 (the use of µ > 0 wll be dscussed n secton 4.4). The embeddng s gven by y k = v k, and s constant wth respect to µ. 3 From Egenvectors to Egenfunctons To obtan an embeddng for a new data pont, we propose to use the Nyström formula (eq. 9) (Baker, 1977), whch has been used successfully to speed-up kernel methods computatons by focussng the heaver computatons (the egendecomposton) on a subset of examples. The use of ths formula can be justfed by consderng the convergence of egenvectors and egenvalues, as the number of examples ncreases (Baker, 1977; Wllams and Seeger, 2000; Koltchnsk and Gné, 2000; Shawe-Taylor and Wllams, 2003). Intutvely, the extensons to obtan the embeddng for a new example requre specfyng a new column of the Gram matrx M, through a tranng-set dependent kernel functon K D, n whch one of the arguments may be requred to be n the tranng set. If we start from a data set D, obtan an embeddng for ts elements, and add more and more data, the embeddng for the ponts n D converges (for egenvalues that are unque). (Shawe-Taylor and Wllams, 2003) gve bounds on the convergence error (n the case of kernel PCA). In the lmt, we expect each egenvector to converge to an egenfuncton for the lnear operator defned below, n the sense that the -th element of the k-th egenvector converges to the applcaton of the k-th egenfuncton to x (up to a normalzaton factor). Consder a Hlbert space H p of functons wth nner product f, g p = f(x)g(x)p(x)dx, wth a densty functon p(x). Assocate wth kernel K a lnear operator K p n H p : (K p f)(x) = K(x, y)f(y)p(y)dy. (7) We don t know the true densty p but we can approxmate the above nner product and lnear operator (and ts egenfunctons) usng the emprcal dstrbuton ˆp. An emprcal Hlbert space Hˆp s thus defned usng ˆp nstead of p. Note that the proposton below can be

appled even f the kernel s not postve sem-defnte, although the embeddng algorthms we have studed are restrcted to usng the prncpal coordnates assocated wth postve egenvalues. For a more rgorous mathematcal analyss, see (Bengo et al., 2003). Proposton 1 Let K(a, b) be a kernel functon, not necessarly postve sem-defnte, that gves rse to a symmetrc matrx M wth entres M j = K(x, x j ) upon a dataset D = {x 1,..., x n }. Let (v k, λ k ) be an (egenvector,egenvalue) par that solves Mv k = λ k v k. Let (f k, λ k ) be an (egenfuncton,egenvalue) par that solves ( Kˆp f k )(x) = λ k f k(x) for any x, wth ˆp the emprcal dstrbuton over D. Let e k (x) = y k (x) λ k or y k (x) denote the embeddng assocated wth a new pont x. Then λ k = 1 n λ k (8) n n f k (x) = v k K(x, x ) (9) λ k =1 f k (x ) = nv k (10) y k (x) = f k(x) = 1 n v k K(x, x ) (11) n λ k =1 y k (x ) = y k, e k (x ) = e k (12) See (Bengo et al., 2003) for a proof and further justfcatons of the above formulae. The generalzed embeddng for Isomap and MDS s e k (x) = λ k y k (x) whereas the one for spectral clusterng, Laplacan egenmaps and LLE s y k (x). Proposton 2 In addton, f the data-dependent kernel K D s postve sem-defnte, then n f k (x) = π k (x) λ k where π k (x) s the k-th component of the kernel PCA projecton of x obtaned from the kernel K D (up to centerng). Ths relaton wth kernel PCA (Schölkopf, Smola and Müller, 1998), already ponted out n (Wllams and Seeger, 2000), s further dscussed n (Bengo et al., 2003). 4 Extendng to new Ponts Usng Proposton 1, one obtans a natural extenson of all the unsupervsed learnng algorthms mapped to Algorthm 1, provded we can wrte down a kernel functon K that gves rse to the matrx M on D, and can be used n eq. (11) to generalze the embeddng. We consder each of them n turn below. In addton to the convergence propertes dscussed n secton 3, another justfcaton for usng equaton (9) s gven by the followng proposton: Proposton 3 If we defne the f k (x ) by eq. (10) and take a new pont x, the value of f k (x) that mnmzes ( 2 n m K(x, x ) λ tf t (x)f t (x )) (13) =1 t=1 s gven by eq. (9), for m 1 and any k m. The proof s a drect consequence of the orthogonalty of the egenvectors v k. Ths proposton lnks equatons (9) and (10). Indeed, we can obtan eq. (10) when tryng to approxmate

K at the data ponts by mnmzng ( the cost n m K(x, x j ) λ tf t (x )f t (x j ),j=1 t=1 for m = 1, 2,... When we add a new pont x, t s thus natural to use the same cost to approxmate the K(x, x ), whch yelds (13). Note that by dong so, we do not seek to approxmate K(x, x). Future work should nvestgate embeddngs whch mnmze the emprcal reconstructon error of K but gnore the dagonal contrbutons. 4.1 Extendng MDS For MDS, a normalzed kernel can be defned as follows, usng a contnuous verson of the double-centerng eq. (2): K(a, b) = 1 2 (d2 (a, b) E x [d 2 (x, b)] E x [d 2 (a, x )] + E x,x [d 2 (x, x )]) (14) where d(a, b) s the orgnal dstance and the expectatons are taken over the emprcal data D. An extenson of metrc MDS to new ponts has already been proposed n (Gower, 1968), solvng exactly for the embeddng of x to be consstent wth ts dstances to tranng ponts, whch n general requres addng a new dmenson. 4.2 Extendng Spectral Clusterng and Laplacan Egenmaps Both the verson of Spectral Clusterng and Laplacan Egenmaps descrbed above are based on an ntal kernel K, such as the Gaussan or nearest-neghbor kernel. An equvalent normalzed kernel s: K(a, b) = 1 K(a, b) n Ex [K(a, x)]e x [K(b, x )] where the expectatons are taken over the emprcal data D. 4.3 Extendng Isomap To extend Isomap, the test pont s not used n computng the geodesc dstance between tranng ponts, otherwse we would have to recompute all the geodesc dstances. A reasonable soluton s to use the defnton of D(a, b) n eq. (5), whch only uses the tranng ponts n the ntermedate ponts on the path from a to b. We obtan a normalzed kernel by applyng the contnuous double-centerng of eq. (14) wth d = D. A formula has already been proposed (de Slva and Tenenbaum, 2003) to approxmate Isomap usng only a subset of the examples (the landmark ponts) to compute the egenvectors. Usng our notatons, ths formula s e k(x) = 1 2 v k (E x [ λ D 2 (x, x )] D 2 (x, x)). (15) k where E x s an average over the data set. The formula s appled to obtan an embeddng for the non-landmark examples. Corollary 1 The embeddng proposed n Proposton 1 for Isomap (e k (x)) s equal to formula 15 (Landmark Isomap) when K(x, y) s defned as n eq. (14) wth d = D. Proof: the proof reles on a property of the Gram matrx for Isomap: M j = 0, by constructon. Therefore (1, 1,... 1) s an egenvector wth egenvalue 0, and all the other egenvectors v k have the property v k = 0 because of the orthogonalty wth (1, 1,... 1). Wrtng (E x [ D 2 (x, x )] D 2 (x, x )) = 2 K(x, x )+E x,x [ D 2 (x, x )] E x [ D 2 (x, x )] yelds e k (x) = 2 2 λ k v K(x, k x ) + (E x,x [ D 2 (x, x )] E x [ D 2 (x, x )]) v k = e k (x), snce the last sum s 0. ) 2

4.4 Extendng LLE The extenson of LLE s the most challengng one because t does not ft as well the framework of Algorthm 1: the M matrx for LLE does not have a clear nterpretaton n terms of dstance or dot product. An extenson has been proposed n (Saul and Rowes, 2002), but unfortunately t cannot be cast drectly nto the framework of Proposton 1. Ther embeddng of a new pont x s gven by n y k (x) = y k (x )w(x, x ) (16) =1 where w(x, x ) s the weght of x n the reconstructon of x by ts k-nearest-neghbors n the tranng set (f x = x j D, w(x, x ) = δ j ). Ths s very close to eq. (11), but lacks the normalzaton by λ k. However, we can see ths embeddng as a lmt case of Proposton 1, as shown below. We frst need to defne a kernel K µ such that K µ (x, x j ) = M µ,j = (µ 1)δ j + W j + W j k W k W kj (17) for x, x j D. Let us defne a kernel K by K (x, x) = K (x, x ) = w(x, x ) and K (x, y) = 0 when nether x nor y s n the tranng set D. Let K be defned by K (x, x j ) = W j + W j k W k W kj and K (x, y) = 0 when ether x or y sn t n D. Then, by constructon, the kernel Kµ = (µ 1) K + K verfes eq. (17). Thus, we can apply eq. (11) to obtan an embeddng of a new pont x, whch yelds y µ,k (x) = 1 y k ((µ 1) λ K (x, x ) + K ) (x, x ) k wth λ k = (µ ˆλ k ), and ˆλ k beng the k-th lowest egenvalue of M. Ths rewrtes nto y µ,k (x) = µ 1 µ ˆλ y k w(x, x ) + 1 k µ ˆλ y K k (x, x ). k Then when µ, y µ,k (x) y k (x) defned by eq. (16). Snce the choce of µ s free, we can thus consder eq. (16) as approxmatng the use of the kernel Kµ wth a large µ n Proposton 1. Ths s what we have done n the experments descrbed n the next secton. Note however that we can fnd smoother kernels K µ verfyng eq. (17), gvng other extensons of LLE from Proposton 1. It s out of the scope of ths paper to study whch kernel s best for generalzaton, but t seems desrable to use a smooth kernel that would take nto account not only the reconstructon of x by ts neghbors x, but also the reconstructon of the x by ther neghbors ncludng the new pont x. 5 Experments We want to evaluate whether the precson of the generalzatons suggested n the prevous secton s comparable to the ntrnsc perturbatons of the embeddng algorthms. The perturbaton analyss wll be acheved by consderng splts of the data n three sets, D = F R 1 R 2 and tranng ether wth F R 1 or F R 2, comparng the embeddngs on F. For each algorthm descrbed n secton 2, we apply the followng procedure:

10 x 10 4 x 10 4 8 20 6 15 4 10 2 5 0-2 0-4 7 x 10 3 0 0.05 0.1 0.15 0.2 0.25-5 1 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 6 0.8 5 4 0.6 3 0.4 2 1 0.2 0 0-1 -2-0. 2-3 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14-0. 4 0 0.05 0.1 0.15 0.2 0.25 Fgure 1: Tranng set varablty mnus out-of-sample error, wrt the proporton of tranng samples substtuted. Top left: MDS. Top rght: spectral clusterng or Laplacan egenmaps. Bottom left: Isomap. Bottom rght: LLE. Error bars are 95% confdence ntervals. 1. We choose F D wth m = F samples. The remanng n m samples n D/F are splt nto two equal sze subsets R 1 and R 2. We tran (obtan the egenvectors) over F R 1 and F R 2. When egenvalues are close, the estmated egenvectors are unstable and can rotate n the subspace they span. Thus we estmate an affne algnment between the two embeddngs usng the ponts n F, and we calculate the Eucldean dstance between the algned embeddngs obtaned for each s F. 2. For each sample s F, we also tran over {F R 1 }/{s }. We apply the extenson to out-of-sample ponts to fnd the predcted embeddng of s and calculate the Eucldean dstance between ths embeddng and the one obtaned when tranng wth F R 1,.e. wth s n the tranng set. 3. We calculate the mean dfference (and ts standard error, shown n the fgure) between the dstance obtaned n step 1 and the one obtaned n step 2 for each sample s F, and we repeat ths experment for varous szes of F. The results obtaned for MDS, Isomap, spectral clusterng and LLE are shown n fgure 1 for dfferent values of m. Experments are done over a database of 698 synthetc face mages descrbed by 4096 components that s avalable at http://somap.stanford.edu. Qualtatvely smlar results have been obtaned over other databases such as Ionosphere (http://www.cs.uc.edu/ mlearn/mlsummary.html) and swssroll (http://www.cs.toronto.edu/ rowes/lle/). Each algorthm generates a twodmensonal embeddng of the mages, followng the experments reported for Isomap. The number of neghbors s 10 for Isomap and LLE, and a Gaussan kernel wth a standard devaton of 0.01 s used for spectral clusterng / Laplacan egenmaps. 95% confdence

ntervals are drawn besde each mean dfference of error on the fgure. As expected, the mean dfference between the two dstances s almost monotoncally ncreasng as the fracton of substtuted examples grows (x-axs n the fgure). In most cases, the out-of-sample error s less than or comparable to the tranng set embeddng stablty: t corresponds to substtutng a fracton of between 1 and 4% of the tranng examples. 6 Conclusons In ths paper we have presented an extenson to fve unsupervsed learnng algorthms based on a spectral embeddng of the data: MDS, spectral clusterng, Laplacan egenmaps, Isomap and LLE. Ths extenson allows one to apply a traned model to out-ofsample ponts wthout havng to recompute egenvectors. It ntroduces a noton of functon nducton and generalzaton error for these algorthms. The experments on real hghdmensonal data show that the average dstance between the out-of-sample and n-sample embeddngs s comparable or lower than the varaton n n-sample embeddng due to replacng a few ponts n the tranng set. References Baker, C. (1977). The numercal treatment of ntegral equatons. Clarendon Press, Oxford. Belkn, M. and Nyog, P. (2003). Laplacan egenmaps for dmensonalty reducton and data representaton. Neural Computaton, 15(6):1373 1396. Bengo, Y., Vncent, P., Paement, J., Delalleau, O., Oumet, M., and Le Roux, N. (2003). Spectral clusterng and kernel pca are learnng egenfunctons. Techncal report, Département d nformatque et recherche opératonnelle, Unversté de Montréal. Cox, T. and Cox, M. (1994). Multdmensonal Scalng. Chapman & Hall, London. de Slva, V. and Tenenbaum, J. (2003). Global versus local methods n nonlnear dmensonalty reducton. In Becker, S., Thrun, S., and Obermayer, K., edtors, Advances n Neural Informaton Processng Systems, volume 15, pages 705 712, Cambrdge, MA. The MIT Press. Gower, J. (1968). Addng a pont to vector dagrams n multvarate analyss. Bometrka, 55(3):582 585. Koltchnsk, V. and Gné, E. (2000). Random matrx approxmaton of spectra of ntegral operators. Bernoull, 6(1):113 167. Ng, A. Y., Jordan, M. I., and Wess, Y. (2002). On spectral clusterng: Analyss and an algorthm. In Detterch, T. G., Becker, S., and Ghahraman, Z., edtors, Advances n Neural Informaton Processng Systems 14, Cambrdge, MA. MIT Press. Rowes, S. and Saul, L. (2000). Nonlnear dmensonalty reducton by locally lnear embeddng. Scence, 290(5500):2323 2326. Saul, L. and Rowes, S. (2002). Thnk globally, ft locally: unsupervsed learnng of low dmensonal manfolds. Journal of Machne Learnng Research, 4:119 155. Schölkopf, B., Smola, A., and Müller, K.-R. (1998). Nonlnear component analyss as a kernel egenvalue problem. Neural Computaton, 10:1299 1319. Shawe-Taylor, J. and Wllams, C. (2003). The stablty of kernel prncpal components analyss and ts relaton to the process egenspectrum. In Becker, S., Thrun, S., and Obermayer, K., edtors, Advances n Neural Informaton Processng Systems, volume 15. The MIT Press. Sh, J. and Malk, J. (1997). Normalzed cuts and mage segmentaton. In Proc. IEEE Conf. Computer Vson and Pattern Recognton, pages 731 737. Tenenbaum, J., de Slva, V., and Langford, J. (2000). A global geometrc framework for nonlnear dmensonalty reducton. Scence, 290(5500):2319 2323. Wess, Y. (1999). Segmentaton usng egenvectors: a unfyng vew. In Proceedngs IEEE Internatonal Conference on Computer Vson, pages 975 982. Wllams, C. and Seeger, M. (2000). The effect of the nput densty dstrbuton on kernel-based classfers. In Proceedngs of the Seventeenth Internatonal Conference on Machne Learnng. Morgan Kaufmann.