A Fast Incremental Spectral Clustering for Large Data Sets

2011 12th Internatonal Conference on Parallel and Dstrbuted Computng, Applcatons and Technologes A Fast Incremental Spectral Clusterng for Large Data Sets Tengteng Kong 1,YeTan 1, Hong Shen 1,2 1 School of Computer Scence, Unversty of Scence and Technology of Chna 2 School of Computer Scence, Unversty of Adelade, Australa Abstract Spectral clusterng s an emergng research topc that has numerous applcatons, such as data dmenson reducton and mage segmentaton. In spectral clusterng, as new data ponts are added contnuously, dynamc data sets are processed n an on-lne way to avod costly re-computaton. In ths paper, we propose a new representatve measure to compress the orgnal data sets and mantan a set of representatve ponts by contnuously updatng Egen-system wth the ncdence vector. Accordng to these extracted ponts we generate nstant cluster labels as new data ponts arrve. Our method s effectve and able to process large data sets due to ts low tme complexty. Expermental results over varous real evolutonal data sets show that our method provdes fast and relatvely accurate results. Index Terms Spectral Clusterng, Incremental, Egen-Gap, Representatve Pont I. INTRODUCTION Spectral clusterng uses nformaton contaned n the spectrum of data affnty matrx to detect the structure of data dstrbutons. Recently, t has become ncreasngly popular both for ts fundamental advantages over tradtonal algorthms [6] and for ts smplcty n mplementaton wth standard lnear algebra methods [2], [5]. It has been used n varous applcatons rangng from data dmenson reducton to computer vson, mage segmentaton and speech recognton. The classcal algorthms usually have to make explct assumptons over the data sets before mplementaton, (e.g., EM algorthm assume that the data sets are as the law of Gaussan mxture models [1]). Therefore, these methods usually fal when data sets are arranged n a more complex stuaton [3], [4]. Compared wth these algorthms, spectral clusterng can acheve surprsngly good results by analyzng the spectrum of data set. Before the mplementaton of spectral clusterng, we need to construct a smlarty matrx and compute ts correspondng spectrum. Obvously, t s computatonally expensve and the stuaton s more severe when facng mass data. Therefore, t s necessary to compress the data sets and apply spectral clusterng n an on-lne way to avod costly re-computaton as data evolves. However, almost all exstng spectral clusterng methods are off-lne and wthout use of data compresson. Hence, t becomes dffcult to apply spectral clusterng when data sets are large and evolvng. In response to the above problems, there are manly two knds of solutons. One reles on smulatng the change of Ths paper was partally supported by the "100 Talents" Project of Chnese Academy of Scences, NSFC grant #622307, and Provncal Natural Scence Fund of Anhu #11040606Q52. The correspondng author s Hong Shen. Egen-system to avod re-computaton as new data ponts arrve: In [8], an ncremental spectral clusterng algorthm s proposed to handle the changes among the objects. The measure ntroduces an ncdence vector to represent the nsert/delete of data ponts and contnuously updates the Egen-system by analyzng the approxmate relatons between the changes of egenvalues and egenvectors. It acheves a good accuracy, however suffers from beng uncertan of convergence, workng only wth a constant number of clusters. The other reles on extractng representatve ponts to compress data set: In [9], a self-adapton algorthm s proposed to nspect the clusters as new data ponts are added. Instead of computng the affnty matrx of all entres, t only mantans a few representatve data ponts, and hence works more effectvely. However, use of only one representatve data pont n each cluster may ntroduce sgnfcant errors. In general, these methods have clustered the data sets ncrementally n dfferent ways, but have not acheved the desred effcency. In ths paper, we propose an ncremental spectral clusterng algorthm to deal wth evolvng large data sets by extendng the NJW spectral clusterng algorthm [1]. Our algorthm effcently assgns nstant cluster labels to newly arrvng data accordng to the representatve sets estmated by our proposed measure and updates Egen-system [6] wth the ncdence vector [7] to detect the change of cluster number. Compared wth re-computaton of the soluton n NJW, our algorthm acheves a smlar accuracy at a much lower computatonal cost. The rest of the paper s organzed as follows. In Secton II, we gve some background knowledge used n the NJW algorthm. In Secton III, we ntroduce our ncremental spectral clusterng algorthm. The expermental results are reported n Secton IV followed by concludng remarks. II. PRELIMINARIES Frst, we state some notaton used n ths paper. Scrpted letters, such as ξ and φ, represent sets. Captal letters, such as L and W, represent matrces. Lower case letters n vector forms, such as v and u j, represent column vectors. And we use subscrpts to ndex the elements n matrces and vectors. In addton, egenvalues are lsted n ascendng order, and the frst k egenvectors represent the egenvectors correspondng to the k smallest egenvalues. 978-0-7695-4564-6/11 $26.00 2011 IEEE DOI 10.1109/PDCAT.2011.4 1

A. NJW Spectral Clusterng Algorthm The NJW algorthm whch s one of the most common spectral clusterng algorthms ntroduces a partcular manner to use the frst k egenvectors and gve condtons under whch the algorthm can be expected to do well. It can be outlnes as follows usng the notaton n [2]. Algorthm 1 NJW algorthm Input : Affnty matrx W R n n, number k of clusters to construct. 1) Compute Laplacan matrx L = D W ; D s a dagonal matrx wth D = n j=1 W j 2) Compute the frst k egenvectors u 1,...,u k of egenproblem Lu = λdu; let Z R n k be the matrx contanng the vectors u 1,...,u k. 3) Cluster y 1,...,y n by K-means algorthm nto clusters c 1,...,c k ; y correspondng to the -th row of Z. Output : Clusters A 1,...,A k wth A ={j y j c }. As nput to algorthm, the constructon of affnty matrx s very mportant. We use the k-nearest neghbor graph to construct the smlarty matrx and use the Gaussan smlarty functon to measure the smlarty of each pont [2]: ( ( )) d (s,s j ) 2 A j =exp 2σ 2 (1) It s smple to work wth, results n a sparse affnty matrx whose frst k egenvectors can be effcent computed. However, t s computatonal expensve to resolve the generalzed egenvalue system as new data ponts comng. By analyzng the spectrum of the Laplacan Matrx constructed by all data entres, the orgnal data can be compressed nto a certan number of representatve ponts. B. Incdence Vector As new data comng, t s necessary to represent the dynamc changes n Laplacan matrx. A soluton was proposed n [8] that ntroduced ncdence vector to update Egen-system. Defnton 1. An ncdence vector c j rj s a column vector wth two nonzero elements: -th element equal to c j and j-th element c j, ndcatng data pont and j havng a smlarty c j. In addton, we let R be the matrx contans all the ncdence vectors as column n any order. Obvously, there are at most ( n 2 n ) /2 columns n R f the affnty matrx W s generated by a full connected graph. ( Fortunately, the actual columns n R s far less than n 2 n ) /2, snce W s sparse. Proposton 2. Laplacan matrx L = D W can be decomposed as L = RR T [10]. And f data ponts v and v j have a smlarty change Δc j correspondng to the ncdence vector Δcj rj, the new graph Laplacan L can be decomposed as L = R R T where R = [R, Δc j rj ]. Consder a new comng data pont v l, t can be smply decomposed nto a serous of ncdence vectors added n R. However, t s worth to note that after updatng R, the matrxes W, D, and L are expected to change ether. Acccordng to Pro. 2 the ncrement of L and D wth respect to Δc j rj can be expressed as: ΔL = L L = RR T T R R = Δc j rj T rj (2) ΔD = Δc j dag{m j } (3) where m j s a column vector whose -th and j-th elements equal to 1 whle others equal to 0.Snce the frst order approxmate soluton of λ can be computed as: Δλ = xt (ΔL λδd) x T x T Dx (4) we can further specfed Eq. (4) accordng to Eq. (2) and Eq. (3) wth ncdence vector Δc j r j as: x ( T r j rj T Δλ =Δc λdag {v j} ) x j x T Dx (5) C. Egen-gap It s a general problem to choose the number of clusters for all clusterng algorthms and there are varous of methods devsed for ths problem. Here, we adopt the Egen-gap heurstc [11] whch s partcularly desgned for spectral clusterng. It s known that the frst k egenvalues s exact 0, whle there s a gap between λ k and λ k+1 whch s called Egen-gap n k completely dsconnected clusters. Smlar stuatons exst wth regard to general case accordng to the matrx perturbaton theory. Therefore, the number of clusters k can be detected by Egen-gap and expressed as follows: k =argmn(max(g )) (6) where g = λ +1 λ for =1,...n 1; n s the number of data ponts. D. Representatve Measurement Analyss There are several of methods to compute the central or representatve ponts n a cluster. However, these methods are mostly based on densty, dstance or propnquty and are not applcable to reflect the complex relatonshp of ponts n clusters generated by spectral clusterng. Here, we heurstcally llustrate the relevance of ponts. Consder the case of k-connected components whose vertces are ordered accordng to the cluster they belong to. Thus, the affnty matrx s block dagonal, and the same s true for L L 1 0 L =... 0 L k where each L s a connected Laplacan graph whch has a egenvalue 0 wth constant one egenvector. We know that the frst k egenvectors of L are pecewse constant wth correspondng egenvalues 0. Hence, 0 s a 2

repeated egenvector wth multplcty k. Thus the Egen-solver could be any set of orthogonal vectors spannng the same space as the frst k egenvectors of L. In [3], the author defned a cost functon as: n k Xj 2 J = (7) M 2 =1 j=1 where M =max j X j. By mnmzng J wth cluster number k, t recover the rotaton whch best algns the columns of X wth canoncal coordnate system. Furthermore, mnmzng J means ncorporate as few columns as possble to contan bgger data gap, that s, reserve marked ndcator whle reduce napparent one. It s accord wth our clusterng target and gve some expresson to the label nformaton of correspondng ponts. A smlar result comes up n general case wth perturbed data. Therefore, t s reasonable for us to measure the representatveness of ponts use a smlar cost functon. III. OUR PROPOED METHOD By estmatng ponts n every cluster wth our proposed measurement, we compress the orgnal data nto a set of representatve ponts. Then, nstant cluster labels can be generated accordng to these extracted representatve ponts as new data ponts added. However, as new data contnuously come, the orgnal representatve ponts may not be able to represent ts cluster very well. Hence, we apply ncdence vector to update the change of data n the form of Egen-system to keep a newest set of representatve ponts. In ths secton, we wll dscuss these problems n detal. A. Extractng Representatve Ponts and ts Number 1) Representatve Measurement: When we get the result of clusters after applyng the NJW algorthm, t makes sense to analyss the representatveness of each pont n submanfold. There are many general algorthms desgned for ths problem [12], however, most of them are based on dstance, densty or mode estmaton. Hence, t can t reflect nternal and external relatons between clusters. For ths purpose, we defne a new cost functon to measure the representatve relablty of each cluster accordng to ts egenvectors. Inspred by (7), we defne the representatve relablty R of pont v n cluster C j as: R = k X 2 j M 2 j=1 where M =max j X j and a better representatve pont has a smaller magntude of R. Fg. 1 shows a toy example of a graph evolves from (a) to (b), as a new type data pont D accompany wth an edge BD added. In Fg. 1(a), the representatve pont should be B; whle n Fg. 1(b) the representatve of pont should be A. That s to say, the measure of Eq. (8) s prefer to choose ponts wth more smlarty nternal clusters and less smlarty external clusters. Hence, the connecton wth other clusters wll reduce ts representatve relablty. (8) B A 0.4 C (a) Before evolved D B A 0.1 0.4 C (b) After evolved Fgure 1: A toy example of ncremental data. The dash lne are edge to be added 2) The number of Representatve Ponts: The next problem s to select the number of representatve ponts. We want to choose enough number of ponts to represent a cluster whle at the same tme reducng t as much as possble to avod redundant computaton. Thus, we can solve ths problem by analyzng the Egen-gap of each cluster and fx the number by Eq. (6). Furthermore, f there s a partcular demand on tme and certan error s allowed, we can approxmate the spectrum of each sub-cluster C j wth the correspondng columns and rows of Z. Where Z s the spectrum of the whole data sets and denoted the reduced matrx as Z Cj R Cj Cj. Thus the approxmate egenvalues of cluster C can be approxmate express as: λ Cj = x C T j Lx Cj x Cj T (9) Dx Cj where x Cj correspondng to the -th column of S Cj. Snce then, we can use Eq. (6) to detect the number of representatve ponts. B. Updatng Representatve Sets and Re-ntalzng the Algorthm when Cluster Number Changes As new data comng ncrementally, the error s accumulatng. Ths s also a problem n many other algorthms. Here we re-ntalze the NJW algorthm to avod a collapse. Then, there comes a queston that when to apply the re-ntalzaton step. We can smply apply the step when a certan preset number of ponts have been added, however, a constant number can hardly competent snce the added data may have much dfferent smlarty connecton wth orgnal data ponts. Hence, we except to gan a better result by contnuously detect the change of cluster number n an approxmate way. The current cluster number can be detected by Egen-gap as: k = argmn(max(λ +1 λ )) = argmn(max((λ +1 +Δλ +1 ) (λ +Δλ ))) = argmn(max (g +(Δλ +1 Δλ ))) (10) Thus, we can get the current number of cluster k by Eq. (10) and (5) and apply the re-ntalzaton step whle k k. In III-A, we have chosen k representatve ponts by analyzng the Egen-gap of cluster C. Consder that a new comer D 3

data assgned to C but wthout change the magntude of k. In ths stuaton, the prevous extracted representatve ponts stll work snce there s nothng new type ponts generated. However, when the number of k ncrease, the prevous extracted ponts can hardly make t. Therefore, we adopt a smlar strategy as the above dscusson of re-ntalzaton step and solve ths problem by smply addng the pont whch have aroused the change of k to representatve sets. C. A Fast Incremental Spectral Clusterng Algorthm for Large Data Set 1) The Algorthm: Summarze Secton III-A and III-B, we propose a new ncremental spectral clusterng algorthm and descrbe t as follows: Algorthm 2 A Fast Incremental Spectral Clusterng Algorthm Input: Number of clusters k, affnty matrx W R n n at tme t. New comer data ponts v l after t. 1) Apply Algorthm 1 wth parameter k, W and generate k clusters as C 1...C k. Noted X as the matrx contans the frst k egenvectors as columns and Z contans all. 2) For each cluster C, compute the representatve relablty R j of every pont v j C accordng to Eq. (8), and choose the frst k C ponts to represent cluster C noted as C. k C s computed by Eq. (6) where the correspondng parameter λ C s gven by Eq. (9). Note that the frst k C ponts means the ponts correspondng to the k C smallest R j. 3) For every new added pont v l, compute the average dstance Ds from v l to cluster C j and assgn v l to cluster C m whch gve the smallest value of Ds: ɛc d (v l,v ) j Ds = C j 4) Compute the current cluster number k accordng to Eq. (10) where the change of egenvector s gven by Eq. (5) n the form of ncdence vector. If k k, then go back to step 1 to re-ntalze the algorthm wth k = k, otherwse contnue. 5) Compute the current number of C m s representatve ponts k C m smlar as step 4. If k C m >k Cm then add v l to C m, otherwse contnue. 6) Go to step 3. Output: Instant cluster lables of ponts v l. 2) Dscussons: It s known that compute the spectrum of a standard matrx needs O ( ( ) n 3) operatons, t can be furthermore reduce to O n 3 2 f the Laplacan matrx s sparse. However, the computatonal cost stll very hgh. Hence, NJW algorthm may fal when data scale s large or new data comes too frequently. On the contrary, our algorthm may success. It s fast and relatve accuracy. Here, we shortly analyss the tme complexty to llustrate t. It needs O (n) operatons to compute the ( representatve ) ponts n each cluster as ntalzaton and O ñ 3 2 operatons to generate cluster labels and update representatve sets as new data come, where n and ñ denote the number of data set and representatve set respectvely. ñ s usually much smaller than n and relatve stable, hence, our method s effectve and able to process large data sets. IV. EXPERIMENTS A. Parameter Settngs As mentoned before, we use k-nearest neghbor graph to construct the sparse affnty matrx. However, t may lead to non-symmetrc matrx. Fortunately, we can make t symmetrc by smply settng both W j and W j as the smlarty of v and v j, f ether W j or W j s non-zero. In ths experment, we adopt the Gaussan smlarty functon to measure local neghborhoods between ponts and ts parameter σ s selected n a self-turnng way suggested n [3]. Moreover, we employ ARPACK (a varant of Lanczos method) to compute the spectrum of D 1 L and choose k =20to construct k-nearest neghbor graph. B. Data Sets The data set s a collecton of about 810,000 documents whch s known as RCV1 (Reuters Corpus Volume I) [14]. It s manually categorzed nto 350 classes and splt nto 23, 139 tranng documents and 781, 256 test documents. We use the category codes based on ndustres vocabulary and preprocess the data sets by removng document wth mult-labeled and categores wth less than 500 documents. Thus, we get about 200,000 documents n 103 categores. In ths experment, we extract a subset ϕ from the 200,000 documents to ntalze our algorthm and smulate the ncrement of data sets by add data ponts to ϕ from the rest of 200,000 documents. C. Qualty Measure We estmate our algorthm by computng Clusterng Accuracy (CA) and Normalzed Mutual Informaton (NMI) between labels generated by our algorthm and the real one [13] : =n =1 CA =max δ (y,map(c )) map n where n denote the number of documents, y and c denote the real label and generated label of document v respectvely. Functon δ (y, c) equals 1 f y = c, equals 0 otherwse. Permutaton functon map ( ) maps each generated label to real one and the optmzed mappng functon can be found n [15]. The magntude of CA s between 0 and 1, whle a hgher score of CA means a better clusterng qualty. NMI = k =1 ( ) k j=1 n n nj jlog n nj ( n ) ( log n n j n jlog nj n where n denote the number of documents, n and n j denote the magntude of documents n cluster and category j, and n j denotes the mutual documents both n category and cluster j. The magntude of NMI s between 0 and 1, whle a hgher score of NMI means a better clusterng qualty. ) 4

0.55 Modfed NJW Increnmental spectal clusterng 8 Modfed NJW Increnmental spectal clusterng 5 4.5 Modfed NJW Increnmental spectal clusterng 0.5 6 4 4 0.45 2 3.5 NMI 0.4 Accuracy Tme 3 5 0.18 2.5 0.16 0.14 2 5 0.12 1.5 Number of ponts (a) NMI 0.1 Number of ponts (b) Accuracy 1 Number of ponts (c) Tme Fgure 2: A clusterng qualty and runtme comparson between K-means, Modfed NJW and Alg. 2 usng the RCV1 data set. For Alg. 2, we use 3000 ponts as ntalzaton and ncrementally add another 3000 ponts n the rest of data set. For K-means, each value s mean of 10 replcates. D. Results Fg. 2(a) and Fg. 2(b) shows the NMI and CA score usng RCV1 data set. Both of the results confrm that our algorthm acheves a relatve good clusterng qualty between NJW and K-means. Although the value of NMI and AC may drop gradually wth the ncrease of added ponts, t could rectfy by the automatc re-ntalzaton operaton of Alg. 2. Furthermore, t would perform much better wth the ncrease number of ponts whch s crucal for large data sets. Fg. 2(c) reports the runtme usng the RCV1 data set. It can be seen that the runtme of Alg. 2 s close to K-means and much less than NJW. In addton, the ncrease of runtme s not so sharp as NJW as new ponts added. On the contrary, t become relatve stable and approach to K-means. Hence, compared wth recomputaton by NJW, t acheves smlar accuracy but wth much lower computatonal cost. V. CONCLUSIONS A fast ncremental spectral clusterng algorthm for large data set s proposed n ths paper. It extends the NJW algorthm to handle dynamc data and ncorporates a new strategy of measurement to compress the orgnal data sets wth a certan number of representatve ponts. Instead of evaluatng the whole data set, by ncrementally keepng a representatve sets, the algorthm generates nstant cluster labels as new ponts come. Therefore, the algorthm s fast and can be effcently appled to large data sets. Moreover, by analyzng Egen-gap n the form of ncdence vectors, the change of cluster number can be detected automatcally. Expermental results over a number of real evolutonal data sets llustrate our methods provde fast and relatve accurate results. [2] von Luxburg U. (2007). A tutoral on spectral clusterng. Stat. Comput. 17, 395 416 [3] L. Zelnk-Manor and P. Perona. Self-tunng spectral clusterng. In L. K. Saul, Y. Wess, and L. Bottou, edtors, Advances n Neural Informaton Processng Systems 17, pages 1601-1608. MIT Press, Cambrdge, MA, 2005. [4] J. Sh, J. Malk, Normalzed cuts and mage segmentaton, IEEE Transactons on Pattern Analyss and Machne Intellgence 22 (8) (2000) 888 905. [5] F. Bach and M. Jordan. Learnng spectral clusterng. In Proc. of NIPS- 16. MIT Press, 2004. [6] F. R. K. Chung. Spectral Graph Theory. Amercan Mathematcal Socety, 1997. [7] B. Bollobas, Modern Graph Theory, Sprnger, New York, 1998. [8] H. Nng, W. Xu, Y. Ch, Y. Gong, and T. Huang. Incremental spectral clusterng wth applcaton to montorng of evolvng blog communtes. In SIAM Int. Conf. on Data Mnng, 2007. [9] C. Valgren, T. Duckett, and A. Llenthal. Incremental spectral clusterngand ts applcaton to topologcal mappng. In Proc. IEEE Int. Conf. onrobotcs and Automaton, pages 4283 4288, 2007 [10] F.R.K. Chung, Spectral Graph Theory, n: CBMS Regonal Conference Seres n Mathematcs, vol. 92, Amercan Mathematcal Socety, Provdence, RI, 1997. [11] Bhata, R.: Matrx Analyss. Sprnger, New York (1997) [12] D. Chaudhur, C.A. Murthy, and B.B. Chaudhur, Fndng a Subset of Representatve Ponts n a Data Set, IEEE Trans. Systems, Man, and Cybernetcs, vol. 24, no. 9, pp. 1416-1424, 1994. [13] Wen-Yen Chen, Yangqu Song, Hongje Ba, Chh-Jen Ln, and Edward Y. Chang. Parallel Spectral Clusterng n Dstrbuted Systems. IEEE Transactons on Pattern Analyss and Machne Intellgence,2010. [14] D. D. Lews, Y. Yang, T. G. Rose, and F. L. RCV1: A new benchmark collecton for text categorzaton research. Journal of Machne Learnng Research, 5:361 397, 2004. [15] L. Lovasz, M. Plummer, Matchng Theory, Akadema Kado, North- Holland, Budapest, 1986. REFERENCES [1] A.Y. Ng, M.I. Jordan, Y. Wess, On spectral clusterng: analyss and an algorthm, n: T.G. Detterch, S. Becker, Z. Ghahraman (Eds.), Proceedngs of the Advances n Neural Informaton Processng Systems, MIT Press, Cambrdge, MA, 2002, pp. 849 856. 5