A Fast Incremental Spectral Clustering for Large Data Sets



Similar documents
Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

An Interest-Oriented Network Evolution Mechanism for Online Communities

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

What is Candidate Sampling

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

L10: Linear discriminants analysis

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Support Vector Machines

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Loop Parallelization

Traffic State Estimation in the Traffic Management Center of Berlin

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

1 Example 1: Axis-aligned rectangles

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

where the coordinates are related to those in the old frame as follows.

DEFINING %COMPLETE IN MICROSOFT PROJECT

Recurrence. 1 Definitions and main statements

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Mining Multiple Large Data Sources

Distributed Multi-Target Tracking In A Self-Configuring Camera Network

The OC Curve of Attribute Acceptance Plans

A Simple Approach to Clustering in Excel

8 Algorithm for Binary Searching in Trees

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features


Bayesian Cluster Ensembles

Detecting Global Motion Patterns in Complex Videos

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Performance Analysis and Coding Strategy of ECOC SVMs

Ring structure of splines on triangulations

Web Object Indexing Using Domain Knowledge *

An Alternative Way to Measure Private Equity Performance

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

A DATA MINING APPLICATION IN A STUDENT DATABASE

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

A heuristic task deployment approach for load balancing

Semantic Link Analysis for Finding Answer Experts *

A Performance Analysis of View Maintenance Techniques for Data Warehouses

Data Visualization by Pairwise Distortion Minimization

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

An Adaptive and Distributed Clustering Scheme for Wireless Sensor Networks

Fast degree elevation and knot insertion for B-spline curves

Improved SVM in Cloud Computing Information Mining

POLYSA: A Polynomial Algorithm for Non-binary Constraint Satisfaction Problems with and

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Laddered Multilevel DC/AC Inverters used in Solar Panel Energy Systems

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

Realistic Image Synthesis

When Network Effect Meets Congestion Effect: Leveraging Social Services for Wireless Services

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

IMPACT ANALYSIS OF A CELLULAR PHONE

Design and Development of a Security Evaluation Platform Based on International Standards

Assessing Student Learning Through Keyword Density Analysis of Online Class Messages

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

J. Parallel Distrib. Comput.

Project Networks With Mixed-Time Constraints

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

21 Vectors: The Cross Product & Torque

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

An interactive system for structure-based ASCII art creation

PERRON FROBENIUS THEOREM

Master s Thesis. Configuring robust virtual wireless sensor networks for Internet of Things inspired by brain functional networks

Availability-Based Path Selection and Network Vulnerability Assessment

Daily Mood Assessment based on Mobile Phone Sensing

Minimal Coding Network With Combinatorial Structure For Instantaneous Recovery From Edge Failures

The Greedy Method. Introduction. 0/1 Knapsack Problem

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

A Programming Model for the Cloud Platform

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Sequential Optimizing Investing Strategy with Neural Networks

A Secure Password-Authenticated Key Agreement Using Smart Cards

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Gender Classification for Real-Time Audience Analysis System

Extending Probabilistic Dynamic Epistemic Logic

An Algorithm for Data-Driven Bandwidth Selection

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Application of Quasi Monte Carlo methods and Global Sensitivity Analysis in finance

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Transcription:

2011 12th Internatonal Conference on Parallel and Dstrbuted Computng, Applcatons and Technologes A Fast Incremental Spectral Clusterng for Large Data Sets Tengteng Kong 1,YeTan 1, Hong Shen 1,2 1 School of Computer Scence, Unversty of Scence and Technology of Chna 2 School of Computer Scence, Unversty of Adelade, Australa Abstract Spectral clusterng s an emergng research topc that has numerous applcatons, such as data dmenson reducton and mage segmentaton. In spectral clusterng, as new data ponts are added contnuously, dynamc data sets are processed n an on-lne way to avod costly re-computaton. In ths paper, we propose a new representatve measure to compress the orgnal data sets and mantan a set of representatve ponts by contnuously updatng Egen-system wth the ncdence vector. Accordng to these extracted ponts we generate nstant cluster labels as new data ponts arrve. Our method s effectve and able to process large data sets due to ts low tme complexty. Expermental results over varous real evolutonal data sets show that our method provdes fast and relatvely accurate results. Index Terms Spectral Clusterng, Incremental, Egen-Gap, Representatve Pont I. INTRODUCTION Spectral clusterng uses nformaton contaned n the spectrum of data affnty matrx to detect the structure of data dstrbutons. Recently, t has become ncreasngly popular both for ts fundamental advantages over tradtonal algorthms [6] and for ts smplcty n mplementaton wth standard lnear algebra methods [2], [5]. It has been used n varous applcatons rangng from data dmenson reducton to computer vson, mage segmentaton and speech recognton. The classcal algorthms usually have to make explct assumptons over the data sets before mplementaton, (e.g., EM algorthm assume that the data sets are as the law of Gaussan mxture models [1]). Therefore, these methods usually fal when data sets are arranged n a more complex stuaton [3], [4]. Compared wth these algorthms, spectral clusterng can acheve surprsngly good results by analyzng the spectrum of data set. Before the mplementaton of spectral clusterng, we need to construct a smlarty matrx and compute ts correspondng spectrum. Obvously, t s computatonally expensve and the stuaton s more severe when facng mass data. Therefore, t s necessary to compress the data sets and apply spectral clusterng n an on-lne way to avod costly re-computaton as data evolves. However, almost all exstng spectral clusterng methods are off-lne and wthout use of data compresson. Hence, t becomes dffcult to apply spectral clusterng when data sets are large and evolvng. In response to the above problems, there are manly two knds of solutons. One reles on smulatng the change of Ths paper was partally supported by the "100 Talents" Project of Chnese Academy of Scences, NSFC grant #622307, and Provncal Natural Scence Fund of Anhu #11040606Q52. The correspondng author s Hong Shen. Egen-system to avod re-computaton as new data ponts arrve: In [8], an ncremental spectral clusterng algorthm s proposed to handle the changes among the objects. The measure ntroduces an ncdence vector to represent the nsert/delete of data ponts and contnuously updates the Egen-system by analyzng the approxmate relatons between the changes of egenvalues and egenvectors. It acheves a good accuracy, however suffers from beng uncertan of convergence, workng only wth a constant number of clusters. The other reles on extractng representatve ponts to compress data set: In [9], a self-adapton algorthm s proposed to nspect the clusters as new data ponts are added. Instead of computng the affnty matrx of all entres, t only mantans a few representatve data ponts, and hence works more effectvely. However, use of only one representatve data pont n each cluster may ntroduce sgnfcant errors. In general, these methods have clustered the data sets ncrementally n dfferent ways, but have not acheved the desred effcency. In ths paper, we propose an ncremental spectral clusterng algorthm to deal wth evolvng large data sets by extendng the NJW spectral clusterng algorthm [1]. Our algorthm effcently assgns nstant cluster labels to newly arrvng data accordng to the representatve sets estmated by our proposed measure and updates Egen-system [6] wth the ncdence vector [7] to detect the change of cluster number. Compared wth re-computaton of the soluton n NJW, our algorthm acheves a smlar accuracy at a much lower computatonal cost. The rest of the paper s organzed as follows. In Secton II, we gve some background knowledge used n the NJW algorthm. In Secton III, we ntroduce our ncremental spectral clusterng algorthm. The expermental results are reported n Secton IV followed by concludng remarks. II. PRELIMINARIES Frst, we state some notaton used n ths paper. Scrpted letters, such as ξ and φ, represent sets. Captal letters, such as L and W, represent matrces. Lower case letters n vector forms, such as v and u j, represent column vectors. And we use subscrpts to ndex the elements n matrces and vectors. In addton, egenvalues are lsted n ascendng order, and the frst k egenvectors represent the egenvectors correspondng to the k smallest egenvalues. 978-0-7695-4564-6/11 $26.00 2011 IEEE DOI 10.1109/PDCAT.2011.4 1

A. NJW Spectral Clusterng Algorthm The NJW algorthm whch s one of the most common spectral clusterng algorthms ntroduces a partcular manner to use the frst k egenvectors and gve condtons under whch the algorthm can be expected to do well. It can be outlnes as follows usng the notaton n [2]. Algorthm 1 NJW algorthm Input : Affnty matrx W R n n, number k of clusters to construct. 1) Compute Laplacan matrx L = D W ; D s a dagonal matrx wth D = n j=1 W j 2) Compute the frst k egenvectors u 1,...,u k of egenproblem Lu = λdu; let Z R n k be the matrx contanng the vectors u 1,...,u k. 3) Cluster y 1,...,y n by K-means algorthm nto clusters c 1,...,c k ; y correspondng to the -th row of Z. Output : Clusters A 1,...,A k wth A ={j y j c }. As nput to algorthm, the constructon of affnty matrx s very mportant. We use the k-nearest neghbor graph to construct the smlarty matrx and use the Gaussan smlarty functon to measure the smlarty of each pont [2]: ( ( )) d (s,s j ) 2 A j =exp 2σ 2 (1) It s smple to work wth, results n a sparse affnty matrx whose frst k egenvectors can be effcent computed. However, t s computatonal expensve to resolve the generalzed egenvalue system as new data ponts comng. By analyzng the spectrum of the Laplacan Matrx constructed by all data entres, the orgnal data can be compressed nto a certan number of representatve ponts. B. Incdence Vector As new data comng, t s necessary to represent the dynamc changes n Laplacan matrx. A soluton was proposed n [8] that ntroduced ncdence vector to update Egen-system. Defnton 1. An ncdence vector c j rj s a column vector wth two nonzero elements: -th element equal to c j and j-th element c j, ndcatng data pont and j havng a smlarty c j. In addton, we let R be the matrx contans all the ncdence vectors as column n any order. Obvously, there are at most ( n 2 n ) /2 columns n R f the affnty matrx W s generated by a full connected graph. ( Fortunately, the actual columns n R s far less than n 2 n ) /2, snce W s sparse. Proposton 2. Laplacan matrx L = D W can be decomposed as L = RR T [10]. And f data ponts v and v j have a smlarty change Δc j correspondng to the ncdence vector Δcj rj, the new graph Laplacan L can be decomposed as L = R R T where R = [R, Δc j rj ]. Consder a new comng data pont v l, t can be smply decomposed nto a serous of ncdence vectors added n R. However, t s worth to note that after updatng R, the matrxes W, D, and L are expected to change ether. Acccordng to Pro. 2 the ncrement of L and D wth respect to Δc j rj can be expressed as: ΔL = L L = RR T T R R = Δc j rj T rj (2) ΔD = Δc j dag{m j } (3) where m j s a column vector whose -th and j-th elements equal to 1 whle others equal to 0.Snce the frst order approxmate soluton of λ can be computed as: Δλ = xt (ΔL λδd) x T x T Dx (4) we can further specfed Eq. (4) accordng to Eq. (2) and Eq. (3) wth ncdence vector Δc j r j as: x ( T r j rj T Δλ =Δc λdag {v j} ) x j x T Dx (5) C. Egen-gap It s a general problem to choose the number of clusters for all clusterng algorthms and there are varous of methods devsed for ths problem. Here, we adopt the Egen-gap heurstc [11] whch s partcularly desgned for spectral clusterng. It s known that the frst k egenvalues s exact 0, whle there s a gap between λ k and λ k+1 whch s called Egen-gap n k completely dsconnected clusters. Smlar stuatons exst wth regard to general case accordng to the matrx perturbaton theory. Therefore, the number of clusters k can be detected by Egen-gap and expressed as follows: k =argmn(max(g )) (6) where g = λ +1 λ for =1,...n 1; n s the number of data ponts. D. Representatve Measurement Analyss There are several of methods to compute the central or representatve ponts n a cluster. However, these methods are mostly based on densty, dstance or propnquty and are not applcable to reflect the complex relatonshp of ponts n clusters generated by spectral clusterng. Here, we heurstcally llustrate the relevance of ponts. Consder the case of k-connected components whose vertces are ordered accordng to the cluster they belong to. Thus, the affnty matrx s block dagonal, and the same s true for L L 1 0 L =... 0 L k where each L s a connected Laplacan graph whch has a egenvalue 0 wth constant one egenvector. We know that the frst k egenvectors of L are pecewse constant wth correspondng egenvalues 0. Hence, 0 s a 2

repeated egenvector wth multplcty k. Thus the Egen-solver could be any set of orthogonal vectors spannng the same space as the frst k egenvectors of L. In [3], the author defned a cost functon as: n k Xj 2 J = (7) M 2 =1 j=1 where M =max j X j. By mnmzng J wth cluster number k, t recover the rotaton whch best algns the columns of X wth canoncal coordnate system. Furthermore, mnmzng J means ncorporate as few columns as possble to contan bgger data gap, that s, reserve marked ndcator whle reduce napparent one. It s accord wth our clusterng target and gve some expresson to the label nformaton of correspondng ponts. A smlar result comes up n general case wth perturbed data. Therefore, t s reasonable for us to measure the representatveness of ponts use a smlar cost functon. III. OUR PROPOED METHOD By estmatng ponts n every cluster wth our proposed measurement, we compress the orgnal data nto a set of representatve ponts. Then, nstant cluster labels can be generated accordng to these extracted representatve ponts as new data ponts added. However, as new data contnuously come, the orgnal representatve ponts may not be able to represent ts cluster very well. Hence, we apply ncdence vector to update the change of data n the form of Egen-system to keep a newest set of representatve ponts. In ths secton, we wll dscuss these problems n detal. A. Extractng Representatve Ponts and ts Number 1) Representatve Measurement: When we get the result of clusters after applyng the NJW algorthm, t makes sense to analyss the representatveness of each pont n submanfold. There are many general algorthms desgned for ths problem [12], however, most of them are based on dstance, densty or mode estmaton. Hence, t can t reflect nternal and external relatons between clusters. For ths purpose, we defne a new cost functon to measure the representatve relablty of each cluster accordng to ts egenvectors. Inspred by (7), we defne the representatve relablty R of pont v n cluster C j as: R = k X 2 j M 2 j=1 where M =max j X j and a better representatve pont has a smaller magntude of R. Fg. 1 shows a toy example of a graph evolves from (a) to (b), as a new type data pont D accompany wth an edge BD added. In Fg. 1(a), the representatve pont should be B; whle n Fg. 1(b) the representatve of pont should be A. That s to say, the measure of Eq. (8) s prefer to choose ponts wth more smlarty nternal clusters and less smlarty external clusters. Hence, the connecton wth other clusters wll reduce ts representatve relablty. (8) B A 0.4 C (a) Before evolved D B A 0.1 0.4 C (b) After evolved Fgure 1: A toy example of ncremental data. The dash lne are edge to be added 2) The number of Representatve Ponts: The next problem s to select the number of representatve ponts. We want to choose enough number of ponts to represent a cluster whle at the same tme reducng t as much as possble to avod redundant computaton. Thus, we can solve ths problem by analyzng the Egen-gap of each cluster and fx the number by Eq. (6). Furthermore, f there s a partcular demand on tme and certan error s allowed, we can approxmate the spectrum of each sub-cluster C j wth the correspondng columns and rows of Z. Where Z s the spectrum of the whole data sets and denoted the reduced matrx as Z Cj R Cj Cj. Thus the approxmate egenvalues of cluster C can be approxmate express as: λ Cj = x C T j Lx Cj x Cj T (9) Dx Cj where x Cj correspondng to the -th column of S Cj. Snce then, we can use Eq. (6) to detect the number of representatve ponts. B. Updatng Representatve Sets and Re-ntalzng the Algorthm when Cluster Number Changes As new data comng ncrementally, the error s accumulatng. Ths s also a problem n many other algorthms. Here we re-ntalze the NJW algorthm to avod a collapse. Then, there comes a queston that when to apply the re-ntalzaton step. We can smply apply the step when a certan preset number of ponts have been added, however, a constant number can hardly competent snce the added data may have much dfferent smlarty connecton wth orgnal data ponts. Hence, we except to gan a better result by contnuously detect the change of cluster number n an approxmate way. The current cluster number can be detected by Egen-gap as: k = argmn(max(λ +1 λ )) = argmn(max((λ +1 +Δλ +1 ) (λ +Δλ ))) = argmn(max (g +(Δλ +1 Δλ ))) (10) Thus, we can get the current number of cluster k by Eq. (10) and (5) and apply the re-ntalzaton step whle k k. In III-A, we have chosen k representatve ponts by analyzng the Egen-gap of cluster C. Consder that a new comer D 3

data assgned to C but wthout change the magntude of k. In ths stuaton, the prevous extracted representatve ponts stll work snce there s nothng new type ponts generated. However, when the number of k ncrease, the prevous extracted ponts can hardly make t. Therefore, we adopt a smlar strategy as the above dscusson of re-ntalzaton step and solve ths problem by smply addng the pont whch have aroused the change of k to representatve sets. C. A Fast Incremental Spectral Clusterng Algorthm for Large Data Set 1) The Algorthm: Summarze Secton III-A and III-B, we propose a new ncremental spectral clusterng algorthm and descrbe t as follows: Algorthm 2 A Fast Incremental Spectral Clusterng Algorthm Input: Number of clusters k, affnty matrx W R n n at tme t. New comer data ponts v l after t. 1) Apply Algorthm 1 wth parameter k, W and generate k clusters as C 1...C k. Noted X as the matrx contans the frst k egenvectors as columns and Z contans all. 2) For each cluster C, compute the representatve relablty R j of every pont v j C accordng to Eq. (8), and choose the frst k C ponts to represent cluster C noted as C. k C s computed by Eq. (6) where the correspondng parameter λ C s gven by Eq. (9). Note that the frst k C ponts means the ponts correspondng to the k C smallest R j. 3) For every new added pont v l, compute the average dstance Ds from v l to cluster C j and assgn v l to cluster C m whch gve the smallest value of Ds: ɛc d (v l,v ) j Ds = C j 4) Compute the current cluster number k accordng to Eq. (10) where the change of egenvector s gven by Eq. (5) n the form of ncdence vector. If k k, then go back to step 1 to re-ntalze the algorthm wth k = k, otherwse contnue. 5) Compute the current number of C m s representatve ponts k C m smlar as step 4. If k C m >k Cm then add v l to C m, otherwse contnue. 6) Go to step 3. Output: Instant cluster lables of ponts v l. 2) Dscussons: It s known that compute the spectrum of a standard matrx needs O ( ( ) n 3) operatons, t can be furthermore reduce to O n 3 2 f the Laplacan matrx s sparse. However, the computatonal cost stll very hgh. Hence, NJW algorthm may fal when data scale s large or new data comes too frequently. On the contrary, our algorthm may success. It s fast and relatve accuracy. Here, we shortly analyss the tme complexty to llustrate t. It needs O (n) operatons to compute the ( representatve ) ponts n each cluster as ntalzaton and O ñ 3 2 operatons to generate cluster labels and update representatve sets as new data come, where n and ñ denote the number of data set and representatve set respectvely. ñ s usually much smaller than n and relatve stable, hence, our method s effectve and able to process large data sets. IV. EXPERIMENTS A. Parameter Settngs As mentoned before, we use k-nearest neghbor graph to construct the sparse affnty matrx. However, t may lead to non-symmetrc matrx. Fortunately, we can make t symmetrc by smply settng both W j and W j as the smlarty of v and v j, f ether W j or W j s non-zero. In ths experment, we adopt the Gaussan smlarty functon to measure local neghborhoods between ponts and ts parameter σ s selected n a self-turnng way suggested n [3]. Moreover, we employ ARPACK (a varant of Lanczos method) to compute the spectrum of D 1 L and choose k =20to construct k-nearest neghbor graph. B. Data Sets The data set s a collecton of about 810,000 documents whch s known as RCV1 (Reuters Corpus Volume I) [14]. It s manually categorzed nto 350 classes and splt nto 23, 139 tranng documents and 781, 256 test documents. We use the category codes based on ndustres vocabulary and preprocess the data sets by removng document wth mult-labeled and categores wth less than 500 documents. Thus, we get about 200,000 documents n 103 categores. In ths experment, we extract a subset ϕ from the 200,000 documents to ntalze our algorthm and smulate the ncrement of data sets by add data ponts to ϕ from the rest of 200,000 documents. C. Qualty Measure We estmate our algorthm by computng Clusterng Accuracy (CA) and Normalzed Mutual Informaton (NMI) between labels generated by our algorthm and the real one [13] : =n =1 CA =max δ (y,map(c )) map n where n denote the number of documents, y and c denote the real label and generated label of document v respectvely. Functon δ (y, c) equals 1 f y = c, equals 0 otherwse. Permutaton functon map ( ) maps each generated label to real one and the optmzed mappng functon can be found n [15]. The magntude of CA s between 0 and 1, whle a hgher score of CA means a better clusterng qualty. NMI = k =1 ( ) k j=1 n n nj jlog n nj ( n ) ( log n n j n jlog nj n where n denote the number of documents, n and n j denote the magntude of documents n cluster and category j, and n j denotes the mutual documents both n category and cluster j. The magntude of NMI s between 0 and 1, whle a hgher score of NMI means a better clusterng qualty. ) 4

0.55 Modfed NJW Increnmental spectal clusterng 8 Modfed NJW Increnmental spectal clusterng 5 4.5 Modfed NJW Increnmental spectal clusterng 0.5 6 4 4 0.45 2 3.5 NMI 0.4 Accuracy Tme 3 5 0.18 2.5 0.16 0.14 2 5 0.12 1.5 Number of ponts (a) NMI 0.1 Number of ponts (b) Accuracy 1 Number of ponts (c) Tme Fgure 2: A clusterng qualty and runtme comparson between K-means, Modfed NJW and Alg. 2 usng the RCV1 data set. For Alg. 2, we use 3000 ponts as ntalzaton and ncrementally add another 3000 ponts n the rest of data set. For K-means, each value s mean of 10 replcates. D. Results Fg. 2(a) and Fg. 2(b) shows the NMI and CA score usng RCV1 data set. Both of the results confrm that our algorthm acheves a relatve good clusterng qualty between NJW and K-means. Although the value of NMI and AC may drop gradually wth the ncrease of added ponts, t could rectfy by the automatc re-ntalzaton operaton of Alg. 2. Furthermore, t would perform much better wth the ncrease number of ponts whch s crucal for large data sets. Fg. 2(c) reports the runtme usng the RCV1 data set. It can be seen that the runtme of Alg. 2 s close to K-means and much less than NJW. In addton, the ncrease of runtme s not so sharp as NJW as new ponts added. On the contrary, t become relatve stable and approach to K-means. Hence, compared wth recomputaton by NJW, t acheves smlar accuracy but wth much lower computatonal cost. V. CONCLUSIONS A fast ncremental spectral clusterng algorthm for large data set s proposed n ths paper. It extends the NJW algorthm to handle dynamc data and ncorporates a new strategy of measurement to compress the orgnal data sets wth a certan number of representatve ponts. Instead of evaluatng the whole data set, by ncrementally keepng a representatve sets, the algorthm generates nstant cluster labels as new ponts come. Therefore, the algorthm s fast and can be effcently appled to large data sets. Moreover, by analyzng Egen-gap n the form of ncdence vectors, the change of cluster number can be detected automatcally. Expermental results over a number of real evolutonal data sets llustrate our methods provde fast and relatve accurate results. [2] von Luxburg U. (2007). A tutoral on spectral clusterng. Stat. Comput. 17, 395 416 [3] L. Zelnk-Manor and P. Perona. Self-tunng spectral clusterng. In L. K. Saul, Y. Wess, and L. Bottou, edtors, Advances n Neural Informaton Processng Systems 17, pages 1601-1608. MIT Press, Cambrdge, MA, 2005. [4] J. Sh, J. Malk, Normalzed cuts and mage segmentaton, IEEE Transactons on Pattern Analyss and Machne Intellgence 22 (8) (2000) 888 905. [5] F. Bach and M. Jordan. Learnng spectral clusterng. In Proc. of NIPS- 16. MIT Press, 2004. [6] F. R. K. Chung. Spectral Graph Theory. Amercan Mathematcal Socety, 1997. [7] B. Bollobas, Modern Graph Theory, Sprnger, New York, 1998. [8] H. Nng, W. Xu, Y. Ch, Y. Gong, and T. Huang. Incremental spectral clusterng wth applcaton to montorng of evolvng blog communtes. In SIAM Int. Conf. on Data Mnng, 2007. [9] C. Valgren, T. Duckett, and A. Llenthal. Incremental spectral clusterngand ts applcaton to topologcal mappng. In Proc. IEEE Int. Conf. onrobotcs and Automaton, pages 4283 4288, 2007 [10] F.R.K. Chung, Spectral Graph Theory, n: CBMS Regonal Conference Seres n Mathematcs, vol. 92, Amercan Mathematcal Socety, Provdence, RI, 1997. [11] Bhata, R.: Matrx Analyss. Sprnger, New York (1997) [12] D. Chaudhur, C.A. Murthy, and B.B. Chaudhur, Fndng a Subset of Representatve Ponts n a Data Set, IEEE Trans. Systems, Man, and Cybernetcs, vol. 24, no. 9, pp. 1416-1424, 1994. [13] Wen-Yen Chen, Yangqu Song, Hongje Ba, Chh-Jen Ln, and Edward Y. Chang. Parallel Spectral Clusterng n Dstrbuted Systems. IEEE Transactons on Pattern Analyss and Machne Intellgence,2010. [14] D. D. Lews, Y. Yang, T. G. Rose, and F. L. RCV1: A new benchmark collecton for text categorzaton research. Journal of Machne Learnng Research, 5:361 397, 2004. [15] L. Lovasz, M. Plummer, Matchng Theory, Akadema Kado, North- Holland, Budapest, 1986. REFERENCES [1] A.Y. Ng, M.I. Jordan, Y. Wess, On spectral clusterng: analyss and an algorthm, n: T.G. Detterch, S. Becker, Z. Ghahraman (Eds.), Proceedngs of the Advances n Neural Informaton Processng Systems, MIT Press, Cambrdge, MA, 2002, pp. 849 856. 5