Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory, Oak Rdge, TN 37831-6085, USA cux, potokte@ornl.gov Abstract: There s a tremendous prolferaton n the amount of nformaton avalable on the largest shared nformaton source, the World Wde Web. Fast and hgh-qualty document clusterng algorthms play an mportant role n helpng users to effectvely navgate, summarze, and organze the nformaton. Recent studes have shown that parttonal clusterng algorthms are more sutable for clusterng large datasets. The K-means algorthm s the most commonly used parttonal clusterng algorthm because t can be easly mplemented and s the most effcent one n terms of the executon tme. The maor problem wth ths algorthm s that t s senstve to the selecton of the ntal partton and may converge to a local optma. In ths paper, we present a hybrd Partcle Swarm Optmzaton (PSO)+K-means document clusterng algorthm that performs fast document clusterng and can avod beng trapped n a local optmal soluton as well. For comparson purpose, we appled the PSO+K-means, PSO, K-means, and other two hybrd clusterng algorthms on four dfferent text document datasets. The number of documents n the datasets range from 204 to over 800, and the number of terms range from over 5000 to over 7000. The results llustrate that the PSO+K-means algorthm can generate the most compact clusterng results than other four algorthms. Keywords: Partcle Swarm Optmzaton, Text Dataset, Cluster Centrod, Vector Space Model INTRODUCTION Document clusterng s a fundamental operaton used n unsupervsed document organzaton, automatc topc extracton, and nformaton retreval. Clusterng nvolves dvdng a set of obects nto a specfed number of clusters [21]. The motvaton behnd clusterng a set of data s to fnd nherent structure n the data and expose ths structure as a set of groups. The data obects wthn each group should exhbt a large degree of smlarty whle the smlarty among dfferent clusters should be mnmzed [3, 9, 18]. There are two maor clusterng technques: Parttonng and Herarchcal [9]. Most document clusterng algorthms can be classfed nto these two groups. Herarchcal technques produce a nested sequence of partton, wth a sngle, all-nclusve cluster at the top and sngle clusters of ndvdual ponts at the bottom. The parttonng clusterng method seeks to partton a collecton of documents nto a set of non-overlappng groups, so as to maxmze the evaluaton value of clusterng. Although the herarchcal clusterng technque s often portrayed as a better qualty clusterng approach, ths technque does not contan any provson for the reallocaton of enttes, whch may have been poorly classfed n the early stages of the text analyss [9]. Moreover, the tme complexty of ths approach s quadratc [18]. In recent years, t has been recognzed that the parttonal clusterng technque s well suted for clusterng a large document dataset due to ther relatvely low computatonal requrements [18]. The tme complexty of the parttonng technque s almost lnear, whch makes t wdely used. The best-known parttonng clusterng algorthm s the K-means algorthm and ts varants [10]. Ths algorthm s smple, straghtforward and s based on the frm foundaton of analyss of varances. The K-means algorthm clusters a group of data vectors nto a predefned number of clusters. It starts wth a random ntal cluster center and keeps reassgnng the data obects n the dataset to cluster centers based on the smlarty between the data obect and the cluster center. The reassgnment procedure wll not stop untl

a convergence crteron s met (e.g., the fxed teraton number or the cluster result does not change after a certan number of teratons). The man drawback of the K-means algorthm s that the cluster result s senstve to the selecton of the ntal cluster centrods and may converge to the local optma [16]. Therefore, the ntal selecton of the cluster centrods decdes the man processng of K- means and the partton result of the dataset as well. The man processng of K-means s to search the local optmal soluton n the vcnty of the ntal soluton and to refne the partton result. The same ntal cluster centrods n a dataset wll always generate the same cluster results. However, f good ntal clusterng centrods can be obtaned usng any of the other technques, the K-means would work well n refnng the clusterng centrods to fnd the optmal clusterng centers [2]. It s necessary to employee some other global optmal searchng algorthm for generatng ths ntal cluster centrods. The Partcle Swarm Optmzaton (PSO) algorthm s a populaton based stochastc optmzaton technque that can be used to fnd an optmal, or near optmal, soluton to a numercal and qualtatve problem [4, 11, 17]. The PSO algorthm can be used to generate good ntal cluster centrods for the K-means. In ths paper, we present a hybrd PSO+K-means document clusterng algorthm that performs fast document clusterng and can avod beng trapped n a local optmal soluton. The results from our experments ndcate that the PSO+K-means algorthm can generate the best results n ust 50 teratons n comparson wth the K-means algorthm and the PSO algorthm. The remander of ths paper s organzed as follows: Secton 2 provdes the methods of representng documents n clusterng algorthms and of computng the smlarty between documents. Secton 3 provdes a general overvew of the PSO algorthm. The hybrd PSO+K-means clusterng algorthms are descrbed n Secton 4. Secton 5 provdes the detaled expermental setup and results for comparng the performance of the PSO+K-means algorthm wth the K-means, PSO, and other hybrd approaches. The dscusson of the experment s results s also presented. The concluson s n Secton 6. PRELIMINARIES Document representaton: In most clusterng algorthms, the dataset to be clustered s represented as a set of vectors X={x 1, x 2,., x n }, where the vector x corresponds to a sngle obect and s called the feature vector. The feature vector should nclude proper features to represent the obect. The text document obects can be represented usng the Vector Space Model (VSM) [8]. In ths model, the content of a document s formalzed as a dot n the multdmensonal space and represented by a vector d, such as d= { w 1, w2,... wn}, where w ( = 1,2,,n) s the term weght of the term t n one document. The term weght value represents the sgnfcance of ths term n a document. To calculate the term weght, the occurrence frequency of the term wthn a document and n the entre set of documents must be consdered. The most wdely used weghtng scheme combnes the Term Frequency wth Inverse Document Frequency (TF-IDF) [8]. The weght of term n document s gven n equaton 1: w = tf df = tf *log ( n / df ) (1) * 2 where tf s the number of occurrences of term n the document ; df ndcates the term frequency n the collectons of documents; and n s the total number of documents n the collecton. Ths weghtng scheme dscounts the frequent words wth lttle dscrmnatng power. The smlarty metrc: The smlarty between two documents needs to be measured n a clusterng analyss. Over the years, two promnent ways have been proposed to compute the smlarty between documents m p and m. The frst method s based on Mnkowsk dstances [5], gven by: D ( m, m ) n p d m = = 1 m n, p m, 1/ n (2) For n =2, we obtan the Eucldean dstance. In order to manpulate equvalent threshold dstances, consderng that the dstance ranges wll vary accordng to the dmenson number, ths algorthm uses the normalzed Eucldean dstance as the smlarty metrc of two documents, m p and m, n the vector space. Equaton 3 represents the dstance measurement formula: d m d( m, m ) = ( m m ) 2 / d (3) p k = 1 where m p and m are two document vectors; d m denotes the dmenson number of the vector space; m pk and m k stand for the documents m p and m s weght values n dmenson k. pk k m

The other commonly used smlarty measure n document clusterng s the cosne correlaton measure [15], gven by: t mpm cos( m p, m ) = (4) m m p where m t m p denotes the dot-product of the two document vectors;. ndcates the length of the vector. Both smlarty metrcs are wdely used n the text document clusterng lteratures. BACKGROUND OF THE PSO ALGORITHM PSO was orgnally developed by Eberhart and Kennedy n 1995 [11], and was nspred by the socal behavor of a brd flock. In the PSO algorthm, the brds n a flock are symbolcally represented as partcles. These partcles can be consdered as smple agents flyng through a problem space. A partcle s locaton n the mult-dmensonal problem space represents one soluton for the problem. When a partcle moves to a new locaton, a dfferent problem soluton s generated. Ths soluton s evaluated by a ftness functon that provdes a quanttatve value of the soluton s utlty. The velocty and drecton of each partcle movng along each dmenson of the problem space wll be altered wth each generaton of movement. In combnaton, the partcle s personal experence, P d and ts neghbors experence, P gd nfluence the movement of each partcle through a problem space. The random values, rand 1 and rand 2, are used for the sake of completeness, that s, to make sure that partcles explore wde search space before convergng around the optmal soluton. The values of c 1 and c 2 control the weght balance of P d and P gd n decdng the partcle s next movement velocty. For every generaton, the partcle s new locaton s computed by addng the partcle s current velocty, V-vector, to ts locaton, X-vector. Mathematcally, gven a multdmensonal problem space, the th partcle changes ts velocty and locaton accordng to the followng equatons [11]: vd = w* vd + c * rand1 *( pd xd ) + c2 * rand2 *( p x = x + v d d x 1 gd d d ) (5a) (5b) where w denotes the nerta weght factor; p d s the locaton of the partcle that experences the best ftness value; p gd s the locaton of the partcles that experence a global best ftness value; c 1 and c 2 are constants and are known as acceleraton coeffcents; d denotes the dmenson of the problem space; rand 1, rand 2 are random values n the range of (0, 1). The nerta weght factor w provdes the necessary dversty to the swarm by changng the momentum of partcles to avod the stagnaton of partcles at the local optma. The emprcal research conducted by Eberhart and Sh [7] shows mprovement of search effcency through gradually decreasng the value of nerta weght factor from a hgh value durng the search. Equaton 5a requres each partcle to record ts current coordnate X d, ts velocty V d that ndcates the speed of ts movement along the dmensons n a problem space, and the coordnates P d and P gd where the best ftness values were computed. The best ftness values are updated at each generaton, based on equaton 6, P ( t) P ( t + 1) = X ( t + 1) f ( X ( t + 1)) f ( X ( t)) (6) f ( X ( t + 1)) > f ( X ( t)) where the symbol f denotes the ftness functon; P (t) stands for the best ftness values and the coordnaton where the value was calculated; and t denotes the generaton step. It s possble to vew the clusterng problem as an optmzaton problem that locates the optmal centrods of the clusters rather than fndng an optmal partton. Ths vew offers us a chance to apply PSO optmal algorthm on the clusterng soluton. In [6], we proposed a PSO document clusterng algorthm. Contrary to the localzed searchng n the K-means algorthm, the PSO clusterng algorthm performs a globalzed search n the entre soluton space [4, 17]. Utlzng the PSO algorthm s optmal ablty, f gven enough tme, the PSO clusterng algorthm we proposed could generate more compact clusterng results from the document datasets than the tradtonal K-means clusterng algorthm. However, n order to cluster the large document datasets, PSO requres much more teraton (generally more than 500 teratons) to converge to the optma than the K-mean algorthm does. Although the PSO algorthm s nherently parallel and can be mplemented usng parallel hardware, such as a computer cluster, the computaton requrement for clusterng extremely huge document datasets s stll hgh. In terms of executon tme, the K-means algorthm s the most effcent for the large dataset [1]. The K-means algorthm tends to converge faster than the PSO, but t usually only fnds the local maxmum. Therefore, we face a dlemma regardng choosng the algorthm for clusterng a large

document dataset. Base on ths reason, we proposed a hybrd PSO+K-means document clusterng algorthm. HYBRID PSO+K-MEANS ALGORITHM In the hybrd PSO+K-means algorthm, the multdmensonal document vector space s modeled as a problem space. Each term n the document dataset represents one dmenson of the problem space. Each document vector can be represented as a dot n the problem space. The whole document dataset can be represented as a multple dmenson space wth a large number of dots n the space. The hybrd PSO+Kmeans algorthm ncludes two modules, the PSO module and K-means module. At the ntal stage, the PSO module s executed for a short perod to search for the clusters centrod locatons. The locatons are transferred to the K-means module for refnng and generatng the fnal optmal clusterng soluton. The PSO module: A sngle partcle n the swarm represents one possble soluton for clusterng the document collecton. Therefore, a swarm represents a number of canddate clusterng solutons for the document collecton. Each partcle mantans a matrx X = (C 1, C 2,, C,.., C k ), where C represents the th cluster centrod vector and k s the cluster number. At each teraton, the partcle adusts the centrod vector poston n the vector space accordng to ts own experence and those of ts neghbors. The average dstance between a cluster centrod and a document s used as the ftness value to evaluate the soluton represented by each partcle. The ftness value s measured by the equaton below: f p Nc = 1 { = = 1 d( o, m c p N ) } (7) where m denotes the th document vector, whch belongs to cluster ; O s the centrod vector of th cluster; d(o, m ) s the dstance between document m and the cluster centrod O. ; P stands for the document number, whch belongs to cluster C ; N c stands for the cluster number. The PSO module can be summarzed as: (1) At the ntal stage, each partcle randomly chooses k numbers of document vectors from the document collecton as the cluster centrod vectors. (2) For each partcle: (a) Assgnng each document vector n the document set to the closest centrod vector. (b) Calculatng the ftness value based on equaton 7. (c) Usng the velocty and partcle poston to update equatons 5a and 5b and to generate the next solutons. (3) Repeatng step (2) untl one of followng termnaton condtons s satsfed. (a) The maxmum number of teratons s exceeded or (b) The average change n centrod vectors s less than a predefned value. The K-means module: The K-means module wll nhert the PSO module s result as the ntal clusterng centrods and wll contnue processng the optmal centrods to generate the fnal result. The K-means module can be summarzed as: (1) Inhertng cluster centrod vectors from the PSO module. (2) Assgnng each document vector to the closest cluster centrods. (3) Recalculatng the cluster centrod vector c usng equaton 8. c 1 = n d d S (8) where d denotes the document vectors that belong to cluster S ; c stands for the centrod vector; n s the number of document vectors belong to cluster S. (4) Repeatng step 2 and 3 untl the convergence s acheved. In the PSO+K-means algorthm, the ablty of globalzed searchng of the PSO algorthm and the fast convergence of the K-means algorthm are combned. The PSO algorthm s used at the ntal stage to help dscoverng the vcnty of the optmal soluton by a global search. The result from PSO s used as the ntal seed of the K-means algorthm, whch s appled for refnng and generatng the fnal result. EXPERIMENTS AND RESULTS Datasets: We used four dfferent document collectons to compare the performance of the K-

means, PSO and hybrd PSO+K-means algorthms wth dfferent combnaton models. These document datasets are derved from the TREC-5, TREC-6, and TREC-7 collectons [19]. A descrpton of the test datasets s gven n Table 1. In those document datasets, the very common words (e.g. functon words: a, the, n, to ; pronouns: I, he, she, t ) are strpped out completely and dfferent forms of a word are reduced to one canoncal form by usng Porter s algorthm [13]. In order to reduce the mpact of the length varatons of dfferent documents, each document vector s normalzed so that t s of unt length. The document number n each dataset ranges from 204 to 878. The term numbers of each dataset are all over 5000. Table 1: Summary of text document datasets Data Number of Number of Number of documents terms classes Dataset1 414 6429 9 Dataset2 313 5804 8 Dataset3 204 5832 6 Dataset4 878 7454 10 Expermental setup: The K-means, PSO and PSO+K-means clusterng approaches are appled on the four datasets, respectvely. The Eucldan dstance measure and cosne correlaton measure are used as the smlarty metrcs n each algorthm, respectvely. Some researchers [12, 20] proposed usng the K- means+pso hybrd algorthm for clusterng low dmenson datasets. They argue that the K-means algorthm tends to converge faster than other clusterng algorthms, but usually wth a less accurate clusterng. The performance of the clusterng algorthm can be mproved by seedng the ntal swarm wth the result of the K-means algorthm. To compare the performance of dfferent knds of hybrd algorthms, we appled K-means+PSO and K-means+PSO+Kmeans algorthms on the four datasets, respectvely. We have notced that K-means clusterng algorthms can converge to a stable soluton wthn 20 teratons when appled to most document datasets. The PSO usually needs to repeat for more than 100 teratons to generate a stable soluton. For an easy comparson, the K-means and PSO approaches run 50 teratons. In the K-means+PSO approach, the K-means algorthm s frst executed for 25 teratons. The result of the K- means algorthm s then used as the ntal cluster centrod n the PSO algorthm, and the PSO algorthm executes for another 25 teratons to generate the fnal result. The PSO+k-means approach has the same executon procedure, except that t frst executes the PSO algorthm for 25 teratons and uses the PSO result as the ntal seed for the K-means algorthm. In the K-means+PSO+K-means approach, the K-means algorthm s frst executed for 25 teratons. The result of the K-means algorthm s then used as the ntal cluster centrod n the PSO algorthm and the PSO algorthm executes for 25 teratons, the global best soluton result from the PSO algorthm s used as the ntal cluster centrod of the K-means algorthm, and the K-means algorthm executes for another 25 teratons to generate the fnal result. In these fve dfferent algorthms, the total executng teraton number for K-means, PSO, K-means+PSO, and PSO+K-means s 50. The total executng number for the K-means+PSO+K-means teraton approach s 75. No parameter needs to be set up for the K-means algorthm. In the PSO clusterng algorthm, because of the extremely hgh dmensonal soluton space of the text document datasets, we choose 50 partcles for all the PSO algorthms nstead of choosng 20 to 30 partcles recommended n [4, 17]. In the PSO algorthm, the nerta weght w s ntally set as 0.72 and the acceleraton coeffcent constants c1 and c2 are set as 1.49. These values are chosen based on the results of [17]. In the PSO algorthm, the nerta weght wll reduce 1% n value at each teraton to ensure good convergence. However, the nerta weght n all hybrd algorthms s kept constant to ensure a globalzed search. Results: The ftness equaton 7 s used not only n the PSO algorthm for the ftness value calculaton, but also n the evaluaton of the cluster qualty. It ndcates the value of the average dstance between documents and the cluster centrod to whch they belong (ADVDC). The smaller the ADVDC value, the more compact the clusterng soluton s. Table 2 demonstrates the expermental results by usng the K- means, PSO, K-means+PSO, PSO+K-means and K- means+pso+k-means respectvely. Ten smulatons are performed separately. The average ADVDC values and standard dvson are recorded n Table 2. To llustrate the convergence behavor of dfferent clusterng algorthms, the clusterng ADVDC values at each teraton are recorded when these fve algorthms are appled on datasets separately. As shown n Table 2, the PSO+K-means hybrd clusterng approach generates the clusterng result that has the lowest ADVDC value for all four datasets usng the Eucldan smlarty metrc and the Cosne

Table 2: Performance comparson of K-means, PSO, K-means+PSO, PSO+K-means and K-means+PSO+K-means Dataset 1 Dataset 2 Dataset 3 Dataset 4 ADVDC value K-means PSO PSO+K-means K-means+PSO K-means+PSO +K-means Eucldan 8.238±0.090 6.759±0.956 4.556±1.405 8.244±0.110 8.138±0.138 Cosne 8.999±0.150 10.624±0.406 7.690±0.474 9.083±0.118 8.910±0.231 Eucldan 7.245±0.166 6.362±1.032 4.824±1.944 7.243±0.114 7.292±0.086 Cosne 8.074±0.200 9.698±0.435 7.676±0.172 8.126±0.072 8.140±0.152 Eucldan 4.788±0.089 4.174±0.207 2.550±0.746 4.662±0.315 4.799±0.247 Cosne 5.093±0.120 5.750±0.395 4.355±0.252 5.078±0.213 4.985±0.267 Eucldan 9.09±0.097 9.311±1.010 6.004±2.666 8.936±0.285 8.861±0.363 Cosne 10.22±0.402 12.874±0.593 9.547±0.237 10.363±0.132 10.221±0.343 correlaton smlarty metrc. The results from the PSO approach have mprovements compared to the results of the K-means approach when usng the Eucldan smlarty metrc. However, when the smlarty metrc s changed wth the cosne correlaton metrc, the K- means algorthm has a better performance than the PSO algorthm. The K-means+PSO and K- means+pso+k-means approaches do not have sgnfcant mprovements compared to the result of the K-means approach. Fgure 1 llustrates the convergence behavors of these algorthms on the document dataset 1 usng the Eucldan dstance as a smlarty metrc. In Fgure 1, the K-means algorthm converges quckly but prematurely wth hgh quantzaton error. As shown n fgure 1, the ADVDC value of the K-means algorthm s sharply reduced from 11.3 to 8.2 wthn 10 teratons and fxed at 8.2. In Fgure 1, t s hard to separate the curve lnes that represent the K-means+PSO and K- means+pso+k-means approaches from the K-means approach. The three lnes nearly overlap each other, whch ndcates these three algorthms have nearly the same convergence behavor. The PSO approach s ADVDC value s quckly converged from 11.3 to 6.7 wthn 30 teratons. The reducton of the ADVDC value n PSO s not as sharp as n K-means and becomes smoothly after 30 teratons. The curvy lne s tendency ndcates that f more teratons are executed, the dstance average value may reduce further although the reducton speed wll be very slow. The PSO+K-means approach s performance sgnfcantly mproves. In the frst 25 teratons, the PSO+K-means algorthm has smlar convergence behavor because wthn 1 to 25 teratons, the PSO and the PSO+K-means algorthms execute the same PSO optmal code. After 25 teratons, the ADVDC value has a sharp reducton wth the value reduced from 6.7 to 4.7 and mantans a stable value wthn 10 teratons. Dscusson: Usng hybrd algorthms for boostng the clusterng performance s not a novel dea. However, most of hybrd algorthms use K-means algorthm for generatng the ntal clusterng seeds for other optmal algorthms. To the best of the author s knowledge, there no hybrd algorthm that uses PSO optmal algorthm generatng ntal seed for K-means clusterng. In [20], Merwe and Engelbrecht argued that the performance of the PSO clusterng algorthm could be mproved by seedng the ntal swarm wth the result of the K-means algorthm. They conducted smulatons on some low dmenson datasets wth 10 partcles and 1000 teratons. However, from the expermental results n Table 2, we notced the K- means +PSO algorthm does not show any mprovement n the large document datasets. The dfferences between the experments n the present research and the experment n [20] are (a) the datasets used here are document data wth more than 5000 terms, whch s also the dmensonal number of the PSO searchng space; (b) The teratons runnng n each smulaton are no more than 75. In the K- means+pso approach, the K-means algorthm generates a local optma result, whch s used as the ntal status of one partcle n the PSO algorthm. The hgh ftness value of ths ntal partcle s value wll drectly force the partcle to start the optmal refnng stage. Although other partcles are randomly deployed n the searchng space, the hgh average dstance value

12 10 8 ADVDC 6 4 2 K-means PSO PSO+K K+PSO K+PSO+K 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Iteraton Fgure 1: The convergence behavors of dfferent clusterng algorthm (PSO+K: the PSO+K-means Algorthm, K+PSO: the K-means + PSO algorthm, K+PSO+K: the K-means +PSO+K-means algorthm) of the ntal soluton that the PSO algorthm nherted from the K-means algorthm wll attract other partcles to converge quckly nto the vcnty of the local optma. The performance results n Table 2 and the convergence behavors of the K-means+PSO approach n Fgure 1 llustrate ths. The PSO+K-means algorthm generates the hghest clusterng compact result n the experments. The average dstance value s the lowest. In the PSO+Kmeans algorthm clusterng experment, although 25 teratons s not enough for the PSO to dscover the optmal soluton, t has a hgh possblty that one partcle s soluton s located n the vcnty of the global soluton. The result of the PSO s used as the ntal seed of the K-means algorthm and the K-means algorthm can quckly locate the optma wth a low dstance average value. Comparson of the performance of these fve approaches n our experments llustrates that the sequence of hybrd K- means algorthms s very mportant. CONCLUSION In ths study, we presented a document clusterng algorthm, the PSO+K-means algorthm, whch can be regarded as a hybrd of the PSO and K-means algorthms. In the general PSO algorthm, PSO can conduct a globalzed searchng for the optmal clusterng, but requres more teraton numbers and computaton than the K-means algorthm does. The K- means algorthm tends to converge faster than the PSO algorthm, but usually can be trapped n a local optmal area. The PSO+K-means algorthm combnes the ablty of the globalzed searchng of the PSO algorthm and the fast convergence of the K-means algorthm and can avod the drawback of both algorthms. The algorthm ncludes two modules, the PSO module and the K-means module. The PSO module s executed for a short perod at the ntal stage to dscover the vcnty of the optmal soluton by a global search and at the same tme to avod consumng hgh computaton. The result from the PSO module s used as the ntal seed of the K-means module. The K-means algorthm wll be appled for refnng and generatng the fnal result. Our expermental results llustrate that usng ths hybrd PSO+K-means algorthm can generate hgher compact clusterng than usng ether PSO or K-means alone. The results from the three dfferent hybrd K-means algorthms llustrates that performng the K-means n advance of the PSO module n the hybrd algorthm wll reduce the globalzed searchng ablty of the PSO algorthm and lower the whole algorthm s performance.

ACKNOWLEDGMENT Oak Rdge Natonal Laboratory s managed by UT- Battelle LLC for the U.S. Department of Energy under contract number DE-AC05_00OR22725. REFERENCES 1. Al-Sultan, K. S. and Khan, M. M. 1996. Computatonal experence on four algorthms for the hard clusterng problem. Pattern Recogn. Lett. 17, 3, 295 308. 2. Anderberg, M. R., 1973. Cluster Analyss for Applcatons. Academc Press, Inc., New York, NY. 3. Berkhn, P., 2002. Survey of clusterng data mnng technques. Accrue Software Research Paper. 4. Carlsle, A. and Dozer, G., 2001. An Off-The- Shelf PSO, Proceedngs of the 2001 Workshop on Partcle Swarm Optmzaton, pp. 1-6, Indanapols, IN 5. Cos K., Pedrycs W., Swnarsk R., 1998. Data Mnng Methods for Knowledge Dscovery, Kluwer Academc Publshers. 6. Cu X., Potok T. E., 2005. Document Clusterng usng Partcle Swarm Optmzaton, IEEE Swarm Intellgence Symposum 2005, Pasadena, Calforna. 7. Eberhart, R.C., and Sh, Y., 2000. Comparng Inerta Weghts and Constrcton Factors n Partcle Swarm Optmzaton, 2000 Congress on Evolutonary Computng, vol. 1, pp. 84-88. 8. Evertt, B., 1980. Cluster Analyss. 2 nd Edton. Halsted Press, New York. 9. Jan A. K., Murty M. N., and Flynn P. J., 1999. Data Clusterng: A Revew, ACM Computng Survey, Vol. 31, No. 3, pp. 264-323. 10. Hartgan, J. A. 1975. Clusterng Algorthms. John Wley and Sons, Inc., New York, NY. 11. Kennedy J., Eberhart R. C. and Sh Y., 2001. Swarm Intellgence, Morgan Kaufmann, New York. 12. Omran, M., Salman, A. and Engelbrecht, A. P., 2002. Image classfcaton usng partcle swarm optmzaton. Proceedngs of the 4th Asa-Pacfc Conference on Smulated Evoluton and Learnng 2002 (SEAL 2002), Sngapore. pp. 370-374. 13. Porter, M.F., 1980. An Algorthm for Suffx Strppng. Program, 14 no. 3, pp. 130-137. 14. Salton G., 1989. Automatc Text Processng. Addson-Wesley. 15. Salton G. and Buckley C., 1988. Term-weghtng approaches n automatc text retreval. Informaton Processng and Management, 24 (5): pp. 513-523. 16. Selm, S. Z. And Ismal, M. A. 1984. K-means type algorthms: A generalzed convergence theorem and characterzaton of local optmalty. IEEE Trans. Pattern Anal. Mach. Intell. 6, 81 87. 17. Sh, Y. H., Eberhart, R. C., 1998. Parameter Selecton n Partcle Swarm Optmzaton, The 7th Annual Conference on Evolutonary Programmng, San Dego, CA. 18. Stenbach M., Karyps G., Kumar V., 2000. A Comparson of Document Clusterng Technques. TextMnng Workshop, KDD. 19. TREC. 1999. Text Retreval Conference. http://trec.nst.gov. 20. Van D. M., Engelbrecht, A. P., 2003. Data clusterng usng partcle swarm optmzaton. Proceedngs of IEEE Congress on Evolutonary Computaton 2003 (CEC 2003), Canbella, Australa. pp. 215-220. 21. Zhao Y. and Karyps G., 2004. Emprcal and Theoretcal Comparsons of Selected Crteron Functons for Document Clusterng, Machne Learnng, 55 (3): pp. 311-331.