Tailoring Fuzzy C-Means Clustering Algorithm for Big Data Using Random Sampling and Particle Swarm Optimization

Internatonal Journal of Database Theory and Alcaton,.191-202 htt://dx.do.org/10.14257/jdta.2015.8.3.16 Talorng Fuzzy -Means lusterng Algorth for Bg Data Usng Rando Salng and Partcle Swar Otzaton Yang Xanfeng 1 and Lu Pengfe 2 1 Yang Xanfeng, School of Inforaton Engneerng, Henan Insttute of Scence and Technology, Henan Xnxang, hna 2 HEBI olleges of Vocaton and Technology, Henan Heb, hna 1 49377535@qq.co, 2 274772453@qq.co, Abstract As one of the ost coon data nng technques, clusterng has been wdely aled n any felds, aong whch fuzzy clusterng can reflect the real world n a ore objectve ersectve. As one of the ost oular fuzzy clusterng algorths, Fuzzy -Means FM) clusterng cobnes the fuzzy theory and K-Means clusterng algorth. However, there are soe ssues wth FM clusterng. For exale, FM s very senstve to the ntalzaton condton, such as the deternaton of ntal clusters; the seed of convergence s lted, and the global otal soluton s hard to be guaranteed. Esecally for bg data scenaro, the overall seed s slow, and t s hard to erfor the clusterng algorth on all the orgnal dataset. To solve above challenges, n ths aer, we roose a odfed FM based on Partcle Swar Otzaton PSO). Besdes, we also resent a ult-round rando salng ethod to deal wth the bg data roble, by sulatng the clusterng on the orgnal bg dataset wth the objectve to aroxate the clusterng results on sale datasets. Our exerents show that both the odfed FM usng PSO and the ult-round salng strategy are effcent and effectve. Keywords: lusterng, Fuzzy -Means FM), Partcle Swar Otzaton PSO), Bg data 1. Introducton lusterng [1] s one of the ost coon data nng technques, whch refers to grou a set of objects n such a way that objects wthn the sae grou are slar and objects across dfferent grous are dfferent. lusterng s an unsuervsed learnng rocess, and has been wdely aled n any felds, such as attern recognton [2], data analyss [3], age rocessng [4] and arketng research [5], etc. Fuzzy clusterng also referred to as soft clusterng) can allocate objects to clusters n a fuzzy way, where data objects can belong to ore than one clusters and assocated wth dfferent ebersh levels [6]. Therefore, fuzzy clusterng can reflect the real world n a ore objectve ersectve. As one of the ost oular fuzzy clusterng algorths, Fuzzy -Means FM) clusterng [7] cobnes the fuzzy theory and K-Means clusterng algorth. However, there are soe robles wth FM clusterng. Frst, FM s very senstve to the ntalzaton condton, such as the deternaton of ntal clusters. Second, the seed of convergence s lted, and the global otal soluton s hard to be guaranteed, esecally for bg data scenaro. Thrd, for bg data, the overall seed s slow, and t s hard to erfor the clusterng algorth on all the orgnal dataset. To solve above challenges, n ths aer we roose a odfed FM algorth esecally for bg data soluton. Frst, to deal wth the ntalzaton senstvty, we roose to rove FM usng Partcle Swar Otzaton PSO) [8] by otzng the ISS: 2005-4270 IJDTA oyrght c 2015 SERS

Internatonal Journal of Database Theory and Alcaton ntalzaton rocedure of FM. Second, to solve the challenges of bg data, we roose a ult-round rando salng ethod to sulate the clusterng on the orgnal bg dataset wth the objectve to aroxate the clusterng results on sale datasets. The rean of ths aer s organzed as follows. Secton 2 rovdes soe related work. In Secton 3, we resent soe relnares of FM and PSO. The roosed odfed FM based on PSO and the rocedures of FM clusterng on bg data are dscussed n secton 4. Then, ercal exerents are conducted n Secton 5. Fnally, the aer s concluded n Secton 6. 2. Related Work Exstng clusterng ethods can be groued as arttonng ethods, herarchcal ethods, densty-based ethods, grd-based ethods, and odel-based ethods. For exale, tycal arttonng ethods nclude K-Means [9], LARAS [10], and FREM [11]. oon herarchcal clusterng algorths are BIRH [12] and URE [13]. Densty-based ethods nclude DBSA [14], OPTIS [15], ST-DBSA [16], etc. STIG [17], LIQUE [18], WAVE-LUSTER [19] are the exales for grd-based ethods. Many efforts have done n fuzzy clusterng as well. For exale, Rusn [20] frst roosed the concet of fuzzy arttonng by ntroducng fuzzy theory nto clusterng analyss. Then, any fuzzy clusterng ethods were roosed, such as the transtve closure based on fuzzy equvalence relatons [21-22], ethod based on slarty relatons and fuzzy relatons [23], the axu tree based on fuzzy grah theory [24] and ethods based on the dataset convex decooston [25] and dynac rograng [26], etc. ow one of the ost coon fuzzy clusterng algorths s Fuzzy -Means FM) clusterng [27]. However, FM has soe ssues such as senstvty to the ntalzaton status and reature convergence due to ts gradent descent based search strategy. Soe researchers ntroduce Genetc Algorth GA) to solve the local otal roble [28-29]. However, when the sze of datasets s very large, reature convergence cannot be avoded. oared to GA, PSO can not only rovde global search, but also resent strong local search caablty by adjustng araeters, and s easy to leent [30-31]. Therefore, n ths aer, we roose to use PSO for rovng FM. 3. Prelnary 3.1. FM Bascs Let X x, x,..., x } be the set of data onts, where n the nuber of data s { 1 2 n objects. The clusterng results can be reresented by fuzzy atrx U [ uj] n, where s the nuber of clusters, and each eleent u j denotes the degree of ebersh of -th data ont to j -th cluster, 1,2,..., n; j 1,2,...,. And, 192 oyrght c 2015 SERS

Internatonal Journal of Database Theory and Alcaton u j, j, s. t., j, u [0,1]; j j1 n 1 u u j j 1; 0.. 1) FM s forulated as the otzaton roble wth above constrants usng the followng objectve functon: n J U, ) n 1 j1 2 uj d x, c j), 2) where c, c,..., c ), c j s the centrod of j -th cluster; 1 2 d x, c ) x c s the j j Eucldean dstance between x and c j, d j for short; s the fuzzy weghtng exonent, used to control the nfluence of atrx U, and bgger eans that the clusterng results are ore fuzzy t s establshed that [1.5,2.5 ] [32]); and J U, ) s the weghted su of squares of the dstances between data onts and the cluster centrods. In ths aer, we set 2. Aly Lagrange ultlers to solve Equaton 2), we have u j 1 k1 d d j k 2 1, 3) c j n n uj x uj 1 1. 4) The rocedure of FM can be descrbed as follows: Ste 1: gven the nuber of clusters, the sze of data sales n, the ternaton threshold, fuzzy exonent, axu teraton tes u, the randoly generated ntal centrod atrx, and t 0. Intalze 0 U usng Equaton 3); Ste 2: t t 1.Udate centrod atrx t usng Equaton 4); Ste 3: Recalculate ebersh atrx t U usng Equaton 3); oyrght c 2015 SERS 193

Internatonal Journal of Database Theory and Alcaton Ste 4: f t U U t1, algorth stos; otherwse, go to Ste 2. 3.2. PSO Bascs PSO algorth sulates the attern and behavor of brd flock, and each brd s reresented as a artcle. For each artcle, there are two ortant attrbutes,.e., locaton and seed. Suose there are artcles n a D -densonal sace. Gven artcle, ts locaton s x x, x,..., x ), and seed s v v, v,..., v ). The locaton and 1 2 D 1 2 D seed of artcles are udated durng the flyng rocess. Suose the best locaton of artcle s,,..., ), and the overall best locaton of all artcles s g g, g 2,..., 1 gd v d 1 2 D ). The seed and locaton are udated by: t 1) wvd t) c1 1 d t) xd t)) c2 2 gd t) xgd t)), 5) xd t 1) xd t) vd t 1), 6) where 1,2,...,, d 1,2,..., D, c 1,c2 are learnng factors, 1, 2 [0,1] are rando nubers, and w s an nerta weght tycally w [0.1,0.9 ] ). v d t) s the seed of artcle at teraton t, and x d t) s the locaton of artcle at teraton t. Equaton 5) s the udate rule of artcle s seed, where the frst coonent s the current seed of artcle; the second coonent s the nfluence of self-recognton,.e., the dstance between the current locaton and the best locaton of each artcle; and the thrd coonent s the nfluence of the whole oulaton,.e., the dstance between the current locaton of the artcle and the overall best locaton of all artcles. Equaton 6) s the udate rule of artcle s locaton, whch ndcates that the locaton at teraton t 1 s calculated as the locaton at teraton t lus the flyng dstance durng one teraton. Fgure 1 gves the seudo-code of PSO. 4. Proosed Modfed FM 4.1. Irovng FM wth PSO The gradent descent based FM algorth s essentally a local search algorth, and therefore s easly to arrve a local nu, esecally then the sze of data sales s large. Besdes, FM s senstve to the ntalzaton status, and ost FM algorths sly rando choose ntal centrods, whch n turn affects the convergence seed. 194 oyrght c 2015 SERS

Internatonal Journal of Database Theory and Alcaton Fortunately, swar based PSO algorth has a strong global search caablty, and the convergence seed s fast. Therefore, we roose to cobne FM and PSO to otze the objectve functon n Equaton 2). Bascally, the dea s to reresent centrods as artcles, and utlze PSO for global search and FM for local search. Fgure 1. Algorth Descrton of PSO 4.1.1. Partcles Encodng: we reresent each artcle as a feasble soluton. That s, the locaton of each artcle conssts of centrods c, c,..., c ), where c j s 1 2 the centrod of j -th cluster. 4.1.2. Ftness alculaton: recall that our goal s to nze J U, ). Therefore, the ftness functon can be defned as: f J 1 U, ) 1, 7) The larger f s, the saller J U, ) s, and the better the soluton s. 4.1.3. Tng of obnaton: Generally, we utlze PSO for global search and FM for local search. The key s to deterne when to aly FM durng the PSO rocess. In oyrght c 2015 SERS 195

Internatonal Journal of Database Theory and Alcaton ths aer, we roose to erfor FM when the artcle swar converges. Defne the degree of convergence by the varance of ftness: 2 1 f f f avg 2, 8) Where f s the ftness of -th artcle, avg f s the average ftness of all artcles, and s the nuber of artcles. The saller 2 s, the ore converge the artcles are. We use 2 to deterne when to aly FM durng PSO global search. If 2 s close to 0, the artcles are alost the sae, whch eans PSO arrves reature convergence or global convergence. Otherwse, artcles wth varous ftness are dong rando search. If 2 s saller than a threshold, t eans that PSO s about to converge, and alyng FM at that te for local search can rove the erforance of global search to avod reature convergence. 4.1.4 Algorth Procedure: Fgure 2 llustrates the flow chart of odfed FM wth PSO algorth. Secfcally, the rocedure can be descrbed as follows: Ste 1: ntalze the artcle swar and araeters. Gven the nuber of clusters, the sze of data sales n, fuzzy exonent, axu teraton tes u, andt 0. Randoly generate ntal centrod atrx, and ntalze 0 U usng Equaton 3). Generate a artcle fro the centrod vector, and ntalze ts seed usng Equaton 5). Run the rando centrods generaton tes, and get artcles. Ste 2: for each artcle, calculate ts ftness usng Equaton 7), and get ts best locaton. Ste 3: for each artcle, f ts best locaton s better than any other artcles, udate the global locaton g as. Ste 4: udate seed and locaton of all artcles usng Equatons 5) and 6). Ste 5: f the varance of ftness 2, where s the threshold of the varance of ftness values of all artcles, go to Ste 6; otherwse, return to Ste 2. Ste 6: erfor FM clusterng. alculate the centrods of clusters usng Equaton 4). 196 oyrght c 2015 SERS

Internatonal Journal of Database Theory and Alcaton alculate ebersh atrx U usng Equaton 3). Udate corresondng artcles based on centrods. Ste 7: f ternaton condton s satsfed, algorth stos. Otherwse, return to Ste 2. Fgure 2. Flow hart of Modfed FM lusterng based on PSO Fgure 3. Flow hart of Mult-round Rando Salng Method oyrght c 2015 SERS 197

Internatonal Journal of Database Theory and Alcaton 4.2. Procedure for Bg Data lusterng Even though our odfed FM clusterng algorth can solve the ssues, such as ntalzaton senstvty and low convergence seed, there are stll robles wth bg data scenaro. For exale, n Ste 1 n Secton 4,1, where artcles are generated by randoly generatng centrods tes on the orgnal dataset. If the sze of dataset n s not very large, the erforance s good. However, f n s very bg, ths ste could be extreely te-consung, and even not feasble to leent. Therefore, n ths secton, we roose a ult-round rando salng ethod to sulate the clusterng on the orgnal bg dataset wth the objectve to aroxatng the clusterng results on sale datasets. The basc assuton here s that [33] by erforng clusterng on a saller subset, we can aroxate the results on the orgnal large dataset. The ult-round rando salng can be descrbed as follows. Frst, randoly select a subset fro the orgnal dataset D, notated as D, and then erfor the odfed FM algorth on D as resented n Secton 3, and get the clusterng results 1 2 c, c,..., c ). After that, randoly select another subset wth the sae sze fro the reanng dataset, notated as D j, and cobne D D and j, notated as D j. Usng D j c, c,..., c ) as the ntal centrods, and then erfor odfed FM on, 1 2 j j j and get clusterng results c, c,..., c ). Reeatedly randoly select a subset j 1 2 fro the reanng dataset and cobne to the exstng sales, and then erfor odfed FM on the new dataset. Untl the dfference of clusterng results between two rounds s sall enough. That s, 12... t 12... t 1), where 12... t s the clusterng results of the t -th round, and s the threshold. The flow chart s shown n Fgure 3, where the odfed FM algorth s dscussed n Secton 4.1. 5. Exerent Our datasets are orgnally obtaned fro UI database,.e., rs, glass and wne. To sulate the bg data scenaro, we enlarge orgnal datasets by 1000 tes n ths exerent. The araeter settngs are 3, 2, 10 5, 30, u 100, % 3%. We defne three baselnes for coarson. 1) FM: basc FM algorth on 198 oyrght c 2015 SERS

Internatonal Journal of Database Theory and Alcaton randoly saled dataset. 2) FM-PSO: odfed FM based on PSO on randoly saled dataset. 3) FM-PSO-MRS: odfed FM based on PSO sung the ult-round rando salng ethod on the orgnal large dataset. Table 1 gves the coarson of average recson for three ethods on the UI datasets and large scale datasets, resectvely. We have the followng observatons. Frst, the erforance of FM-PSO s better than FM, whch eans our odfed strategy based on PSO can ndeed rove the recson of clusterng. Second, FM-PSO-MRS s better than the other two ethods, whch eans that our ult-round rando salng strategy s better than randoly salng. Thrd, for bg data scenaro, FM-PSO-MRS can reserve good erforance, whle FM and FM-PSO have soe degradaton of erforance, and thus FM-PSO-MRS s sutable for bg data clusterng. Secfcally, coarng the saller UI datasets and the 1000 tes bg datasets, the recson for FM decreases ost when the sze of dataset s large, whle the erforance of roosed FM-PSO-MRS s alost the sae deste the sze of dataset. Table 1. The Average Precson of lusterng Dataset FM FM-PSO FM-PSO-MRS Irs 80.58% 84.37% 89.12% Glass 71.34% 78.92% 81.05% Wne 63.41% 69.56% 73.87% Irs*1000 67.23% 70.84% 88.92% Glass*1000 60.67% 74.43% 80.85% Wne*1000 58.50% 67.32% 73.28% Besdes, we evaluate the convergence erforance of our odfed FM-PSO algorth on bg dataset n Fgures 4, 5 and 6. ote that n these exerents, to be far, both FM and FM-PSO are based on one round rando salng, n order to coare the convergence seed. For all three datasets, we have the sae concluson, that s, our odfed FM-PSO converges faster than FM wth better ftness value. Fgure4. Perforance of FM-PSO for rs*1000 Dataset oyrght c 2015 SERS 199

Internatonal Journal of Database Theory and Alcaton 6. oncluson In ths aer, we focus on the odfcaton of FM clusterng consderng ts ltatons such as senstvty to ntalzaton status, low convergence, etc. Secfcally, we roose to cobne PSO nto FM. Moreover, we resent a ult-round rando salng ethod to deal wth the bg data challenge. Fgure 5. Perforance of FM-PSO for Glass*1000 Dataset Fgure 6. Perforance of FM-PSO for Wne*1000 Dataset However, the exerents n ths study n obtaned by exandng UI datasets, whch s ore, lke sulated datasets. In future works, we would lke to ebrace the roosed clusterng nto real world bg data scenaros, such as large scale socal networks, and try to exlore ore clusterng alcatons n bg data. References [1] A. K. Jan, M.. Murty and P. J. Flynn. "Data clusterng: a revew", AM coutng surveys SUR), vol. 31, no. 3, 1999),. 264-323. [2] A. Barald and P. Blonda, "A survey of fuzzy clusterng algorths for attern recognton. I." Systes, Man, and ybernetcs, Part B: ybernetcs, IEEE Transactons, vol. 29, no. 6, 1999),. 778-785. [3] P. Berkhn, "A survey of clusterng data nng technques", Groung ultdensonal data, Srnger Berln Hedelberg, 2006),. 25-71. [4] B.. L, "Integratng satal fuzzy clusterng wth level set ethods for autoated edcal age segentaton", outers n Bology and Medcne, vol. 41, no. 1, 2011),. 1-10. 200 oyrght c 2015 SERS

Internatonal Journal of Database Theory and Alcaton [5] D. Won, B. M. Song and D. McLeod, "An aroach to clusterng arketng data", Proceedngs of the 2nd Internatonal Advanced Database onference, 2006). [6] D. Graves and W. Pedrycz, "Kernel-based fuzzy clusterng and fuzzy clusterng: A coaratve exerental study", Fuzzy sets and systes, vol. 161, no. 4, 2010),. 522-543. [7] J.. Bezdek, R. Ehrlch and W. Full, "FM: The fuzzy c-eans clusterng algorth", outers & Geoscences, vol. 10, no. 2, 1984),. 191-203. [8] J. Kennedy, "Partcle swar otzaton", Encycloeda of Machne Learnng, Srnger US, 2010),.760-766. [9] A. K. Jan, "Data clusterng: 50 years beyond K-eans", Pattern Recognton Letters, vol. 31, no 8, 2010),. 651-666. [10] R. T. g and J. Han, "LARAS: A ethod for clusterng objects for satal data nng", Knowledge and Data Engneerng, IEEE Transactons, vol. 14, no. 5, 2002),. 1003-1016. [11]. Ordonez and E. Oecnsk, "FREM: fast and robust EM clusterng for large data sets", Proceedngs of the eleventh nternatonal conference on Inforaton and knowledge anageent, AM, 2002). [12] T. Zhang, R. Raakrshnan and M. Lvny, "BIRH: an effcent data clusterng ethod for very large databases", AM SIGMOD Record, vol. 25, no. 2, 1996). [13] S. Guha, R. Rastog and K. Sh, "URE: an effcent clusterng algorth for large databases", AM SIGMOD Record, vol. 27, no. 2, AM, 1998). [14] M. Ester, "A densty-based algorth for dscoverng clusters n large satal databases wth nose", Kdd. vol. 96, 1996). [15] M. Ankerst, "Otcs: Orderng onts to dentfy the clusterng structure", AM Sgod Record, vol. 28, no. 2, AM, 1999). [16] D. Brant and A. Kut, "ST-DBSA: An algorth for clusterng satal teoral data", Data & Knowledge Engneerng, vol. 60, no. 1, 2007),. 208-221. [17] W. Wang, J. Yang and R. Muntz, "STIG: A statstcal nforaton grd aroach to satal data nng", VLDB, vol. 97, 1997). [18] R. Agrawal, Autoatc subsace clusterng of hgh densonal data for data nng alcatons, AM, vol. 27, no. 2, 1998). [19] G. Shekholesla, S. hatterjee, A. Zhang, Wave-luster: A Mult-Resoluton lusterng Aroach for Very Large Satal Databases, Proc. 24th Int. onf. on Very Large Data Bases, 1998), ew York. [20] E. H. Rusn, "A new aroach to clusterng", Inforaton and control, vol. 15, no. 1, 1969),. 22-32. [21] G.-S. Lang, T.-Y. hou and T.-. Han, "luster analyss based on fuzzy equvalence relaton", Euroean Journal of Oeratonal Research, vol. 166, no.1, 2005),. 160-171. [22] D. Boxader, J. Jacas and J. Recasens, "Fuzzy equvalence relatons: advanced ateral", Fundaentals of Fuzzy Sets. Srnger US, 2000),. 261-290. [23] M.-S. Yang and H.-M. Shh, "luster analyss based on fuzzy relatons", Fuzzy Sets and Systes, vol. 120, vol. 2, 2001),. 197-212. [24] Z. Wu and R. Leahy, "An otal grah theoretc aroach to data clusterng: Theory and ts alcaton to age segentaton", Pattern Analyss and Machne Intellgence, IEEE Transactons, 1993). [25] J.. Bezdek and J. D. Harrs, "Fuzzy arttons and relatons; an axoatc bass for clusterng", Fuzzy sets and systes, vol. 1, no. 2, 1978),. 111-127. [26] A. O. Esogbue, "Otal clusterng of fuzzy data va fuzzy dynac rograng", Fuzzy Sets and Systes, vol. 18, no. 3, 1986),. 283-298. [27] D. Graves and W. Pedrycz, "Kernel-based fuzzy clusterng and fuzzy clusterng: A coaratve exerental study", Fuzzy sets and Systes vol. 161, no. 4, 2010),. 522-543. [28] K. Krshna and M.. Murty, "Genetc K-eans algorth" Systes, Man, and ybernetcs, Part B: ybernetcs, IEEE Transactons, vol. 29, no. 3, 1999),. 433-439. [29] U. Maulk and S. Bandyoadhyay, "Genetc algorth-based clusterng technque", Pattern recognton vol. 33, no. 9, 2000),. 1455-1465. [30] J.-. Zeng and Z.-H u, A guaranteed global convergence artcle swar otzer, Journal of outer Research and Develoent, vol. 41, no. 8,, 2004),. 1333-1338. [31] L. Xao-qng and J. Su-n, PSO satcal clusterng wth obstacles constrants, outer Engneerng and Desgn, vol. 28, no. 24,, b,. 5924-5926. [32]. R. Pal and J.. Bezdek, "On cluster valdty for the fuzzy c-eans odel", Fuzzy Systes, IEEE Transactons, vol. 3, no. 3, 1995),. 370-379. [33] M.-. Hung, and D.-L. Yang, "An effcent fuzzy c-eans clusterng algorth", Data Mnng, 2001), IDM 2001, Proceedngs IEEE Internatonal onference, 2001). oyrght c 2015 SERS 201

Internatonal Journal of Database Theory and Alcaton Authors Yang Xanfeng, he receved hs bachelor's degree n outer Alcaton Technology fro Henan oral Unversty, Xnxang, Henan, hna, n 2001, the aster's degree n outer Alcaton Technology fro hna Unversty of Petroleu, Dongyng, hna, n 2007. He s now a lecturer at School of Inforaton Engneerng, Henan Insttute of Scence and Technology, Xnxang, hna. Hs current research nterests nclude attern recognton, age rocessng, neural networks, natural language rocessng. Lu Pengfe, he receved hs bachelor's degree n outer Scence and Technology fro Anyang oral Unversty, Anyang, hna, n 2004, the aster degree n outer Alcaton fro Huazhong Unversty of Scence and Technology, Wuhan, hna, n 2010. He s now a lecturer at Heb ollege of Vocaton And Technology. Hs current research nterests nclude couter network, web alcaton, and network oeratng syste. 202 oyrght c 2015 SERS