Genetic Algorithms applied to Clustering Problem and Data Mining

Proceedngs of the 7th WSEAS Internatonal Conference on Smulaton, Modellng and Optmzaton, Beng, Chna, September 5-7, 007 9 Genetc Algorthms appled to Clusterng Problem and Data Mnng JF JIMENEZ a, FJ CUEVAS b, JM CARPIO a a Insttuto Tecnológco de León, Av Tecnológco s/n, Fracc Julán de Obregón, CP3790 León, Guanauato, Méxco b Centro de Investgacones en Óptca AC, Loma del Bosque 5, CP 3750, León, Guanauato, Méxco fcuevas@comx http://wwwcomx and http://smbatleonedumx/prncpalhtml Abstract: - Clusterng technques have obtaned adequate results when are appled to data mnng problems However, dfferent runs of the same clusterng technque on a specfc dataset may result n dfferent solutons The cause of ths dfference s the choce of the ntal cluster settng and the values of the parameters assocated wth the technque A defnton of good ntal settngs and optmal parameters values s not an easy task, partcularly because both vary largely from one dataset to another In ths paper the authors nvestgate the use of Genetc Algorthms to determne the best ntalzaton of clusters, as well as the optmzaton of the ntal parameters The expermental results show the great potental of the Genetc Algorthms for the mprovement of the clusters, snce they do not only optmze the clusters, but resolve the problem of the number cluster, whch had been gvng t form a pror The technques of clusterng are most used n the analyss of nformaton or Data Mnng, ths method was appled to Data Set at mnng ey-words: - Clusterng Technques, Data Mnng, k-means, Genetc Algorthms Introducton Clusterng has always been a key task n the process of acqurng knowledge The complexty and specally the dversty of phenomena have forced socety to organze thngs based on ther smlartes The obectve of cluster analyss s to sort the observatons nto clusters such as the degree of natural assocaton whch s hgh among members of the same cluster and low between members of dfferent clusters (Berry, 003; Tou and Gonzalez, 974; Webb, 00), the complexty of such task s easly recognzed due to the number of possble arrangements Senstvty to ntal ponts and convergence to local optmum are usually among the problems affectng the nteractve technques such as k-means (Bradley and Fayyad, 998) Largely used, cluster analyss has called the attenton of a very large number of academc dscplnes Most of the work done on nternal spatal and socal structure of ctes has n some way used classfcaton as a bass for data sets analyss usng Data mnng There are several establshed methods for generatng a clusterng algorthmcally (Evertt, 99; aufman and Rousseeuw, 990; Gersho and Gray, 99) The most cted and wdely used method s the k-means algorthm (McQueen, 967) It begns wth an ntal soluton, whch s teratvely mproved usng two dfferent optmalty crtera n turn untl a local mnmum has been reached The algorthm s easy to mplement and t gves postve results n most cases The problem of the technques clusterng ncludes two search and selecton sub-problems: () number of clusters to formng (k), and () the ntal elements of these clusters Clusterng of adequate qualty has been obtaned by genetc algorthms (GA) (varv and Frat, 003; Nald and Carvalho, 003; Wang, et al 006), Solvng the problem of ntalzaton of the clusters However, t does not solve the selecton problem of the number of clusters In ths paper we propose an adaptve genetc algorthm for the clusterng problem, our am s to gve an effectve algorthm whch obtans good solutons for the optmzaton problem wthout explct parameter tunng, and each ndvdual of the GA populaton contans a set of parameter values These parameters are used for the generaton of clusters Data mnng and Clusterng Data mnng s an nterdscplnary feld, the confluence of a set of dscplnes, ncludng database systems, statstcs, machne learnng, vsualzaton, and nformaton scence Moreover, dependng on the data mnng approach used, technques from other dscplnes may be appled, such as neural networks, fuzzy and/or rough set theory, knowledge

Proceedngs of the 7th WSEAS Internatonal Conference on Smulaton, Modellng and Optmzaton, Beng, Chna, September 5-7, 007 0 representaton, nductve logc programmng, or hgh performance computng Dependng on the knds of data to be mned or on the gven data mnng applcaton, the data mnng system may also ntegrate technques from spatal data analyss, nformaton retreval, pattern recognton, mage analyss, sgnal processng, computer graphcs, Web technology, economcs, or psychology A data mnng system has the potental to generate thousands or even mllons of patterns, or rules Are all of the patterns nterestng? The answer n not, only a small fracton of the patterns potentally generated would actually be of nterest to any gven user Clusterng also has been studed n the felds of machne learnng and statstcal pattern recognton as a type of unsupervsed learnng because t does not rely on predefned class-labeled tranng examples (Duda, Hart, & Stork, 00) The knds of knowledge to be mned: Ths specfes the data mnng functons to be performed, such as characterzaton, dscrmnaton, assocaton, classfcaton, clusterng, or evoluton analyss For nstance, f studyng the buyng habts of customers n Mexco, you may choose to mne assocatons between customers Insde the data mnng they workng n practcal pattern-classfcaton and knowledge-dscovery problems requre the selecton of a subset of attrbutes or features to represent the patterns to be classfed, some works manage wth genetc algorthms, memetc algorthms, n some cases cultural algorthms (Skora and Pramuthu, 007; Ochoa, et al, 007), n our case we wll use clusterng technques and wll optmze them wth GA 3 Clusterng Clusterng methods partton a set of obects nto clusters such that obects n the same cluster are more smlar to each other than obects n dfferent clusters accordng to some defned crtera The clusterng problem s defned as follows Gven a set of N data obects x, partton the data set nto clusters n such a way that smlar obects are cluster together and obects wth dssmlar features belong to dfferent clusters M patterns x, x,, xm, a process clusterng conssts of searchng clusters S, =,,, Every cluster s characterzed to have centrod (mean), t s the optmal pattern of the cluster, and s formed for Z, =,,, Scheme functonally clusterng technques as s ndcated n the Fgure The general clusterng problem ncludes two subproblems: () Intalzaton of centrods and patterns processng, () decson of the number of clusters X, X,, X M Algorthm of Clusterng parameters The most cted and wdely used method s the k- means algorthm (McQueen, 967) It begns wth an ntal soluton, whch s teratvely mproved usng two dfferent optmalty crtera n turn untl a local mnmum has been reached The algorthm s easy to mplement and t gves reasonable results n most cases Typcally the k-means algorthm starts wth an ntalzaton process n whch seed postons are defned Ths ntal step can have a sgnfcant mpact on the performance of the method (Bradley and Fayyad, 998) and can be done n a number of ways (Bradley and Fayyad, 998) After the ntal seed had been defned each data element s assgned to the nearest seed The next step conssts on repostonng the seeds, ths can be done after all elements are assgned to the nearest seed or as each one of the elements s assgned After ths, a new assgnment step s necessary and the process wll go on untl no further mprovement can be made, n other words a local optmum has been found Consderng that the assgnments wll be done on the bass of the dstance to the nearest seed, mplctly ths process wll produce a mnmzaton of the sum of the dstance squared between each data pont and ts nearest centrod of the cluster (Bradley and Fayyad, 998) The measurement of smlarty smpler s the dstance, f d t s a measurement of dssmlarty defned between two patterns there turns out to be evdent: d ( X, X ) = 0 d( X, X ) 0 { X, X X } 4 7,, In the scentfc lterature (Bow, 00; Tou and Gonzalez, 974; Webb, 00) they can fnd dfferent expressons, Eucldean dstance s a wdely used S 3 { X X } S,, S, 55 X M { X, X X } 0 3,, Fg Scheme functonally clusterng technques () 66

Proceedngs of the 7th WSEAS Internatonal Conference on Smulaton, Modellng and Optmzaton, Beng, Chna, September 5-7, 007 dstance functon n the clusterng context, and t s calculated as: M M M ( x x ) d( x, x = () ) M= The most mportant choce n the clusterng method s the obectve functon for evaluatng the qualty of a cluster, a commonly used obectve crteron s to mnmze the sum of squared dstances of the data obects to ther cluster representatves, and t s calculated as: If lke Z s the centrod of the clusterng Z = N X S The sum of squared errors s: = X S X S, calculated (3) J = X Z (4) e The specfcatons of the algorthm of k-means are the followng: A Algorthm usng to on a data set M nformaton n clusters B The algorthm converges to local optmum Z Select at random the centrod { z, z,, z } Repeat untl the crteron stop s satsfed J e a) Every pattern of the data set assgns the most nearby cluster ( x, Z ) d( x, Z ) x M, d, (5) b) update from the new assgned patterns Z = E( x), x S, (6) 4 Genetc Algorthms Evoluton has proven to be a very powerful mechansm n fndng good solutons to dffcult problems One can look at the natural selecton as an optmzaton method, whch tres to produce adequate solutons to partcular envronments In spte of the large number of applcatons of GA n dfferent types of optmzaton problems, there s very lttle research on usng ths knd of approach to the clusterng problem (varv and Frat,003; Nald and Carvalho, 006; Wang, 003) In fact, the qualty of the solutons that ths technque has showed n dfferent types of felds and problems (Mtchell, 996) t makes perfect sense to try to use t n clusterng problems The flexblty assocated wth GA s one mportant aspect and advantage to consder Wth the same genome representaton and ust by changng the ftness functon one can have a dfferent algorthm In the case of spatal analyss ths s partcularly mportant snce one can try dfferent ftness functons n an exploratory phase 5 Solvng the Clusterng Problem usng k-means method An ndvdual n the genetc tradtonal algorthm codfes a solutonω In ths case the soluton s gven n the ndvdual, whch encodes t The reason of ths conceptual dstncton between the ndvdual and the soluton s that an ndvdual ncludes the nformaton of the parameter entry of the algorthm of k-means Before startng the codfcaton of the Genetc Algorthm, we must codfy the chromosomes chan that contans all the genetc nformaton of our system In ths case, analyze a massve repostory of nformaton, the scheme of the nformaton s as follows: Table Scheme of the set nformaton Car Car Car X X M where M s the number of samples, and N s the number of characterstcs or dmensons of every sample In the chromosome s encoded the followng parameters of the k-means algorthm: () the number of clusters k and () the number of characterstcs or attrbutes that wll be used n the clusterng process n the range of [, N] denomnated C Then the chromosome structure can be represented as follows, Number of Clusters () Car Car Car C Wth nformaton prevously descrbed there s generated a chromosome chan ω, whch ths composed for: the number of k clusters, ths number N

Proceedngs of the 7th WSEAS Internatonal Conference on Smulaton, Modellng and Optmzaton, Beng, Chna, September 5-7, 007 s selected random n the range of [, _ max], ths s to explore the space of search of the clusters, also the soluton ω ths composed by the numbers of the characterstcs to usng n the clusterng technques As example of chromosome take the maxmum = 7 and C = 3 of 7 possble ones, whch can be represented each one by 3 bts (see Table ) Table Example of the genotype of the chromosome chans Number of Car Car Car Chromosome 3 Cluster () 0 0 00 0 00000 0 0 00 00 000000 3 00 0 0 0000 The chromosome s generated of form random, where the number of clusters [, _ max] and Car [, maxmum number of characterstcs or column of the data set] for =,, and 3 The followng step generates hs phenotype, whch s the representaton n decmal form Table 3 Example of the phenotype of the Number of Clusters () chromosome chans Car Car Car 3 Chromosome 3 6 5 3,6,,5 5 3 4 5,3,4, 3 7 4 5 3 7,4,5,3 These characterstcs are the parameters of entry to the algorthm of k-means Scheme of the algorthm proposed for the resoluton of clusterng problems: To obtan the crossover probablty ( P C ) and mutaton ( P M ), populaton sze ( G ), and maxmum number of generatons (T ) Generate G random ndvduals to form the ntal generaton 3 Iterate the followng T generatons a) Apply k-means to G ndvduals b) To obtan the statstcs of the clusters of every ndvdual G c) Select G survvng ndvduals for B the new generaton d) Select G G pars of ndvduals as B the set of parents e) For each par of parents ( a, b ) do the followng: ) Create the soluton ω of n the offsprng by crossng the solutons of the parents ) Mutate ω wth probablty pm ) To evaluate the qualty of the soluton ω n a functon of ftness v) Add n to the new generaton f) Replace the current generaton by the new generaton 4 Output the best soluton of the fnal generaton In every run of the k-means, such statstcs are obtaned for ω lke standard devaton and dstances between clusters, to be able to evaluate them n functon ftness The chromosome chanω, ths to be evaluated by functon ftness, n ths case the crtera of the clusterng technque are: () to maxmze the dstance between clusters Ths functon can be wrtten lke: dstc = ( Z Z ) = = Where dstc s the average of the dstances of the centrods And () to mnmze the nternal standard devaton of every cluster, ths functon can be wrtten lke desvc= ( σ σ) = = The desvc mnmzes the Sum of Squared Errors of the standard devaton of the clusters Fnally combnng functons dstc and desvc, we obtan the functon of ftness f ( M ) = σ ( Z Z ) ( σ σ) = = + = = σ (7) (8) (9)

Proceedngs of the 7th WSEAS Internatonal Conference on Smulaton, Modellng and Optmzaton, Beng, Chna, September 5-7, 007 3 Where M s the set of patterns, before applyng ths functon ftness, M s evaluated by one clusterng technques n ths case that of k-means 6 Test Results The data set used n the smulated test had M=000 samples, N= characterstcs, only two characterstcs where used to generate fve clusters Clusters were generated takng 5 random centrods The samples were spread usng a Gaussan dstrbuton The other characterstcs were generated usng a unform dstrbuton In Fgure 3 the orgnal computer generated clusters are shown Fg3 Clusters generated by Computer smulaton (a) (c) Fg 4 Graphs of the results of k-means and the proposed algorthm, wth (a) Gen (b) Gen 6 (c) Gen 0 (d) Gen 35 The nput parameters used n the GA, were the followng: P =08, C P =000, M G =0, T =30, and a Boltzmann selecton method where used Fgure 4 shows the results of the proposed GA technque n Generatons, 6, 0 and 35 Fnally, on Generaton 35 the orgnal clusters are recovered 7 Concluson The good result of a clusterng method depends to a great extent on ntal parameters, n ths paper we (b) (d) proposed a genetc algorthm that adapts the ntal parameters The GA technque s appled n k-means clusterng method to determne the number of clusters, and the characterstcs to take n consderaton n the clusterng process The future work conssts n usng dfferent representaton schemes for the GA and compares the qualtes and shortcomngs of the dfferent representatons References: [] Berry Mchael W: Surver of Text Mnng: Clusterng, Classfcaton, and Retreval John Wley & Sons (003) [] Bow Sng-Tze: Pattern Recognton and Image Preprocessng Marcel Dekker Inc (00) [3] Bradley P, Fayyad U: Refnng Intal Ponts for -Means Clusterng, In J Shavlk, edtor, Proceedngs of the Ffteenth Internatonal Conference on Machne Learnng, Morgan aufmann (998) [4] Duda Rchard O, Hart Peter E: Pattern Classfcaton John Wley & Sons (00) [5] Goldberg Davd E: Genetc Algorthms n Search Optmzaton and Machne Learnng Addson-Wesley Publshng (989) [6] Gonzalez Rafael C, Woods Rchard E: Dgtal Image Processng Addson Wesley (00) [7] Hartgan, J: Clusterng Algorthms Wley Seres n Probablty and Mathematcal Statstcs, John Wley & Sons (975) [8] Huapt Randy L, Huapt Sue Ellen: Practcal Genentc Algorthm John Wley & Sons (005) [9] Jan Ak, Dubes RC: Algorthms for Clusterng Data Prentce-Hall (998) [0] varv Juha, Frat Pas: Self-Adaptatve Genetc Algorthm for Clusterng Journal for Heurstcs, luwer Academc Publshers 9: 3-9 (003) [] Marques de Sá JP: Pattern Recognton: Concept, Methods and Aplcatons Sprnger (00) [] Mtchel, Melane: An Introducton to Genetc Algorthms MIT Press, London (999) [3] Nald Murllo C, Carvalho André: Parttonal clusterng mprovement wth Genetc Algorthms (006) [4] Ochoa Alberto, Ponce Julo, Baltazar Rosaro: An approach to Cultural Algorthms from Data Mnng (COMCEV07) Mexcan congress of Evolutonary Computaton (007) [5] Pedrycz Wtold: nowledge Based Clusterng John Wley & Sons (005)

Proceedngs of the 7th WSEAS Internatonal Conference on Smulaton, Modellng and Optmzaton, Beng, Chna, September 5-7, 007 4 [6] Sato M, Sato Y, Jan L: Fuzzy Clusterng Models and Applcatons Sprnger-Verlag (997) [7] Skora Ryaz, Pramuthu Selwyn: Framework for effcent feature selecton n genetc algorthm based data mnng European Journal of Operatonal Research 80(): 73-737 (007) [8] Tou Julus T, Gonzalez Rafael C: Pattern Recognton Prncples Addson-Wesley (974) [9] Una-May O Relly, Tna Yu: Genetc Programmng Theory and Practce II Sprnger (005) [0] Wang Chang, Zengqang Chen, Qnln Sun, Zhuzh Yuan: Clusterng of Amno Acd Sequences based on -Medods Method Journal of Computer Engneerng, Vol9 No8 (003) [] Wang Chang, Zengqang Chen, Zhuzh Yuan: -Means Clusterng Based on Genetc Algorthm Journal of Computer Scence, Vol30 No (003) [] Webb Andrew R: Statstcal Pattern Recognton Prncples John Wley & Sons (00)