A Comparative Study of Data Clustering Techniques

Transcription

1 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES A Comparatve Study of Data Clusterng Technques Khaled Hammouda Prof. Fakhreddne Karray Unversty of Waterloo, Ontaro, Canada Abstract Data clusterng s a process of puttng smlar data nto groups. A clusterng algorthm parttons a data set nto several groups such that the smlarty wthn a group s larger than among groups. Ths paper revews four of the most representatve off-lne clusterng technques: K-means clusterng, Fuzzy C- means clusterng, Mountan clusterng, and Subtractve clusterng. The technques are mplemented and tested aganst a medcal problem of heart dsease dagnoss. Performance and accuracy of the four technques are presented and compared. Index Terms data clusterng, k-means, fuzzy c-means, mountan, subtractve. D I. INTRODUCTION ATA CLUSTERING s consdered an nterestng approach for fndng smlartes n data and puttng smlar data nto groups. Clusterng parttons a data set nto several groups such that the smlarty wthn a group s larger than that among groups []. The dea of data groupng, or clusterng, s smple n ts nature and s close to the human way of thnkng; whenever we are presented wth a large amount of data, we usually t to summarze ths huge number of data nto a small number of groups or categores n order to further facltate ts analyss. Moreover, most of the data collected n many problems seem to have some nherent propertes that l themselves to natural groupngs. Nevertheless, fndng these groupngs or tryng to categorze the data s not a smple task for humans unless the data s of low dmensonalty K. M. Hammouda, Department of Systems Desgn Engneerng, Unversty of Waterloo, Waterloo, Ontaro, Canada NL 3G (two or three dmensons at maxmum.) Ths s why some methods n soft computng have been proposed to solve ths knd of problem. Those methods are called Data Clusterng Methods and they are the subject of ths paper. Clusterng algorthms are used extensvely not only to organze and categorze data, but are also useful for data compresson and model constructon. By fndng smlartes n data, one can represent smlar data wth fewer symbols for example. Also f we can fnd groups of data, we can buld a model of the problem based on those groupngs. Another reason for clusterng s to dscover relevance knowledge n data. Francsco Azuaje et al. [] mplemented a Case Based Reasonng (CBR) system based on a Growng Cell Structure (GCS) model. Data can be stored n a knowledge base that s ndexed or categorzed by cases; ths s what s called a Case Base. Each group of cases s assgned to a certan category. Usng a Growng Cell Structure (GCS) data can be added or removed based on the learnng scheme used. Later when a query s presented to the model, the system retreves the most relevant cases from the case base depng on how close those cases are to the query. In ths paper, four of the most representatve off-lne clusterng technques are revewed: K-means (or Hard C-means) Clusterng, Fuzzy C-means Clusterng, Mountan Clusterng, and Subtractve Clusterng. These technques are usually used n conjuncton wth radal bass functon networks (RBFNs) and Fuzzy Modelng. Those four technques are mplemented and tested aganst a medcal dagnoss problem for heart dsease. The results

2 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES are presented wth a comprehensve comparson of the dfferent technques and the effect of dfferent parameters n the process. The remander of the paper s organzed as follows. Secton II presents an overvew of data clusterng and the underlyng concepts. Secton III presents each of the four clusterng technques n detal along wth the underlyng mathematcal foundatons. Secton IV ntroduces the mplementaton of the technques and goes over the results of each technque, followed by a comparson of the results. A bref concluson s presented n Secton V. The MATLAB code lstng of the four clusterng technques can be found n the appx. II. DATA CLUSTERING OVERVIEW As mentoned earler, data clusterng s concerned wth the parttonng of a data set nto several groups such that the smlarty wthn a group s larger than that among groups. Ths mples that the data set to be parttoned has to have an nherent groupng to some extent; otherwse f the data s unformly dstrbuted, tryng to fnd clusters of data wll fal, or wll lead to artfcally ntroduced parttons. Another problem that may arse s the overlappng of data groups. Overlappng groupngs sometmes reduce the effcency of the clusterng method, and ths reducton s proportonal to the amount of overlap between groupngs. Usually the technques presented n ths paper are used n conjuncton wth other sophstcated neural or fuzzy models. In partcular, most of these technques can be used as preprocessors for determnng the ntal locatons for radal bass functons or fuzzy fthen rules. The common approach of all the clusterng technques presented here s to fnd cluster centers that wll represent each cluster. A cluster center s a way to tell where the heart of each cluster s located, so that later when presented wth an nput vector, the system can tell whch cluster ths vector belongs to by measurng a smlarty metrc between the nput vector and al the cluster centers, and determnng whch cluster s the nearest or most smlar one. Some of the clusterng technques rely on knowng the number of clusters apror. In that case the algorthm tres to partton the data nto the gven number of clusters. K-means and Fuzzy C-means clusterng are of that type. In other cases t s not necessary to have the number of clusters known from the begnnng; nstead the algorthm starts by fndng the frst large cluster, and then goes to fnd the second, and so on. Mountan and Subtractve clusterng are of that type. In both cases a problem of known cluster numbers can be appled; however f the number of clusters s not known, K-means and Fuzzy C-means clusterng cannot be used. Another aspect of clusterng algorthms s ther ablty to be mplemented n on-lne or offlne mode. On-lne clusterng s a process n whch each nput vector s used to update the cluster centers accordng to ths vector poston. The system n ths case learns where the cluster centers are by ntroducng new nput every tme. In off-lne mode, the system s presented wth a tranng data set, whch s used to fnd the cluster centers by analyzng all the nput vectors n the tranng set. Once the cluster centers are found they are fxed, and they are used later to classfy new nput vectors. The technques presented here are of the off-lne type. A bref overvew of the four technques s presented here. Full detaled dscusson wll follow n the next secton. The frst technque s K-means clusterng [6] (or Hard C-means clusterng, as compared to Fuzzy C-means clusterng.) Ths technque has been appled to a varety of areas, ncludng mage and speech data compresson, [3, 4] data preprocessng for system modelng usng radal bass functon networks, and task decomposton n heterogeneous neural network archtectures [5]. Ths algorthm reles on fndng cluster centers by tryng to mnmze a cost functon of dssmlarty (or dstance) measure. The second technque s Fuzzy C-means clusterng, whch was proposed by Bezdek n 973 [] as an mprovement over earler Hard C- means clusterng. In ths technque each data pont belongs to a cluster to a degree specfed by a membershp grade. As n K-means clusterng, Fuzzy C-means clusterng reles on mnmzng a cost functon of dssmlarty measure.

3 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 3 The thrd technque s Mountan clusterng, proposed by Yager and Flev []. Ths technque bulds calculates a mountan functon (densty functon) at every possble poston n the data space, and chooses the poston wth the greatest densty value as the center of the frst cluster. It then destructs the effect of the frst cluster mountan functon and fnds the second cluster center. Ths process s repeated untl the desred number of clusters have been found. The fourth technque s Subtractve clusterng, proposed by Chu []. Ths technque s smlar to mountan clusterng, except that nstead of calculatng the densty functon at every possble poston n the data space, t uses the postons of the data ponts to calculate the densty functon, thus reducng the number of calculatons sgnfcantly. III. DATA CLUSTERING TECHNIQUES In ths secton a detaled dscusson of each technque s presented. Implementaton and results are presented n the followng sectons. A. K-means Clusterng The K-means clusterng, or Hard C-means clusterng, s an algorthm based on fndng data clusters n a data set such that a cost functon (or an objecton functon) of dssmlarty (or dstance) measure s mnmzed []. In most cases ths dssmlarty measure s chosen as the Eucldean dstance. A set of n vectors x j, j =,, n, are to be parttoned nto c groups G, =,, c. The cost functon, based on the Eucldean dstance between a vector x n group j and the k correspondng cluster center c, can be defned by: c c J = J = xk c, () = = k, xk G where J = k, x k G x k c s the cost functon wthn group. The parttoned groups are defned by a c n bnary membershp matrx U, where the element u j s f the j th data pont x j belongs to group, and 0 otherwse. Once the cluster centers c are fxed, the mnmzng u for Equaton () can be derved as follows: u j f, for each k, j j k = x c x c 0 otherwse. Whch means that x j belongs to group f c s the closest center among all centers. On the other hand, f the membershp matrx s fxed,.e. f u j s fxed, then the optmal center c that mnmze Equaton () s the mean of all vectors n group : where G s the sze of j () c = x k, (3) G k, x G k G, or G n = u j= j. The algorthm s presented wth a data set x, =,, n ; t then determnes the cluster centers c and the membershp matrx U teratvely usng the followng steps: Step : Intalze the cluster center c, =,, c. Ths s typcally done by randomly selectng c ponts from among all of the data ponts. Step : Determne the membershp matrx U by Equaton (). Step 3: Compute the cost functon accordng to Equaton (). Stop f ether t s below a certan tolerance value or ts mprovement over prevous teraton s below a certan threshold. Step 4: Update the cluster centers accordng to Equaton (3). Go to step. The performance of the K-means algorthm deps on the ntal postons of the cluster centers, thus t s advsable to run the algorthm several tmes, each wth a dfferent set of ntal cluster centers. A dscusson of the

4 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 4 mplementaton ssues s presented later n ths paper. B. Fuzzy C-means Clusterng Fuzzy C-means clusterng (FCM), reles on the basc dea of Hard C-means clusterng (HCM), wth the dfference that n FCM each data pont belongs to a cluster to a degree of membershp grade, whle n HCM every data pont ether belongs to a certan cluster or not. So FCM employs fuzzy parttonng such that a gven data pont can belong to several groups wth the degree of belongngness specfed by membershp grades between 0 and. However, FCM stll uses a cost functon that s to be mnmzed whle tryng to partton the data set. The membershp matrx U s allowed to have elements wth values between 0 and. However, the summaton of degrees of belongngness of a data pont to all clusters s always equal to unty: c uj =, j =,, n. (4) = The cost functon for FCM s a generalzaton of Equaton (): where c c n m c = = j j = = j= J ( U, c,, c ) J u d, (5) u j s between 0 and ; c s the cluster center of fuzzy group ; d j = c x j s the Eucldean dstance between the th cluster center and the th m, s a j data pont; and [ ) weghtng exponent. The necessary condtons for Equaton (5) to reach ts mnmum are and u j n m u j j x = j = n m u j = j c, (6) = c d j d k= kj /( m ). (7) The algorthm works teratvely through the precedng two condtons untl the no more mprovement s notced. In a batch mode operaton, FCM determnes the cluster centers c and the membershp matrx U usng the followng steps: Step : Intalze the membershp matrx U wth random values between 0 and such that the constrants n Equaton (4) are satsfed. Step : Calculate c fuzzy cluster centers c, =,, c, usng Equaton (6). Step 3: Compute the cost functon accordng to Equaton (5). Stop f ether t s below a certan tolerance value or ts mprovement over prevous teraton s below a certan threshold. Step 4: Compute a new U usng Equaton (7). Go to step. As n K-means clusterng, the performance of FCM deps on the ntal membershp matrx values; thereby t s advsable to run the algorthm for several tmes, each startng wth dfferent values of membershp grades of data ponts. C. Mountan Clusterng The mountan clusterng approach s a smple way to fnd cluster centers based on a densty measure called the mountan functon. Ths method s a smple way to fnd approxmate cluster centers, and can be used as a preprocessor for other sophstcated clusterng methods. The frst step n mountan clusterng nvolves formng a grd on the data space, where the ntersectons of the grd lnes consttute the potental cluster centers, denoted as a set V. The second step entals constructng a mountan functon representng a data densty measure. The heght of the mountan functon at a pont v V s equal to N v x m( v ) = exp, (8) = σ where x s the th data pont and σ s an applcaton specfc constant. Ths equaton states

5 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 5 that the data densty measure at a pont v s affected by all the ponts x n the data set, and ths densty measure s nversely proportonal to the dstance between the data ponts x and the pont under consderaton v. The constant σ determnes the heght as well as the smoothness of the resultant mountan functon. The thrd step nvolves selectng the cluster centers by sequentally destructng the mountan functon. The frst cluster center c s determned by selectng the pont wth the greatest densty measure. Obtanng the next cluster center requres elmnatng the effect of the frst cluster. Ths s done by revsng the mountan functon: a new mountan functon s formed by subtractng a scaled Gaussan functon centered at c : v c mnew ( v) = m( v) m( c )exp (9) β The subtracted amount elmnates the effect of the frst cluster. Note that after subtracton, the new mountan functon mnew ( v ) reduces to zero at v = c. After subtracton, the second cluster center s selected as the pont havng the greatest value for the new mountan functon. Ths process contnues untl a suffcent number of cluster centers s attaned. D. Subtractve Clusterng The problem wth the prevous clusterng method, mountan clusterng, s that ts computaton grows exponentally wth the dmenson of the problem; that s because the mountan functon has to be evaluated at each grd pont. Subtractve clusterng solves ths problem by usng data ponts as the canddates for cluster centers, nstead of grd ponts as n mountan clusterng. Ths means that the computaton s now proportonal to the problem sze nstead of the problem dmenson. However, the actual cluster centers are not necessarly located at one of the data ponts, but n most cases t s a good approxmaton, especally wth the reduced computaton ths approach ntroduces. Snce each data pont s a canddate for cluster centers, a densty measure at data pont x s defned as n x x j D = exp, (0) j= ( ra / ) where r a s a postve constant representng a neghborhood radus. Hence, a data pont wll have a hgh densty value f t has many neghborng data ponts. The frst cluster center xc s chosen as the pont havng the largest densty value D c. Next, the densty measure of each data pont x s revsed as follows: x x c D = D Dc exp () ( rb / ) where r b s a postve constant whch defnes a neghborhood that has measurable reductons n densty measure. Therefore, the data ponts near the frst cluster center c x wll have sgnfcantly reduced densty measure. After revsng the densty functon, the next cluster center s selected as the pont havng the greatest densty value. Ths process contnues untl a suffcent number of clusters s attanted. IV. IMPLEMENTATION AND RESULTS Havng ntroduced the dfferent clusterng technques and ther basc mathematcal foundatons, we now turn to the dscusson of these technques on the bass of a practcal study. Ths study nvolves the mplementaton of each of the four technques ntroduced prevously, and testng each one of them on a set of medcal data related to heart dsease dagnoss problem. The medcal data used conssts of 3 nput attrbutes related to clncal dagnoss of a heart dsease, and one output attrbute whch ndcates whether the patent s dagnosed wth the heart dsease or not. The whole data set conssts of 300 cases. The data set s parttoned nto two data sets: two-thrds of the data for tranng, and one-

6 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 6 Performance Test Runs measure No. of teratons RMSE Accuracy 78.0% 78.0% 80.0% 78.0% 60.0% 5.0% 5.0% 80.0% 80.0% 78.0% Regresson Lne Slope Table. K-means Clusterng Performance Results thrd for evaluaton. The number of clusters nto whch the data set s to be parttoned s two clusters;.e. patents dagnosed wth the heart dsease, and patents not dagnosed wth the heart dsease. Because of the hgh number of dmensons n the problem (3-dmensons), no vsual representaton of the clusters can be presented; only -D or 3-D clusterng problems can be vsually nspected. We wll rely heavly on performance measures to evaluate the clusterng technques rather than on vsual approaches. As mentoned earler, the smlarty metrc used to calculate the smlarty between an nput vector and a cluster center s the Eucldean dstance. Snce most smlarty metrcs are senstve to the large ranges of elements n the nput vectors, each of the nput varables must be 0, ;.e. normalzed to wthn the unt nterval [ ] the data set has to be normalzed to be wthn the unt hypercube. Each clusterng algorthm s presented wth the tranng data set, and as a result two clusters are produced. The data n the evaluaton set s then tested aganst the found clusters and an analyss of the results s conducted. The followng sectons present the results of each clusterng technque, followed by a comparson of the four technques. MATLAB code for each of the four technques can be found n the appx. A. K-means Clusterng As mentoned n the prevous secton, K- means clusterng works on fndng the cluster centers by tryng to mnmze a cost functon J. It alternates between updatng the membershp matrx and updatng the cluster centers usng Equatons () and (3), respectvely, untl no further mprovement n the cost functon s notced. Snce the algorthm ntalzes the cluster centers randomly, ts performance s affected by those ntal cluster centers. So several runs of the algorthm s advsed to have better results. Evaluatng the algorthm s realzed by testng the accuracy of the evaluaton set. After the cluster centers are determned, the evaluaton data vectors are assgned to ther respectve clusters accordng to the dstance between each vector and each of the cluster centers. An error measure s then calculated; the root mean square error (RMSE) s used for ths purpose. Also an accuracy measure s calculated as the percentage of correctly classfed vectors. The algorthm was tested for 0 tmes to determne the best performance. Table lsts the results of those runs. Fgure shows a plot of the cost functon over tme for the best test case. Cost Functon K-means clusterng cost functon hstory Iteraton Fgure. K-means clusterng cost functon plot

7 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 7 Performance Weghtng exponent m measure No. of teratons RMSE Accuracy 78.0% 78.0% 77.0% 78.0% 79.0% 77.0% 77% 77% Regresson Lne Slope Table. Fuzzy C-means Clusterng Performance Results A 0.5 Best Lnear Ft: A = (0.604) T + (0.4) R = T Data Ponts A = T Best Lnear Ft Fgure. Regresson Analyss of K-means Clusterng To further measure how accurately the dentfed clusters represent the actual classfcaton of data, a regresson analyss s performed of the resultant clusterng aganst the orgnal classfcaton. Performance s consdered better f the regresson lne slope s close to. Fgure shows the regresson analyss of the best test case. As seen from the results, the best case acheved 80% accuracy and an RMSE of Ths relatvely moderate performance s related to the hgh dmensonalty of the problem; havng too much dmensons t to dsrupt the couplng of data and ntroduces overlappng n some of these dmensons that reduces the accuracy of clusterng. It s notced also that the cost functon converges rapdly to a mnmum value as seen from the number of teratons n each test run. However, ths has no effect on the accuracy measure. B. Fuzzy C-means Clusterng FCM allows for data ponts to have dfferent degrees of membershp to each of the clusters; thus elmnatng the effect of hard membershp ntroduced by K-means clusterng. Ths approach employs fuzzy measures as the bass for membershp matrx calculaton and for cluster centers dentfcaton. As t s the case n K-means clusterng, FCM starts by assgnng random values to the membershp matrx U, thus several runs have to be conducted to have hgher probablty of gettng good performance. However, the results showed no (or nsgnfcant) varaton n performance or accuracy when the algorthm was run for several tmes. For testng the results, every vector n the evaluaton data set s assgned to one of the clusters wth a certan degree of belongngness (as done n the tranng set). However, because the output values we have are crsp values (ether or 0), the evaluaton set degrees of membershp are defuzzfed to be tested aganst the actual outputs. The same performance measures appled n K-means clusterng wll be used here; however only the effect of the weghtng exponent m s analyzed, snce the effect of random ntal membershp grades has nsgnfcant effect on the fnal cluster centers. Table lsts the results of the tests wth the effect of varyng the weghtng exponent m. It s notced that very low or very hgh values for m reduces the accuracy; moreover hgh values t to ncrease the tme taken by the algorthm to fnd the clusters. A value of seems adequate for ths problem snce

8 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 8 Performance Test Runs measure RMSE Accuracy 68.0% 78.0% 68.0% 76.0% 70.0% 68.0% 68.0% 7.0% 5.0% 78.0% Regresson Lne Slope Table 3. Mountan Clusterng Performance Results t has good accuracy and requres less number of teratons. Fgure 3 shows the accuracy and number of teratons aganst the weghtng factor Performance of Fuzzy C-means Clusterng Accuracy Iteratons m Fgure 3. Fuzzy C-means Clusterng Performance In general, the FCM technque showed no mprovement over the K-means clusterng for ths problem. Both showed close accuracy; moreover FCM was found to be slower than K-means because of fuzzy calculatons. C. Mountan Clusterng Mountan clusterng reles on dvdng the data space nto grd ponts and calculatng a mountan functon at every grd pont. Ths mountan functon s a representaton of the densty of data at ths pont. The performance of mountan clusterng s severely affected by the dmenson of the problem; the computaton needed rses exponentally wth the dmenson of nput data because the mountan functon has to be evaluated at each grd pont n the data space. For a problem wth c clusters, n dmensons, m data ponts, and a grd sze of g per dmenson, the requred number of calculatons s: n n N = m g + ( c ) g () st cluster remander clusters So for the problem at hand, wth nput data of 3- dmensons, 00 tranng nputs, and a grd sze of 0 per dmenson, the requred number of mountan functon calculaton s approxmately calculatons. In addton the value of the mountan functon needs to be stored for every grd pont for later calculatons n fndng n subsequent clusters; whch requres g storage 3 locatons, for our problem ths would be 0 storage locatons. Obvously ths seems mpractcal for a problem of ths dmenson. In order to be able to test ths algorthm, the dmenson of the problem have to be reduced to a reasonable number; e.g. 4-dmensons. Ths s acheved by randomly selectng 4 varables from the nput data out of the orgnal 3 and performng the test on those varables. Several tests nvolvng dfferently selected random varables are conducted n order to have a better understandng of the results. Table 3 lsts the results of 0 test runs of randomly selected varables. The accuracy acheved ranged between 5% and 78% wth an average of 70%, and average RMSE of Those results are qute dscouragng compared to the results acheved n K-means and FCM clusterng. Ths s due to the fact that not all of the varables of the nput data contrbute to the clusterng process; only 4 are chosen at random to make t possble to conduct the tests. However, wth only 4 attrbutes chosen to do the tests, mountan clusterng requred far much more tme than any other technque durng the tests; ths s because of the fact that the

9 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 9 Performance measure Neghborhood radus r a RMSE Accuracy 55.0% 58.0% 58.0% 75.0% 75.0% 75.0% 75.0% 75.0% 58.0% Regresson Lne Slope Table 4. Subtractve Clusterng Performance Results number of computaton requred s exponentally proportonal to the number of dmensons n the problem, as stated n Equaton (). So apparently mountan clusterng s not sutable for problems of dmensons hgher than two or three. D. Subtractve Clusterng Ths method s smlar to mountan clusterng, wth the dfference that a densty functon s calculated only at every data pont, nstead of at every grd pont. So the data ponts themselves are the canddates for cluster centers. Ths has the effect of reducng the number of computatons sgnfcantly, makng t lnearly proportonal to the number of nput data nstead of beng exponentally proportonal to ts dmenson. For a problem of c clusters and m data ponts, the requred number of calculatons s: N = m + ( c ) m st cluster remander clusters (3) As seen from the equaton, the number of calculatons does not dep on the dmenson of the problem. For the problem at hand, the number of computatons requred s n the range of few ten thousands only. Snce the algorthm s fxed and does not rely on any randomness, the results are fxed. However, we can test the effect of the two varables r a and r b on the accuracy of the algorthm. Those varables represent a radus of neghborhood after whch the effect (or contrbuton) of other data ponts to the densty functon s dmnshed. Usually the r b varable s taken to be as.5r a. Table 4 shows the results of varyng r a. Fgure 4 shows a plot of accuracy and RMSE aganst r a. Accuracy Subtractve Clusterng Performance Accuracy RMSE ra Fgure 4. Subtractve Clusterng Performance It s clear from the results that choosng r a very small or very large wll result n a poor accuracy because f r a s chosen very small the densty functon wll not take nto account the effect of neghborng data ponts; whle f taken very large, the densty functon wll be affected account all the data ponts n the data space. So a value between 0.4 and 0.7 should be adequate for the radus of neghborhood. As seen from table 4, the maxmum acheved accuracy was 75% wth an RMSE of 0.5. Compared to K-means and FCM, ths result s a lttle bt behnd the accuracy acheved n those other technques. E. Results Summary and Comparson Accordng to the prevous dscusson of the mplementaton of the four data clusterng technques and ther results, t s useful to

10 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 0 summarze the results and present some comparson of performances. A summary of the best acheved results for each of the four technques s presented n Table 5. Algorthm RMSE Comparson Aspect Regresson Accuracy Lne Slope Tme (sec) K-means % FCM % Mountan % Subtractve % Table 5. Performance Results Comparson From ths comparson we can conclude some remarks: K-means clusterng produces farly hgher accuracy and lower RMSE than the other technques, and requres less computaton tme. Mountan clusterng has a very poor performance regardng ts requrement for huge number of computaton and low accuracy. However, we have to notce that tests conducted on mountan clusterng were done usng part of the nput varables n order to make t feasble to run the tests. Mountan clusterng s sutable only for problems wth two or three dmensons. FCM produces close results to K-means clusterng, yet t requres more computaton tme than K-means because of the fuzzy measures calculatons nvolved n the algorthm. In subtractve clusterng, care has to be taken when choosng the value of the neghborhood radus r a, snce too small rad wll result n neglectng the effect of neghborng data ponts, whle large rad wll result n a neghborhood of all the data ponts thus cancelng the effect of the cluster. Snce non of the algorthms acheved enough hgh accuracy rates, t s assumed that the problem data tself contans some overlappng n some of the dmensons; because of the hgh number of dmensons t to dsrupt the couplng of data and reduce the accuracy of clusterng. As stated earler n ths paper, clusterng algorthms are usually used n conjuncton wth radal bass functon networks and fuzzy models. The technques descrbed here can be used as preprocessors for RBF networks for determnng the centers of the radal bass functons. In such cases, more accuracy can be ganed by usng gradent descent or other advanced dervatvebased optmzaton schemes for further refnement. In fuzzy modelng, the clusterng cluster centers produced by the clusterng technques can be modeled as f-then rules n ANFIS for example; where a tranng set (ncludng nputs and outputs) to fnd cluster centers ( x, y ) va clusterng frst and then formng a zero-order Sugeno fuzzy modelng n whch the th rule s expressed as If X s close to x then Y s close to y Whch means that the th rule s based on the th cluster center dentfed by the clusterng method. Agan, after the structure s determned, backpropagaton-type gradent descent and other optmzaton schemes can be appled to proceed wth parameter dentfcaton. V. CONCLUSION Four clusterng technques have been revewed n ths paper, namely: K-means clusterng, Fuzzy C-means clusterng, Mountan clusterng, and Subtractve clusterng. These approaches solve the problem of categorzng data by parttonng a data set nto a number of clusters based on some smlarty measure so that the smlarty n each cluster s larger than among clusters. The four methods have been mplemented and tested aganst a data set for medcal dagnoss of heart dsease. The comparatve study done here s concerned wth the accuracy of each algorthm, wth care beng taken toward the effcency n calculaton and other performance measures. The medcal problem presented has a hgh number of dmensons, whch mght nvolve some complcated relatonshps between the varables n the nput data. It was obvous that mountan

11 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES clusterng s not one of the good technques for problems wth ths hgh number of dmensons due to ts exponental proportonalty to the dmenson of the problem. K-means clusterng seemed to over perform the other technques for ths type of problem. However n other problems where the number of clusters s not known, K- means and FCM cannot be used to solve ths type of problem, leavng the choce only to mountan or subtractve clusterng. Subtractve clusterng seems to be a better alternatve to mountan clusterng snce t s based on the same dea, and uses the data ponts as cluster centers canddates nstead of grd ponts; however, mountan clusterng can lead to better results f the grd granularty s small enough to capture the potental cluster centers, but wth the sde effect of ncreasng computaton needed for the larger number of grd ponts. Fnally, the clusterng technques dscussed here do not have to be used as stand-alone approaches; they can be used n conjuncton wth other neural or fuzzy systems for further refnement of the overall system performance. VI. REFERENCES [] Jang, J.-S. R., Sun, C.-T., Mzutan, E., Neuro- Fuzzy and Soft Computng A Computatonal Approach to Learnng and Machne Intellgence, Prentce Hall. [] Azuaje, F., Dubtzky, W., Black, N., Adamson, K., Dscoverng Relevance Knowledge n Data: A Growng Cell Structures Approach, IEEE Transactons on Systems, Man, and Cybernetcs- Part B: Cybernetcs, Vol. 30, No. 3, June 000 (pp. 448) [3] Ln, C., Lee, C., Neural Fuzzy Systems, Prentce Hall, NJ, 996. [4] Tsoukalas, L., Uhrg, R., Fuzzy and Neural Approaches n Engneerng, John Wley & Sons, Inc., NY, 997. [5] Nauck, D., Kruse, R., Klawonn, F., Foundatons of Neuro-Fuzzy Systems, John Wley & Sons Ltd., NY, 997. [6] J. A. Hartgan and M. A. Wong, A k-means clusterng algorthm, Appled Statstcs, 8: , 979. [7] The MathWorks, Inc., Fuzzy Logc Toolbox For Use Wth MATLAB, The MathWorks, Inc., 999.

12 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES Appx % K-means clusterng K-means Clusterng (MATLAB scrpt) % CLUSTERING PHASE % Load the Tranng Set TrSet = load('tranngset.txt'); [m,n] = sze(trset); % (m samples) x (n dmensons) for = :m % the output (last column) values (0,,,3) are mapped to (0,) f TrSet(,)>= TrSet(,)=; % fnd the range of each attrbute (for normalzaton later) for = :n range(,) = mn(trset(:,)); range(,) = max(trset(:,)); x = Normalze(TrSet, range); x(:,) = []; [m,n] = sze(x); % normalze the data set to a hypercube % get rd of the output column nc = ; % number of clusters = % Intalze cluster centers to random ponts c = zeros(nc,n); for = :nc rnd = nt6(rand*m + ); % select a random vector from the nput set c(,:) = x(rnd,:); % assgn ths vector value to cluster () % Clusterng Loop delta = e-5; n = 000; ter = ; whle (ter < n) % Determne the membershp matrx U % u(,j) = f euc_dst(x(j),c()) <= euc_dst(x(j),c(k)) for each k ~= % u(,j) = 0 otherwse for = :nc for j = :m d = euc_dst(x(j,:),c(,:)); u(,j) = ; for k = :nc f k~= f euc_dst(x(j,:),c(k,:)) < d u(,j) = 0; % Compute the cost functon J J(ter) = 0; for = :nc JJ() = 0; for k = :m f u(,k)== JJ() = JJ() + euc_dst(x(k,:),c(,:)); J(ter) = J(ter) + JJ();

13 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 3 % Stop f ether J s below a certan tolerance value, % or ts mprovement over prevous teraton s below a certan threshold str = sprntf('teraton: %.0d, J=%d', ter, J(ter)); f (ter~=) & (abs(j(ter-) - J(ter)) < delta) break; % Update the cluster centers % c() = mean of all vectors belongng to cluster () for = :nc sum_x = 0; G() = sum(u(,:)); for k = :m f u(,k)== sum_x = sum_x + x(k,:); c(,:) = sum_x./ G(); ter = ter + ; % whle dsp('clusterng Done.'); % TESTING PHASE % Load the evaluaton data set EvalSet = load('evaluatonset.txt'); [m,n] = sze(evalset); for = :m f EvalSet(,)>= EvalSet(,)=; x = Normalze(EvalSet, range); x(:,) = []; [m,n] = sze(x); % Assgn evaluaton vectors to ther respectve clusters accordng % to ther dstance from the cluster centers for = :nc for j = :m d = euc_dst(x(j,:),c(,:)); evu(,j) = ; for k = :nc f k~= f euc_dst(x(j,:),c(k,:)) < d evu(,j) = 0; % Analyze results ev = EvalSet(:,)'; rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); subplot(,,); f rmse() < rmse() r = ; else r = ; str = sprntf('testng Set RMSE: %f', rmse(r)); ctr = 0; for = :m f evu(r,)==ev()

14 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 4 ctr = ctr + ; str = sprntf('testng Set accuracy: %.f%%', ctr*00/m); [m,b,r] = postreg(evu(r,:),ev); % Regresson Analyss dsp(sprntf('r = %.3f', r)); Fuzzy C-means Clusterng (MATLAB scrpt) % Fuzzy C-means clusterng % CLUSTERING PHASE % Load the Tranng Set TrSet = load('tranngset.txt'); [m,n] = sze(trset); % (m samples) x (n dmensons) for = :m % the output (last column) values (0,,,3) are mapped to (0,) f TrSet(,)>= TrSet(,)=; % fnd the range of each attrbute (for normalzaton later) for = :n range(,) = mn(trset(:,)); range(,) = max(trset(:,)); x = Normalze(TrSet, range); x(:,) = []; [m,n] = sze(x); % normalze the data set to a hypercube % get rd of the output column nc = ; % number of clusters = % Intalze the membershp matrx wth random values between 0 and % such that the summaton of membershp degrees for each vector equals unty u = zeros(nc,m); for = :m u(,) = rand; u(,) = - u(,); % Clusterng Loop m_exp = ; prevj = 0; J = 0; delta = e-5; n = 000; ter = ; whle (ter < n) % Calculate the fuzzy cluster centers for = :nc sum_ux = 0; sum_u = 0; for j = :m sum_ux = sum_ux + (u(,j)^m_exp)*x(j,:); sum_u = sum_u + (u(,j)^m_exp); c(,:) = sum_ux./ sum_u; % Compute the cost functon J J(ter) = 0; for = :nc JJ() = 0; for j = :m JJ() = JJ() + (u(,j)^m_exp)*euc_dst(x(j,:),c(,:));

15 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 5 J(ter) = J(ter) + JJ(); % Stop f ether J s below a certan tolerance value, % or ts mprovement over prevous teraton s below a certan threshold str = sprntf('teraton: %.0d, J=%d', ter, J); f (ter~=) & (abs(j(ter-) - J(ter)) < delta) break; % Update the membershp matrx U for = :nc for j = :m sum_d = 0; for k = :nc sum_d = sum_d + (euc_dst(c(,:),x(j,:))/euc_dst(c(k,:),x(j,:)))^(/(m_exp-)); u(,j) = /sum_d; ter = ter + ; % whle dsp('clusterng Done.'); % TESTING PHASE % Load the evaluaton data set EvalSet = load('evaluatonset.txt'); [m,n] = sze(evalset); for = :m f EvalSet(,)>= EvalSet(,)=; x = Normalze(EvalSet, range); x(:,) = []; [m,n] = sze(x); % Assgn evaluaton vectors to ther respectve clusters accordng % to ther dstance from the cluster centers for = :nc for j = :m sum_d = 0; for k = :nc sum_d = sum_d + (euc_dst(c(,:),x(j,:))/euc_dst(c(k,:),x(j,:)))^(/(m_exp-)); evu(,j) = /sum_d; % defuzzfy the membershp matrx for j = :m f evu(,j) >= evu(,j) evu(,j) = ; evu(,j) = 0; else evu(,j) = 0; evu(,j) = ; % Analyze results ev = EvalSet(:,)'; rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); subplot(,,); f rmse() < rmse() r = ; else r = ;

16 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 6 str = sprntf('testng Set RMSE: %f', rmse(r)); ctr = 0; for = :m f evu(r,)==ev() ctr = ctr + ; str = sprntf('testng Set accuracy: %.f%%', ctr*00/m); [m,b,r] = postreg(evu(r,:),ev); % Regresson Analyss dsp(sprntf('r = %.3f', r)); Mountan Clusterng (MATLAB scrpt) % Mountan Clusterng % Setup the tranng data % Load the Tranng Set TrSet = load('tranngset.txt'); [m,n] = sze(trset); % (m samples) x (n dmensons) for = :m % the output (last column) values (0,,,3) are mapped to (0,) f TrSet(,)>= TrSet(,)=; % fnd the range of each attrbute (for normalzaton later) for = :n range(,) = mn(trset(:,)); range(,) = max(trset(:,)); x = Normalze(TrSet, range); x(:,) = []; [m,n] = sze(x); % normalze the data set to a hypercube % get rd of the output column % Due to memory and speed lmtatons, the number of attrbutes % wll be set to a maxmum of 4 attrbutes. Extra attrbutes wll % be dropped at random. n_dropped = 0; f n>4 for = :(n-4) attr = cel(rand*(n-+)); x(:,attr) = []; dropped() = attr; % save dropped attrbutes postons n_dropped = n_dropped+; [m,n] = sze(x); % Frst: setup a grd matrx of n-dmensons (V) % (n = the dmenson of nput data vectors) % The grddng granularty s 'gr' = # of grd ponts per dmenson gr = 0; % setup the dmenson vector [d d d3... dn] v_dm = gr * ones([ n]); % setup the mountan matrx M = zeros(v_dm); sgma = 0.; % Second: calculate the mountan functon at every grd pont

17 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 7 % setup some adng varables cur = ones([ n]); for = :n for j = : cur() = cur()*v_dm(j); max_m = 0; % greatest densty value max_v = 0; % cluster center poston dsp('fndng Cluster...'); % loop over each grd pont for = :cur(,) % calculate the vector ndexes dx = ; for j = n:-: dm(j) = cel(dx/cur(j-)); dx = dx - cur(j-)*(dm(j)-); dm() = dx; % dm s holdng the current pont ndex vector % but needs to be normalzed to the range [0,] v = dm./gr; % calculate the mountan functon for the current pont M() = mnt(v,x,sgma); f M() > max_m max_m = M(); max_v = v; max_ = ; % report progress f mod(,5000)==0 str = sprntf('vector %.0d/%.0d; M(v)=%.f',, cur(,), M()); % Thrd: select the frst cluster center by choosng the pont % wth the greatest densty value c(,:) = max_v; c = max_; str = sprntf('cluster :'); str = sprntf('%4.f', c(,:)); str = sprntf('m=%.3f', max_m); % CLUSTER Mnew = zeros(v_dm); max_m = 0; max_v = 0; beta = 0.; dsp('fndng Cluster...'); for = :cur(,) % calculate the vector ndexes dx = ; for j = n:-: dm(j) = cel(dx/cur(j-)); dx = dx - cur(j-)*(dm(j)-); dm() = dx; % dm s holdng the current pont ndex vector

18 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 8 % but needs to be normalzed to the range [0,] v = dm./gr; % calculate the REVISED mountan functon for the current pont Mnew() = M() - M(c)*exp((-euc_dst(v,c(,:)))./(*beta^)); f Mnew() > max_m max_m = Mnew(); max_v = v; max_ = ; % report progress f mod(,5000)==0 str = sprntf('vector %.0d/%.0d; Mnew(v)=%.f',, cur(,), Mnew()); c(,:) = max_v; str = sprntf('cluster :'); str = sprntf('%4.f', c(,:)); str = sprntf('m=%.3f', max_m); % Evaluaton % Load the evaluaton data set EvalSet = load('evaluatonset.txt'); [m,n] = sze(evalset); for = :m f EvalSet(,)>= EvalSet(,)=; x = Normalze(EvalSet, range); x(:,) = []; [m,n] = sze(x); % drop the attrbutes correspondng to the ones dropped n the tranng set for = :n_dropped x(:,dropped()) = []; [m,n] = sze(x); % Assgn every test vector to ts nearest cluster for = : for j = :m d = euc_dst(x(j,:),c(,:)); evu(,j) = ; for k = : f k~= f euc_dst(x(j,:),c(k,:)) < d evu(,j) = 0; % Analyze results ev = EvalSet(:,)'; rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); f rmse() < rmse() r = ; else r = ;

19 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 9 str = sprntf('testng Set RMSE: %f', rmse(r)); ctr = 0; for = :m f evu(r,)==ev() ctr = ctr + ; str = sprntf('testng Set accuracy: %.f%%', ctr*00/m); [m,b,r] = postreg(evu(r,:),ev); dsp(sprntf('r = %.3f', r)); Subtractve Clusterng (MATLAB scrpt) % Subtractve Clusterng % Setup the tranng data % Load the Tranng Set TrSet = load('tranngset.txt'); [m,n] = sze(trset); % (m samples) x (n dmensons) for = :m % the output (last column) values (0,,,3) are mapped to (0,) f TrSet(,)>= TrSet(,)=; % fnd the range of each attrbute (for normalzaton later) for = :n range(,) = mn(trset(:,)); range(,) = max(trset(:,)); x = Normalze(TrSet, range); x(:,) = []; [m,n] = sze(x); % normalze the data set to a hypercube % get rd of the output column % Frst: Intalze the densty matrx and some varables D = zeros([m ]); ra =.0; % Second: calculate the densty functon at every data pont % setup some adng varables max_d = 0; % greatest densty value max_x = 0; % cluster center poston dsp('fndng Cluster...'); % loop over each data pont for = :m % calculate the densty functon for the current pont D() = densty(x(,:),x,ra); f D() > max_d max_d = D(); max_x = x(,:); max_ = ; % report progress f mod(,50)==0 str = sprntf('vector %.0d/%.0d; D(v)=%.f',, m, D());

20 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES 0 % % Thrd: select the frst cluster center by choosng the pont % wth the greatest densty value c(,:) = max_x; c = max_; str = sprntf('cluster :'); str = sprntf('%4.f', c(,:)); str = sprntf('d=%.3f', max_d); % CLUSTER Dnew = zeros([m ]); max_d = 0; max_x = 0; rb =.5*ra; dsp('fndng Cluster...'); for = :m % calculate the REVISED densty functon for the current pont Dnew() = D() - D(c)*exp((-euc_dst(x(,:),c(,:)))./((rb/)^)); f Dnew() > max_d max_d = Dnew(); max_x = x(,:); max_ = ; % report progress f mod(,50)==0 str = sprntf('vector %.0d/%.0d; Dnew(v)=%.f',, m, Dnew()); c(,:) = max_x; str = sprntf('cluster :'); str = sprntf('%4.f', c(,:)); str = sprntf('d=%.3f', max_d); % Evaluaton % Load the evaluaton data set EvalSet = load('evaluatonset.txt'); [m,n] = sze(evalset); for = :m f EvalSet(,)>= EvalSet(,)=; x = Normalze(EvalSet, range); x(:,) = []; [m,n] = sze(x); % Assgn every test vector to ts nearest cluster for = : for j = :m

21 A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES dd = euc_dst(x(j,:),c(,:)); evu(,j) = ; for k = : f k~= f euc_dst(x(j,:),c(k,:)) < dd evu(,j) = 0; % Analyze results ev = EvalSet(:,)'; rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); rmse() = norm(evu(,:)-ev)/sqrt(length(evu(,:))); f rmse() < rmse() r = ; else r = ; str = sprntf('testng Set RMSE: %f', rmse(r)); ctr = 0; for = :m f evu(r,)==ev() ctr = ctr + ; str = sprntf('testng Set accuracy: %.f%%', ctr*00/m); [m,b,r] = postreg(evu(r,:),ev); dsp(sprntf('r = %.3f', r));