A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining



Similar documents
A Study of Unrelated Parallel-Machine Scheduling with Deteriorating Maintenance Activities to Minimize the Total Completion Time

Fuzzy Task Assignment Model of Web Services Supplier in Collaborative Development Environment

Numerical Comparisons of Quality Control Charts for Variables

6.7 Network analysis Introduction. References - Network analysis. Topological analysis

Developing a Fuzzy Search Engine Based on Fuzzy Ontology and Semantic Search

SHAPIRO-WILK TEST FOR NORMALITY WITH KNOWN MEAN

Measuring the Quality of Credit Scoring Models

ADAPTATION OF SHAPIRO-WILK TEST TO THE CASE OF KNOWN MEAN

Polyphase Filters. Section 12.4 Porat 1/39

APPENDIX III THE ENVELOPE PROPERTY

T = 1/freq, T = 2/freq, T = i/freq, T = n (number of cash flows = freq n) are :

Security Analysis of RAPP: An RFID Authentication Protocol based on Permutation

Statistical Pattern Recognition (CE-725) Department of Computer Engineering Sharif University of Technology

The Gompertz-Makeham distribution. Fredrik Norström. Supervisor: Yuri Belyaev

IDENTIFICATION OF THE DYNAMICS OF THE GOOGLE S RANKING ALGORITHM. A. Khaki Sedigh, Mehdi Roudaki

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

10.5 Future Value and Present Value of a General Annuity Due

of the relationship between time and the value of money.

An Approach to Evaluating the Computer Network Security with Hesitant Fuzzy Information

Average Price Ratios

Numerical Methods with MS Excel

A New Bayesian Network Method for Computing Bottom Event's Structural Importance Degree using Jointree

Optimal multi-degree reduction of Bézier curves with constraints of endpoints continuity

Maintenance Scheduling of Distribution System with Optimal Economy and Reliability

The simple linear Regression Model

Abraham Zaks. Technion I.I.T. Haifa ISRAEL. and. University of Haifa, Haifa ISRAEL. Abstract

Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering

Fractal-Structured Karatsuba`s Algorithm for Binary Field Multiplication: FK

DECISION MAKING WITH THE OWA OPERATOR IN SPORT MANAGEMENT

ANOVA Notes Page 1. Analysis of Variance for a One-Way Classification of Data

Chapter 3. AMORTIZATION OF LOAN. SINKING FUNDS R =

1. The Time Value of Money

AN ALGORITHM ABOUT PARTNER SELECTION PROBLEM ON CLOUD SERVICE PROVIDER BASED ON GENETIC

Chapter Eight. f : R R

Report 52 Fixed Maturity EUR Industrial Bond Funds

Common p-belief: The General Case

Credibility Premium Calculation in Motor Third-Party Liability Insurance

Green Master based on MapReduce Cluster

A Parallel Transmission Remote Backup System

ECONOMIC CHOICE OF OPTIMUM FEEDER CABLE CONSIDERING RISK ANALYSIS. University of Brasilia (UnB) and The Brazilian Regulatory Agency (ANEEL), Brazil

Speeding up k-means Clustering by Bootstrap Averaging

Automated Event Registration System in Corporation

Load Balancing via Random Local Search in Closed and Open systems

Study on prediction of network security situation based on fuzzy neutral network

The analysis of annuities relies on the formula for geometric sums: r k = rn+1 1 r 1. (2.1) k=0

Constrained Cubic Spline Interpolation for Chemical Engineering Applications

Project 3 Weight analysis

Suspicious Transaction Detection for Anti-Money Laundering

On Error Detection with Block Codes

POSTRACK: A Low Cost Real-Time Motion Tracking System for VR Application

ISyE 512 Chapter 7. Control Charts for Attributes. Instructor: Prof. Kaibo Liu. Department of Industrial and Systems Engineering UW-Madison

The impact of service-oriented architecture on the scheduling algorithm in cloud computing

Preprocess a planar map S. Given a query point p, report the face of S containing p. Goal: O(n)-size data structure that enables O(log n) query time.

Projection model for Computer Network Security Evaluation with interval-valued intuitionistic fuzzy information. Qingxiang Li

How do bookmakers (or FdJ 1 ) ALWAYS manage to win?

Fault Tree Analysis of Software Reliability Allocation

ROULETTE-TOURNAMENT SELECTION FOR SHRIMP DIET FORMULATION PROBLEM

Simple Linear Regression

A 360 Degree Feedback Model for Performance Appraisal Based on Fuzzy AHP and TOPSIS

A COMPARATIVE STUDY BETWEEN POLYCLASS AND MULTICLASS LANGUAGE MODELS

A particle swarm optimization to vehicle routing problem with fuzzy demands

n. We know that the sum of squares of p independent standard normal variables has a chi square distribution with p degrees of freedom.

Integrating Production Scheduling and Maintenance: Practical Implications

Settlement Prediction by Spatial-temporal Random Process

Bayesian Network Representation

CHAPTER 2. Time Value of Money 6-1

ON SLANT HELICES AND GENERAL HELICES IN EUCLIDEAN n -SPACE. Yusuf YAYLI 1, Evren ZIPLAR 2. yayli@science.ankara.edu.tr. evrenziplar@yahoo.

The Application of Intuitionistic Fuzzy Set TOPSIS Method in Employee Performance Appraisal

An IG-RS-SVM classifier for analyzing reviews of E-commerce product

How To Balance Load On A Weght-Based Metadata Server Cluster

Banking (Early Repayment of Housing Loans) Order,

Classic Problems at a Glance using the TVM Solver

Performance Attribution. Methodology Overview

Discrete-Event Simulation of Network Systems Using Distributed Object Computing

Regression Analysis. 1. Introduction

Relaxation Methods for Iterative Solution to Linear Systems of Equations

RUSSIAN ROULETTE AND PARTICLE SPLITTING

CH. V ME256 STATICS Center of Gravity, Centroid, and Moment of Inertia CENTER OF GRAVITY AND CENTROID

Curve Fitting and Solution of Equation

Cyber Journals: Multidisciplinary Journals in Science and Technology, Journal of Selected Areas in Telecommunications (JSAT), January Edition, 2011

The Time Value of Money

Compressive Sensing over Strongly Connected Digraph and Its Application in Traffic Monitoring

Fast, Secure Encryption for Indexing in a Column-Oriented DBMS

Three Dimensional Interpolation of Video Signals

Chapter = 3000 ( ( 1 ) Present Value of an Annuity. Section 4 Present Value of an Annuity; Amortization

Transcription:

A Fast Clusterg Algorth to Cluster Very Large Categorcal Data Sets Data Mg Zhexue Huag * Cooperatve Research Cetre for Advaced Coputatoal Systes CSIRO Matheatcal ad Iforato Sceces GPO Box 664, Caberra 260, AUSTRALIA eal:zhexue.huag@cs.csro.au Abstract Parttog a large set of obects to hoogeeous clusters s a fudaetal operato data g. The k-eas algorth s best suted for pleetg ths operato because of ts effcecy clusterg large data sets. However, workg oly o uerc values lts ts use data g because data sets data g ofte cota categorcal values. I ths paper we preset a algorth, called k-odes, to exted the k-eas paradg to categorcal doas. We troduce ew dsslarty easures to deal wth categorcal obects, replace eas of clusters wth odes, ad use a frequecy based ethod to update odes the clusterg process to se the clusterg cost fucto. Tested wth the well kow soybea dsease data set the algorth has deostrated a very good classfcato perforace. Experets o a very large health surace data set cosstg of half a llo records ad 34 categorcal attrbutes show that the algorth s scalable ters of both the uber of clusters ad the uber of records. Itroducto Parttog a set of obects to hoogeeous clusters s a fudaetal operato data g. The operato s eeded a uber of data g tasks, such as usupervsed classfcato ad data suato, as well as segetato of large heterogeeous data sets to saller hoogeeous subsets that ca be easly aaged, separately odelled ad aalysed. Clusterg s a popular approach used to pleet ths operato. Clusterg ethods partto a set of obects to clusters such that obects the sae cluster are ore slar to each other tha obects dfferet clusters accordg to soe defed crtera. Statstcal clusterg ethods (Aderberg 973, Ja ad Dubes 988 use slarty easures to partto obects whereas coceptual clusterg ethods cluster obects accordg to the cocepts obects carry (Mchalsk ad Stepp 983, Fsher 987. The ost dstct characterstc of data g s that t deals wth very large data sets (ggabytes or eve terabytes. Ths requres the algorths used data g to be scalable. However, ost algorths curretly used data g do ot scale well whe appled to very large data sets because they were tally developed for other applcatos tha data g whch volve sall data sets. The study of scalable data g algorths has recetly becoe a data g research focus (Shafer et al. 996. I ths paper we preset a fast clusterg algorth used to cluster categorcal data. The algorth, called k- odes, s a exteso to the well kow k-eas algorth (MacQuee 967. Copared to other clusterg ethods the k-eas algorth ad ts varats (Aderberg 973 are effcet clusterg large data sets, thus very sutable for data g. However, ther use s ofte lted to uerc data because these algorths se a cost fucto by calculatg the eas of clusters. Data g applcatos frequetly volve categorcal data. The tradtoal approach to covertg categorcal data to uerc values does ot ecessarly produce eagful results the case where categorcal doas are ot ordered. The k-odes algorth ths paper reoves ths ltato ad exteds the k-eas paradg to categorcal doas whlst preservg the effcecy of the k-eas algorth. I (Huag 997 we have proposed a algorth, called k-prototypes, to cluster large data sets wth xed uerc ad categorcal values. I the k-prototypes algorth we defe a dsslarty easure that takes to accout both uerc ad categorcal attrbutes. Assue s s the dsslarty easure o uerc attrbutes defed by the squared Eucldea dstace ad s c s the dsslarty easure o categorcal attrbutes defed as the uber of satches of categores betwee two obects. We defe the dsslarty easure betwee two obects as s + γs c, where γ s a weght to balace the two parts to avod favourg ether type of attrbute. The clusterg process of the k-prototypes algorth s slar to the k-eas algorth except that a ew ethod s used to update the categorcal attrbute values of cluster * The author wshes to ackowledge that ths work was carred out wth the Cooperatve Research Cetre for Advaced Coputatoal Systes (ACSys establshed uder the Australa Goveret s Cooperatve Research Cetres Progra.

prototypes. A proble usg that algorth s to choose a proper weght. We have suggested the use of the average stadard devato of uerc attrbutes as a gude choosg the weght. The k-odes algorth preseted ths paper s a splfcato of the k-prototypes algorth by oly takg categorcal attrbutes to accout. Therefore, weght γ s o loger ecessary the algorth because of the dsappearace of s. If uerc attrbutes are volved a data set, we categorse the usg a ethod as descrbed (Aderberg 973. The bggest advatage of ths algorth s that t s scalable to very large data sets. Tested wth a health surace data set cosstg of half a llo records ad 34 categorcal attrbutes, ths algorth has show a capablty of clusterg the data set to 00 clusters about a hour usg a sgle processor of a Su Eterprse 4000 coputer. Ralabodray (995 preseted aother approach to usg the k-eas algorth to cluster categorcal data. Ralabodray s approach eeds to covert ultple category attrbutes to bary attrbutes (usg 0 ad to represet ether a category abset or preset ad to treat the bary attrbutes as uerc the k-eas algorth. If t s used data g, ths approach requres to hadle a large uber of bary attrbutes because data sets data g ofte have categorcal attrbutes wth hudreds or thousads of categores. Ths wll evtably crease both coputatoal ad space costs of the k- eas algorth. The other drawback s that the cluster eas, gve by real values betwee 0 ad, do ot dcate the characterstcs of the clusters. Coparatvely, the k-odes algorth drectly works o categorcal attrbutes ad produces the cluster odes, whch descrbe the clusters, thus very useful to the user terpretg the clusterg results. Usg Gower s slarty coeffcet (Gower 97 ad other dsslarty easures (Gowda ad Dday 99 oe ca use a herarchcal clusterg ethod to cluster categorcal or xed data. However, the herarchcal clusterg ethods are ot effcet processg large data sets. Ther use s lted to sall data sets. The rest of the paper s orgased as follows. Categorcal data ad ts represetato are descrbed Secto 2. I Secto 3 we brefly revew the k-eas algorth ad ts portat propertes. I Secto 4 we dscuss the k-odes algorth. I Secto 5 we preset soe experetal results o two real data sets to show the classfcato perforace ad coputatoal effcecy of the k-odes algorth. We suarse our dscussos ad descrbe our future work pla Secto 6. 2 Categorcal Data Categorcal data as referred to ths paper s the data descrbg obects whch have oly categorcal attrbutes. The obects, called categorcal obects, are a splfed verso of the sybolc obects defed (Gowda ad Dday 99. We cosder all uerc (quattatve attrbutes are categorsed ad do ot cosder categorcal attrbutes that have cobatoal values, e.g., Laguagesspoke (Chese, Eglsh. The followg two subsectos defe the categorcal attrbutes ad obects accepted by the algorth. 2. Categorcal Doas ad Attrbutes Let A, A 2,, A be attrbutes descrbg a space Ω ad DOM(A, DOM(A 2,, DOM(A the doas of the attrbutes. A doa DOM(A s defed as categorcal f t s fte ad uordered, e.g., for ay a, b DOM(A, ether a = b or a b. A s called a categorcal attrbute. Ω s a categorcal space f all A, A 2,, A are categorcal. A categorcal doa defed here cotas oly sgletos. Cobatoal values lke (Gowda ad Dday 99 are ot allowed. A specal value, deoted by ε, s defed o all categorcal doas ad used to represet ssg values. To splfy the dsslarty easure we do ot cosder the coceptual cluso relatoshps aog values a categorcal doa lke (Kodratoff ad Tecuc 988 such that car ad vehcle are two categorcal values a doa ad coceptually a car s also a vehcle. However, such relatoshps ay exst real world databases. 2.2 Categorcal Obects Lke (Gowda ad Dday 99 a categorcal obect X Ω s logcally represeted as a coucto of attrbutevalue pars [A = x ] [A 2 = x 2 ] [A = x ], where x DOM(A for. A attrbute-value par [A = x ] s called a selector (Mchalsk ad Stepp 983. Wthout abguty we represet X as a vector [x, x 2,, x ]. We cosder every obect Ω has exactly attrbute values. If the value of attrbute A s ot avalable for a obect X, the A = ε. Let X = {X, X 2,, X } be a set of categorcal obects ad X Ω. Obect X s represeted as [x,, x,2,, x, ]. We wrte X = X k f x, = x k, for. The relato X = X k does ot ea that X, X k are the sae obect the real world database. It eas the two obects have equal categorcal values attrbutes A, A 2,, A. For exaple, two patets a data set ay have equal values attrbutes Sex, Dsease ad Treatet. However, they are dstgushed the hosptal database by other attrbutes such as ID ad Address whch were ot selected for clusterg.

Assue X cossts of obects whch p obects are dstct. Let N be the cardalty of the Cartesa product DOM(A x DOM(A 2 x x DOM(A. We have p N. However, ay be larger tha N, whch eas there are duplcates X. 3 The K-eas Algorth The k-eas algorth (MacQuee 967, Aderberg 973 s bult upo four basc operatos: ( selecto of the tal k eas for k clusters, (2 calculato of the dsslarty betwee a obect ad the ea of a cluster, (3 allocato of a obect to the cluster whose ea s earest to the obect, (4 Re-calculato of the ea of a cluster fro the obects allocated to t so that the tra cluster dsslarty s sed. Except for the frst operato, the other three operatos are repeatedly perfored the algorth utl the algorth coverges. The essece of the algorth s to se the cost fucto k E = y, l d( X, Ql l= where s the uber of obects a data set X, X X, Q l s the ea of cluster l, ad y, l s a eleet of a partto atrx Y x l as (Had 98. d s a dsslarty easure usually defed by the squared Eucldea dstace. There exst a few varats of the k-eas algorth whch dffer selecto of the tal k eas, dsslarty calculatos ad strateges to calculate cluster eas (Aderberg 973, Bobrowsk ad Bezdek 99. The sophstcated varats of the k-eas algorth clude the well-kow ISODATA algorth (Ball ad Hall 967 ad the fuzzy k-eas algorths (Rusp 969, 973. Most k-eas type algorths have bee proved coverget (MacQuee 967, Bezdek 980, Sel ad Isal 984. The k-eas algorth has the followg portat propertes.. It s effcet processg large data sets. The coputatoal coplexty of the algorth s O(tk, where s the uber of attrbutes, s the uber of obects, k s the uber of clusters, ad t s the uber of teratos over the whole data set. Usually, k,, t <<. I clusterg large data sets the k-eas algorth s uch faster tha the herarchcal clusterg algorths whose geeral coputatoal coplexty s O( 2 (Murtagh 992. 2. It ofte terates at a local optu (MacQuee 967, Sel ad Isal 984. To fd out the global optu, techques such as deterstc aealg (Krkpatrck et al. 983, Rose et al. 990 ad geetc algorths (Goldberg 989, Murthy ( ad Chowdhury 996 ca be corporated wth the k-eas algorth. 3. It works oly o uerc values because t ses a cost fucto by calculatg the eas of clusters. 4. The clusters have covex shapes (Aderberg 973. Therefore, t s dffcult to use the k-eas algorth to dscover clusters wth o-covex shapes. Oe dffculty usg the k-eas algorth s to specfy the uber of clusters. Soe varats lke ISODATA clude a procedure to search for the best k at the cost of soe perforace. The k-eas algorth s best suted for data g because of ts effcecy processg large data sets. However, workg oly o uerc values lts ts use data g because data sets data g ofte have categorcal values. Developet of the k-odes algorth to be dscussed the ext secto was otvated by the desre to reove ths ltato ad exted ts use to categorcal doas. 4 The K-odes Algorth The k-odes algorth s a splfed verso of the k- prototypes algorth descrbed (Huag 997. I ths algorth we have ade three aor odfcatos to the k-eas algorth,.e., usg dfferet dsslarty easures, replacg k eas wth k odes, ad usg a frequecy based ethod to update odes. These odfcatos are dscussed below. 4. Dsslarty Measures Let X, Y be two categorcal obects descrbed by categorcal attrbutes. The dsslarty easure betwee X ad Y ca be defed by the total satches of the correspodg attrbute categores of the two obects. The saller the uber of satches s, the ore slar the two obects. Forally, where d( X, Y = δ( x, y = 0 ( x δ( x, y = ( x = y y d(x,y gves equal portace to each category of a attrbute. If we take to accout the frequeces of categores a data set, we ca defe the dsslarty easure as ( x + y d 2 ( X, Y = δ( x, y χ (4 = where x, y are the ubers of obects the data set that have categores x ad y for attrbute. Because x y (2 (3

d ( X, Y s slar to the ch-square dstace χ 2 (Greeacre 984, we call t ch-square dstace. Ths dsslarty easure gves ore portace to rare categores tha frequet oes. Eq. (4 s useful dscoverg uder-represeted obect clusters such as fraudulet clas surace databases. 4.2 Mode of a Set Let X be a set of categorcal obects descrbed by categorcal attrbutes A, A 2,, A. Defto: A ode of X s a vector Q = [q, q 2,, q ] Ω that ses D( Q, X = d( X, Q where X = {X, X 2,, X } ad d ca be ether defed as Eq. (2 or Eq. (4. Here, Q s ot ecessarly a eleet of X. 4.3 Fd a Mode for a Set Let ck, be the uber of obects havg category c k, c k, attrbute A ad f r ( A = ck, X = the relatve frequecy of category c k, X. Theore: The fucto D(Q,X s sed ff f ( A = q X f ( A = c X for q c k, for all =... r r k, The proof of the theore s gve the Appedx. The theore defes a way to fd Q fro a gve X, ad therefore s portat because t allows to use the k- eas paradg to cluster categorcal data wthout losg ts effcecy. The theore ples that the ode of a data set X s ot uque. For exaple, the ode of set {[a, b], [a, c], [c, b], [b, c]} ca be ether [a, b] or [a, c]. 4.4 The k-odes Algorth Let {S, S 2,, S k } be a partto of X, where S l for l k, ad {Q,Q 2,,Q k } the odes of {S, S 2,, S k }. The total cost of the partto s defed by k E = y, d( X, Q l= l l where y,l s a eleet of a partto atrx Y as x l (Had 98 ad d ca be ether defed as Eq. (2 or Eq. (4. Slar to the k-eas algorth, the obectve of clusterg X s to fd a set {Q, Q 2,, Q k } that ca se E. Although the for of ths cost fucto s the sae as Eq. (, d s dfferet. Eq. (6 ca be sed by the k-odes algorth below. (5 (6 The k-odes algorth cossts of the followg steps (refer to (Huag 997 for the detaled descrpto of the algorth:. Select k tal odes, oe for each cluster. 2. Allocate a obect to the cluster whose ode s the earest to t accordg to d. Update the ode of the cluster after each allocato accordg to the Theore. 3. After all obects have bee allocated to clusters, retest the dsslarty of obects agast the curret odes. If a obect s foud such that ts earest ode belogs to aother cluster rather tha ts curret oe, reallocate the obect to that cluster ad update the odes of both clusters. 4. Repeat 3 utl o obect has chaged clusters after a full cycle test of the whole data set. Lke the k-eas algorth the k-odes algorth also produces locally optal solutos that are depedet o the tal odes ad the order of obects the data set. I Secto 5 we use a real exaple to show how approprate tal ode selecto ethods ca prove the clusterg results. I our curret pleetato of the k-odes algorth we clude two tal ode selecto ethods. The frst ethod selects the frst k dstct records fro the data set as the tal k odes. The secod ethod s pleeted the followg steps.. Calculate the frequeces of all categores for all attrbutes ad store the a category array the descedg order of frequecy as show Fgure. Here, c, deotes category of attrbute ad f(c, f(c +, where f(c, s the frequecy of category c,. c c c c c c c c c c c c4, c4, 3 c5, 3,, 2, 3, 4 2, 2, 2 2, 3 2, 4 3, 3, 3 3, 4 Fgure. The category array of a data set wth 4 attrbutes havg 4, 2, 5, 3 categores respectvely. 2. Assg the ost frequet categores equally to the tal k odes. For exaple Fgure, assue k = 3. We assg Q = [q, =c,, q,2 =c 2,2, q,3 =c 3,3, q,4 =c,4 ], Q 2 = [q 2, =c 2,, q 2,2 =c,2, q 2,3 =c 4,3, q 2,4 =c 2,4 ] ad Q 3 = [q 3, =c 3,, q 3,2 =c 2,2, q 3,3 =c,3, q 3,4 =c 3,4 ]. 3. Start wth Q. Select the record ost slar to Q ad substtute Q wth the record as the frst tal

ode. The select the record ost slar to Q 2 ad substtute Q 2 wth the record as the secod tal ode. Cotue ths process utl Q k s substtuted. I these selectos Q l Q t for l t. Step 3 s take to avod the occurrece of epty clusters. The purpose of ths selecto ethod s to ake the tal odes dverse, whch ca result better clusterg results (see Secto 5..3. 5 Experetal Results We used the well kow soybea dsease data to test classfcato perforace of the algorth ad aother large data set selected fro a health surace database to test coputatoal effcecy of the algorth. The secod data set cossts of half a llo records, each beg descrbed by 34 categorcal attrbutes. 5. Tests o Soybea Dsease Data 5.. Test Data Sets The soybea dsease data s oe of the stadard test data sets used the ache learg couty. It has ofte bee used to test coceptual clusterg algorths (Mchalsk ad Stepp 983, Fsher 987. We chose ths data set to test our algorth because of ts publcty ad because all ts attrbutes ca be treated as categorcal wthout categorsato. The soybea data set has 47 observatos, each beg descrbed by 35 attrbutes. Each observato s detfed by oe of the 4 dseases -- Daporthe Ste Caker, Charcoal Rot, Rhzoctoa Root Rot, ad Phytophthora Rot. Except for Phytophthora Rot whch has 7 observatos, all other dseases have 0 observatos each. Eq. (2 was used the tests because all dsease classes are alost equally dstrbuted. Of the 35 attrbutes we oly selected 2 because the other 4 have oly oe category. To study the effect of record order, we created 00 test data sets by radoly reorderg the 47 observatos. By dog ths we were also selectg dfferet records for the tal odes usg the frst selecto ethod. All dsease detfcatos were reoved fro the test data sets. 5..2 Clusterg Results We used the k-odes algorth to cluster each test data set to 4 clusters wth the two tal ode selecto ethods ad produced 200 clusterg results. For each clusterg result we used a sclassfcato atrx to aalyse the correspodece betwee clusters ad the dsease classes of the observatos. Two sclassfcato atrces for the test data sets ad 9 are show Fgure 2. The captal letters D, C, R, P the frst colu of the atrces represet the 4 dsease classes. I fgure 2(a there s oe to oe correspodece betwee clusters ad dsease classes, whch eas the observatos the sae dsease classes were clustered to the sae clusters. Ths represets a coplete recovery of the 4 dsease classes fro the test data set. I Fgure 2(b two observatos of the dsease class P were sclassfed to cluster whch was doated by the observatos of the dsease class R. However, the observatos the other two dsease classes were correctly clustered to clusters 3 ad 4. Ths clusterg result ca also be cosdered good. Cluster Cluster 2 Cluster 3 Cluster 4 D 0 C 0 R 0 P 7 (a Cluster Cluster 2 Cluster 3 Cluster 4 D 0 C 0 R 0 P 2 5 (b Fgure 2. Two sclassfcato atrces. (a Correspodece betwee clusters of test data set ad dsease classes. (b Correspodece betwee clusters of test data set 9 ad dsease classes. If we use the uber of sclassfed observatos as a easure of a clusterg result, we ca suarse the 200 clusterg results Table. The frst colu the table gves the uber of sclassfed observatos. The secod ad thrd colus show the ubers of clusterg results. Table. Msclassfed Frst Selecto Method Secod Selecto Method Observatos 0 3 4 7 8 2 2 26 3 4 9 4 7 6 5 2 >5 55 36 If we cosder the uber of sclassfed observatos less tha 6 as a good clusterg result, the 45 good results were produced wth the frst selecto ethod ad 64 good results wth the secod selecto ethod. Both selecto ethods produced ore tha 0 coplete recovery results (0 sclassfcato. These results dcate that f we radoly choose oe test data set, we have a 45% chace to obta a good clusterg result wth the frst selecto ethod ad a 64% chace wth the secod selecto ethod.

Table 2 shows the relatoshps betwee the clusterg results ad the clusterg costs (values of Eq. (6. The ubers brackets are the ubers of clusterg results havg the correspodg clusterg cost values. All total satches of bad clusterg results are greater tha those of good clusterg results. The al total satch uber these tests s 94 whch s lkely the global u. These relatoshps dcate that we ca use the clusterg cost values fro several rus to choose a good clusterg result f the orgal classfcato of data s ukow. We dd the sae tests usg a k-eas algorth whch s based o the versos 3 ad 5 of subroute KMEAN (Aderberg 973. I these tests we sply treated all attrbutes as uerc ad used the squared Eucldea dstace as the dsslarty easure. The tal eas were selected by the frst ethod. Of 00 clusterg results we oly got 4 good oes of whch 2 had a coplete recovery. Coparg the cost values of the 4 good clusterg results wth other clusterg results, we foud that the clusterg results ad the cost values are ot related. Therefore, a good clusterg result caot be selected accordg to ts cost value. Table 2. Msclassfed Observatos Total satches for ethod Total satches for ethod 2 0 94(3 94(4 94(7 94(7, 97( 2 94(2 94(25,95( 3 95(2,97(, 20( 95(6,96(2,97( 4 95(2,96(3,97(2 95(4,96(,97( 5 97(2 97( >5 203-26 209-254 produce dscratve characterstcs of clusters slar to those (Mchalsk ad Stepp 983. 5.2 Tests o a Large Data Set The purpose of ths experet was to test the scalablty of the k-odes algorth clusterg very large real world data sets. We selected a large data set fro a health surace database. The data set cossts of 500000 records, each beg descrbed by 34 categorcal attrbutes whch 4 have ore tha 000 categores each. We tested two scalabltes of the algorth usg ths large data set. The frst oe s the scalablty of the algorth agast the uber of clusters for a gve uber of obects ad the secod s the scalablty agast the uber of obects for a gve uber of clusters. Fgures 3 ad 4 show the results produced usg a sgle processor of a Su Eterprse 4000 coputer. The plots the fgures represet the average te perforace of 5 depedet rus. Real ru te secods 3800 3600 3400 3200 3000 2800 2600 2400 2200 2000 800 600 0 20 30 40 50 60 70 80 90 00 Nuber of clusters Fgure 3. Scalablty to the uber of clusters clusterg 500000 records. 4000 Table 3. No. of classes No. of rus Mea cost Std Dev 247-2 28 222.3 24.94 3 66 2.9 9.28 4 5 94.6.34 Real ru te secods 3500 3000 2500 2000 500 000 The effect of tal odes o clusterg results s show Table 3. The frst colu s the uber of dsease classes the tal odes have ad the secod s the correspodg uber of rus wth the uber of dsease classes the tal odes. Ths table dcates that the ore dverse the dsease classes are the tal odes, the better the clusterg results. The tal odes selected by the secod ethod have 3 dsease types, therefore ore good cluster results were produced tha by the frst ethod. Fro the odes ad category dstrbutos of dfferet attrbutes dfferet clusters the algorth ca also 500 0 0 50 00 50 200 250 300 350 400 450 500 Nuber of records 000 Fgure 4. Scalablty to the uber of records clustered to 00 clusters. These results are very ecouragg because they show clearly a lear crease te as both the uber of clusters ad uber of records crease. Clusterg half a llo obects to 00 clusters took about a hour, whch s qute acceptable. Copared wth the results of clusterg data wth xed values (Huag 997, ths algorth s uch faster tha ts prevous verso because t eeds ay less teratos to coverge.

The above soybea dsease data tests dcate that a good clusterg result should be selected fro ultple rus of the algorth over the sae data set wth dfferet record orders ad/or dfferet tal odes. Ths ca be doe practce by rug the algorth parallel o a parallel coputg syste. Other parts of the algorth such as the operato to allocate a obect to a cluster ca also be parallelsed to prove the perforace. 6 Suary ad Future Work The bggest advatage of the k-eas algorth data g applcatos s ts effcecy clusterg large data sets. However, ts use s lted to uerc values. The k-odes algorth preseted ths paper has reoved ths ltato whlst preservg ts effcecy. The k-odes algorth has ade the followg extesos to the k-eas algorth:. replacg eas of clusters wth odes, 2. usg ew dsslarty easures to deal wth categorcal obects, ad 3. usg a frequecy based ethod to update odes of clusters. These extesos allow us to use the k-eas paradg drectly to cluster categorcal data wthout eed of data coverso. Aother advatage of the k-odes algorth s that the odes gve characterstc descrptos of clusters. These descrptos are very portat to the user terpretg clusterg results. Because data g deals wth very large data sets, scalablty s a basc requreet to the data g algorths. Our experetal results have deostrated that the k-odes algorth s deed scalable to very large ad coplex data sets ters of both the uber of records ad the uber of clusters. I fact the k-odes algorth s faster tha the k-eas algorth because our experets have show that the forer ofte eeds less teratos to coverge tha the later. Our future work pla s to develop ad pleet a parallel k-odes algorth to cluster data sets wth llos of obects. Such a algorth s requred a uber of data g applcatos, such as parttog very large heterogeeous sets of obects to a uber of saller ad ore aageable hoogeeous subsets that ca be ore easly odelled ad aalysed, ad detectg uder-represeted cocepts, e.g., fraud a very large uber of surace clas. Ackowledgets The author s grateful to Dr Markus Heglad at The Australa Natoal Uversty, Mr Peter Mle ad Dr Graha Wllas at CSIRO for ther coets o the paper. Refereces Aderberg, M. R. (973 Cluster Aalyss for Applcatos, Acadec Press. Ball, G. H. ad Hall, D. J. (967 A Clusterg Techque for Suarzg Multvarate Data, Behavoral Scece, 2, pp. 53-55. Bezdek, J. C. (980 A Covergece Theore for the Fuzzy ISODATA Clusterg Algorths, IEEE Trasactos o Patter Aalyss ad Mache Itellgece, 2(8, pp. -8. Bobrowsk, L. ad Bezdek, J. C. (99 c-meas Clusterg wth the l ad l Nors, IEEE Trasactos o Systes, Ma ad Cyberetcs, 2(3, pp. 545-554. Fsher, D. H. (987 Kowledge Acqusto Va Icreetal Coceptual Clusterg, Mache Learg, 2(2, pp.39-72. Goldberg, D. E. (989 Geetc Algorths Search, Optsato, ad Mache Learg, Addso-Wesley. Gowda, K. C. ad Dday, E. (99 Sybolc Clusterg Usg a New Dsslarty Measure, Patter Recogto, 24(6, pp. 567-578. Gower, J. C. (97 A Geeral Coeffcet of Slarty ad Soe of ts Propertes, BoMetrcs, 27, pp. 857-874. Greeacre, M. J. (984 Theory ad Applcatos of Correspodece Aalyss, Acadec Press. Had, D. J. (98 Dscrato ad Classfcato, Joh Wley & Sos. Huag, Z. (997 Clusterg Large Data Sets wth Mxed Nuerc ad Categorcal Values, I Proceedgs of The Frst Pacfc-Asa Coferece o Kowledge Dscovery ad Data Mg, Sgapore, World Scetfc. Ja, A. K. ad Dubes, R. C. (988 Algorths for Clusterg Data, Pretce Hall. Krkpatrck, S., Gelatt, C. D. ad Vecch, M. P. (983 Optsato by Sulated Aealg, Scece, 220(4598, pp.67-680. Kodratoff, Y. ad Tecuc, G. (988 Learg Based o Coceptual Dstace, IEEE Trasactos o Patter Aalyss ad Mache Itellgece, 0(6, pp. 897-909. MacQuee, J. B. (967 Soe Methods for Classfcato ad Aalyss of Multvarate Observatos, I Proceedgs of the 5 th Berkeley Syposu o Matheatcal Statstcs ad Probablty, pp. 28-297. Mchalsk, R. S. ad Stepp, R. E. (983 Autoated Costructo of Classfcatos: Coceptual Clusterg Versus Nuercal Taxooy, IEEE Trasactos o Patter Aalyss ad Mache Itellgece, 5(4, pp. 396-40. Murtagh, F. (992 Coets o Parallel Algorths for Herarchcal Clusterg ad Cluster Valdty, IEEE Trasactos o Patter Aalyss ad Mache Itellgece, 4(0, pp. 056-057.

Murthy, C. A. ad Chowdhury, N. (996 I Search of Optal Clusters Usg Geetc Algorths, Patter Recogto Letters, 7, pp. 825-832. Ralabodray, H. (995 A Coceptual Verso of the k-meas Algorth, Patter Recogto Letters, 6, pp. 47-57. Rose, K., Gurewtz, E. ad Fox, G. (990 A Deterstc Aealg Approach to Clusterg, Patter Recogto Letters,, pp. 589-594. Rusp, E. R. (969 A New Approach to Clusterg, Iforato Cotrol, 9, pp. 22-32. Rusp, E. R. (973 New Experetal Results Fuzzy Clusterg, Iforato Sceces, 6, pp. 273-284. Sel, S. Z. ad Isal, M. A. (984 K-Meas-Type Algorths: A Geeralzed Covergece Theore ad Characterzato of Local Optalty, IEEE Trasactos o Patter Aalyss ad Mache Itellgece, 6(, pp. 8-87. Shafer, J., Agrawal, R. ad Metha, M. (996 SPRINT: A Scalable Parallel Classfer for Data Mg, I Proceedgs of the 22 d VLDB Coferece, Bobay, Ida, pp. 544-555. Appedx The theore Secto 4.3 ca be proved as follows (A stads for DOM(A here: c k, Let f r ( A = ck, X = be the relatve frequecy of category c k, of attrbute A, where s the total uber of obects X ad ck, the uber of obects havg category c k,. For the dsslarty easure d( x, y = δ( x, y, we wrte = d( X, Q = δ( x, q = = ( δ( x, q = q = ( = = ( f ( A = q X =,, r Because ( f ( A = q X 0 for, r d( X, Q s sed ff every ( f ( A = q X s r al. Thus, f ( A = q X ust be axal. r d χ 2 For the dsslarty easure ( x + y ( x, y = δ( x, y, we wrte = d ( X, Q 2 x x y ( x + q, = δ( x,, q = x, q = ( + δ( x,, q = = q x, = δ( x,, q + δ( x,, q = = q = = Now we have δ( x,, q x, = x, c = f r ( A = ck, X f r ( A = q X k = ck, c where c s the uber of categores A ad ck, the uber of obects havg category c k,. Cosequetly, we get d ( X, Q = ( f r ( A = ck, X + ( c χ 2 = q = Because q q ( f ( A = q X 0 ad ( s a r c = costat for a gve X, d ( X, Q s sed ff every q χ 2 ( f ( A = q X s al. Thus, r f ( A = q X ust be axal. r