Abstract. Clustering ensembles have emerged as a powerful method for improving both the

Transcription

1 Clusterng Ensembles: {topchyal, Models jan, of Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty East Lansng, Mchgan, 48824, USA Abstract. Clusterng ensembles have emerged as a powerful method for mprovng both the robustness as well as the stablty of unsupervsed classfcaton solutons. However, fndng a consensus clusterng from multple parttons s a dffcult problem that can be approached from graph-based, combnatoral or statstcal perspectves. Ths study extends prevous research on clusterng ensembles n several respects. Frst, we ntroduce a unfed representaton for multple clusterngs and formulate the correspondng categorcal clusterng problem. Second, we propose a probablstc model of consensus usng a fnte mxture of multnomal dstrbutons n a space of clusterngs. A combned partton s found as a soluton to the correspondng maxmum lkelhood problem usng the EM algorthm. Thrd, we defne a new consensus functon that s related to the classcal ntra-class varance crteron usng the generalzed mutual nformaton defnton. Fnally, we demonstrate the effcacy of combnng parttons generated by weak clusterng algorthms that use data projectons and random data splts. A smple explanatory model s offered for the behavor of combnatons of such weak clusterng components. Combnaton accuracy s analyzed as a functon of several parameters that control the power and resoluton of component parttons as well as the number of parttons. We also analyze clusterng ensembles wth ncomplete nformaton and the effect of mssng cluster labels on the qualty of overall consensus. Expermental results demonstrate the effectveness of the proposed methods on several real-world datasets. KEYWORDS: clusterng, ensembles, multple classfer systems, consensus functon, mutual nformaton * Ths research was supported by ONR grant N Parts of ths work have been presented at the IEEE Internatonal Conference on Data Mnng, ICDM 03, Melbourne, Florda, November 2003 and SIAM Internatonal Conference on Data Mnng, SDM 04, Florda, Aprl 2004.

2 Introducton In contrast to supervsed classfcaton, clusterng s nherently an ll-posed problem, whose soluton volates at least one of the common assumptons about scale-nvarance, rchness, and cluster consstency [33]. Dfferent clusterng solutons may seem equally plausble wthout a pror knowledge about the underlyng data dstrbutons. Every clusterng algorthm mplctly or explctly assumes a certan data model, and t may produce erroneous or meanngless results when these assumptons are not satsfed by the sample data. Thus the avalablty of pror nformaton about the data doman s crucal for successful clusterng, though such nformaton can be hard to obtan, even from experts. Identfcaton of relevant subspaces [2] or vsualzaton [24] may help to establsh the sample data s conformty to the underlyng dstrbutons or, at least, to the proper number of clusters. The exploratory nature of clusterng tasks demands effcent methods that would beneft from combnng the strengths of many ndvdual clusterng algorthms. Ths s the focus of research on clusterng ensembles, seekng a combnaton of multple parttons that provdes mproved overall clusterng of the gven data. Clusterng ensembles can go beyond what s typcally acheved by a sngle clusterng algorthm n several respects: Robustness. Better average performance across the domans and datasets. Novelty. Fndng a combned soluton unattanable by any sngle clusterng algorthm. Stablty and confdence estmaton. Clusterng solutons wth lower senstvty to nose, outlers or samplng varatons. Clusterng uncertanty can be assessed from ensemble dstrbutons. Parallelzaton and Scalablty. Parallel clusterng of data subsets wth subsequent combnaton of results. Ablty to ntegrate solutons from multple dstrbuted sources of data or attrbutes (features). Clusterng ensembles can also be used n multobjectve clusterng as a compromse between ndvdual clusterngs wth conflctng objectve functons. Fuson of clusterngs usng multple

3 sources of data or features becomes ncreasngly mportant n dstrbuted data mnng, e.g., see revew n [4]. Several recent ndependent studes [0, 2, 4, 5, 43, 47] have poneered clusterng ensembles as a new branch n the conventonal taxonomy of clusterng algorthms [26, 27]. Please see the Appendx for detaled revew of the related work, ncludng [7,, 6, 9, 28, 3, 35]. The problem of clusterng combnaton can be defned generally as follows: gven multple clusterngs of the data set, fnd a combned clusterng wth better qualty. Whle the problem of clusterng combnaton bears some trats of a classcal clusterng problem, t also has three major ssues whch are specfc to combnaton desgn:. Consensus functon: How to combne dfferent clusterngs? How to resolve the label correspondence problem? How to ensure symmetrcal and unbased consensus wth respect to all the component parttons? 2. Dversty of clusterng: How to generate dfferent parttons? What s the source of dversty n the components? 3. Strength of consttuents/components: How weak could each nput partton be? What s the mnmal complexty of component clusterngs to ensure a successful combnaton? Smlar questons have already been addressed n the framework of multple classfer systems. Combnng results from many supervsed classfers s an actve research area (Qunlan 96, Breman 98) and t provdes the man motvaton for clusterngs combnaton. However, t s not possble to mechancally apply the combnaton algorthms from classfcaton (supervsed) doman to clusterng (unsupervsed) doman. Indeed, no labeled tranng data s avalable n clusterng; therefore the ground truth feedback necessary for boostng the overall accuracy cannot be used. In addton, dfferent clusterngs may produce ncompatble data labelngs, resultng n ntractable correspondence problems, especally when the numbers of clusters are dfferent. Stll, the supervsed classfer combnaton demonstrates, n prncple, how multple solutons reduce the varance component of the expected error rate and ncrease the robustness of the soluton. 2

4 From the supervsed case we also learn that the proper combnaton of weak classfers [32, 25, 8, 6] may acheve arbtrarly low error rates on tranng data, as well as reduce the predctve error. One can expect that usng many smple, but computatonally nexpensve components wll be preferred to combnng clusterngs obtaned by sophstcated, but computatonally nvolved algorthms. Ths paper further advances ensemble methods n several aspects, namely, desgn of new effectve consensus functons, development of new partton generaton mechansms and study of the resultng clusterng accuracy.. Our Contrbuton We offer a representaton of multple clusterngs as a set of new attrbutes characterzng the data tems. Such a vew drectly leads to a formulaton of the combnaton problem as a categorcal clusterng problem n the space of these attrbutes, or, n other terms, a medan partton problem. Medan partton can be vewed as the best summary of the gven nput parttons. As an optmzaton problem, medan partton s NP-complete [3], wth a contnuum of heurstcs for an approxmate soluton. Ths work focuses on the prmary problem of clusterng ensembles, namely the consensus functon, whch creates the combned clusterng. We show how medan partton s related to the classcal ntra-class varance crteron when generalzed mutual nformaton s used as the evaluaton functon. Consensus functon based on quadratc mutual nformaton (QMI) s proposed and reduced to the k-means clusterng n the space of specally transformed cluster labels. We also propose a new fuson method for unsupervsed decsons that s based on a probablty model of the consensus partton n the space of contrbutng clusters. The consensus partton s found as a soluton to the maxmum lkelhood problem for a gven clusterng ensemble. The lkelhood functon of an ensemble s optmzed wth respect to the parameters of a fnte mxture dstrbuton. Each component n ths dstrbuton corresponds to a cluster n the target consensus 3

5 partton, and s assumed to be a multvarate multnomal dstrbuton. The maxmum lkelhood problem s solved usng the EM algorthm [8]. There are several advantages to QMI and EM consensus functons. These nclude: () complete avodance of solvng the label correspondence problem, () low computatonal complexty, and () ablty to handle mssng data,.e. mssng cluster labels for certan patterns n the ensemble (for example, when bootstrap method s used to generate the ensemble). Another goal of our work s to adopt weak clusterng algorthms and combne ther outputs. Vaguely defned, a weak clusterng algorthm produces a partton, whch s only slghtly better than a random partton of the data. We propose two dfferent weak clusterng algorthms as the component generaton mechansms:. Clusterng of random -dmensonal projectons of multdmensonal data. Ths can be generalzed to clusterng n any random subspace of the orgnal data space. 2. Clusterng by splttng the data usng a number of random hyperplanes. For example, f only one hyperplane s used then data s splt nto two groups. Fnally, ths paper compares the performance of dfferent consensus functons. We have nvestgated the performance of a famly of consensus functons based on categorcal clusterng ncludng the co-assocaton based herarchcal methods [5, 6, 7], hypergraph algorthms [47, 29, 30] and our new consensus functons. Combnaton accuracy s analyzed as a functon of the number and the resoluton of the clusterng components. In addton, we study clusterng performance when some cluster labels are mssng, whch s often encountered n the dstrbuted data or re-samplng scenaros. 2 Representaton of Multple Parttons Combnaton of multple parttons can be vewed as a parttonng task tself. Typcally, each partton n the combnaton s represented as a set of labels assgned by a clusterng algorthm. The combned partton s obtaned as a result of yet another clusterng algorthm whose nputs are the 4

6 cluster labels of the contrbutng parttons. We wll assume that the labels are nomnal values. In general, the clusterngs can be soft,.e., descrbed by the real values ndcatng the degree of pattern membershp n each cluster n a partton. We consder only hard parttons below, notng however, that combnaton of soft parttons can be solved by numerous clusterng algorthms and does not appear to be more complex. Suppose we are gven a set of N data ponts X = {x,, x N } and a set of H parttons Π={π,, π H } of objects n X. Dfferent parttons of X return a set of labels for each pont x, =,, N: x { π x ), π ( x ),..., π ( x )}. () ( 2 H Here, H dfferent clusterngs are ndcated and π ) denotes a label assgned to x by the j-th algorthm. No assumpton s made about the correspondence between the labels produced by dfferent clusterng algorthms. Also no assumptons are needed at the moment about the data nput: t could be represented n a non-metrc space or as an N N dssmlarty matrx. For smplcty, we use the notaton y = π x ) or y = π(x ). The problem of clusterng combnaton s to fnd a new j j ( partton π C of data X that summarzes the nformaton from the gathered parttons Π. Our man goal s to construct a consensus partton wthout the assstance of the orgnal patterns n X, but only from ther labels Y delvered by the contrbutng clusterng algorthms. Thus, such potentally mportant ssues as the underlyng structure of both the parttons and data are gnored for the sake of a soluton to the unsupervsed consensus problem. We emphasze that a space of new features s nduced by the set Π. One can vew each component partton π as a new feature wth categorcal values,.e. cluster labels. The values assumed by the -th new feature are smply the cluster labels from partton π. Therefore, membershp of an object x n dfferent parttons s treated as a new feature vector y = π(x), an H-tuple. In ths case, one can consder partton π j (x) as a feature extracton functon. Combnaton of clusterngs becomes equvalent to the problem of clusterng of H-tuples f we use only the exstng clusterngs {π,, π H }, wthout the orgnal features of data X. j ( x 5

7 Hence the problem of combnng parttons can be transformed to a categorcal clusterng problem. Such a vew gves nsght nto the propertes of the expected combnaton, whch can be nferred through varous statstcal and nformaton-theoretc technques. In partcular, one can estmate the senstvty of the combnaton to the correlaton of components (features) as well as analyze varous sample sze ssues. Perhaps the man advantage of ths representaton s that t facltates the use of known algorthms for categorcal clusterng [37, 48] and allows one to desgn new consensus heurstcs n a transparent way. The extended representaton of data X can be llustrated by a table wth N rows and (d+h) columns: π π H x x x d π ( x ) π H ( x ) x 2 Orgnal x 2 d features x 2d "New" π ( x 2 H features ) π H ( x 2 ) x N x N x Nd π ( x N ) π H ( x N ) The consensus clusterng s found as a partton π C of a set of vectors Y = {y } that drectly translates to the partton of the underlyng data ponts {x }. 3 A Mxture Model of Consensus Our approach to the consensus problem s based on a fnte mxture model for the probablty of the cluster labels y=π(x) of the pattern/object x. The man assumpton θ s that the labels y are modeled as random varables drawn from a probablty dstrbuton descrbed as a mxture of multvarate component denstes: M P( y Θ) = α mpm ( y m), (2) m= where each component s parametrzed by θ m. The M components n the mxture are dentfed wth the clusters of the consensus partton π C. The mxng coeffcents α m correspond to the pror probabltes of the clusters. In ths model, data ponts {y } are presumed to be generated n two 6

8 steps: frst, by drawng a component accordng to the probablty mass functon α m, and then N samplng a pont from the dstrbuton P m (y θ m ). All the data Y = y } are assumed to be { = ndependent and dentcally dstrbuted. Ths allows one to represent the log lkelhood functon for the parameters Θ={α,, α M, θ,, θ M } gven the data set Y as: N log L (Θ Y) = log P( y Θ) = log α m P = N M = m= The objectve of consensus clusterng s now formulated as a maxmum lkelhood estmaton problem. To fnd the best fttng mxture densty for a gven data Y, we must maxmze the lkelhood functon wth respect to the unknown parameters Θ: Θ = argmax log L ( Θ Y). (4) Θ The next mportant step s to specfy the model of component-condtonal denstes P m (y θ m ). Note, that the orgnal problem of clusterng n the space of data X has been transformed, wth the help of multple clusterng algorthms, to a space of new multvarate features y = π(x). To make the problem more tractable, a condtonal ndependence assumpton s made for the components of vector y, namely that the condtonal probablty of y can be represented as the followng product: m m H ( j) ( j) P ( y θ ) = P ( y θ ). j= To motvate ths, one can note that even f the dfferent clusterng algorthms (ndexed by j) are not truly ndependent, the approxmaton by product n Eq. (5) can be justfed by the excellent performance of nave Bayes classfers n dscrete domans [34]. Our ultmate goal s to make a dscrete label assgnment to the data n X through an ndrect route of densty estmaton of Y. The assgnments of patterns to the clusters n π C are much less senstve to the condtonal ndependence approxmaton than the estmated values of probabltes P( y Θ), as supported by the analyss of naïve Bayes classfer n [9]. m j m m ( y θ m ). (3) (5) 7

9 ( j) ( j) The last ngredent of the mxture model s the choce of a probablty densty P ( y θ ) for the components of the vectors y. Snce the varables y j take on nomnal values from a set of cluster labels n the partton π j, t s natural to vew them as the outcome of a multnomal tral: m j m K ( j) ( j) ( j) m j m = ϑ jm k = ( yj, k ) P ( y θ ) ( k) δ. Here, wthout the loss of generalty, the labels of the clusters n π j are chosen to be ntegers n (6) {,,K(j)}. To clarfy the notaton, note that the probabltes of the outcomes are defned as ϑ jm (k) and the product s over all the possble values of y j labels of the partton π j. Also, the probabltes sum up to one: K ( j) k = ϑ ( k) =, j {,..., H}, m {,..., M}. (7) jm For example, f the j-th partton has only two clusters, and possble labels are 0 and, then Eq. (5) can be smplfed as: ( j) P m y ( j ) y y θ m ) = ϑ jm ( jm ). (8) ( ϑ The maxmum lkelhood problem n Eq. (3) generally cannot be solved n a closed form when all the parameters Θ={α,, α M, θ,, θ M } are unknown. However, the lkelhood functon n Eq. (2) can be optmzed usng the EM algorthm. In order to adopt the EM algorthm, we hypothesze the exstence of hdden data Z and the lkelhood of complete data (Y, Z). If the value of z s known then one could mmedately tell whch of the M mxture components was used to generate the pont y. The detaled dervaton of the EM soluton to the mxture model wth multvarate, multnomal components s gven n the Appendx. Here we gve only the equatons for the E- and M-steps whch are repeated at each teraton of the algorthm: E[ z m ] = M α H K ( j) ( ϑ jm ( k) ) m j= k = H K ( j) α ( n ϑ jn ( k) ) n= j= k = δ ( y, k) j δ ( y, k) j. (9) 8

10 N E [ z m ] = α m = N. (0) M E [ z ] = m = N δ ( yj, k) E[ zm ] = ϑ jm ( k) = N K ( j). δ ( y, k) E[ z ] = k = The soluton to the consensus clusterng problem s obtaned by a smple nspecton of the expected values of the varables E[z m ], due to the fact that E[z m ] represents the probablty that the pattern y was generated by the m-th mxture component. Once convergence s acheved, a pattern y s assgned to the component whch has the largest value for the hdden label z. It s nstructve to consder a smple example of an ensemble. Fgure shows four 2-cluster parttons of 2 two-dmensonal data ponts. Correspondence problem s emphaszed by dfferent label systems used by the parttons. Table shows the expected values of latent varables after 6 teratons of the EM algorthm and the resultng consensus clusterng. In fact, a stable combnaton appears as early as the thrd teraton, and t corresponds to the true underlyng structure of the data. Our mxture model of consensus admts generalzaton for clusterng ensembles wth ncomplete parttons. Such parttons can appear as a result of clusterng of subsamples or resamplng of a dataset. For example, a partton of a bootstrap sample only provdes labels for the selected ponts. Therefore, the ensemble of such parttons s represented by a set of vectors of cluster labels wth potentally mssng components. Moreover, dfferent vectors of cluster labels are lkely to mss dfferent components. Incomplete nformaton can also arse when some clusterng algorthms do not assgn outlers to any of the clusters. Dfferent clusterngs n the dverse ensemble can consder the same pont x as an outler or otherwse, that results n mssng components n the vector y. j m m () 9

11 Y Y Y Y Y Y X X Y X X Y B A A A B B A A B B B B Yet another scenaro leadng to mssng nformaton can occur n clusterng combnaton of dstrbuted data or ensemble of clusterngs of non-dentcal replcas of a dataset. It s possble to apply the EM algorthm n the case of mssng data [20], namely mssng cluster labels for some of the data ponts. In these stuatons, each vector y n Y can be splt nto observed and mssng components y = (y obs, y ms ). Incorporaton of a mssng data leads to a slght β β β α β β α β α Fgure : Four possble parttons of 2 data ponts nto 2 clusters. Dfferent parttons use dfferent sets of labels. Table : Clusterng ensemble and consensus soluton π π 2 π 3 π 4 E [z ] E [z 2 ] Consensus y 2 B X β y 2 2 A X α y 3 2 A Y β y 4 2 B X β y 5 A X β y 6 2 A Y β y 7 2 B Y α y 8 B Y α y 9 B Y β y 0 A Y α y 2 B Y α y 2 B Y α α α α 0

12 modfcaton of the computaton of E and M steps. Frst, the expected values E[z m y obs, Θ ] are now nferred from the observed components of vector y,.e. the products n Eq. (9) are taken over H known labels:. Addtonally, one must compute the expected values E[z m y ms y obs, Θ ] j= j: y obs and substtute them, as well as E[z m y obs, Θ ], n the M-step for re-estmaton of parameters ϑ (k). More detals on handlng mssng data can be found n [20]. jm Though data wth mssng cluster labels can be obtaned n dfferent ways, we analyze only the case when components of y are mssng completely at random [46]. It means that the probablty of a component to be mssng does not depend on other observed or unobserved varables. Note, that the outcome of clusterng of data subsamples (e.g., bootstrap) s dfferent from clusterng the entre data set and then deletng a random subset of labels. However, our goal s to present a consensus functon for general settngs. We expect that expermental results for ensembles wth mssng labels are applcable, at least qualtatvely, even for a combnaton of bootstrap clusterngs. The proposed ensemble clusterng based on mxture model consensus algorthm s summarzed below. Note that any clusterng algorthm can be used to generate ensemble nstead of the k-means algorthm shown n ths pseudocode: begn for = to H // H - number of clusterngs end cluster a dataset X: π k-means(x) add partton π to the ensemble Π= {Π,π} ntalze model parameters Θ ={α,, α M, θ,, θ M } do untl convergence crteron s satsfed compute expected values E[z m ], =..N, m=..m compute E[z m y ms ] for mssng data (f any) re-estmate parameters ϑ (k jm ), j=..h, m=..m, k end π C (x ) = ndex of component of z wth the largest expected value, =..N

13 return π C // consensus partton end The value of M, number of components n the mxture, deserves a separate dscusson that s beyond the scope of ths paper. Here, we assume that the target number of clusters s predetermned. It should be noted, however, that mxture model n unsupervsed classfcaton greatly facltates estmaton of the true number of clusters [3]. Maxmum lkelhood formulaton of the problem specfcally allows us to estmate M by usng addtonal objectve functons durng the nference, such as the mnmum descrpton length of the model. In addton, the proposed consensus algorthm can be vewed as a verson of Latent Class Analyss (e.g. see [4]), whch has rgorous statstcal means for quantfyng plausblty of a canddate mxture model. Whereas the fnte mxture model may not be vald for the patterns n the orgnal space (the ntal representaton), ths model more naturally explans the separaton of groups of patterns n the space of extracted features (labels generated by the parttons). It s somewhat remnscent of classfcaton approaches based on kernel methods whch rely on lnear dscrmnant functons n the transformed space. For example, Support Vector Clusterng [5] seeks sphercal clusters after the kernel transformaton that corresponds to more complex cluster shapes n the orgnal pattern space. 4 Informaton-Theoretc Consensus of Clusterngs Another canddate consensus functon s based on the noton of medan partton. A medan partton σ s the best summary of exstng parttons n Π. In contrast to the co-assocaton approach, medan partton s derved from estmates of smlartes between attrbutes (.e., parttons n Π), rather than from smlartes between objects. A well-known example of ths approach s mplemented n the COBWEB algorthm n the context of conceptual clusterng [48]. COBWEB clusterng crteron estmates the partton utlty, whch s the sum of category utlty functons ntroduced by Gluck and Corter [2]. In our terms, the category utlty functon U(σ, π ) evaluates the qualty of a Here attrbutes (features) refer to the parttons of an ensemble, whle the objects refer the orgnal data ponts. 2

14 canddate medan partton π C ={C,,C K } aganst some other partton π = {L,, L K()}, wth labels L j for j-th cluster: K K ( ) K ( ) 2 2 C π = r j r j r= j= j=, (2) U( π, ) p( C ) p( L C ) p( L ) wth the followng notatons: p(c r ) = C r / N, p(l j) = L j / N, and p L C ) = L C / C. ( j r j r r The functon U(π C, π ) assesses the agreement between two parttons as the dfference between the expected number of labels of partton π that can be correctly predcted both wth the knowledge of clusterng π C and wthout t. The category utlty functon can also be wrtten as Goodman-Kruskal ndex for the contngency table between two parttons [22, 39]. The overall utlty of the partton π C wth respect to all the parttons n Π can be measured as the sum of par-wse agreements: U( π, ) U( π, π ) H Π =. C C = Therefore, the best medan partton should maxmze the value of overall utlty: (3) best π C = arg max U ( π C, Π ). (4) πc Importantly, Mrkn [39] has proved that maxmzaton of partton utlty n Eq. (3) s equvalent to mnmzaton of the square-error clusterng crteron f the number of clusters K n target partton π C s fxed. Ths s somewhat surprsng n that the partton utlty functon n Eq. (4) uses only the between-attrbute smlarty measure of Eq.(2), whle square-error crteron makes use of dstances between objects and prototypes. Smple standardzaton of categorcal labels n {π,,π H } effectvely transforms them to quanttatve features [39]. Ths allows us to compute real-valued dstances and cluster centers. Ths transformaton replaces the -th partton π assumng K() values by K() bnary features, and standardzes each bnary feature to a zero mean. In other words, for each object x we can compute the values of the new features yɶ ( x), as followng: yɶ ( x) = δ ( L, π ( x)) p( L ), for j= K(), = H. (5) j j j j 3

15 Hence, the soluton of medan partton problem n Eq. (4) can be approached by k-means clusterng algorthm operatng n the space of features yɶ j f the number of target clusters s predetermned. We use ths heurstc as a part of emprcal study of consensus functons. Let us consder the nformaton-theoretc approach to the medan partton problem. In ths framework, the qualty of the consensus partton π C s determned by the amount of nformaton I ( π C, Π ) t shares wth the gven parttons n Π. Strehl and Ghosh [47] suggest an objectve functon that s based on the classcal Shannon defnton of mutual nformaton: best π C = arg max I ( π C, Π ), where H I ( π C, Π ) I ( π C, π ) π =, (6) C = K K ( ) p( Cr, L ) j I ( π C, π ) = p( Cr, L j )log. (7) r= j= p( Cr ) p( Lj ) Agan, an optmal medan partton can be found by solvng ths optmzaton problem. However, t s not clear how to drectly use these equatons n a search for consensus. We show that another nformaton-theoretc defnton of entropy wll reduce the mutual nformaton crteron to the category utlty functon dscussed before. We proceed from the generalzed entropy of degree s for a dscrete probablty dstrbuton P=(p,,p n ) [23]: H s n s ( P) = (2 ) p = Shannon s entropy s the lmt form of Eq.(8): s = s, s lm H ( P) p log p n =. s > 0, s Generalzed mutual nformaton between σ and π can be defned as: s s s I ( π, π ) = H ( π ) H ( π π ). (20) C Quadratc entropy (s=2) s of partcular nterest, snce t s known to be closely related to classfcaton error, when used n the probablstc measure of nter-class dstance. When s=2, generalzed mutual nformaton I(π C, π ) becomes: 2 C (8) (9) 4

16 K ( ) K ( ) 2 K 2 2 I ( π C, π ) = 2 p( L j ) + 2 p( Cr ) p( L j Cr ) ) = j= r= j= (2) K K ( ) K ( ) 2 2 p Cr p L j Cr p L j U π C π r= j= j= = 2 ( ) ( ) 2 ( ) = 2 (, ). Therefore, generalzed mutual nformaton gves the same consensus clusterng crteron as category utlty functon n Eq. (3). Moreover, tradtonal Gn-ndex measure for attrbute selecton also follows from Eqs. (2) and (2). In lght of Mrkn s result, all these crtera are equvalent to wthn-cluster varance mnmzaton, after smple label transformaton. Quadratc mutual nformaton, mxture model and other nterestng consensus functons have been used n our comparatve emprcal study. 5 Combnaton of Weak Clusterngs The prevous sectons addressed the problem of clusterngs combnaton, namely how to formulate the consensus functon regardless of the nature of ndvdual parttons n the combnaton. We now turn to the ssue of generatng dfferent clusterngs for the combnaton. There are several prncpal questons. Do we use the parttons produced by numerous clusterng algorthms avalable n the lterature? Can we relax the requrements for the clusterng components? There are several exstng methods to provde dverse parttons:. Use dfferent clusterng algorthms, e.g. k-means, mxture of Gaussans, spectral, sngle-lnk, etc. [47]. 2. Explot bult-n randomness or dfferent parameters of some algorthms, e.g. ntalzatons and varous values of k n k-means algorthm [35, 5, 6]. 3. Use many subsamples of the data set, such as bootstrap samples [0, 38]. These methods rely on the clusterng algorthms, whch are powerful on ther own, and as such are computatonally nvolved. We argue that t s possble to generate the parttons usng weak, but less expensve, clusterng algorthms and stll acheve comparable or better performance. Certanly, the 5

17 key motvaton s that the synergy of many such components wll compensate for ther weaknesses. We consder two smple clusterng algorthms:. Clusterng of the data projected to a random subspace. In the smplest case, the data s projected on -dmensonal subspace, a random lne. The k-means algorthm clusters the projected data and gves a partton for the combnaton. 2. Random splttng of data by hyperplanes. For example, a sngle random hyperplane would create a rather trval clusterng of d-dmensonal data by cuttng the hypervolume nto two regons. We wll show that both approaches are capable of producng hgh qualty consensus clusterngs n conjuncton wth a proper consensus functon. 5. Splttng by Random Hyperplanes Drect clusterng by use of a random hyperplane llustrates how a relable consensus emerges from low-nformatve components. The random splts approach pushes the noton of weak clusterng almost to an extreme. The data set s cut by random hyperplanes dssectng the orgnal volume of d-dmensonal space contanng the ponts. Ponts separated by the hyperplanes are declared to be n dfferent clusters. Hence, the output clusters are convex. In ths stuaton, a co-assocaton consensus functon s approprate snce the only nformaton needed s whether the patterns are n the same cluster or not. Thus the contrbuton of a hyperplane partton to the co-assocaton value for any par of objects can be ether 0 or. Fner resolutons of dstance are possble by countng the number of hyperplanes separatng the objects, but for smplcty we do not use t here. Consder a random lne dssectng the classc 2-spral data shown n Fg. 2(a). Whle any one such partton does lttle to reveal the true underlyng clusters, analyss of the hyperplane generatng mechansm shows how multple such parttons can dscover the true clusters. 6

18 0.9 cluster cluster P plane 2-planes 3-planes 4-planes dstance x Fgure 2. Clusterng by a random hyperplane: (a) An example of splttng 2-spral data set by a random lne. Ponts on the same sde of the lne are n the same cluster. (b) Probablty of splttng two onedmensonal objects for dfferent number of random thresholds as a functon of dstance between objects. Consder frst the case of one-dmensonal data. Splttng of objects n -dmensonal space s done by a random threshold n R. In general, f r thresholds are randomly selected, then (r+) clusters are formed. It s easy to derve that, n -dmensonal space, the probablty of separatng two objects whose nter-pont dstance s x s exactly: P ( splt) ) where L s the length of the nterval contanng the objects, and r threshold ponts are drawn at r = ( x L, (22) random from unform dstrbuton on ths nterval. Fg. 2(b) llustrates the dependence for L= and r=,2,3,4. If a co-assocaton matrx s used to combne H dfferent parttons, then the expected value of co-assocaton between two objects s H( P(splt)), that follows from the bnomal dstrbuton of the number of splts n H attempts. Therefore, the co-assocaton values found after combnng many random splt parttons are generally expected to be a non-lnear and a monotonc functon of respectve dstances. The stuaton s smlar for multdmensonal data, however, the generaton of random hyperplanes s a bt more complex. To generate a random hyperplane n d dmensons, we should frst draw a random pont n the multdmensonal regon that wll serve as a pont of orgn. Then we randomly choose a unt normal vector u that defnes the hyperplane. The two objects characterzed by vectors p and q wll be n the same cluster f (up)(uq)>0 and wll be separated otherwse (here ab denotes a scalar product of a and b). If r hyperplanes are generated, then the total probablty that two objects reman n the same cluster s just the product of 7

19 probabltes that each of the hyperplanes does not splt the objects. Thus we can expect that the law governng the co-assocaton values s close to what s obtaned n -dmensonal space n Eq. (22). Let us compare the actual dependence of co-assocaton values wth the functon n Eq. (22). Fg. 3 shows the results of experments wth 000 dfferent parttons by random splts of the Irs data set. The Irs data s 4-dmensonal and contans 50 ponts. There are,75 par-wse dstances between the data tems. For all the possble pars of ponts, each plot n Fg. 3 shows the number of tmes a par was splt. The observed dependence of the nter-pont dstances derved from the co-assocaton values vs. the true Eucldean dstance, ndeed, can be descrbed by the functon n Eq. (22). Clearly, the nter-pont dstances dctate the behavor of respectve co-assocaton values. The probablty of a cut between any two gven objects does not depend on the other objects n the data set. Therefore, we can conclude that any clusterng algorthm that works well wth the orgnal nterpont dstances s also expected to work well wth co-assocaton values obtaned from a combnaton of multple parttons by random splts. However, ths result s more of theoretcal value when true dstances are avalable, snce they can be used drectly nstead of co-assocaton values. It llustrates the man dea of the approach, namely that the synergy of multple weak clusterngs can be very effectve. We present an emprcal study of the clusterng qualty of ths algorthm n the expermental secton. 5.2 Combnaton of Clusterngs n Random Subspaces Random subspaces are an excellent source of clusterng dversty that provdes dfferent vews of the data. Projectve clusterng s an actve topc n data mnng. For example, algorthms such as CLIQUE [2] and DOC [42] can dscover both useful projectons as well as data clusters. Here, however, we are only concerned wth the use of random projectons for the purpose of clusterng combnaton. 8

20 cōasocaton dstance cōasocaton dstance x cōasocaton dstance r= r=2 cōasocaton dstance x r=3 r=4 x x Fgure 3. Dependence of dstances derved from the co-assocaton values vs. the actual Eucldean dstance x for each possble par of objects n Irs data. Co-assocaton matrces were computed for dfferent numbers of hyperplanes r =,2,3,4. Each random subspace can be of very low dmenson and t s by tself somewhat unnformatve. On the other hand, clusterng n -dmensonal space s computatonally cheap and can be effectvely performed by k-means algorthm. The man subroutne of k-means algorthm dstance computaton becomes d tmes faster n -dmensonal space. The cost of projecton s lnear wth respect to the sample sze and number of dmensons O(Nd), and s less then the cost of one k-means teraton. The man dea of our approach s to generate multple parttons by projectng the data on a random lne. A fast and smple algorthm such as k-means clusters the projected data, and the resultng partton becomes a component n the combnaton. Afterwards, a chosen consensus functon s appled to the components. We dscuss and compare several consensus functons n the expermental secton. It s nstructve to consder a smple 2-dmensonal data and one of ts projectons, as llustrated n Fg. 4(a). There are two natural clusters n the data. Ths data looks the same n any - 9

21 d projected dstrbuton total class class 2 # of objects projected axs (a) (b) Fgure 4. Projectng data on a random lne: (a) A sample data wth two dentfable natural clusters and a lne randomly selected for projecton. (b) Hstogram of the dstrbuton of ponts resultng from data projecton onto a random lne. dmensonal projecton, but the actual dstrbuton of ponts s dfferent n dfferent clusters n the projected subspace. For example, Fg. 4(b) shows one possble hstogram dstrbuton of ponts n - dmensonal projecton of ths data. There are three dentfable modes, each havng a clear majorty of ponts from one of the two classes. One can expect that clusterng by k-means algorthm wll relably separate at least a porton of the ponts from the outer rng cluster. It s easy to magne that projecton of the data n Fg. 4(a) on another random lne would result n a dfferent dstrbuton of ponts and dfferent label assgnments, but for ths partcular data set t wll always appear as a mxture of three bell-shaped components. Most probably, these modes wll be dentfed as clusters by k-means algorthm. Thus each new -dmensonal vew correctly helps to group some data ponts. Accumulaton of multple vews eventually should result n a correct combned clusterng. The major steps for combnng the clusterngs usng random -d projectons are descrbed by the followng procedure: begn for = to H // H s the number of clusterngs n the combnaton generate a random vector u, s.t. u = project all data ponts {x j }: {y j } {ux j }, j= N cluster projectons {y j }: π() k-means({y j }) end combne clusterngs va a consensus functon: σ {π()}, = H 20

22 return σ // consensus partton end The mportant parameter s the number of clusters n the component partton π returned by k- means algorthm at each teraton,.e. the value of k. If the value of k s too large then the parttons {π } wll overft the data set whch n turn may cause unrelablty of the co-assocaton values. Too small a number of clusters n {π } may not be enough to capture the true structure of data set. In addton, f the number of clusterngs n the combnaton s too small then the effectve sample sze for the estmates of dstances from co-assocaton values s also nsuffcent, resultng n a larger varance of the estmates. That s why the consensus functons based on the co-assocaton values are more senstve to the number of parttons n the combnaton (value of H) than consensus functons based on hypergraph algorthms. 6 Emprcal study The experments were conducted wth artfcal and real-world datasets, where true natural clusters are known, to valdate both accuracy and robustness of consensus va the mxture model. We explored the datasets usng fve dfferent consensus functons. 6. Datasets. Table 2 summarzes the detals of the datasets. Fve datasets of dfferent nature have been used n the experments. Bochemcal and Galaxy data sets are descrbed n [] and [40], respectvely. Table 2: Characterstcs of the datasets. Dataset No. of No. of No. of Total no. Av. k -means features classes ponts/class of ponts error (%) Bochem Galaxy sprals Half-rngs Irs

23 We evaluated the performance of the evdence accumulaton clusterng algorthms by matchng the detected and the known parttons of the datasets. The best possble matchng of clusters provdes a measure of performance expressed as the msassgnment rate. To determne the clusterng error, one needs to solve the correspondence problem between the labels of known and derved clusters. The optmal correspondence can be obtaned usng the Hungaran method for mnmal weght bpartte matchng problem wth O(k 3 ) complexty for k clusters. 6.2 Selecton of Parameters and Algorthms. Accuracy of the QMI and EM consensus algorthms has been compared to sx other consensus functons:. CSPA for parttonng of hypergraphs nduced from the co-assocaton values. Its complexty s O(N 2 ) that leads to severe computatonal lmtatons. We dd not apply ths algorthm to Galaxy [40] and Bochemcal [] data. For the same reason, we dd not use other coassocaton methods, such as sngle-lnk clusterng. The performance of these methods was already analyzed n [4,5]. 2. HGPA for hypergraph parttonng. 3. MCLA, that modfes HGPA va extended set of hyperedge operatons and addtonal heurstcs. 4. Consensus functons operated on the co-assocaton matrx, but wth three dfferent herarchcal clusterng algorthms for obtanng the fnal partton, namely sngle-lnkage, average-lnkage, and complete-lnkage. Frst three methods (CSPA, HGPA and MCLA) were ntroduced n [47] and ther code s avalable at The k-means algorthm was used as a method of generatng the parttons for the combnaton. Dversty of the parttons s ensured by the solutons obtaned after a random ntalzaton of the algorthm. The followng parameters of the clusterng ensemble are especally mportant:. H the number of combned clusterngs. We vared ths value n the range [5..50]. 22

24 Fgure 5: 2 sprals and Half-rngs datasets are dffcult for any centrod based clusterng algorthms.. k the number of clusters n the component clusterngs {π,, π H } produced by k-means algorthm was taken n the range [2..0].. r the number of hyperplanes used for obtanng clusterngs {π,, π H } by random splttng algorthm. Both, the EM and QMI algorthms are susceptble to the presence of local mnma of the objectve functons. To reduce the rsk of convergence to a lower qualty soluton, we used a smple heurstc afforded by low computatonal complextes of these algorthms. The fnal partton was pcked from the results of three runs (wth random ntalzatons) accordng to the value of objectve functon. The hghest value of the lkelhood functon served as a crteron for the EM algorthm and wthn-cluster varance s a crteron for the QMI algorthm. 6.3 Experments wth Complete Parttons. Only man results for each of the datasets are presented n Tables 3-7 due to space lmtatons. The tables report the mean error rate (%) of clusterng combnaton from 0 ndependent runs for relatvely large bochemcal and astronomcal data sets and from 20 runs for the other smaller datasets. Frst observaton s that none of the consensus functons s the absolute wnner. Good performance was acheved by dfferent combnaton algorthms across the values of parameters k and H. The EM algorthm slghtly outperforms other algorthms for ensembles of smaller sze, whle MCLA s superor when number of clusterngs H > 20. However, ensembles of very large sze are less mportant n practce. All co-assocaton methods are usually unrelable wth number of 23

25 clusterngs H < 50 and ths s where we poston the proposed EM algorthm. Both, EM and QMI consensus functons need to estmate at least khm parameters. Therefore, accuracy degradaton wll nevtably occur wth an ncrease n the number of parttons when sample sze s fxed. However, there was no notceable decrease n the accuracy of the EM algorthm n current experments. The EM algorthm also should beneft from the datasets of large sze due to the mproved relablty of model parameter estmaton. A valuable property of the EM consensus algorthm s ts fast convergence rate. Mxture model parameter estmates nearly always converged n less than 0 teratons for all the datasets. Moreover, pattern assgnments were typcally settled n 4-6 teratons. Clusterng combnaton accuracy also depends on the number of clusters M n the ensemble parttons, or more precsely, on ts rato to the target number of clusters,.e. k/m. For example, the EM algorthm worked best wth k=3 for Irs dataset, k=3,4 for Galaxy dataset and k=2 for Halfrngs data. These values of k are equal or slghtly greater than the number of clusters n the combned partton. In contrast, accuracy of MCLA slghtly mproves wth an ncrease n the number of clusters n the ensemble. Fgure 7 shows the error as a functon of k for dfferent consensus functons for the galaxy data. It s also nterestng to note that, as expected, the average error of consensus clusterng was lower than average error of the k-means clusterngs n the ensemble (Table 2) when k s chosen to be equal to the true number of clusters. Moreover, the clusterng error obtaned by EM and MCLA algorthms wth k=4 for Bochemstry data [] was the same as found by supervsed classfers appled to ths dataset [45]. 6.4 Experments wth Incomplete Parttons. Ths set of experments focused on the dependence of clusterng accuracy on the number of patterns wth mssng cluster labels. As before, an ensemble of parttons was generated usng the k-means algorthm. Then, we randomly deleted cluster labels for a fxed number of patterns n each of the parttons. The EM consensus algorthm was used on 24

26 Table 3: Mean error rate (%) for the Galaxy dataset. Type of Consensus Functon H k EM QMI HGPA MCLA error rate (%) % of patterns wth mssng labels Fgure 6: Consensus clusterng error rate as a functon of the number of mssng labels n the ensemble for the Irs dataset, H=5, k=3. such an ensemble. The number of mssng labels n each partton was vared between 0% to 50% of the total number of patterns. The man results averaged over 0 ndependent runs are reported n Table 8 for Galaxy and Bochemstry datasets for varous values of H and k. Also, a typcal dependence of error on the number of patterns wth mssng data s shown for Irs data on Fgure 6 (H=5, k=3). One can note that combnaton accuracy decreases only nsgnfcantly for Bochemstry data when up to 50% of labels are mssng. Ths can be explaned by the low nherent accuracy for ths data, leavng lttle room for further degradaton. For the Galaxy data, the accuracy drops by almost 0% when k=3,4. However, when just 0-20% of the cluster labels are mssng, then there s just a small change n accuracy. Also, wth dfferent values of k, we see dfferent senstvty of the results to the mssng labels. For example, wth k=2, the accuracy drops by only slghtly more than %. Ensembles of larger sze H=0 suffered less from mssng data than ensembles of sze H= Results of Random Subspaces Algorthm Let us start by demonstratng how the combnaton of clusterngs n projected -dmensonal subspaces outperforms the combnaton of clusterngs n the orgnal multdmensonal space. Fg. 8(a) shows the learnng dynamcs for Irs data and k=4, usng average-lnk consensus functon based 25

27 on co-assocaton values. Note that the number of clusters n each of the components {π,, π H } s set to k=4, and s dfferent from the true number of clusters (=3). Clearly, each ndvdual clusterng n full multdmensonal space s much stronger than any -dm partton, and therefore wth only a small number of parttons (H<50) the combnaton of weaker parttons s not yet effectve. However, for larger numbers of combned parttons (H>50), -dm projectons together better reveal the true structure of the data. It s qute unexpected, snce the k-means algorthm wth k=3 makes, on average, 9 mstakes n orgnal 4-dm space and 25 mstakes n -dm random subspace. Moreover, clusterng n the projected subspace s d tmes faster than n multdmensonal space. Although, the cost of computng a consensus partton σ s the same n both cases. The results regardng the mpact of value of k are reported n Fg. 8(b), whch shows that there s a crtcal value of k for the Irs data set. Ths occurs when the average-lnkage of co-assocaton dstances s used as a consensus functon. In ths case the value k=2 s not adequate to separate the true clusters. The role of the consensus functon s llustrated n Fg. 9. Three consensus functons are compared on the Irs data set. They all use smlartes from the co-assocaton matrx but cluster the objects usng three dfferent crteron functons, namely, sngle lnk, average lnk and complete lnk. It s clear that the combnaton usng sngle-lnk performs sgnfcantly worse than the other two consensus functons. Ths s expected snce the three classes n Irs data have hyperellpsodal shape. More results were obtaned on half-rngs and 2 sprals data sets n Fg. 5, whch are tradtonally dffcult for any parttonal centrod-based algorthm. Table 9 reports the error rates for the 2 sprals data usng seven dfferent consensus functons, dfferent number of component parttons H = [5..500] and dfferent number of clusters n each component k = 2,4,0. We omt smlar results for half-rngs data set under the same expermental condtons and some ntermedate values of k due to space lmtatons. As we see, the sngle-lnk consensus functon performed the best and was able to dentfy both the half-rngs clusters as well as sprals. In contrast to the results for Irs data, average-lnk and complete-lnk consensus were not sutable for these data sets. 26

28 Table 4: Mean error rate (%) for the Bochemstry dataset. Type of Consensus Functon H k EM QMI MCLA Table 5: Mean error rate (%) for the Half-rngs dataset. Type of Consensus Functon H k EM QMI CSPA HGPA MCLA Table 6: Mean error rate (%) for the 2-sprals dataset. Type of Consensus Functon H k EM QMI CSPA HGPA MCLA error rate (%) k - number of clusters EM QMI MCLA Fgure 7: Consensus error as a functon of the number of clusters n the contrbutng parttons for Galaxy data and ensemble sze H=20. Table 7: Mean error rate (%) for the Irs dataset. Type of Consensus Functon H k EM QMI CSPA HGPA MCLA Table 8: Clusterng error rate of EM algorthm as a functon of the number of mssng labels for the large datasets Mssng "Galaxy" "Bochem." H k labels (%) error (%) error (%)