Bayesian Cluster Ensembles

Bayesan Cluster Ensembles Hongjun Wang 1, Hanhua Shan 2 and Arndam Banerjee 2 1 Informaton Research Insttute, Southwest Jaotong Unversty, Chengdu, Schuan, 610031, Chna 2 Department of Computer Scence & Engneerng, Unversty of Mnnesota, Twn Ctes. Mnneapols, MN 55455 Receved 14 November 2009; revsed 19 October 2010; accepted 27 October 2010 DOI:10.1002/sam.10098 Publshed onlne n Wley Onlne Lbrary (wleyonlnelbrary.com). Abstract: Cluster ensembles provde a framework for combnng multple base clusterngs of a dataset to generate a stable and robust consensus clusterng. There are mportant varants of the basc cluster ensemble problem, notably ncludng cluster ensembles wth mssng values, row- or column-dstrbutedcluster ensembles. Exstng cluster ensemble algorthms are applcable only to a small subset of these varants. In ths paper, we propose Bayesan cluster ensemble (), whch s a mxed-membershp model for learnng cluster ensembles, and s applcable to all the prmary varants of the problem. We propose a varatonal approxmaton based algorthm for learnng Bayesan cluster ensembles. s further generalzed to deal wth the case where the features of orgnal data ponts are avalable, referred to as generalzed (G). We compare extensvely wth several other cluster ensemble algorthms, and demonstrate that s not only versatle n terms of ts applcablty but also outperforms other algorthms n terms of stablty and accuracy. Moreover, G can have hgher accuracy than, especally wth only a small number of avalable base clusterngs. 2011 Wley Perodcals, Inc. Statstcal Analyss and Data Mnng, 2011 Keywords: cluster ensembles; Bayesan models 1. INTRODUCTION Cluster ensembles provde a framework for combnng multple base clusterngs of a dataset nto a sngle consoldated clusterng. Compared to ndvdual clusterng algorthms, cluster ensembles generate more robust and stable clusterng results [1]. In prncple, cluster ensembles can leverage dstrbuted computng by calculatng the base clusterngs n an entrely dstrbuted manner [2]. In addton, snce cluster ensembles only need access to the base clusterng results nstead of the orgnal data ponts, they provde a convenent approach to prvacy preservaton and knowledge reuse [2]. Such desrable aspects have made the study of cluster ensembles ncreasngly mportant n the context of data mnng. In addton to generatng a consensus clusterng from a complete set of base clusterngs, t s hghly desrable for cluster ensemble algorthms to have several addtonal propertes sutable for real lfe applcatons. Frst, there may be mssng values n the base clusterngs. For example, n a customer segmentaton applcaton, whle there are legacy Correspondence to: Hanhua Shan (shan@cs.umn.edu) clusterngs on old customers, there wll be no clusterngs on the new customers. Cluster ensemble algorthms should be able to buld consensus clusters wth such mssng nformaton on base clusterngs. Second, there may be restrctons on brngng all the base clusterngs to one place to run the cluster ensemble algorthm. Such restrctons may be due to the fact that the base clusterngs are wth dfferent organzatons and cannot be shared wth each other. Cluster ensemble algorthms should be able to work wth such column-dstrbuted base clusterngs. Thrd, the data ponts themselves may be dstrbuted over multple locatons; whle t s possble to get a base clusterng across the entre dataset by message passng, base clusterngs for dfferent parts of data wll be n dfferent locatons, and there may be restrctons on brngng them together at one place. For example, for a customer segmentaton applcaton, dfferent vendors may have dfferent subsets of customers, and a base clusterng on all the customers can be performed usng prvacy preservng clusterng algorthms; however, the cluster assgnments of the customer subsets for each vendor s prvate nformaton whch they wll be unwllng to share drectly for the purposes of formng a consensus clusterng. Agan, t wll be desrable to have 2011 Wley Perodcals, Inc.

2 Statstcal Analyss and Data Mnng, Vol. (In press) cluster ensemble algorthms handle such row-dstrbuted base clusterngs. Fnally, n many real-world scenaros, features of orgnal data ponts are avalable. These features could be the ones used for generatng the base clusterngs n the ensemble, or they could be the new nformaton currently becomng avalable, such as some new purchasng records of a customer. In such a stuaton, a cluster ensemble algorthm whch s able to combne both the base clusterng results and data pont s features s expected to generate a better consensus clusterng compared to usng base clusterng results alone. Current cluster ensemble algorthms, such as the clusterbased smlarty parttonng algorthm () [2], hypergraph parttonng algorthm () [2], or k-means based algorthms [3] are applcable to accomplsh one or two of the above varants of the problem. However, none of them was desgned to address all of the varants. In prncple, the recently proposed mxture modelng approach to learnng cluster ensembles [1] s applcable to the varants, but the detals have not been reported n the lterature. In ths paper, we propose Bayesan cluster ensembles (), whch can solve the basc cluster ensemble problem usng a Bayesan approach, that s, by effectvely mantanng a dstrbuton over all possble consensus clusterngs. It also seamlessly generalzes to all the mportant varants dscussed above. Smlar to the mxture modelng approach, treats all base clusterng results for each data pont as a vector wth a dscrete value on each dmenson, and learns a mxedmembershp model from such a representaton. In addton, we extend to generalzed (G), whch learns a consensus clusterng from both the base clusterngs and feature vectors of orgnal data ponts. Extensve emprcal evaluaton demonstrates that s not only versatle n terms of ts applcablty but also mostly outperforms the other cluster ensemble algorthms n terms of stablty and accuracy. Moreover G can have hgher accuracy than, especally when there are only a small number of avalable base clusterngs. The rest of the paper s organzed as follows. In Secton 2, we gve a problem defnton. Secton 3 presents the related work n cluster ensembles. The model for s proposed n Secton 4, and a varatonal nference algorthm s dscussed n Secton 5. Secton 6 proposes G. We report expermental results n Secton 7, and conclude n Secton 8. algorthm s that t generates a cluster assgnment or d for each of the N data ponts {o, [] N 1 }. The number of clusters generated by dfferent base clusterng algorthms may be dfferent. We denote the number of clusters generated from c j by k j, so that the cluster ds assgned by c j range from 1 to k j.ifλ j {1,...,k j } denotes the cluster d assgned to o by c j, the base clusterng algorthm c j gves a clusterng of the entre dataset, gven by λ j = { λ j, [] N } { 1 = cj (o ), [] N } 1. The results from M base clusterng algorthms can be stacked together to form an (N M) matrx B, whose jth column s λ j, as shown n panel (a) of Fg. 1. The matrx can be vewed from another perspectve: Each row x of the matrx, that s, all base clusterng results for o,gves a new vector representaton for the data pont o (panel (b) of Fg. 1). In partcular, x ={x j, [j] M 1 }={c j (o ), [j] M 1 }. Gven the base clusterng matrx B, the cluster ensemble problem s to combne the M base clusterng results for N data ponts to generate a consensus clusterng, whch should be more accurate, robust, and stable than the ndvdual base clusterngs. The tradtonal approach to process the base clusterng results s column-wse (panel (a) of Fg 1), that s, we consder B as a set of M columns of base clusterng results {λ j, [j] M 1 },andwetry to fnd out the consensus clusterng λ. The dsadvantage of the column-wse perspectve s that t needs to fnd out the correspondence between dfferent base clusters generated by dfferent algorthms. For example, n panel (a) of Fg. 1, we need to know 1 n the frst column corresponds to 1 or 2 or 3 n the second column. The cluster correspondence problem s hard to solve effcently, and the complexty ncreases especally when dfferent base clusterng algorthms generate dfferent numbers of clusters [1]. A smpler approach to cluster ensemble problem, whch s what we use n ths paper, s to read the matrx B n 2. PROBLEM FORMULATION Gven N data ponts O ={o, [] N 1 } ([]N 1 = 1,..., N) andm base clusterng algorthms C ={c j, [j] M 1 },we get M base clusterngs of the data ponts, one from each algorthm. The only requrement from a base clusterng Fg. 1 Two ways of processng base clusterng results for cluster ensemble.

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 3 a row-wse (panel (b) of Fg 1) way. All base clusterng results for a data pont o can be consdered as a vector x wth dscrete values on each dmenson [1], and we consder base clusterng matrx B as a set of N rows of M-dmensonal vectors {x, [] N 1 }. From ths perspectve, the cluster ensemble problem becomes fndng a clusterng λ for {x, [] N 1 },whereλ s a consensus clusterng over all base clusterngs. Further, by consderng the cluster ensemble problem from ths perspectve, we naturally avod cluster correspondence problem, because for each x, λ 1 and λ 2 are just two features, they are condtonally ndependent n the nave Bayes settng for clusterng. Whle the basc cluster ensemble framework assumes all base clusterng results for all data ponts are avalable n one place to perform the analyss, real-lfe applcatons often need varants of the basc settng. In ths paper, we dscuss four mportant varants: mssng value cluster ensembles, row- and column-dstrbuted cluster ensembles, and cluster ensemble wth orgnal data ponts. 2.1. Mssng Value Cluster Ensembles When several base clusterng results are mssng for several data ponts, we have a mssng value cluster ensemble problem. Such a problem appears due to varous reasons. For example, f there are new data ponts added to the dataset after runnng clusterng algorthm c j,these new data ponts wll not have base clusterng results correspondng to c j. In mssng value cluster ensemble, nstead of dealng wth a full base clusterng matrx B, we are dealng wth a matrx wth mssng entres. 2.2. Row-Dstrbuted Cluster Ensembles For row-dstrbuted cluster ensembles, base clusterng results of dfferent data ponts (rows) are at dfferent locatons. The correspondng real-lfe scenaro s that dfferent subsets of the orgnal dataset are owned by dfferent organzatons, or cannot be put together n one place due to sze, communcaton, or prvacy constrants. Whle dstrbuted base clusterng algorthms, such as dstrbuted prvacy preservng k-means [4], can be run on the subsets to generate base clusterng results, due to the restrctons on sharng, the results on dfferent subsets cannot be transmtted to a central locaton for analyss. Therefore, t s desrable to learn a consensus clusterng n a row-dstrbuted manner. 2.3. Column-Dstrbuted Cluster Ensembles For column-dstrbuted cluster ensemble, dfferent base clusterng results of all data ponts are at dfferent locatons. The correspondng real-lfe scenaro s that separate organzatons have dfferent base clusterngs on the same set of data ponts, for example, dfferent e-commerce vendors havng customer segmentatons on the same customer base. The base clusterngs cannot be shared wth others due to prvacy concerns, but each organzaton has an ncentve to get a more robust consensus clusterng. In such a case, the cluster ensemble problem have to be solved n a column-dstrbuted way. 2.4. Cluster Ensemble wth Orgnal Data Ponts In many real-lfe scenaros, not only the base clusterng results but also the features of orgnal data ponts are avalable. For example, a company may have both the customer segmentatons and ther purchasng records. The features of orgnal data ponts could be the ones used to generate the base clusterng results, for example, the purchasng records used to generate the exstent customer segmentatons. The features could also be the new nformaton currently become avalable, for example, new purchasng records of customers. In such cases, we may lose useful nformaton by runnng cluster ensemble algorthms on base clusterng results only. Meanwhle, f the base clusterng algorthms do not perform very well, a combnaton of them usually fals to yeld a good consensus clusterng. Therefore, a cluster ensemble algorthm whch can take both the base clusterng results and orgnal data ponts s expected to generate a better consensus clusterng. 3. RELATED WORK In ths secton, we gve a bref overvew of cluster ensemble algorthms. There are three man classes of algorthms: graph-based models, matrx-based models, and probablstc models. 3.1. Graph-Based Models The most popular algorthms for cluster ensemble are graph-based models [2,5 7]. The man dea of ths class of algorthms s to convert the results of base clusterngs to a hypergraph or a graph and then use graph parttonng algorthms to obtan ensemble clusters. Strehl and Ghosh [2] present three graph-based cluster ensemble algorthms: [2] nduces a graph from a co-assocaton matrx, and the graph s parttoned by the METIS algorthm [8] to obtan fnal clusters. In addton, [2] represents each cluster and correspondng objects by a hyperedge and nodes, respectvely, and then uses mnmal cut algorthm HMTIS [9] for parttonng.

4 Statstcal Analyss and Data Mnng, Vol. (In press) Further, hyperedge collapsng operatons are used n a metaclusterng algorthm () [2] whch determnes a soft cluster membershp for each object. Fern and Brodley [5] propose a bpartte graph parttonng algorthm. It solves cluster ensemble by reducng t to a graph parttonng problem and ntroduces a new reducton method that constructs a bpartte graph from the base clusterngs. The graph models consder both objects and clusters of the ensemble as vertces smultaneously. Al-Razgan and Domencon [6] propose a weghted bpartte parttonng algorthm (WBPA), whch maps the problem of fndng a consensus partton to bpartte graph parttonng. 3.2. Matrx-Based Models The second class of algorthms are matrx-based models [10 13]. The man dea of ths category s convertng base clusterng matrx to another matrx such as co-assocaton matrx, consensus matrx, or non-negatve matrx, and usng matrx operatons to get the results of cluster ensemble. Fred and Jan [10] map varous base clusterng results to a co-assocaton matrx, where each entry represents the strength of assocaton between objects, based on the co-occurrence of two objects n a same cluster. A votng algorthm s appled to the co-assocaton matrx to obtan the fnal result. Clusters are formed from the co-assocaton matrx by collectng the objects whose co-assocaton values exceed the threshold. Kellam et al. [12] combne results of base clusterngs through a co-assocaton matrx, whch s an agreement matrx wth each cell contanng the number of agreements among the base clusterng methods. The co-assocaton matrx s used to fnd the clusters wth the hghest value of support based on object co-occurrences. As a result, only a set of so-called robust clusters are produced. Mont et al. [13] defne a consensus matrx for representng and quantfyng the agreement among the results of base clusterngs. For each par of objects, the matrx stores the proporton of clusterng runs n whch two objects are clustered together. L et al. [11] llustrate that the problem of cluster ensemble can be formulated under the framework of nonnegatve matrx factorzaton (NMF), whch refers to the problem of factorzng a gven non-negatve data matrx X nto two matrx factors that s, X AB, under the constrant of A and B to be non-negatve matrces. 3.3. Probablstc Models The thrd class of cluster ensemble algorthms are based on probablstc models [1]. The algorthms take advantage of statstc propertes of base clusterngs results to acheve a consensus clusterng. Topchy et al. [1] consder a representaton of multple clusterngs as a set of new attrbutes characterzng the data tems, and a mxture model () offers a probablstc model of consensus usng a fnte mxture of multnomal dstrbutons n the space of base clusterngs. A consensus result s found as a soluton to the correspondng maxmum lkelhood problem usng expectaton maxmzaton (EM) algorthm. 4. BAYESIAN CLUSTER ENSEMBLES In ths secton, we propose a novel model. The man dea s as follows: Gven a base clusterng matrx B ={x, [] N 1 } for N data ponts, we assume there exsts a Bayesan graphcal model generatng B. In partcular, we assume that each vector x has an underlyng mxedmembershp to dfferent consensus clusters. Let θ denote the latent mxed-membershp vector for x ; f there are k consensus clusters, θ s a dscrete dstrbuton over the k clusters. From the generatve model perspectve, we assume that θ s sampled from a Drchlet dstrbuton, wth parameter α, and the consensus cluster h, [h] k 1 for each x j of [j] M 1 s sampled from θ separately. Further, each latent consensus cluster h, has a dscrete dstrbuton β hj over the cluster ds {1,...,k j } for the jth base clusterng result of each x.thus,fx j truly belongs to consensus cluster h, x j = r {1,...,k j } wll be determned by the dscrete probablty dstrbuton β hj (r) = p(x j β hj ),where β hj (r) 0, k j r=1 β hj (r) = 1. The full generatve process for each x s assumed to be as follows (Fg. 2): 1. Choose θ Drchlet(α). 2. For the jth base clusterng: (a) Choose a component z j = h dscrete(θ ); (b) Choose the base clusterng result x j dscrete(β hj ). Thus, the model contans the model parameters (α, β), where β = { β hj, [h] k 1, } [j]m 1, the latent varables (θ,z j ) and the actual observatons { x j, [] N 1, } [j]m 1. can be vewed as a specal case of mxed-membershp nave Bayes models [14,15] by choosng a dscrete dstrbuton as the generatve model. Further, s closely related to LDA [16], although the models are applcable to dfferent types of data. Gven the model parameters α and β, the jont dstrbuton of latent and observed varables {x, z, θ } s

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 5 α θ β z k Μ x Μ Ν unknown, we have to also estmate the model parameters such that the log-lkelhood of observng the base clusterng matrx B s maxmzed. EM algorthms are typcally used for such parameter estmaton problems by alternatng between calculatng the posteror over latent varables and updatng the model parameters untl convergence. However, the posteror dstrbuton Fg. 2 gven by: Graphcal model for. p(x, θ, z α, β) = p(θ α) M j=1, x j p(z j = h θ )p(x j β hj ), where x j denotes that there exsts a jth base clusterng result for x, so the product s only over the exstng base clusterng results. By ntegratng over the latent varables {z, θ }, the margnal probablty for each x s gven by: p(x α, β) M = p(θ α) θ j=1, x j p(z j = h θ )p(x j β hj )dθ. h could be consdered as a generalzaton of mxture models [1] to Bayesan models on cluster ensemble problem. In s, for each x we pck a z, whch may take [0, 0, 1], [0, 1, 0], or [1, 0, 0] n a three cluster problem, that s, there are only three possble values for the consensus clusterng n the generatve process. In comparson, n, for each x we pck a θ, whch could be any vald dscrete dstrbuton, that s, n ths case, t could be any three-dmensonal vector wth each dmenson larger than 0 and the summaton of three dmensons equal to 1. Also, we keep a Drchlet dstrbuton over all possble θ s. Such a scheme of s better than that of s due to the two reasons: () The membershp vector of has a much larger set of choces than s. () allows mxed membershp (a membershp to multple consensus clusters) n the generatve process, whle s only allow a sole membershp (a membershp to only one consensus cluster). Therefore, s more flexble than mxture model based cluster ensembles. 5. VARIATIONAL INFERENCE FOR We have assumed a generatve process for the base clusterng matrx B ={x, [] N 1 } n Secton 4. Gven the observable matrx B, our fnal goal s to estmate the mxed-membershp {θ, [] N 1 } of each object to the consensus clusters. Snce the model parameters α and β are (1) p(θ, z x,α,β)= p(θ, z, x α, β) p(x α, β) cannot be calculated n a closed form snce the denomnator (partton functon) p(x α, β) as an expanson of Eq. (1) s gven by p(x α, β) = Ɣ( h α h) h Ɣ(α h) ( k h=1 ) θ (α M h 1) h k kj h j=1h=1θ r=1 (2) β hj (r) (r,j) dθ, where (r, j) s an ndcator takng value 1 f the jth base clusterng assgns o to base cluster r and 0 otherwse, β hj (r) s the rth component of the dscrete dstrbuton β hj for the hth consensus cluster and the jth base clusterng. The couplng between θ and β n the summaton over the latent varable z makes the computaton ntractable [16]. There are two man classes of approxmaton algorthms to address such problems: one s varatonal nference, and the other s Gbbs samplng. In our paper, we present the varatonal nference method. 5.1. Varatonal Inference Snce t s ntractable to calculate the true posteror n Eq. (2) drectly, n varatonal nference, we ntroduce a famly of dstrbutons as an approxmaton of the posteror dstrbuton over latent varables to get a tractable lower bound of the log-lkelhood log(p(x α, β)). We maxmze ths lower bound to update the parameter estmaton. In partcular, followng [14,16], we ntroduce a famly of varatonal dstrbutons as q(θ, z γ,φ ) = q(θ γ ) M q(z j φ j ) (3) j=1 as an approxmaton of p(θ, z α, β, x ) neq.(2),where γ s a Drchlet dstrbuton parameter, and φ ={φ j, [j] M 1 } are dscrete dstrbuton parameters. We ntroduce such an approxmatng dstrbuton for each x,[] N 1.Now,usng Jensen s nequalty [17], we can obtan a lower bound

6 Statstcal Analyss and Data Mnng, Vol. (In press) L(α, β; φ,γ ) to log p(x α, β) gven by: L(α, β; φ,γ )= E q [log p(θ, z α, β)]+ H(q(θ, z γ,φ )), where H( ) denotes the Shannon entropy. Assumng each row x of the matrx B to be statstcally ndependent gven the parameters (α, β), the log-lkelhood of observng the matrx B s smply log p(b α, β) = N log p(x α, β) =1 N L(α, β; φ,γ ). For a fxed set of model parameters (α, β), maxmzng the lower bound wth respect to the free varatonal parameters (γ,φ ) for each x, [] N 1 gves us the best lower bound from ths famly of approxmatons. A drect calculaton leads to the followng set of update equatons for the varatonal maxmzaton: =1 ( ( k φ j h exp (γ h ) + k j r=1 γ h = α h + h =1 γ h ) (r, j) log β hj (r), M ) (4) (5) j=1, x j φ j h, (6) where [] N 1, [j]m 1, [h]k 1, φ j h s the hth component of the varatonal dscrete dstrbuton φ j for z j,andγ h s the hth component of the varatonal Drchlet dstrbuton γ for θ. For a gven set of varatonal parameters (γ,φ ), [] N 1, the lower bound gven n Eq. (4) s maxmzed by the pont estmate for β: β hj (r) N φ j h (r, j), (7) =1 where [h] k 1,[j]M 1,[r]k j 1. The Drchlet parameter α can be estmated va Newton Raphson updates as n LDA [16]. In partcular, the update equaton for α h s gven by α h = α h g h c l h, (8) wth ( ( k g h = N h =1 α h ) ) (α h ) + ( ( N k (γ h ) =1 l h = N (α h ), k h=1 c = g h/l h v 1 +, k h=1 l 1 h ( k ) v = N α h, h=1 h =1 γ h )), where s the dgamma functon, that s, the frst dervatve of the log Gamma functon. 5.2. Varatonal EM Algorthms Gven the updatng equatons for varatonal parameters and model parameters, we can use a varatonal EM algorthm to fnd the best-ft model (α,β ). Startng from an ntal guess (α (0),β (0) ), the EM algorthm alternates between two steps untl convergence: 1. E-Step: Gven (α (t 1),β (t 1) ), for each x,fndthe best varatonal parameters: (φ (t),γ (t) ) = argmax (φ,γ ) L(α (t),β (t) ; φ,γ ). L(α, β, φ (t),γ (t) ) serves as a lower bound functon to log p(x α, β). 2. M-Step: Maxmze the aggregate lower bound wth respect to (α, β) to obtan an mproved parameter estmate: (α (t),β (t) ) = argmax (α,β) N =1 L(α, β; φ (t),γ (t) ). After (t 1) teratons, the value of the lower bound functon s L(α (t 1),β (t 1),φ (t 1),γ (t 1) ). Inthetth teraton, N =1 L(α (t 1),β (t 1),φ (t 1),γ (t 1) ) N =1 N =1 L(α (t 1),β (t 1),φ (t),γ (t) ) (9) L(α (t),β (t),φ (t),γ (t) ). (10)

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 7 The frst nequalty holds because n the E-step, Eq. (9) s the maxmum of L(α (t 1),β (t 1),φ,γ ), and the second nequalty holds because n the M-step, Eq. (10) s the maxmum of L ( α, β, φ (t),γ (t) ). Therefore, the objectve functon s nondecreasng untl convergence [17]. For computatonal complexty, snce we only need to calculate (γ h ) ( k h =1 γ h ) once for all φj h,[j] M 1, the complexty for updatng φ n each E-step s then O((Nk 2 + NMku)t E ), where u = max{k j, [j] M 1 } and t E s the number of teratons nsde each E-step. Also, the tme for updatng γ s O(NMkt E ). In the M-step, the complexty for updatng β s O(NMku). α s updated usng Newton update and the tme needed s O(kNt α ),wheret α s the number of teratons n Newton updates. Compared to based cluster ensemble algorthm [1], s computatonally more expensve, snce t has teratons over Eqs. (5) and (6) nsde the E-step, whle [1] uses a drect EM algorthm whch has the E-step n a closed form. However, as we show n the experments, acheves sgnfcantly better performance than s. 5.2.1. Row-dstrbuted EM algorthm In row-dstrbuted cluster ensemble, the object set O s parttoned nto P parts {O (1),O (2),...,O (P ) } and dfferent parts are assumed to be at dfferent locatons. We further assume that a set of dstrbuted base clusterng algorthms have { been used to } obtan the base clusterng results B(1),B (2),...,B (P ). Now, we outlne a row-dstrbuted varant of the varatonal nference algorthm. At each teraton t, gven the ntalzaton of model parameters (α (t 1),β (t 1) ), row-dstrbuted varatonal EM for proceeds as follows: 1. For each partton { B (p), [p] P } 1, we obtan varatonal parameters (φ (p),γ (p) ) followng Eqs. (5) and (6), where φ (p) ={φ x B (p) } and γ (p) ={γ x B (p) }. 2. To update β followng Eq. (7), we can wrte the rght term of Eq. (7) as φ j h (r, j) + + φ j h (r, j). x B (1) x B (P ) Each part n the summaton corresponds to one partton of B. To update β hj (r), frst, (p) = x B (p) φ j h (r, j) s calculated for each B p. Second, for each B (p) (p [2,P]), wetake p 1 q=1 (q) from B (p 1), generate p q=1 (q) by addng (p) to the summaton, and pass t to B (p+1). Fnally, after passng through all parttons, we have the summaton as the rght term of Eq. (7) to update β hj (r) after normalzaton. 3. Updatng α s a lttle trcky snce t does not have a closed form soluton. However, we notce that the update Eq. (8) for α h only depends on two varables: α h and {γ, [] N 1 }. α h can be obtan from the last teraton of Newton Raphson algorthm. Regardng γ, we only need to know N =1 (γ h) and N =1 ( h γ h) for g neq.(8).weuseasame strategy as for updatng β: Frst we calculate p = x B (p) (γ h ) and p = x B (p) ( h γ h) on each partton. Second, for each B (p) (p [2,P]), we take p 1 q=1 q and p 1 q=1 q from B (p 1), generate p q=1 q and p q=1 q by addng p and p to the summatons respectvely, and pass them to B (p+1). Fnally, after gong through all parttons, we have the result for N =1 ( (γ h) ( h γ h)), so we can update α h followng Eq. (8). For each teraton of Newton Raphson algorthm, we need to pass the summatons through all parttons once. By the end of the tth teraton, we have the updated model parameters (α (t),β (t) ), whch are used as the ntalzaton for the (t + 1)th teraton. The algorthm s guaranteed to converge snce t s essentally the same wth the EM for the general case, except that t works n a row-dstrbuted way. By runnng EM dstrbutedly, nether { O (p), [p] P } 1 nor { B (p), [p] P } 1 s passed around dfferent ndvduals, but only the ntermedate summatons; n ths sense, we acheve prvacy preservaton. As we have notced, updatng α s very expensve because t needs to pass the summatons over all parttons for each Newton Raphson teraton, whch s practcally nfeasble for a dataset wth a large number of parttons. Therefore, we next gve a heurstc row-dstrbuted EM, whch does not have a theoretcal guarantee for convergence, but worked well n practce n our experments. At each teraton t, gven the ntalzaton of model parameters ( α (t 1) (1),β (t 1) ) (1), heurstc row-dstrbuted varatonal EM for proceeds as follows: 1. For the frst partton B (1),gven ( α (t 1) (1),β (t 1) ) (1),we obtan varatonal parameters (φ (1),γ (1) ) followng Eqs. (5) and (6). Also, we update (α (1),β (1) ) to get (α (t) ) followng Eqs. (8) and (7) respectvely. (1),β(t) (1) 2. For the pth partton B (p), we ntalze (α (p),β (p) ) wth ( α (t) (p 1) (p 1)),β(t) and obtan (φ(p),γ (p) ) followng Eqs. (5) and (6). We update (α (t) (p),β(t) (p)) and pass them to the (p + 1)th partton.

8 Statstcal Analyss and Data Mnng, Vol. (In press) After gong over all parttons, we are done wth the tth teraton; the teratons are repeated untl convergence. The ntalzaton for (α (1) (1),β(1) (1)) n the frst teraton could be pcked by random or by usng some heurstcs, and the ntalzatons for (α (1),β (1) ) n the tth teraton are from (α (t 1) (P ),β (t 1) (P ) ). The teratons run tll the net change n the lower bound value s below a threshold, or when a pre-fxed number of teratons reached. 5.2.2. Column-dstrbuted EM algorthm For column-dstrbuted cluster ensemble, we desgn a clent server style algorthm, where each clent mantans one base clusterng, and the server gathers partal results from the clents and performs further processng. Whle we assume that there are M dfferent clents, one can always work wth a smaller number of clents by splttng the columns among the avalable clents. Gven the ntalzaton for model parameters (α (t),β (t) ),where(α (t),β (t) j ) s made avalable to the j th clent, the column-dstrbuted cluster ensemble at teraton t proceeds as follows: 1. E-step jth clent: Gven x j and β (t) j for [] N 1,thejth clent calculates k j r=1 (r, j) log β(t) hj (r) for []N 1, [h] k 1 and passes the results to the E-step server. 2. E-step server: Gven k j r=1 the clents, for [] N 1,[j]M 1,[h]k 1 (r, j) log β(t) hj (r) from, the server calculates varatonal parameters { φ j h, [] N 1, [j]m 1, [h]k 1} followng Eq. (5). Gven α (t) and { φ j h, [] N 1, [j]m 1, [h]k 1}, the server updates { γ h, [] N 1, [h]k 1} followng Eq. (6). The parameters { φ j h, [] N 1, [h]k 1} are passed to the M- step jth clent and { γ h, [] N 1, [h]k 1} are passed to the M-step server. 3. M-step jth clent: Gven x j and φ j h for [] N 1,[h]k 1, β (t+1),j ( ) s updated followng Eq. (7) and passed to E-step server for the (t + 1)th teraton. 4. M-step server: Gven α (t) and γ h for [] N 1,[h]k 1, α(t+1) s updated followng Eq. (8) and passed to E-step server for the next step. The ntalzaton (α (0),β (0) ) s chosen at the begnnng of the frst teraton. In teraton t, (α (t),β (t) ) are ntalzed by (α (t 1),β (t 1) ), that s, the results of the (t 1)th teraton. The algorthm s guaranteed to converge because t s essentally the same as the EM algorthm for general cluster ensembles except that t s runnng n a column-dstrbuted way. The algorthm s expected to be more effcent than the general cluster ensemble f we gnore the communcaton overhead. In addton, j th clent/server only has access to the jth base clusterng results. The communcaton s only for the parameters and ntermedate results, nstead of base clusterngs. Therefore, prvacy preservaton s also acheved. In, the most computatonally expensve part of the E-step s the update for φ. By runnng column-dstrbuted EM, we are parallelzng most computaton n updatng φ, the tme complexty of updatng φ n each E-step hence decreases from O((Nk 2 + NMku)t E ) to O((Nk 2 + Nku)t E ). In the M-step, the cost of updatng β decreases from O(NMku) to O(Nku) through parallelzaton. 6. GENERALIZED Most cluster ensemble algorthms only combne the base clusterng results to generate a consensus clusterng, whch mght not be a good use of data when features of the orgnal data ponts are avalable, and meanwhle, the performance of the ensemble algorthm s hghly restrcted by the base clusterng algorthms, that s, the chances of obtanng a good consensus clusterng from a set of very poor base clusterng algorthms are low. In ths secton, we propose a G algorthm whch overcomes the two drawbacks by combnng both the base clusterng results and feature vectors of orgnal data ponts to yeld a consensus clusterng. The man dea of G s as follows: For each data pont, we concatenate ts D-dmensonal feature vector o after the M-dmensonal base clusterng vector x to get an (M + D)-dmensonal vector y. Followng Shan and Banerjee [15], the generatve process for y s gven as follows: 1. Choose θ Drchlet(α). 2. For each nonmssng base clusterng, that s, y j, [j] M 1 : (a) Choose a component z j = h dscrete(θ ); (b) Choose the base clusterng result y j dscrete(β hj ). 3. For each nonmssng feature of the data pont, that s, y j, [j] M+D M+1 : (a) Choose a component z j = h dscrete(θ ); (b) Choose the feature value y j p ψj (y j ζ hj ). The frst two steps are the same wth. The dfference s the new step 3, where each feature of the data pont s generated from p ψj (y j ζ hj ) an exponental famly

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 9 dstrbuton for feature j and consensus cluster h [15]. ψ j n p ψj (y j ζ hj ) determnes a partcular famly for feature j, such as Gaussan, Posson, etc., and ζ hj determnes a partcular parameter for the dstrbuton n that famly. For the ease of exposton, we assume that all orgnal features have real values generated from Gaussan dstrbutons, then p ψj (y j ζ hj ) could be denoted by [15] ( p(y j µ hj,σhj 2 ) = 1 exp (y ) j µ hj ) 2 2πσhj 2 2σhj 2, where µ hj and σ 2 hj are the mean and varance for the Gaussan dstrbuton of cluster h and feature j. The proposed generatve model gves an ntutve way to combne the feature vectors wth base clusterng results, but t suffers from a lmtaton that t treats each feature of the orgnal data pont as mportant as each base clusterng result. In such cases, for hgh dmensonal data ponts wth D M, the feature vectors wll domnate the consensus clusterng,.e., the consensus clusterng result s almost the same wth runnng the algorthm on only the data ponts wthout base clusterng results. Therefore, we further generalze the algorthm to allow dfferent weghts for dfferent base clusterng results and dfferent features, yeldng generalzed. Gven non-negatve ntegral weghts u ={u j, [j] M+D 1 } for y j, [j] M+D 1, the generatve process of G for y wth weght u s gven as follows: 1. Choose θ Drchlet(α). 2. For each nonmssng base clusterng, that s, y j, [j] M 1, repeat 2(a) and 2(b) for u j tmes: (a) Choose a component z j = h dscrete(θ ); (b) Choose the base clusterng result y j dscrete(β hj ). 3. For each nonmssng orgnal feature, that s, y j, [j] M+D M+1, repeat 3(a) and 3(b) for u j tmes: (a) Choose a component z j = h dscrete(θ ); (b) Choose the feature value y j N(µ hj,σ 2 hj ). Therefore, s a specal case of G by settng u j = 1 for [j] M 1 and u j = 0for[j] M+D M+1. The margnal probablty for weghted y s gven by p(y α, β, µ, σ 2, u) (11) ( ) uj M = p(θ α) p(z j = h θ )p(y j β hj ) θ M+D j=m+1, y j j=1, y j h ( ) uj p(z j = h θ )p(y j µ hj,σhj 2 ) dθ. h In G, f we set u j = 1for[j] M+D M+1 and u j = D for [j] M 1, we are treatng each base clusterng as mportant as the whole feature vector of the data pont, nstead of a sngle feature. We can also set dfferent weghts for dfferent y j based on the confdence of clusterng accuracy, or mportance of the feature, etc. In addton, n the generatve process, the weghts have been assumed to be non-negatve ntegers snce they denote the repetton tmes, but the learnng algorthm we dscuss below stll holds even when u j s generalzed to postve real numbers, yeldng a very flexble model. From the generatve process, G actually does not generate y, but generates a new vector ỹ wth y j repeated for u j tmes to ncorporate the weghts. However, we do not need to create a ỹ explctly to learn the model. For nference and parameter estmaton, smlar wth Secton 5, we ntroduce a famly of varatonal dstrbutons M+D q(θ, z γ,φ, u) = q(θ γ ) q(z j φ j ) u j j=1 to approxmate p(θ, z α, β, u, y ). The update equatons for varatonal parameters are gven by ( ) ( φ j h exp (γ h ) γ h (12) h + k j r=1 ( φ j h exp (γ h ) γ h = α h + ) (r, j) log β hj (r), [j] M 1, ( h γ h log σ hj (y j µ hj ) 2 M+D 2σ 2 hj ) ), [j] M+D M+1, (13) j=1, y j u j φ j h, (14) where [] N 1 and [h]k 1. For the model parameters, the update equatons for α s the same as n Eq. (8), and the equatons

10 Statstcal Analyss and Data Mnng, Vol. (In press) for the rest of parameters are gven by β hj (r) u j µ hj = N =1 φ j h (r, j), (15) N =1, x j u j φ j h y j N =1, x j u j φ j h, (16) σ 2 hj = N =1, x j u j φ j h (y j µ hj ) 2 N =1, x j u j φ j h, (17) where [h] k 1,[j]M 1,and[r]k j 1. 7. EXPERIMENTAL RESULTS In ths secton, we run experments on datasets from UCI machne learnng repostory and KDD Cup 1999. In partcular, for UCI data, we pck 12 datasets whch are relatvely small. (For wne qualty we only keep the data ponts n three man classes, so the classes wth very few number of data ponts are removed.) For KDD Cup data, there are four man classes among 37 classes n total. We randomly pck 2 000 000 data ponts from these four man classes and dvde them nto two parts, so we have two relatvely large datasets wth one mllon data ponts each. The number of objects, features and classes n each data set are lsted n Table 1, where kdd99-1 and kdd99-2 are from KDD Cup 1999 and the rest are from UCI machne learnng repostory. For all reported results, there are two steps leadng to the fnal consensus clusterng. Frst, we run base clusterng algorthms to get a set of base clusterng results. Second, varous cluster ensemble algorthms, ncludng [1], Table 1. The number of the nstances, features, and classes n each dataset. Dataset Instances Features Classes pma 768 8 2 rs 150 4 3 wdbc 569 30 2 balance 625 4 3 glass 214 9 6 bupa 345 6 2 wne 178 13 3 magc04 19 020 10 2 onosphere 351 34 2 segmentaton 2100 19 7 kdd99-1 1 000 000 41 4 kdd99-2 1 000 000 41 4 chess 3196 36 2 wne qualty 4535 11 3,, [2] and k-means, are appled to the base clusterng results to generate a consensus clusterng. We compare ther results wth. The comparson between and other cluster ensemble algorthms are dvded nto fve categores as follows: 1. General cluster ensemble (general). 2. Cluster ensemble wth mssng values (mss-v). 3. Cluster ensemble wth ncreasng number of columns (ncrease-c), that s, addtonal base clusterngs. 4. Column-dstrbuted cluster ensemble (column-d). 5. Row-dstrbuted cluster ensemble (row-d). Table 2 shows the fve categores of experments and the sx cluster ensemble algorthms we use. We can see that most of the algorthms can only accomplsh a few tasks among the fve. In prncple, can be generalzed to deal wth all fve scenaros; however, the lterature does not have an explct algorthm for column- or row-dstrbuted cluster ensembles usng. As we can see from Table 2, s the most flexble and versatle among the sx algorthms. For evaluaton, we use mcro-precson [18] to measure accuracy of the consensus cluster wth respect to the true labels: the mcro-precson s defned as MP = k a h /n, h=1 where k s the number of clusters and n s the number of objects, a h denotes the number of objects n consensus cluster h that are correctly assgned to the correspondng class. We dentfy the correspondng class for consensus cluster h as the true class wth the largest overlap wth the cluster, and assgn all objects n cluster h to that class. Note that 0 MP 1 wth 1 ndcatng the best possble consensus clusterng, whch has to be n full agreement wth the class labels. Table 2. The applcablty of algorthms to dfferent expermental settngs: ndcates that the algorthm s applcable, and ndcates otherwse. Algorthm General Mss-v Increase-c Column-d Row-d k-means

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 11 Table 3. k-means wth dfferent ntalzatons are used as base clusterng algorthms: (a) Maxmum MP for dfferent cluster ensemble algorthms; (b) average MP for dfferent cluster ensemble algorthms (Magc04, kdd99-1, andkdd99-2 are too large so could not fnsh ts run). The hghest MP among dfferent algorthms on each data set s bolded. Base clusterngs Cluster ensembles k-means k-means (a) Maxmum MP rs 0.8867 0.8867 0.9533 333 0.9067 267 0.9600 wdbc 0.8541 0.8840 0.8840 518 0.8840 0.8840 0.8893 onosphere 123 123 952 353 179 094 749 glass 421 187 393 439 748 093 121 bupa 841 652 710 188 710 565 942 pma 602 602 065 260 654 029 044 wne 629 247 416 562 247 775 247 magc04 491 491 491 530 491 531 balance 936 216 408 256 016 824 968 segmentaton 710 657 810 419 233 710 362 kdd99-1 458 703 846 652 227 827 kdd99-2 642 475 352 651 523 804 (b) Average MP rs 267 0.8867 0.9167 333 0.8867 267 0.8911 wdbc 595 0.8840 0.8840 188 0.8840 0.8689 0.8840 onosphere 906 046 952 063 111 094 123 glass 140 766 393 234 519 363 526 bupa 537 652 710 075 586 164 664 pma 751 602 065 163 503 029 612 wne 904 247 416 250 129 775 247 magc04 252 491 235 231 250 497 balance 114 188 408 256 514 824 293 segmentaton 574 657 810 543 817 142 854 kdd99-1 281 689 621 411 899 642 kdd99-2 643 475 720 463 015 523 In the followng subsectons, we wll present the expermental results for fve categores of experments as n Table 2, startng from general cluster ensembles. 7.1. General Cluster Ensembles In ths subsecton, we run two types of experments: one only uses k-means as the base clusterng algorthms, and the other uses multple algorthms as the base clusterng algorthms. Gven N objects, we frst use k-means as the base clusterng algorthm on 12 datasets. For ten UCI datasets, we run k-means 2000 tmes wth dfferent ntalzatons to obtan 2000 base clusterng results, whch are dvded evenly nto 100 subsets, wth 20 base clusterng results n each of them. For two large datasets kdd99-1 and kdd99-2, we run the experments followng the same strategy, but we keep three subsets wth fve base clusterng results n each of them. Cluster ensemble algorthms are then appled on each subset. The maxmum and average MPs over all subsets are reported n Tables 3(a) and 3(b). We also use k-means, fuzzy c-means (FCM) [19], METIS [8], and affnty propagaton (AP) [20] as the base clusterng algorthms on 11 datasets for cluster ensemble. 1 By runnng k-means 500 tmes, FCM 800 tmes, METIS 200 tmes, and AP 500 tmes wth dfferent ntalzatons, we also obtan 2000 base clusterng results. Followng the same strategy above to run cluster ensemble algorthms, we have the maxmum and average MPs n Table 5(a) and 5(b). The key observatons from Table 3 and 5 can be summarzed as follows: () almost always has a hgher max and average MP than base clusterng results, whch means the consensus clusterng from s ndeed better n qualty than the orgnal base clusterngs. () outperforms other cluster ensemble algorthms for most of the tmes n terms of both maxmum and average MP, no matter whch base clusterng algorthms are used. Snce the results of and are rather close to each other, to make a careful comparson, we run a pared t-test under the hypothess H 0 : MP() = MP(), 1 We run ths set of experments on relatvely small datasets snce METIS and AP cannot run on large ones such as kdd99-1 and kdd99-2.

12 Statstcal Analyss and Data Mnng, Vol. (In press) Table 4. k-means wth dfferent ntalzatons are used as base clusterng algorthms: Pared t-test for and, where Mean-D s the mean of MP dfferences obtaned by (-), and sd- () s standard devaton of the MPs from (). For Mean-D, the datasets where performs better s bolded. For sd- and sd-, the one wth a smaller standard devaton s bolded. Dataset Mean-D sd- sd- p-value rs 0.0402 0.0221 0.0103 0.0026 wdbc 0.0009 0.0018 0.0000 0.3256 onosphere 0.0013 0.0024 0.0000 0.0169 glass 0.0046 0.0110 0.0076 0.0511 bupa 0.0128 0.0377 0.0013 0.0018 pma 0.0117 0.0205 0.0038 0.0089 wne 0.0240 0.0290 0.0119 0.0239 magc04 0.0010 0.0023 0.0014 127 balance 0.0301 0.0384 0.0061 0.9990 segmentaton 0.0140 0.0331 0.0186 0.2250 kdd99-1 0.0120 0.0207 0.0091 0.0382 kdd99-2 0.0082 0.0103 0.0031 0.0534 Table 6. k-means, FCM, AP, and METIS are used as the base clusterng algorthms: Pared t-test for and, where Mean-D s the mean of MP dfferences obtaned by (-), and sd- () s standard devaton of the MPs from (). For Mean-D, the datasets where performs better s bolded. For sd- and sd-, the one wth a smaller standard devaton s bolded. Dataset Mean-D sd- sd- p-value rs 0.0104 0.0182 0.0204 0.0047 wdbc 0.0007 0.0022 0.0000 0.2813 onosphere 0.0058 0.0061 0.0002 0.0081 glass 0.0072 0.0030 0.0062 0.9681 bupa 0.0110 0.0224 0.0021 0.9961 pma 0.0120 0.0216 0.0008 0.0046 wne 0.0121 0.0204 0.0082 0.0061 balance 0.0248 0.0428 0.0042 0.9939 segmentaton 0.0040 0.0271 0.0092 0.3145 chess 0.0140 0.0094 0.0086 0.0304 wne qualty 0.0172 0.0006 0.0024 0.0112 H a : MP() <MP(). The test s desgned to assess the strength of the evdence aganst H 0 and supportng H a. Such strength s measured by the p-value, a lower p-value ndcates stronger evdence. In our case, a lower p-value ndcates that the performance mprovements of over s statstcally sgnfcant. Usually a p-value less than 0.05 s consdered as strong evdence. The results are shown n Tables 4 and 6, respectvely. outperforms wth a low p-value (<0.05) most of the tmes, ndcatng that MP() s Table 5. k-means, FCM, AP, and METIS are used as the base clusterng algorthms: (a) Maxmum MP for dfferent cluster ensemble algorthms; (b) average MP for dfferent cluster ensemble algorthms. The hghest MP among dfferent algorthms on each data set s bolded. Base clusterngs Cluster ensembles k-means FCM AP METIS (a) Maxmum MP rs 0.8867 0.8933 0.8867 0.8867 0.8893 0.9533 431 0.9067 0.9600 wdbc 0.8541 0.8541 0.8541 0.8541 0.8840 0.8840 731 0.8840 0.8893 onosphere 123 123 094 806 046 952 782 179 655 glass 421 270 224 317 187 566 439 822 052 bupa 841 548 537 541 524 710 234 710 855 pma 602 602 602 602 602 125 260 654 044 wne 629 629 068 571 247 416 428 247 247 balance 936 612 091 621 112 408 452 016 848 segmentaton 710 356 710 524 657 810 419 233 362 chess 000 466 466 408 466 025 003 466 529 wne qualty 563 008 0.3800 0.3563 646 0.3857 0.3563 646 078 (b) Average MP rs 267 468 731 0.8064 0.8893 0.9133 431 0.8893 0.8933 wdbc 595 0.8200 932 0.8175 0.8840 0.8840 426 0.8840 0.8840 onosphere 906 906 721 538 046 952 104 088 141 glass 140 808 981 222 810 566 314 506 435 bupa 537 461 431 260 524 710 862 586 478 pma 751 541 211 460 602 125 163 496 621 wne 904 062 704 521 247 416 182 129 247 balance 114 804 966 072 091 408 451 552 301 segmentaton 574 238 437 097 657 810 543 817 854 chess 0.3806 560 108 281 466 025 803 321 500 wne qualty 0.2908 0.3602 0.3405 0.3563 451 0.3857 0.3563 646 802

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 13 0.95 0.9 0.85 0.8 5 Dataset:Irs 8 6 4 2 8 6 4 2 Percentage of mssng entres 8 4 2 8 6 4 2 8 Dataset:Pma BNB Percentage of mssng entres Dataset:Wne Percentage of mssng entres 0.9 0.85 0.8 5 5 5 Dataset:Wdbc 5 5 5 0.35 6 4 2 8 6 4 2 Percentage of mssng entres Dataset:Glass Percentage of mssng entres Dataset:Balance 0.38 Percentage of mssng entres 2 8 6 4 2 8 6 4 Dataset:Ionosphere 2 8 6 4 2 8 6 4 Percentage of mssng entres Dataset:Bupa 2 5 5 0.35 Percentage of mssng entres Dataset:Segmentaton Percentage of mssng entres Fg. 3 Average MP wth ncreasng percentage of mssng values. sgnfcantly better than MP() on these datasets. In addton, the smaller standard devaton of shows that t s more stable than. 7.2. Cluster Ensembles wth Mssng Values Gven 20 base clusterng results for N objects, we randomly hold out p percent of data as mssng values, wth p ncreasng from 0 to 90 n steps of 4.5. 2 We compare the performance of dfferent algorthms except k-means, because k-means cannot handle mssng values. Each tme we run the algorthms ten tmes and report MP on nne datasets n Fg. 3. Surprsngly, before the mssng value percentage reaches 70%, most algorthms have a stable MP wth ncreasng number of mssng entres, wthout a dstnct decrease n accuracy. s always among the top one or two n terms of the accuracy across dfferent percentage of mssng values, ndcatng that s one of the best algorthms to deal wth mssng value cluster ensemble. Comparatvely, seems to have the worst performance n terms of both the accuracy and stablty. 2 Startng from ths subsecton, we only use k-meansasthebase clusterng algorthm. 7.3. Cluster Ensembles wth Increasng Columns In order to fnd out the effect on the cluster ensemble accuracy wth ncreasng number of base clusterngs, we perform experments for cluster ensemble wth columns (base clusterngs) ncreasng from 1 to 20 n steps of 1. We frst generate 20 base clusterngs as a pool. At each step s, we randomly pck s base clusterngs from the pool, whch s repeated for 50 tmes to generate 50 (N s) base clusterng matrces (note there are repettons among these 50 matrces). We then run cluster ensemble on each of them. The average of MP over 50 runs at each step s reported n Fg. 4 for nne datasets. Frst, we can see that s agan among the top one or two on all the data sets n our experments. Second, MPs for most of the algorthms ncrease dramatcally when the number of base clusterngs ncreases from 1 to 5. After that, no dstnct ncrease s observed. On Pma, the accuracy even decreases when the number of base clusterng s larger than 10, whch s possbly due to the poor performance of the base clusterngs. The trends of the curves mght be related to the dversty of the base clusterngs. In our experments, we only use k-means for all base clusterngs, so the cluster nformaton may become redundant after a certan number

14 Statstcal Analyss and Data Mnng, Vol. (In press) 1 Dataset:Irs 0.9 Dataset:Wdbc 5 Dataset:Ionosphere 0.9 0.8 0.85 0.8 5 5 5 5 5 Number of avalable base clusterngs Number of avalable base clusterngs Number of avalable base clusterngs Dataset:Pma Dataset:Glass 5 Dataset:Bupa 5 5 Number of avalable base clusterngs 0.3 0.2 Number of avalable base clusterngs 5 5 Number of avalable base clusterngs 5 Dataset:Wne Dataset:Balance 0.8 Dataset:Segmentaton 5 5 5 0.35 Number of avalable base clusterngs 5 5 0.35 Number of avalable base clusterngs 0.3 0.2 0.1 Number of avalable base clusterngs Fg. 4 Average MP comparson wth ncreasng number of avalable base clusterngs. of base clusterngs have been used, and the accuracy does not ncrease anymore. The accuracy may keep on ncreasng wth more columns f the base clusterngs are generated by dfferent algorthms. 7.4. Row-Dstrbuted Cluster Ensembles For experments on row-dstrbuted cluster ensembles, we dvde our 20 base clusterng results by rows (approxmately) evenly nto P parttons, wth P ncreasng from 1 to 10 n steps of 1. We compare the performance of rowdstrbuted wth dstrbuted k-means [4]. Note that n our experments, we use the heurstc row-dstrbuted EM as n Secton 5.2.1. Although no theoretcal guarantee for convergence s provded, n our observaton, the algorthm stops when model parameters do not change anymore wthn ten teratons. The comparatve results on nne datasets are presented n Fg. 5. It s clear that row-dstrbuted always has a hgher accuracy than dstrbuted k-means except on Balance. For most datasets, the performance of row-dstrbuted s more stable across varyng number of parttons, ndcatng ts robustness. 7.5. Column-Dstrbuted Cluster Ensembles We run experments for column-dstrbuted cluster ensembles wth ncreasng number of base clusterngs (20, 60, 120, 240, 480, 960, 1440, 1920), whch are pcked randomly from a pool of 3000 base clusterng results. We run the clent server style algorthm as n Secton 5.2.2 wth one clent mantanng one base clusterng, such that multple clents could run n parallel. The accuracy n the column-dstrbuted case would be the same as the general cluster ensemble usng snce they are usng exactly the same algorthm except that the column-dstrbuted varants run t n a dstrbuted manner. If we gnore the communcaton overhead between the clents and server, the comparson of runnng tme between the column-dstrbuted and general cluster ensemble s presented n Fg. 6. We can see that column-dstrbuted cluster ensemble s much more effcent than the general case, especally when the number of base clusterngs s large, the column-dstrbuted varant s several orders of magntudes faster. Therefore, the column-dstrbuted s readly applcable to the real-lfe settngs wth large data sets.

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 15 0.95 0.9 0.85 0.8 5 5 5 Dataset:Irs Number of parttons 14 13 12 11 1 09 08 07 Dataset:Ionosphere 06 Number of parttons 0.885 0.88 0.875 0.87 0.865 0.86 0.855 0.85 0.845 Dataset:Wdbc 0.84 Number of parttons 4 2 Dataset:Glass 7 6 Dataset:Bupa 7 6 Dataset:Pma 8 5 4 3 5 4 3 6 2 2 4 1 1 2 Number of parttons Number of parttons Number of parttons Dataset:Balance 5 Dataset:Wne 95 Dataset:Segmentaton 8 6 4 2 5 5 9 85 8 75 7 65 6 8 Number of parttons 5 Number of parttons Number of parttons Fg. 5 Average MP wth ncreasng number of dstrbuted parttons. Seconds 10 5 10 4 10 3 10 2 10 1 Colum dstrbuted Grneral Dataset:WneQualty 10 0 10 1 10 2 10 3 10 4 Number of base clusterngs Fg. 6 The comparson of runnng tme between columndstrbuted and general cluster ensemble. 7.6. Generalzed Cluster Ensembles To compare G to, we set the weght of each orgnal feature to be 1 and the weght of each base clusterng result to be D n G, where D s the number of features n orgnal data ponts. Smlar wth Secton 7.3, we frst generate 50 base clusterng results as a pool, then at each step s, we randomly pck s base clusterngs from the pool to run G and, where s ncreases from 1 to 10. Fg. 7 show the fnal result averaged over 50 runs at each step s. 3 We do not show results on bupa snce the accuracy on bupa from G and are exactly the same. Overall, Fg. 7 contans two cases of the comparson: In the frst case (on wdbc, onosphere, pma, magc04, and balance), the whole curve of G s (mostly) above, whch shows a clear mprovement by combnng orgnal data ponts wth base clusterng results. In the second case (on rs, glass, wne, segmentaton), G has hgher accuracy when there s only a small number of base clusterng results, and there s no clear wnner when the number of base clusterng results ncreases. It s probably because the base clusterng algorthms generate 3 The base clusterngs we use are dfferent from Secton 7.3, so the result of s dfferent from Fg. 4.

16 Statstcal Analyss and Data Mnng, Vol. (In press) 0.9 0.88 0.86 0.84 0.82 0.8 8 6 Dataset: Irs G Number of base clusterngs 0.86 0.84 0.82 0.8 8 6 4 2 8 6 Dataset: Wdbc G Number of base clusterngs 14 12 1 08 06 04 02 98 Dataset: Ionosphere G Number of base clusterngs 62 61 6 59 58 57 56 55 54 53 52 Dataset: Pma Number of base clusterngs G 8 6 4 2 8 6 4 2 Dataset: Glass Number of base clusterngs G 493 492 491 49 489 488 487 486 485 Dataset: Magc04 Number of base clusterngs G 9 Dataset: Wne 7 6 Dataset: Balance G 8 Dataset: Segmentaton 8 7 6 5 4 5 4 3 2 6 4 2 3 2 Number of base clusterngs G 1 Number of base clusterngs 8 6 Number of base clusterngs G Fg. 7 Average MP comparson between and G wth ncreasng number of avalable base clusterngs. good clusterngs, so when more base clusterng results are combned together, performs as good as G. In addton, we also run experments for G wth dfferent weghts for the base clusterng results varyng from D/8 to 8D n multplcatve steps of 2. There s no clear trend wth ncreasng weghts. Generally, the results wth the weght larger than D are qute smlar wth each other, and the results wth the weght smaller than D could be dfferent, but the dfference decreases when the number of base clusterng results ncreases. We show two examples wth weghts {D/8,D/4,D/2,D} n Fg. 8. 8. CONCLUSION In ths paper, we have proposed Bayesan cluster ensembles, a mxed-membershp generatve model for obtanng a consensus clusterng by combnng multple base clusterng results. provdes a Bayesan way to combne clusterngs, and entrely avods cluster label correspondence problems encountered n graph based approaches to the cluster ensemble problem. A varatonal approxmaton based algorthm s proposed for learnng a Bayesan cluster ensemble. In addton, we have also proposed G, whch generates a consensus clusterng by takng both the base clusterng results and orgnal data ponts. Compared wth exstng algorthms, s the most versatle because of ts applcablty to several varants of the cluster ensemble problem, ncludng mssng value cluster ensembles, row-dstrbuted and column-dstrbuted cluster ensembles. In addton, extensve expermental results show that outperforms other algorthms n terms of accuracy and stablty, and t can be run n a dstrbuted manner wthout exchangng base clusterng results, thereby preservng prvacy and/or substantal speedups. Fnally, the comparson between G and show that G can generate hgher accuracy than, especally wth only a small number of base clusterng results avalable.

Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 17 1 Dataset: Pma D/8 D/4 D/2 D 5 Dataset: Glass 9 8 5 7 6 5 Number of base clusterngs 0.35 Number of base clusterngs D/8 D/4 D/2 D Fg. 8 G wth dfferent weghts on base clusterng results. ACKNOWLEDGMENTS Ths research was supported by NSF grants IIS-0812183, IIS-0916750, CNS-1017647, IIS-1029711, NSF CAREER grant IIS-0953274, NASA grant NNX08AC36A, and NSFC grant 61003142. REFERENCES [1] A. Topchy, A. Jan, and W. Punch, A mxture model for clusterng ensembles, In SDM, 2004, 379 390. [2] A. Strehl and J. Ghosh, Cluster ensembles-a knowledge reuse framework for combnng multple parttons, JMLR 3 (2002), 583 617. [3] L. Kuncheva and D. Vetrov, Evaluaton of stablty of k-means cluster ensembles wth respect to random ntalzaton, PAMI 28 (2006), 1798 1808. [4] G. Jagannathan and R. Wrght, Prvacy-preservng dstrbuted k-means clusterng over arbtrarly parttoned data, In KDD, 2005, 593 599. [5] X. Fern and C. Brodley, Solvng cluster ensemble problems by bpartte graph parttonng, In ICML, 2004, 281 288. [6] M. Al-Razgan and C. Domencon, Weghted cluster ensemble, In SDM, 2006, 258 269. [7] X. Fern and C. Brodley, Random projecton for hgh dmensonal data clusterng: a cluster ensemble approach, In ICML, 2003, 186 193. [8] G. Karyps and V. Kumar, A fast and hgh qualty multlevel scheme for parttonng rregular graphs, SIAM J Sc Comput 20 (1999), 359 392. [9] G. Karyps, R. Aggarwal, V. Kumar, and S. Shekhar, Multlevel hypergraph parttonng: applcatons n VLSI desgn, In ACM/IEEE Desgn Automaton Conference, 1997, 526 529. [10] A. Fred and A. Jan, Data clusterng usng evdence accumulaton, In ICPR, 2002, 276 280. [11] T. L, C. Dng, and M. Jordan, Solvng consensus and semsupervsed clusterng problems usng nonnegatve matrx factorzaton, In ICDM, 2007. [12] P. Kellam, X. Lu, N. Martn, C. Orengo, S. Swft, and A. Tucker, Comparng, contrastng and combnng clusters n vral gene expresson data, In Workshop on Intellgent Data Analyss n Medcne and Pharmocology, 2001, 56 62. [13] S. Mont, P. Tamayo, J. Mesrov, and T. Golub, Consensus clusterng: a resamplng-based method for class dscovery and vsualzaton of gene expresson mcroarray data, Mach Learn J 52 (2003), 91 118. [14] A. Banerjee and H. Shan, Latent Drchlet condtonal nave- Bayes models, In ICDM, 2007, 421 426. [15] H. Shan and A. Banerjee, Mxed-membershp nave Bayes models. Data Mnng and Knowledge Dscovery, 2010. [16] D. Ble, A. Ng, and M. Jordan, Latent Drchlet allocaton, JMLR 3 (2003), 993 1022. [17] R. Neal and G. Hnton, A vew of the EM algorthm that justfes ncremental, sparse, and other varants, In Learnng n Graphcal Models, 1998, 355 368. [18] Z. Zhou and W. Tang, Clusterer ensemble, Knowl Based Syst 1 (2006), 77 83. [19] J. Bezdek, Pattern Recognton wth Fuzzy Objectve Functon Algorthms, Plenum Press, New York, 1981. [20] B. Frey and D. Dueck, Clusterng by passng messages between data ponts, Scence 315 (2007), 972 976.