Bayesian Cluster Ensembles

Size: px
Start display at page:

Download "Bayesian Cluster Ensembles"

Transcription

1 Bayesan Cluster Ensembles Hongjun Wang 1, Hanhua Shan 2 and Arndam Banerjee 2 1 Informaton Research Insttute, Southwest Jaotong Unversty, Chengdu, Schuan, , Chna 2 Department of Computer Scence & Engneerng, Unversty of Mnnesota, Twn Ctes. Mnneapols, MN Receved 14 November 2009; revsed 19 October 2010; accepted 27 October 2010 DOI: /sam Publshed onlne n Wley Onlne Lbrary (wleyonlnelbrary.com). Abstract: Cluster ensembles provde a framework for combnng multple base clusterngs of a dataset to generate a stable and robust consensus clusterng. There are mportant varants of the basc cluster ensemble problem, notably ncludng cluster ensembles wth mssng values, row- or column-dstrbutedcluster ensembles. Exstng cluster ensemble algorthms are applcable only to a small subset of these varants. In ths paper, we propose Bayesan cluster ensemble (), whch s a mxed-membershp model for learnng cluster ensembles, and s applcable to all the prmary varants of the problem. We propose a varatonal approxmaton based algorthm for learnng Bayesan cluster ensembles. s further generalzed to deal wth the case where the features of orgnal data ponts are avalable, referred to as generalzed (G). We compare extensvely wth several other cluster ensemble algorthms, and demonstrate that s not only versatle n terms of ts applcablty but also outperforms other algorthms n terms of stablty and accuracy. Moreover, G can have hgher accuracy than, especally wth only a small number of avalable base clusterngs Wley Perodcals, Inc. Statstcal Analyss and Data Mnng, 2011 Keywords: cluster ensembles; Bayesan models 1. INTRODUCTION Cluster ensembles provde a framework for combnng multple base clusterngs of a dataset nto a sngle consoldated clusterng. Compared to ndvdual clusterng algorthms, cluster ensembles generate more robust and stable clusterng results [1]. In prncple, cluster ensembles can leverage dstrbuted computng by calculatng the base clusterngs n an entrely dstrbuted manner [2]. In addton, snce cluster ensembles only need access to the base clusterng results nstead of the orgnal data ponts, they provde a convenent approach to prvacy preservaton and knowledge reuse [2]. Such desrable aspects have made the study of cluster ensembles ncreasngly mportant n the context of data mnng. In addton to generatng a consensus clusterng from a complete set of base clusterngs, t s hghly desrable for cluster ensemble algorthms to have several addtonal propertes sutable for real lfe applcatons. Frst, there may be mssng values n the base clusterngs. For example, n a customer segmentaton applcaton, whle there are legacy Correspondence to: Hanhua Shan ([email protected]) clusterngs on old customers, there wll be no clusterngs on the new customers. Cluster ensemble algorthms should be able to buld consensus clusters wth such mssng nformaton on base clusterngs. Second, there may be restrctons on brngng all the base clusterngs to one place to run the cluster ensemble algorthm. Such restrctons may be due to the fact that the base clusterngs are wth dfferent organzatons and cannot be shared wth each other. Cluster ensemble algorthms should be able to work wth such column-dstrbuted base clusterngs. Thrd, the data ponts themselves may be dstrbuted over multple locatons; whle t s possble to get a base clusterng across the entre dataset by message passng, base clusterngs for dfferent parts of data wll be n dfferent locatons, and there may be restrctons on brngng them together at one place. For example, for a customer segmentaton applcaton, dfferent vendors may have dfferent subsets of customers, and a base clusterng on all the customers can be performed usng prvacy preservng clusterng algorthms; however, the cluster assgnments of the customer subsets for each vendor s prvate nformaton whch they wll be unwllng to share drectly for the purposes of formng a consensus clusterng. Agan, t wll be desrable to have 2011 Wley Perodcals, Inc.

2 2 Statstcal Analyss and Data Mnng, Vol. (In press) cluster ensemble algorthms handle such row-dstrbuted base clusterngs. Fnally, n many real-world scenaros, features of orgnal data ponts are avalable. These features could be the ones used for generatng the base clusterngs n the ensemble, or they could be the new nformaton currently becomng avalable, such as some new purchasng records of a customer. In such a stuaton, a cluster ensemble algorthm whch s able to combne both the base clusterng results and data pont s features s expected to generate a better consensus clusterng compared to usng base clusterng results alone. Current cluster ensemble algorthms, such as the clusterbased smlarty parttonng algorthm () [2], hypergraph parttonng algorthm () [2], or k-means based algorthms [3] are applcable to accomplsh one or two of the above varants of the problem. However, none of them was desgned to address all of the varants. In prncple, the recently proposed mxture modelng approach to learnng cluster ensembles [1] s applcable to the varants, but the detals have not been reported n the lterature. In ths paper, we propose Bayesan cluster ensembles (), whch can solve the basc cluster ensemble problem usng a Bayesan approach, that s, by effectvely mantanng a dstrbuton over all possble consensus clusterngs. It also seamlessly generalzes to all the mportant varants dscussed above. Smlar to the mxture modelng approach, treats all base clusterng results for each data pont as a vector wth a dscrete value on each dmenson, and learns a mxedmembershp model from such a representaton. In addton, we extend to generalzed (G), whch learns a consensus clusterng from both the base clusterngs and feature vectors of orgnal data ponts. Extensve emprcal evaluaton demonstrates that s not only versatle n terms of ts applcablty but also mostly outperforms the other cluster ensemble algorthms n terms of stablty and accuracy. Moreover G can have hgher accuracy than, especally when there are only a small number of avalable base clusterngs. The rest of the paper s organzed as follows. In Secton 2, we gve a problem defnton. Secton 3 presents the related work n cluster ensembles. The model for s proposed n Secton 4, and a varatonal nference algorthm s dscussed n Secton 5. Secton 6 proposes G. We report expermental results n Secton 7, and conclude n Secton 8. algorthm s that t generates a cluster assgnment or d for each of the N data ponts {o, [] N 1 }. The number of clusters generated by dfferent base clusterng algorthms may be dfferent. We denote the number of clusters generated from c j by k j, so that the cluster ds assgned by c j range from 1 to k j.ifλ j {1,...,k j } denotes the cluster d assgned to o by c j, the base clusterng algorthm c j gves a clusterng of the entre dataset, gven by λ j = { λ j, [] N } { 1 = cj (o ), [] N } 1. The results from M base clusterng algorthms can be stacked together to form an (N M) matrx B, whose jth column s λ j, as shown n panel (a) of Fg. 1. The matrx can be vewed from another perspectve: Each row x of the matrx, that s, all base clusterng results for o,gves a new vector representaton for the data pont o (panel (b) of Fg. 1). In partcular, x ={x j, [j] M 1 }={c j (o ), [j] M 1 }. Gven the base clusterng matrx B, the cluster ensemble problem s to combne the M base clusterng results for N data ponts to generate a consensus clusterng, whch should be more accurate, robust, and stable than the ndvdual base clusterngs. The tradtonal approach to process the base clusterng results s column-wse (panel (a) of Fg 1), that s, we consder B as a set of M columns of base clusterng results {λ j, [j] M 1 },andwetry to fnd out the consensus clusterng λ. The dsadvantage of the column-wse perspectve s that t needs to fnd out the correspondence between dfferent base clusters generated by dfferent algorthms. For example, n panel (a) of Fg. 1, we need to know 1 n the frst column corresponds to 1 or 2 or 3 n the second column. The cluster correspondence problem s hard to solve effcently, and the complexty ncreases especally when dfferent base clusterng algorthms generate dfferent numbers of clusters [1]. A smpler approach to cluster ensemble problem, whch s what we use n ths paper, s to read the matrx B n 2. PROBLEM FORMULATION Gven N data ponts O ={o, [] N 1 } ([]N 1 = 1,..., N) andm base clusterng algorthms C ={c j, [j] M 1 },we get M base clusterngs of the data ponts, one from each algorthm. The only requrement from a base clusterng Fg. 1 Two ways of processng base clusterng results for cluster ensemble.

3 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 3 a row-wse (panel (b) of Fg 1) way. All base clusterng results for a data pont o can be consdered as a vector x wth dscrete values on each dmenson [1], and we consder base clusterng matrx B as a set of N rows of M-dmensonal vectors {x, [] N 1 }. From ths perspectve, the cluster ensemble problem becomes fndng a clusterng λ for {x, [] N 1 },whereλ s a consensus clusterng over all base clusterngs. Further, by consderng the cluster ensemble problem from ths perspectve, we naturally avod cluster correspondence problem, because for each x, λ 1 and λ 2 are just two features, they are condtonally ndependent n the nave Bayes settng for clusterng. Whle the basc cluster ensemble framework assumes all base clusterng results for all data ponts are avalable n one place to perform the analyss, real-lfe applcatons often need varants of the basc settng. In ths paper, we dscuss four mportant varants: mssng value cluster ensembles, row- and column-dstrbuted cluster ensembles, and cluster ensemble wth orgnal data ponts Mssng Value Cluster Ensembles When several base clusterng results are mssng for several data ponts, we have a mssng value cluster ensemble problem. Such a problem appears due to varous reasons. For example, f there are new data ponts added to the dataset after runnng clusterng algorthm c j,these new data ponts wll not have base clusterng results correspondng to c j. In mssng value cluster ensemble, nstead of dealng wth a full base clusterng matrx B, we are dealng wth a matrx wth mssng entres Row-Dstrbuted Cluster Ensembles For row-dstrbuted cluster ensembles, base clusterng results of dfferent data ponts (rows) are at dfferent locatons. The correspondng real-lfe scenaro s that dfferent subsets of the orgnal dataset are owned by dfferent organzatons, or cannot be put together n one place due to sze, communcaton, or prvacy constrants. Whle dstrbuted base clusterng algorthms, such as dstrbuted prvacy preservng k-means [4], can be run on the subsets to generate base clusterng results, due to the restrctons on sharng, the results on dfferent subsets cannot be transmtted to a central locaton for analyss. Therefore, t s desrable to learn a consensus clusterng n a row-dstrbuted manner Column-Dstrbuted Cluster Ensembles For column-dstrbuted cluster ensemble, dfferent base clusterng results of all data ponts are at dfferent locatons. The correspondng real-lfe scenaro s that separate organzatons have dfferent base clusterngs on the same set of data ponts, for example, dfferent e-commerce vendors havng customer segmentatons on the same customer base. The base clusterngs cannot be shared wth others due to prvacy concerns, but each organzaton has an ncentve to get a more robust consensus clusterng. In such a case, the cluster ensemble problem have to be solved n a column-dstrbuted way Cluster Ensemble wth Orgnal Data Ponts In many real-lfe scenaros, not only the base clusterng results but also the features of orgnal data ponts are avalable. For example, a company may have both the customer segmentatons and ther purchasng records. The features of orgnal data ponts could be the ones used to generate the base clusterng results, for example, the purchasng records used to generate the exstent customer segmentatons. The features could also be the new nformaton currently become avalable, for example, new purchasng records of customers. In such cases, we may lose useful nformaton by runnng cluster ensemble algorthms on base clusterng results only. Meanwhle, f the base clusterng algorthms do not perform very well, a combnaton of them usually fals to yeld a good consensus clusterng. Therefore, a cluster ensemble algorthm whch can take both the base clusterng results and orgnal data ponts s expected to generate a better consensus clusterng. 3. RELATED WORK In ths secton, we gve a bref overvew of cluster ensemble algorthms. There are three man classes of algorthms: graph-based models, matrx-based models, and probablstc models Graph-Based Models The most popular algorthms for cluster ensemble are graph-based models [2,5 7]. The man dea of ths class of algorthms s to convert the results of base clusterngs to a hypergraph or a graph and then use graph parttonng algorthms to obtan ensemble clusters. Strehl and Ghosh [2] present three graph-based cluster ensemble algorthms: [2] nduces a graph from a co-assocaton matrx, and the graph s parttoned by the METIS algorthm [8] to obtan fnal clusters. In addton, [2] represents each cluster and correspondng objects by a hyperedge and nodes, respectvely, and then uses mnmal cut algorthm HMTIS [9] for parttonng.

4 4 Statstcal Analyss and Data Mnng, Vol. (In press) Further, hyperedge collapsng operatons are used n a metaclusterng algorthm () [2] whch determnes a soft cluster membershp for each object. Fern and Brodley [5] propose a bpartte graph parttonng algorthm. It solves cluster ensemble by reducng t to a graph parttonng problem and ntroduces a new reducton method that constructs a bpartte graph from the base clusterngs. The graph models consder both objects and clusters of the ensemble as vertces smultaneously. Al-Razgan and Domencon [6] propose a weghted bpartte parttonng algorthm (WBPA), whch maps the problem of fndng a consensus partton to bpartte graph parttonng Matrx-Based Models The second class of algorthms are matrx-based models [10 13]. The man dea of ths category s convertng base clusterng matrx to another matrx such as co-assocaton matrx, consensus matrx, or non-negatve matrx, and usng matrx operatons to get the results of cluster ensemble. Fred and Jan [10] map varous base clusterng results to a co-assocaton matrx, where each entry represents the strength of assocaton between objects, based on the co-occurrence of two objects n a same cluster. A votng algorthm s appled to the co-assocaton matrx to obtan the fnal result. Clusters are formed from the co-assocaton matrx by collectng the objects whose co-assocaton values exceed the threshold. Kellam et al. [12] combne results of base clusterngs through a co-assocaton matrx, whch s an agreement matrx wth each cell contanng the number of agreements among the base clusterng methods. The co-assocaton matrx s used to fnd the clusters wth the hghest value of support based on object co-occurrences. As a result, only a set of so-called robust clusters are produced. Mont et al. [13] defne a consensus matrx for representng and quantfyng the agreement among the results of base clusterngs. For each par of objects, the matrx stores the proporton of clusterng runs n whch two objects are clustered together. L et al. [11] llustrate that the problem of cluster ensemble can be formulated under the framework of nonnegatve matrx factorzaton (NMF), whch refers to the problem of factorzng a gven non-negatve data matrx X nto two matrx factors that s, X AB, under the constrant of A and B to be non-negatve matrces Probablstc Models The thrd class of cluster ensemble algorthms are based on probablstc models [1]. The algorthms take advantage of statstc propertes of base clusterngs results to acheve a consensus clusterng. Topchy et al. [1] consder a representaton of multple clusterngs as a set of new attrbutes characterzng the data tems, and a mxture model () offers a probablstc model of consensus usng a fnte mxture of multnomal dstrbutons n the space of base clusterngs. A consensus result s found as a soluton to the correspondng maxmum lkelhood problem usng expectaton maxmzaton (EM) algorthm. 4. BAYESIAN CLUSTER ENSEMBLES In ths secton, we propose a novel model. The man dea s as follows: Gven a base clusterng matrx B ={x, [] N 1 } for N data ponts, we assume there exsts a Bayesan graphcal model generatng B. In partcular, we assume that each vector x has an underlyng mxedmembershp to dfferent consensus clusters. Let θ denote the latent mxed-membershp vector for x ; f there are k consensus clusters, θ s a dscrete dstrbuton over the k clusters. From the generatve model perspectve, we assume that θ s sampled from a Drchlet dstrbuton, wth parameter α, and the consensus cluster h, [h] k 1 for each x j of [j] M 1 s sampled from θ separately. Further, each latent consensus cluster h, has a dscrete dstrbuton β hj over the cluster ds {1,...,k j } for the jth base clusterng result of each x.thus,fx j truly belongs to consensus cluster h, x j = r {1,...,k j } wll be determned by the dscrete probablty dstrbuton β hj (r) = p(x j β hj ),where β hj (r) 0, k j r=1 β hj (r) = 1. The full generatve process for each x s assumed to be as follows (Fg. 2): 1. Choose θ Drchlet(α). 2. For the jth base clusterng: (a) Choose a component z j = h dscrete(θ ); (b) Choose the base clusterng result x j dscrete(β hj ). Thus, the model contans the model parameters (α, β), where β = { β hj, [h] k 1, } [j]m 1, the latent varables (θ,z j ) and the actual observatons { x j, [] N 1, } [j]m 1. can be vewed as a specal case of mxed-membershp nave Bayes models [14,15] by choosng a dscrete dstrbuton as the generatve model. Further, s closely related to LDA [16], although the models are applcable to dfferent types of data. Gven the model parameters α and β, the jont dstrbuton of latent and observed varables {x, z, θ } s

5 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 5 α θ β z k Μ x Μ Ν unknown, we have to also estmate the model parameters such that the log-lkelhood of observng the base clusterng matrx B s maxmzed. EM algorthms are typcally used for such parameter estmaton problems by alternatng between calculatng the posteror over latent varables and updatng the model parameters untl convergence. However, the posteror dstrbuton Fg. 2 gven by: Graphcal model for. p(x, θ, z α, β) = p(θ α) M j=1, x j p(z j = h θ )p(x j β hj ), where x j denotes that there exsts a jth base clusterng result for x, so the product s only over the exstng base clusterng results. By ntegratng over the latent varables {z, θ }, the margnal probablty for each x s gven by: p(x α, β) M = p(θ α) θ j=1, x j p(z j = h θ )p(x j β hj )dθ. h could be consdered as a generalzaton of mxture models [1] to Bayesan models on cluster ensemble problem. In s, for each x we pck a z, whch may take [0, 0, 1], [0, 1, 0], or [1, 0, 0] n a three cluster problem, that s, there are only three possble values for the consensus clusterng n the generatve process. In comparson, n, for each x we pck a θ, whch could be any vald dscrete dstrbuton, that s, n ths case, t could be any three-dmensonal vector wth each dmenson larger than 0 and the summaton of three dmensons equal to 1. Also, we keep a Drchlet dstrbuton over all possble θ s. Such a scheme of s better than that of s due to the two reasons: () The membershp vector of has a much larger set of choces than s. () allows mxed membershp (a membershp to multple consensus clusters) n the generatve process, whle s only allow a sole membershp (a membershp to only one consensus cluster). Therefore, s more flexble than mxture model based cluster ensembles. 5. VARIATIONAL INFERENCE FOR We have assumed a generatve process for the base clusterng matrx B ={x, [] N 1 } n Secton 4. Gven the observable matrx B, our fnal goal s to estmate the mxed-membershp {θ, [] N 1 } of each object to the consensus clusters. Snce the model parameters α and β are (1) p(θ, z x,α,β)= p(θ, z, x α, β) p(x α, β) cannot be calculated n a closed form snce the denomnator (partton functon) p(x α, β) as an expanson of Eq. (1) s gven by p(x α, β) = Ɣ( h α h) h Ɣ(α h) ( k h=1 ) θ (α M h 1) h k kj h j=1h=1θ r=1 (2) β hj (r) (r,j) dθ, where (r, j) s an ndcator takng value 1 f the jth base clusterng assgns o to base cluster r and 0 otherwse, β hj (r) s the rth component of the dscrete dstrbuton β hj for the hth consensus cluster and the jth base clusterng. The couplng between θ and β n the summaton over the latent varable z makes the computaton ntractable [16]. There are two man classes of approxmaton algorthms to address such problems: one s varatonal nference, and the other s Gbbs samplng. In our paper, we present the varatonal nference method Varatonal Inference Snce t s ntractable to calculate the true posteror n Eq. (2) drectly, n varatonal nference, we ntroduce a famly of dstrbutons as an approxmaton of the posteror dstrbuton over latent varables to get a tractable lower bound of the log-lkelhood log(p(x α, β)). We maxmze ths lower bound to update the parameter estmaton. In partcular, followng [14,16], we ntroduce a famly of varatonal dstrbutons as q(θ, z γ,φ ) = q(θ γ ) M q(z j φ j ) (3) j=1 as an approxmaton of p(θ, z α, β, x ) neq.(2),where γ s a Drchlet dstrbuton parameter, and φ ={φ j, [j] M 1 } are dscrete dstrbuton parameters. We ntroduce such an approxmatng dstrbuton for each x,[] N 1.Now,usng Jensen s nequalty [17], we can obtan a lower bound

6 6 Statstcal Analyss and Data Mnng, Vol. (In press) L(α, β; φ,γ ) to log p(x α, β) gven by: L(α, β; φ,γ )= E q [log p(θ, z α, β)]+ H(q(θ, z γ,φ )), where H( ) denotes the Shannon entropy. Assumng each row x of the matrx B to be statstcally ndependent gven the parameters (α, β), the log-lkelhood of observng the matrx B s smply log p(b α, β) = N log p(x α, β) =1 N L(α, β; φ,γ ). For a fxed set of model parameters (α, β), maxmzng the lower bound wth respect to the free varatonal parameters (γ,φ ) for each x, [] N 1 gves us the best lower bound from ths famly of approxmatons. A drect calculaton leads to the followng set of update equatons for the varatonal maxmzaton: =1 ( ( k φ j h exp (γ h ) + k j r=1 γ h = α h + h =1 γ h ) (r, j) log β hj (r), M ) (4) (5) j=1, x j φ j h, (6) where [] N 1, [j]m 1, [h]k 1, φ j h s the hth component of the varatonal dscrete dstrbuton φ j for z j,andγ h s the hth component of the varatonal Drchlet dstrbuton γ for θ. For a gven set of varatonal parameters (γ,φ ), [] N 1, the lower bound gven n Eq. (4) s maxmzed by the pont estmate for β: β hj (r) N φ j h (r, j), (7) =1 where [h] k 1,[j]M 1,[r]k j 1. The Drchlet parameter α can be estmated va Newton Raphson updates as n LDA [16]. In partcular, the update equaton for α h s gven by α h = α h g h c l h, (8) wth ( ( k g h = N h =1 α h ) ) (α h ) + ( ( N k (γ h ) =1 l h = N (α h ), k h=1 c = g h/l h v 1 +, k h=1 l 1 h ( k ) v = N α h, h=1 h =1 γ h )), where s the dgamma functon, that s, the frst dervatve of the log Gamma functon Varatonal EM Algorthms Gven the updatng equatons for varatonal parameters and model parameters, we can use a varatonal EM algorthm to fnd the best-ft model (α,β ). Startng from an ntal guess (α (0),β (0) ), the EM algorthm alternates between two steps untl convergence: 1. E-Step: Gven (α (t 1),β (t 1) ), for each x,fndthe best varatonal parameters: (φ (t),γ (t) ) = argmax (φ,γ ) L(α (t),β (t) ; φ,γ ). L(α, β, φ (t),γ (t) ) serves as a lower bound functon to log p(x α, β). 2. M-Step: Maxmze the aggregate lower bound wth respect to (α, β) to obtan an mproved parameter estmate: (α (t),β (t) ) = argmax (α,β) N =1 L(α, β; φ (t),γ (t) ). After (t 1) teratons, the value of the lower bound functon s L(α (t 1),β (t 1),φ (t 1),γ (t 1) ). Inthetth teraton, N =1 L(α (t 1),β (t 1),φ (t 1),γ (t 1) ) N =1 N =1 L(α (t 1),β (t 1),φ (t),γ (t) ) (9) L(α (t),β (t),φ (t),γ (t) ). (10)

7 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 7 The frst nequalty holds because n the E-step, Eq. (9) s the maxmum of L(α (t 1),β (t 1),φ,γ ), and the second nequalty holds because n the M-step, Eq. (10) s the maxmum of L ( α, β, φ (t),γ (t) ). Therefore, the objectve functon s nondecreasng untl convergence [17]. For computatonal complexty, snce we only need to calculate (γ h ) ( k h =1 γ h ) once for all φj h,[j] M 1, the complexty for updatng φ n each E-step s then O((Nk 2 + NMku)t E ), where u = max{k j, [j] M 1 } and t E s the number of teratons nsde each E-step. Also, the tme for updatng γ s O(NMkt E ). In the M-step, the complexty for updatng β s O(NMku). α s updated usng Newton update and the tme needed s O(kNt α ),wheret α s the number of teratons n Newton updates. Compared to based cluster ensemble algorthm [1], s computatonally more expensve, snce t has teratons over Eqs. (5) and (6) nsde the E-step, whle [1] uses a drect EM algorthm whch has the E-step n a closed form. However, as we show n the experments, acheves sgnfcantly better performance than s Row-dstrbuted EM algorthm In row-dstrbuted cluster ensemble, the object set O s parttoned nto P parts {O (1),O (2),...,O (P ) } and dfferent parts are assumed to be at dfferent locatons. We further assume that a set of dstrbuted base clusterng algorthms have { been used to } obtan the base clusterng results B(1),B (2),...,B (P ). Now, we outlne a row-dstrbuted varant of the varatonal nference algorthm. At each teraton t, gven the ntalzaton of model parameters (α (t 1),β (t 1) ), row-dstrbuted varatonal EM for proceeds as follows: 1. For each partton { B (p), [p] P } 1, we obtan varatonal parameters (φ (p),γ (p) ) followng Eqs. (5) and (6), where φ (p) ={φ x B (p) } and γ (p) ={γ x B (p) }. 2. To update β followng Eq. (7), we can wrte the rght term of Eq. (7) as φ j h (r, j) + + φ j h (r, j). x B (1) x B (P ) Each part n the summaton corresponds to one partton of B. To update β hj (r), frst, (p) = x B (p) φ j h (r, j) s calculated for each B p. Second, for each B (p) (p [2,P]), wetake p 1 q=1 (q) from B (p 1), generate p q=1 (q) by addng (p) to the summaton, and pass t to B (p+1). Fnally, after passng through all parttons, we have the summaton as the rght term of Eq. (7) to update β hj (r) after normalzaton. 3. Updatng α s a lttle trcky snce t does not have a closed form soluton. However, we notce that the update Eq. (8) for α h only depends on two varables: α h and {γ, [] N 1 }. α h can be obtan from the last teraton of Newton Raphson algorthm. Regardng γ, we only need to know N =1 (γ h) and N =1 ( h γ h) for g neq.(8).weuseasame strategy as for updatng β: Frst we calculate p = x B (p) (γ h ) and p = x B (p) ( h γ h) on each partton. Second, for each B (p) (p [2,P]), we take p 1 q=1 q and p 1 q=1 q from B (p 1), generate p q=1 q and p q=1 q by addng p and p to the summatons respectvely, and pass them to B (p+1). Fnally, after gong through all parttons, we have the result for N =1 ( (γ h) ( h γ h)), so we can update α h followng Eq. (8). For each teraton of Newton Raphson algorthm, we need to pass the summatons through all parttons once. By the end of the tth teraton, we have the updated model parameters (α (t),β (t) ), whch are used as the ntalzaton for the (t + 1)th teraton. The algorthm s guaranteed to converge snce t s essentally the same wth the EM for the general case, except that t works n a row-dstrbuted way. By runnng EM dstrbutedly, nether { O (p), [p] P } 1 nor { B (p), [p] P } 1 s passed around dfferent ndvduals, but only the ntermedate summatons; n ths sense, we acheve prvacy preservaton. As we have notced, updatng α s very expensve because t needs to pass the summatons over all parttons for each Newton Raphson teraton, whch s practcally nfeasble for a dataset wth a large number of parttons. Therefore, we next gve a heurstc row-dstrbuted EM, whch does not have a theoretcal guarantee for convergence, but worked well n practce n our experments. At each teraton t, gven the ntalzaton of model parameters ( α (t 1) (1),β (t 1) ) (1), heurstc row-dstrbuted varatonal EM for proceeds as follows: 1. For the frst partton B (1),gven ( α (t 1) (1),β (t 1) ) (1),we obtan varatonal parameters (φ (1),γ (1) ) followng Eqs. (5) and (6). Also, we update (α (1),β (1) ) to get (α (t) ) followng Eqs. (8) and (7) respectvely. (1),β(t) (1) 2. For the pth partton B (p), we ntalze (α (p),β (p) ) wth ( α (t) (p 1) (p 1)),β(t) and obtan (φ(p),γ (p) ) followng Eqs. (5) and (6). We update (α (t) (p),β(t) (p)) and pass them to the (p + 1)th partton.

8 8 Statstcal Analyss and Data Mnng, Vol. (In press) After gong over all parttons, we are done wth the tth teraton; the teratons are repeated untl convergence. The ntalzaton for (α (1) (1),β(1) (1)) n the frst teraton could be pcked by random or by usng some heurstcs, and the ntalzatons for (α (1),β (1) ) n the tth teraton are from (α (t 1) (P ),β (t 1) (P ) ). The teratons run tll the net change n the lower bound value s below a threshold, or when a pre-fxed number of teratons reached Column-dstrbuted EM algorthm For column-dstrbuted cluster ensemble, we desgn a clent server style algorthm, where each clent mantans one base clusterng, and the server gathers partal results from the clents and performs further processng. Whle we assume that there are M dfferent clents, one can always work wth a smaller number of clents by splttng the columns among the avalable clents. Gven the ntalzaton for model parameters (α (t),β (t) ),where(α (t),β (t) j ) s made avalable to the j th clent, the column-dstrbuted cluster ensemble at teraton t proceeds as follows: 1. E-step jth clent: Gven x j and β (t) j for [] N 1,thejth clent calculates k j r=1 (r, j) log β(t) hj (r) for []N 1, [h] k 1 and passes the results to the E-step server. 2. E-step server: Gven k j r=1 the clents, for [] N 1,[j]M 1,[h]k 1 (r, j) log β(t) hj (r) from, the server calculates varatonal parameters { φ j h, [] N 1, [j]m 1, [h]k 1} followng Eq. (5). Gven α (t) and { φ j h, [] N 1, [j]m 1, [h]k 1}, the server updates { γ h, [] N 1, [h]k 1} followng Eq. (6). The parameters { φ j h, [] N 1, [h]k 1} are passed to the M- step jth clent and { γ h, [] N 1, [h]k 1} are passed to the M-step server. 3. M-step jth clent: Gven x j and φ j h for [] N 1,[h]k 1, β (t+1),j ( ) s updated followng Eq. (7) and passed to E-step server for the (t + 1)th teraton. 4. M-step server: Gven α (t) and γ h for [] N 1,[h]k 1, α(t+1) s updated followng Eq. (8) and passed to E-step server for the next step. The ntalzaton (α (0),β (0) ) s chosen at the begnnng of the frst teraton. In teraton t, (α (t),β (t) ) are ntalzed by (α (t 1),β (t 1) ), that s, the results of the (t 1)th teraton. The algorthm s guaranteed to converge because t s essentally the same as the EM algorthm for general cluster ensembles except that t s runnng n a column-dstrbuted way. The algorthm s expected to be more effcent than the general cluster ensemble f we gnore the communcaton overhead. In addton, j th clent/server only has access to the jth base clusterng results. The communcaton s only for the parameters and ntermedate results, nstead of base clusterngs. Therefore, prvacy preservaton s also acheved. In, the most computatonally expensve part of the E-step s the update for φ. By runnng column-dstrbuted EM, we are parallelzng most computaton n updatng φ, the tme complexty of updatng φ n each E-step hence decreases from O((Nk 2 + NMku)t E ) to O((Nk 2 + Nku)t E ). In the M-step, the cost of updatng β decreases from O(NMku) to O(Nku) through parallelzaton. 6. GENERALIZED Most cluster ensemble algorthms only combne the base clusterng results to generate a consensus clusterng, whch mght not be a good use of data when features of the orgnal data ponts are avalable, and meanwhle, the performance of the ensemble algorthm s hghly restrcted by the base clusterng algorthms, that s, the chances of obtanng a good consensus clusterng from a set of very poor base clusterng algorthms are low. In ths secton, we propose a G algorthm whch overcomes the two drawbacks by combnng both the base clusterng results and feature vectors of orgnal data ponts to yeld a consensus clusterng. The man dea of G s as follows: For each data pont, we concatenate ts D-dmensonal feature vector o after the M-dmensonal base clusterng vector x to get an (M + D)-dmensonal vector y. Followng Shan and Banerjee [15], the generatve process for y s gven as follows: 1. Choose θ Drchlet(α). 2. For each nonmssng base clusterng, that s, y j, [j] M 1 : (a) Choose a component z j = h dscrete(θ ); (b) Choose the base clusterng result y j dscrete(β hj ). 3. For each nonmssng feature of the data pont, that s, y j, [j] M+D M+1 : (a) Choose a component z j = h dscrete(θ ); (b) Choose the feature value y j p ψj (y j ζ hj ). The frst two steps are the same wth. The dfference s the new step 3, where each feature of the data pont s generated from p ψj (y j ζ hj ) an exponental famly

9 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 9 dstrbuton for feature j and consensus cluster h [15]. ψ j n p ψj (y j ζ hj ) determnes a partcular famly for feature j, such as Gaussan, Posson, etc., and ζ hj determnes a partcular parameter for the dstrbuton n that famly. For the ease of exposton, we assume that all orgnal features have real values generated from Gaussan dstrbutons, then p ψj (y j ζ hj ) could be denoted by [15] ( p(y j µ hj,σhj 2 ) = 1 exp (y ) j µ hj ) 2 2πσhj 2 2σhj 2, where µ hj and σ 2 hj are the mean and varance for the Gaussan dstrbuton of cluster h and feature j. The proposed generatve model gves an ntutve way to combne the feature vectors wth base clusterng results, but t suffers from a lmtaton that t treats each feature of the orgnal data pont as mportant as each base clusterng result. In such cases, for hgh dmensonal data ponts wth D M, the feature vectors wll domnate the consensus clusterng,.e., the consensus clusterng result s almost the same wth runnng the algorthm on only the data ponts wthout base clusterng results. Therefore, we further generalze the algorthm to allow dfferent weghts for dfferent base clusterng results and dfferent features, yeldng generalzed. Gven non-negatve ntegral weghts u ={u j, [j] M+D 1 } for y j, [j] M+D 1, the generatve process of G for y wth weght u s gven as follows: 1. Choose θ Drchlet(α). 2. For each nonmssng base clusterng, that s, y j, [j] M 1, repeat 2(a) and 2(b) for u j tmes: (a) Choose a component z j = h dscrete(θ ); (b) Choose the base clusterng result y j dscrete(β hj ). 3. For each nonmssng orgnal feature, that s, y j, [j] M+D M+1, repeat 3(a) and 3(b) for u j tmes: (a) Choose a component z j = h dscrete(θ ); (b) Choose the feature value y j N(µ hj,σ 2 hj ). Therefore, s a specal case of G by settng u j = 1 for [j] M 1 and u j = 0for[j] M+D M+1. The margnal probablty for weghted y s gven by p(y α, β, µ, σ 2, u) (11) ( ) uj M = p(θ α) p(z j = h θ )p(y j β hj ) θ M+D j=m+1, y j j=1, y j h ( ) uj p(z j = h θ )p(y j µ hj,σhj 2 ) dθ. h In G, f we set u j = 1for[j] M+D M+1 and u j = D for [j] M 1, we are treatng each base clusterng as mportant as the whole feature vector of the data pont, nstead of a sngle feature. We can also set dfferent weghts for dfferent y j based on the confdence of clusterng accuracy, or mportance of the feature, etc. In addton, n the generatve process, the weghts have been assumed to be non-negatve ntegers snce they denote the repetton tmes, but the learnng algorthm we dscuss below stll holds even when u j s generalzed to postve real numbers, yeldng a very flexble model. From the generatve process, G actually does not generate y, but generates a new vector ỹ wth y j repeated for u j tmes to ncorporate the weghts. However, we do not need to create a ỹ explctly to learn the model. For nference and parameter estmaton, smlar wth Secton 5, we ntroduce a famly of varatonal dstrbutons M+D q(θ, z γ,φ, u) = q(θ γ ) q(z j φ j ) u j j=1 to approxmate p(θ, z α, β, u, y ). The update equatons for varatonal parameters are gven by ( ) ( φ j h exp (γ h ) γ h (12) h + k j r=1 ( φ j h exp (γ h ) γ h = α h + ) (r, j) log β hj (r), [j] M 1, ( h γ h log σ hj (y j µ hj ) 2 M+D 2σ 2 hj ) ), [j] M+D M+1, (13) j=1, y j u j φ j h, (14) where [] N 1 and [h]k 1. For the model parameters, the update equatons for α s the same as n Eq. (8), and the equatons

10 10 Statstcal Analyss and Data Mnng, Vol. (In press) for the rest of parameters are gven by β hj (r) u j µ hj = N =1 φ j h (r, j), (15) N =1, x j u j φ j h y j N =1, x j u j φ j h, (16) σ 2 hj = N =1, x j u j φ j h (y j µ hj ) 2 N =1, x j u j φ j h, (17) where [h] k 1,[j]M 1,and[r]k j EXPERIMENTAL RESULTS In ths secton, we run experments on datasets from UCI machne learnng repostory and KDD Cup In partcular, for UCI data, we pck 12 datasets whch are relatvely small. (For wne qualty we only keep the data ponts n three man classes, so the classes wth very few number of data ponts are removed.) For KDD Cup data, there are four man classes among 37 classes n total. We randomly pck data ponts from these four man classes and dvde them nto two parts, so we have two relatvely large datasets wth one mllon data ponts each. The number of objects, features and classes n each data set are lsted n Table 1, where kdd99-1 and kdd99-2 are from KDD Cup 1999 and the rest are from UCI machne learnng repostory. For all reported results, there are two steps leadng to the fnal consensus clusterng. Frst, we run base clusterng algorthms to get a set of base clusterng results. Second, varous cluster ensemble algorthms, ncludng [1], Table 1. The number of the nstances, features, and classes n each dataset. Dataset Instances Features Classes pma rs wdbc balance glass bupa wne magc onosphere segmentaton kdd kdd chess wne qualty ,, [2] and k-means, are appled to the base clusterng results to generate a consensus clusterng. We compare ther results wth. The comparson between and other cluster ensemble algorthms are dvded nto fve categores as follows: 1. General cluster ensemble (general). 2. Cluster ensemble wth mssng values (mss-v). 3. Cluster ensemble wth ncreasng number of columns (ncrease-c), that s, addtonal base clusterngs. 4. Column-dstrbuted cluster ensemble (column-d). 5. Row-dstrbuted cluster ensemble (row-d). Table 2 shows the fve categores of experments and the sx cluster ensemble algorthms we use. We can see that most of the algorthms can only accomplsh a few tasks among the fve. In prncple, can be generalzed to deal wth all fve scenaros; however, the lterature does not have an explct algorthm for column- or row-dstrbuted cluster ensembles usng. As we can see from Table 2, s the most flexble and versatle among the sx algorthms. For evaluaton, we use mcro-precson [18] to measure accuracy of the consensus cluster wth respect to the true labels: the mcro-precson s defned as MP = k a h /n, h=1 where k s the number of clusters and n s the number of objects, a h denotes the number of objects n consensus cluster h that are correctly assgned to the correspondng class. We dentfy the correspondng class for consensus cluster h as the true class wth the largest overlap wth the cluster, and assgn all objects n cluster h to that class. Note that 0 MP 1 wth 1 ndcatng the best possble consensus clusterng, whch has to be n full agreement wth the class labels. Table 2. The applcablty of algorthms to dfferent expermental settngs: ndcates that the algorthm s applcable, and ndcates otherwse. Algorthm General Mss-v Increase-c Column-d Row-d k-means

11 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 11 Table 3. k-means wth dfferent ntalzatons are used as base clusterng algorthms: (a) Maxmum MP for dfferent cluster ensemble algorthms; (b) average MP for dfferent cluster ensemble algorthms (Magc04, kdd99-1, andkdd99-2 are too large so could not fnsh ts run). The hghest MP among dfferent algorthms on each data set s bolded. Base clusterngs Cluster ensembles k-means k-means (a) Maxmum MP rs wdbc onosphere glass bupa pma wne magc balance segmentaton kdd kdd (b) Average MP rs wdbc onosphere glass bupa pma wne magc balance segmentaton kdd kdd In the followng subsectons, we wll present the expermental results for fve categores of experments as n Table 2, startng from general cluster ensembles General Cluster Ensembles In ths subsecton, we run two types of experments: one only uses k-means as the base clusterng algorthms, and the other uses multple algorthms as the base clusterng algorthms. Gven N objects, we frst use k-means as the base clusterng algorthm on 12 datasets. For ten UCI datasets, we run k-means 2000 tmes wth dfferent ntalzatons to obtan 2000 base clusterng results, whch are dvded evenly nto 100 subsets, wth 20 base clusterng results n each of them. For two large datasets kdd99-1 and kdd99-2, we run the experments followng the same strategy, but we keep three subsets wth fve base clusterng results n each of them. Cluster ensemble algorthms are then appled on each subset. The maxmum and average MPs over all subsets are reported n Tables 3(a) and 3(b). We also use k-means, fuzzy c-means (FCM) [19], METIS [8], and affnty propagaton (AP) [20] as the base clusterng algorthms on 11 datasets for cluster ensemble. 1 By runnng k-means 500 tmes, FCM 800 tmes, METIS 200 tmes, and AP 500 tmes wth dfferent ntalzatons, we also obtan 2000 base clusterng results. Followng the same strategy above to run cluster ensemble algorthms, we have the maxmum and average MPs n Table 5(a) and 5(b). The key observatons from Table 3 and 5 can be summarzed as follows: () almost always has a hgher max and average MP than base clusterng results, whch means the consensus clusterng from s ndeed better n qualty than the orgnal base clusterngs. () outperforms other cluster ensemble algorthms for most of the tmes n terms of both maxmum and average MP, no matter whch base clusterng algorthms are used. Snce the results of and are rather close to each other, to make a careful comparson, we run a pared t-test under the hypothess H 0 : MP() = MP(), 1 We run ths set of experments on relatvely small datasets snce METIS and AP cannot run on large ones such as kdd99-1 and kdd99-2.

12 12 Statstcal Analyss and Data Mnng, Vol. (In press) Table 4. k-means wth dfferent ntalzatons are used as base clusterng algorthms: Pared t-test for and, where Mean-D s the mean of MP dfferences obtaned by (-), and sd- () s standard devaton of the MPs from (). For Mean-D, the datasets where performs better s bolded. For sd- and sd-, the one wth a smaller standard devaton s bolded. Dataset Mean-D sd- sd- p-value rs wdbc onosphere glass bupa pma wne magc balance segmentaton kdd kdd Table 6. k-means, FCM, AP, and METIS are used as the base clusterng algorthms: Pared t-test for and, where Mean-D s the mean of MP dfferences obtaned by (-), and sd- () s standard devaton of the MPs from (). For Mean-D, the datasets where performs better s bolded. For sd- and sd-, the one wth a smaller standard devaton s bolded. Dataset Mean-D sd- sd- p-value rs wdbc onosphere glass bupa pma wne balance segmentaton chess wne qualty H a : MP() <MP(). The test s desgned to assess the strength of the evdence aganst H 0 and supportng H a. Such strength s measured by the p-value, a lower p-value ndcates stronger evdence. In our case, a lower p-value ndcates that the performance mprovements of over s statstcally sgnfcant. Usually a p-value less than 0.05 s consdered as strong evdence. The results are shown n Tables 4 and 6, respectvely. outperforms wth a low p-value (<0.05) most of the tmes, ndcatng that MP() s Table 5. k-means, FCM, AP, and METIS are used as the base clusterng algorthms: (a) Maxmum MP for dfferent cluster ensemble algorthms; (b) average MP for dfferent cluster ensemble algorthms. The hghest MP among dfferent algorthms on each data set s bolded. Base clusterngs Cluster ensembles k-means FCM AP METIS (a) Maxmum MP rs wdbc onosphere glass bupa pma wne balance segmentaton chess wne qualty (b) Average MP rs wdbc onosphere glass bupa pma wne balance segmentaton chess wne qualty

13 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles Dataset:Irs Percentage of mssng entres Dataset:Pma BNB Percentage of mssng entres Dataset:Wne Percentage of mssng entres Dataset:Wdbc Percentage of mssng entres Dataset:Glass Percentage of mssng entres Dataset:Balance 0.38 Percentage of mssng entres Dataset:Ionosphere Percentage of mssng entres Dataset:Bupa Percentage of mssng entres Dataset:Segmentaton Percentage of mssng entres Fg. 3 Average MP wth ncreasng percentage of mssng values. sgnfcantly better than MP() on these datasets. In addton, the smaller standard devaton of shows that t s more stable than Cluster Ensembles wth Mssng Values Gven 20 base clusterng results for N objects, we randomly hold out p percent of data as mssng values, wth p ncreasng from 0 to 90 n steps of We compare the performance of dfferent algorthms except k-means, because k-means cannot handle mssng values. Each tme we run the algorthms ten tmes and report MP on nne datasets n Fg. 3. Surprsngly, before the mssng value percentage reaches 70%, most algorthms have a stable MP wth ncreasng number of mssng entres, wthout a dstnct decrease n accuracy. s always among the top one or two n terms of the accuracy across dfferent percentage of mssng values, ndcatng that s one of the best algorthms to deal wth mssng value cluster ensemble. Comparatvely, seems to have the worst performance n terms of both the accuracy and stablty. 2 Startng from ths subsecton, we only use k-meansasthebase clusterng algorthm Cluster Ensembles wth Increasng Columns In order to fnd out the effect on the cluster ensemble accuracy wth ncreasng number of base clusterngs, we perform experments for cluster ensemble wth columns (base clusterngs) ncreasng from 1 to 20 n steps of 1. We frst generate 20 base clusterngs as a pool. At each step s, we randomly pck s base clusterngs from the pool, whch s repeated for 50 tmes to generate 50 (N s) base clusterng matrces (note there are repettons among these 50 matrces). We then run cluster ensemble on each of them. The average of MP over 50 runs at each step s reported n Fg. 4 for nne datasets. Frst, we can see that s agan among the top one or two on all the data sets n our experments. Second, MPs for most of the algorthms ncrease dramatcally when the number of base clusterngs ncreases from 1 to 5. After that, no dstnct ncrease s observed. On Pma, the accuracy even decreases when the number of base clusterng s larger than 10, whch s possbly due to the poor performance of the base clusterngs. The trends of the curves mght be related to the dversty of the base clusterngs. In our experments, we only use k-means for all base clusterngs, so the cluster nformaton may become redundant after a certan number

14 14 Statstcal Analyss and Data Mnng, Vol. (In press) 1 Dataset:Irs 0.9 Dataset:Wdbc 5 Dataset:Ionosphere Number of avalable base clusterngs Number of avalable base clusterngs Number of avalable base clusterngs Dataset:Pma Dataset:Glass 5 Dataset:Bupa 5 5 Number of avalable base clusterngs Number of avalable base clusterngs 5 5 Number of avalable base clusterngs 5 Dataset:Wne Dataset:Balance 0.8 Dataset:Segmentaton Number of avalable base clusterngs Number of avalable base clusterngs Number of avalable base clusterngs Fg. 4 Average MP comparson wth ncreasng number of avalable base clusterngs. of base clusterngs have been used, and the accuracy does not ncrease anymore. The accuracy may keep on ncreasng wth more columns f the base clusterngs are generated by dfferent algorthms Row-Dstrbuted Cluster Ensembles For experments on row-dstrbuted cluster ensembles, we dvde our 20 base clusterng results by rows (approxmately) evenly nto P parttons, wth P ncreasng from 1 to 10 n steps of 1. We compare the performance of rowdstrbuted wth dstrbuted k-means [4]. Note that n our experments, we use the heurstc row-dstrbuted EM as n Secton Although no theoretcal guarantee for convergence s provded, n our observaton, the algorthm stops when model parameters do not change anymore wthn ten teratons. The comparatve results on nne datasets are presented n Fg. 5. It s clear that row-dstrbuted always has a hgher accuracy than dstrbuted k-means except on Balance. For most datasets, the performance of row-dstrbuted s more stable across varyng number of parttons, ndcatng ts robustness Column-Dstrbuted Cluster Ensembles We run experments for column-dstrbuted cluster ensembles wth ncreasng number of base clusterngs (20, 60, 120, 240, 480, 960, 1440, 1920), whch are pcked randomly from a pool of 3000 base clusterng results. We run the clent server style algorthm as n Secton wth one clent mantanng one base clusterng, such that multple clents could run n parallel. The accuracy n the column-dstrbuted case would be the same as the general cluster ensemble usng snce they are usng exactly the same algorthm except that the column-dstrbuted varants run t n a dstrbuted manner. If we gnore the communcaton overhead between the clents and server, the comparson of runnng tme between the column-dstrbuted and general cluster ensemble s presented n Fg. 6. We can see that column-dstrbuted cluster ensemble s much more effcent than the general case, especally when the number of base clusterngs s large, the column-dstrbuted varant s several orders of magntudes faster. Therefore, the column-dstrbuted s readly applcable to the real-lfe settngs wth large data sets.

15 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles Dataset:Irs Number of parttons Dataset:Ionosphere 06 Number of parttons Dataset:Wdbc 0.84 Number of parttons 4 2 Dataset:Glass 7 6 Dataset:Bupa 7 6 Dataset:Pma Number of parttons Number of parttons Number of parttons Dataset:Balance 5 Dataset:Wne 95 Dataset:Segmentaton Number of parttons 5 Number of parttons Number of parttons Fg. 5 Average MP wth ncreasng number of dstrbuted parttons. Seconds Colum dstrbuted Grneral Dataset:WneQualty Number of base clusterngs Fg. 6 The comparson of runnng tme between columndstrbuted and general cluster ensemble Generalzed Cluster Ensembles To compare G to, we set the weght of each orgnal feature to be 1 and the weght of each base clusterng result to be D n G, where D s the number of features n orgnal data ponts. Smlar wth Secton 7.3, we frst generate 50 base clusterng results as a pool, then at each step s, we randomly pck s base clusterngs from the pool to run G and, where s ncreases from 1 to 10. Fg. 7 show the fnal result averaged over 50 runs at each step s. 3 We do not show results on bupa snce the accuracy on bupa from G and are exactly the same. Overall, Fg. 7 contans two cases of the comparson: In the frst case (on wdbc, onosphere, pma, magc04, and balance), the whole curve of G s (mostly) above, whch shows a clear mprovement by combnng orgnal data ponts wth base clusterng results. In the second case (on rs, glass, wne, segmentaton), G has hgher accuracy when there s only a small number of base clusterng results, and there s no clear wnner when the number of base clusterng results ncreases. It s probably because the base clusterng algorthms generate 3 The base clusterngs we use are dfferent from Secton 7.3, so the result of s dfferent from Fg. 4.

16 16 Statstcal Analyss and Data Mnng, Vol. (In press) Dataset: Irs G Number of base clusterngs Dataset: Wdbc G Number of base clusterngs Dataset: Ionosphere G Number of base clusterngs Dataset: Pma Number of base clusterngs G Dataset: Glass Number of base clusterngs G Dataset: Magc04 Number of base clusterngs G 9 Dataset: Wne 7 6 Dataset: Balance G 8 Dataset: Segmentaton Number of base clusterngs G 1 Number of base clusterngs 8 6 Number of base clusterngs G Fg. 7 Average MP comparson between and G wth ncreasng number of avalable base clusterngs. good clusterngs, so when more base clusterng results are combned together, performs as good as G. In addton, we also run experments for G wth dfferent weghts for the base clusterng results varyng from D/8 to 8D n multplcatve steps of 2. There s no clear trend wth ncreasng weghts. Generally, the results wth the weght larger than D are qute smlar wth each other, and the results wth the weght smaller than D could be dfferent, but the dfference decreases when the number of base clusterng results ncreases. We show two examples wth weghts {D/8,D/4,D/2,D} n Fg CONCLUSION In ths paper, we have proposed Bayesan cluster ensembles, a mxed-membershp generatve model for obtanng a consensus clusterng by combnng multple base clusterng results. provdes a Bayesan way to combne clusterngs, and entrely avods cluster label correspondence problems encountered n graph based approaches to the cluster ensemble problem. A varatonal approxmaton based algorthm s proposed for learnng a Bayesan cluster ensemble. In addton, we have also proposed G, whch generates a consensus clusterng by takng both the base clusterng results and orgnal data ponts. Compared wth exstng algorthms, s the most versatle because of ts applcablty to several varants of the cluster ensemble problem, ncludng mssng value cluster ensembles, row-dstrbuted and column-dstrbuted cluster ensembles. In addton, extensve expermental results show that outperforms other algorthms n terms of accuracy and stablty, and t can be run n a dstrbuted manner wthout exchangng base clusterng results, thereby preservng prvacy and/or substantal speedups. Fnally, the comparson between G and show that G can generate hgher accuracy than, especally wth only a small number of base clusterng results avalable.

17 Wang, Shan, and Banerjee: Bayesan Cluster Ensembles 17 1 Dataset: Pma D/8 D/4 D/2 D 5 Dataset: Glass Number of base clusterngs 0.35 Number of base clusterngs D/8 D/4 D/2 D Fg. 8 G wth dfferent weghts on base clusterng results. ACKNOWLEDGMENTS Ths research was supported by NSF grants IIS , IIS , CNS , IIS , NSF CAREER grant IIS , NASA grant NNX08AC36A, and NSFC grant REFERENCES [1] A. Topchy, A. Jan, and W. Punch, A mxture model for clusterng ensembles, In SDM, 2004, [2] A. Strehl and J. Ghosh, Cluster ensembles-a knowledge reuse framework for combnng multple parttons, JMLR 3 (2002), [3] L. Kuncheva and D. Vetrov, Evaluaton of stablty of k-means cluster ensembles wth respect to random ntalzaton, PAMI 28 (2006), [4] G. Jagannathan and R. Wrght, Prvacy-preservng dstrbuted k-means clusterng over arbtrarly parttoned data, In KDD, 2005, [5] X. Fern and C. Brodley, Solvng cluster ensemble problems by bpartte graph parttonng, In ICML, 2004, [6] M. Al-Razgan and C. Domencon, Weghted cluster ensemble, In SDM, 2006, [7] X. Fern and C. Brodley, Random projecton for hgh dmensonal data clusterng: a cluster ensemble approach, In ICML, 2003, [8] G. Karyps and V. Kumar, A fast and hgh qualty multlevel scheme for parttonng rregular graphs, SIAM J Sc Comput 20 (1999), [9] G. Karyps, R. Aggarwal, V. Kumar, and S. Shekhar, Multlevel hypergraph parttonng: applcatons n VLSI desgn, In ACM/IEEE Desgn Automaton Conference, 1997, [10] A. Fred and A. Jan, Data clusterng usng evdence accumulaton, In ICPR, 2002, [11] T. L, C. Dng, and M. Jordan, Solvng consensus and semsupervsed clusterng problems usng nonnegatve matrx factorzaton, In ICDM, [12] P. Kellam, X. Lu, N. Martn, C. Orengo, S. Swft, and A. Tucker, Comparng, contrastng and combnng clusters n vral gene expresson data, In Workshop on Intellgent Data Analyss n Medcne and Pharmocology, 2001, [13] S. Mont, P. Tamayo, J. Mesrov, and T. Golub, Consensus clusterng: a resamplng-based method for class dscovery and vsualzaton of gene expresson mcroarray data, Mach Learn J 52 (2003), [14] A. Banerjee and H. Shan, Latent Drchlet condtonal nave- Bayes models, In ICDM, 2007, [15] H. Shan and A. Banerjee, Mxed-membershp nave Bayes models. Data Mnng and Knowledge Dscovery, [16] D. Ble, A. Ng, and M. Jordan, Latent Drchlet allocaton, JMLR 3 (2003), [17] R. Neal and G. Hnton, A vew of the EM algorthm that justfes ncremental, sparse, and other varants, In Learnng n Graphcal Models, 1998, [18] Z. Zhou and W. Tang, Clusterer ensemble, Knowl Based Syst 1 (2006), [19] J. Bezdek, Pattern Recognton wth Fuzzy Objectve Functon Algorthms, Plenum Press, New York, [20] B. Frey and D. Dueck, Clusterng by passng messages between data ponts, Scence 315 (2007),

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna [email protected] Abstract.

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht [email protected] 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

How To Calculate The Accountng Perod Of Nequalty

How To Calculate The Accountng Perod Of Nequalty Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye [email protected] [email protected] [email protected] Abstract - Stock market s one of the most complcated systems

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

Sketching Sampled Data Streams

Sketching Sampled Data Streams Sketchng Sampled Data Streams Florn Rusu, Aln Dobra CISE Department Unversty of Florda Ganesvlle, FL, USA [email protected] [email protected] Abstract Samplng s used as a unversal method to reduce the

More information

DEFINING %COMPLETE IN MICROSOFT PROJECT

DEFINING %COMPLETE IN MICROSOFT PROJECT CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,

More information

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems Jont Schedulng of Processng and Shuffle Phases n MapReduce Systems Fangfe Chen, Mural Kodalam, T. V. Lakshman Department of Computer Scence and Engneerng, The Penn State Unversty Bell Laboratores, Alcatel-Lucent

More information

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo

More information

A Fast Incremental Spectral Clustering for Large Data Sets

A Fast Incremental Spectral Clustering for Large Data Sets 2011 12th Internatonal Conference on Parallel and Dstrbuted Computng, Applcatons and Technologes A Fast Incremental Spectral Clusterng for Large Data Sets Tengteng Kong 1,YeTan 1, Hong Shen 1,2 1 School

More information

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

More information

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange

More information

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6 PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada [email protected] Abstract Ths s a note to explan support vector machnes.

More information

BERNSTEIN POLYNOMIALS

BERNSTEIN POLYNOMIALS On-Lne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful

More information

Calculating the high frequency transmission line parameters of power cables

Calculating the high frequency transmission line parameters of power cables < ' Calculatng the hgh frequency transmsson lne parameters of power cables Authors: Dr. John Dcknson, Laboratory Servces Manager, N 0 RW E B Communcatons Mr. Peter J. Ncholson, Project Assgnment Manager,

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How To Understand The Results Of The German Meris Cloud And Water Vapour Product Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008

Risk-based Fatigue Estimate of Deep Water Risers -- Course Project for EM388F: Fracture Mechanics, Spring 2008 Rsk-based Fatgue Estmate of Deep Water Rsers -- Course Project for EM388F: Fracture Mechancs, Sprng 2008 Chen Sh Department of Cvl, Archtectural, and Envronmental Engneerng The Unversty of Texas at Austn

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n

More information

Realistic Image Synthesis

Realistic Image Synthesis Realstc Image Synthess - Combned Samplng and Path Tracng - Phlpp Slusallek Karol Myszkowsk Vncent Pegoraro Overvew: Today Combned Samplng (Multple Importance Samplng) Renderng and Measurng Equaton Random

More information

Logistic Regression. Steve Kroon

Logistic Regression. Steve Kroon Logstc Regresson Steve Kroon Course notes sectons: 24.3-24.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The

More information

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,

More information

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING Matthew J. Lberatore, Department of Management and Operatons, Vllanova Unversty, Vllanova, PA 19085, 610-519-4390,

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

Traffic State Estimation in the Traffic Management Center of Berlin

Traffic State Estimation in the Traffic Management Center of Berlin Traffc State Estmaton n the Traffc Management Center of Berln Authors: Peter Vortsch, PTV AG, Stumpfstrasse, D-763 Karlsruhe, Germany phone ++49/72/965/35, emal [email protected] Peter Möhl, PTV AG,

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY RESEARCH VOLUME, ISSUE, FEBRUARY ISSN 77-866 Logcal Development Of Vogel s Approxmaton Method (LD- An Approach To Fnd Basc Feasble Soluton Of Transportaton

More information

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

On the Optimal Control of a Cascade of Hydro-Electric Power Stations On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;

More information

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Brigid Mullany, Ph.D University of North Carolina, Charlotte Evaluaton And Comparson Of The Dfferent Standards Used To Defne The Postonal Accuracy And Repeatablty Of Numercally Controlled Machnng Center Axes Brgd Mullany, Ph.D Unversty of North Carolna, Charlotte

More information

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble 1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, [email protected] Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In

More information

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

More information

8 Algorithm for Binary Searching in Trees

8 Algorithm for Binary Searching in Trees 8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the

More information

Fast Fuzzy Clustering of Web Page Collections

Fast Fuzzy Clustering of Web Page Collections Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng Otto-von-Guercke-Unversty of Magdeburg Unverstätsplatz, D-396 Magdeburg,

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Conversion between the vector and raster data structures using Fuzzy Geographical Entities Converson between the vector and raster data structures usng Fuzzy Geographcal Enttes Cdála Fonte Department of Mathematcs Faculty of Scences and Technology Unversty of Combra, Apartado 38, 3 454 Combra,

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

Mining Multiple Large Data Sources

Mining Multiple Large Data Sources The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of

More information

Analysis of Premium Liabilities for Australian Lines of Business

Analysis of Premium Liabilities for Australian Lines of Business Summary of Analyss of Premum Labltes for Australan Lnes of Busness Emly Tao Honours Research Paper, The Unversty of Melbourne Emly Tao Acknowledgements I am grateful to the Australan Prudental Regulaton

More information

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall SP 2005-02 August 2005 Staff Paper Department of Appled Economcs and Management Cornell Unversty, Ithaca, New York 14853-7801 USA Farm Savngs Accounts: Examnng Income Varablty, Elgblty, and Benefts Brent

More information

Availability-Based Path Selection and Network Vulnerability Assessment

Availability-Based Path Selection and Network Vulnerability Assessment Avalablty-Based Path Selecton and Network Vulnerablty Assessment Song Yang, Stojan Trajanovsk and Fernando A. Kupers Delft Unversty of Technology, The Netherlands {S.Yang, S.Trajanovsk, F.A.Kupers}@tudelft.nl

More information

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta

More information

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM BARRIOT Jean-Perre, SARRAILH Mchel BGI/CNES 18.av.E.Beln 31401 TOULOUSE Cedex 4 (France) Emal: [email protected] 1/Introducton The

More information

Support vector domain description

Support vector domain description Pattern Recognton Letters 20 (1999) 1191±1199 www.elsever.nl/locate/patrec Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty

More information

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School Robust Desgn of Publc Storage Warehouses Yemng (Yale) Gong EMLYON Busness School Rene de Koster Rotterdam school of management, Erasmus Unversty Abstract We apply robust optmzaton and revenue management

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

General Auction Mechanism for Search Advertising

General Auction Mechanism for Search Advertising General Aucton Mechansm for Search Advertsng Gagan Aggarwal S. Muthukrshnan Dávd Pál Martn Pál Keywords game theory, onlne auctons, stable matchngs ABSTRACT Internet search advertsng s often sold by an

More information

Improved SVM in Cloud Computing Information Mining

Improved SVM in Cloud Computing Information Mining Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu

More information

A Performance Analysis of View Maintenance Techniques for Data Warehouses

A Performance Analysis of View Maintenance Techniques for Data Warehouses A Performance Analyss of Vew Mantenance Technques for Data Warehouses Xng Wang Dell Computer Corporaton Round Roc, Texas Le Gruenwald The nversty of Olahoma School of Computer Scence orman, OK 739 Guangtao

More information

Loop Parallelization

Loop Parallelization - - Loop Parallelzaton C-52 Complaton steps: nested loops operatng on arrays, sequentell executon of teraton space DECLARE B[..,..+] FOR I :=.. FOR J :=.. I B[I,J] := B[I-,J]+B[I-,J-] ED FOR ED FOR analyze

More information

Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems

Multi-Resource Fair Allocation in Heterogeneous Cloud Computing Systems 1 Mult-Resource Far Allocaton n Heterogeneous Cloud Computng Systems We Wang, Student Member, IEEE, Ben Lang, Senor Member, IEEE, Baochun L, Senor Member, IEEE Abstract We study the mult-resource allocaton

More information

Enterprise Master Patient Index

Enterprise Master Patient Index Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an

More information

A Secure Password-Authenticated Key Agreement Using Smart Cards

A Secure Password-Authenticated Key Agreement Using Smart Cards A Secure Password-Authentcated Key Agreement Usng Smart Cards Ka Chan 1, Wen-Chung Kuo 2 and Jn-Chou Cheng 3 1 Department of Computer and Informaton Scence, R.O.C. Mltary Academy, Kaohsung 83059, Tawan,

More information

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

v a 1 b 1 i, a 2 b 2 i,..., a n b n i. SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

A New Task Scheduling Algorithm Based on Improved Genetic Algorithm

A New Task Scheduling Algorithm Based on Improved Genetic Algorithm A New Task Schedulng Algorthm Based on Improved Genetc Algorthm n Cloud Computng Envronment Congcong Xong, Long Feng, Lxan Chen A New Task Schedulng Algorthm Based on Improved Genetc Algorthm n Cloud Computng

More information

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB. PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB. INDEX 1. Load data usng the Edtor wndow and m-fle 2. Learnng to save results from the Edtor wndow. 3. Computng the Sharpe Rato 4. Obtanng the Treynor Rato

More information

Solving Factored MDPs with Continuous and Discrete Variables

Solving Factored MDPs with Continuous and Discrete Variables Solvng Factored MPs wth Contnuous and screte Varables Carlos Guestrn Berkeley Research Center Intel Corporaton Mlos Hauskrecht epartment of Computer Scence Unversty of Pttsburgh Branslav Kveton Intellgent

More information

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting Propertes of Indoor Receved Sgnal Strength for WLAN Locaton Fngerprntng Kamol Kaemarungs and Prashant Krshnamurthy Telecommuncatons Program, School of Informaton Scences, Unversty of Pttsburgh E-mal: kakst2,[email protected]

More information

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1 Send Orders for Reprnts to [email protected] The Open Cybernetcs & Systemcs Journal, 2014, 8, 115-121 115 Open Access A Load Balancng Strategy wth Bandwdth Constrant n Cloud Computng Jng Deng 1,*,

More information

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University Characterzaton of Assembly Varaton Analyss Methods A Thess Presented to the Department of Mechancal Engneerng Brgham Young Unversty In Partal Fulfllment of the Requrements for the Degree Master of Scence

More information

1. Math 210 Finite Mathematics

1. Math 210 Finite Mathematics 1. ath 210 Fnte athematcs Chapter 5.2 and 5.3 Annutes ortgages Amortzaton Professor Rchard Blecksmth Dept. of athematcal Scences Northern Illnos Unversty ath 210 Webste: http://math.nu.edu/courses/math210

More information

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching) Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

More information

A Comparative Study of Data Clustering Techniques

A Comparative Study of Data Clustering Techniques A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES A Comparatve Study of Data Clusterng Technques Khaled Hammouda Prof. Fakhreddne Karray Unversty of Waterloo, Ontaro, Canada Abstract Data clusterng s a

More information

Binomial Link Functions. Lori Murray, Phil Munz

Binomial Link Functions. Lori Murray, Phil Munz Bnomal Lnk Functons Lor Murray, Phl Munz Bnomal Lnk Functons Logt Lnk functon: ( p) p ln 1 p Probt Lnk functon: ( p) 1 ( p) Complentary Log Log functon: ( p) ln( ln(1 p)) Motvatng Example A researcher

More information

Software project management with GAs

Software project management with GAs Informaton Scences 177 (27) 238 241 www.elsever.com/locate/ns Software project management wth GAs Enrque Alba *, J. Francsco Chcano Unversty of Málaga, Grupo GISUM, Departamento de Lenguajes y Cencas de

More information

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining Rsk Model of Long-Term Producton Schedulng n Open Pt Gold Mnng R Halatchev 1 and P Lever 2 ABSTRACT Open pt gold mnng s an mportant sector of the Australan mnng ndustry. It uses large amounts of nvestments,

More information

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits Lnear Crcuts Analyss. Superposton, Theenn /Norton Equalent crcuts So far we hae explored tmendependent (resste) elements that are also lnear. A tmendependent elements s one for whch we can plot an / cure.

More information

Returns to Experience in Mozambique: A Nonparametric Regression Approach

Returns to Experience in Mozambique: A Nonparametric Regression Approach Returns to Experence n Mozambque: A Nonparametrc Regresson Approach Joel Muzma Conference Paper nº 27 Conferênca Inaugural do IESE Desafos para a nvestgação socal e económca em Moçambque 19 de Setembro

More information

Estimation of Dispersion Parameters in GLMs with and without Random Effects

Estimation of Dispersion Parameters in GLMs with and without Random Effects Mathematcal Statstcs Stockholm Unversty Estmaton of Dsperson Parameters n GLMs wth and wthout Random Effects Meng Ruoyan Examensarbete 2004:5 Postal address: Mathematcal Statstcs Dept. of Mathematcs Stockholm

More information

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT Chapter 4 ECOOMIC DISATCH AD UIT COMMITMET ITRODUCTIO A power system has several power plants. Each power plant has several generatng unts. At any pont of tme, the total load n the system s met by the

More information

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy 4.02 Quz Solutons Fall 2004 Multple-Choce Questons (30/00 ponts) Please, crcle the correct answer for each of the followng 0 multple-choce questons. For each queston, only one of the answers s correct.

More information

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS 21 22 September 2007, BULGARIA 119 Proceedngs of the Internatonal Conference on Informaton Technologes (InfoTech-2007) 21 st 22 nd September 2007, Bulgara vol. 2 INVESTIGATION OF VEHICULAR USERS FAIRNESS

More information

Joe Pimbley, unpublished, 2005. Yield Curve Calculations

Joe Pimbley, unpublished, 2005. Yield Curve Calculations Joe Pmbley, unpublshed, 005. Yeld Curve Calculatons Background: Everythng s dscount factors Yeld curve calculatons nclude valuaton of forward rate agreements (FRAs), swaps, nterest rate optons, and forward

More information

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh

More information

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001 Proceedngs of the Annual Meetng of the Amercan Statstcal Assocaton, August 5-9, 2001 LIST-ASSISTED SAMPLING: THE EFFECT OF TELEPHONE SYSTEM CHANGES ON DESIGN 1 Clyde Tucker, Bureau of Labor Statstcs James

More information

Dominant Resource Fairness in Cloud Computing Systems with Heterogeneous Servers

Dominant Resource Fairness in Cloud Computing Systems with Heterogeneous Servers 1 Domnant Resource Farness n Cloud Computng Systems wth Heterogeneous Servers We Wang, Baochun L, Ben Lang Department of Electrcal and Computer Engneerng Unversty of Toronto arxv:138.83v1 [cs.dc] 1 Aug

More information