A Probabilistic Approach to Latent Cluster Analysis



Similar documents
Modern Portfolio Theory (MPT) Statistics

Online Load Balancing and Correlated Randomness

QUANTITATIVE METHODS CLASSES WEEK SEVEN

ANALYSIS OF ORDER-UP-TO-LEVEL INVENTORY SYSTEMS WITH COMPOUND POISSON DEMAND

How To Write A Recipe Card

Life Analysis for the Main bearing of Aircraft Engines

Authenticated Encryption. Jeremy, Paul, Ken, and Mike

Mininum Vertex Cover in Generalized Random Graphs with Power Law Degree Distribution

The example is taken from Sect. 1.2 of Vol. 1 of the CPN book.

The influence of advertising on the purchase of pharmaceutical products

Section 3: Logistic Regression

Protecting E-Commerce Systems From Online Fraud

An RSA-based (t, n) threshold proxy signature scheme with freewill identities

ERLANG C FORMULA AND ITS USE IN THE CALL CENTERS

Game of Platforms: Strategic Expansion into Rival (Online) Territory

Control of Perceived Quality of Service in Multimedia Retrieval Services: Prediction-based mechanism vs. compensation buffers

Personalized Web Search by User Interest Hierarchy

5 2 index. e e. Prime numbers. Prime factors and factor trees. Powers. worked example 10. base. power

Bank Incentives, Economic Specialization, and Financial Crises in Emerging Economies

Constrained Renewable Resource Allocation in Fuzzy Metagraphs via Min- Slack

Term Structure of Interest Rates: The Theories

Adverse Selection and Moral Hazard in a Model With 2 States of the World

Advantageous Selection versus Adverse Selection in Life Insurance Market

No 28 Xianning West Road, Xi an No 70 Yuhua East Road, Shijiazhuang.

Non-Linear and Unbalanced Three-Phase Load Static Compensation with Asymmetrical and Non Sinusoidal Supply

Keywords Cloud Computing, Service level agreement, cloud provider, business level policies, performance objectives.

Modelling Exogenous Variability in Cloud Deployments

Logistic Regression for Insured Mortality Experience Studies. Zhiwei Zhu, 1 Zhi Li 2

Reputation Management for DHT-based Collaborative Environments *

Buffer Management Method for Multiple Projects in the CCPM-MPL Representation

Facts About Chronc Fatgu Syndrom - sample thereof

Rural and Remote Broadband Access: Issues and Solutions in Australia

A Graph-based Proactive Fault Identification Approach in Computer Networks

EFFECT OF GEOMETRICAL PARAMETERS ON HEAT TRANSFER PERFORMACE OF RECTANGULAR CIRCUMFERENTIAL FINS

VOL. 25, NÚM. 54, EDICIÓN JUNIO 2007 PP


Performance Evaluation

erkeley / uc berkeley extension Be YoUR Best / be est with berkeley / uc berkeley With BerkELEY exten xtension / be your best with berkele

Traffic Information Estimation Methods Based on Cellular Network Data

A Note on Approximating. the Normal Distribution Function

C H A P T E R 1 Writing Reports with SAS

Lecture 20: Emitter Follower and Differential Amplifiers

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

CPS 220 Theory of Computation REGULAR LANGUAGES. Regular expressions

Traffic Flow Analysis (2)

Managing the Outsourcing of Two-Level Service Processes: Literature Review and Integration

by John Donald, Lecturer, School of Accounting, Economics and Finance, Deakin University, Australia

Use a high-level conceptual data model (ER Model). Identify objects of interest (entities) and relationships between these objects

Gold versus stock investment: An econometric analysis

A Multi-Heuristic GA for Schedule Repair in Precast Plant Production

Upper Bounding the Price of Anarchy in Atomic Splittable Selfish Routing

FACULTY SALARIES FALL NKU CUPA Data Compared To Published National Data

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

PROXIMITY OPERATIONS OF ON-ORBIT SERVICING SPACECRAFT USING AN ECCENTRICITY/INCLINATION VECTOR SEPARATION

WORKERS' COMPENSATION ANALYST, 1774 SENIOR WORKERS' COMPENSATION ANALYST, 1769

is knowing the car market inside out.

Incomplete 2-Port Vector Network Analyzer Calibration Methods

Entity-Relationship Model

A Probabilistic Theory of Coherence

Real-Time Evaluation of Campaign Performance

Question 3: How do you find the relative extrema of a function?

Architecture of the proposed standard

TELL YOUR STORY WITH MYNEWSDESK The world's leading all-in-one brand newsroom and multimedia PR platform

Basis risk. When speaking about forward or futures contracts, basis risk is the market

Mathematics. Mathematics 3. hsn.uk.net. Higher HSN23000

Abstract. Introduction. Statistical Approach for Analyzing Cell Phone Handoff Behavior. Volume 3, Issue 1, 2009

An IAC Approach for Detecting Profile Cloning in Online Social Networks

STATEMENT OF INSOLVENCY PRACTICE 3.2

Sharp bounds for Sándor mean in terms of arithmetic, geometric and harmonic means

RISK MANAGEMENT OF UNCERTAIN INNOVATION PROJECT BASED ON BAYESIAN RISK DECISION

Global Sourcing: lessons from lean companies to improve supply chain performances

JOB-HOPPING IN THE SHADOW OF PATENT ENFORCEMENT

METHODS FOR HANDLING TIED EVENTS IN THE COX PROPORTIONAL HAZARD MODEL

Constraint-Based Analysis of Gene Deletion in a Metabolic Network

Planning and Managing Copper Cable Maintenance through Cost- Benefit Modeling

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Business rules FATCA V. 02/11/2015

Implementation of Deutsch's Algorithm Using Mathcad

Sci.Int.(Lahore),26(1), ,2014 ISSN ; CODEN: SINTE 8 131

ImportingCoreInternationalCrimes intonationallaw

New Basis Functions. Section 8. Complex Fourier Series

CPU. Rasterization. Per Vertex Operations & Primitive Assembly. Polynomial Evaluator. Frame Buffer. Per Fragment. Display List.

Rank Optimization of Personalized Search

This circuit than can be reduced to a planar circuit

Electronic Commerce. and. Competitive First-Degree Price Discrimination

Recurrence. 1 Definitions and main statements

The international Internet site of the geoviticulture MCC system Le site Internet international du système CCM géoviticole

5.4 Exponential Functions: Differentiation and Integration TOOTLIFTST:

SPECIAL VOWEL SOUNDS

Tax Collection, Transfers, and Corruption: the Russian Federation at the Crossroads 1)

Section G3: Differential Amplifiers

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Transcription:

Procdngs of th Twnty-Thrd Intrnatonal Jont Confrnc on Artfcal Intllgnc A Probablstc Approach to Latnt Clstr Analyss Zhpng X R Dong, Zhnghng Dng, Zhnyng H, Wdong Yang School of Comptr Scnc Fdan Unrsty, Shangha Chna {xzp, 11210240011, 11210240082, zhnyng, wdyang}@fdan.d.cn Abstract Facng a larg nmbr of clstrng soltons, clstr nsmbl mthod prods an ffct approach to aggrgatng thm nto a bttr on. In ths papr, w propos a nol clstr nsmbl mthod from probablstc prspct. It assms that ach clstrng solton s gnratd from a latnt clstr modl, ndr th control of two probablstc paramtrs. Ths, th clstr nsmbl problm s rformlatd nto an optmzaton problm of maxmm lklhood. An EM-styl algorthm s dsgnd to sol ths problm. It can dtrmn th nmbr of clstrs atomatcally. Exprmnal rslts ha shown that th proposd algorthm otprforms th stat-of-th-art mthods ncldng EAC-AL,,, and. Frthrmor, t has bn shown that or algorthm s stabl n th prdctd nmbrs of clstrs. 1 Introdcton Th goal of clstr analyss s to dscor th ndrlyng strctr of a datast (Jan t al., 1999; Jan, 2010. It normally parttons a st of obcts so that th obcts wthn th sam grop ar smlar whl thos from dffrnt grops ar dssmlar. A larg nmbr of clstrng algorthms ha bn proposd,.g. k-mans, Spctral Clstrng, Hrarchcal Clstrng, Slf-Organzng Maps, to nam bt a fw, yt no sngl on s abl to sccssflly ach ths goal for all datasts. On th sam data, dffrnt algorthms, or n mltpl rns of th sam algorthm wth dffrnt paramtrs, oftn lad to clstrng soltons that ar dstnct from ach othr. Confrontd wth a larg nmbr of clstrng soltons, clstr nsmbl or clstrng aggrgaton mthods ha mrgd, whch try to combn dffrnt clstrng soltons nto a consnss on, n ordr to mpro th qalty of componnt clstrng soltons (Vga-Pons and Rz-Shlclopr, 2011. Clstr nsmbl mthods sally consst of two or thr phass: th nsmbl gnraton phas to prodc a arty of clstrng soltons; thn th nsmbl slcton phas to slct a sbst of ths clstrng soltons, whch s optonal; and fnally th consnss phas to ndc a nfd partton by combnng th componnt ons. In th gnraton phas, dffrnt clstrng soltons can b gnratd by dffrnt clstrng algorthms, th sam algorthm wth dffrnt paramtr sttngs or ntalzaton, and ncton of random dstrbanc nto data st sch as data rsamplng (Mna-Bdgol t al., 2004, random procton (Frn and Brodly, 2003, and random fatr slcton (Strhl and Ghosh, 2002. Followng th gnraton phas, an optonal nmbl slcton phas wll slct or prn ths clstrng soltons accordng to thr qalts and drsts (Frn and Ln, 2008; Azm and Frn, 2009. In ths papr, w focs on th fnal phas - clstrng combnaton. Thr ar a lot of algorthms for th combnaton, whch can b catgorzd accordng to th knd of nformaton xplotd. Th algorthm proposd hr falls nto th catgory makng s of th parws smlarts btwn obcts, whch form a co-assocaton matrx n th contxt of clstr nsmbls. Any clstrng algorthm can b appld on ths nw smlarty matrx to fnd a consnss partton. Ednc Accmmlaton Clstrng (Frd and Jan, 2005, or EAC n short, prforms a hrarchcal clstrng of arag lnkag (AL or sngl lnkag (SL on co-assocatong matrx, whr a maxmm lftm crtron s proposd to dtrmn th nmbr of clstrs. Clstr-basd Smlarty Parttonng ( algorthm (Strhl and Ghosh, 2002 ss a graph-partonng algorthm nstad, bt rqrs th nmbr of clstrs b spcfd manally. Anothr algorthm (Strhl and Ghosh, 2002 can b thoght as an approxmaton to. Ot of ths catgory, algorthm maks a clstrng of clstrs basd on th smlarts btwn clstrs, and thn assgns obcts to ts closst mta-clstr. For a thorogh lst of rlatd algorthms, plas rfr to th sry papr by Vga-Pons and Rz-Shlclopr (2011. Althogh ths mthods ha achd som sccss, thy ar stll dfcnt n sral aspcts: frst, thy lack of thortc ndrpnnng; scond, thy thnk that all th clstrng soltons b of th sam qalty, and ths assgn th sam wght to ach clstrng solton; last bt not th last, most of thm (xcpt EAC rqr th nmbr of clstrs to b spcfd manally. As to th maxmm lftm crtron adoptd by EAC algorthm, t s mor or lss a rl-of-thmb that s lack of stfcaton. As w shall s n th xprmnts, th maxmm lftm crtron s nstabl n dtrmnng th nmbr of clstrs. 1813

To tackl ths problms, w propos a probablstc mthod calld LAtnt Clstr Analyss, or LACA n short. It assms that thr s a latnt clstr modl whch s nobsrabl. All th obsrd Clstrng soltons ar gnratd from th latnt modl ndr th control of two probablstc paramtrs. Or obct s to sk th latnt clstr modl wth th maxmm lklhood. Ths papr s organzd as follows. In Scton 2, w ntrodc th latnt clstr modl and bld ts conncton wth th obsrd clstrng soltons. W dot Scton 3 to an EM-styl algorthm for nfrrng th latnt clstr modl from th obsrd clstrng soltons. In Scton 4, w prsnt th xprmntal rslts of th proposd mthod compard wth sral stat-of-th-art clstr nsmbl algorthms. Fnally, w mak th conclson n Scton 5. 2 Latnt Clstr Modl Lt X = {x 1 2, n } b a st of n obcts, whr ach obct x may b rprsntd as a mltdmnsonal ctor, a strng, or n any othr form. Takng X as npt, a clstrng algorthm (calld clstrr prodcs a clstrng solton that parttons th n obcts nto grops. By rnnng clstrng algorthms mltpl tms, w can obsr an nsmbl of clstrng soltons, E = {C 1, C 2,, C E }, whr ach clstrng solton C (1 E parttons th n obcts nto grops c, c,, c. C Wth rspct to a gn clstrng solton C, a co-assocaton rlatonshp btwn two obcts x and x s dfnd accordng to whthr thy ar assgnd to th sam grop: 1, f ck C sch that { x } c k. (1 0, othrws Two obcts x and x ar sad to b co-assocatd n th clstrng solton C f 1 ; othrws, thy ar not co-assocatd. Gn th nsmbl of obsrd clstrng solton, what w wold lk to xplor s th latnt clstr modl. W dnot th latnt clstr modl as = { 1, 2,, s }, whr l (1 l s s a clstr rprsntd as a sbst of obcts, and s s th ral nmbr of clstrs. In ths papr, w assm that ths s latnt clstrs ar non-orlappng,.., = for 1 s. Basd on th latnt clstr modl, w dfn th co-clstr fncton for a gn par of obcts {x, x }:, f k sch that { x } k (2, othrws Ths latnt clstr modl srs as th maor factor n dtrmnng what clstrng soltons can b obsrd, whl othr factors sch as th bas of appld clstrng algorthm nflncs th obsrd rslts and lads to som fals posts and fals ngats. To bld th conncton btwn a clstrng solton C and th latnt clstr modl, w ntrodc two probablstc paramtrs: Th paramtr to dnot th condtonal probablty that two obcts ar co-assocatd n C gn that thy ar co-clstr n th hddn modl, that s = Pr( =1 =1; and Th paramtr r to dnot th condtonal probablty that two obcts ar co-assocatd n C gn that thy ar not co-clstr n th hddn modl, that s r = Pr( =1 =0. Inttly, ach obsrd clstrng solton prods som dncs abot th latnt (nobsrabl co-clstr rlatonshp btwn obcts. Gn a st E of clstrng soltons, or obct s to maxmz th postror probablty of th latnt clstr modl, as follows: * E Pr( arg max Pr( E arg max (3 Pr( E whr E = Pr(E s th lklhood fncton. Assm that ths clstrng soltons ar ndpndnt of ach othr and pror probablts of all th possbl latnt clstr modls ar dstrbtd nformly, t can b wrttn as th followng maxmm-lklhood problm: * arg maxl ( E arg max C, (4 whr C = Pr(C s th lklhood of th latnt modl gn th obsrd clstrng solton C (or th probablty of obsrng C gn th latnt modl Th dnc of obsrng a clstrng solton C can b dcomposd nto th dnc st of co-assocaton rlatonshps of all obct pars,.. whthr a par of obcts {x } ar assgnd to th sam grop. Thn, w ha C ( r ( r (5 : : : : Takng logrthm on both sds, w gt th log-lklhood: L ( C log C (6,,,, n11 log n log( n01 log r n log( r whr, n1 {{ x } and } rprsnts th nmbr of obct pars that ar co-clstr n th hddn modl and also co-assocatd n C ;, n10 {{ x } and 0} rprsnts th nmbr of obct pars that ar co-clstr n th hddn modl, bt not co-assocatd n th obsrd clstrng rslt C ;, n01 {{ x } 0 and 1} rprsnts th nmbr of obct pars that ar not co-clstr n th hddn modl, bt co-assocatd n th obsrd clstrng rslt C ; and 1814

, n {{ x } 0 and } rprsnts th 00 nmbr of th obct pars that ar not co-clstr n also not co-assocatd n th clstrng rslt C. By sbstttng qatons (5 or (6 nto (4, w can rformlat th optmzaton problm as: * arg max l ( C arg max L( C (7 3 Algorthm Dsgn Unfortnatly, th latnt clstr modl and th probablstc paramtrs of ths obsrd clstrng rslts ar all nknown, whch mak t mpossbl to sk th solton drctly. Hr, w proposd an EM-styl algorthm to dal wth th problm, dpctd n Fgr 1. Fgr 1. Th flowchart of LACA algorthm Th algorthm conssts of for maor stps: Stp 1 (Paramtr Intalzaton: ntalz th probablstc paramtrs for ach clstrng solton; Stp 2 (Latnt Modl Gnraton: fxng th probablstc paramtrs, look for a nar-optmal solton (a latnt modl to th maxmm-lklhood problm wth a hll clmbng stratgy; Stp 3 (Paramtr Estmaton: fxng th latnt clstr modl, stmat th probablstc paramtrs for ach clstrng solton; Stp 4 (Conrgnc Tst: Rpat Stp 2 and Stp 3 ntl conrgnc. 3.1 Paramtr Intalzaton Gn E obsrd clstrng soltons C 1, C 2,, C E, w s cont(x to dnot th nmbr of clstrng soltons whr th obcts x and x ar co-assocatd, and rcont(x to dnot th nmbr of clstrng soltons whr x and x ar not co-assocatd, that s: and Clstrng 1 Clstrng 2... Clstrng E Paramtr Intalzat cont(x = rcont(x = Latnt Modl Gnraton Paramtr Rstmaton Conrgnc? (8 (. (9 It s dnt that cont(x + rcont(x = E for 1 n,. So far as paramtr ntalzaton s concrnd, t s takn for grantd that dffrnt clstrng soltons b qally No Ys Otpt th Solton plasbl. Th hghr s th al of cont(x, th mor probably th two obcts x and x ar co-clstrd. Hnc, th two probablstc paramtrs for ach clstrng solton C s ntalzd as th followng M-stmats: cont( x x ESS, :, (10 cont( x ESS and r : 1 rcont( x ESS, (11 rcont( x, y ESS whr ESS, standng for qalnt sampl sz, s st as 30 by dfalt. 3.2 Latnt Modl Gnraton Onc th probablstc paramtrs ar ntalzd or r-stmatd, th latnt clstr modl can dtrmnd n a hll-clmbng mannr wth rspct to al th log lklhood fncton. Lt s start wth an mpty latnt modl whr ach obct corrsponds to a snglton clstr, and ths any two obcts ar not co-clstr. Thn, w tratly mrg two clstrs nto a largr on. Th slcton crtron for mrgng at ach stp s dscrbd blow, stp by stp. Lt = { 1, 2,, t } b th latnt clstr modl at th crrnt stp, and ( b prodcd by mrgng and n,,. D to th fact that ( n (,, n = ( n n and 11, (,, ( n ( n = ( n, n, t can b drd that: ( E E (12 (,, ( n11 n11 log ( n r Frthr, t s dnt that: and ( n ( n,, 11 n11 11 (, x n (,, 10 n10 ( x Sbstttng (13 and (14 nto (12, w gt: ( E E x log r x 01, log 01 r, (13 ( log. (14 r (15 By ntrchangng th ordr of smmaton, t can b drd that: ( E E (16 log ( log x r r If w dfn th affnty scor btwn two obcts x and x otd by a clstrng solton C as: 1815

scor x x ( log ( log, (17 r r w can sm p all th scors otd by C 1, C 2,, C E nto th corrspondng ntry M[ ] of a scor matrx M, that s, M [ ] scor ( x. (18 Sbstttng (17 and (18 nto (16, w ha: ( E E M[ ] x (19 Howr, sng qaton (19 drctly as th slcton crtron may faor mrgng largr clstrs or smallr ons. Ths, w choos to mrg two clstrs s and t sch that ( E E ( s, t arg max, (20 1 arg max M[ ], x If w thnk of th scor matrx M as a smlarty matrx, th slcton crtron s actally th arag-lnkag (AL. As a rslt, a hrarchcal clstrng wth arag lnkag can b appld on th matrx M. What s mportant s that th lmnts n th scor matrx has clar probablstc manng: ach lmnt rprsnts actally th log-lklhood rato of th corrpondng two obcts bng co-clstr to bng not. Hddn Modl Infrnc a Hrarchcal clstrng By fxng all th paramtrs and r (1 E, th scor matrx M can b constrctd accordng to (17 and (18. Th hddn modl can b gnratd by applyng th followng agglommrat hrarchcal clstrng on M: Stp 1: W start from th smplst snglton modl 0 whr ach clstr conssts of on and only on obct. Stp 2: Lt t dnot th crrnt traton, and st t = 1. Stp 3: At ach traton t, lt b th clstr modl n th pros traton,.., = t1. Wthot loss of gnralty, w dnot ={ 1, 2,, }. Two clstrs s and t n ar slctd accordng to qaton (20. Stp 4: If th arag lnk btwn s and t s ngat thn trmnat th loop and otpt as th gnratd hddn clstr modl; othrws, contn to stp 5. Stp 5: Th clstrs slctd at stp 3 gt mrgd nto a nw clstr nw = s t. W pdat by rmong s and t, and nsrtng nw. Th pdatd thn srs as th clstr modl n th crrnt traton, that s t =. Stp 6: If only two clstrs ar lft, trmnat and otpt ; othrws, st t = t+1, and go to stp 3. Stoppng Crtron: Ths procss s rpatd ntl w can not fnd a par of clstrs wth post arag lnk (Stp 4, or thr ar only two clstrs lft (Stp 6. 3.3 Paramtr R-stmaton Onc a latnt clstr modl s gnratd, t can b sd to stmat th probablstc paramtrs and r for ach clstrng solton C. Snc th paramtr rprsnts th probablty that two obcts ar co-assocatd n C on condton that thy ar co-clstr n th latnt modl, that s, r = Pr( =1 =1, t can b stmatd as: {( } 0. ESS, (21 {( } ESS whr th ESS s also st as 30 by dfalt. Bcas th paramtr r dnots th probablty that two obcts ar co-assocatd n C on condton that thy ar not co-clstr n, that s r = Pr( =1 =0, t can b stmatd as: {( }. ESS r. (22 {( } ESS If a clstrng solton assgns ach obct nto a dstnct grop, th corrspondng and r wll b both clos to 0. On th othr xtrm, f t assgns all obcts nto a sngl grop, th corrspondng and r wll b both nar to 1. 3.4 Conrgnc Tst Onc th probablstc paramtrs and r for ach clstrng solton C ar r-stmatd, w compt th dffrnc btwn th r-stmatd als and th pros ons. If th sm of absolt dffrncs or all clstrng soltons s lss than a sr-spcfd thrshold al, w consdr th algorthm as conrgd and otpt th latnt modl. 4 Exprmnts W ha condctd xtns xprmnts to compar LACA wth sral stat-of-th-art clstr nsmbl mthods. Or xprmnts ar dsgnd to dmonstrat: 1 LACA s mor stabl than EAC-AL n dtrmnng th nmbr of clstrs; 2 LACA otprforms EAC-AL whch s also abl to dtrmnng th nmbr of clstrs atomatcally; 3 A arant rson of LACA, calld, otprforms, and. 4.1 Exprmntal Sttngs DataSt #Obct #Fatr #Class Irs 150 4 3 Glass 214 9 6 Ecol 336 7 8 Lbras 360 90 15 Sgmntaton 210 19 7 Sd 210 7 3 Pma 768 8 2 Pndgts 1000 16 10 Tabl 1: Dscrptons of th datasts. 1816

Data sts. W s ght data sts from th UCI machn larnng rpostory (Frank and Asncon, 2010 n or xprmnts. Th charactrstcs of th data sts ar smmarzd n Tabl 1. Not that, for Pndgts, w randomly slct 100 obcts from ach class. Clstr Ensmbl Gnraton. W choos to s th K-mans algorthm (MacQn, 1967 as or bas clstrr, bcas of ts poplarty n many pros clstr nsmbl stds. At ach rn, w gnrat a clstr nsmbl of 200 clstrng soltons for a gn data st. To b mor spcfc, for a datast of n obcts and m fatrs, ach clstrng solton s prodcd as follows: Th sz s of fatr sbst s frstly dtrmnd by randomly drawng an ntgr al from th rang [mns, maxs], whr mns s st to b 3, and maxs s st to b m. A random fatr sbst FS of sz s s gnratd by drawng s dffrnt fatrs from th orgnal m fatrs. An random ntgr al K s drawn from [mnk, maxk], whr mnk s st to b 2, and maxk s st to b n/15. A clstrng solton s obtand by applyng K-mans algorthm on th datast, wth accss to all th obcts, bt only th s fatrs n FS. Ealaton Crtron. As all th datasts ar labld, w s th class labls as a srrogat for th tr ndrlyng strctr of th data. Two commonly sd masrs, Normalzd Mtal Informaton ( and F-masrs, ar chosn to alat or approach aganst othrs. (Strhl and Ghosh, 2002 trats clstr labls X and class labls Y as random arabls and maks a tradoff btwn th mtal nformaton and th nmbr of clstrs: I( X, Y, H ( X H ( Y whr I( s th mtal nformaton mtrc and H( s th ntropy mtrc. F-masr (Mannng t al., 2008 ws a clstrng solton (on a datast wth n obcts as a srs of n(n1/2 dcsons, on for ach par of obcts. It maks a compoms btwn th prcson and th rcall of ths dcsons: Prcson Rcall F. Prcson Rcall 4.2 Stablty of Prdctd Clstr Nmbrs To th bst of or knowldg, most clstr nsmbl mthods rly on a sr-spcfd nmbr of clstrs. Th only xcpton s th maxmm lftm crtron sd n EAC-AL mthod. In ordr to stdy th stablty of or algorthm and EAC-AL n prdctd nmbr of clstrs, w gnrat 30 clstr nsmbls for ach datast n th way as dscrbd n Scton 4.1. Or algorthm and EAC-AL ar appld on ths clstr nsmbls to gt thr prdctd clstr nmbrs. Th statstcs abot ths nmbrs ar prsntd n Tabl 2. It can b sn from th tabl that th rang [Mn, Max] of EAC-AL s mch wdr than that of LACA on ach datast, sggstng that th clstr nmbrs prdctd by EAC-AL flctats a lot for ach data st, and spcally for Pndgts and Pma. Smlar obsratons can also b mad from th standard daton of clstr nmbrs on ach datast. W conctr that ths s bcas lf tm s not always ffct n th prdcton of clstr nmbrs, bcas th maxmm lftm stratgy s mor or lss a rl of thmb, lack of thortc stfcaton. As to th arag al of th prdctd clstr nmbrs, LACA s largr than EAC-AL on 3 datasts, and smallr on 5, showng nconsstncy n som dgr. Datast Mthod Mn Max Arag Std D Irs LACA 3 4 3.73 EAC-AL 2 4 2.73 0.98 Glass LACA 5 6 5.03 0.18 EAC-AL 2 6 4.40 1.65 Ecol LACA 3 5 3.73 3 EAC-AL 3 12 4.30 2.37 Lbras LACA 7 9 7.70 0 EAC-AL 6 21 11.50 5.54 Sgmntaton LACA 5 7 5.23 7 EAC-AL 2 13 2.53 2.08 Sd LACA 4 5 4.40 0 EAC-AL 2 7 4.53 0.97 Pma LACA 4 10 8.70 1.06 EAC-AL 4 30 11.80 8.04 Pndgts LACA 10 16 14.03 1.75 EAC-AL 4 48 23 12.27 Tabl 2: Statstcs of prdctd clstr nmbrs 4.3 Comparson wth EAC-AL Tabl 3 rports th and F-masr als of or algorthm and EAC-AL on th sam clstr nsmbls. Each al rportd hr s obtand by aragng across 30 rns. W can s that or algorthm prforms bttr than EAC-AL on 7 ot of 8 datasts. Th only xcpton s on th Lbras datast. W conctr that t s bcas th arag clstr nmbr prdctd by EAC-AL s closr to th ral nmbr of classs n Lbras. F-masr Datast LACA EAC-AL LACA EAC-AL Irs 533 995 535 528 Glass 502 395 869 614 Ecol 693 545 790 712 Lbras 267 470 014 333 Sgmntaton 091 091 052 617 Sd 423 333 680 613 Pma 751 597 0.0674 0.0665 Pndgts 712 809 721 389 Tabl 3: F-masr and als of LACA and EAC-AL 4.4 Comparson wth, and 1817

0.9 5 5 rs 6 4 2 8 6 glass 6 4 2 8 6 col 6 4 2 8 lbras 5 5 4 2 0.28 4 2 6 4 2 clstr nmbr k sgmntaton 8 6 4 2 8 6 4 2 8 10 12 14 16 18 20 clstr nmbr k 5 5 6 8 10 12 14 16 18 clstr nmbr k sd clstr nmbr k 0.09 0.08 0.07 0.06 0.05 0.04 8 10 12 14 16 18 20 22 24 clstr nmbr k pma 0.03 0.02 0.01 0 2 3 4 5 6 7 8 clstr nmbr k Fgr 2. Comparson of,, and 15 20 25 30 35 40 45 clstr nmbr k pndgts 8 6 4 2 8 6 4 10 15 20 25 30 clstr nmbr k F al F al 1 0.9 clstr nmbr k sgmntaton 5 5 5 rs 8 10 12 14 16 18 20 clstr nmbr k 6 8 10 12 14 16 18 clstr nmbr k sd 0.95 0.9 5 D to th fact that most xstng clstr nsmbl mthods rqr a sr-spcfd nmbr of clstrs, to mak a far comparson wth thm, w mak a small modfcaton of LACA by accptng a sr-spcfd clstr nmbr k, rsltng n a arant rson calld. In, whn th probablstc paramtrs gt conrgd, w forc th hrarchcal clstrng to stop mrgng only f thr ar xactly k clstrs lft, whch ar thn sd as th consnss clstrng of. On ach datast of l classs, w compar wth,, and by aryng th clstr nmbr k from l to 3 l wth stp sz 1. Fgr 2 and Fgr 3 dpct th s and th F-masrs of,, and, rspctly, whch ar also argad or 30 rns, wth dffrnt sr-spcfd clstr nmbrs. It s obos that th cr of or mthod s bttr or at last comptt on almost all th datasts. Th only xcpton s obsrd on th Pma datast, whr th of or mthod s lowr than th othrs. Bsds, w also fnd that th cr of or mthod s mor smooth and stabl across ths dffrnt k als. Ths sggsts that or mthod has achd hgh qalts consstntly on ths lls of hrarchcal clstrng. F al F al 5 5 5 5 5 glass clstr nmbr k F al F al 5 5 5 5 col 0.25 8 10 12 14 16 18 20 22 24 clstr nmbr k pma 0.25 2 3 4 5 6 7 8 clstr nmbr k Fgr 3. F-masr comparson of,, and 5 5 5 F al F al 9 8 7 6 4 3 2 1 15 20 25 30 35 40 45 clstr nmbr k pndgts 5 5 5 5 Conclsons In ths papr, w proposd a nol clstr nsmbl approach by assmng that th obsrd clstrng soltons ar gnratd from a latnt clstr modl. An EM-styl algorthm, calld LACA, was dsgnd and mplmntd to maxmz th lklhood fncton. It has xhbtd a satsfactory prformanc on th xprmntal datasts, for two rasons: frstly, t can mak a stabl and rlabl prdcton of th clstr nmbrs; scondly, ot of ach bas clstrng solton s wghtd whch rflcts th qalty of th bas solton. Acknowldgmnts Ths work s spportd by Natonal Arplan Rsarch Program (MJ-Y-2011-39, Natonal Natral Scnc Fnd of Chna (No. 61170007, Shangha Hgh-Tch Proct (11-43 and Shangha Ladng Acadmc Dscpln Proct (No. B114. W ar gratfl to th anonymos rwrs for thr alabl commnts. lbras 10 15 20 25 30 clstr nmbr k 1818

Rfrncs [Azm and Frn, 2009] Jaad Azm Xaol Z. Frn. Adapt clstr nsmbl slcton. In Procdngs of th 21 st Intrnatonal Jont Confrnc on Artfcal Intllgnc, pags 993-997, 2009. [Frn and Brodly, 2003] Xaol Z. Frn, and Carla E. Brodly. Random procton for hgh dmnsonal data clstrng: a clstr nsmbl approach. In Procdngs of th 20 th Intrnatonal Confrnc on Machn Larnng, pags 186-193, 2003. [Frn and Ln, 2008] Xaol Z. Frn, and W Ln. Clstr nsmbl slcton. Statstcal Analyss and Data Mnng, 1(3: 379-390, 2008 [Frank and Asncon, 2010] A. Frank, and A. Asncon. UCI Machn Larnng Rpostory. Irn, CA: Unrsty of Calforna, School of Informaton and Comptr Scnc, 2010. [http://arch.cs.c.d/ml] [Frd and Jan, 2005] Ana L.N. Frd, and Anl K. Jan. Combnng mltpl clstrngs sng dnc accmlaton. IEEE Transactons on Pattrn Rcognton and Machn Intllgnc. 27(6: 835-850, 2005. [Jan t al., 1999] Anl K. Jan, M.N. Mrty, P.J. Flynn. Data clstrng: a rw. ACM Comptng Srys, 31(3: 264-323, 1999. [Jan, 2010] Anl K. Jan. Data clstrng: 50 yars byond K-mans. Pattrn Rcognton Lttrs, 31(8: 651-666, 2010. [MacQn, 1967] J. MacQn. Som mthods for classfcatons and analyss of mltarat obsratons. In Procdngs of th Ffth Brkly Symposm on Mathmatcs, Statstcs and Probablty, Unrsty of Calforna Prss, pags 281-297, 1967. [Mannng t al., 2008] Chrstophr D. Mannng, Prabhakar Raghaan, and Hnrch Schtz. Introdcton to Informaton Rtral, Cambrdg UnrstyPrss, 2008. [Mna-Bdgol t al., 2004] Bhroz Mna-Bdgol Alxandr Topchy, Wllam F. Pnch. Ensmbls of parttons a data rsamplng. In Procdngs of Intrnatonal Confrnc on Informaton Tchnology: Codng and Comptng (ITCC 2004, pags 188-192, 2004. [Strhl and Ghosh, 2002] Alxandr Strhl, and Joydp Ghosh. Clstr nsmbls - a knowldg rs framwork for combnng mltpl parttons. Jornal of Machn Larnng Rsarch, 3: 583-617, 2002. [Vga-Pons and Rz-Shlclopr, 2011] Sandro Vga-Pons, and Jos Rz-Shlclopr. A srry of clstrng nsmbl algorthms. Intrnatonal Jornal of Pattrn Rcognton and Artfcal Intllgnc, 25(3: 337-372, 2011. 1819