A Latent Variable Pairwise Classification Model of a Clustering Ensemble



Similar documents
Face Hallucination and Recognition

Secure Network Coding with a Cost Criterion

SELECTING THE SUITABLE ERP SYSTEM: A FUZZY AHP APPROACH. Ufuk Cebeci

A Supplier Evaluation System for Automotive Industry According To Iso/Ts Requirements

Fast Robust Hashing. ) [7] will be re-mapped (and therefore discarded), due to the load-balancing property of hashing.

Multi-Robot Task Scheduling

Vendor Performance Measurement Using Fuzzy Logic Controller

Art of Java Web Development By Neal Ford 624 pages US$44.95 Manning Publications, 2004 ISBN:

ASYMPTOTIC DIRECTION FOR RANDOM WALKS IN RANDOM ENVIRONMENTS arxiv:math/ v2 [math.pr] 11 Dec 2007

Teamwork. Abstract. 2.1 Overview

Australian Bureau of Statistics Management of Business Providers

The Basel II Risk Parameters. Second edition

Chapter 3: e-business Integration Patterns

Risk Margin for a Non-Life Insurance Run-Off

A New Statistical Approach to Network Anomaly Detection

COMPARISON OF DIFFUSION MODELS IN ASTRONOMICAL OBJECT LOCALIZATION

Risk Margin for a Non-Life Insurance Run-Off

Betting Strategies, Market Selection, and the Wisdom of Crowds

An Integrated Data Management Framework of Wireless Sensor Network

ACO and SVM Selection Feature Weighting of Network Intrusion Detection Method

Design of Follow-Up Experiments for Improving Model Discrimination and Parameter Estimation

An Idiot s guide to Support vector machines (SVMs)

STUDY MATERIAL. M.B.A. PROGRAMME (Code No. 411) (Effective from ) II SEMESTER 209MBT27 APPLIED RESEARCH METHODS IN MANAGEMENT

Simultaneous Routing and Power Allocation in CDMA Wireless Data Networks

Chapter 1 Structural Mechanics

Automatic Structure Discovery for Large Source Code

Chapter 2 Traditional Software Development

FRAME BASED TEXTURE CLASSIFICATION BY CONSIDERING VARIOUS SPATIAL NEIGHBORHOODS. Karl Skretting and John Håkon Husøy

Virtual trunk simulation

Learning from evaluations Processes and instruments used by GIZ as a learning organisation and their contribution to interorganisational learning

CONTRIBUTION OF INTERNAL AUDITING IN THE VALUE OF A NURSING UNIT WITHIN THREE YEARS

Betting on the Real Line

Capacity of Multi-service Cellular Networks with Transmission-Rate Control: A Queueing Analysis

Best Practices for Push & Pull Using Oracle Inventory Stock Locators. Introduction to Master Data and Master Data Management (MDM): Part 1

Introduction the pressure for efficiency the Estates opportunity

Optimizing QoS-Aware Semantic Web Service Composition

Packet Classification with Network Traffic Statistics

Big Data projects and use cases. Claus Samuelsen IBM Analytics, Europe

Advanced ColdFusion 4.0 Application Development Server Clustering Using Bright Tiger

A Similarity Search Scheme over Encrypted Cloud Images based on Secure Transformation

A quantum model for the stock market

Traffic classification-based spam filter

Fixed income managers: evolution or revolution

READING A CREDIT REPORT

Insertion and deletion correcting DNA barcodes based on watermarks

Overview of Health and Safety in China

Human Capital & Human Resources Certificate Programs

The Comparison and Selection of Programming Languages for High Energy Physics Applications

Maintenance activities planning and grouping for complex structure systems

Load Balancing in Distributed Web Server Systems with Partial Document Replication *

Fast b-matching via Sufficient Selection Belief Propagation

Oligopoly in Insurance Markets

Oracle Project Financial Planning. User's Guide Release

Sketch-based Network-wide Traffic Anomaly Detection

Introduction to XSL. Max Froumentin - W3C

Leakage detection in water pipe networks using a Bayesian probabilistic framework

Integrating Risk into your Plant Lifecycle A next generation software architecture for risk based

Advantages and Disadvantages of Sampling. Vermont ASQ Meeting October 26, 2011

Market Design & Analysis for a P2P Backup System

Design and Analysis of a Hidden Peer-to-peer Backup Market

CLOUD service providers manage an enterprise-class

We focus on systems composed of entities operating with autonomous control, such

GREEN: An Active Queue Management Algorithm for a Self Managed Internet

Minimum Support Size of the Defender s Strong Stackelberg Equilibrium Strategies in Security Games

The BBC s management of its Digital Media Initiative

LADDER SAFETY Table of Contents

Ricoh Healthcare. Process Optimized. Healthcare Simplified.

Ricoh Legal. ediscovery and Document Solutions. Powerful document services provide your best defense.

Let s get usable! Usability studies for indexes. Susan C. Olason. Study plan

EFFICIENT CLUSTERING OF VERY LARGE DOCUMENT COLLECTIONS

Leadership & Management Certificate Programs

Storing Shared Data on the Cloud via Security-Mediator

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS

A train dispatching model based on fuzzy passenger demand forecasting during holidays

The Essence of Research Methodology

Applying graph theory to automatic vehicle tracking by remote sensing

Finance 360 Problem Set #6 Solutions

SPOTLIGHT. A year of transformation

Swingtum A Computational Theory of Fractal Dynamic Swings and Physical Cycles of Stock Market in A Quantum Price-Time Space

Normalization of Database Tables. Functional Dependency. Examples of Functional Dependencies: So Now what is Normalization? Transitive Dependencies

arxiv: v2 [astro-ph] 7 Jan 2008

DEGREES OF ORDERS ON TORSION-FREE ABELIAN GROUPS

Oracle Hyperion Tax Provision. User's Guide Release

3.3 SOFTWARE RISK MANAGEMENT (SRM)

University of Southern California

Transcription:

A atent Variabe Pairwise Cassification Mode of a Custering Ensembe Vadimir Berikov Soboev Institute of mathematics, Novosibirsk State University, Russia berikov@math.nsc.ru http://www.math.nsc.ru Abstract. This paper addresses some theoretica properties of custering ensembes. We consider the probem of custer anaysis from pattern recognition point of view. A atent variabe pairwise cassification mode is proposed for studying the efficiency (in terms of error probabiity ) of the ensembe. The notions of stabiity, homogeneity and correation between ensembe eements are introduced. An upper bound for miscassification probabiity is obtained. Numerica experiment confirms potentia usefuness of the suggested ensembe characteristics. Keywords: custering ensembe, atent variabe mode, miscassification probabiity, error bound, ensembe s homogeneity and correation. 1 Introduction Coective decision-making based on a combination of simpe agorithms is activey used in modern pattern recognition and machine earning. In ast decade, there is growing interest in custering ensembe agorithms [1,2]. In the ensembe design process, the resuts obtained by different agorithms, or by one agorithm with variousparameters settings are used. After construction of partia custering soutions, a fina coective decision is buit. Modern iterature on custering ensembes can be roughy divided into severa main categories. There are a great dea of works in which the ensembe methodoogy is adapted to new appication areas such as magnetic resonance imaging, sateite images anaysis, anaysis of genetic sequences etc (see, for exampe, [3,4,5]). Another direction aims to deveop custering ensembes methods of genera usage and eaborate efficient agorithms using various optimization techniques (e.g., [6]). Other categories of works are of more theoretica nature; their purpose is to study the properties of custering ensembes, improve measures of ensembe quaity, suggest the ways to achieve the best quaity (e.g., [7,8,9]). There is a arge number of experimenta evidences confirming a significant raise in stabiity of custering decisions for ensembe agorithms (see, for exampe, [2,10]). At the same time, theoretica grounds of custering ensembes agorithms, as opposed to the pattern cassifier ensembes theory (e.g., [11]), are sti in C. Sansone, J. Kitter, and F. Roi (Eds.): MCS 2011, NCS 6713, pp. 279 288, 2011. c Springer-Verag Berin Heideberg 2011

280 V. Berikov the eary deveopment stage. Existing works consider mainy the asymptotic properties of custering ensembes (e.g., [7]). Custer anaysis probems are characterized by the compexity of formaization caused by substantiay subjective nature of grouping process. For the definition of custering quaity it is necessary to appy additiona a priori information in terms of natura cassification, to use data generation modes etc. In the given work we attempt to corroborate custering ensembe methodoogy by utiizing of a pattern recognition mode with atent cass abes. To avoid probems with cass renumbering, a pairwise cassification approach is used. The rest of the paper is organized as foows. In the next section we give basic definitions and introduce the mode of ensembe custer anaysis. In the third section we receive an upper bound for error probabiity (in cassifying a pair of arbitrary objects according to atent variabe abes) and give some quaitative consequences of the resut. In the forth section we introduce the estimates of ensembe characteristics. The next section describes numerica experiment that demonstrates the usage of these notions. The concusion summaries the work and gives possibe future directions. 2 Ensembe Mode et us consider a sampe s = {o (1),...,o (N) } of objects independenty and randomy seected from a genera popuation. The purpose of the anaysis is to group the objects into K 2 casses in accordance with some custering criterion; the number of casses may be either given beforehand or not (in the atter case an optima number of casses shoud be determined automaticay). et each of the objects be characterized by variabes X 1,...,X n.denoteby x = (x 1,...,x n ) the vector of these variabes for an object o, x j = X j (o), j =1,...,n. In many custering tasks it is aowabe to consider that there exists a ground truth (atent, directy unobserved) variabe Y {1,...,K} that determines to which cass an object beongs. Suppose that the observations of k-th cass are distributed according to the conditiona dencity function p k (x) = p(x Y = k), k =1,...,K. Consider the foowing mode of data generation. et each object be assigned to cass k in accordance with a priori probabiities P k = P(Y = k), k =1,...,K, where K P k = 1. After the assignment, an observabe vaue of x is determined k=1 with use of p k (x). This procedure is repeated independenty for each object. For an arbitrary pair of different objects a, b s, their correspondent observations are denoted by x(a) and x(b). et Z = I(Y (a) Y (b)),

A atent Variabe Pairwise Cassification Mode of a Custering Ensembe 281 where I( ) is indicator function. Denote by P Z = P[Z =1 x(a),x(b)] the probabiity of the event a and b beong to different casses, given x(a) and x(b) : P Z =1 P[Y (a) =1 x(a)] P[Y (b) =1 x(b)]... P[Y (a) =K x(a)] P[Y (b) =K x(b)] = 1 where p(x(o)) = K p k (x(o))p k, o = a, b. k=1 K k=1 p k (x(a))p k (x(b))p 2 k p(x(a))p(x(b)) et a custering agorithm μ be run to partition s into K subsets (custers). Because the numberings of custers do not matter, it is convenient to consider the equivaence reation, i.e. to indicate whether the agorithm μ assigns each pair of objects to the same cass or to different casses. et h μ (a, b) =I[μ(a) μ(b)]. et us consider the foowing mode of ensembe custering. Suppose that agorithm μ is randomized, i.e. it depends from a random vaue Ω from a given set of aowabe vaues (parameters or more generay earning settings such as bootstrap sampes, order of input objects etc). In addition to Ω, the agorithm s decisions are dependent from the true status of the pair a, b (i.e., from Z): h μ (a, b) =h μ(ω) (Z, a, b). Hereinafter we wi denote h μ(ω) (Z, a, b) =h(ω,z). Suppose that P[h(Ω,Z) =1 Z =1]=P[h(Ω,Z) =0 Z =0]=q, i.e. the conditiona probabiities of correct decision (either partition or union of objects a,b) coincide. One can say that q refects the stabiity of agorithm under various earning settings. We sha suppose that q>1/2; it means that agorithm μ provides better custering quaity than just random guessing. In machine earning theory, such a condition is known as the condition of weak earnabiity. Denote P h = P[h(Ω,Z) = 1]. This quantity shows the homogeneity of agorithm s decisions: P h cose to 0 or 1; or homogeneity index I h =1 P h (1 P h ) cose to 1 means high agreement between the soutions. Note that P h = P[h(Ω,Z) =1 Z =1]P Z + P[h(Ω,Z) =1 Z =0](1 P Z )= qp Z +(1 q)(1 P Z ). Suppose that agorithm μ is running times under randomy and independenty chosen settings. As a resut, we get random decisions h(ω 1,Z),...,,

282 V. Berikov h(ω,z). By Ω 1,...,Ω we denote independent statistica copies of a random vector Ω. For every Ω, agorithm μ is running independenty (it does not use the resuts obtained with other Ω, ). Suppose that the decisions are conditionay independent: P[h(Ω i1,z)=h i1,...,h(ω ij,z)=h ij Z = z] = P[(h(Ω i1,z)=h i1 Z = z] P[h(Ω ij,z) =h ij Z = z], where Ω i1,...,ω ij are arbitrary earning settings, h i1,...,h ij,z {0, 1} (we sha assume that is odd). et P h,h = P[h(Ω,Z)=1,h(Ω,Z)=1],whereΩ,Ω havethesamedistribution as Ω, andω Ω. It foows from the assumptions of independence and stabiity that P h,h = P[h(Ω,Z)=1,h(Ω )=1 Z =1]P Z + P[h(Ω,Z)=1,h(Ω )=1 Z =0](1 P Z )= q 2 P Z +(1 q) 2 (1 P Z ). (1) Denote H = 1 h(ω,z). The function =1 c(h(ω 1,Z),...,h(Ω,Z)) = I[ H > 1 2 ] sha be caed the ensembe soution for a and b, based on the majority voting. For constructing a fina ensembe custering decision, various approaches can be utiized [2]. For exampe, it is possibe to use a methodoogy based on the pairwise dissimiarity matrix H = ( H(o (i 1),o (i2)),whereo (i1),o (i2) s, o (i1) o (i2). This matrix can be considered as a matrix of pairwise distances between objects and used as input information for a dendrogram construction agorithm to form a sampe partition on a desired number of custers. 3 An Upper Bound for Miscassification Probabiity et us consider the margin [11] of the ensembe: mg = 1 { number of votes for Z number of votes against Z}, where Z =0, 1. It is easy to show that the margin equas: mg = mg( H,Z)=(2Z 1)(2 H 1). Using the notion of margin, one can represent the probabiity of wrong prediction of the true vaue of Z: P err = P Ω1,...,Ω,Z[mg( H,Z) < 0].

A atent Variabe Pairwise Cassification Mode of a Custering Ensembe 283 It foows from the Tchebychev s inequaity that P[U <0] < VarU (EU) 2, where EU is popuation mean, VarU is variance of random vaue U (it is required that EU >0). Thus, P Ω1,...,Ω,Z[mg( H,Z) < 0] < Var mg( H,Z) (E mg( H,Z)) 2, provided that E mg( H,Z) > 0. Theorem. The expected vaue and variance of the margin are: E mg( H,Z)=2q 1, Var mg( H,Z)= 4 (P h P h,h ). Proof. We have: E mg( H,Z)=E(2Z 1)( 2 h(ω,z) 1) = 4 E Zh(Ω,Z) 2EZ 2 E h(ω,z)+1. Because a h(ω,z) are distributed in the same way as h(ω,z), we get: E mg( H,Z)=4EZh(Ω,Z) 2P Z 2E h(ω,z)+1= 4P[Z =1,h(Ω,Z) =1] 2P Z 2P[h(Ω,Z) =1]+1. As P[h(Ω,Z) =1]=P[Z =1,h(Ω,Z) =1]+P[Z =0,h(Ω,Z) =1]= qp Z +(1 q)(1 P Z )=2qP Z +1 q P Z, we obtain: E mg( H,Z)=4qP Z 2P Z 2(2qP Z +1 q P Z )+1=2q 1. Consider the variance of margin: Var mg( H,Z)=Var(4Z H 2 H 2Z) =E(4Z H 2 H 2Z) 2 (E (4Z H 2 H 2Z)) 2 =E(16Z 2 H2 +4 h 2 +4Z 2 16Z H 2 16Z 2 H +8Z H) (E mg( H,Z) 1) 2 =E(4 H 2 +4Z 8Z H) 4(1 q) 2

284 V. Berikov (we appy Z 2 = Z). Next, we have E H 2 = 1 2 E ( 1 E h(ω,z)+ 1 From this, we obtain: ) 2 ( ) h(ω,z) = 1 2 E h 2 (Ω,Z) + 1 2 E(h(Ω,Z)h(Ω,Z)) =, :,Z)h(Ω,Z)= P h + 1 E(h(Ω Ω,Ω : Ω Ω P h,h. Var mg( H,Z) = 4 P h + 4 1 P h,h + 4P Z 8qP Z 4(1 q) 2. Using (1) finay we get: Var mg( H,Z) = 4 P h + 4 1 P h,h 4P h,h = 4 (P h P h,h ). The theorem is proved. Evidenty, the requirement E mg( H,Z) > 0 is fufied if q>1/2. et us consider the correation coefficient ρ between h = h(ω,z)andh = h(ω,z), where Ω Ω.Wehave ρ = ρ h,h = P h,h P 2 h P h (1 P h ). Because P h P h,h = P h P 2 h + P 2 h P h,h, weobtain Var (mg( H,Z)) = 4 (1 ρ) P h(1 P h ). Note that P h P h,h = q(1 q), and after necessary transformations we get the foowing upper bound for error probabiity: P err < 1 ( ) 1 1 4(1 ρ)p h (1 P h ) 1. The obtained expression aows to make some quaitative concusions. Namey, if the mode assumptions are fufied and q>1/2, then under other conditions being equa the foowing statements are vaid: - the probabiity of error decreases with an increase in number of ensembe eements; - an increase in homogeneity of the ensembe and raise of correation between its outputs reduce the probabiity of error (note that a signed vaue of the correation coefficient is meant).

A atent Variabe Pairwise Cassification Mode of a Custering Ensembe 285 4 Estimating Characteristics of a Custering Ensembe To evauate the quaity of a custering ensembe, it is necessary to estimate ensembe s characteristics (in our mode homogeneity and correation) from a finite number of ensembe eements. For an arbitrary pair of different objects a and b, the estimate of the ensembe s homogeneity can be found as foows: where Î h (a, b) =1 ˆP h (a, b)(1 ˆP h (a, b)), ˆP h (a, b) = 1 h (a, b). Unfortunatey, the straightforward estimation of the correation coefficient ρ(a, b) is impossibe: under a fixed sampe, every pair of custering agorithms give conditionay independent decisions. et us introduce a simiar notion: the averaged correation coefficient, where the averaging is done over a pairs of different objects: ρ = cov(h,h ) σ 2 ; (h) where the covariance cov(h,h )=h h h 2, and the variance =1 h h 2 2 = N(N 1) ( 1) a,b: a b 2 1 h = h (a, b) = N(N 1) a,b: a b σ 2 (h) =h 2 h 2 = h h 2. h (a, b) h (a, b),, : 2 ˆP h (a, b), N(N 1) a,b: a b Simiary, it is possibe to introduce the averaged homogeneity index: Ī h = 2 N(N 1) Î h (a, b). a,b: a b 5 Numerica Experiment To verify the appicabiity of the suggested methodoogy for the anaysis of custering ensembe behavior, the statistica modeing approach was used. In the modeing, artificia data sets are repeatedy generated according to certain distribution cass (a type of custering tasks ). Each data set is cassified by the ensembe agorithm. The correct cassification rate, averaged over a given number of trias, determines agorithm s performance for the given type of tasks.

286 V. Berikov The foowing experiment was performed. In each tria, two casses are independenty samped according to the Gauss mutivariate distributions where m 1,m 2 are vectors of means, N (m 1,Σ), N (m 2,Σ), Σ = σi is a diagona n-dimensiona covariance matrix, σ =(σ 1,...,σ n ) is a vector of variances. Variabe X i sha be caed noisy, if for some i {1,..., n}, σ i = σ noise >> 1. In our experiment, the set of noisy variabes {X i1,..., X innoize } is chosen at random. For those variabes that are not noisy, we set σ i = σ 0 = const. Both casses have the same sampe size. The mixture of sampes is cassified by the ensembe of k-means custering agorithms. Each agorithm performs custering in the randomy chosen variabe subspace of dimensionaity n ens. The ensembe decision for each pair of objects (i.e., either unite them or assign to different casses) is made by the majority voting procedure. The true overa performance P cor of the ensembe is determined as the proportion of correcty cassified pairs. 1 0.9 0.8 0.7 0.6 0.5 0.4 P cor 0.3 P ind 0.2 averaged ρ averaged I h 0.1 0 1 2 3 4 5 6 n noise Fig. 1. Exampe of experiment resuts (averaged over 100 trias). Experiment settings: N = 60, n = 10, m 1 = 0, m 2 = 1, σ noise = 10, σ 0 =0.2, n ens = 2, ensembe size = 15.

A atent Variabe Pairwise Cassification Mode of a Custering Ensembe 287 An exampe of experiment resuts is shown in Figure 1. The graphs dispay the dependence of averaged vaues of P cor, ρ and Īh from the number of noisy variabes n noise. To demonstrate the effectiveness of the ensembe soution in comparison with individua custering, the averaged performance rate P ind of a singe k-means custering agorithm is given (this agorithm performs custering in the whoe feature space of dimensionaity n). From this exampe, one can concude that a) in average, the ensembe agorithm has better performance than a singe custering agorithm (when a noise presents), and b) the dynamics of estimated ensembe characteristics (averaged homogeneity and correation) reproduces we the behavior of correct cassification rate (note that this rate is directy unobserved in rea custering probems). When averaged correation and homogeneity index are sufficienty arge, one can expect good cassification quaity. Concusion A atent variabe pairwise cassification mode is proposed for studying nonasymptotic properties of custering ensembes. In this mode, the notions of stabiity, homogeneity and correation between ensembe eements are utiized. An upper bound for probabiity of error is obtained. Theoretica anaysis of the suggested mode aows to make a concusion that the probabiity of correct decision increases with an increase in number of ensembe eements. It is aso found that a arge degree of agreement between partia custering soutions (expressed in our mode in terms of homogeneity and correation between ensembe eements), under condition of independence of base custering agorithms, indicates good cassification performance. Numerica experiment aso confirms this concusion. The foowing possibe future directions can be indicated. It is interesting to study intensiona connections between the notions used in the suggested mode (conditiona independence, stabiity, homogeneity and correation) and other known concepts such as mutua information [2] and diversity in custering ensembes (e.g., [8,9]). Another direction coud aim to improve the tightness of the obtained error bound. Acknowedgements This work was partiay supported by the Russian Foundation for Basic Research, projects 11-07-00346a, 10-01-00113a. References 1. Jain, A.K.: Data Custering: 50 Years Beyond K-Means. Pattern Recognition etters 31(8), 651 666 (2010) 2. Streh, A., Ghosh, J.: Custering ensembes - a knowedge reuse framework for combining mutipe partitions. The Journa of Machine earning Research 3, 583 617 (2002)

288 V. Berikov 3. Kuncheva,.I., Rodriguez, J.J., Pumpton, C.O., inden, D.E.J., Johnston, S.J.: Random Subspace Ensembes for fmri Cassification. IEEE Transactions on Medica Imaging 29(2), 531 542 (2010) 4. Pestunov, I.A., Berikov, V.B., Kuikova, E.A.: Grid-based ensembe custering agorithm using sequence of fixed grids. In: Proc. of the 3rd IASTED Intern. Conf. on Automation, Contro, and Information Technoogy, pp. 103 110. ACTA Press, Cagary (2010) 5. Iam-on, N., Boongoen, T., Garrett, S.: CE: a ink-based custer ensembe method for improved gene expression data anaysis. Bioinformatics 26(12), 1513 1519 (2010) 6. Hong, Y., Kwong, S.: To combine steady-state genetic agorithm and ensembe earning for data custering. Pattern Recognition etters 29(9), 1416 1423 (2008) 7. Topchy, A., aw, M., Jain, A., Fred, A.: Anaysis of Consensus Partition in Custer Ensembe. In: Fourth IEEE Internationa Conference on Data Mining, pp. 225 232. IEEE Press, New York (2004) 8. Hadjitodorov, S.T., Kuncheva,.I., Todorova,.P.: Moderate diversity for better custer ensembes. Information Fusion 7(3), 264 275 (2006) 9. Azimi, J., Fern, X.: Adaptive Custer Ensembe Seection. In: Proceedings of Internationa Joint Conference on Artificia Inteigence, pp. 992 997 (2009) 10. Kuncheva,.: Combining Pattern Cassifiers. Methods and Agorithms. John Wiey & Sons, Hoboken (2004) 11. Breiman,.: Random Forests. Machine earning 45(1), 5 32 (2001)