Neural Networks Lesson 5 - Cluster Analysis

Similar documents

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Unsupervised learning: Clustering

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Machine Learning using MapReduce

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering UE 141 Spring 2013

Social Media Mining. Data Mining Essentials

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter ML:XI (continued)

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Classification Techniques for Remote Sensing

Chapter 7. Cluster Analysis

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Distances, Clustering, and Classification. Heatmaps

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Information Retrieval and Web Search Engines

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Methods

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Standardization and Its Effects on K-Means Clustering Algorithm

Hierarchical Cluster Analysis Some Basics and Algorithms

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Advanced Concepts

Data Mining: Algorithms and Applications Matrix Math Review

A comparison of various clustering methods and algorithms in data mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

How To Solve The Cluster Algorithm

An Introduction to Cluster Analysis for Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Cluster analysis Cosmin Lazar. COMO Lab VUB

Component Ordering in Independent Component Analysis Based on Data Power

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Cluster Analysis: Basic Concepts and Algorithms

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

The Scientific Data Mining Process

Using Data Mining for Mobile Communication Clustering and Characterization

Lecture 9: Introduction to Pattern Analysis

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

An Overview of Knowledge Discovery Database and Data mining Techniques

Clustering & Visualization

Categorical Data Visualization and Clustering Using Subjective Factors

Going Big in Data Dimensionality:

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Environmental Remote Sensing GEOG 2021

Data Preprocessing. Week 2

Cluster Analysis using R

Introduction to Clustering

Personalized Hierarchical Clustering

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining and Visualization

B490 Mining the Big Data. 2 Clustering

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Data Mining 資料探勘. 分群分析 (Cluster Analysis)

CLUSTER ANALYSIS FOR SEGMENTATION

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Classification algorithm in Data mining: An Overview

Unsupervised Data Mining (Clustering)

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

D-optimal plans in observational studies

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Vector Quantization and Clustering

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

The SPSS TwoStep Cluster Component

A Learning Based Method for Super-Resolution of Low Resolution Images

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Principles of Data Mining by Hand&Mannila&Smyth

Linear Threshold Units

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Fortgeschrittene Computerintensive Methoden: Fuzzy Clustering Steffen Unkel

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Data Clustering Techniques Qualifying Oral Examination Paper

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Statistical Machine Learning

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Specific Usage of Visual Data Analysis Techniques

Introduction to Machine Learning Using Python. Vikram Kamath

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Probabilistic Latent Semantic Analysis (plsa)

Transcription:

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29 October 2009 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 1 / 35

1 Cluster Analysis 2 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 2 / 35

Cluster Analysis M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 3 / 35

Cluster Analysis Intelligent Circuits and Neural Network applications, in principle, can be divided in two main categories 1 Static data processing: patterns recognition, cluster analysis, associative memory; 2 Dynamic data processing: non-linear filtering, prediction, functional and operator approximation, dynamic pattern recognition, etc. Let us consider cluster analysis, one of the main problem of interest to computer scientists and engineers. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 4 / 35

Cluster Analysis Cluster Analysis Cluster Analysis is the collection procedure used to describe methods for grouping unlabeled data X i into subset that are believed to reflect the underlying structure of the data generator. The techniques for clustering are many and diverse, summarized in the following scheme M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 5 / 35

Cluster Analysis Cluster Analysis Hierarchical algorithms find successive clusters using previously established clusters. These algorithms can be either agglomerative ( bottom-up ) or divisive ( top-down ): 1 Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters; 2 Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Bayesian algorithms try to generate a posteriori distribution over the collection of all partitions of the data. Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm, like partitional algorithms. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 6 / 35

Cluster Analysis Cluster Analysis In clustering, also known as unsupervised pattern classification, there are no training data with known class labels. A clustering algorithm explores the similarity between the patterns and places similar patterns in a cluster. Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: 1 collecting and labeling a large set of sample patterns can be costly; 2 one can train with large amount of unlabeled data, and then use supervision to label the groupings found; 3 exploratory data analysis can provide insight into the nature or structure of the data; 4 well-known clustering applications include data mining and data compression. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 7 / 35

Cluster Analysis Cluster Analysis A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 8 / 35

Cluster Analysis Cluster Analysis Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Most of the clustering algorithms are based on the following two popular techniques: the iterative squared-error partitioning and the agglomerative hierarchical clustering. On of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 9 / 35

Cluster Analysis Cluster Analysis We define feature vector [x 1 x 2 ] T R 2 1, defined in a two-dimensional feature space. We define, also, scatter plot as a graphical representation of the feature space variables vs true classified species. As an example we show the scatter plot of weight and width features of salmon and sea-bass. We can draw a decision boundary to divide the feature space into two regions. Considering training data set as a subset of fishes correctly classified, we can draw the classification function (or decision boundary) to divide the feature space into two regions. The decision boundary determination should be performed considering a criteria for evaluation the tradeoff between complexity of decision rules and their performance to unknown samples. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 10 / 35

Cluster Analysis Cluster Analysis Let x = [x 1 x 2 x N ] T R N 1 be a N-dimension feature space data set, then two (or more) data classes are defined linearly separable if there exists a (usually unknown) linear classification function or decision boundary function able to separate these classes. Considering the following figure, as an example, with N = 2, classes A and B are non linearly separable while, on the contrary, classes C and D are linearly separable. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 11 / 35

Cluster Analysis Cluster Analysis Taking into consideration the problem of fish species classification, different criteria lead to different decision boundaries also, more complex models result in more complex boundaries. The following figure shows others possible non-linear decision boundary that can be determined using different criteria. We may distinguish training samples perfectly but we can t predict how well can generalize to unknown sample. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 12 / 35

Cluster Analysis Cluster Analysis Consider the classification of two classes of patterns that are linearly separable. The linear classifier hyperplane can be estimated from training data set, using various distance measure. Obviously, different criteria produce different optimal solution. For example, in the problem described following, hyperplanes H 1, H 2 and H 3 are all optimum solution relatively to the choice of the distance measure to be minimized during the learning phase. However, as we can observe, only hyperplane H 2 presents not misclassification error. From these considerations, it follows that the choice of a distance measure, for identification of the optimum separation hyperplane, is one of the central point in pattern recognition and is strictly dependent to the specific problem. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 13 / 35

M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 14 / 35

is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: 1 Agglomerative: This is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. 2 Divisive: This is a top down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 15 / 35

: metric The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are: 1 Euclidean distance: d(x, y) = i (x i y i ) 2 ; 2 Squared Euclidean distance: d(x, y) = i (x i y i ) 2 ; 3 Manhattan (City-block) distance: d(x, y) = i x i y i ; 4 Chebychev (maximum) distance: d(x, y) = max x i y i ; 5 Power distance: d(x, y) = r xi y i p ; 6 Percent disagreement: d(x, y) = Number of x i y i i ; 7 Mahalanobis distance: d(x, y) = i (x i y i )R 1 (x i y i ); 8 Cosine similarity: d(x, y) = cos 1 a b a b. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 16 / 35

: linkage criteria The linkage criteria determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations X and Y (where d is the chosen metric) are: 1 Maximum or complete linkage clustering: max {d(x, y) : x X, y Y }; 2 Minimum or single-linkage clustering: min {d(x, y) : x X, y Y }; 3 Mean or average linkage clustering, or UPGMA: 1 X Y x X y Y d(x, y); 4 The sum of all intra-cluster variance; 5 The increase in variance for the cluster being merged (Ward s criterion); 6 The probability that candidate clusters are spawn from the same distribution function (V-linkage). M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 17 / 35

Dendrogram A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in computational biology to illustrate the clustering of genes. For a clustering example, suppose this data is to be clustered using Euclidean distance as the distance metric. The hierarchical clustering dendrogram would be as such. Here the top row of nodes represent data, and the remaining nodes represent the clusters to which the data belong, and the arrows represent the distance. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 18 / 35

Dendrogram The hierarchical cluster tree created by clustering is most easily understood when viewed graphically. Statistics Toolbox of Matlab R includes the dendrogram function that plots this hierarchical tree information as a graph, as in the following example. In the figure, the numbers along the horizontal axis represent the indexes of the objects in the original data set. The links between objects are represented as upside-down U-shaped lines. The height of the U indicates the distance between the objects. For example, the link representing the cluster containing objects 1 and 3 has a height of 1. The link representing the cluster that groups object 2 together with objects 1, 3, 4, and 5 (which are already clustered as object 8), has a height of 2.5. The height represents the distance linkage computes between objects 2 and 8. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 19 / 35

K-means algorithm The k-means clustering algorithm is a method of cluster analysis which aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectationmaximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. Given a set of observations (x 1, x 2,..., x N ), where each observation is a d-dimensional real vector, then k-means clustering aims to partition the N observations into K sets (K < N) S = {S 1, S 2,..., S K } so as to minimize the within-cluster sum of squares (WCSS): where µ i is the mean of S i. arg min S K i=1 x j S i x j µ i 2 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 20 / 35

K-means algorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd s algorithm, particularly in the computer science community. Given an initial set of K means m (1) 1,..., m(1) k, which may be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps: 1 Assignment step: assign each observation to the cluster with the closest mean { } S (t) i = x j : x j m (t) i x j m (t) i for all i = 1,..., K 2 Update step: calculate the new means to be the centroid of the observations in the cluster m (t+1) i = 1 S (t) x j i x j S (t) i The algorithm is deemed to have converged when the assignments no longer change. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 21 / 35

Example of K-means algorithm 1) K initial means (in this case K = 3) are randomly selected from the data set (shown in color). 2) K clusters are created by associating every observation with the nearest mean. 3) The centroid of each of the K clusters becomes the new means. 4) Steps 2 and 3 are repeated until convergence has been reached. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 22 / 35

K-means algorithm As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. The two key features of k-means which make it efficient are often regarded as its biggest drawbacks: 1 The number of clusters K is an input parameter: an inappropriate choice of K may yield poor results; 2 Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard Euclidean distance is usually insufficient in forming the clusters. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 23 / 35

K-medoids algorithm The k-medoids algorithm is a clustering algorithm related to the k-means algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. K-medoid is a classical partitioning technique of clustering that clusters the data set of N objects into K clusters known a priori. In contrast to the k-means algorithm k-medoids chooses datapoints as centers (medoids or exemplars). A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. It is more robust to noise and outliers as compared to k-means. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 24 / 35

K-medoids algorithm The most common realization of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: 1 Initialize: randomly select K of the N data points as the medoids; 2 Associate each data point to the closest medoid ( closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance); 3 For each mediod m: For each non-mediod data point p: Swap m and p and compute the total cost of the configuration; 4 Select the configuration with the lowest cost; 5 repeat steps 2 to 4 until there is no change in the medoid. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 25 / 35

K-medians algorithm The k-medians algorithm is similar to the k-means one, but in this case, the squared error is replaced by the absolute error, and the mean is replaced with a median-like object arising from optimization. The objective function is arg min S K i=1 x j S i x j µ i There is some evidence that this procedure is more resistant to outliers or strong non-normality than regular k-means. However, like all centroid-based methods, it works best when the clusters are convex. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 26 / 35

Silhouette Silhouette refers to a method of interpretation and validation of clusters of data. Assume the data has been clustered. For each datum, i let a(i) be the average dissimilarity of i with all other data within the same cluster. Any measure of dissimilarity can be used but distance measures are the most common. We can interpret a(i) as how well matched i is to the cluster it is assigned (the smaller the value, the better the matching). Then find the average dissimilarity of i with the data of another single cluster. Repeat this for every cluster that i is not a member of. Denote the cluster with the lowest average dissimilarity to i by b(i). This cluster is said to be the neighboring cluster of i as it is, aside from the cluster i is assigned, the cluster i fits best in. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 27 / 35

Silhouette We now define: 1 a(i)/b(i), if a(i) < b(i) s (i) = 0, if a(i) = b(i) b(i)/a(i) 1, if a(i) > b(i) From the above definition it is clear that 1 s(i) 1 For s(i) to be close to one we require a(i) << b(i). As a(i) is a measure of how dissimilar i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighboring cluster. Thus an s(i) close to one means that the datum is appropriately clustered. If s(i) is close to negative one, then by the same logic we see that i would be more appropriate if it was clustered in its neighboring cluster. An s(i) near zero means that the datum is on the border of two natural clusters. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 28 / 35

QT clustering algorithm The QT (quality threshold) clustering is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The algorithm is: 1 The user chooses a maximum diameter for clusters; 2 Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold; 3 Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points; 4 Recurse with the reduced set of points. The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 29 / 35

Spectral clustering Given a set of data points A, the similarity matrix may be defined as a matrix S where s ij represents a measure of the similarity between points i, j A. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S 1, S 2 ) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix L = I D 1/2 SD 1/2 of S, where D is the diagonal matrix, with d ii = j s ij. This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S 1, and the rest in S 2. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 30 / 35

Gaussian mixture models clustering There s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. A mixture model with high likelihood tends to have the following traits: 1 component distributions have high peaks (data in one cluster are tight); 2 the mixture model covers the data well (dominant patterns in the data are captured by component distributions). Main advantages of model-based clustering: 1 well-studied statistical inference techniques available; 2 flexibility in choosing the component distribution; 3 obtain a density estimation for each cluster; 4 a soft classification is available. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 31 / 35

Gaussian mixture models clustering The most widely used clustering method of this kind is the one based on learning a mixture of Gaussians: we can actually consider clusters as Gaussian distributions centered on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution: Clusters are assigned by selecting the component that maximizes the posterior probability. Like k-means clustering, Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum. Gaussian mixture modeling may be more appropriate than k-means clustering when clusters have different sizes and correlation within them. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 32 / 35

Gaussian mixture models clustering The algorithm works in this way: 1 it chooses the parameter θ i (for i = 1,..., K) at random with prior probability p(θ i ) N {µ i, σ}; 2 it obtains the posterior distribution (likelihood), where θ i = {µ i, σ}; 3 the likelihood function is L = K p(θ i )p(x θ i ) i=1 4 Now the likelihood function should be maximized by calculating L θ i = 0, but it would be too difficult. That s why it is used a simplified algorithm, known as EM (Expectation-Maximization). M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 33 / 35

Bayesian clustering Fundamentally the Bayesian clustering aims to obtain a posterior distribution over partitions of the data set D, denoted by C = {C 1,..., C K }, with or without specifying K. Several methods have been proposed for how to do this. Usually they come down to specifying a hierarchical model mimicking the partial order on the class of partitions so that the procedure is also hierarchical, usually agglomerative. The first effort to Bayesian clustering was the hierarchical technique due to Makato and Tokunaga. Starting with the data D = {x 1,..., x N } as N clusters of size 1, the idea is to merge clusters when the probability of the merged cluster p(c k C j ) is greater than the probability of the individual clusters p(c k )p(c j ). Thus the clusters themselves are treated as random variables. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 34 / 35

References B. Clarke, E. Fokoué, H.H. Zhang. Principles and Theory for data Mining and Machine Learning. Springer, 2009. S. Theodoridis and K. Koutroumbas. Pattern Recognition. Elsevier, 2003. J.P. Marques de Sà. Pattern Recognition. Springer, 2001. J.S. Liu, J.L. Zhang, M.J. Palumbo, C.E. Lawrence. Bayesian Clustering with Variable and Transformation selactions. in Bayesian Statistics (Bernardo et al. Eds.), Oxford University Press, 2003. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 35 / 35