Neural Networks Lesson 5 - Cluster Analysis

Transcription

1 Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome michele.scarpiniti@uniroma1.it Rome, 29 October 2009 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 1 / 35

2 1 Cluster Analysis 2 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 2 / 35

3 Cluster Analysis M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 3 / 35

4 Cluster Analysis Intelligent Circuits and Neural Network applications, in principle, can be divided in two main categories 1 Static data processing: patterns recognition, cluster analysis, associative memory; 2 Dynamic data processing: non-linear filtering, prediction, functional and operator approximation, dynamic pattern recognition, etc. Let us consider cluster analysis, one of the main problem of interest to computer scientists and engineers. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 4 / 35

5 Cluster Analysis Cluster Analysis Cluster Analysis is the collection procedure used to describe methods for grouping unlabeled data X i into subset that are believed to reflect the underlying structure of the data generator. The techniques for clustering are many and diverse, summarized in the following scheme M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 5 / 35

6 Cluster Analysis Cluster Analysis Hierarchical algorithms find successive clusters using previously established clusters. These algorithms can be either agglomerative ( bottom-up ) or divisive ( top-down ): 1 Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters; 2 Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters. Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Bayesian algorithms try to generate a posteriori distribution over the collection of all partitions of the data. Many clustering algorithms require specification of the number of clusters to produce in the input data set, prior to execution of the algorithm, like partitional algorithms. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 6 / 35

7 Cluster Analysis Cluster Analysis In clustering, also known as unsupervised pattern classification, there are no training data with known class labels. A clustering algorithm explores the similarity between the patterns and places similar patterns in a cluster. Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: 1 collecting and labeling a large set of sample patterns can be costly; 2 one can train with large amount of unlabeled data, and then use supervision to label the groupings found; 3 exploratory data analysis can provide insight into the nature or structure of the data; 4 well-known clustering applications include data mining and data compression. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 7 / 35

8 Cluster Analysis Cluster Analysis A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 8 / 35

9 Cluster Analysis Cluster Analysis Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Most of the clustering algorithms are based on the following two popular techniques: the iterative squared-error partitioning and the agglomerative hierarchical clustering. On of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 9 / 35

10 Cluster Analysis Cluster Analysis We define feature vector [x 1 x 2 ] T R 2 1, defined in a two-dimensional feature space. We define, also, scatter plot as a graphical representation of the feature space variables vs true classified species. As an example we show the scatter plot of weight and width features of salmon and sea-bass. We can draw a decision boundary to divide the feature space into two regions. Considering training data set as a subset of fishes correctly classified, we can draw the classification function (or decision boundary) to divide the feature space into two regions. The decision boundary determination should be performed considering a criteria for evaluation the tradeoff between complexity of decision rules and their performance to unknown samples. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 10 / 35

11 Cluster Analysis Cluster Analysis Let x = [x 1 x 2 x N ] T R N 1 be a N-dimension feature space data set, then two (or more) data classes are defined linearly separable if there exists a (usually unknown) linear classification function or decision boundary function able to separate these classes. Considering the following figure, as an example, with N = 2, classes A and B are non linearly separable while, on the contrary, classes C and D are linearly separable. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 11 / 35

12 Cluster Analysis Cluster Analysis Taking into consideration the problem of fish species classification, different criteria lead to different decision boundaries also, more complex models result in more complex boundaries. The following figure shows others possible non-linear decision boundary that can be determined using different criteria. We may distinguish training samples perfectly but we can t predict how well can generalize to unknown sample. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 12 / 35

13 Cluster Analysis Cluster Analysis Consider the classification of two classes of patterns that are linearly separable. The linear classifier hyperplane can be estimated from training data set, using various distance measure. Obviously, different criteria produce different optimal solution. For example, in the problem described following, hyperplanes H 1, H 2 and H 3 are all optimum solution relatively to the choice of the distance measure to be minimized during the learning phase. However, as we can observe, only hyperplane H 2 presents not misclassification error. From these considerations, it follows that the choice of a distance measure, for identification of the optimum separation hyperplane, is one of the central point in pattern recognition and is strictly dependent to the specific problem. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 13 / 35

14 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 14 / 35

15 is a method of cluster analysis which seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two types: 1 Agglomerative: This is a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. 2 Divisive: This is a top down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. In general, the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram. In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric (a measure of distance between pairs of observations), and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 15 / 35

16 : metric The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Some commonly used metrics for hierarchical clustering are: 1 Euclidean distance: d(x, y) = i (x i y i ) 2 ; 2 Squared Euclidean distance: d(x, y) = i (x i y i ) 2 ; 3 Manhattan (City-block) distance: d(x, y) = i x i y i ; 4 Chebychev (maximum) distance: d(x, y) = max x i y i ; 5 Power distance: d(x, y) = r xi y i p ; 6 Percent disagreement: d(x, y) = Number of x i y i i ; 7 Mahalanobis distance: d(x, y) = i (x i y i )R 1 (x i y i ); 8 Cosine similarity: d(x, y) = cos 1 a b a b. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 16 / 35

17 : linkage criteria The linkage criteria determines the distance between sets of observations as a function of the pairwise distances between observations. Some commonly used linkage criteria between two sets of observations X and Y (where d is the chosen metric) are: 1 Maximum or complete linkage clustering: max {d(x, y) : x X, y Y }; 2 Minimum or single-linkage clustering: min {d(x, y) : x X, y Y }; 3 Mean or average linkage clustering, or UPGMA: 1 X Y x X y Y d(x, y); 4 The sum of all intra-cluster variance; 5 The increase in variance for the cluster being merged (Ward s criterion); 6 The probability that candidate clusters are spawn from the same distribution function (V-linkage). M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 17 / 35

18 Dendrogram A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. Dendrograms are often used in computational biology to illustrate the clustering of genes. For a clustering example, suppose this data is to be clustered using Euclidean distance as the distance metric. The hierarchical clustering dendrogram would be as such. Here the top row of nodes represent data, and the remaining nodes represent the clusters to which the data belong, and the arrows represent the distance. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 18 / 35

19 Dendrogram The hierarchical cluster tree created by clustering is most easily understood when viewed graphically. Statistics Toolbox of Matlab R includes the dendrogram function that plots this hierarchical tree information as a graph, as in the following example. In the figure, the numbers along the horizontal axis represent the indexes of the objects in the original data set. The links between objects are represented as upside-down U-shaped lines. The height of the U indicates the distance between the objects. For example, the link representing the cluster containing objects 1 and 3 has a height of 1. The link representing the cluster that groups object 2 together with objects 1, 3, 4, and 5 (which are already clustered as object 8), has a height of 2.5. The height represents the distance linkage computes between objects 2 and 8. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 19 / 35

20 K-means algorithm The k-means clustering algorithm is a method of cluster analysis which aims to partition N observations into K clusters in which each observation belongs to the cluster with the nearest mean. It is similar to the expectationmaximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. Given a set of observations (x 1, x 2,..., x N ), where each observation is a d-dimensional real vector, then k-means clustering aims to partition the N observations into K sets (K < N) S = {S 1, S 2,..., S K } so as to minimize the within-cluster sum of squares (WCSS): where µ i is the mean of S i. arg min S K i=1 x j S i x j µ i 2 M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 20 / 35

21 K-means algorithm The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is often called the k-means algorithm; it is also referred to as Lloyd s algorithm, particularly in the computer science community. Given an initial set of K means m (1) 1,..., m(1) k, which may be specified randomly or by some heuristic, the algorithm proceeds by alternating between two steps: 1 Assignment step: assign each observation to the cluster with the closest mean { } S (t) i = x j : x j m (t) i x j m (t) i for all i = 1,..., K 2 Update step: calculate the new means to be the centroid of the observations in the cluster m (t+1) i = 1 S (t) x j i x j S (t) i The algorithm is deemed to have converged when the assignments no longer change. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 21 / 35

22 Example of K-means algorithm 1) K initial means (in this case K = 3) are randomly selected from the data set (shown in color). 2) K clusters are created by associating every observation with the nearest mean. 3) The centroid of each of the K clusters becomes the new means. 4) Steps 2 and 3 are repeated until convergence has been reached. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 22 / 35

23 K-means algorithm As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial clusters. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. The two key features of k-means which make it efficient are often regarded as its biggest drawbacks: 1 The number of clusters K is an input parameter: an inappropriate choice of K may yield poor results; 2 Euclidean distance is used as a metric and variance is used as a measure of cluster scatter. The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard Euclidean distance is usually insufficient in forming the clusters. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 23 / 35

24 K-medoids algorithm The k-medoids algorithm is a clustering algorithm related to the k-means algorithm. Both the k-means and k-medoids algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize squared error, the distance between points labeled to be in a cluster and a point designated as the center of that cluster. K-medoid is a classical partitioning technique of clustering that clusters the data set of N objects into K clusters known a priori. In contrast to the k-means algorithm k-medoids chooses datapoints as centers (medoids or exemplars). A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. It is more robust to noise and outliers as compared to k-means. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 24 / 35

25 K-medoids algorithm The most common realization of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: 1 Initialize: randomly select K of the N data points as the medoids; 2 Associate each data point to the closest medoid ( closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance); 3 For each mediod m: For each non-mediod data point p: Swap m and p and compute the total cost of the configuration; 4 Select the configuration with the lowest cost; 5 repeat steps 2 to 4 until there is no change in the medoid. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 25 / 35

26 K-medians algorithm The k-medians algorithm is similar to the k-means one, but in this case, the squared error is replaced by the absolute error, and the mean is replaced with a median-like object arising from optimization. The objective function is arg min S K i=1 x j S i x j µ i There is some evidence that this procedure is more resistant to outliers or strong non-normality than regular k-means. However, like all centroid-based methods, it works best when the clusters are convex. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 26 / 35

27 Silhouette Silhouette refers to a method of interpretation and validation of clusters of data. Assume the data has been clustered. For each datum, i let a(i) be the average dissimilarity of i with all other data within the same cluster. Any measure of dissimilarity can be used but distance measures are the most common. We can interpret a(i) as how well matched i is to the cluster it is assigned (the smaller the value, the better the matching). Then find the average dissimilarity of i with the data of another single cluster. Repeat this for every cluster that i is not a member of. Denote the cluster with the lowest average dissimilarity to i by b(i). This cluster is said to be the neighboring cluster of i as it is, aside from the cluster i is assigned, the cluster i fits best in. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 27 / 35

28 Silhouette We now define: 1 a(i)/b(i), if a(i) < b(i) s (i) = 0, if a(i) = b(i) b(i)/a(i) 1, if a(i) > b(i) From the above definition it is clear that 1 s(i) 1 For s(i) to be close to one we require a(i) << b(i). As a(i) is a measure of how dissimilar i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighboring cluster. Thus an s(i) close to one means that the datum is appropriately clustered. If s(i) is close to negative one, then by the same logic we see that i would be more appropriate if it was clustered in its neighboring cluster. An s(i) near zero means that the datum is on the border of two natural clusters. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 28 / 35

29 QT clustering algorithm The QT (quality threshold) clustering is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times. The algorithm is: 1 The user chooses a maximum diameter for clusters; 2 Build a candidate cluster for each point by including the closest point, the next closest, and so on, until the diameter of the cluster surpasses the threshold; 3 Save the candidate cluster with the most points as the first true cluster, and remove all points in the cluster from further consideration. Must clarify what happens if more than 1 cluster has the maximum number of points; 4 Recurse with the reduced set of points. The distance between a point and a group of points is computed using complete linkage, i.e. as the maximum distance from the point to any member of the group. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 29 / 35

30 Spectral clustering Given a set of data points A, the similarity matrix may be defined as a matrix S where s ij represents a measure of the similarity between points i, j A. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. One such technique is the Shi-Malik algorithm, commonly used for image segmentation. It partitions points into two sets (S 1, S 2 ) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix L = I D 1/2 SD 1/2 of S, where D is the diagonal matrix, with d ii = j s ij. This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S 1, and the rest in S 2. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 30 / 35

31 Gaussian mixture models clustering There s another way to deal with clustering problems: a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. In practice, each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution. A mixture model with high likelihood tends to have the following traits: 1 component distributions have high peaks (data in one cluster are tight); 2 the mixture model covers the data well (dominant patterns in the data are captured by component distributions). Main advantages of model-based clustering: 1 well-studied statistical inference techniques available; 2 flexibility in choosing the component distribution; 3 obtain a density estimation for each cluster; 4 a soft classification is available. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 31 / 35

32 Gaussian mixture models clustering The most widely used clustering method of this kind is the one based on learning a mixture of Gaussians: we can actually consider clusters as Gaussian distributions centered on their barycentres, as we can see in this picture, where the grey circle represents the first variance of the distribution: Clusters are assigned by selecting the component that maximizes the posterior probability. Like k-means clustering, Gaussian mixture modeling uses an iterative algorithm that converges to a local optimum. Gaussian mixture modeling may be more appropriate than k-means clustering when clusters have different sizes and correlation within them. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 32 / 35

33 Gaussian mixture models clustering The algorithm works in this way: 1 it chooses the parameter θ i (for i = 1,..., K) at random with prior probability p(θ i ) N {µ i, σ}; 2 it obtains the posterior distribution (likelihood), where θ i = {µ i, σ}; 3 the likelihood function is L = K p(θ i )p(x θ i ) i=1 4 Now the likelihood function should be maximized by calculating L θ i = 0, but it would be too difficult. That s why it is used a simplified algorithm, known as EM (Expectation-Maximization). M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 33 / 35

34 Bayesian clustering Fundamentally the Bayesian clustering aims to obtain a posterior distribution over partitions of the data set D, denoted by C = {C 1,..., C K }, with or without specifying K. Several methods have been proposed for how to do this. Usually they come down to specifying a hierarchical model mimicking the partial order on the class of partitions so that the procedure is also hierarchical, usually agglomerative. The first effort to Bayesian clustering was the hierarchical technique due to Makato and Tokunaga. Starting with the data D = {x 1,..., x N } as N clusters of size 1, the idea is to merge clusters when the probability of the merged cluster p(c k C j ) is greater than the probability of the individual clusters p(c k )p(c j ). Thus the clusters themselves are treated as random variables. M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 34 / 35

35 References B. Clarke, E. Fokoué, H.H. Zhang. Principles and Theory for data Mining and Machine Learning. Springer, S. Theodoridis and K. Koutroumbas. Pattern Recognition. Elsevier, J.P. Marques de Sà. Pattern Recognition. Springer, J.S. Liu, J.L. Zhang, M.J. Palumbo, C.E. Lawrence. Bayesian Clustering with Variable and Transformation selactions. in Bayesian Statistics (Bernardo et al. Eds.), Oxford University Press, M. Scarpiniti Neural Networks Lesson 5 - Cluster Analysis 35 / 35