2 Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization

3 Clustering Formulation Objects Attributes Find groups of similar points (observations) in multidimensional space No target variable (unsupervised learning) Model

4 Methods of Clustering - Overview Variety of methods: Hierarchical clustering create hierarchy of clusters (one cluster entirely contained within another cluster) Non-hierarchical methods create disjoint clusters Overlapping clusters (objects can belong to >1 cluster simultaneously) Fuzzy clusters (defined by the probability (grade) of membership of each object in each cluster) Useful data preprocessing prior to clustering: PCA (Principal Components Analysis) to reduce dimensionality of data Data standarization (transform data to reduce large influence of variables with larger variance on results of clustering)

5 Introductory Example 97 countries described by 3 attributes: Birth, Death, InfantDeath rate (given as number per 1000, data from year 1995)

6 Analysis I Clustering raw data K-means algorithm Result: 3 clusters (no. of obs. in each cluster: 13, 32, 52) Example cntd.

7

8 Example Profiles of Clusters

9 Example Profiles of Clusters Notice: data clustered based on InfantDeath Rate only!

10 Example Standarization of Data Analysis II Data standarized prior to clustering (variables divided by their standard deviation) Result: 3 clusters (with 35, 46, 16 obs.) Data clustered based on InfantDeath and Death Analysis II Analysis I Observe that data with largest variance have largest influence on results of clustering

11 Example Profiles of Clusters Analysis II: profiles of clusters

12 Methods of Clustering Non-hierarchical methods K-means clustering Non-deterministic O(n), n - number of observations Hierarchical methods Aglomerative (join small clusters) Divisive (split big clusters) Deterministic methods O(n 2 ) O(n 3 ), depending on the clustering method (i.e. definition of intercluster distance)

13 Methods of Clustering - Remarks Clustering large datasets K-means If results of hierarchical clustering needed first use K-means yielding e.g. 50 clusters, followed by hierarchical clustering on results of K-means Consensus clustering Discover real clusters in data analyze stability of results with noise injected

14 K-means Algorithm K-means clustering Select k points (centroids of initial clusters; select randomly) Assign each observation to the nearest centroid (nearest cluster) For each cluster find the new centroid Repeat step 2 and 3 until no change occurs in cluster assignments

15 K-means Algorithm Result: k separate clusters Algorithm requires that the correct number of clusters k is specified in advance (difficult problem: how to know the real number of clusters in data )

16 Hierarchical Clustering Notation x i observations, i=1..n C k clusters G current number of clusters D KL distance between clusters C K and C L Between-cluster distance D KL linkage function (various definitions available, results of clustering depend on D KL ) C L C K D KL

17 Hierarchical Clustering Algorithm (agglomerative hierarchical clustering) C k = {x k }, k=1..n, G=n Find K, L such that D KL = min D IJ, 1<=I,J<=G Replace clusters C K and C L by cluster C K C L, G=G-1 Repeat steps 2 and 3 while G>1 C L D KL C K Result: hierarchy of clusters dendrogram

18 Hierarchy of Clusters - Dendrogram

19 Definitions of Distance Between Clusters Different definitions of distance between clusters Average linkage Single linkage Complete linkage Density linkage Ward s minimum variance method (SAS CLUSTER procedure accepts 11 different definitions of inter-cluster distance)

20 Notation x i observations, i=1..n Average Linkage d(x,y) distance between observations (Euclidean distance assumed from now on) C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Average linkage Tends to join clusters with small variance Resulting clusters tend to have similar variance

21 Notation x i observations, i=1..n Complete Linkage d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Complete linkage Resulting clusters tend to have similar diameter

22 Notation x i observations, i=1..n Single Linkage d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster Single linkage Tends to produce elongated clusters, irregular in shape

23 Ward s Minimum Variance Method Notation x i observations, i=1..n d(x,y) distance between observations C k clusters N K number of observations in cluster C K D KL distance between clusters C K and C L mean CK mean observation in cluster C K W K = x i -mean CK 2 x i C K variance in cluster B KL =W M -W K -W L where C M =C K C L Ward s minimum variance method Tends to join small clusters Tends to produce clusters with similar number of observations

24 Density Linkage Notation x i observations, i=1..n d(x,y) distance between observations r a fixed constant f(x) proportion of observations within sphere centered at x with radius r divided by the volume of the sphere (measure of density of points near observation x) Density linkage We realize single linkage using the measure d* Capable of discovering clusters of irregular shape

25 Example Average Linkage Elongated clusters in data

26 Elongated clusters in data Example K-means

27 Example Density Linkage Elongated clusters in data

28 Nonconvex clusters in data Example K-means

29 Example Centroid Linkage Nonconvex clusters in data

30 Example Density Linkage Nonconvex clusters in data

31 Clusters of unequal size Example True Clusters

32 Clusters of unequal size Example K-means

33 Example Ward s Method Clusters of unequal size

35 Example Centroid Linkage Clusters of unequal size

36 Example Single Linkage Clusters of unequal size

37 Example Well Separated Data Any method will work

38 Example Poorly Separated Data True clusters

39 Example Poorly Separated Data Method: K-means

40 Example Poorly Separated Data Ward s method

41 Clustering Methods Final Remarks Standarization of variables prior to clustering Often necessary, otherwise variables with large variance tend to have large influence on clustering Often standarized measurement z ij is computed as the z-score: where x ij original measurement in observation i and variable j, µ j mean value of variable j, s j mean absolute deviation of variable j (or its standard deviation) Other ideas: divide variable by its range, max value or standard deviation

42 Clustering Methods Final Remarks The number of clusters No satisfactory theory to determine the right number of clusters in data Various criteria can be observed to help determine the right number of clusters, e.g. criteria based on variance accounted for by clusters R 2 =1-P G /T or semipartial R 2 =B KL /T where T total variance of observations; P G = W K over G clusters B KL =W M -W K -W L where C M =C K C L Cubic Clustering Criterion (CCC) Often data visualization useful for determining the number of clusters Scatterplot for 2-3 dimensional data In high dimensions apply PCA transformation (or similar) visualize data in 2-3 dimensional space of first principal components

43 Example 2 R, Semi-partial 2 R

44 Example Number of Clusters Useful Checks PST2: 3 or 6 or 9 (one before peak in value) PSF: 9 (peak in value) CCC: 18 (CCC around 3)

45 Kohonen VQ (Vector Quantization) Algorithm similar to k-means Idea of VQ algorithm: Select k points (initial cluster centroids) For observation x i find nearest centroid (winning seed) denoted by S n Modify S n according to the formula: S n =S n (1-L)+x i L, where L learning constant (decresing during learning process) Repeat steps 2 and 3 over all training observations Repeat steps 2-4 given number of iterations

46 Kohonen SOM (Self Organizing Maps) Idea of the SOM algorithm Select k initial points (cluster centroids), represent them on a 2D map For observation x i find winning seed S n Modify all centroids : S j =S j (1-K(j,n)L)+x i K(j,n)L, where L learning constant (decreasing during training) K(j,n) function decreasing with increasing distance on the 2D map between S j i S n centroids (K(j,j)=1) Repeat steps 2 and 3 over all training observations

