Unsupervised learning: Clustering

Transcription

1 Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52

2 Outline 1 Introduction What is Unsupervised learning? Fundamental aspects of clustering 2 Clustering algorithms Hierarchical clustering Partitional clustering 3 Clustering evaluation metrics Unsupervised learning: Clustering h mn 2/52

3 Introduction What is Unsupervised learning? What is Unsupervised learning? Problem Given a set of records (e.g. observations or variables) with no target attribute, organise them into groups, without advance knowledge of the definitions of the groups. Unsupervised learning Unsupervised learning consists of approaches, which attempt to address the above problem by exploring the unlabelled data to find some intrinsic natural structures within them. Unsupervised learning: Clustering h mn 3/52

4 Introduction What is Unsupervised learning? What is Unsupervised learning? Problem Given a set of records (e.g. observations or variables) with no target attribute, organise them into groups, without advance knowledge of the definitions of the groups. Unsupervised learning Unsupervised learning consists of approaches, which attempt to address the above problem by exploring the unlabelled data to find some intrinsic natural structures within them. Unsupervised learning: Clustering h mn 3/52

5 Introduction What is Unsupervised learning? What is Unsupervised learning? Examples of unsupervised learning approaches Clustering Self-organising maps Association rule Blind signal separation etc. This session will focus on Clustering. Why Clustering? Clustering is one of the most utilised unsupervised learning techniques. Unsupervised learning: Clustering h mn 4/52

9 Introduction Fundamental aspects of clustering Fundamental aspects of clustering Definition Clustering, also termed Cluster Analysis is the collection of methods for grouping unlabelled data into subsets (called clusters) that are believed to reflect the underlying structure of the data, based on similarity groups within the data. What is clustering for? Identification of new tumor classes using gene expression profiles; Identification of groups of co-regulated genes, e.g. using a large number of yeast experiments; Grouping similar proteins together with respect to their chemical structure and/or functionality etc; Detect experimental artifacts. Unsupervised learning: Clustering h mn 5/52

12 Introduction Fundamental aspects of clustering Fundamental aspects of clustering Basic concepts Clustering deals with data for which the groups are unknown and undefined. Thus we need to conceptualise the groups. Intra-clusters distance: Inter-clusters distance: Intra-cluster distance Inter-cluster distance Unsupervised learning: Clustering h mn 6/52

13 Challenges Introduction Fundamental aspects of clustering Notion of a Cluster can 1 Definition of the inter-cluster and intra-cluster distances. 2 The number of clusters. 3 The type of clusters. 4 Clusters quality. How many clusters? for these data? Unsupervised learning: Clustering h mn 7/52

14 Challenges Introduction Fundamental aspects of clustering Notion of a Cluster can 1 Definition of the inter-cluster and intra-cluster distances. 2 The number of clusters. 3 The type of clusters. 4 Clusters quality. How many clusters? for these data? Unsupervised learning: Clustering h mn 7/52

15 Introduction Fundamental aspects of clustering Challenges Two clusters? How many clusters? r can be Ambiguous Why not six clusters? Two Clusters Tan,Steinbach, Kumar Introduction to Data Mining Six Clusters Unsupervised learning: Clustering h mn 8/52

16 Challenges Introduction Fundamental aspects of clustering Definition of intra-clusters distance Type of distance measurement to be used to determine how close two data points are to each other. It is commonly called the distance, similarity or dissimilarity measure. Definition of inter-clusters distance Type of distance measurement to be used to determine how close two clusters are to each other. It is commonly called the linkage function or linkage criteria. It is is often both data (cluster shape) and context dependent and may depend on the distance measure. Unsupervised learning: Clustering h mn 9/52

17 Introduction Fundamental aspects of clustering Distance measures Fundamental axioms Assume that the data are in an n-dimensional Euclidean space, and let x =[x 1, x 2,...,x n ], y =[y 1, y 2,...,y n ]andz =[z 1, z 2,...,z n ]define three data points. Fundamental axioms of a distance measure d are: 1 d(x, x) =0 2 d(x, y) =d(y, x) 3 d(x, y) apple d(x, z)+d(z, y) Remark The choice of a distance measure will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Unsupervised learning: Clustering h mn 10 / 52

18 Introduction Fundamental aspects of clustering Distance measures Fundamental axioms Assume that the data are in an n-dimensional Euclidean space, and let x =[x 1, x 2,...,x n ], y =[y 1, y 2,...,y n ]andz =[z 1, z 2,...,z n ]define three data points. Fundamental axioms of a distance measure d are: 1 d(x, x) =0 2 d(x, y) =d(y, x) 3 d(x, y) apple d(x, z)+d(z, y) Remark The choice of a distance measure will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Unsupervised learning: Clustering h mn 10 / 52

19 Distance measures Introduction Fundamental aspects of clustering Examples of distance metrics Some commonly used metrics for clustering include: Euclidian distance (L 2 norm): d(x, y) = p P n i=1 (x i y i ) 2 nx Manhattan distance (L 1 norm): d(x, y) = kx i y i k i=1 Chebychev maximum distance (L 1 norm): d(x, y) = Minkowski distance (L p norm): d(x, y) = max i=1,...,n kx i! 1/p nx kx i y i k p Mahalanobis distance: d(x, y) = p P n i=1 (x i y i )R 1 (x i y i ), where R denotes the covariance matrix associated to the data. i=1 y i k Unsupervised learning: Clustering h mn 11 / 52

20 Linkage criteria Introduction Fundamental aspects of clustering Examples of linkage criteria or linkage functions Let C 1 and C 2 be two candidate clusters and let d be the chosen distance metric. Commonly used linkage functions between C 1 and C2 include: Single linkage: f (C 1, C 2 )=min{d(x, y) : x 2 C 1, y 2 C 2 } Complete linkage: f (C 1, C 2 )=max{d(x, y) : x 2 C 1, y 2 C 2 } 1 X X Average linkage: f (C 1, C 2 )= d(x, y) C 1 C 2 x2c 1 y2c 2 Ward s criterion: The distance between C 1 and C 2 is given by where µ i is the centre of cluster i. f (C 1, C 2 )= C 1 C 2 C 1 + C 2 µ 1 µ 2 2, Unsupervised learning: Clustering h mn 12 / 52

21 Clustering algorithms Clustering algorithms Hierarchical clustering Create a hierarchical decomposition of a data set by finding successive clusters using previously established clusters. Hierarchical clustering methods produce a tree diagram known as dendrogram or phenogram, which can be built in two distinct ways: Bottom-up known as Agglomerative clustering and Top-down called Divisive clustering. Partitional clustering Decompose the data set into a set of disjoint clusters, i.e. a set of non-overlapping clusters such that each data point is in exactly one subset cluster. Unsupervised learning: Clustering h mn 13 / 52

22 Clustering algorithms Clustering algorithms Hierarchical clustering Create a hierarchical decomposition of a data set by finding successive clusters using previously established clusters. Hierarchical clustering methods produce a tree diagram known as dendrogram or phenogram, which can be built in two distinct ways: Bottom-up known as Agglomerative clustering and Top-down called Divisive clustering. Partitional clustering Decompose the data set into a set of disjoint clusters, i.e. a set of non-overlapping clusters such that each data point is in exactly one subset cluster. Unsupervised learning: Clustering h mn 13 / 52

23 Clustering algorithms Hierarchical clustering Hierarchical clustering Agglomerative clustering Start with the points as individual clusters; At each step, merge the closest pair of clusters until all the data points are in a single cluster or until certain termination conditions are satisfied. Divisive clustering Start with one, all-inclusive cluster; At each step, split a cluster until each cluster contains a single data point or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 14 / 52

24 Clustering algorithms Hierarchical clustering Hierarchical clustering Agglomerative clustering Start with the points as individual clusters; At each step, merge the closest pair of clusters until all the data points are in a single cluster or until certain termination conditions are satisfied. Divisive clustering Start with one, all-inclusive cluster; At each step, split a cluster until each cluster contains a single data point or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 14 / 52

25 Clustering algorithms Hierarchical clustering Agglomerative clustering Algorithm The algorithm forms clusters in a bottom-up manner, as follows: 1 Initially, put each data point in its own cluster. 2 Among all current clusters, pick the two clusters which optimise the chosen linkage function. 3 Replace these two clusters with a new cluster, formed by merging the two original ones. 4 Repeat the steps 2 and 3 until there is only one remaining cluster in the pool, or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 15 / 52

26 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Distance measure The function dist(x, method="metric") returns the distance matrix of anumericalmatrixx using a specified metric, which must be one of the followings: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Clustering The function hclust(d, method="linkage") performs hierarchical agglomerative clustering using a given distance matrix d and a specified linkage function, which must be one of the followings: "single", "complete", "average", "mcquitty", "median" or "centroid". Unsupervised learning: Clustering h mn 16 / 52

27 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Let us consider the following data set X : Unsupervised learning: Clustering h mn 17 / 52

28 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Rscript library(stats) d<-dist(x, method="euclidean") hc<-hclust(d, method="single") ggdendrogram(hc, theme dendro=false) Agglomerative clustering using euclidian distance measure and single linkage. Unsupervised learning: Clustering h mn 18 / 52

29 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Single linkage: Impact of the choice of the distance measure. Euclidian distance Chebychev distance Unsupervised learning: Clustering h mn 19 / 52

30 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Complete linkage: Impact of the choice of the distance measure. Euclidian distance Chebychev distance Unsupervised learning: Clustering h mn 20 / 52

31 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Average linkage: Impact of the choice of the distance measure. Euclidian distance Chebychev distance Unsupervised learning: Clustering h mn 21 / 52

32 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Euclidean distance: Impact of the choice of the linkage function. Single linkage Complete linkage Average linkage Unsupervised learning: Clustering h mn 22 / 52

33 Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Chebychev distance: Impact of the choice of the linkage function. Single linkage Complete linkage Average linkage Unsupervised learning: Clustering h mn 23 / 52

34 Clustering algorithms Hierarchical clustering Agglomerative clustering Advantages No apriori information about the number of clusters required; Easy to implement; The obtained results may correspond to meaningful taxonomies. Limitations The algorithm does not enable to undo what was done previously Interpretation of the hierarchy can be complex or even confusing Depending on the type of distance matrix used, the algorithm 1 can be sensitivity to noise and outliers, 2 tends to break large clusters. 3 can hardly handle di erent sized clusters. Unsupervised learning: Clustering h mn 24 / 52

35 Clustering algorithms Hierarchical clustering Agglomerative clustering Advantages No apriori information about the number of clusters required; Easy to implement; The obtained results may correspond to meaningful taxonomies. Limitations The algorithm does not enable to undo what was done previously Interpretation of the hierarchy can be complex or even confusing Depending on the type of distance matrix used, the algorithm 1 can be sensitivity to noise and outliers, 2 tends to break large clusters. 3 can hardly handle di erent sized clusters. Unsupervised learning: Clustering h mn 24 / 52

36 Clustering algorithms Hierarchical clustering Divisive clustering Algorithm The algorithm forms clusters in a up-down manner, as follows: 1 Initially, put all objects in one cluster. 2 Among all current clusters, pick the one which satisfies a specified criterion and split it using a specified method. 3 Replace this cluster with the new clusters, formed by splitting the original one. 4 Repeat the steps 2 and 3 until all clusters are singletons or or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 25 / 52

37 Clustering algorithms Hierarchical clustering Divisive clustering: Illustration with R Clustering The function diana(x, diss = inherits(x, "dist"), metric = "metric") performs hierarchical divisive clustering a numerical matrix X using a specified distance metric, which must be one of the followings: "euclidean" or "manhattan". Let us consider the following data set X : Unsupervised learning: Clustering h mn 26 / 52

38 Clustering algorithms Hierarchical clustering Divisive clustering: Illustration with R Rscript library(cluster) dc<-diana(x, diss=inherits(x, "dist"), metric="euclidean") plot(dc) Divisive clustering using euclidian distance measure. Unsupervised learning: Clustering h mn 27 / 52

39 Clustering algorithms Hierarchical clustering Divisive clustering: Illustration with R Impact of the choice of the distance measure Euclidian distance Manhattan distance Unsupervised learning: Clustering h mn 28 / 52

40 Divisive clustering Clustering algorithms Hierarchical clustering Advantages No apriori information about the number of clusters required; The obtained result may correspond to meaningful taxonomies. Limitations The algorithm does not enable to undo what was done previously; Computational di culties when considering all possible divisions into two groups; Depending on the type of distance matrix used, the algorithm 1 can be sensitivity to noise and outliers 2 tends to break large clusters Unsupervised learning: Clustering h mn 29 / 52

41 Clustering algorithms Partitional clustering Partitional clustering Basic concept Given, k the number of clusters, partitional clustering algorithms construct a partition of a data set into k clusters that optimises the chosen partitioning criterion. Partitionning techniques 1 Global optimal method: Exhaustive enumeration of all partitions (NP hard problem) 2 Heuristic methods: e.g. k-means clustering Each cluster is represented by its centre k-medoids clustering or PAM (Partition Around Medoids): Each cluster is represented by one of its components Unsupervised learning: Clustering h mn 30 / 52

42 Clustering algorithms Partitional clustering Partitional clustering Basic concept Given, k the number of clusters, partitional clustering algorithms construct a partition of a data set into k clusters that optimises the chosen partitioning criterion. Partitionning techniques 1 Global optimal method: Exhaustive enumeration of all partitions (NP hard problem) 2 Heuristic methods: e.g. k-means clustering Each cluster is represented by its centre k-medoids clustering or PAM (Partition Around Medoids): Each cluster is represented by one of its components Unsupervised learning: Clustering h mn 30 / 52

43 k-means clustering Clustering algorithms Partitional clustering Basic concept Given an integer k asetx of n points (n Euclidean space, denoted by k) in a m-dimensional X = {x i =(x i1,...,x im ) T 2 R m, i =1,...,n}. Find an assignment of the n points into k disjoint clusters C =(C 1,...,C k ) centered at cluster means µ j (j =1,...,k), based on a certain criteria, e.g. by minimising f (X, C) = kx X C j j=1 i=1 x (j) i µ j 2, where C j is the number of points in the cluster C j,andx (j) i in C j. is the point i Unsupervised learning: Clustering h mn 31 / 52

44 Clustering algorithms Partitional clustering k-means clustering Algorithm The k-means clustering algorithm can be summarised as follows: 1 Select k data points randomly in a domain containing all the points in the data set. These k points represent the centres of the initial clusters. 2 Assign each point to the cluster that has the closest centre. 3 Recompute the cluster centers (means) using the current cluster memberships. 4 Repeat the steps 2 and 3 until the centres no longer change, or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 32 / 52

45 Clustering algorithms Partitional clustering k-means clustering: Illustration with R Clustering The function kmeans(x, centers, iter.max = 1000, nstart = 10) performs k-means clustering given a numerical matrix of data x, the maximum number of iterations, and the number of random initial sets to be chosen when centres is greater than 1. Let us consider the following data set X : Unsupervised learning: Clustering h mn 33 / 52

46 Clustering algorithms Partitional clustering k-mean clustering: Illustration with R Rscript library(stats) kc <- kmeans(x, centers= 4, iter.max=1000, nstart=10000) k-mean clustering using four clusters. Unsupervised learning: Clustering h mn 34 / 52

47 Clustering algorithms Partitional clustering k-mean clustering: Illustration with R Impact of the choice of the number of clusters Three clusters Four clusters Unsupervised learning: Clustering h mn 35 / 52

48 Clustering algorithms Partitional clustering k-mean clustering: Illustration with R Impact of the choice of the number of clusters Five clusters Six clusters Unsupervised learning: Clustering h mn 36 / 52

49 Clustering algorithms Partitional clustering k-mean Clustering: Illustration with R Impact of the choice of the number of clusters Number of clusters vs Within clusters sum of squares. Unsupervised learning: Clustering h mn 37 / 52

50 Clustering algorithms Partitional clustering k-mean clustering Advantages Relatively easy to implement. A simple iterative algorithm works quite well in practice. Limitations Need to specify k, the number of clusters, in advance. Applicable only when the mean is defined, hence it can t handle categorical data. Not suitable to discover clusters with non-convex shapes. Unable to handle noisy data and outliers. Unsupervised learning: Clustering h mn 38 / 52

51 k-medoids clustering Clustering algorithms Partitional clustering Basic concept Given an integer k asetx of n points (n Euclidean space, denoted by k) in a m-dimensional X = {x i =(x i1,...,x im ) T 2 R m, i =1,...,n}. Find an assignment of the n points into k disjoint clusters C =(C 1,...,C k ) centered at cluster points m j (j =1,...,k) called medoids, based on a certain criteria, e.g. by minimising f (X, C) = kx X C j j=1 i=1 x (j) i m j, where C j is the number of points in the cluster C j,andx (j) i in C j. is the point i Unsupervised learning: Clustering h mn 39 / 52

52 Clustering algorithms Partitional clustering k-medoids clustering PAM (Partitioning Around Medoids) Algorithm The PAM is a k-medoids clustering algorithm, which is similar to the k-means algorithm. It can be summarised as follows: 1 Select randomly k data points from the given data set. These k points represent the medoids of the initial clusters. 2 Assign each point to the cluster that has the closest medoid. 3 Iteratively replace one of the medoids by one of the non-medoids which improve the chosen criterion. 4 Repeat the steps 2 and 3 until the medoids no longer change, or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 40 / 52

53 Clustering algorithms Partitional clustering k-medoids clustering PAM Algorithm Advantages: Works e ectively for small data sets Limitations: Does not scale well for large data sets CLARA (Clustering Large Applications) Based on multiple sampling from the data set and application of PAM on each sample, it provides the best clustering as the output. Advantages: Deals with larger data sets than PAM Limitations: E ciency depends on the sample size Unsupervised learning: Clustering h mn 41 / 52

54 Clustering algorithms Partitional clustering k-medoids clustering: Illustration with R CLARA The function clara(x, k, metric = "metric", samples = r) performs CLARA clustering given a numerical matrix of data x, the number of cluster, the distance metric, and the number of samples to be drawn from the data set X. Let us consider the following data set X : Unsupervised learning: Clustering h mn 42 / 52

55 Clustering algorithms Partitional clustering k-medoids clustering: Illustration with R Rscript library(cluster) km <- clara(x, k, metric = "euclidean", samples = 10) CLARA clustering using 5 clusters and 10 samples. Unsupervised learning: Clustering h mn 43 / 52

56 Clustering algorithms Partitional clustering k-medoids clustering: Illustration with R CLARA: Impact of the choice of the distance metric Euclidean distance Manhattan distance Unsupervised learning: Clustering h mn 44 / 52

57 Clustering evaluation metrics So... which method to use for the data set X?!!!?? Hierarchical clustering? If yes Agglomerative or Divisive? For either method 1 which metric distance and/or linkage function? 2 where to cut the dendrogram? Partitional clustering? If yes k-means or CLARA? For either method 1 which metric distance? 2 how many clusters? Unsupervised learning: Clustering h mn 45 / 52

62 Clustering evaluation metrics Clustering evaluation metrics Silhouette Coe cient Provides a graphical representation of how well each object lies within its cluster. The silhouette coe cient of a data point i is defined as s i = (b i a i ) max(a i, b i ), where a i denotes the average distance between the data point i and all other data points in its cluster, and b i denotes the minimum average distance between i and the data points in other clusters. Data points with large silhouette coe cient s i are well-clustered, those with small s i tend to lie between clusters. Unsupervised learning: Clustering h mn 46 / 52

63 Clustering evaluation metrics Clustering evaluation metrics Classification-oriented measures Use of the classification approach to compare clustering techniques with the ground truth. Some of these measures are 1 Entropy 2 Purity 3 Recall 4 F -measure Unsupervised learning: Clustering h mn 47 / 52

64 Clustering evaluation metrics Clustering evaluation metrics Entropy Measures the degree to which each cluster consists of data points from a single class. The entropy of a cluster i is given by E i = lx j=1 n ij n i log nij n i, where n ij is the number of data points of class i in cluster j, n i is the number of data points in cluster i and l is the number of classes. The total entropy for a set of clusters is given by E = kx i=1 n i n E i, where k is the number of clusters and n is the total number of data points. Unsupervised learning: Clustering h mn 48 / 52

65 Clustering evaluation metrics Clustering evaluation metrics Purity Measures the extent to which a cluster contains data points of a single class. Using the previous notations, the purity for a cluster i is given by Pur i =max j n ij n i, whereas the overall purity of a clustering is given by Pur = kx i=1 n i n Pur i. Unsupervised learning: Clustering h mn 49 / 52

66 Clustering evaluation metrics Clustering evaluation metrics Precision Measures the fraction of a cluster that consists of objects of a specified class. Using the previous notations, the precision of cluster i with respect to class j is given by Pre(i, j) = n ij n i Recall Measures the extent to which a cluster contains all objects of a specified class. The recall of cluster i with respect to class j is given by Rec(i, j) = n ij n j, where n ij is the number of data points of class i in cluster j and n j is the number of data points in class j. Unsupervised learning: Clustering h mn 50 / 52

67 Clustering evaluation metrics Clustering evaluation metrics F -measure It combines the precision and the recall to measure the extent to which a cluster contains only data points of a particular class and all points of that class. The F -measure of cluster i with respect to class j is given by F (i, j) = 2Pre(i, j) Rec(i, j) Pre(i, j)+rec(i, j). Unsupervised learning: Clustering h mn 51 / 52

68 End End Thank you for your attention! Unsupervised learning: Clustering h mn 52 / 52