Contact: mailto:

Transcription

1 Contact: mailto: Unsupervised Learning Clustering Partitioning K-Means Dr. Ammar Mohammed Associate Professor of Computer Science ISSR, Cairo University PhD of CS ( Uni. Koblenz-Landau, Germany) Spring 2019

2 Supervised versus Unsupervised Learning Supervised learning (classification/regression) Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning (clustering) The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

3 Introduction: What is Clustering? Clustering is the task of dividing the population or data points into a number of different groups (clusters) such that data points in the same groups are more similar to other data points in the same group than those in other groups. The data in each group share common traits according to some defined Similarity measure of distance measure. The similarity measure is the measure of how much alike two data instance are. Similarity measure in Machine learning context is a distance with dimensions representing features of the objects. If this distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity Similarity are measured Usually in the range [0,1]. Similarity = 1 if X = Y (Where X, Y are two objects) Similarity = 0 if X Y Note Similarity(x,y)=1 means distance(x,y)=0

4 Example Applications of Clustering Clustering is appropriate when there is no a priori knowledge about the data Absence of class labels Groups people of similar sizes together to make small, medium and large T-Shirts. In marketing, segment customers according to their similarities To do targeted marketing. Given a collection of text documents, we want to organize them according to their content similarities, To produce a topic hierarchy

5 Distance Metric Properties A distance metric d is a function that takes as arguments two points x and y in an n-dimensional space R n and has the following properties: Symmetry : The distance should be symmetric, i.e: d(x,y)=d(y,x) This mean that the distance from x to y should be the same as the distance from y to x. Positivity : The distance between any two points should be a real number greater than or equal to zero: d(x,y) 0 for any x and y. The equality is true if and only if x = y, i.e. d(x,x)=0. Triangle inequality : The distance between two points x and y should be shorter than or equal to the sum of the distances from x to a third point z and from z to y: d(x,y) d(x,z)+ d(z,y)

6 Most Common Distance Measure 1- The Euclidean Distance Euclidean distance is also known as simply distance. When data is dense or continuous, this is the best proximity measure. The Euclidean Distance between two n-dimensional vectors x=(x 1, x 2,, x n ) and y=(y 1, y 2,, y n ) is: The Euclidean Distance takes into account both the direction and the magnitude of the vector d( x,y )= ( x 1 y 1 ) 2 +( x 2 y 2 ) 2 + +( x n y n ) 2 = n i=1 ( x i y i ) 2 Other Different version of Euclidean includes: Squared Euclidean, Weighted Euclidean

7 Most Common Distance Measure 1- Manhattan Distance Manhattan distance represents distance that is measured along directions that are parallel to the x and y axes Manhattan distance between two n-dimensional vectors x=(x 1, x 2,, x n ) and y=(y 1, y 2,, y n ) is: d M ( x,y)= x 1 y 1 + x 2 y x n y n n = i=1 x i y i Distance between two Where x i y i represents the absolute blue points = 3+4=7 value of the difference betweeen x i and y i

8 Most Common Distance Measure 1- Minkowski Distance Minkowski distance is a generalization of Euclidean and Manhattan distance. Minkowski distance between two n-dimensional vectors x=(x 1, x 2,, x n ) and y=(y 1, y 2,, y n ) is: d ( x,y )= { x 1 y 1 p + x 2 y 2 p + + x n y n p } 1 p ={ n i=1 x i y i p}1 p When p=1, the distance reduces to Manhattan distance When p=2, the distance reduces to Euclidean distance.

9 Most Common Distance Measure 1- Cosine Similarity The Cosine Similarity takes into account only the angle and discards the magnitude. The Cosine Similarity distance between two n-dimensional vectors x=(x 1,x 2,,x n ) and y=(y 1,y 2,,y n ) is: cos(θ )= x y x y n θ x y=x 1 y 1 +x 2 y 2 + +x n y n = xy x = x 1 2 +x x n 2 = i=1 n 2 x i i=1 two vectors with the same orientation have a cosine similarity of 1, two vectors at 90 have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. d( x,y )=1 cos(θ ) x i y i 0 d ( x,y ) 2

10 Most Common Distance Measure 1- Cosine Similarity Example x = (3, 2, 0, 5, 2, 0, 0 ),x = (1,0, 0, 0, 1, 0, 2 ) = = = cos ( x,x )= d (x,x )=1 cos (x,x )= =

11 Distance Measure (Binary Features) Binary attribute: has two values or states but no ordering relationships, e.g., Gender: male and female. We use a confusion matrix to introduce the distance functions/measures. Let the ith and jth data points be x i and x j (vectors) - a: number of features that equal 1 for both x and y - b: number of features that equal 1 for x but that are 0 for y - c: number of features that equal 0 for x but that are 1 for y - d: number of features that equal 0 for both x and y

12 Distance Measure (Binary Features) A binary attribute is symmetric if both of its states (0 and 1) have equal importance, and carry the same weights, e.g., male and female of the attribute Gender Distance function: Simple Matching Coefficient, proportion of mismatches of their values dist (x i,x j )= b+c a+b+c+d x x dist (x 1,x 2 )= = 3 =

13 Distance Measure (Binary Features) Asymmetric: if one of the states is more important or more valuable than the other. By convention, state 1 represents the more important state, which is typically the rare or infrequent state. Jaccard coefficient is a popular measure dist (x i,x j )= b+c a+b+c If Features having combinations of symmetric and Asymmetric features. Apply the distance for dominant features

14 Distance Measure (Binary Features) Example: Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4 Jack M Mary F Jim M Y : yes P : positive N : negative Gender is a symmetric feature (less important) the remaining features are asymmetric binary set the values Y and P to 1, and the value N to 0 Mary Jack d ( Jack,Mary )= 0 +1 = Jack Ji m Jim Mary d ( Jack,Jim )= 1+1 = d ( Jim,Mary )= 1+2 =

15 Distance Measure (Nominal Features) A generalization of the binary feature so that it can take more than two states/values, e.g., red, yellow, blue, green, There are two methods to handle variables of such features. Simple mis-matching d ( x,y )= number of mis matching features between x and y total number of features Convert it into binary variables creating new binary features for all of its nominal states e.g., if an feature has three possible nominal states: red, yellow and blue, then this feature will be expanded into three binary features accordingly. Thus, distance measures for binary features are now applicable!

16 Distance Measure (Nominal Features) Example Outlook Temperature Humidity Wind D D Simple mis-matching d (D 1,D 2 )= 2 4 =0. 5 Creating new binary features Using the same number of bits as those features can take Outlook = {Sunny, Overcast, Rain} (100, 010, 001) Temperature = {High, Mild, Cool} (100, 010, 001) Humidity = {High, Normal} (10, 01) Wind = {Strong, Weak} (10, 01) d (D 1,D 2 )= =0. 4

17 Clustering Methods Partitioning Clustering Our Focus Hierarchical Clustering Density based Methods Model Based Methods Several partitioning Algorithms are existing. There are more than 100 clustering algorithms known. K-Means is one of the famous partition clustering algorithm

18 K-Means Clustering Let the set of data points (or instances) D be {x 1, x 2,, x m }, where x i = (x i (1), x i (2),, x i (n) ) is a vector of n dimension representing the number of attributes in the data. The k-means algorithm partitions the given data into k clusters. Each cluster has a cluster center, called centroid. k is specified by the user

19 K-Means Clustering: Steps Given k, the k-means algorithm works as follows: 1) Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2) Assign each data point to the closest centroid 3) Re-compute the centroids using the current cluster memberships. 4) If a convergence criterion is not met, go to 2.

20 K-Means Clustering Algorithm Algorithm: k-means(k,d) Randomly Initialize k cluster centroids μ 1,μ 2,...,μ k repeat Until convergence { for i=1 to m C i := index (from 1 to k) of cluster centroid closest to x i for j=1 to K μ j := average(mean) of points assigned to cluster j }

21 K-Means Clustering Algorithm (informally) Algorithm: k-means(k,d) 1 Choose k data pints as the initial centroids (cluster centers) 2 repeat 3 for each data point x i in D do 4 compute the distance from x i to each centroid 5 assign x i to the closest centroid 6 endfor 7 re-compute the centroids using the cluster memberships 8 until the stopping criterion is met

22 Common Distance Measures For each cluster C j, C j denotes the number of data points in the cluster. The Centroid of the cluster is computed with: μ j = 1 C x i C j X i Distance between a data point x and a centroid μ i of a cluster C i is denoted as d(x,μ i ) Stopping / Convergence Criterion - no (or minimum) re-assignments of data points to different cluster - no (or minimum) change of centroids - minimum decrease in the the cost function sum of - - squared error (SSE) k SSE= i=1 dist ( x,μ i ) 2 Optimization Objective Function

23 Example: K=2

24 Example Random select k centroids

25 Example Iteration 1: Cluster assignment

26 Example Re-compute the centroids

28 Example Re compute the centroids

30 A simple practical Example: K=2 Us Euclidean Distance to find the two Clusters

31 A simple practical Example: K=2 Step 1: Randomly we choose following 2 centroids: μ 1 =(1.0,1.0) and μ 2 =(5.0,7.0). Cluster 1 Cluster 2 Individual 1 4 μ (1.0,1.0) (5.0,7.0)

32 Step 2: Thus, we obtain two clusters containing: {1,2,3} (green) {4,5,6,7} (red). Their new centroids are: Individual Centroid1 Centroid μ 1 =1/3 ( , )=(1.83,2.33) μ 2 =1/4 ( , ) = (4.12,5.38)

33 Step 3: Now using the new centroids we compute the Euclidean distance of each object, as shown in table. Therefore, the new clusters are: {1,2} and {3,4,5,6,7} Next centroids are: μ 1 =(1.25,1.5) and μ 2 = (3.9,5.1) Individual Centroid1 Centroid

34 Step 4 : with the new centroids, the obtained clusters are: {1,2} and {3,4,5,6,7} Therefore, there is no change in the cluster. Thus, the algorithm comes to a halt here and the final result consist of 2 clusters {1,2} and {3,4,5,6,7}. Individual Centroid1 Centroid

35 Plotting the answer

36 The same problem with K=

37 Strength of K-Means - Easy and simple to implement - Efficient: Time complexity: O(t k m), where m is the number of data points, k is the number of clusters, and t is the number of iterations. - K-means is the most popular clustering algorithm. - Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

38 Weakness of K- Means: The Random Initialization The random initialization (seeds) of centroids can lead to local optimum.

39 Weakness of K- Means: The Random Initialization A possible selection of the random centroids There are some methods to help choose good seeds

40 Weakness of K- Means: Choosing the number of Clusters What is the right Number of K?

43 Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster analysis, the analogous question is how to evaluate the goodness of the resulting clusters? Then why do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

44 Measures of Cluster Validity Measures of Validity are classified into the following:. External Index: Used to measure the extent to which cluster labels match externally supplied class labels. Entropy Internal Index: Used to measure the goodness of a clustering structure without respect to external information. Sum of Squared Error (SSE) Other Index: Used to compare two different clusterings or clusters. Often an external or internal index is used for this function, e.g., SSE or entropy

45 Internal Measure: Cohesion and Separation Cluster Cohesion: Measures how closely related are objects in a cluster. Cohesion is measured by the within cluster sum of squares Error (SSE)/ WSS WSS= i ( x m i ) 2 x C i m i is the centriod of cluster C i Cluster Separation: Measure how distinct or well-separated a cluster is from other clusters. Separation is measured by the between cluster sum of squares BSS BSS= i C i (m m i ) 2 Where C i is the size of cluster i Objective Mininmiz WSS Or Maximize BSS m i is the centroid of cluster C i m is the centeroid of all data

46 Choosing the number of Clusters A heuristic method called elbow (knee) method might help Plotting number of clusters against the Cluster Cohesion Select K having abrupt value K( no. of Clusters) This method is not working well if the slope is smooth Sometimes, we choose K by human for some special purpose applications

47 Internal: Cohesion and Separation together Example: SSE BSS + WSS = constant m 1 m m 2 5 K=1 cluster K=2 clusters WSS=(1 3) 2 +(2 3 ) 2 +( 4 3 ) 2 +(5 3 ) 2 =10 BSS=4 (3 3 ) 2 =0 Total=10 +0=10 WSS=(1 1. 5) 2 +(2 1.5) 2 +( 4 4.5) 2 +(5 4. 5) 2 =1 BSS=2 (3 1.5) 2 +2 (4.5 3) 2 =9 Total=1+9=10

48 K-Means Summary Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and other clustering algorithms have their own lists of weaknesses. No clear evidence that any other clustering algorithm performs better in general although they may be more suitable for some specific types of data or applications. Comparing different clustering algorithms is a difficult task. No one knows the correct clusters!