Implementation and Use of The K-Means Algorithm

Implementation and Use of The K-Means Algorithm Thilina Hewaragamunige September 11, 2014 Contents 1 K-Means Algorithm 1 1.1 Introduction.............................................. 1 1.2 Performance measure of K-Means.................................. 2 2 Python Implementation of k-means 2 2.1 Implementation............................................ 2 2.2 Demonstrations............................................ 4 2.2.1 Impact of k on J value.................................... 6 3 Experiments 6 3.1 Data Set................................................ 6 3.2 Objective............................................... 8 3.3 Results................................................. 8 4 Discussion 9 Appendices 12 A Complete Python Code for K-Means algorithm 12 Abstract A brief introduction to K-Means algorithm is provided which is followed by an implementation in Python. The implementation is demonstrated using a synthetic data set. Finally it is applied to Wisconsin breast cancer data set to see if the K-Means algorithm can cluster benign and malignant entries separately. The results of this experiment is discussed along with the performance measures of K-Means algorithm. 1 K-Means Algorithm This section briefly introduces the K-Means algorithm and its underpinning principle. 1.1 Introduction K-Means algorithm is used to find a given number of clusters (i.e. K number of clusters) in a particular multidimensional data set. A cluster is a group of data points which are located in a close proximity to each other in terms of Euclidean distance. Each cluster has a central point of the cluster which is called the centroid. K-Means algorithm uses an iterative approach to find a locally optimum solution. It starts by choosing an initial set of centroids. It s possible that these initial set of centroids are chosen randomly from the data points. Then each data point is assigned to the closest centroid, which in turn forms the initial set of clusters. A new centroid is calculated for each cluster by taking the mean value along each dimension of data points 1

belong to the cluster. This will results in a new set of centroids. Then the data points are reassigned to new centroids following the same criteria as earlier and the same procedure continues until the centroids do not change. Also in certain cases, a fixed number of iterations may be used irrespective of whether centroids are stabilized or not. 1.2 Performance measure of K-Means The underlying principle of K-Means is to find a set of clusteres such that the total distance from data points to the centroids of their corresponding clusters is minimum. The sum of square distances from data points to the centroids of their cluster is called the performance measure of K-Means of the J value[1, p. 424]. The formal definition of J-Value is as follows. J = N n=1 k=1 K r nk x n µ k 2 The number of data points is N and the number of clusters is K. x n is the n th data point whereas µ k is the centroid of the k th cluster. r nk is known as a binary indicator variable which is defined as follows. { 1 if x n belongs to cluster k with the centroid µ k r nk = 0 Otherwise Usually K-Means algorithm is executed until the J value will reach a minimum and remain unchanged. This happens when the centroids are stabilized after a number of iterations. 2 Python Implementation of k-means This section is dedicated to discuss the Python implementation of K-Means in detail. In the latter half of the section, it demonstrates the implementation by applying it to a synthetic data set. 2.1 Implementation The K-Means algorithm was implemented in a generic way to ensure that it can be used with a data set of any dimension and also with any number of clusters. The K-Means function in this implementation accepts two parameters, the data object which is a Numpy array and the K value. In this implementation, the K-Means algorithm will execute until the centroids are stabilized. In this discussion, the number of data points will be referred to as n and the number of clusters is referred to as k. Number of dimensions in the data set is represented by m. 1. The first step is to select the initial set of centroids. As per the terminology outlined before, the input data array is of size n m. Numpy random function is used without replacement to find k random numbers between 0 and n. These random numbers are used as indexes to the data array in order to retrieve the initial k centroids. Following code snippet implements the above mentioned functionality. # get k random numbers between 0 and the number o f rows i n the data s e t c e n t r o i d i n d e x e s = np. random. c h o i c e ( range ( data. shape [ 0 ] ), k, r e p l a c e = F a l s e ) # get the c o r r e s p o n d i n g data p o i n t s c e n t r o i d s = data [ c e n t r o i d i n d e x e s, : ] 2. Now the code enters a while loop that ll run until the centroids are stabilized. The next step is to calculate the Euclidean distance between each data point and each centroid. Centroid array is of shape k m where as data array is of shape n m. Since Numpy is used to directly calculate the Euclidean distance, it should be possible to broadcast these multi-dimensional arrays to a common shape. In order to do that, a new axis is introduced into data array to change its shape to n 1 m. data ex = data [ :, np. newaxis, : ] 2

Multi-dimensional arrays data ex and centroids are now compatible for broadcasting. Now the Euclidean distance can be calculated using per element operations in Numpy. The resulting array will be of size n k m. e u q l i d e a n d i s t = ( data ex c e n t r o i d s ) 2 As the final step of this step, the sum of individual distances corresponding to each dimension is added together to get the distance between a data point and a centroid. The summation has to be done along the 3rd axis which represents the distances for each dimension for a given data point and a given centroid. d i s t a n c e a r r = np. sum( e u q l i d e a n d i s t, a x i s =2) The shape of the distance arr will be n k. 3. The next step is to identify the closest centroid for each data point. This is achieved by taking the minimum distance between the data point and a centroid out of m centroids. So the minimum distance is calculated along the 2nd axis of distance arr for each data point. np. argmin ( d i s t a n c e a r r, a x i s =1) To represent the cluster assignments, a n k binary array is used. If the i th data point belongs to the cluster corresponding to j th column, then point {i, j} is set to 1 in this array. Otherwise, it is set to 0. After combining this minimum location array with the previous code snippet, the following code segment is resulted. m i n l o c a t i o n = np. z e r o s ( d i s t a n c e a r r. shape ) m i n l o c a t i o n [ range ( d i s t a n c e a r r. shape [ 0 ] ), np. argmin ( d i s t a n c e a r r, a x i s =1) ] = 1 4. Now it s possible to calculate the J value for this iteration. It is stored in a list for later use. j v a l = np. sum( d i s t a n c e a r r [ m i n l o c a t i o n == True ] ) 5. Next, the new centroids should be calculated by taking the mean of the data points that are clustered together. The indexes of the data points belong to a particular cluster can be identified by using the corresponding column corresponding to min location matrix. Then the mean values are calculated along each dimension of the corresponding data points of the cluster to identify the new centroid for that cluster. In the Python implementation, a for loop is used to iterate through each cluster. n e w c e n t r o i d s = np. empty ( c e n t r o i d s. shape ) ; f o r c o l i n range ( 0, k ) : n e w c e n t r o i d s [ c o l ] = np. mean ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ], a x i s = 0) 6. Then the terminating condition should be tested. If the centroids haven t been changed, it s safe to assume that the J value is stabilized. To check whether the new centroids and old centroids are equal, they are sorted and array equal method of Numpy is invoked. If the centroids are different, the process will be repeated from step 2. np. a r r a y e q u a l ( np. s o r t ( n e w c e n t r o i d s, a x i s =0), np. s o r t ( c e n t r o i d s, a x i s =0) ) 7. In this implementation, even after the terminating condition is met three more iterations will be run just to collect the J values for plotting. Finally clusters are plotted as scatter plots. Each cluster is plotted in a different color using a random color palette. It ll have k+1 colors for k clusters and centroids. c o l o r s = i t e r (cm. rainbow ( np. l i n s p a c e ( 0, 1, k + 1) ) ) Cluster plotting code snippet is shown below. 3

Figure 1: Demonstration: Scatter plot of the data points (Before clusters are identified) f o r c o l i n range ( 0, k ) : p l t. s c a t t e r ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 0 ], data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 1 ], c o l o r=next ( c o l o r s ) ) c e n t r o i d l e g = p l t. s c a t t e r ( n e w c e n t r o i d s [ :, 0 ], n e w c e n t r o i d s [ :, 1 ], c o l o r=next ( c o l o r s ), marker= x ) p l t. l egend ( [ c e n t r o i d l e g ], [ C entroids ], s c a t t e r p o i n t s =1, l o c= b e s t ) p l t. s a v e f i g ( c l u s t e r. png ) Finally the J value is plotted against the iterations. The complete code of K-Means algorithm is available in Appendix A. 2.2 Demonstrations For demonstrating the correct functionality of the implementation, following synthetic data set is used. [[1.1,2],[1,2],[0.9,1.9],[1,2.1],[9,9],[8.9,9],[8.7,9.2],[9.1,9]] As it can be noticed, there are two clearly visible clusters in this data set. First four data points can be clustered together and the remaining four data points can be considered as the second cluster. A scatter plot of these data points is depicted in the figure 1. K-Means algorithm is executed over this data set with k=2 to identify the two clusters. Figure 2 depicts the clustered data set. Markers of data points of each cluster is appearing in different colors and centroids are marked with a different marker. Please refer to Figure 3 for J value plot. J value was stabilized at 0.1575 in this particular instance. 4

Figure 2: Demonstration: Scatter plot of the clustered data points Figure 3: Demonstration: J value plot 5

Figure 4: Demonstration: J Vs. K Plot 2.2.1 Impact of k on J value To observe the effect of the value of k on the J value, a set of random data points were used. The motivation behind using random data points is to generalize the input to the K-Means algorithm as much as possible. The stabilized J value was recorded and plotted against different k values incremented from 1 to the number of data points in the random sample. Figure 4 depicts the results of this test with 100 random data points. When k is set to 1, every data point belongs to a single cluster and final centroid will be the mean of every data point. This may result in a higher J value, as majority of the data points may be located in a significant distance from the centroid. As k increases, data points get a chance to cluster with a centroid in a closer proximity. Hence the J value is expected to decrease. When k is set to 100, i.e. the number of data points, every data point becomes a centroid and will stabilize which will result in a 0 J value. The shape of this plot can be different for real data sets which do contain some clusters. In such cases, it s possible to have an optimum J value in the middle before it goes to zero when k is close to the sample size. So a plot like this is useful to identify a good k value. But it s not possible to have a high confidence on this plot, because it s dependent on the randomness of selecting the initial set of centroids. 3 Experiments This section discusses about using the K-Means implementation outlined in previous section on a real data set. The Wisconsin breast cancer data set[2] was used here. 3.1 Data Set Wisconsin breast cancer data set[2] contains a set of features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. It contains 699 entries each with 11 attributes. One of the attributes is the sample code number which identifies a sample uniquely. Class attribute identifies whether it s a benign or malignant entry. Table given below provides a brief description of each attribute. 6

Figure 5: Experiment: Plot of all dimensions Attribute No. Attribute Domain 1 Sample code number id number 2 Clump Thickness 1-10 3 Uniformity of Cell Size 1-10 4 Uniformity of Cell Shape 1-10 5 Marginal Adhesion 1-10 6 Single Epithelial Cell Size 1-10 7 Bare Nuclei 1-10 8 Bland Chromatin 1-10 9 Normal Nucleoli 1-10 10 Mitoses 1-10 11 Class (2 for benign, 4 for malignant) There were missing attributes in 16 records which were denoted by? signs. Those records were removed from the processing set, hence it had 683 records at the end. Data was available from the first row itself, hence the file was manually modified to have the abbreviations of attribute names as the first row. First and Last columns were skipped before running the K-Means algorithm. Following code listing shows how Pandas was used to import these data into a Numpy array. d = pandas. r e a d c s v ( open ( data / breast cancer w i s c o n s i n. data ), n a v a l u e s=? ) d c l e a n = d [ d. i s n u l l ( ). any ( a x i s =1)==F a l s e ] data = d c l e a n. i l o c [ :, 1 : 1 0 ]. v a l u e s This would result in a 683 data points, each with 9 dimensions. If this data was plotted to see how they are varying on each dimension, Figure 5 is resulted. Since this plot was too clear for the naked eye, it was attempted to plot some of the attribute separately for the first 100 data entries. Figure 6 depicts the resulting set of plots. As it can be seen, the data has a oscillating behavior along each axis. 7

Figure 6: Experiment: Plot of 4 dimensions for first 100 data points 3.2 Objective Since the original data set was labeled whether a record is benign or malignant, it was attempted to test if k is set to 2, if the resulting two clusters would resemble those two categories. 3.3 Results In the data set of 683 entries, the composition of benign and malignant entries based on the Class attribute (label) is shown below. Class No. Of Entries Percentage Benign 444 65% Malignant 239 35% The number of benign and malignant entries in the clusters resulted by the K-Means algorithm were counted. If the majority of the cluster contains benign entries, then it was considered as the benign cluster. Similarly, if the malignant entries occupy the majority of the cluster, then it was labeled as the malignant cluster. Then it was attempted to calculate the percentage of true positives in each cluster, i.e. how many genuine benign records are ended up in the benign cluster and how many malignant records are ended up in the malignant cluster. Tabulated results of multiple iterations of K-Means algorithm are given below. It contains the number of total entries, the number of benign entries and the number of malignant entries in each cluster. And the positive predictive value[3] for overall result is also listed. Positive predictive value (PPV) is the proportion of true positive results. P P V = number of true positives number of true positives + number of false positives 8

Figure 7: Experiment: J value plot over multiple iterations in a single execution of K-Means Iteration Cluster 1 Cluster 2 Size Benign Malignant Size Benign Malignant PPV 1 231 9 222 452 435 17 0.96 2 452 435 17 231 9 222 0.96 3 452 435 17 231 9 222 0.96 4 452 435 17 231 9 222 0.96 5 230 9 221 453 435 18 0.96 For instance in iteration 1, PPV is calculated as follows. number of true positives in cluster 1 + number of true positives in cluster 2 P P V = (number of true positives + number of false positives) in both clusters 222 + 435 = (222 + 9) + (435 + 17) = 657 683 = 0.96 Based on these observations, sufficient evidence exists to conclude that the clusters generated by the K-Means algorithm significantly resembles with the two categories of entries. 4 Discussion A manual scan of the data set suggests that the values of majority of features of a given data point are either low (around 1-2) or relatively higher in most cases. If the feature values of a record are low, then the records is categorized as benign whereas relatively higher values makes it a malignant entry. This particular characteristic of this data set should be the main reason behind the observed results. Because data points with low values for each feature are clustered together by K-Means algorithm which will create the cluster of benign entries. Same applies for entries with relatively high feature values. The J value plot for an iteration is depicted in Figure 7. J value is stabilized on 19323.1738171 in this iteration. 9

Figure 8: Experiment: Stabilized J Values in 5 executions of K-Means Figure 9: Experiment: Final Centroids resulted in 5 executions of K-Means 10

Figure 10: Experiment: Scatter plot of data points before and after being clustered. (With Clump Thickness and Uniformity of Cell Size) Figure 11: Experiment: Scatter plot of data points before and after being clustered. (With Clump Thickness and Uniformity of Cell Shape) 11

Figure 8 shows the stabilized J value of multiple executions of the K-Means algorithm. These J values have a very less variance over multiple executions. Figure 9 is a plot of the final centroids of 5 executions of K-Means algorithm. The adjacent centroid pairs belong to a single execution, for instance the first two centroids belong to the first execution and the last two centroids belong to the fifth execution. Each feature is plotted with a different color. If the distribution of the values of a particular feature of these centroids is observed, it can be seen that it is varying between two distinct narrow range of values. For instance, the feature plotted in green color varies between 1.2 and 2.6. Also it can be seen that the centroids corresponding to a single execution lie in either of these two ranges and not both of them belong to the same range. Continuing the previous example, first centroid is approximately equal to 1.2 and the second centroid corresponding to the same execution is approximately equal to 2.6. Another observation is that if the entire set of features is considered, then at each centroid all of them show either a relatively high value or a relatively low value. As mentioned previously, if the features values are lower then there is a high chance that the record is a benign entry and vice versa. So this graphical evidence again justify the reason behind the high accuracy of the results. The observations made by Figure 8 can be supported by the observations made by Figure 9. The final set of centroids resulted by each execution are very close to each other in terms of the euclidean distance. Hence the stabilized J values are approximately the equal to each other. Figure 10 and Figure 11 shows the results of the K-Means algorithm if only two features are considered. As we discussed before, the two clusters are formed where one is containing smaller values in each dimension and the other one containing larger values. One of the biggest challenge was to get familiarize with the Numpy library. It requires some practice to be comfortable with handling multidimensional arrays and some of Numpy s advanced features like broadcasting. The ipython notebooks provided in the class and the book Python for Data Analysis [4] was very useful in this assignment. K-Means algorithm proves to be effective against certain data sets. Wisconsin breast cancer data set is one such example. If the values of features of the data ares inherently clustered, K-Means algorithm can generate very accurate results. Also if it s K value is known like in this particular example, the challenge of identifying the correct k value will not arise. Appendices A Complete Python Code for K-Means algorithm The complete code for the k-mean algorithm is given below. This does not contain the code used to prepare the data. d e f kmeans ( data, k = 2) : # get the i n i t i a l s e t o f c e n t r o i d s # get k random numbers between 0 and the number o f rows i n the data s e t c e n t r o i d i n d e x e s = np. random. c h o i c e ( range ( data. shape [ 0 ] ), k, r e p l a c e = F a l s e ) # get the c o r r e s p o n d i n g data p o i n t s c e n t r o i d s = data [ c e n t r o i d i n d e x e s, : ] p r i n t ( I n i t i a l Centroids :, c e n t r o i d s ) # c o l o r map f o r c o l o r i n g c l u s t e r s. One e x t r a c o l o r f o r the c e n t r o i d s c o l o r s = i t e r (cm. rainbow ( np. l i n s p a c e ( 0, 1, k + 1) ) ) j v a l u e s = [ ] e x t r a i t e r a t i o n s = 3 w h i l e e x t r a i t e r a t i o n s > 0 : # f i n d the e u c l e d e a n d i s t a n c e between a c e n t e r and a data p o i n t # c e n t r o i d s array shape = k x m # data array shape = n x m # In o r d e r to broadcast i t, we have to i n t r o d u c e a t h i r d dimension i n t o data # data array becomes n x 1 x m # now as a r e s u l t o f broadcasting, both array s i z e s w i l l be n x k x m data ex = data [ :, np. newaxis, : ] 12

e u q l i d e a n d i s t = ( data ex c e n t r o i d s ) 2 # now take the summation o f a l l d i s t a n c e s along the 3 rd a x i s ( l e n g t h o f the dimension i s m). #This w i l l the t o t a l d i s t a n c e from each c e n t r o i d f o r each data p o i n t. # r e s u l t i n g v e c t o r w i l l be o f s i z e n x k d i s t a n c e a r r = np. sum( e u q l i d e a n d i s t, a x i s =2) # now we need to f i n d out to which c l u s t e r each data p o i n t b e l o n g s. #Use a matrix o f n x k where [ i, j ] = 1 i f the i t h data p o i n t b e l o n g s # to c l u s t e r j. m i n l o c a t i o n = np. z e r o s ( d i s t a n c e a r r. shape ) m i n l o c a t i o n [ range ( d i s t a n c e a r r. shape [ 0 ] ), np. argmin ( d i s t a n c e a r r, a x i s =1) ] = 1 # c a l c u l a t e J j v a l = np. sum( d i s t a n c e a r r [ m i n l o c a t i o n == True ] ) p r i n t ( J Value :, j v a l ) j v a l u e s. append ( j v a l ) # c a l c u l a t e s the new c e n t r o i d s n e w c e n t r o i d s = np. empty ( c e n t r o i d s. shape ) ; f o r c o l i n range ( 0, k ) : n e w c e n t r o i d s [ c o l ] = np. mean ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ], a x i s = 0) p r i n t ( n e w c e n t r o i d s ) # compare c e n t r o i d s to s e e i f they are equal or not i f ( np. a r r a y e q u a l ( np. s o r t ( n e w centroids, a x i s =0), np. s o r t ( c e n t r o i d s, a x i s =0) ) ) : # i t has r e s u l t e d i n the same c e n t r o i d s. #Run i t f o r e x t r a i t e r a t i o n s j u s t to p l o t the J value. p r i n t ( Centroids are s t a b i l i z e d. Going f o r an e x t r a i t e r a t i o n s. ) e x t r a i t e r a t i o n s = e x t r a i t e r a t i o n s 1 i f ( e x t r a i t e r a t i o n s == 0) : # p l o t the c e n t r o i d s and the a s s i g n e d data p o i n t s u s i n g a s c a t t e r p l o t f o r c o l i n range ( 0, k ) : p l t. s c a t t e r ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 0 ], data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 1 ], c o l o r=next ( c o l o r s ) ) c e n t r o i d l e g = p l t. s c a t t e r ( n e w c e n t r o i d s [ :, 0 ], n e w c e n t r o i d s [ :, 1 ], c o l o r=next ( c o l o r s ), marker= x ) p l t. legend ( [ c e n t r o i d l e g ], [ C entroids ], s c a t t e r p o i n t s =1, l o c= b e s t ) p l t. s a v e f i g ( c l u s t e r. png ) # p l o t J v a l u e s f i g = p l t. f i g u r e ( ) j p l o t = f i g. add subplot ( 1, 1, 1 ) j p l o t. p l o t ( range ( l e n ( j v a l u e s ) ), np. array ( j v a l u e s ) ) j p l o t. s e t t i t l e ( J Value V a r i a t i o n With No. o f I t e r a t i o n s ) j p l o t. s e t x l a b e l ( I t e r a t i o n s ) j p l o t. s e t y l a b e l ( J Value ) p l t. s a v e f i g ( j v a l s. png ) r e t u r n m i n l o c a t i o n, j v a l break ; c e n t r o i d s = n e w c e n t r o i d s References [1] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, 2006. [2] Center for Machine Learning and Irvine Intelligent Systems UC. Breast cancer wisconsin (diagnostic) data set. http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic% 29, 1995. 13

[3] Wikipedia. Positive and negative predictive values. http://en.wikipedia.org/wiki/positive_and_ negative_predictive_values. [4] Wes McKinney. Python for Data Analysis. O Reilly, 2013. 14