Cluster Validation
Cluster Validit For supervised classification we have a variet of measures to evaluate how good our model is Accurac, precision, recall For cluster analsis, the analogous question is how to evaluate the goodness of the resulting clusters? But clusters are in the ee of the beholder! Then wh do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters
Clusters found in Random Data. 9. 9. 8. 8 Random Points. 7. 6. 5. 7. 6. 5 DBSCAN. 4. 4. 3. 3...... 4. 6. 8.. 4. 6. 8. 9. 9 K-means. 8. 7. 6. 8. 7. 6 Complete Link. 5. 5. 4. 4. 3. 3...... 4. 6. 8.. 4. 6. 8
Different Aspects of Cluster Validation. Determining the clustering tendenc of a set of data, i.e., distinguishing whether non-random structure actuall eists in the data.. Comparing the results of a cluster analsis to eternall known results, e.g., to eternall given class labels. 3. Evaluating how well the results of a cluster analsis fit the data without reference to eternal information. - Use onl the data 4. Comparing the results of two different sets of cluster analses to determine which is better. 5. Determining the correct number of clusters. For, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.
Measures of Cluster Validit Numerical measures that are applied to judge various aspects of cluster validit, are classified into the following three tpes. Eternal Inde: Used to measure the etent to which cluster labels match eternall supplied class labels. Entrop Internal Inde: Used to measure the goodness of a clustering structure without respect to eternal information. Sum of Squared Error (SSE) Relative Inde: Used to compare two different clusterings or clusters. Often an eternal or internal inde is used for this function, e.g., SSE or entrop Sometimes these are referred to as criteria instead of indices However, sometimes criterion is the general strateg and inde is the numerical measure that implements the criterion.
Measuring Cluster Validit Via Correlation Two matrices Proimit Matri Incidence Matri One row and one column for each data point An entr is if the associated pair of points belong to the same cluster An entr is if the associated pair of points belongs to different clusters Compute the correlation between the two matrices Since the matrices are smmetric, onl the correlation between n(n-) / entries needs to be calculated. High correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some densit or contiguit based clusters.
Measuring Cluster Validit Via Correlation Correlation of incidence and proimit matrices for the K-means clusterings of the following two data sets.. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8 Corr = -.935 Corr = -.58
Using Similarit Matri for Cluster Validation Order the similarit matri with respect to cluster labels and inspect visuall.. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8 P o in t s 3 4 5 6 7 8 9 4 6 8 S i m i la r i t P o in t s. 9. 8. 7. 6. 5. 4. 3..
Using Similarit Matri for Cluster Validation Clusters in random data are not so crisp. 9. 9. 8. 8 3. 7. 7 4. 6. 6 P o in t s 5 6. 5. 4. 5. 4 7. 3. 3 8.. 9.. 4 6 8 S i m i la r i t P o i n t s.. 4. 6. 8 DBSCAN
Using Similarit Matri for Cluster Validation Clusters in random data are not so crisp. 9. 9. 8. 8 3. 7. 7 4. 6. 6 P o i n t s 5 6. 5. 4. 5. 4 7. 3. 3 8.. 9.. 4 6 8 S i m i la r i t P o in t s.. 4. 6. 8 K-means
Using Similarit Matri for Cluster Validation Clusters in random data are not so crisp. 9. 9. 8. 8 3. 7. 7 4. 6. 6 P o in t s 5 6. 5. 4. 5. 4 7. 3. 3 8.. 9.. 4 6 8 S i m i la r i t P o in t s.. 4. 6. 8 Complete Link
Using Similarit Matri for Cluster Validation. 9 6 5. 8 4 3 5. 7. 6. 5. 4. 3 5 5. 7 3 5 5 5 3. DBSCAN
Internal Measures: SSE Clusters in more complicated figures aren t well separated Internal Inde: Used to measure the goodness of a clustering structure without respect to eternal information SSE SSE is good for comparing two clusterings or two clusters (average SSE). Can also be used to estimate the number of clusters 6 9 4 8 7 6 S S E 5 4-3 - 4-6 5 5 5 5 5 3 K
Internal Measures: SSE SSE curve for a more complicated data set 6 4 3 5 7 SSE of clusters found using K-means
Framework for Cluster Validit Need a framework to interpret an measure. For eample, if our measure of evaluation has the value,, is that good, fair, or poor? Statistics provide a framework for cluster validit The more atpical a clustering result is, the more likel it represents valid structure in the data Can compare the values of an inde that result from random data or clusterings to those of a clustering result. If the value of the inde is unlikel, then the cluster results are valid These approaches are more complicated and harder to understand. For comparing the results of two different sets of cluster analses, a framework is less necessar. However, there is the question of whether the difference between two inde values is significant
Statistical Framework for SSE Eample. 9. 8. 7. 6. 5. 4. 3.. Compare SSE of.5 against three clusters in random data Histogram shows SSE of three clusters in 5 sets of random data points of size distributed over the range..8 for and values.. 4. 6. 8 C o u n t 5 4 5 4 3 5 3 5 5 5. 6. 8... 4. 6. 8. 3. 3. 3 4 S S E
Statistical Framework for Correlation Correlation of incidence and proimit matrices for the K-means clusterings of the following two data sets.. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8. 9. 8. 7. 6. 5. 4. 3.... 4. 6. 8 Corr = -.935 Corr = -.58
Internal Measures: Cohesion and Separation Cluster Cohesion: Measures how closel related are objects in a cluster Eample: SSE Cluster Separation: Measure how distinct or wellseparated a cluster is from other clusters Eample: Squared Error Cohesion is measured b the within cluster sum of squares (SSE) WSS = i Separation is measured b the between cluster sum of squares BSS = Ci ( m m i ) Where C i i is the size of cluster i C ( i m i )
Internal Measures: Cohesion and Separation Eample: SSE BSS + WSS = constant m m 3 4 m 5 K= cluster: WSS= ( 3) + ( 3) + (4 3) + (5 3) = BSS= 4 (3 3) = Total = + = K= clusters: WSS= (.5) + (.5) + (4 4.5) + (5 4.5) = BSS= Total (3.5) = + 9 = + (4.5 3) = 9
Internal Measures: Cohesion and Separation A proimit graph based approach can also be used for cohesion and separation. Cluster cohesion is the sum of the weight of all links within a cluster. Cluster separation is the sum of the weights between nodes in the cluster and nodes outside the cluster. cohesion separation
Internal Measures: Silhouette Coefficient Silhouette Coefficient combine ideas of both cohesion and separation, but for individual points, as well as clusters and clusterings For an individual point, i Calculate a = average distance of i to the points in its cluster Calculate b = min (average distance of i to points in another cluster) The silhouette coefficient for a point is then given b s = a/b if a < b, (or s = b/a - Tpicall between and. The closer to the better. if a b, not the usual case) a b Can calculate the Average Silhouette width for a cluster or a clustering
Eternal Measures of Cluster Validit: Entrop and Purit
Final Comment on Cluster Validit The validation of clustering structures is the most difficult and frustrating part of cluster analsis. Without a strong effort in this direction, cluster analsis will remain a black art accessible onl to those true believers who have eperience and great courage. Algorithms for Clustering Data, Jain and Dubes