Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Transcription

1 Cluster Analsis Overview Data Mining Techniques: Cluster Analsis Mirek Riedewald Man slides based on presentations b Han/Kamber, Tan/Steinbach/Kumar, and Andrew Moore Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation What is Cluster Analsis? What is Cluster Analsis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Unsupervised learning: usuall no training set with known classes Tpical applications As a stand-alone tool to get insight into data properties As a preprocessing step for other algorithms Intra-cluster distances are minimized Inter-cluster distances are maimized Rich Applications, Multidisciplinar Efforts Pattern Recognition Spatial Data Analsis Image Processing Data Reduction Economic Science Market research WWW Document classification Weblogs: discover groups of similar access patterns Clustering precipitation in Australia Eamples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifing groups of motor insurance polic holders with a high average claim cost Cit-planning: Identifing groups of houses according to their house tpe, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

2 Qualit: What Is Good Clustering? Notion of a Cluster can be Ambiguous Cluster membership objects in same class High intra-class similarit, low inter-class similarit Choice of similarit measure is important How man clusters? Si Clusters Abilit to discover some or all of the hidden patterns Difficult to measure without ground truth Two Clusters Four Clusters 7 Distinctions Between Sets of Clusters Eclusive versus non-eclusive Non-eclusive clustering: points ma belong to multiple clusters Fuzz versus non-fuzz Fuzz clustering: a point belongs to ever cluster with some weight between and Weights must sum to Partial versus complete Cluster some or all of the data Heterogeneous versus homogeneous Clusters of widel different sizes, shapes, densities Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation 9 Distance Clustering is inherentl connected to question of (dis-)similarit of objects How can we define similarit between objects? Similarit Between Objects Usuall measured b some notion of distance Popular choice: Minkowski distance q q q q dist ( i ), ( j) ( i) ( j) ( i) ( j) ( i) ( j) d d q is a positive integer q = : Manhattan distance dist ( i), ( j) ( i) ( j) ( i) ( j) ( i) ( j) d d q = : Euclidean distance: dist ( i), ( j) ( i) ( j) ( i) ( j) ( i ) ( j ) d d

3 Metrics Properties of a metric d(i,j) d(i,j) = if and onl if i=j d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j) Eamples: Euclidean distance, Manhattan distance Man other non-metric similarit measures eist After selecting the distance function, is it now clear how to compute similarit between objects? Challenges How to compute a distance for categorical attributes An attribute with a large domain often dominates the overall distance Weight and scale the attributes like for k-nn Curse of dimensionalit Curse of Dimensionalit Best solution: remove an attribute that is known to be ver nois or not interesting Tr different subsets of the attributes and determine where good clusters are found Nominal Attributes Method : work with original values Difference = if same value, difference = otherwise Method : transform to binar attributes New binar attribute for each domain value Encode specific domain value b setting corresponding binar attribute to and all others to Ordinal Attributes Method : treat as nominal Problem: loses ordering information Method : map to [,] Problem: To which values should the original values be mapped? Default: equi-distant mapping to [,] Scaling and Transforming Attributes Sometimes it might be necessar to transform numerical attributes to [,] or use another normalizing transformation, mabe even nonlinear (e.g., logarithm) Might need to weight attributes differentl Often requires epert knowledge or trial-anderror 7

4 Other Similarit Measures Special distance or similarit measures for man applications Might be a non-metric function Information retrieval Document similarit based on kewords Bioinformatics Gene features in micro-arras Calculating Cluster Distances Single link = smallest distance between an element in one cluster and an element in the other: dist(k i, K j ) = min( ip, jq ) Complete link = largest distance between an element in one cluster and an element in the other: dist(k i, K j ) = ma( ip, jq ) Average distance between an element in one cluster and an element in the other: dist(k i, K j ) = avg( ip, jq ) Distance between cluster centroids: dist(k i, K j ) = d(m i, m j ) Distance between cluster medoids: dist(k i, K j ) = dist( mi, mj ) Medoid: one chosen, centrall located object in the cluster 9 Cluster Centroid, Radius, and Diameter Centroid: the middle of a cluster C m Radius: square root of average distance from an point of the cluster to its centroid ( m) C R C Diameter: square root of average mean squared distance between all pairs of points in the cluster ( ) C C, D C ( C ) C C Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation Partitioning Algorithms: Basic Concept Construct a partition of a database D of n objects into a set of K clusters, s.t. sum of squared distances to cluster representative m is minimized K ( m ) i C i i Given a K, find partition of K clusters that optimizes the chosen partitioning criterion Globall optimal: enumerate all partitions Heuristic methods K-means ( 7): each cluster represented b its centroid K-medoids ( 7): each cluster represented b one of the objects in the cluster K-means Clustering Each cluster is associated with a centroid Each object is assigned to the cluster with the closest centroid. Given K, select K random objects as initial centroids. Repeat until centroids do not change. Form K clusters b assigning ever object to its nearest centroid. Recompute centroid of each cluster

5 K-Means Eample Iteration. Overview of K-Means Convergence Iteration Iteration Iteration Iteration Iteration Iteration K-means Questions What is it tring to optimize? Will it alwas terminate? Will it find an optimal clustering? How should we start it? How could we automaticall choose the number of centers?.we ll deal with these questions net K-means Clustering Details Initial centroids often chosen randoml Clusters produced var from one run to another Distance usuall measured b Euclidean distance, cosine similarit, correlation, etc. Comparabl fast algorithm: O( n * K * I * d ) n = number of objects I = number of iterations d = number of attributes 7 Evaluating K-means Clusters Most common measure: Sum of Squared Error (SSE) For each point, the error is the distance to the nearest centroid K SSE dist ( m, ) i C i m i = centroid of cluster C i Given two clusterings, choose the one with the smallest error Eas wa to reduce SSE: increase K In practice, large K not interesting i K-means Convergence () Assign each to its nearest center (minimizes SSE for fied centers) () Choose centroid of all points in the same cluster as cluster center (minimizes SSE for fied clusters) Ccle through steps () and () = K-means algorithm Algorithm terminates when neither () nor () results in change of configuration Finite number of was of partitioning n records into K groups If the configuration changes on an iteration, it must have improved SSE So each time the configuration changes it must go to a configuration it has never been to before So if it tried to go on forever, it would eventuall run out of configurations 9

6 Will it Find the Optimal Clustering?... Importance of Initial Centroids Iteration Iteration Iteration Iteration Iteration Iteration Optimal Clustering Sub-optimal Clustering Will It Find The Optimal Clustering Now? Iteration Importance of Initial Centroids Iteration Iteration Iteration Iteration Iteration Problems with Selecting Initial Centroids Probabilit of starting with eactl one initial centroid per real cluster is ver low K selected for algorithm might be different from inherent K of the data Might randoml select multiple initial objects from same cluster Sometimes initial centroids will readjust themselves in the right wa, and sometimes the don t Clusters Eample Iteration Starting with two initial centroids in one cluster of each pair of clusters

7 Clusters Eample Iteration Iteration Clusters Eample Iteration Iteration - - Iteration Starting with two initial centroids in one cluster of each pair of clusters 7 Starting with some pairs of clusters having three initial centroids, while other have onl one. Clusters Eample Iteration Iteration Solutions to Initial Centroids Problem Iteration Iteration Multiple runs Helps, but probabilit is not on our side Sample and use hierarchical clustering to determine initial centroids Select more than k initial centroids and then select among these the initial centroids Select those that are most widel separated Postprocessing Eliminate small clusters that ma represent outliers Split clusters with high SSE Merge clusters that are close and have low SSE Starting with some pairs of clusters having three initial centroids, while other have onl one. 9 Limitations of K-means Limitations of K-means: Differing Sizes K-means has problems when clusters are of differing Sizes Densities Non-globular shapes K-means has problems when the data contains outliers K-means ( Clusters) 7

8 Limitations of K-means: Differing Densit Limitations of K-means: Non-globular Shapes K-means ( Clusters) K-means ( Clusters) Overcoming K-means Limitations Overcoming K-means Limitations K-means Clusters K-means Clusters One solution is to use man clusters. Find parts of clusters, then put them together. Overcoming K-means Limitations K-Means and Outliers K-means Clusters K-means algorithm is sensitive to outliers Centroid is average of cluster members Outlier can dominate average computation Solution: K-medoids Medoid = most centrall located real object in a cluster Algorithm similar to K-means, but finding medoid is much more epensive Tr all objects in cluster to find the one that minimizes SSE, or just tr a few randoml to reduce cost 7

9 Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Visualized as a dendrogram Tree-like diagram that records the sequences of merges or splits Strengths of Hierarchical Clustering Do not have to assume an particular number of clusters An number of clusters can be obtained b cutting the dendogram at the proper level Ma correspond to meaningful taonomies Eample in biological sciences (e.g., animal kingdom, phlogen reconstruction, ) Hierarchical Clustering Two main tpes of hierarchical clustering Agglomerative: Start with the given objects as individual clusters At each step, merge the closest pair of clusters until onl one cluster (or K clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a single object (or there are K clusters) Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward. Compute the proimit matri. Let each data object be a cluster. Repeat until onl a single cluster remains. Merge the two closest clusters. Update the proimit matri Ke operation: computation of the proimit of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms Starting Situation Clusters of individual objects, proimit matri p p p p p.. p p p p p.... Proimit Matri... p p p p p9 p p p 9

10 Intermediate Situation Intermediate Situation Some clusters are merged C C C C C C Merge closest clusters (C and C ) and update proimit matri C C C C C C C C C C C C C Proimit Matri C C C C C C C Proimit Matri C C C C... p p p p p9 p p p... p p p p p9 p p p After Merging Defining Cluster Distance How do we update the proimit matri? C C C C C U C C C C C U C??????? C C Proimit Matri Min: clusters near each other Ma: low diameter Avg: more robust against outliers Distance between centroids C U C... p p p p p9 p p p 7 Strength of MIN Limitations of MIN Two Clusters Two Clusters Can handle non-elliptical shapes Sensitive to noise and outliers 9

11 Strength of MAX Limitations of MAX Two Clusters Two Clusters Less susceptible to noise and outliers Tends to break large clusters Biased towards globular clusters Hierarchical Clustering: Average Compromise between Single and Complete Link Strengths Less susceptible to noise and outliers Limitations Biased towards globular clusters Cluster Similarit: Ward s Method Distance of two clusters is based on the increase in squared error when two clusters are merged Similar to group average if distance between objects is distance squared Less susceptible to noise and outliers Biased towards globular clusters Hierarchical analogue of K-means Can be used to initialize K-means Hierarchical Clustering: Comparison MIN MAX Ward s Method Group Average Time and Space Requirements O(n ) space for proimit matri n = number of objects O(n ) time in man cases There are n steps and at each step the proimit matri must be updated and searched Compleit can be reduced to O(n log(n) ) time for some approaches

12 Hierarchical Clustering: Problems and Limitations Once a decision is made to combine two clusters, it cannot be undone No objective function is directl minimized Different schemes have problems with one or more of the following: Sensitivit to noise and outliers Difficult handling different sized clusters and conve shapes Breaking large clusters Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation 7 7 Densit-Based Clustering Methods Clustering based on densit of data objects in a neighborhood Local clustering criterion Major features: Discover clusters of arbitrar shape Handle noise Need densit parameters as termination condition 7 DBSCAN: Basic Concepts Two parameters: Eps: Maimum radius of the neighborhood N Eps (q): {p D dist(q,p) Eps} MinPts: Minimum number of points in an Epsneighborhood of that point A point p is directl densit-reachable from a point q w.r.t. Eps and MinPts if p belongs to N Eps (q) Core point condition: MinPts = N Eps (q) MinPts p q 7 Densit-Reachable, Densit-Connected DBSCAN: Classes of A point p is densit-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points q = p,p,, p n = p such that p i + is directl densitreachable from p i A point p is densit-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both p and q are densit-reachable from o w.r.t. Eps and MinPts Cluster = set of densit-connected points p q o p p q A point is a core point if it has more than a specified number of points (MinPts) within Eps At the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point At the outer surface of a cluster A noise point is an point that is not a core point or a border point Not part of an cluster 77 7

13 DBSCAN: Core, Border, and Noise DBSCAN Algorithm Repeat until all points have been processed Select a point p If p is core point then Retrieve and remove all points densit-reachable from p w.r.t. Eps and MinPts; output them as a cluster Discards all noise points (how?) Discovers clusters of arbitrar shape Fairl robust against noise Runtime: O(n ), space: O(n) O(n * timetofindinneighborhood) Can be O(n log(n)) with spatial inde 79 DBSCAN: Core, Border and Noise When DBSCAN Works Well Point tpes: core, border and noise Clusters Eps =, MinPts = When DBSCAN Does NOT Work Well DBSCAN: Determining Eps and MinPts (MinPts=, large Eps) Idea: for points in a cluster, their k-th nearest neighbors are at roughl the same distance Noise points have the k-th nearest neighbor at farther distance Plot the sorted distance of ever point to its k-th nearest neighbor Choose Eps where sharp change occurs MinPts = k k too large: small clusters labeled as noise k too small: small groups of outliers labeled as cluster Varing densities High-dimensional data (MinPts=, small Eps)

14 DBSCAN: Sensitive to Parameters Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation 9 Clustering High-Dimensional Data Man applications: tet documents, DNA micro-arra data Major challenges: Irrelevant dimensions ma mask clusters Curse of dimensionalit for distance computation Clusters ma eist onl in some subspaces Methods Feature transformation, e.g., PCA and SVD Some useful onl when features are highl correlated/redundant Feature selection: wrapper or filter approaches Subspace-clustering: find clusters in all subspaces CLIQUE Curse of Dimensionalit Graphs on the right adapted from Parsons et al. KDD Eplorations Data in onl one dimension is relativel packed Adding a dimension stretches the objects across that dimension, moving them further apart High-dimensional data is ver sparse Distance measure becomes meaningless For man distributions, distances between objects become more similar in high dimensions 9 97 Wh Subspace Clustering? Adapted from Parsons et al. SIGKDD Eplorations CLIQUE (Clustering In QUEst) Automaticall identifies clusters in sub-spaces Eploits monotonicit propert If a set of points forms a dense cluster in d dimensions, the also form a cluster in an subset of these dimensions A region is dense if the fraction of data points in the region eceeds the input model parameter Sound familiar? Apriori algorithm... Algorithm is both densit-based and grid-based Partitions each dimension into the same number of equallength intervals Partitions an m-dimensional data space into non-overlapping rectangular units Cluster = maimal set of connected dense units within a subspace 9 99

15 Vacation 7 Salar (,) Vacation (week) CLIQUE Algorithm Find all dense regions in -dim space for each attribute. This is the set of dense -dim cells. Let k=. Repeat until there are no dense k-dim cells k = k+ Generate all candidate k-dim cells from dense (k-)-dim cells Eliminate cells with fewer than points Find clusters b taking union of all adjacent, highdensit cells of same dimensionalit Summarize each cluster using a small set of inequalities that describe the attribute ranges of the cells in the cluster = age 7 age age Compute intersection of dense age-salar and age-vacation regions Strengths and Weaknesses of CLIQUE Strengths Automaticall finds subspaces of the highest dimensionalit that contain high-densit clusters Insensitive to the order of objects in input and does not presume some canonical data distribution Scales linearl with input size and has good scalabilit with number of dimensions Weaknesses Need to tune grid size and densit threshold Each point can be a member of man clusters Can still have high mining cost (inherent problem for subspace clustering) Same densit threshold for low and high dimensionalit Cluster Analsis Overview Introduction Foundations: Measuring Distance (Similarit) Partitioning Methods: K-Means Hierarchical Methods Densit-Based Methods Clustering High-Dimensional Data Cluster Evaluation Cluster Validit on Test Data Cluster Validit Clustering: usuall no ground truth available Problem: clusters are in the ee of the beholder Then wh do we want to evaluate them? To avoid finding patterns in noise To compare clustering algorithms To compare two sets of clusters To compare two clusters

16 Random K-means Clusters found in Random Data DBSCAN Complete Link Measuring Cluster Validit Via Correlation Two matrices Similarit Matri Incidence Matri One row and one column for each object Entr is if the associated pair of objects belongs to the same cluster, otherwise Compute correlation between the two matrices Since the matrices are smmetric, onl the correlation between n(n-) / entries needs to be calculated. High correlation: objects close to each other tend to be in same cluster Not a good measure when clusters can be non-globular and intertwined Measuring Cluster Validit Via Correlation.9 Similarit Matri for Cluster Validation Order the similarit matri with respect to cluster labels and inspect visuall Block-diagonal matri for well-separated clusters Corr =.9 Corr = Similarit 9 Similarit Matri for Cluster Validation Clusters in random data are not so crisp Similarit Matri for Cluster Validation Clusters in random data are not so crisp Similarit.... Similarit.... DBSCAN K-means

17 Count SSE Similarit Matri for Cluster Validation Similarit Matri for Cluster Validation Clusters in random data are not so crisp Similarit Complete Link DBSCAN Sum of Squared Error When SSE Is Not So Great For fied number of clusters, lower SSE indicates better clustering Not necessaril true for non-globular, intertwined clusters Can also be used to estimate the number of clusters Run K-means for different K, compare SSE SSE of clusters found using K-means - K Comparison to Random Data or Clustering Need a framework to interpret an measure E.g., if measure =, is that good or bad? Statistical framework for cluster validit Compare cluster qualit measure on random data or random clustering to those on real data If value for random setting is unlikel, then cluster results are valid (cluster = non-random structure) For comparing the results of two different sets of cluster analses, a framework is less necessar But: need to know whether the difference between two inde values is significant Statistical Framework for SSE Eample: found clusters, got SSE =. for given data set Compare to SSE of clusters in random data Histogram: SSE of clusters in sets of random data points ( points from range.. for and ) Estimate mean, stdev for SSE on random data Check how man stdev awa from mean the real-data SSE is Random data SSE 7

18 Statistical Framework for Correlation Compare correlation of incidence and proimit matrices for well-separated data versus random data Random data Cluster Cohesion and Separation Cohesion: how closel related are objects in a cluster Can be measured b SSE (m i = centroid of cluster i): SSE ( mi) ( ) C i Ci i i, C i Separation: how well-separated are clusters Can be measured b between-cluster sum of squares (m = overall mean): BSS ( m m ) i C i i Corr = -.9 Corr = -. 9 cohesion separation Cohesion and Separation Eample Note: BSS + SSE = constant Minimize SSE get ma. BSS m m m K= cluster: K= clusters: SSE ( ) ( ) ( ) ( ) BSS ( ) Total SSE (.) (.) (.) (.) BSS (.) (. ) 9 Total 9 Silhouette Coefficient Combines ideas of both cohesion and separation For an individual object i Calculate a i = average distance of i to the objects in its cluster Calculate b i = average distance of i to objects in another cluster C, choosing the C that minimizes b i Silhouette coefficient of i = (b i -a i ) / ma{a i,b i } Range: [-,], but tpicall between and The closer to, the better Can calculate the Average Silhouette width over all objects b a Final Comment on Cluster Validit The validation of clustering structures is the most difficult and frustrating part of cluster analsis. Without a strong effort in this direction, cluster analsis will remain a black art accessible onl to those true believers who have eperience and great courage. Algorithms for Clustering Data, Jain and Dubes

19 Summar Cluster analsis groups objects based on their similarit (or distance) and has wide applications Measure of similarit (or distance) can be computed for all tpes of data Man different tpes of clustering algorithms Discover different tpes of clusters Man measures of clustering qualit, but absence of ground truth alwas a challenge 9