Clustering DHS 10.6-10.7, 10.9-10.10, 10.4.3-10.4.4

Similar documents

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Neural Networks Lesson 5 - Cluster Analysis

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Unsupervised learning: Clustering

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering UE 141 Spring 2013

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Chapter ML:XI (continued)

Cluster Analysis: Basic Concepts and Algorithms

Chapter 7. Cluster Analysis

Linear Threshold Units

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Cluster Analysis: Advanced Concepts

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Information Retrieval and Web Search Engines

How To Solve The Cluster Algorithm

Social Media Mining. Data Mining Essentials

Standardization and Its Effects on K-Means Clustering Algorithm

An Introduction to Cluster Analysis for Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Segmentation & Clustering

Environmental Remote Sensing GEOG 2021

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Statistical Databases and Registers with some datamining

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Using Data Mining for Mobile Communication Clustering and Characterization

Statistical Machine Learning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Distances, Clustering, and Classification. Heatmaps

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Cluster analysis Cosmin Lazar. COMO Lab VUB

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Hierarchical Cluster Analysis Some Basics and Algorithms

Lecture 3: Linear methods for classification

Solving Simultaneous Equations and Matrices

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

The SPSS TwoStep Cluster Component

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Mining: Algorithms and Applications Matrix Math Review

Performance Metrics for Graph Mining Tasks

SoSe 2014: M-TANI: Big Data Analytics

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Nonlinear Iterative Partial Least Squares Method

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Session 7 Bivariate Data and Analysis

Solving Geometric Problems with the Rotating Calipers *

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

A Study of Web Log Analysis Using Clustering Techniques

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Discovering Local Subgroups, with an Application to Fraud Detection

Object Recognition and Template Matching

0.1 What is Cluster Analysis?

Common Core Unit Summary Grades 6 to 8

: Introduction to Machine Learning Dr. Rita Osadchy

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

High-dimensional labeled data analysis with Gabriel graphs

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Linear Classification. Volker Tresp Summer 2015

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY Principal Components Null Space Analysis for Image and Video Classification

Machine Learning using MapReduce

How To Understand Multivariate Models

Classification Techniques for Remote Sensing

Multivariate Analysis of Ecological Data

Binary Image Scanning Algorithm for Cane Segmentation

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Cluster Analysis: Basic Concepts and Algorithms

Character Image Patterns as Big Data

An Introduction to Machine Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Manifold Learning Examples PCA, LLE and ISOMAP

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

A Demonstration of Hierarchical Clustering

Vector Quantization and Clustering

DATA ANALYSIS II. Matrix Algorithms

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Introduction to General and Generalized Linear Models

Spazi vettoriali e misure di similaritá

2. Simple Linear Regression

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Visualization by Linear Projections as Information Retrieval

Transcription:

Clustering DHS 10.6-10.7, 10.9-10.10, 10.4.3-10.4.4

Clustering Definition A form of unsupervised learning, where we identify groups in feature space for an unlabeled sample set Define class regions in feature space using unlabeled data Note: the classes identified are abstract, in the sense that we obtain cluster 0... cluster n as our classes (e.g. clustering MNIST digits, we may not get 10 clusters) 2

Applications Clustering Applications Include: Data reduction: represent samples by their associated cluster Hypothesis generation Discover possible patterns in the data: validate on other data sets Hypothesis testing Test assumed patterns in data Prediction based on groups e.g. selecting medication for a patient using clusters of previous patients and their reactions to medication for a given disease 3

Kuncheva: Supervised vs. Unsupervised Classification

A Simple Example Assume Class Distributions Known to be Normal Can define clusters by mean and covariance matrix However... We may need more information to cluster well Many different distributions can share a mean and covariance matrix...number of clusters? 5

FIGURE 10.6. These four data sets have identical statistics up to second-order that is, the same mean and covariance. In such cases it is important to include in the model more parameters to represent the structure more completely. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Steps for Clustering 1. Feature Selection Ideal: small number of features with little redundancy 2. Similarity (or Proximity) Measure Measure of similarity or dissimilarity 3. Clustering Criterion Red: defining cluster space Determine how distance patterns determine cluster likelihood (e.g. preferring circular to elongated clusters) 4. Clustering Algorithm Search method used with the clustering criterion to identify clusters 5. Validation of Results Using appropriate tests (e.g. statistical) 6. Interpretation of Results Domain expert interprets clusters (clusters are subjective) 7

Choosing a Similarity Measure Most Common: Euclidean Distance Roughly speaking, want distance between samples in a cluster to be smaller than the distance between samples in different clusters Example (next slide): define clusters by a maximum distance d0 between a point and a point in a cluster Rescaling features can be useful (transform the space) Unfortunately, normalizing data (e.g. by setting features to zero mean, unit variance) may eliminate subclasses One might also choose to rotate axes so they coincide with eigenvectors of the covariance matrix (i.e. apply PCA) 8

1 x 2 x 2 x 2 d 0 =.3 1 1 d 0 =.1 d 0 =.03.8.8.8.6.6.6.4.4.4.2.2.2 0.2.4.6.8 1 x 1 0 x.2.4.6.8 1 1 0 x.2.4.6.8 1 1 FIGURE 10.7. The distance threshold affects the number and size of clusters in similarity based clustering methods. For three different values of distance d 0, lines are drawn between points closer than d 0 the smaller the value of d 0, the smaller and more numerous the clusters. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

FIGURE 10.8. Scaling axes affects the clusters in a minimum distance cluster method. The original data and minimum-distance clusters are shown in the upper left; points in one cluster are shown in red, while the others are shown in gray. When the vertical axis is expanded by a factor of 2.0 and the horizontal axis shrunk by a factor of 0.5, the clustering is altered (as shown at the right). Alternatively, if the vertical axis is shrunk by a factor of 0.5 and the horizontal axis is expanded by a factor of 2.0, smaller more numerous clusters result (shown at the bottom). In both these scaled cases, the assignment of points to clusters differ from that in the original space. From: Richard O. Duda, Peter 1.6 x 2 1.8.6 x 2.5 0 ( 0 2) 1.4 1.2 1.4.2 0.5.4.3.2.1 0 x 2.2.4.6.8 1 2 0 ( 0.5).1.2.3.4.5.25.5.75 1 1.25 1.5 1.75 2 x 1.8.6.4.2 0 x 1 x 1

x 2 x 2 x 1 x 1 FIGURE 10.9. If the data fall into well-separated clusters (left), normalization by scaling for unit variance for the full data may reduce the separation, and hence be undesirable (right). Such a normalization may in fact be appropriate if the full data set arises from a single fundamental process (with noise), but inappropriate if there are several different processes, as shown here. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Other Similarity Measures Minkowski Metric (Dissimilarity) Change the exponent q: q = 1: Manhattan (city-block) distance q = 2: Euclidean distance (only form invariant to translation and rotation in feature space) Cosine Similarity d(x, x )= d(x, x )= ( d ( d Characterizes similarity by the cosine of the angle between two feature vectors (in [0,1]) Ratio of inner product to vector magnitude product Invariant to rotations and dilation (not translation) 12 k=1 k=1 x k x k q ) 1/q s(x, x )= xt x x x x k x k q ) 1/q

More on Cosine Similarity ( d If features binary-valued: Inner product is sum of shared feature values Product of magnitudes is geometric mean of number of attributes in the two vectors Variations d(x, x )= Frequently used for Information Retrieval Ratio of shared attributes (identical lengths): s(x, x )= xt x d Tanimoto distance: ratio of shared attributes to s(x, x attributes in x or x x T x )= ( d ( d ) 1/q d(x, x )= x d(x, k x )= k q s(x, x )= k=1 k=1 x k x k q ) 1/q s(x, x )= xt x x x k=1 x k x k q ) 1/q s(x, x )= xt x s(x, x )= xt x x x x x s(x, x )= xt x x T x x T x + x T x x T x x T x + x T x x T x 13 d

system (e.g. we observed a decrease in security A B of 5% = ly an increase in usability of 0.3% relative to matching t only author-supplied tags). In addition, we could not te the security impact of adding title words A B using our = quencies (which are calculated over tag space, not title, and so we decided to not allow title words. they only contain 1 s and 0 s) over the union of the a i b i Thus, tagsthe in cosine similai. both videos. Therefore, the dot product simply reduces to the size intersection of the two tag sets (i.e., A t n R t ) and Adding Related the product of the magnitudes reduces to the square Tags Cosine Similarity: Tag Sets for YouTube (aroot i ) Once 2 of n 4. Return Z (b i ) the related 2 the number of tags in the first tag set times the square root of videos the number of tags in the second tag set (i.e., A Videos (Example by K. Kluever) t ilarity This techn R t ). order, we ground introdu tru Therefore, the cosine similarity between a set of author the ground tags truth. ated nthe b and a set of related Intags our case, can easily A and bebcomputed are binary as: lowed tag occurrences in a YouTube more vectors than ta g Related Videos by Cosinethey Similarity only contain 1 s and 0 s) over tag set thecould union contain of the tag 6 ct tags from Let those A and videos B that be both cos have binary θ videos. = the A t R most vectors t Therefore, similarof the the dot beproduct asame singlesimply character), reduce to the challenge length video, (represent we the performed sizeall intersection At without ex consider t R tags a sort using of t A&B) the twonumber tag sets (i.e., of related Ational t vide Rtags t ) sine similarity of the tags onthe related product videos of theand magnitudes the reduces Therefore, to the adding If square theall next roo the challenge video. The cosine the number similarity of tags metric in the isfirst tag add setup times to 6000 the square new tag ro ard similarity metric used in Information the number of Retrieval tags in the to secondbytag adding set (i.e., up all of these Tag Set Occ. Vector dog puppy funny cat to to A avoid n t bi A t A 1 1 1 0 add R re text documents [20]. R t The cosine Therefore, B similarity the 1 cosine between 1 similarity 0 videos between 1 (sorted a set of inauthor decrea ctors A and B can be easily computed and a set of asrelated follows: tags can easily a challenge be computed Rejecting video as: v, a Table 1. Example of a tag occurrence table. Security a SIM(A, B) = cos θ = A B ber of related tags to g cos θ = A t R generates t up to n relate A B At the three m Consider an example where A t = {dog, puppy, funny} R t tained thro and R t = {dog, puppy, cat}. We can build a simple ta- product whichof corresponds magnitudes to are: the tag occurrence over the union tion). F i erating fu RELATEDTAGS(A, R, n t product andble of both tag sets (see Table Tag Set 1). Reading Occ. Vector row-wise dogfrom 14 n 1. Create puppy this t is a frequ Here SIM(A, B) is 2/3. an empty funny set, cat table, the tag occurrence A vectors for A B = a i b t A t and R t 1are A i 2. Sort 1= ation, after related videos 1 R 0 {1, 1, 1, 0} and B = {1, R 1, 0, 1}, respectively. t B 1 Next, we have been 1 0 1 compute the dot product: of their tagquency sets relat gre

Additional Similarity Metrics Theodoridis Text Defines a large number of alternative distance metrics, including: Hamming distance: number of locations where two vectors (usually bit vectors) disagree Correlation coefficient Weighted distances... 15

Criterion Functions for Clustering Criterion Function Quantifies quality of a set of clusters Clustering task: partition data set D into c disjoint sets D1... Dc Choose partition maximizing the criterion function 16

s(x, x )= d s(x, x )= J e = x T x x T x + x T x x T x Criterion: Sum of Squared Error c x D i x µ Di 2 Measures total squared error incurred by choice of cluster centers (cluster means) Optimal Clustering Minimizes this quantity Issues Well suited when clusters compact and well-separated Different # points in each cluster can lead to large clusters being split unnaturally (next slide) Sensitive to outliers 17

J e = large J e = small FIGURE 10.10. When two natural groupings have very different numbers of points, the clusters minimizing a sum-squared-error criterion J e of Eq. 54 may not reveal the true underlying structure. Here the criterion is smaller for the two clusters at the bottom than for the more natural clustering at the top. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Related Criteria: J e = Min 1 n i s Variance i 2 c J e = 1 e = 1 n i s i n 2 i s s i = 1 n 2 2 i x D i An Equivalent Formulation for SSE s i = 1 n 2 i c x D i : mean squared distance x between x 2 Jpoints e = 1 c in cluster i (variance) x D i x D i 2 J n i s i e = 1 c n i s i 2 Alternative Criterions: use median, maximum, other s i = 1 descriptive statistic on distance for n 2 J e = 1 c n s i s i = 1 x x 2 i i 2 n 2 s i = 1 x x i x D i x n D 2 i i x D i s i = 1 n 2 i x D i x x 2 s(x, x ) x D i x Variation: Using D i Similarity (e.g. Tanimoto) s s i = min s(x, x i = 1 x x n 2 ) 2 s i = 1 n 2 s i = 1 i x,x x D i D x D i i i x D i n 2 i s(x, x ) s(x, x ) s may be any similarity function (in this case, x D maximize) i x D i s i = 1 n 2 i x D i x D i s(x, x ) c x D i x x 2 x D i x D i s i = min s s(x, x ) x,x D i = min s(x, x ) i x,x D i 19 s i = min s(x, x ) x,x D i

Criterion: Scatter Matrix-Based S w = c trace[s w ]= x D i (x µ i )(x µ i ) T c x D i x µ i 2 = J e trace[s Minimize Trace of Sw (within-class) w ]= Equivalent to SSE! S w = c Recall that total c scatter is the sum of within S and between-class w = (x µ scatter i )(x µ (Sm i ) = T x D i Sw + Sb). This means that by c minimizing the trace of Sw, we trace[s also w maximize ]= Sb (as Sm is fixed): trace[s b ]= c x D i x µ i 2 = J e c n i µ i µ 0 2 x D i (x µ i )(x µ i ) T trace[s i ]= c x D i x µ i 2 = 20

trace[s w ]= x µ i 2 = J e x D i Scatter-Based trace[s b ]= Criterions, n i µ i µ 0 2 Cont d J d = S w = c Determinant Criterion c x D i (x µ i )(x µ i ) T Roughly measures square of the scattering volume; proportional to product of variances in principal axes (minimize!) Minimum error partition will not change with axis scaling, unlike SSE 21

Scatter-Based: Invariant Criteria c Invariant S w Criteria = c (Eigenvalue-based) S w = (x i )(x µ i ) T x D i (x µ i )(x µ i ) T x D i Eigenvalues: measure c ratio of between to withincluster trace[s scatter w ]= c trace[s in direction x of i eigenvectors 2 J e w ]= x µ i 2 = J e x D x D i (maximize!) i trace[s b ]= c Trace trace[s of a matrix b ]= is nsum i µ i of eigenvalues µ 0 2 (here d is length of feature vector) c n i µ i µ 0 2 JEigenvalues J d d = S w = are invariant under non-singular linear (x µ i µ i ) T i )(x µ i ) T transformations x D (rotations, i translations, scaling, etc.) trace[s w 1 S b ]= J f = trace[s 1 m S w ]= d λ i i d 1 1+λ i 22

Clustering with a Criterion Choosing Criterion Creates a well-defined problem Define clusters so as to maximize the criterion function A search problem Brute force solution: enumerate partitions of the training set, select the partition with maximum criterion value 23

Comparison: Scatter-Based Criteria 24

Hierarchical Clustering Motivation Capture similarity/distance relationships between sub-groups and samples within the chosen clusters Common in scientific taxonomies (e.g. biology) Can operate bottom up (individual samples to clusters, or agglomerative clustering) or topdown (single cluster to individual samples, or divisive clustering) 25

Agglomerative Hierarchical Clustering Problem: Given n samples, we want c clusters One solution: Create a sequence of partitions (clusterings) First partition, k = 1: n clusters (one cluster per sample) Second partition, k = 2: n-1 clusters Continue reducing the number of clusters by one: merge 2 closest clusters (a cluster may be a single sample) at each step k until... Goal partition: k = n - c + 1: c clusters Done; but if we re curious, we can continue on until the......final partition, k = n: one cluster Result All samples and sub-clusters organized into a tree (a dendrogram) Often show cluster similarity for a dendrogram diagram (Y-axis) If as stated above whenever two samples share a cluster they remain in a cluster at higher levels, we have a hierarchical clustering 26

k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 k = 7 k = 8 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 100 90 80 70 60 50 40 30 20 10 0 similarity scale FIGURE 10.11. A dendrogram can represent the results of hierarchical clustering algorithms. The vertical axis shows a generalized measure of similarity among clusters. Here, at level 1 all eight points lie in singleton clusters; each point in a cluster is highly similar to itself, of course. Points x 6 and x 7 happen to be the most similar, and are merged at level 2, and so forth. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. x 2 4 x 6 x 3 x x 2 4 7 3 x 5 x 1 6 7 x 8 5 k = 8 FIGURE 10.12. A set or Venn diagram representation of two-dimensional data (which was used in the dendrogram of Fig. 10.11) reveals the hierarchical structure but not the quantitative distances between clusters. The levels are numbered by k, in red. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

Distance Measures d min (D i,d j ) = d max (D i,d j ) = min x x x D i,x D j max x x x D i,x D j d avg (D i,d j )= 1 n i n j x D i x D j x x Listed Above: d mean (D i,d j )= m i m j Minimum, maximum and average inter-sample distance (samples for clusters i,j: Di, Dj) Difference in cluster means (mi, mj) 28

Nearest-Neighbor Algorithm Also Known as Single-Linkage Algorithm d min (D i,d j ) = min x x x D i,x D j Agglomerative hierarchical clustering using dmin d max (D i,d j ) = max x x x D i,x D j Two nearest neighbors in separate clusters determine clusters merged at each step d avg (D i,d j )= 1 x x n i n j x D If we continue until k = n (c = 1), produce a minimum spanning i x tree D j (similar to Kruskal s alg.) d mean (D i,d j )= m i m j MST: Path exists between all node (sample) pairs, sum of edge costs minimum for all spanning trees Issues Sensitive to noise and slight changes in position of data points (chaining effect) Example: next slide 29

FIGURE 10.13. Two Gaussians were used to generate two-dimensional samples, shown in pink and black. The nearest-neighbor clustering algorithm gives two clusters that well approximate the generating Gaussians (left). If, however, another particular sample is generated (circled red point at the right) and the procedure is restarted, the clusters do not well approximate the Gaussians. This illustrates how the algorithm is sensitive to the details of the samples. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

d min (D i,d j ) = min x x x D i,x D j Farthest-Neighbor Algorithm d max (D i,d j ) = max x x x D i,x D j Agglomerative hierarchical d avg (D i,d j )= 1 clustering using dmax x x n i n j x D j Clusters with the smallest maximum distance between two x D points are merged at each step i Goal: d mean minimal (D i,d increase j )= m to largest i mcluster j diameter at each iteration (discourages elongated clusters) Known as Complete-Linkage Algorithm if terminated when distance between nearest clusters exceeds a given threshold distance Issues Works well for compact and roughly equal in size; with elongated clusters, result can be meaningless 31

d max = large d max = small FIGURE 10.14. The farthest-neighbor clustering algorithm uses the separation between the most distant points as a criterion for cluster membership. If this distance is set very large, then all points lie in the same cluster. In the case shown at the left, a fairly large d max leads to three clusters; a smaller d max gives four clusters, as shown at the right. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

d min (D i,d j ) = min x x x D i,x D j Using Mean, Avg Distances d max (D i,d j ) = max x x x D i,x D j d avg (D i,d j )= 1 n i n j x D i x D j x x d mean (D i,d j )= m i m j Reduces Sensitivity to Outliers Mean less expensive to compute than avg, min, max (each require ni * nj distances) 33

Stepwise Optimal Hierarchical Clustering Problem None of the agglomerative methods discussed so far directly minimize a specific criterion function Modified Agglomerative Algorithm: For k = 1 to (n - c + 1) d min (D i,d j ) = d max (D i,d j ) = d avg (D i,d j )= 1 n i n j min x x x D i,x D j max x x x D i,x D j Find clusters whose merger changes criterion least, Di and Dj d mean (D i,d j )= m i m j x D i x D j x x Merge Di and Dj Example: Minimal increase in SSE (Je) d e (D i,d j )= ni n j n i + n j m i m j de defines the cluster pair that increases Je as little as possible. May not minimize SSE, but often good starting point prefers merging single elements or small with large clusters vs. merging medium-size clusters 34

k-means Clustering k-means Algorithm For a number of clusters k: 1. Choose k data points at random 2. Assign all data points to closest of the k cluster centers 3. Re-compute k cluster centers as the mean vector of each cluster If cluster centers do not change, stop Else, goto 2 Complexity O(ndcT) - T iterations, d features, n points, c clusters, in practice usually T << n (much fewer than n iterations) Note: means tend to move minimizing squared error criterion 35

µ 2 4 2-6 -4-2 2 4 6 µ 1-2 -4 FIGURE 10.2. The k-means clustering procedure is a form of stochastic hill climbing in the log-likelihood function. The contours represent equal log-likelihood values for the one-dimensional data in Fig. 10.1. The dots indicate parameter values after different iterations of the k-means algorithm. Six of the starting points shown lead to local maxima, whereas two (i.e., µ 1 (0) = µ 2 (0)) lead to a saddle point near = 0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

p(x µ b ) p(x µ a ) source density -4-3 -2-1 0 1 2 3 4 x l(µ 1, µ 2 ) -50-4 -2 0 2 4-100 -150 µ 2-52.2 µ a start -56.7 µ b -5-2.5 0 2.5 5 start µ 1 FIGURE 10.1. (Above) The source mixture density used to generate sample data, and two maximum-likelihood estimates based on the data in the table. (Bottom) Loglikelihood of a mixture model consisting of two univariate Gaussians as a function of their means, for the data in the table. Trajectories for the iterative maximum-likelihood estimation of the means of a two-gaussian mixture model based on the data are shown as red lines. Two local optima (with log-likelihoods 52.2 and 56.7) correspond to the two density estimates shown above. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.

x 2 1 3 2 FIGURE 10.3. Trajectories for the means of the k-means clustering procedure applied to two-dimensional data. The final Voronoi tesselation (for classification) is also shown the means correspond to the centers of the Voronoi cells. In this case, convergence is obtained in three iterations. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. x 1

Basic Idea Fuzzy k-means Allow every point to have a probability of membership in every cluster. The criterion (cost function) minimized is: J fuz = c n j=1 [ ˆP (ω i x j, ˆΘ)] b x j µ i 2 Theta is the membership function parameter set. b ( blending ) is a free parameter: b = 0: Sum of squared error criterion (one cluster per data point) b > 1: each pattern may belong to multiple clusters 39

J fuz = c Fuzzy k-mean Clustering µ j = Algorithm n j=1 [ ˆP (ω i x j, ˆΘ)] b x j µ i 2 J fuz = [ ˆP (ω i x j, ˆΘ)] b x j µ i 2 j=1 nj=1 [ ˆP (ω i x j )] b x j nj=1 [ ˆP (ω i x j )] b µ j = nj=1 [ ˆP (ω i x j )] b x j nj=1 [ ˆP (ω i x j )] b ˆP (ω i x j )= (1/d ij) 1/(b 1) cr=1 (1/d rj ) 1/(b 1) Algorithm d ij = x j µ i 2 1. Compute probability of each class for every point in the training set (uniform probability: equal likelihood in each cluster) 2. Recompute means using expression at top-left 3. Recompute probability of each class for each point using expression at top right If change in means and membership probabilities for points is small, stop Else goto 2 40

x 2 4 3 2 1 FIGURE 10.4. At each iteration of the fuzzy k-means clustering algorithm, the probability of category memberships for each point are adjusted according to Eqs. 32 and 33 (here b = 2). While most points have nonnegligible memberships in two or three clusters, we nevertheless draw the boundary of a Voronoi tesselation to illustrate the progress of the algorithm. After four iterations, the algorithm has converged to the red cluster centers and associated Voronoi tesselation. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. x 1

Fuzzy k-means, Cont d Convergence Properties Sometimes fuzzy k-means improves convergence over classical k-means However, probability of cluster membership depends on the number of clusters; can lead to problems if poor choice of k is made 42

Cluster Validity So far... We ve assumed that we know the number of clusters When number of clusters isn t known We can try a clustering procedure using c=1, c=2, etc., and making note of sudden decreases in the error criterion (e.g. SSE) More formal: statistical tests, however problem of testing cluster validity is unsolved DHS: Section 10.10 presents a statistical test centered around testing the null hypothesis of having c clusters, by comparing with c+1 clusters 43