Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009
Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation of results
Relationship to PCA/Factor Analysis M variables, N observations per variable PCA/Factor Analysis attempts to reduce the number of variables by finding directions of maximum variance Cluster analysis attempts to reduce the number of observations by finding groups of observations with minimum within-group variabilities and maximum between group variability
What is a cluster? 1. We need to define what we mean by a cluster, for our specific application 2. We need to define membership of a cluster a) Exclusive: each object belongs to one and only one cluster b) Overlapping: an object can belong simultaneously to more than one cluster c) Fuzzy: every object belongs to every cluster with a membership weighting (probability) between zero and one
Types of Cluster
Distance Function We need to define some measure of distance between our data points Example: 3 variables: X, Y, Z Data points: D 1 = X 1, Y 1, Z 1 D 2 = X 2, Y 2, Z 2 Distance is also known as proximity
Types of Distance Functions Euclidean Distance Squared Euclidean Distance City Block ( ) ( ) ( ) 2 1 2 2 1 2 2 1 2 z z y y x x D E + + = ( ) ( ) ( ) 2 1 2 2 1 2 2 1 2 z z y y x x D s + + = 1 2 1 2 1 2 z z y y x x D c + + =
Distance functions for categorical variables Marital Status Sex d 1 Single Male d 2 Married Male d 3 Single Female d 4 Married Male d 5 Married Female X = Marital Status Y = Sex One measure of distance is: 1 # matched var iables # var iables
In our example: # of variables = 2 So, 1 d d = 1 = 1 3 2 0 d d = 1 = 2 3 2 2 d d = 1 = 2 4 2 1 2 1 0
Clustering Methods Finally, you need to choose an algorithm for finding the clusters. Here we will look at three algorithms. 1) Agglomerative Hierarchical 2) K-means 3) Density-based
Hierarchical Agglomerative Clustering 1. Begin with one cluster for each observation. 2. Repeat merge the two nearest clusters until there is only one cluster left. Store the clusters and their distance. 3. Stop.
Dendogram Result Distance d 1 d 2 d 3 d 4 d 5 Observation
Defining distance To implement this algorithm, we need to define the distance between two clusters. 3 common definitions: a) The distance between the nearest points of the two clusters b) The distance between the furthest points in the two clusters c) The average distance between all pairs of points in the two clusters
Defining by nearest points
Defining by furthest points
Defining by average distance
Assessment of hierarchical clustering Advantages 1) You do not need to define the number of clusters Disadvantages 1) Answer can depend on the definition of the inter-cluster distance 2) Computationally intensive can only be used for relatively small datasets
K-means clustering A centroid of a group of points is usually defined as the point whose co-ordinates are the mean of the co-ordinates of the group. Note that the centroid does not, in general, correspond to an actual observation.
K-means clustering algorithm 1. Select K initial points, where K is the number of clusters required. 2. Repeat. Assign each point to its nearest centroid. Re-calculate the centroid until the centroids do not change. 3. Stop.
K-means iterations
We need to choose: 1) The number of clusters we require. 2) The positions of the initial centroids. Different choices will lead to different answers: a) We need to be specifically careful about 2), where a poor choice can lead to bad clustering. b) There are some techniques for optimizing the choice, but none are perfect.
Bad clustering via K-means
Assessment of K-means Advantages 1) Computationally efficient, basically linear in the number of data points. 2) Can be used for many types of data. Disadvantages 1) Need to specify the number of clusters in advance. 2) Potentially sensitive to initial conditions. 3) Not good when clusters are of very different sizes or very non-spherical.
Density-based clustering Locates regions of high density that are separated by regions of low density. We need to define density Center-based density: the density of a point is the number of points within a specified radius R. The density then depends on our choice of R.
Classification of points Choose some minimum number of points, N min. Core point: has more than N min points within a radius R. Border point: has less than N min points within a radius R, but does have a core point within this radius. Noise point: a point which is not a core point or a border point.
Classification of points N min = 7
Density-based algorithm First choose R and N min 1. Classify all points. 2. Remove noise points. 3. Connect all core points that are within a distance R of each other. 4. Make each group of connected points into a cluster. 5. Assign each border point to one of the clusters of its neighboring core points.
Assessment of density-based algorithm Advantages 1) Handles noise and outliers well. 2) Can handle clusters of different shapes and sizes. Disadvantages 1) Has difficulty with clusters of very different densities. 2) Has trouble with high-dimensional data. 3) We need to choose R and N min.
Choice of inputs The results of cluster analysis depend on: 1) The algorithm you choose. 2) The parameters and initial conditions you choose.
User-defined choices 1) Clustering algorithm 2) Distance function between data points Agglomerative clustering a) Distance between clusters b) Cut-off point on dendogram K-means clustering a) Number of clusters b) Initial positions of centroids Density-based clustering a) R b) N min
Evaluating your results Almost any algorithm will always find clusters in any dataset
Some hints on sanity-checking 1) How tight are the clusters compared with the inter-cluster distance? 2) How well do the clusters match your hypothesis (if you have one)? 3) How sensitive is the answer to different choices of algorithms/parameters/initial conditions/number of clusters/etc.? There are techniques for checking these.
Take-Home Message 1) Clustering is, to some extent, in the eye of the beholder. 2) Choose your algorithm/parameters carefully in the light of your particular application. 3) Evaluate your results, particularly: a) Does the clustering make sense? b) How sensitive is the solution to the input parameters?
Data for software example
Hierarchical R script
Hierarchical results
K-means R script
K-means result
Lipkovich, et al. Defining good and poor outcomes in patients with schizophrenia or schizoaffective disorder: A multidimensional data-driven approach. Psychiatry Research. In press.
Further reading P. Tryfos Methods for Business Analysis and Forecasting: Text & Cases Chapter 15, Cluster Analysis (http://www.yorku.ca/ptryfos/f1500.pdf) Figures from P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining Chapter 8, Cluster Analysis: Basic Concepts and Algorithms (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)