Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Size: px

Start display at page:

Download "Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009"

Bartholomew Austin
10 years ago
Views:

1 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009

2 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation of results

3 Relationship to PCA/Factor Analysis M variables, N observations per variable PCA/Factor Analysis attempts to reduce the number of variables by finding directions of maximum variance Cluster analysis attempts to reduce the number of observations by finding groups of observations with minimum within-group variabilities and maximum between group variability

variance Cluster analysis attempts to reduce the number of observations by finding

4 What is a cluster? 1. We need to define what we mean by a cluster, for our specific application 2. We need to define membership of a cluster a) Exclusive: each object belongs to one and only one cluster b) Overlapping: an object can belong simultaneously to more than one cluster c) Fuzzy: every object belongs to every cluster with a membership weighting (probability) between zero and one

cluster b) Overlapping: an object can belong simultaneously to more than one cluster c) Fuzzy:

6 Types of Cluster

7 Distance Function We need to define some measure of distance between our data points Example: 3 variables: X, Y, Z Data points: D 1 = X 1, Y 1, Z 1 D 2 = X 2, Y 2, Z 2 Distance is also known as proximity

8 Types of Distance Functions Euclidean Distance Squared Euclidean Distance City Block ( ) ( ) ( ) z z y y x x D E + + = ( ) ( ) ( ) z z y y x x D s + + = z z y y x x D c + + =

1 2 z z y y x x D E + + = ( ) ( ) ( ) 2 1 2 2 1 2 2 1 2

9 Distance functions for categorical variables Marital Status Sex d 1 Single Male d 2 Married Male d 3 Single Female d 4 Married Male d 5 Married Female X = Marital Status Y = Sex One measure of distance is: 1 # matched var iables # var iables

Married Male d 5 Married Female X = Marital Status Y = Sex

10 In our example: # of variables = 2 So, 1 d d = 1 = d d = 1 = d d = 1 =

11 Clustering Methods Finally, you need to choose an algorithm for finding the clusters. Here we will look at three algorithms. 1) Agglomerative Hierarchical 2) K-means 3) Density-based

12 Hierarchical Agglomerative Clustering 1. Begin with one cluster for each observation. 2. Repeat merge the two nearest clusters until there is only one cluster left. Store the clusters and their distance. 3. Stop.

13 Dendogram Result Distance d 1 d 2 d 3 d 4 d 5 Observation

14 Defining distance To implement this algorithm, we need to define the distance between two clusters. 3 common definitions: a) The distance between the nearest points of the two clusters b) The distance between the furthest points in the two clusters c) The average distance between all pairs of points in the two clusters

3 common definitions: a) The distance between the nearest points of the two

15 Defining by nearest points

16 Defining by furthest points

17 Defining by average distance

18 Assessment of hierarchical clustering Advantages 1) You do not need to define the number of clusters Disadvantages 1) Answer can depend on the definition of the inter-cluster distance 2) Computationally intensive can only be used for relatively small datasets

can depend on the definition of the inter-cluster distance 2)

19 K-means clustering A centroid of a group of points is usually defined as the point whose co-ordinates are the mean of the co-ordinates of the group. Note that the centroid does not, in general, correspond to an actual observation.

20 K-means clustering algorithm 1. Select K initial points, where K is the number of clusters required. 2. Repeat. Assign each point to its nearest centroid. Re-calculate the centroid until the centroids do not change. 3. Stop.

21 K-means iterations

22 We need to choose: 1) The number of clusters we require. 2) The positions of the initial centroids. Different choices will lead to different answers: a) We need to be specifically careful about 2), where a poor choice can lead to bad clustering. b) There are some techniques for optimizing the choice, but none are perfect.

23 Bad clustering via K-means

24 Assessment of K-means Advantages 1) Computationally efficient, basically linear in the number of data points. 2) Can be used for many types of data. Disadvantages 1) Need to specify the number of clusters in advance. 2) Potentially sensitive to initial conditions. 3) Not good when clusters are of very different sizes or very non-spherical.

25 Density-based clustering Locates regions of high density that are separated by regions of low density. We need to define density Center-based density: the density of a point is the number of points within a specified radius R. The density then depends on our choice of R.

26 Classification of points Choose some minimum number of points, N min. Core point: has more than N min points within a radius R. Border point: has less than N min points within a radius R, but does have a core point within this radius. Noise point: a point which is not a core point or a border point.

27 Classification of points N min = 7

28 Density-based algorithm First choose R and N min 1. Classify all points. 2. Remove noise points. 3. Connect all core points that are within a distance R of each other. 4. Make each group of connected points into a cluster. 5. Assign each border point to one of the clusters of its neighboring core points.

29 Assessment of density-based algorithm Advantages 1) Handles noise and outliers well. 2) Can handle clusters of different shapes and sizes. Disadvantages 1) Has difficulty with clusters of very different densities. 2) Has trouble with high-dimensional data. 3) We need to choose R and N min.

30 Choice of inputs The results of cluster analysis depend on: 1) The algorithm you choose. 2) The parameters and initial conditions you choose.

31 User-defined choices 1) Clustering algorithm 2) Distance function between data points Agglomerative clustering a) Distance between clusters b) Cut-off point on dendogram K-means clustering a) Number of clusters b) Initial positions of centroids Density-based clustering a) R b) N min

32 Evaluating your results Almost any algorithm will always find clusters in any dataset

34 Some hints on sanity-checking 1) How tight are the clusters compared with the inter-cluster distance? 2) How well do the clusters match your hypothesis (if you have one)? 3) How sensitive is the answer to different choices of algorithms/parameters/initial conditions/number of clusters/etc.? There are techniques for checking these.

35 Take-Home Message 1) Clustering is, to some extent, in the eye of the beholder. 2) Choose your algorithm/parameters carefully in the light of your particular application. 3) Evaluate your results, particularly: a) Does the clustering make sense? b) How sensitive is the solution to the input parameters?

36 Data for software example

37 Hierarchical R script

38 Hierarchical results

40 K-means R script

41 K-means result

45 Lipkovich, et al. Defining good and poor outcomes in patients with schizophrenia or schizoaffective disorder: A multidimensional data-driven approach. Psychiatry Research. In press.

46 Further reading P. Tryfos Methods for Business Analysis and Forecasting: Text & Cases Chapter 15, Cluster Analysis ( Figures from P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining Chapter 8, Cluster Analysis: Basic Concepts and Algorithms (

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical