CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

Transcription

1 CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative to Factor Analysis. Cluster analysis differs from discriminant analysis. o In cluster analysis the group membership is unknown prior to the analysis. In the biological sciences, an area where cluster analysis has been widely used is taxonomy. o In taxonomy individuals are classified into arbitrary groups based on measurements of the individuals. o The classification moves from the most general to the most specific. Kingdom Phylum Subphylum Class Order Family Genus Species In economics, cluster analysis can be used for data mining. o For example, in a market survey you could classify patrons into groups based on their answers to many questions. Warnings for cluster analysis. o Groupings from cluster analysis can be different based on the method of analysis used. o Since the groups are not known a priori, it can be difficult to determine if the results make sense in the context of the research being conducted.

2 o Knowledge of the population you are sampling and common sense are two important tools when it comes to interpreting results from cluster analysis. Basic Concepts of Cluster Analysis Cluster analysis can be divided into two basic steps, 1. Initial analysis of data. 2. Analytical clustering using one of many methods of amalgamation. Initial analysis o It is always a good idea before any statistical analysis to plot a scatter diagram of your data to see if there are any irregularities that need to be address using a transformation. o A common transformation in multivariate analyses is to standardize your data so that it has a mean of 0 and a variance of 1.0 Standardized Y! = (Y! Y) S! o If in visualizing your data you seem to see clusters that are elliptical in shape, you want to use a transformation method that will make the resultant pooled within cluster covariance matrix spherical. Analytical clustering v The method PROC ACELUS (Approximate Covariance Estimation for Clustering) procedure in SAS will perform the transformation. v Neither cluster membership nor the number of clusters needs to be known. Distance Measures o Distance measures can be studied in large data sets to determine similarities or clusters. o The opposite of similarity is distance. o Distance values can be calculated for each pair of observations. o Statistical methods to calculate distance are very sensitive to outliers. So you are encouraged to run diagnostics on your data to identify outliers and remove them if necessary.

3 o The most commonly used distance measurement is the Euclidian Distance. Distance (x,y) =Σ! (x! y! )! o Different methods to determine distance will provide different results. Cluster Analysis Process o In the initial cluster analysis, all individuals begin in the same cluster. o In subsequent rounds of analyses, the entries are placed into more and more clusters. o At the end of the cluster analysis, all individuals are in their own cluster. o During the various rounds of cluster analysis, the distances between new clusters must be determined and we need to be ale to determine when two clusters are sufficiently close to be linked together. o Two of the most common methods of cluster analysis are, Unweighted Pair- Group Mean Average (UPGMA): the distance between any two clusters is the average distance between all individuals in the different clusters. Ward s Method: a minimum variance method that uses an ANOVA approach. The method tries to minimize the sum of squares of any two clusters that are formed at each step of the cluster analysis. Estimating the Number of Clusters o Three methods that can be used to estimate the number of clusters are the, 1. Cubic clustering criterion (CCC) method: the estimated number of clusters occurs at the start of a peak on the graph. There may be more than one peak per plot. 2. Pseudo F: estimated number of clusters occurs at the start of peaks on the graph. There may be more than one peak per plot. 3. t 2 The graph is read right to left. The estimated number of clusters occurs at the start of a peak. There may be more than one peak per plot. Precautions When Using Cluster Analysis Unless there is considerable separation between inherent groups when you view the scatter plots, it is not realistic to expect Cluster Analysis to provide clear results.

4 Cluster Analysis is very sensitive to outliers. Results from the different Cluster Analysis methods may give you very different results. If you have large amounts of data, one method of simple validation of the results from Cluster Analysis is to conduct the analysis on the two halves of your data. It would be preferable to select the individuals to be assigned to the two halves at random. Example of Cluster Analysis In this example, I am using data from one of my students (Sintayehu Daba) PhD dissertation. Sintayehu is evaluating barley lines from three regions, Ethiopia and Kenya, ICARDA, and North Dakota, USA. Sintayehu collected data on many different plant characters, agronomic traits, and disease resistance. In the analysis, I am trying to determine if cluster analysis will successfully separate the data into distinct clusters based on the data collected. SAS Commands options pageno=1; data all; input Entry Source Color Hull_cover Row Orrow DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; datalines;

5 ;; data two; set all;

6 if row=2; ods graphics on; ods rtf file='cluster.rtf'; proc cluster data=two method=ave print=15 ccc pseudo; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster Analysis Using the UPGMA Method'; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; proc cluster data=two method=ward print=15 ccc pseudo; var row Color Hull_cover Row DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster analysis Using Wards Method'; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; ods rtf close; ods graphics off;

7 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative Root-Mean-Square Total-Sample Standard Deviation Root-Mean-Square Distance Between Observations Number of Clusters Clusters Joined Freq Semipartial R-Square R-Square Cluster History Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Norm RMS Distance Tie 15 CL27 CL CL52 CL CL19 CL CL42 CL CL21 CL CL12 OB CL13 CL CL11 CL

8 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Number of Clusters Clusters Joined Freq Semipartial R-Square R-Square Cluster History Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Norm RMS Distance Tie 7 CL9 CL CL15 CL CL6 CL CL26 OB CL5 CL OB2 CL CL3 CL The semipartial R 2 measures the homogeneity of merged clusters. This value reflects decreasing homogeneity of members in a cluster as clusters are combined to make new clusters. R 2 reflects the differences between clusters, so you want this value to be high. At the start of the clustering process all entries are their own cluster; thus, the R 2 is 1. As more clusters are combined, the R 2 value should decrease. At the end of the analysis when all observations are in the same cluster, the R 2 value should theoretically be 0. The approximate expected R 2 value is part of the output presented when the CCC value is requested. The approximate expected R 2 value reflects an estimated value given a uniform null hypothesis. Ties o o o At each level of the clustering process, Proc Cluster identifies pairs of clusters with the minimum distance between them. Sometimes there can be two or more pairs of clusters with the same minimum distance. This often occurs with discrete data. In such cases the tie must be broken in some arbitrary way. If there are ties, then the results of the cluster analysis depend on the order of the observations in the data set. A tie means that at a particular step in the cluster analysis, two pairs of clusters had the same minimum distance and possibly some of the later steps some of the clusters are not uniquely determined. Ties that occur early in the cluster analysis usually have little effect on the later stages. Ties that occur in the middle parts of the cluster analysis should be investigated. Ties that occur late in the cluster analysis are a sign that a solid or concrete solution may not be possible. There are routines you can run to determine if Ties are affecting the outcome of your analyses.

9 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis

10 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis

11 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Table of CLUSTER by Orrow (Using Non-standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1 Table of CLUSTER by Orrow (Using Standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1

12 Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Non- standardized Data

13 Cluster Analysis Using the UPGMA Method The FREQ Procedure (Using Standardized Data)

14 Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative Root-Mean-Square Total-Sample Standard Deviation Root-Mean-Square Distance Between Observations Number of Clusters Clusters Joined Freq Semipartial R-Square Cluster History R-Square Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Tie 15 CL22 CL CL26 CL CL15 CL CL28 CL CL17 CL OB2 OB CL31 CL

15 Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Number of Clusters Clusters Joined Freq Semipartial R-Square Cluster History R-Square Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Tie 8 CL9 CL CL34 CL CL13 CL CL11 CL CL10 CL CL21 CL CL4 CL CL3 CL

16 Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis

17 Cluster analysis Using Wards Method The FREQ Procedure Table of CLUSTER by Orrow (Non-standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1 Table of CLUSTER by Orrow (Using Standardized Data) CLUSTER Orrow Frequency Total Total Frequency Missing = 1

18 Cluster analysis Using Wards Method The FREQ Procedure Using Non- standardized Data

19 Cluster analysis Using Wards Method The FREQ Procedure Using Standardized Data