CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

Similar documents

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

FACTOR ANALYSIS. Factor Analysis is similar to PCA in that it is a technique for studying the interrelationships among variables.

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Didacticiel - Études de cas

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Technical Notes for HCAHPS Star Ratings

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Data Mining and Visualization

Exploratory data analysis (Chapter 2) Fall 2011

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Basic Statistical and Modeling Procedures Using SAS

Hierarchical Cluster Analysis Some Basics and Algorithms

Statistics & Analysis

Tutorial for proteome data analysis using the Perseus software platform

Unsupervised learning: Clustering

Paper DV KEYWORDS: SAS, R, Statistics, Data visualization, Monte Carlo simulation, Pseudo- random numbers

4 Other useful features on the course web page. 5 Accessing SAS

Diagrams and Graphs of Statistical Data

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

Additional sources Compilation of sources:

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

SAS Software to Fit the Generalized Linear Model

Dongfeng Li. Autumn 2010

Variables. Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Neural Networks Lesson 5 - Cluster Analysis

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

II. DISTRIBUTIONS distribution normal distribution. standard scores

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Exploration Data Visualization

Multivariate Analysis

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

Customer Profiling for Marketing Strategies in a Healthcare Environment MaryAnne DePesquo, Phoenix, Arizona

Data Analysis Tools. Tools for Summarizing Data

Summarizing and Displaying Categorical Data

Statistics Graduate Courses

The importance of graphing the data: Anscombe s regression examples

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Time series clustering and the analysis of film style

Geostatistics Exploratory Analysis

Quantitative Methods for Finance

Introduction to Multivariate Analysis

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

This can be useful to temporarily deactivate programming segments without actually deleting the statements.

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

A Demonstration of Hierarchical Clustering

Dimensionality Reduction: Principal Components Analysis

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

List of Examples. Examples 319

How To Use Statgraphics Centurion Xvii (Version 17) On A Computer Or A Computer (For Free)

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

Exploratory Data Analysis

Data Mining: Algorithms and Applications Matrix Math Review

Statistics, Data Analysis & Econometrics

How To Identify Noisy Variables In A Cluster

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Protein Protein Interaction Networks

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Chapter 1 Introduction. 1.1 Introduction

Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7

Principal Component Analysis

Cluster Analysis. Aims and Objectives. What is Cluster Analysis? How Does Cluster Analysis Work? Postgraduate Statistics: Cluster Analysis

How To Check For Differences In The One Way Anova

Cluster Analysis using R

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

Local outlier detection in data forensics: data mining approach to flag unusual schools

Paper Exploring, Analyzing, and Summarizing Your Data: Choosing and Using the Right SAS Tool from a Rich Portfolio

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Paper PO 015. Figure 1. PoweReward concept

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Data analysis process

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Chapter 5 Analysis of variance SPSS Analysis of variance

Cluster Analysis: Advanced Concepts

10. Analysis of Longitudinal Studies Repeat-measures analysis

Overview of Factor Analysis

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Transcription:

CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative to Factor Analysis. Cluster analysis differs from discriminant analysis. o In cluster analysis the group membership is unknown prior to the analysis. In the biological sciences, an area where cluster analysis has been widely used is taxonomy. o In taxonomy individuals are classified into arbitrary groups based on measurements of the individuals. o The classification moves from the most general to the most specific. Kingdom Phylum Subphylum Class Order Family Genus Species In economics, cluster analysis can be used for data mining. o For example, in a market survey you could classify patrons into groups based on their answers to many questions. Warnings for cluster analysis. o Groupings from cluster analysis can be different based on the method of analysis used. o Since the groups are not known a priori, it can be difficult to determine if the results make sense in the context of the research being conducted.

o Knowledge of the population you are sampling and common sense are two important tools when it comes to interpreting results from cluster analysis. Basic Concepts of Cluster Analysis Cluster analysis can be divided into two basic steps, 1. Initial analysis of data. 2. Analytical clustering using one of many methods of amalgamation. Initial analysis o It is always a good idea before any statistical analysis to plot a scatter diagram of your data to see if there are any irregularities that need to be address using a transformation. o A common transformation in multivariate analyses is to standardize your data so that it has a mean of 0 and a variance of 1.0 Standardized Y! = (Y! Y) S! o If in visualizing your data you seem to see clusters that are elliptical in shape, you want to use a transformation method that will make the resultant pooled within cluster covariance matrix spherical. Analytical clustering v The method PROC ACELUS (Approximate Covariance Estimation for Clustering) procedure in SAS will perform the transformation. v Neither cluster membership nor the number of clusters needs to be known. Distance Measures o Distance measures can be studied in large data sets to determine similarities or clusters. o The opposite of similarity is distance. o Distance values can be calculated for each pair of observations. o Statistical methods to calculate distance are very sensitive to outliers. So you are encouraged to run diagnostics on your data to identify outliers and remove them if necessary.

o The most commonly used distance measurement is the Euclidian Distance. Distance (x,y) =Σ! (x! y! )! o Different methods to determine distance will provide different results. Cluster Analysis Process o In the initial cluster analysis, all individuals begin in the same cluster. o In subsequent rounds of analyses, the entries are placed into more and more clusters. o At the end of the cluster analysis, all individuals are in their own cluster. o During the various rounds of cluster analysis, the distances between new clusters must be determined and we need to be ale to determine when two clusters are sufficiently close to be linked together. o Two of the most common methods of cluster analysis are, Unweighted Pair- Group Mean Average (UPGMA): the distance between any two clusters is the average distance between all individuals in the different clusters. Ward s Method: a minimum variance method that uses an ANOVA approach. The method tries to minimize the sum of squares of any two clusters that are formed at each step of the cluster analysis. Estimating the Number of Clusters o Three methods that can be used to estimate the number of clusters are the, 1. Cubic clustering criterion (CCC) method: the estimated number of clusters occurs at the start of a peak on the graph. There may be more than one peak per plot. 2. Pseudo F: estimated number of clusters occurs at the start of peaks on the graph. There may be more than one peak per plot. 3. t 2 The graph is read right to left. The estimated number of clusters occurs at the start of a peak. There may be more than one peak per plot. Precautions When Using Cluster Analysis Unless there is considerable separation between inherent groups when you view the scatter plots, it is not realistic to expect Cluster Analysis to provide clear results.

Cluster Analysis is very sensitive to outliers. Results from the different Cluster Analysis methods may give you very different results. If you have large amounts of data, one method of simple validation of the results from Cluster Analysis is to conduct the analysis on the two halves of your data. It would be preferable to select the individuals to be assigned to the two halves at random. Example of Cluster Analysis In this example, I am using data from one of my students (Sintayehu Daba) PhD dissertation. Sintayehu is evaluating barley lines from three regions, Ethiopia and Kenya, ICARDA, and North Dakota, USA. Sintayehu collected data on many different plant characters, agronomic traits, and disease resistance. In the analysis, I am trying to determine if cluster analysis will successfully separate the data into distinct clusters based on the data collected. SAS Commands options pageno=1; data all; input Entry Source Color Hull_cover Row Orrow DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; datalines; 11 1 1 1 2 12 81.6 131.5 2.8 4.1 48.6 7.5 7.1 109.9 7.1 42.8 59.3 4.6 14.1 71.5 8 12 1 1 1 2 12 68.4 115.9 1.2 8.1 16.4 3.4 4.6 103.9 5.5 33.3 54.8 2.6 10.7 35.7 20 13 1 2 1 2 12 86.8 136.2 3.2 4.7 25.7 7.3 7.4 126.3 7.5 51.9 59.3 3.7 12.4 88.1 13 30 1 1 1 2 12 80.8 132.4 3.4 3.2 27.9 8.1 8.1 105.6 8.0 51.6 61.8 4.4 12.1 93.0 10 39 1 1 1 2 12 80.0 123.4 1.3 8.7 26.2 6.8 6.8 108.7 6.2 53.8 61.1 4.8 11.9 90.9 25 56 1 1 1 2 12 81.0 125.4 4.3 4.7 39.2 9.8 8.8 103.7 5.2 56.4 61.0 4.6 12.3 94.1 20 72 1 1 1 2 12 85.8 134.0 1.0 5.8 33.0 9.1 10.1 96.5 6.4 37.5 60.3 3.1 10.8 70.0 10 231 1 1 1 2 12 84.9 133.8 3.1 3.4 48.2 8.4 8.7 107.1 7.2 45.0 57.4 4.9 11.8 80.6 3 232 1 1 1 2 12 82.1 133.9 2.7 4.7 43.6 4.1 6.7 108.9 7.6 63.5 61.4 4.9 14.2 92.8 5 233 1 1 1 2 12 84.1 132.9 6.7 2.7 40.6 6.1 7.7 117.9 7.6 57.3 60.9 5.2 14.0 86.2 10 241 1 1 1 2 12 84.1 131.9 6.7 2.7 36.6 8.1 8.7 130.9 6.6 59.7 63.5 5.6 13.4 86.6 10 1 1 2 1 6 16 87.9 133.4 4.5 2.0 48.8 7.5 7.7 119.2 5.6 36.4 59.0 5.0. 64.9 0 2 1 1 1 6 16 83.4 129.3 1.3 6.1 49.0 8.0 7.9 116.2 7.4 36.5 57.1 3.9 11.0 50.6 18 3 1 2 1 6 16 80.8 123.4 3.5 3.9 44.4 3.3 3.4

105.1 7.0 39.0 59.0 3.4 9.7 74.9 15 4 1 1 1 6 16 85.7 132.9 2.9 2.8 49.1 6.1 6.1 105.6 5.2 40.2 59.9 3.6 10.7 72.2 8 5 1 1 1 6 16 83.2 128.7 2.4 7.3 47.0 5.7 5.4 104.7 5.2 35.0 54.8 3.3 11.5 65.5 15 6 1 1 1 6 16 85.4 132.4 4.7 2.2 34.0 10.0 11.0 130.2 8.0 45.8 57.0 5.5 11.7 59.6 5 7 1 1 1 6 16 83.0 126.4 2.3 3.7 44.2 6.8 5.8 107.7 9.2 44.6 60.2 4.8 13.1 84.2 5.... 89.5 7.6 36.1 59.0 2.5 9.6 64.5 0 82 5 1 1 2 52 79.9 131.3 1.1 7.9 23.9 11.6 12.1 80.6 6.9 35.2 60.3 2.1 10.3 66.5 0 227 5 1 1 2 52 86.2 138.9 1.1 8.1 27.6 7.2 7.7 78.3 6.8 35.4 58.9 2.1 9.3 50.8 1 228 5 1 1 2 52 85.1 145.5 1.3 6.9 26.9 11.1 11.1 89.4 7.2 48.6 64.2 2.3 9.5 95.2 0 263 5 1 1 2 52 80.2 132.3 0.9 7.7 27.1 8.8 8.8 87.9 6.9 32.9 59.0 2.0 9.5 58.8 0 210 5 1 1 6 56 84.9 135.4 0.9 7.4 53.5 4.7 4.5 79.4 7.4 36.3 57.4 2.6 12.3 93.7 0 211 5 1 1 6 56 83.7 137.5 1.0 7.7 45.3 8.2 8.3 84.7 6.0 29.9 60.8 1.6 12.1 58.0 0 212 5 1 1 6 56 91.2 137.4 1.3 7.2 47.8 5.3 5.3 81.4 6.7 29.6 58.6 1.4 11.1 48.2 0 213 5 1 1 6 56 86.4 140.4 0.8 7.6 49.5 7.3 6.9 86.9 6.8 30.5 58.6 2.5 10.6 65.9 0 214 5 1 1 6 56 85.1 139.3 1.1 7.7 53.5 6.3 6.6 87.8 7.5 34.4 59.3 2.4 10.6 71.4 0 215 5 1 1 6 56 88.4 141.3 1.0 6.6 52.9 7.3 6.6 90.4 7.2 30.8 60.4 2.9 10.4 64.2 0 216 5 1 1 6 56 80.4 133.7 1.0 7.4 42.3 6.1 6.3 84.7 6.6 31.7 58.6 2.0 11.8 76.8 0 217 5 1 1 6 56 83.6 137.2 1.3 7.7 50.5 5.8 5.9 85.7 7.0 30.2 58.2 1.8 10.7 60.5 0 218 5 1 1 6 56 86.0 140.6 1.2 7.4 48.9 7.2 7.4 85.8 7.0 33.1 57.7 1.9 12.1 90.4 0 219 5 1 1 6 56 84.0 142.0 1.0 7.6 47.6 7.9 8.3 84.7 7.3 31.4 59.6 1.4 13.3 88.9 0 220 5 1 1 6 56 82.9 140.0 1.0 6.8 56.6 6.1 6.4 104.9 7.5 31.9 62.0 2.4 10.5 62.0 0 221 5 1 1 6 56 82.7 136.1 1.0 7.2 53.4 9.1 9.7 101.9 7.2 32.6 58.6 2.2 11.7 55.8 0 229 5 1 1 6 56 85.0 140.8 0.6 7.6 52.4 6.4 6.5 103.5 7.5 33.3 60.5 2.1 11.9 64.0 0 237 5 1 1 6 56 83.5 133.7 1.0 7.6 45.8 6.3 6.4 87.6 7.0 33.5 59.2 2.5 11.1 68.0 0 ;; data two; set all;

if row=2; ods graphics on; ods rtf file='cluster.rtf'; proc cluster data=two method=ave print=15 ccc pseudo; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster Analysis Using the UPGMA Method'; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; proc cluster data=two method=ward print=15 ccc pseudo; var row Color Hull_cover Row DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; copy orrow; title 'Cluster analysis Using Wards Method'; proc tree noprint ncl=3 out=out; copy row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD orrow; proc freq; tables cluster*orrow / nopercent norow nocol plot=none; proc candisc noprint out=can; class cluster; var row Color Hull_cover DH DM NB SC NKS NSP NTP PLH SL TKW HLW GYH PC Plump LOD; proc sgplot data=can; scatter y=can2 x=can1 / group=cluster; ods rtf close; ods graphics off;

Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 295.143472 165.626898 0.5182 0.5182 2 129.516574 62.636260 0.2274 0.7456 3 66.880314 39.119795 0.1174 0.8631 4 27.760519 8.322771 0.0487 0.9118 5 19.437749 4.612745 0.0341 0.9460 6 14.825003 9.040107 0.0260 0.9720 7 5.784896 2.344945 0.0102 0.9821 8 3.439951 1.163995 0.0060 0.9882 9 2.275956 0.180191 0.0040 0.9922 10 2.095765 1.271980 0.0037 0.9959 11 0.823785 0.268282 0.0014 0.9973 12 0.555504 0.023777 0.0010 0.9983 13 0.531727 0.300812 0.0009 0.9992 14 0.230914 0.039611 0.0004 0.9996 15 0.191304 0.166683 0.0003 1.0000 16 0.024621 0.024621 0.0000 1.0000 17 0.000000 0.000000 0.0000 1.0000 18 0.000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation 5.624935 Root-Mean-Square Distance Between Observations 33.74961 Number of Clusters Clusters Joined Freq Semipartial R-Square R-Square Cluster History Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Norm RMS Distance Tie 15 CL27 CL41 10 0.0086.799.812-1.2 24.1 4.7 0.6206 14 CL52 CL39 4 0.0064.792.803 -.95 25.3 4.6 0.6224 13 CL19 CL23 25 0.0108.782.793-1.0 26.0 4.3 0.6322 12 CL42 CL22 5 0.0089.773.783 -.83 27.2 3.9 0.6993 11 CL21 CL16 42 0.0371.736.771-2.8 24.8 16.2 0.7249 10 CL12 OB45 6 0.0064.729.758-1.9 26.9 1.6 0.7305 9 CL13 CL18 28 0.0201.709.743-2.1 27.7 7.0 0.7772 8 CL11 CL14 46 0.0262.683.725-2.6 28.3 8.3 0.788

Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Number of Clusters Clusters Joined Freq Semipartial R-Square R-Square Cluster History Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Norm RMS Distance Tie 7 CL9 CL10 34 0.0282.655.704-2.9 29.4 7.7 0.7945 6 CL15 CL7 44 0.0619.593.678-4.6 27.4 15.5 0.8501 5 CL6 CL8 90 0.2228.370.645-12 14.0 49.7 0.9624 4 CL26 OB94 7 0.0218.348.600-11 17.1 10.1 1.1609 3 CL5 CL33 92 0.0461.302.533-8.5 21.0 6.7 1.2466 2 OB2 CL4 8 0.0305.272.397-4.0 36.6 5.6 1.3979 1 CL3 CL2 100 0.2717.000.000 0.00. 36.6 1.6047 The semipartial R 2 measures the homogeneity of merged clusters. This value reflects decreasing homogeneity of members in a cluster as clusters are combined to make new clusters. R 2 reflects the differences between clusters, so you want this value to be high. At the start of the clustering process all entries are their own cluster; thus, the R 2 is 1. As more clusters are combined, the R 2 value should decrease. At the end of the analysis when all observations are in the same cluster, the R 2 value should theoretically be 0. The approximate expected R 2 value is part of the output presented when the CCC value is requested. The approximate expected R 2 value reflects an estimated value given a uniform null hypothesis. Ties o o o At each level of the clustering process, Proc Cluster identifies pairs of clusters with the minimum distance between them. Sometimes there can be two or more pairs of clusters with the same minimum distance. This often occurs with discrete data. In such cases the tie must be broken in some arbitrary way. If there are ties, then the results of the cluster analysis depend on the order of the observations in the data set. A tie means that at a particular step in the cluster analysis, two pairs of clusters had the same minimum distance and possibly some of the later steps some of the clusters are not uniquely determined. Ties that occur early in the cluster analysis usually have little effect on the later stages. Ties that occur in the middle parts of the cluster analysis should be investigated. Ties that occur late in the cluster analysis are a sign that a solid or concrete solution may not be possible. There are routines you can run to determine if Ties are affecting the outcome of your analyses.

Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis

Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis

Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Table of CLUSTER by Orrow (Using Non-standardized Data) CLUSTER Orrow Frequency 12 22 32 42 52 Total 1 10 37 2 39 4 92 2 0 0 0 0 7 7 3 1 0 0 0 0 1 Total 11 37 2 39 11 100 Frequency Missing = 1 Table of CLUSTER by Orrow (Using Standardized Data) CLUSTER Orrow Frequency 12 22 32 42 52 Total 1 5 26 2 39 11 83 2 5 11 0 0 0 16 3 1 0 0 0 0 1 Total 11 37 2 39 11 100 Frequency Missing = 1

Cluster Analysis Using the UPGMA Method The CLUSTER Procedure Average Linkage Cluster Analysis Non- standardized Data

Cluster Analysis Using the UPGMA Method The FREQ Procedure (Using Standardized Data)

Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 295.143472 165.626898 0.5182 0.5182 2 129.516574 62.636260 0.2274 0.7456 3 66.880314 39.119795 0.1174 0.8631 4 27.760519 8.322771 0.0487 0.9118 5 19.437749 4.612745 0.0341 0.9460 6 14.825003 9.040107 0.0260 0.9720 7 5.784896 2.344945 0.0102 0.9821 8 3.439951 1.163995 0.0060 0.9882 9 2.275956 0.180191 0.0040 0.9922 10 2.095765 1.271980 0.0037 0.9959 11 0.823785 0.268282 0.0014 0.9973 12 0.555504 0.023777 0.0010 0.9983 13 0.531727 0.300812 0.0009 0.9992 14 0.230914 0.039611 0.0004 0.9996 15 0.191304 0.166683 0.0003 1.0000 16 0.024621 0.024621 0.0000 1.0000 17 0.000000 0.000000 0.0000 1.0000 18 0.000000 0.000000 0.0000 1.0000 19 0.000000 0.0000 1.0000 Root-Mean-Square Total-Sample Standard Deviation 5.47491 Root-Mean-Square Distance Between Observations 33.74961 Number of Clusters Clusters Joined Freq Semipartial R-Square Cluster History R-Square Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Tie 15 CL22 CL39 16 0.0090.828.812 1.63 29.2 6.4 14 CL26 CL18 28 0.0091.819.803 1.55 29.9 5.9 13 CL15 CL33 19 0.0120.807.793 1.28 30.3 6.1 12 CL28 CL43 5 0.0143.793.783 0.89 30.6 5.5 11 CL17 CL16 14 0.0170.776.771 0.40 30.8 5.2 10 OB2 OB94 2 0.0177.758.758 0.01 31.3. 9 CL31 CL19 9 0.0185.739.743 -.22 32.3 7.4

Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Number of Clusters Clusters Joined Freq Semipartial R-Square Cluster History R-Square Approximate Expected R-Square Cubic Clustering Criterion Pseudo F Statistic Pseudo t-squared Tie 8 CL9 CL23 15 0.0242.715.725 -.64 33.0 6.7 7 CL34 CL14 37 0.0244.691.704 -.82 34.6 14.5 6 CL13 CL12 24 0.0246.666.678 -.72 37.5 8.0 5 CL11 CL6 38 0.0392.627.645-1.0 39.9 9.5 4 CL10 CL8 17 0.0603.567.600-1.8 41.8 10.2 3 CL21 CL5 46 0.0624.504.533-1.3 49.3 13.7 2 CL4 CL7 54 0.1888.315.397-2.7 45.1 42.3 1 CL3 CL2 100 0.3154.000.000 0.00. 45.1

Cluster analysis Using Wards Method The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis

Cluster analysis Using Wards Method The FREQ Procedure Table of CLUSTER by Orrow (Non-standardized Data) CLUSTER Orrow Frequency 12 22 32 42 52 Total 1 0 4 0 32 1 37 2 9 31 2 4 0 46 3 2 2 0 3 10 17 Total 11 37 2 39 11 100 Frequency Missing = 1 Table of CLUSTER by Orrow (Using Standardized Data) CLUSTER Orrow Frequency 12 22 32 42 52 Total 1 1 22 2 38 0 63 2 9 15 0 0 0 24 3 1 0 0 1 11 13 Total 11 37 2 39 11 100 Frequency Missing = 1

Cluster analysis Using Wards Method The FREQ Procedure Using Non- standardized Data

Cluster analysis Using Wards Method The FREQ Procedure Using Standardized Data