Cluster analysis with SPSS: K-Means Cluster Analysis



Similar documents
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Machine Learning using MapReduce

Mixed 2 x 3 ANOVA. Notes

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Setting Up Outlook on Workstation to Capture s

K-Means Clustering Tutorial

Social Media Mining. Data Mining Essentials

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SPSS Guide: Regression Analysis

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

Tutorial Segmentation and Classification

Cluster Analysis: Basic Concepts and Algorithms

Tutorial for proteome data analysis using the Perseus software platform

Regression Clustering

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

A Basic Guide to Analyzing Individual Scores Data with SPSS

Chapter 5 Analysis of variance SPSS Analysis of variance

COSC 6397 Big Data Analytics. Mahout and 3 rd homework assignment. Edgar Gabriel Spring Mahout

Binary Logistic Regression

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

Building Data Cubes and Mining Them. Jelena Jovanovic

SAS Software to Fit the Generalized Linear Model

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

University of Arkansas Libraries ArcGIS Desktop Tutorial. Section 2: Manipulating Display Parameters in ArcMap. Symbolizing Features and Rasters:

The Chi-Square Test. STAT E-50 Introduction to Statistics

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Introduction to Clustering

Analysis of categorical data: Course quiz instructions for SPSS

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

IBM SPSS Statistics for Beginners for Windows

Clustering UE 141 Spring 2013

POST-HOC SEGMENTATION USING MARKETING RESEARCH

8. Comparing Means Using One Way ANOVA

Multivariate Analysis

Chapter 7. Cluster Analysis

PUBLIC Model Manager User Guide

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Instructions for SPSS 21

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Comparables Sales Price

Chapter ML:XI (continued)

Using Data Mining for Mobile Communication Clustering and Characterization

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

Clustering Connectionist and Statistical Language Processing

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Standardization and Its Effects on K-Means Clustering Algorithm

They can be obtained in HQJHQH format directly from the home page at:

Cluster Analysis: Basic Concepts and Algorithms

Lesson 9. Reports. 1. Create a Visual Report. Create a visual report. Customize a visual report. Create a visual report template.

To do a factor analysis, we need to select an extraction method and a rotation method. Hit the Extraction button to specify your extraction method.

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Setting Preferences in QuickBooks

Introduction to Machine Learning Using Python. Vikram Kamath

Payroll. 4. Print Checks. Table of Contents Print Checks...2 All...3 Department...4 Print Single Posting...5

Neural Networks Lesson 5 - Cluster Analysis

CLUSTER ANALYSIS FOR SEGMENTATION

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Package MixGHD. June 26, 2015

Ira J. Haimowitz Henry Schwarz

Message Classification user guide

Didacticiel - Études de cas

Multidimensional Scaling (MDS) Presentation

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Comparison of K-means and Backpropagation Data Mining Algorithms

MICROSOFT OUTLOOK 2010 WORK WITH CONTACTS

SPSS: Getting Started. For Windows

One-Way Analysis of Variance (ANOVA) Example Problem

Oracle Data Miner (Extension of SQL Developer 4.0)

Rational Team Concert. Scrum Project Management Tutorial

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

IBM SPSS Direct Marketing 23

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

SPSS INSTRUCTION CHAPTER 1

How to Get More Value from Your Survey Data

IBM SPSS Direct Marketing 22

Support Vector Machines with Clustering for Training with Very Large Datasets

Rational Quality Manager. Quick Start Tutorial

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Methods

Using Microsoft Excel to Analyze Data

Simple Predictive Analytics Curtis Seare

Multivariate Analysis of Variance (MANOVA)

A DATA ANALYSIS TOOL THAT ORGANIZES ANALYSIS BY VARIABLE TYPES. Rodney Carr Deakin University Australia

Two Related Samples t Test

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Can SAS Enterprise Guide do all of that, with no programming required? Yes, it can.

Data Mining Part 5. Prediction

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Transcription:

analysis with SPSS: K-Means Analysis analysis is a type of data classification carried out by separating the data into groups. The aim of cluster analysis is to categorize n objects in k (k>1) groups, called clusters, by using p (p>0) variables. As with many other types of statistical, cluster analysis has several variants, each with its own clustering procedure. There are two main sub-divisions of clustering procedures. In the first procedure the number of clusters is pre-defined. This is known as the K-Means ing method. When the number of the clusters is not predefined we use Hierarchical analysis. The great variety of clustering procedures results from the metrics which are used between different objects. The most commonly used metrics are the Euclidean metric, Manhattan metric, Chebyshev metric and others. There are also different rules used for creating the clusters. Some allow members to share different clusters, whilst others only allow exclusive membership. K-Means Analysis From the main menu of SPSS consecutively click Analyze Classify K- Means. Figure 1.

Mark the variables on the basis of which clustering will be done and send them to Variables - the box for entering the variables. The Label Cases by box is used for entering a string variable which marks the units. After that we determine the number of desired clusters in the Number of s box. In our case in the Method box we mark Iterate and Classify. Unlike the alternative method Classify only, which, defines fixed cluster centers, this one defines the successive iterations and determines how the final clustering is to be carried out. Figure 2. In the Centers box we specify the file (if there is one), which contains the initial cluster centers and the file (if required), which contains the final cluster centers. In Read initial from we specify the file which contains the initial cluster centers, and in Write final as we specify the file which contains the final cluster centers. With the Iterate button we can determine the criteria for updating the cluster centers, in Maximum Iterations we can determine the maximum number of the iterations (no more than 999), and in Convergence Criterion we decide which rule halts the iteration process. By default 10 iterations and convergence criterion 0 are given. Furthermore, it is possible to mark the option Use running means. If this is selected, the cluster centers change after the addition of each object. If this option is not selected, cluster centers are calculated after all objects have been allocated to a given cluster. In

both cases we receive different results and therefore the way in which the clustering is achieved must be specified. We proceed by clicking on Continue. Figure 3. Using the Save button we can save new variables in a data file which indicates the cluster membership of each object ( Membership) and distance from cluster center for each object (Distance from Center). Fgure 4. The Options button gives the option of displaying additional statistics initial cluster centers (Initial cluster centers), dispersion analysis table (ANOVA table) and information for cluster membership of each object ( information in each case?). It is desirable that all three options are selected. We obtain the final result by clicking OK. Figure 5.

Let us briefly go through the different stages of K-Means Analysis using the data from the example with UniCredit Bulbank (Table 1 from the chapter First Steps in SPSS). We determine the number of clusters to be 4, and the initial cluster centers are evaluated based on the data. We use squared Euclidean Distance for the divergence measure between units. Also we choose the cluster centers to be calculated after all objects have been assigned to a given cluster, i.e. we don t put a tick in Use running means box. The initial cluster centers are given in Table 1 (Initial Centers). They are vectors with their values based on the five variables, which refer to 2000 (first cluster), 2005 (second cluster), 2006 (third cluster) and 2003 (fourth cluster). These 4 years are at maximum index distance from each other. Table 1. Initial Centers 1 2 3 4 Net profit 160,065 96,116 120,654 89,752 Own funds 602,776 609,609 630,781 550,026 Assets 2559,476 3474,829 4346,594 2825,439 Client deposits 1692,270 2618,771 3336,875 2177,781 Loans 316,380 1706,858 2131,577 916,634 In Table 2 we can see the number of the iterations and the changes in the cluster centers. In the first iteration year 2001 joins year 2000 and the cluster center is updated. The year 2004 joins the second cluster the year 2005, and year 2002 joins the fourth cluster the year 2003. The third cluster does not change. In the second iteration the process of redistribution of the units stops because there are no changes in the cluster centers. Table 2. Iteration History (a) Change in Centers Iteration 1 2 3 4 1 200,730 227,959,000 195,515 2,000,000,000,000 (a) Convergence achieved due to no or small change in cluster centers. The maximum absolute coordinate change for any center is,000. The current iteration is 2. The minimum distance between initial centers is 821,273.

The results are summarized in Table 3, i.e. which cluster each unit belongs to and the new cluster centers. The first cluster is formed by the years 2000 and 2001, the second by 2004 and 2005, the third only by 2006 and the fourth by the years 2002 and 2003. In Table 4 we can see the final cluster centers, and in Table 5 - the distance between the final cluster centers. Table 3. Membership Case Number Distance 1: 2000 1 200,730 2: 2001 1 200,730 3: 2002 4 195,515 4: 2003 4 195,515 5: 2004 2 227,959 6: 2005 2 227,959 7: 2006 3,000 Table 4. Final Centers 1 2 3 4 Net profit 114,489 91,198 120,654 84,441 Own funds 546,628 591,861 630,781 531,638 Assets 2645,581 3544,763 4346,594 2773,710 Client deposits 1856,952 2767,970 3336,875 2113,869 Loans 339,367 1550,413 2131,577 740,285 Table 5. Distances between Final Centers 1 2 3 4 1 1762,868 2881,450 494,253 2 1762,868 1143,119 1297,055 3 2881,450 1143,119 2432,395 4 494,253 1297,055 2432,395 If we compare the results from Table 1 and Table 4 we will see that the cluster center of the third cluster does not change. Since in our case the groups are formed deliberately in accordance with the distance between them in the multidimensional space, i.e. the condition for randomness

of the observations in the different groups is not met, the results from the dispersion analysis are purely descriptive. In other words, we cannot use the significance level (Sign. column in ANOVA Table dispersion analysis of clustering results) to check the hypothesis about the mean variables. Nevertheless, the differences between the F-ratios (F column in the ANOVA Table) makes it possible to draw general conclusions about the role of the different mean variables in the forming of the clusters. In Table 6 are given the results from the dispersion analysis. They show that assets have the greatest influence in the forming of the clusters and net profit has the least influence. Table 6. ANOVA Error Mean Square df Mean Square df F Sig. Net profit 495,145 3 1419,744 3,349,795 Own funds 2878,202 3 2537,200 3 1,134,460 Assets 842788,443 3 9987,138 3 84,387,002 Client deposits 634017,636 3 35643,498 3 17,788,021 Loans 957411,333 3 37401,709 3 25,598,012 The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal. Table 7. Number of Cases in each 1 2,000 2 2,000 3 1,000 4 2,000 Valid 7,000 Missing,000 Table 7 presents data for the number of units in each cluster as well as their total number and missing units (if there are any). Now we will present the results from the same clustering procedure with the difference that we choose the cluster centers to be changed after the joining of each object to a given cluster and we select the option Use running means.

Table 8. Table 9. Iteration History(a) Membership Change in Centers Iteration 1 2 3 4 1 215,142 151,973,000,000 2 53,786 50,658,000,000 3 13,446 16,886,000,000 4 3,362 5,629,000,000 5,840 1,876,000,000 6,210,625,000,000 7,053,208,000,000 8,013,069,000,000 9,003,023,000,000 10,001,008,000,000 Case Number Distance 1: 2000 1 286,856 2: 2001 1 140,021 3: 2002 1 206,434 4: 2003 4,000 5: 2004 2 227,963 6: 2005 2 227,955 7: 2006 3,000 (a) Iterations stopped because the maximum number of iterations was performed. Iterations failed to converge. The maximum absolute coordinate change for any center is,005. The current iteration is 10. The minimum distance between initial centers is 821,273. Table 10. Final Centers 1 2 3 4 Net profit 102,702 91,198 120,654 89,752 Own funds 535,501 591,861 630,781 550,026 Assets 2671,047 3544,763 4346,594 2825,439 Client deposits 1921,287 2767,970 3336,875 2177,781 Loans 414,223 1550,413 2131,577 916,634 Table 11. Distances between Final Centers 1 2 3 4 1 1665,679 2787,481 585,168 2 1665,679 1143,119 1126,578 3 2787,481 1143,119 2267,372 4 585,168 1126,578 2267,372 Table 12. ANOVA Error Mean Square df Mean Square df F Sig. Net profit 236,122 3 1678,767 3,141,929 Own funds 2856,043 3 2559,359 3 1,116,465 Assets 843275,336 3 9500,245 3 88,764,002 Client deposits 628462,814 3 41198,320 3 15,255,025 Loans 966937,206 3 27875,836 3 34,687,008 The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences between cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.

Table 13. Number of Cases in each 1 3,000 2 2,000 3 1,000 4 1,000 Valid 7,000 Missing,000 From the presented data (Table 9) we see that now the first cluster is formed by years 2000, 2001 and 2002, the second by 2004 and 2005, the third by 2006 and the fourth only by the year 2003. According to the data presented in the ANOVA table, assets once again have maximum influence in forming the clusters and net profit the least. Author: Dessislava Vojnikova, Plovdiv University, 4 th year Bachelor program in Applied Mathematics Supervisor: Snezhana Gocheva-Ilieva