Hierarchical Cluster Analysis Some Basics and Algorithms



Similar documents
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering UE 141 Spring 2013

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Neural Networks Lesson 5 - Cluster Analysis

0.1 What is Cluster Analysis?

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

Statistical Databases and Registers with some datamining

Cluster analysis Cosmin Lazar. COMO Lab VUB

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

How To Cluster

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Machine Learning using MapReduce

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Unsupervised learning: Clustering

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A comparison of various clustering methods and algorithms in data mining

CLUSTER ANALYSIS FOR SEGMENTATION

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Chapter 7. Cluster Analysis

K-Means Clustering Tutorial

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

A Cluster Analysis Approach for Banks Risk Profile: The Romanian Evidence

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

They can be obtained in HQJHQH format directly from the home page at:

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Introduction to Clustering

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Social Media Mining. Data Mining Essentials

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Distances, Clustering, and Classification. Heatmaps

Tutorial for proteome data analysis using the Perseus software platform

Cluster Analysis: Basic Concepts and Algorithms

Information Retrieval and Web Search Engines

Cluster Analysis. Aims and Objectives. What is Cluster Analysis? How Does Cluster Analysis Work? Postgraduate Statistics: Cluster Analysis

Model Selection. Introduction. Model Selection

Linear Threshold Units

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

A Demonstration of Hierarchical Clustering

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

CLUSTERING FOR FORENSIC ANALYSIS

Time series clustering and the analysis of film style

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Hadoop SNS. renren.com. Saturday, December 3, 11

IBM SPSS Direct Marketing 22

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Cluster Analysis: Advanced Concepts

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Data Mining and Visualization

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Movie Classification Using k-means and Hierarchical Clustering

Cluster Analysis: Basic Concepts and Algorithms

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Increasing Debit Card Utilization and Generating Revenue using SUPER Segments

An Overview of Knowledge Discovery Database and Data mining Techniques

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

IBM SPSS Direct Marketing 23

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Chapter 9 Cluster Analysis

CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

Using Data Mining for Mobile Communication Clustering and Characterization

Multivariate Analysis

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Computer program review

Cluster Analysis using R

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Visualization methods for patent data

How To Understand Multivariate Models

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

Standardization and Its Effects on K-Means Clustering Algorithm

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

Model Efficiency Through Data Compression

Mining Social-Network Graphs

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Multivariate Analysis of Ecological Data

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data

Technical Notes for HCAHPS Star Ratings

SoSe 2014: M-TANI: Big Data Analytics

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

The SPSS TwoStep Cluster Component

Transcription:

Hierarchical Cluster Analysis Some Basics and Algorithms Nethra Sambamoorthi CRMportals Inc., 11 Bartram Road, Englishtown, NJ 07726 (NOTE: Please use always the latest copy of the document. Click on this line for the latest copy of the document, if you are connected to the internet to download) 1

1. Introduction 1.1 What is cluster analysis? Cluster analysis is a collection of statistical methods, which identifies groups of samples that behave similarly or show similar characteristics. In common parlance it is also called look-a-like groups. The simplest mechanism is to partition the samples using measurements that capture similarity or distance between samples. In this way, clusters and groups are interchangeable words. Often in market research studies, cluster analysis is also referred to as a segmentation method. In neural network concepts, clustering method is called unsupervised learning (refers to discovery as against prediction even discovery in loose sense may be called prediction, but it does not have predefined learning sets to validate the knowledge). Typically in clustering methods, all the samples with in a cluster is considered to be equally belonging to the cluster (as against belonging with certain probability). If each observation has its unique probability of belonging to a group(cluster) and the application is interested more about these probabilities than we have to use (binomial) multinomial models. 1.2What is a segmentation method and how is it different or the same as cluster analysis? In general terms, segmentation method could be interpreted as a collection of methods, which identifies groups of entities or statistical samples (consumers/customers, markets, organizations, which generally do not have a good application definition to classify entities in one of many groups with out significant errors) that share certain common characteristics such as attitudes, purchase propensities, media habits, and lifestyle etc. The sample characteristics are used to group the samples. Grouping can be arrived at, either hierarchically partitioning the sample or non-hierarchically partitioning the samples. Thus, segmentation methods include probability-based grouping of observations and cluster (grouping) based observations. It includes hierarchical (tree based method divisive) and non-hierarchical (agglomerative) methods. Segmentation methods are thus very general category of methodology, which includes clustering methods also. 2

There are three stages to cluster analysis; namely, partitioning/similarity (what defines the groups), interpretation of clusters (how to use groups), and profiling the characteristics of similar/partitioned groups (what explains the groups). 1.3 The natural questions that need to be attended are: What defines inter-object or inter-sample similarity? What defines a distance between two samples, or clusters? How to find groups? What are the different methods to identify these groups? What is the difference between hierarchical and non-hierarchical clustering methods? When do we stop further and further partitioning or identifying groups (in a divisive approach) and when do we stop further and further joining samples (agglomerative methods)? What are the right data elements to use in finding clusters in a business problem where we may have hundreds of variables (variable selection method)? What are the limitations of cluster analysis? As in many methods common among different areas of study, cluster analysis is also called differently in different specializations. Cluster analysis is called Q-analysis (finding distinct ethnic groups using data about believes and feelings 1 ), numerical taxonomy (biology), classification analysis (sociology, business, psychology), typology 2 and so on. In marketing it is a very useful technique as will be shown at the end. 1 http://www.dodccrp.org/1999ccrts/pdf_files/track_3/woodc.pdf 2 Science of classifying stone tools by form, techniques and technological traits. Must include duplication of the technique by first observing the intentional form, then reconstructing or replicating the tool in the exact order of the aboriginal workman. Shows elements of culture. Typology cannot be based on function." (Crabtree 1982.57) 3

2. Hierarchical and non-hierarchical clustering methods: The clustering algorithms are broadly classified into two namely hierarchical and nonhierarchical algorithms. In the hierarchical procedures, we construct a hierarchy or tree-like structure to see the relationship among entities (observations or individuals). In the non-hierarchical method a position in the measurement is taken as central place and distance is measured from such central point (seed). Identifying a right central position is a big challenge and hence non-hierarchical methods are less popular. Number of clusters 13 9 5 3 2 1 --------------------------------------------- Linkage distance measure 0 1 2 3 4 5 6 7 8 9 10 11 2.1 Hierarchical Clustering There is a concept of ordering involved in this approach. The ordering is driven by how many observations could be combined at a time or what determines that the distance is not statistically different from 0 between two observations or two clusters. The clusters could be arrived at either from weeding out dissimilar observations (divisive method) or joining together similar observations (agglomerative method). Most common statistical packages use agglomerative method and the most popular agglomerative methods are (1) single linkage (nearest neighbor approach), (2) complete linkage (furthest neighbor), (3) average linkage, (4) Ward s method, and (5) Centroid method. All these differ in the definition of distance and what defines largest distance as statistically no-distance or zero-distance. Most of the time, the 4

distance is based on Euclidean distance in the sample axes (Mahalanobis distance is for non-orthogonal sample). 2.2 Points to articulate a) How could clustering methods be used for identifying outlier(s)? note that outlier(s) by itself will be a cluster. Think of an example of a tree diagram which will point out few outliers; how the grouping pattern and the stem will be represented by a cluster of outliers b) The relationship between the linkage distance measure and the number of clusters c) Why number of clusters may not be a simple continuously increasing numbers and why for example there may not be one to one relationship between the linkage distance and the number of clusters? d) How does variable selection play a role in cluster analysis; what method to use? e) How will the above diagram including the measurements and clusters look different if complete linkage measure is used? f) Why linkage distance is inversely related to the number of clusters in general? g) What happens if similarity measure is used instead of distance measure? h) What is meant by similarity or distance measures when we have qualitative data? i) What is the major problem with non-hierarchical method? (Hint: start point of the seed or center of the cluster) j) Why do we have to standardize the data when we do cluster analysis? k) How to use the dendrogram? (tree structure for identifying the distance between clusters and what observations belong to which cluster. a graphical representation issue.) Hint, the length of the stem and coding the records at the left most stems. l) Various factors affect the stability of clustering solution, such as variable selection, distance/ similarity measure used, different significance levels, type of method (divisive vs. agglomerative). Articulate a method that converges to the right solution in the midst of the above mentioned parameters 5

2.3 Simple algorithmic concepts: Single linkage: At each step we agglomerate one additional observation to form a cluster. Thus, first cluster is one with two observations that have the shortest distance. A third observation, which has the next least distance, is added to the two observation cluster to create a three observation cluster or a new two observation cluster is formed. The algorithm continues until all the observations are in one cluster. The distance between any two clusters is the shortest distance from any point in one cluster to any point in the second cluster. Two clusters are merged at any single stage by the single shortest or strongest link between them 3 Explain: What is meant by snake-like chain of clusters? why does it happen in single linkage methods. This procedure is also known as nearest-neighbour method in the neural network methodologies. The distance between any two clusters is the shortest distance from any point in one cluster to any point in the other cluster 3. Complete linkage: This is similar to single linkage except that this is based on maximum distance not minimum distance. The maximum distance between any two individuals in a cluster represents the smallest (minimum diameter) sphere that can enclose the cluster 3. The advantage here is that this does not create one cluster for chain observations. This happens in single linkage distance where the whole collection of data becomes a cluster, though the first and the last observation will be at the maximum distance for the entire sample space!. Average linkage: Here we use the average distance from samples in one cluster to samples in other clusters. Ward Distance: This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two 3 J.F. Hair, Jr., Rolph E. Anderson, and Ronald L. Tatham, Multivariate Data Analysis with Readings, 2 nd Edition, MacMillan Publishing Co., New York, PP 303 6

(hypothetical) clusters that can be formed at each step 4. Typical of properties of variance for statistical decision-making, this tends to create two many clusters or clusters of small sizes because the more the observations scattered, the sum of squares makes the distance bigger. Centroid (Mean) Method: Here Euclidean distance measured between centroids of two clusters. meaure each observation's distance from the other observations Si is the distance between ith observation with the rest of the observations 1 2 3 4 5 6 7 8 9 10 11 SAMPLE 12 13 14 15 16 17 18 19 20 21 22 23 24 25 s1 s2 s3 s25 There are 25 obs (samples) Diagonal elements have zero values and the matrix is symmetric by property of distance The least distance in the who matrix identifies the first cluster of two elements if the that distance qualifies as a significant distance from zero. Next consider the next least distance measure. Join this with the first cluster(min of 2 distances with the new third obs) if its distance is statistically same as zero or this might form a new 2 individual cluster. Thus given a distance, we will be able to form clusters. Now increase the distance measure and see how the different clusters are changing. Since increasing the distance should mean less and less number of cluster, the algorithm comes to the end when all the observations are in one cluster. In the process we create the dedrogram (tree-based graphics representing the clusters) 4 http://www.statsoftinc.com/textbook/stcluan.html 7

2.3 An Example: Suppose a credit card company wants to introduce different incentive schemes for card usage for different groups of merchants whose marketing strategies and positioning the card can play a significant role in promoting the card. It is a non-trivial problem, especially if many of the merchants sell many things and are not correlated very well with the SIC code. Plus what matters is outcome variable of profitability and revenue rather than type of product produced or sold. Thus, grouping them based on SIC code cannot reveal and may not serve the purpose of providing incentives to the merchants, which is to increase usage of the card products across whole collection of merchandize in a store. So the decision variable is increased card usage. The data elements are store related information. There could be hundreds of variables. Now cluster (group) the stores so that we will all these merchants grouped in to say around 10 actionable groups so that a new incentive system could be developed. Think and Articulate: 1. Redefine a problem where we need to use card users as groups, not the merchandisers 2. How is this grouping different from a traditional scoring model for the merchandisers or consumers, assuming share of card or share of wallet accordingly as the decision variable (as against yes or no type decision variable)? 8

Appendix Consider the following nice graphics 5. Do we know how to group these people? Do similar groups for an outcome exist? How many groups are there? We need observations to group these people. Typically we collect observations about these people which do not change. (Think why we may collect and under what conditions it is good to use such data that changes over time) A picture is worth thousand words and our translation of these pictures in to data not only should capture all the information about these people but also such data elements should be correlated to a relevant outcome variable 5 http://obelia.jde.aca.mmu.ac.uk/multivar/ca.htm 9

10