Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009



Similar documents
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Advanced Concepts

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Clustering UE 141 Spring 2013

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

How To Cluster

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Chapter 7. Cluster Analysis

A comparison of various clustering methods and algorithms in data mining

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Information Retrieval and Web Search Engines

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Neural Networks Lesson 5 - Cluster Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

On Clustering Validation Techniques

Cluster analysis with SPSS: K-Means Cluster Analysis

Social Media Mining. Data Mining Essentials

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Cluster Analysis: Basic Concepts and Methods

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Clustering methods for Big data analysis

Distances, Clustering, and Classification. Heatmaps

A Comparative Study of clustering algorithms Using weka tools

Concept of Cluster Analysis

Data Mining and Visualization

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Machine Learning using MapReduce

Hierarchical Cluster Analysis Some Basics and Algorithms

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

A Study of Web Log Analysis Using Clustering Techniques

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Unsupervised learning: Clustering

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Cluster analysis Cosmin Lazar. COMO Lab VUB

Public Transportation BigData Clustering

A Cluster Analysis Approach for Banks Risk Profile: The Romanian Evidence

OUTLIER ANALYSIS. Data Mining 1

Segmentation of stock trading customers according to potential value

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

0.1 What is Cluster Analysis?

Outlier Detection in Clustering

A Survey of Clustering Techniques

ANALYSIS & PREDICTION OF SALES DATA IN SAP- ERP SYSTEM USING CLUSTERING ALGORITHMS

An Introduction to Cluster Analysis for Data Mining

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Principles of Data Mining by Hand&Mannila&Smyth

2 Basic Concepts and Techniques of Cluster Analysis

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

Rethinking Museum Visitors: Using K-means Cluster Analysis to Explore a Museum s Audience

Using multiple models: Bagging, Boosting, Ensembles, Forests

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Clustering & Visualization

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Constrained Clustering of Territories in the Context of Car Insurance

Model Efficiency Through Data Compression

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

BIRCH: An Efficient Data Clustering Method For Very Large Databases

SoSe 2014: M-TANI: Big Data Analytics

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

CLUSTER ANALYSIS FOR SEGMENTATION

K-Means Clustering Tutorial

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Introduction to Clustering

Comparison the various clustering algorithms of weka tools

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Data Clustering Techniques Qualifying Oral Examination Paper

Foundations of Artificial Intelligence. Introduction to Data Mining

The SPSS TwoStep Cluster Component

Transcription:

Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009

Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation of results

Relationship to PCA/Factor Analysis M variables, N observations per variable PCA/Factor Analysis attempts to reduce the number of variables by finding directions of maximum variance Cluster analysis attempts to reduce the number of observations by finding groups of observations with minimum within-group variabilities and maximum between group variability

What is a cluster? 1. We need to define what we mean by a cluster, for our specific application 2. We need to define membership of a cluster a) Exclusive: each object belongs to one and only one cluster b) Overlapping: an object can belong simultaneously to more than one cluster c) Fuzzy: every object belongs to every cluster with a membership weighting (probability) between zero and one

Types of Cluster

Distance Function We need to define some measure of distance between our data points Example: 3 variables: X, Y, Z Data points: D 1 = X 1, Y 1, Z 1 D 2 = X 2, Y 2, Z 2 Distance is also known as proximity

Types of Distance Functions Euclidean Distance Squared Euclidean Distance City Block ( ) ( ) ( ) 2 1 2 2 1 2 2 1 2 z z y y x x D E + + = ( ) ( ) ( ) 2 1 2 2 1 2 2 1 2 z z y y x x D s + + = 1 2 1 2 1 2 z z y y x x D c + + =

Distance functions for categorical variables Marital Status Sex d 1 Single Male d 2 Married Male d 3 Single Female d 4 Married Male d 5 Married Female X = Marital Status Y = Sex One measure of distance is: 1 # matched var iables # var iables

In our example: # of variables = 2 So, 1 d d = 1 = 1 3 2 0 d d = 1 = 2 3 2 2 d d = 1 = 2 4 2 1 2 1 0

Clustering Methods Finally, you need to choose an algorithm for finding the clusters. Here we will look at three algorithms. 1) Agglomerative Hierarchical 2) K-means 3) Density-based

Hierarchical Agglomerative Clustering 1. Begin with one cluster for each observation. 2. Repeat merge the two nearest clusters until there is only one cluster left. Store the clusters and their distance. 3. Stop.

Dendogram Result Distance d 1 d 2 d 3 d 4 d 5 Observation

Defining distance To implement this algorithm, we need to define the distance between two clusters. 3 common definitions: a) The distance between the nearest points of the two clusters b) The distance between the furthest points in the two clusters c) The average distance between all pairs of points in the two clusters

Defining by nearest points

Defining by furthest points

Defining by average distance

Assessment of hierarchical clustering Advantages 1) You do not need to define the number of clusters Disadvantages 1) Answer can depend on the definition of the inter-cluster distance 2) Computationally intensive can only be used for relatively small datasets

K-means clustering A centroid of a group of points is usually defined as the point whose co-ordinates are the mean of the co-ordinates of the group. Note that the centroid does not, in general, correspond to an actual observation.

K-means clustering algorithm 1. Select K initial points, where K is the number of clusters required. 2. Repeat. Assign each point to its nearest centroid. Re-calculate the centroid until the centroids do not change. 3. Stop.

K-means iterations

We need to choose: 1) The number of clusters we require. 2) The positions of the initial centroids. Different choices will lead to different answers: a) We need to be specifically careful about 2), where a poor choice can lead to bad clustering. b) There are some techniques for optimizing the choice, but none are perfect.

Bad clustering via K-means

Assessment of K-means Advantages 1) Computationally efficient, basically linear in the number of data points. 2) Can be used for many types of data. Disadvantages 1) Need to specify the number of clusters in advance. 2) Potentially sensitive to initial conditions. 3) Not good when clusters are of very different sizes or very non-spherical.

Density-based clustering Locates regions of high density that are separated by regions of low density. We need to define density Center-based density: the density of a point is the number of points within a specified radius R. The density then depends on our choice of R.

Classification of points Choose some minimum number of points, N min. Core point: has more than N min points within a radius R. Border point: has less than N min points within a radius R, but does have a core point within this radius. Noise point: a point which is not a core point or a border point.

Classification of points N min = 7

Density-based algorithm First choose R and N min 1. Classify all points. 2. Remove noise points. 3. Connect all core points that are within a distance R of each other. 4. Make each group of connected points into a cluster. 5. Assign each border point to one of the clusters of its neighboring core points.

Assessment of density-based algorithm Advantages 1) Handles noise and outliers well. 2) Can handle clusters of different shapes and sizes. Disadvantages 1) Has difficulty with clusters of very different densities. 2) Has trouble with high-dimensional data. 3) We need to choose R and N min.

Choice of inputs The results of cluster analysis depend on: 1) The algorithm you choose. 2) The parameters and initial conditions you choose.

User-defined choices 1) Clustering algorithm 2) Distance function between data points Agglomerative clustering a) Distance between clusters b) Cut-off point on dendogram K-means clustering a) Number of clusters b) Initial positions of centroids Density-based clustering a) R b) N min

Evaluating your results Almost any algorithm will always find clusters in any dataset

Some hints on sanity-checking 1) How tight are the clusters compared with the inter-cluster distance? 2) How well do the clusters match your hypothesis (if you have one)? 3) How sensitive is the answer to different choices of algorithms/parameters/initial conditions/number of clusters/etc.? There are techniques for checking these.

Take-Home Message 1) Clustering is, to some extent, in the eye of the beholder. 2) Choose your algorithm/parameters carefully in the light of your particular application. 3) Evaluate your results, particularly: a) Does the clustering make sense? b) How sensitive is the solution to the input parameters?

Data for software example

Hierarchical R script

Hierarchical results

K-means R script

K-means result

Lipkovich, et al. Defining good and poor outcomes in patients with schizophrenia or schizoaffective disorder: A multidimensional data-driven approach. Psychiatry Research. In press.

Further reading P. Tryfos Methods for Business Analysis and Forecasting: Text & Cases Chapter 15, Cluster Analysis (http://www.yorku.ca/ptryfos/f1500.pdf) Figures from P. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining Chapter 8, Cluster Analysis: Basic Concepts and Algorithms (http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf)