Data Clustering: K-means and Hierarchical Clustering

Similar documents
Neural Networks Lesson 5 - Cluster Analysis

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering UE 141 Spring 2013

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Unsupervised learning: Clustering

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Information Retrieval and Web Search Engines

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Cluster Analysis: Advanced Concepts

Chapter ML:XI (continued)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Algorithms

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Hierarchical Cluster Analysis Some Basics and Algorithms

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Chapter 7. Cluster Analysis

Machine Learning using MapReduce

Cluster analysis Cosmin Lazar. COMO Lab VUB

Segmentation & Clustering

SoSe 2014: M-TANI: Big Data Analytics

Introduction to Clustering

How To Cluster

Social Media Mining. Data Mining Essentials

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Vector Quantization and Clustering

Distances, Clustering, and Classification. Heatmaps

Unsupervised Data Mining (Clustering)

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

A comparison of various clustering methods and algorithms in data mining

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

K-Means Clustering Tutorial

Time series clustering and the analysis of film style

CLUSTER ANALYSIS FOR SEGMENTATION

How To Solve The Cluster Algorithm

Cluster Analysis using R

Statistical Databases and Registers with some datamining

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Environmental Remote Sensing GEOG 2021

An Introduction to Cluster Analysis for Data Mining

Data Mining K-Clustering Problem

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Using Data Mining for Mobile Communication Clustering and Characterization

B490 Mining the Big Data. 2 Clustering

A Comparative Study of clustering algorithms Using weka tools

A Study of Web Log Analysis Using Clustering Techniques

Standardization and Its Effects on K-Means Clustering Algorithm

0.1 What is Cluster Analysis?

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Personalized Hierarchical Clustering

Clustering Connectionist and Statistical Language Processing

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Classification Techniques for Remote Sensing

Applied Algorithm Design Lecture 5

Support Vector Machine (SVM)

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

PROBABILISTIC DISTANCE CLUSTERING

Clustering DHS , ,

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Cluster Analysis: Basic Concepts and Methods

Visualization of large data sets using MDS combined with LVQ.

ANALYSIS & PREDICTION OF SALES DATA IN SAP- ERP SYSTEM USING CLUSTERING ALGORITHMS

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

COC131 Data Mining - Clustering

A Relevant Document Information Clustering Algorithm for Web Search Engine

Unsupervised and supervised data classification via nonsmooth and global optimization 1

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

A Survey of Clustering Techniques

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Cluster Analysis. Chapter. Chapter Outline. What You Will Learn in This Chapter

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

SOME PRINCIPLES OF UNSUPERVISED LEARNING AND APPLICATION IN EDUCATION 1

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

Original Article Survey of Recent Clustering Techniques in Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

5.1 Bipartite Matching

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Transcription:

Data Clustering: K-means and Hierarchical Clustering Piyush Rai CS5350/6350: Machine Learning October 4, 2011 (CS5350/6350) Data Clustering October 4, 2011 1 / 24

What is Data Clustering? Data Clustering is an unsupervised learning problem Given: N unlabeled examples {x 1,...,x N }; the number of partitions K Goal: Group the examples into K partitions The only information clustering uses is the similarity between examples Clustering groups examples based of their mutual similarities A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity Picture courtesy: Data Clustering: 50 Years Beyond K-Means, A.K. Jain (2008) (CS5350/6350) Data Clustering October 4, 2011 2 / 24

Notions of Similarity Choice of the similarity measure is very important for clustering Similarity is inversely related to distance Different ways exist to measure distances. Some examples: D Euclidean distance: d(x,z) = x z = d=1 (x d z d ) 2 Manhattan distance: d(x,z) = D d=1 x d z d Kernelized (non-linear) distance: d(x, z) = φ(x) φ(z) For the left figure above, Euclidean distance may be reasonable For the right figure above, kernelized distance seems more reasonable (CS5350/6350) Data Clustering October 4, 2011 3 / 24

Similarity is Subjective Similarity is often hard to define Different similarity criteria can lead to different clusterings (CS5350/6350) Data Clustering October 4, 2011 4 / 24

Data Clustering: Some Real-World Examples Clustering images based on their perceptual similarities Image segmentation (clustering pixels) Clustering webpages based on their content Clustering web-search results Clustering people in social networks based on user properties/preferences.. and many more.. Picture courtesy: http://people.cs.uchicago.edu/ pff/segment/ (CS5350/6350) Data Clustering October 4, 2011 5 / 24

Types of Clustering 1 Flat or Partitional clustering (K-means, Gaussian mixture models, etc.) Partitions are independent of each other 2 Hierarchical clustering (e.g., agglomerative clustering, divisive clustering) Partitions can be visualized using a tree structure (a dendrogram) Does not need the number of clusters as input Possible to view partitions at different levels of granularities (i.e., can refine/coarsen clusters) using different K (CS5350/6350) Data Clustering October 4, 2011 6 / 24

Flat Clustering: K-means algorithm (Lloyd, 1957) Input: N examples {x 1,...,x N } (x n R D ); the number of partitions K Initialize: K cluster centers µ 1,...,µ K. Several initialization options: Randomly initialized anywhere in R D Choose any K examples as the cluster centers Iterate: Assign each of example x n to its closest cluster center C k = {n : k = argmin k x n µ k 2 } (C k is the set of examples closest to µ k ) Recompute the new cluster centers µ k (mean/centroid of the set C k ) Repeat while not converged µ k = 1 x n C k n C k A possible convergence criteria: cluster centers do not change anymore (CS5350/6350) Data Clustering October 4, 2011 7 / 24

K-means: Initialization (assume K = 2) (CS5350/6350) Data Clustering October 4, 2011 8 / 24

K-means iteration 1: Assigning points (CS5350/6350) Data Clustering October 4, 2011 9 / 24

K-means iteration 1: Recomputing the cluster centers (CS5350/6350) Data Clustering October 4, 2011 10 / 24

K-means iteration 2: Assigning points (CS5350/6350) Data Clustering October 4, 2011 11 / 24

K-means iteration 2: Recomputing the cluster centers (CS5350/6350) Data Clustering October 4, 2011 12 / 24

K-means iteration 3: Assigning points (CS5350/6350) Data Clustering October 4, 2011 13 / 24

K-means iteration 3: Recomputing the cluster centers (CS5350/6350) Data Clustering October 4, 2011 14 / 24

K-means iteration 4: Assigning points (CS5350/6350) Data Clustering October 4, 2011 15 / 24

K-means iteration 4: Recomputing the cluster centers (CS5350/6350) Data Clustering October 4, 2011 16 / 24

K-means: The Objective Function The K-means objective function Let µ 1,...,µ K be the K cluster centroids (means) Let r nk {0,1} be indicator denoting whether point x n belongs to cluster k K-means objective minimizes the total distortion (sum of distances of points from their cluster centers) J(µ,r) = N n=1 k=1 K r nk x n µ k 2 Note: Exact optimization of the K-means objective is NP-hard The K-means algorithm is a heuristic that converges to a local optimum (CS5350/6350) Data Clustering October 4, 2011 17 / 24

K-means: Choosing the number of clusters K One way to select K for the K-means algorithm is to try different values of K, plot the K-means objective versus K, and look at the elbow-point in the plot For the above plot, K = 2 is the elbow point Picture courtesy: Pattern Recognition and Machine Learning, Chris Bishop (2006) (CS5350/6350) Data Clustering October 4, 2011 18 / 24

K-means: Initialization issues K-means is extremely sensitive to cluster center initialization Bad initialization can lead to Poor convergence speed Bad overall clustering Safeguarding measures: Choose first center as one of the examples, second which is the farthest from the first, third which is the farthest from both, and so on. Try multiple initializations and choose the best result Other smarter initialization schemes (e.g., look at the K-means++ algorithm by Arthur and Vassilvitskii) (CS5350/6350) Data Clustering October 4, 2011 19 / 24

K-means: Limitations Makes hard assignments of points to clusters A point either completely belongs to a cluster or not belongs at all No notion of a soft assignment (i.e., probability of being assigned to each cluster: say K = 3 and for some point x n, p 1 = 0.7,p 2 = 0.2,p 3 = 0.1) Gaussian mixture models and Fuzzy K-means allow soft assignments Sensitive to outlier examples (such examples can affect the mean by a lot) K-medians algorithm is a more robust alternative for data with outliers Reason: Median is more robust than mean in presence of outliers Works well only for round shaped, and of roughtly equal sizes/density clusters Does badly if the clusters have non-convex shapes Spectral clustering or kernelized K-means can be an alternative (CS5350/6350) Data Clustering October 4, 2011 20 / 24

K-means Limitations Illustrated Non-convex/non-round-shaped clusters: Standard K-means fails! Clusters with different densities Picture courtesy: Christof Monz (Queen Mary, Univ. of London) (CS5350/6350) Data Clustering October 4, 2011 21 / 24

Hierarchical Clustering Agglomerative (bottom-up) Clustering 1 Start with each example in its own singleton cluster 2 At each time-step, greedily merge 2 most similar clusters 3 Stop when there is a single cluster of all examples, else go to 2 Divisive (top-down) Clustering 1 Start with all examples in the same cluster 2 At each time-step, remove the outsiders from the least cohesive cluster 3 Stop when each example is in its own singleton cluster, else go to 2 Agglomerative is more popular and simpler than divisive (but less accurarate) (CS5350/6350) Data Clustering October 4, 2011 22 / 24

Hierarchical Clustering: (Dis)similarity between clusters We know how to compute the dissimilarity d(x i,x j ) between two examples How to compute the dissimilarity between two clusters R and S? Min-link or single-link: results in chaining (clusters can get very large) d(r,s) = min d(x R,x S ) x R R,x S S Max-link or complete-link: results in small, round shaped clusters d(r,s) = max d(x R,x S ) x R R,x S S Average-link: compromise between single and complexte linkage d(r,s) = 1 R S x R R,x S S d(x R,x S ) (CS5350/6350) Data Clustering October 4, 2011 23 / 24

Flat vs Hierarchical Clustering Flat clustering produces a single partitioning Hierarchical Clustering can give different partitionings depending on the level-of-resolution we are looking at Flat clustering needs the number of clusters to be specified Hierarchical clustering doesn t need the number of clusters to be specified Flat clustering is usually more efficient run-time wise Hierarchical clustering can be slow (has to make several merge/split decisions) No clear consensus on which of the two produces better clustering (CS5350/6350) Data Clustering October 4, 2011 24 / 24