Unsupervised learning: Clustering



Similar documents
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering UE 141 Spring 2013

Chapter 7. Cluster Analysis

Cluster Analysis using R

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Social Media Mining. Data Mining Essentials

Information Retrieval and Web Search Engines

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Cluster Analysis: Basic Concepts and Methods

Chapter ML:XI (continued)

Statistical Databases and Registers with some datamining

Distances, Clustering, and Classification. Heatmaps

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

How To Cluster

An Introduction to Cluster Analysis for Data Mining

Machine Learning using MapReduce

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

How To Solve The Cluster Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Cluster analysis Cosmin Lazar. COMO Lab VUB

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Cluster Analysis: Advanced Concepts

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

A comparison of various clustering methods and algorithms in data mining

Hierarchical Cluster Analysis Some Basics and Algorithms

Standardization and Its Effects on K-Means Clustering Algorithm

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data Clustering Techniques Qualifying Oral Examination Paper

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

A Survey of Clustering Techniques

How To Identify Noisy Variables In A Cluster

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Protein Protein Interaction Networks

Personalized Hierarchical Clustering

On Clustering Validation Techniques

Data Mining for Knowledge Management. Clustering

Vector Quantization and Clustering

Chapter 20: Data Analysis

SoSe 2014: M-TANI: Big Data Analytics

Data Mining Essentials

Using Data Mining for Mobile Communication Clustering and Characterization

Categorical Data Visualization and Clustering Using Subjective Factors

CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND CATEGORICAL VALUES *

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Data Mining K-Clustering Problem

Data Mining 5. Cluster Analysis

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Dynamical Clustering of Personalized Web Search Results

Data Preprocessing. Week 2

Clustering Connectionist and Statistical Language Processing

0.1 What is Cluster Analysis?

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Environmental Remote Sensing GEOG 2021

Classification Techniques (1)

Performance Metrics for Graph Mining Tasks

DATA CLUSTERING USING MAPREDUCE

Distances between Clustering, Hierarchical Clustering

Comparison of K-means and Backpropagation Data Mining Algorithms

A Business Process Driven Approach for Generating Software Modules

They can be obtained in HQJHQH format directly from the home page at:

Clustering Data Streams

Hadoop SNS. renren.com. Saturday, December 3, 11

Multidimensional data analysis

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Transcription:

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52

Outline 1 Introduction What is Unsupervised learning? Fundamental aspects of clustering 2 Clustering algorithms Hierarchical clustering Partitional clustering 3 Clustering evaluation metrics Unsupervised learning: Clustering h mn 2/52

Introduction What is Unsupervised learning? What is Unsupervised learning? Problem Given a set of records (e.g. observations or variables) with no target attribute, organise them into groups, without advance knowledge of the definitions of the groups. Unsupervised learning Unsupervised learning consists of approaches, which attempt to address the above problem by exploring the unlabelled data to find some intrinsic natural structures within them. Unsupervised learning: Clustering h mn 3/52

Introduction What is Unsupervised learning? What is Unsupervised learning? Problem Given a set of records (e.g. observations or variables) with no target attribute, organise them into groups, without advance knowledge of the definitions of the groups. Unsupervised learning Unsupervised learning consists of approaches, which attempt to address the above problem by exploring the unlabelled data to find some intrinsic natural structures within them. Unsupervised learning: Clustering h mn 3/52

Introduction What is Unsupervised learning? What is Unsupervised learning? Examples of unsupervised learning approaches Clustering Self-organising maps Association rule Blind signal separation etc. This session will focus on Clustering. Why Clustering? Clustering is one of the most utilised unsupervised learning techniques. Unsupervised learning: Clustering h mn 4/52

Introduction What is Unsupervised learning? What is Unsupervised learning? Examples of unsupervised learning approaches Clustering Self-organising maps Association rule Blind signal separation etc. This session will focus on Clustering. Why Clustering? Clustering is one of the most utilised unsupervised learning techniques. Unsupervised learning: Clustering h mn 4/52

Introduction What is Unsupervised learning? What is Unsupervised learning? Examples of unsupervised learning approaches Clustering Self-organising maps Association rule Blind signal separation etc. This session will focus on Clustering. Why Clustering? Clustering is one of the most utilised unsupervised learning techniques. Unsupervised learning: Clustering h mn 4/52

Introduction What is Unsupervised learning? What is Unsupervised learning? Examples of unsupervised learning approaches Clustering Self-organising maps Association rule Blind signal separation etc. This session will focus on Clustering. Why Clustering? Clustering is one of the most utilised unsupervised learning techniques. Unsupervised learning: Clustering h mn 4/52

Introduction Fundamental aspects of clustering Fundamental aspects of clustering Definition Clustering, also termed Cluster Analysis is the collection of methods for grouping unlabelled data into subsets (called clusters) that are believed to reflect the underlying structure of the data, based on similarity groups within the data. What is clustering for? Identification of new tumor classes using gene expression profiles; Identification of groups of co-regulated genes, e.g. using a large number of yeast experiments; Grouping similar proteins together with respect to their chemical structure and/or functionality etc; Detect experimental artifacts. Unsupervised learning: Clustering h mn 5/52

Introduction Fundamental aspects of clustering Fundamental aspects of clustering Definition Clustering, also termed Cluster Analysis is the collection of methods for grouping unlabelled data into subsets (called clusters) that are believed to reflect the underlying structure of the data, based on similarity groups within the data. What is clustering for? Identification of new tumor classes using gene expression profiles; Identification of groups of co-regulated genes, e.g. using a large number of yeast experiments; Grouping similar proteins together with respect to their chemical structure and/or functionality etc; Detect experimental artifacts. Unsupervised learning: Clustering h mn 5/52

Introduction Fundamental aspects of clustering Fundamental aspects of clustering Definition Clustering, also termed Cluster Analysis is the collection of methods for grouping unlabelled data into subsets (called clusters) that are believed to reflect the underlying structure of the data, based on similarity groups within the data. What is clustering for? Identification of new tumor classes using gene expression profiles; Identification of groups of co-regulated genes, e.g. using a large number of yeast experiments; Grouping similar proteins together with respect to their chemical structure and/or functionality etc; Detect experimental artifacts. Unsupervised learning: Clustering h mn 5/52

Introduction Fundamental aspects of clustering Fundamental aspects of clustering Basic concepts Clustering deals with data for which the groups are unknown and undefined. Thus we need to conceptualise the groups. Intra-clusters distance: Inter-clusters distance: Intra-cluster distance Inter-cluster distance Unsupervised learning: Clustering h mn 6/52

Challenges Introduction Fundamental aspects of clustering Notion of a Cluster can 1 Definition of the inter-cluster and intra-cluster distances. 2 The number of clusters. 3 The type of clusters. 4 Clusters quality. How many clusters? for these data? Unsupervised learning: Clustering h mn 7/52

Challenges Introduction Fundamental aspects of clustering Notion of a Cluster can 1 Definition of the inter-cluster and intra-cluster distances. 2 The number of clusters. 3 The type of clusters. 4 Clusters quality. How many clusters? for these data? Unsupervised learning: Clustering h mn 7/52

Introduction Fundamental aspects of clustering Challenges Two clusters? How many clusters? r can be Ambiguous Why not six clusters? Two Clusters Tan,Steinbach, Kumar Introduction to Data Mining Six Clusters Unsupervised learning: Clustering h mn 8/52

Challenges Introduction Fundamental aspects of clustering Definition of intra-clusters distance Type of distance measurement to be used to determine how close two data points are to each other. It is commonly called the distance, similarity or dissimilarity measure. Definition of inter-clusters distance Type of distance measurement to be used to determine how close two clusters are to each other. It is commonly called the linkage function or linkage criteria. It is is often both data (cluster shape) and context dependent and may depend on the distance measure. Unsupervised learning: Clustering h mn 9/52

Introduction Fundamental aspects of clustering Distance measures Fundamental axioms Assume that the data are in an n-dimensional Euclidean space, and let x =[x 1, x 2,...,x n ], y =[y 1, y 2,...,y n ]andz =[z 1, z 2,...,z n ]define three data points. Fundamental axioms of a distance measure d are: 1 d(x, x) =0 2 d(x, y) =d(y, x) 3 d(x, y) apple d(x, z)+d(z, y) Remark The choice of a distance measure will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Unsupervised learning: Clustering h mn 10 / 52

Introduction Fundamental aspects of clustering Distance measures Fundamental axioms Assume that the data are in an n-dimensional Euclidean space, and let x =[x 1, x 2,...,x n ], y =[y 1, y 2,...,y n ]andz =[z 1, z 2,...,z n ]define three data points. Fundamental axioms of a distance measure d are: 1 d(x, x) =0 2 d(x, y) =d(y, x) 3 d(x, y) apple d(x, z)+d(z, y) Remark The choice of a distance measure will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another. Unsupervised learning: Clustering h mn 10 / 52

Distance measures Introduction Fundamental aspects of clustering Examples of distance metrics Some commonly used metrics for clustering include: Euclidian distance (L 2 norm): d(x, y) = p P n i=1 (x i y i ) 2 nx Manhattan distance (L 1 norm): d(x, y) = kx i y i k i=1 Chebychev maximum distance (L 1 norm): d(x, y) = Minkowski distance (L p norm): d(x, y) = max i=1,...,n kx i! 1/p nx kx i y i k p Mahalanobis distance: d(x, y) = p P n i=1 (x i y i )R 1 (x i y i ), where R denotes the covariance matrix associated to the data. i=1 y i k Unsupervised learning: Clustering h mn 11 / 52

Linkage criteria Introduction Fundamental aspects of clustering Examples of linkage criteria or linkage functions Let C 1 and C 2 be two candidate clusters and let d be the chosen distance metric. Commonly used linkage functions between C 1 and C2 include: Single linkage: f (C 1, C 2 )=min{d(x, y) : x 2 C 1, y 2 C 2 } Complete linkage: f (C 1, C 2 )=max{d(x, y) : x 2 C 1, y 2 C 2 } 1 X X Average linkage: f (C 1, C 2 )= d(x, y) C 1 C 2 x2c 1 y2c 2 Ward s criterion: The distance between C 1 and C 2 is given by where µ i is the centre of cluster i. f (C 1, C 2 )= C 1 C 2 C 1 + C 2 µ 1 µ 2 2, Unsupervised learning: Clustering h mn 12 / 52

Clustering algorithms Clustering algorithms Hierarchical clustering Create a hierarchical decomposition of a data set by finding successive clusters using previously established clusters. Hierarchical clustering methods produce a tree diagram known as dendrogram or phenogram, which can be built in two distinct ways: Bottom-up known as Agglomerative clustering and Top-down called Divisive clustering. Partitional clustering Decompose the data set into a set of disjoint clusters, i.e. a set of non-overlapping clusters such that each data point is in exactly one subset cluster. Unsupervised learning: Clustering h mn 13 / 52

Clustering algorithms Clustering algorithms Hierarchical clustering Create a hierarchical decomposition of a data set by finding successive clusters using previously established clusters. Hierarchical clustering methods produce a tree diagram known as dendrogram or phenogram, which can be built in two distinct ways: Bottom-up known as Agglomerative clustering and Top-down called Divisive clustering. Partitional clustering Decompose the data set into a set of disjoint clusters, i.e. a set of non-overlapping clusters such that each data point is in exactly one subset cluster. Unsupervised learning: Clustering h mn 13 / 52

Clustering algorithms Hierarchical clustering Hierarchical clustering Agglomerative clustering Start with the points as individual clusters; At each step, merge the closest pair of clusters until all the data points are in a single cluster or until certain termination conditions are satisfied. Divisive clustering Start with one, all-inclusive cluster; At each step, split a cluster until each cluster contains a single data point or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 14 / 52

Clustering algorithms Hierarchical clustering Hierarchical clustering Agglomerative clustering Start with the points as individual clusters; At each step, merge the closest pair of clusters until all the data points are in a single cluster or until certain termination conditions are satisfied. Divisive clustering Start with one, all-inclusive cluster; At each step, split a cluster until each cluster contains a single data point or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 14 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering Algorithm The algorithm forms clusters in a bottom-up manner, as follows: 1 Initially, put each data point in its own cluster. 2 Among all current clusters, pick the two clusters which optimise the chosen linkage function. 3 Replace these two clusters with a new cluster, formed by merging the two original ones. 4 Repeat the steps 2 and 3 until there is only one remaining cluster in the pool, or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 15 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Distance measure The function dist(x, method="metric") returns the distance matrix of anumericalmatrixx using a specified metric, which must be one of the followings: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". Clustering The function hclust(d, method="linkage") performs hierarchical agglomerative clustering using a given distance matrix d and a specified linkage function, which must be one of the followings: "single", "complete", "average", "mcquitty", "median" or "centroid". Unsupervised learning: Clustering h mn 16 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Let us consider the following data set X : Unsupervised learning: Clustering h mn 17 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Rscript library(stats) d<-dist(x, method="euclidean") hc<-hclust(d, method="single") ggdendrogram(hc, theme dendro=false) Agglomerative clustering using euclidian distance measure and single linkage. Unsupervised learning: Clustering h mn 18 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Single linkage: Impact of the choice of the distance measure. Euclidian distance Chebychev distance Unsupervised learning: Clustering h mn 19 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Complete linkage: Impact of the choice of the distance measure. Euclidian distance Chebychev distance Unsupervised learning: Clustering h mn 20 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Average linkage: Impact of the choice of the distance measure. Euclidian distance Chebychev distance Unsupervised learning: Clustering h mn 21 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Euclidean distance: Impact of the choice of the linkage function. Single linkage Complete linkage Average linkage Unsupervised learning: Clustering h mn 22 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering: Illustration with R Chebychev distance: Impact of the choice of the linkage function. Single linkage Complete linkage Average linkage Unsupervised learning: Clustering h mn 23 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering Advantages No apriori information about the number of clusters required; Easy to implement; The obtained results may correspond to meaningful taxonomies. Limitations The algorithm does not enable to undo what was done previously Interpretation of the hierarchy can be complex or even confusing Depending on the type of distance matrix used, the algorithm 1 can be sensitivity to noise and outliers, 2 tends to break large clusters. 3 can hardly handle di erent sized clusters. Unsupervised learning: Clustering h mn 24 / 52

Clustering algorithms Hierarchical clustering Agglomerative clustering Advantages No apriori information about the number of clusters required; Easy to implement; The obtained results may correspond to meaningful taxonomies. Limitations The algorithm does not enable to undo what was done previously Interpretation of the hierarchy can be complex or even confusing Depending on the type of distance matrix used, the algorithm 1 can be sensitivity to noise and outliers, 2 tends to break large clusters. 3 can hardly handle di erent sized clusters. Unsupervised learning: Clustering h mn 24 / 52

Clustering algorithms Hierarchical clustering Divisive clustering Algorithm The algorithm forms clusters in a up-down manner, as follows: 1 Initially, put all objects in one cluster. 2 Among all current clusters, pick the one which satisfies a specified criterion and split it using a specified method. 3 Replace this cluster with the new clusters, formed by splitting the original one. 4 Repeat the steps 2 and 3 until all clusters are singletons or or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 25 / 52

Clustering algorithms Hierarchical clustering Divisive clustering: Illustration with R Clustering The function diana(x, diss = inherits(x, "dist"), metric = "metric") performs hierarchical divisive clustering a numerical matrix X using a specified distance metric, which must be one of the followings: "euclidean" or "manhattan". Let us consider the following data set X : Unsupervised learning: Clustering h mn 26 / 52

Clustering algorithms Hierarchical clustering Divisive clustering: Illustration with R Rscript library(cluster) dc<-diana(x, diss=inherits(x, "dist"), metric="euclidean") plot(dc) Divisive clustering using euclidian distance measure. Unsupervised learning: Clustering h mn 27 / 52

Clustering algorithms Hierarchical clustering Divisive clustering: Illustration with R Impact of the choice of the distance measure Euclidian distance Manhattan distance Unsupervised learning: Clustering h mn 28 / 52

Divisive clustering Clustering algorithms Hierarchical clustering Advantages No apriori information about the number of clusters required; The obtained result may correspond to meaningful taxonomies. Limitations The algorithm does not enable to undo what was done previously; Computational di culties when considering all possible divisions into two groups; Depending on the type of distance matrix used, the algorithm 1 can be sensitivity to noise and outliers 2 tends to break large clusters Unsupervised learning: Clustering h mn 29 / 52

Clustering algorithms Partitional clustering Partitional clustering Basic concept Given, k the number of clusters, partitional clustering algorithms construct a partition of a data set into k clusters that optimises the chosen partitioning criterion. Partitionning techniques 1 Global optimal method: Exhaustive enumeration of all partitions (NP hard problem) 2 Heuristic methods: e.g. k-means clustering Each cluster is represented by its centre k-medoids clustering or PAM (Partition Around Medoids): Each cluster is represented by one of its components Unsupervised learning: Clustering h mn 30 / 52

Clustering algorithms Partitional clustering Partitional clustering Basic concept Given, k the number of clusters, partitional clustering algorithms construct a partition of a data set into k clusters that optimises the chosen partitioning criterion. Partitionning techniques 1 Global optimal method: Exhaustive enumeration of all partitions (NP hard problem) 2 Heuristic methods: e.g. k-means clustering Each cluster is represented by its centre k-medoids clustering or PAM (Partition Around Medoids): Each cluster is represented by one of its components Unsupervised learning: Clustering h mn 30 / 52

k-means clustering Clustering algorithms Partitional clustering Basic concept Given an integer k asetx of n points (n Euclidean space, denoted by k) in a m-dimensional X = {x i =(x i1,...,x im ) T 2 R m, i =1,...,n}. Find an assignment of the n points into k disjoint clusters C =(C 1,...,C k ) centered at cluster means µ j (j =1,...,k), based on a certain criteria, e.g. by minimising f (X, C) = kx X C j j=1 i=1 x (j) i µ j 2, where C j is the number of points in the cluster C j,andx (j) i in C j. is the point i Unsupervised learning: Clustering h mn 31 / 52

Clustering algorithms Partitional clustering k-means clustering Algorithm The k-means clustering algorithm can be summarised as follows: 1 Select k data points randomly in a domain containing all the points in the data set. These k points represent the centres of the initial clusters. 2 Assign each point to the cluster that has the closest centre. 3 Recompute the cluster centers (means) using the current cluster memberships. 4 Repeat the steps 2 and 3 until the centres no longer change, or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 32 / 52

Clustering algorithms Partitional clustering k-means clustering: Illustration with R Clustering The function kmeans(x, centers, iter.max = 1000, nstart = 10) performs k-means clustering given a numerical matrix of data x, the maximum number of iterations, and the number of random initial sets to be chosen when centres is greater than 1. Let us consider the following data set X : Unsupervised learning: Clustering h mn 33 / 52

Clustering algorithms Partitional clustering k-mean clustering: Illustration with R Rscript library(stats) kc <- kmeans(x, centers= 4, iter.max=1000, nstart=10000) k-mean clustering using four clusters. Unsupervised learning: Clustering h mn 34 / 52

Clustering algorithms Partitional clustering k-mean clustering: Illustration with R Impact of the choice of the number of clusters Three clusters Four clusters Unsupervised learning: Clustering h mn 35 / 52

Clustering algorithms Partitional clustering k-mean clustering: Illustration with R Impact of the choice of the number of clusters Five clusters Six clusters Unsupervised learning: Clustering h mn 36 / 52

Clustering algorithms Partitional clustering k-mean Clustering: Illustration with R Impact of the choice of the number of clusters Number of clusters vs Within clusters sum of squares. Unsupervised learning: Clustering h mn 37 / 52

Clustering algorithms Partitional clustering k-mean clustering Advantages Relatively easy to implement. A simple iterative algorithm works quite well in practice. Limitations Need to specify k, the number of clusters, in advance. Applicable only when the mean is defined, hence it can t handle categorical data. Not suitable to discover clusters with non-convex shapes. Unable to handle noisy data and outliers. Unsupervised learning: Clustering h mn 38 / 52

k-medoids clustering Clustering algorithms Partitional clustering Basic concept Given an integer k asetx of n points (n Euclidean space, denoted by k) in a m-dimensional X = {x i =(x i1,...,x im ) T 2 R m, i =1,...,n}. Find an assignment of the n points into k disjoint clusters C =(C 1,...,C k ) centered at cluster points m j (j =1,...,k) called medoids, based on a certain criteria, e.g. by minimising f (X, C) = kx X C j j=1 i=1 x (j) i m j, where C j is the number of points in the cluster C j,andx (j) i in C j. is the point i Unsupervised learning: Clustering h mn 39 / 52

Clustering algorithms Partitional clustering k-medoids clustering PAM (Partitioning Around Medoids) Algorithm The PAM is a k-medoids clustering algorithm, which is similar to the k-means algorithm. It can be summarised as follows: 1 Select randomly k data points from the given data set. These k points represent the medoids of the initial clusters. 2 Assign each point to the cluster that has the closest medoid. 3 Iteratively replace one of the medoids by one of the non-medoids which improve the chosen criterion. 4 Repeat the steps 2 and 3 until the medoids no longer change, or until certain termination conditions are satisfied. Unsupervised learning: Clustering h mn 40 / 52

Clustering algorithms Partitional clustering k-medoids clustering PAM Algorithm Advantages: Works e ectively for small data sets Limitations: Does not scale well for large data sets CLARA (Clustering Large Applications) Based on multiple sampling from the data set and application of PAM on each sample, it provides the best clustering as the output. Advantages: Deals with larger data sets than PAM Limitations: E ciency depends on the sample size Unsupervised learning: Clustering h mn 41 / 52

Clustering algorithms Partitional clustering k-medoids clustering: Illustration with R CLARA The function clara(x, k, metric = "metric", samples = r) performs CLARA clustering given a numerical matrix of data x, the number of cluster, the distance metric, and the number of samples to be drawn from the data set X. Let us consider the following data set X : Unsupervised learning: Clustering h mn 42 / 52

Clustering algorithms Partitional clustering k-medoids clustering: Illustration with R Rscript library(cluster) km <- clara(x, k, metric = "euclidean", samples = 10) CLARA clustering using 5 clusters and 10 samples. Unsupervised learning: Clustering h mn 43 / 52

Clustering algorithms Partitional clustering k-medoids clustering: Illustration with R CLARA: Impact of the choice of the distance metric Euclidean distance Manhattan distance Unsupervised learning: Clustering h mn 44 / 52

Clustering evaluation metrics So... which method to use for the data set X?!!!?? Hierarchical clustering? If yes Agglomerative or Divisive? For either method 1 which metric distance and/or linkage function? 2 where to cut the dendrogram? Partitional clustering? If yes k-means or CLARA? For either method 1 which metric distance? 2 how many clusters? Unsupervised learning: Clustering h mn 45 / 52

Clustering evaluation metrics So... which method to use for the data set X?!!!?? Hierarchical clustering? If yes Agglomerative or Divisive? For either method 1 which metric distance and/or linkage function? 2 where to cut the dendrogram? Partitional clustering? If yes k-means or CLARA? For either method 1 which metric distance? 2 how many clusters? Unsupervised learning: Clustering h mn 45 / 52

Clustering evaluation metrics So... which method to use for the data set X?!!!?? Hierarchical clustering? If yes Agglomerative or Divisive? For either method 1 which metric distance and/or linkage function? 2 where to cut the dendrogram? Partitional clustering? If yes k-means or CLARA? For either method 1 which metric distance? 2 how many clusters? Unsupervised learning: Clustering h mn 45 / 52

Clustering evaluation metrics So... which method to use for the data set X?!!!?? Hierarchical clustering? If yes Agglomerative or Divisive? For either method 1 which metric distance and/or linkage function? 2 where to cut the dendrogram? Partitional clustering? If yes k-means or CLARA? For either method 1 which metric distance? 2 how many clusters? Unsupervised learning: Clustering h mn 45 / 52

Clustering evaluation metrics So... which method to use for the data set X?!!!?? Hierarchical clustering? If yes Agglomerative or Divisive? For either method 1 which metric distance and/or linkage function? 2 where to cut the dendrogram? Partitional clustering? If yes k-means or CLARA? For either method 1 which metric distance? 2 how many clusters? Unsupervised learning: Clustering h mn 45 / 52

Clustering evaluation metrics Clustering evaluation metrics Silhouette Coe cient Provides a graphical representation of how well each object lies within its cluster. The silhouette coe cient of a data point i is defined as s i = (b i a i ) max(a i, b i ), where a i denotes the average distance between the data point i and all other data points in its cluster, and b i denotes the minimum average distance between i and the data points in other clusters. Data points with large silhouette coe cient s i are well-clustered, those with small s i tend to lie between clusters. Unsupervised learning: Clustering h mn 46 / 52

Clustering evaluation metrics Clustering evaluation metrics Classification-oriented measures Use of the classification approach to compare clustering techniques with the ground truth. Some of these measures are 1 Entropy 2 Purity 3 Recall 4 F -measure Unsupervised learning: Clustering h mn 47 / 52

Clustering evaluation metrics Clustering evaluation metrics Entropy Measures the degree to which each cluster consists of data points from a single class. The entropy of a cluster i is given by E i = lx j=1 n ij n i log nij n i, where n ij is the number of data points of class i in cluster j, n i is the number of data points in cluster i and l is the number of classes. The total entropy for a set of clusters is given by E = kx i=1 n i n E i, where k is the number of clusters and n is the total number of data points. Unsupervised learning: Clustering h mn 48 / 52

Clustering evaluation metrics Clustering evaluation metrics Purity Measures the extent to which a cluster contains data points of a single class. Using the previous notations, the purity for a cluster i is given by Pur i =max j n ij n i, whereas the overall purity of a clustering is given by Pur = kx i=1 n i n Pur i. Unsupervised learning: Clustering h mn 49 / 52

Clustering evaluation metrics Clustering evaluation metrics Precision Measures the fraction of a cluster that consists of objects of a specified class. Using the previous notations, the precision of cluster i with respect to class j is given by Pre(i, j) = n ij n i Recall Measures the extent to which a cluster contains all objects of a specified class. The recall of cluster i with respect to class j is given by Rec(i, j) = n ij n j, where n ij is the number of data points of class i in cluster j and n j is the number of data points in class j. Unsupervised learning: Clustering h mn 50 / 52

Clustering evaluation metrics Clustering evaluation metrics F -measure It combines the precision and the recall to measure the extent to which a cluster contains only data points of a particular class and all points of that class. The F -measure of cluster i with respect to class j is given by F (i, j) = 2Pre(i, j) Rec(i, j) Pre(i, j)+rec(i, j). Unsupervised learning: Clustering h mn 51 / 52

End End Thank you for your attention! Unsupervised learning: Clustering h mn 52 / 52