Clustering. Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods. CS 5751 Machine Learning

Similar documents
Chapter ML:XI (continued)

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Unsupervised learning: Clustering

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering UE 141 Spring 2013

Cluster Analysis: Advanced Concepts

An Introduction to Cluster Analysis for Data Mining

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Hadoop SNS. renren.com. Saturday, December 3, 11

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Information Retrieval and Web Search Engines

Neural Networks Lesson 5 - Cluster Analysis

Cluster Analysis: Basic Concepts and Algorithms

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Distances, Clustering, and Classification. Heatmaps

Machine Learning using MapReduce

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Chapter 7. Cluster Analysis

Cluster analysis Cosmin Lazar. COMO Lab VUB

Statistical Databases and Registers with some datamining

DATA ANALYSIS II. Matrix Algorithms

Social Media Mining. Data Mining Essentials

How To Solve The Cluster Algorithm

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Part 2: Community Detection

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

SCAN: A Structural Clustering Algorithm for Networks

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Hierarchical Cluster Analysis Some Basics and Algorithms

Linear Threshold Units

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Introduction to Clustering

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Self Organizing Maps: Fundamentals

K-Means Clustering Tutorial

Performance Metrics for Graph Mining Tasks

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

SGL: Stata graph library for network analysis

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Cluster Analysis using R

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Methods

Mining Social-Network Graphs

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Equivalence Concepts for Social Networks

2 Basic Concepts and Techniques of Cluster Analysis

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

Visualization of textual data: unfolding the Kohonen maps.

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Basic Questions in Cluster Analysis

B490 Mining the Big Data. 2 Clustering

Concept of Cluster Analysis

DATA CLUSTERING USING MAPREDUCE

0.1 What is Cluster Analysis?

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

PROBABILISTIC DISTANCE CLUSTERING

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Map-Reduce for Machine Learning on Multicore

Using Data Mining for Mobile Communication Clustering and Characterization

Personalized Hierarchical Clustering

Comparing large datasets structures through unsupervised learning

Complex Networks Analysis: Clustering Methods

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Clustering: A Review

Network (Tree) Topology Inference Based on Prüfer Sequence

Categorical Data Visualization and Clustering Using Subjective Factors

Data Mining: Algorithms and Applications Matrix Math Review

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Practical Graph Mining with R. 5. Link Analysis

Using Library Dependencies for Clustering

Cluster Analysis: Basic Concepts and Algorithms

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

How To Understand The Network Of A Network

A comparison of various clustering methods and algorithms in data mining

How To Cluster

Segmentation & Clustering

Clustering DHS , ,

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Environmental Remote Sensing GEOG 2021

Dynamical Clustering of Personalized Web Search Results

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

A Cluster Analysis Approach for Banks Risk Profile: The Romanian Evidence

Comparison of K-means and Backpropagation Data Mining Algorithms

CLUSTER ANALYSIS FOR SEGMENTATION

Transcription:

Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1

What is Clustering? Form of unsupervised learning - no information from teacher The process of partitioning a set of data into a set of meaningful (hopefully) sub-classes, called clusters Cluster: collection of data points that are similar to one another and collectively should be treated as group as a collection, are sufficiently different from other groups Data Clustering 2

Clusters Data Clustering 3

Characterizing Cluster Methods Class - label applied by clustering algorithm hard versus fuzzy: hard - either is or is not a member of cluster fuzzy - member of cluster with probability Distance (similarity) measure - value indicating how similar data points are Deterministic versus stochastic deterministic - same clusters produced every time stochastic - different clusters may result Hierarchical - points connected into clusters using a hierarchical structure Data Clustering 4

Basic Clustering Methodology Two approaches: Agglomerative: pairs of items/clusters are successively linked to produce larger clusters Divisive (partitioning): items are initially placed in one cluster and successively divided into separate groups Data Clustering 5

Cluster Validity One difficult question: how good are the clusters produced by a particular algorithm? Difficult to develop an objective measure Some approaches: external assessment: compare clustering to a priori clustering internal assessment: determine if clustering intrinsically appropriate for data relative assessment: compare one clustering methods results to another methods Data Clustering 6

Basic Questions Data preparation - getting/setting up data for clustering extraction normalization Similarity/Distance measure - how is the distance between points defined Use of domain knowledge (prior knowledge) can influence preparation, Similarity/Distance measure Efficiency - how to construct clusters in a reasonable amount of time Data Clustering 7

Distance/Similarity Measures Key to grouping points distance = inverse of similarity Often based on representation of objects as feature vectors An Employee DB ID Gender Age Salary 1 F 27 19,000 2 M 51 64,000 3 M 52 100,000 4 F 33 55,000 5 M 45 45,000 Term Frequencies for Documents T1 T2 T3 T4 T5 T6 Doc1 0 4 0 0 0 2 Doc2 3 1 4 3 1 2 Doc3 3 0 0 0 3 0 Doc4 0 1 0 3 0 0 Doc5 2 2 2 3 1 4 Which objects are more similar? Data Clustering 8

Distance/Similarity Measures Properties of measures: based on feature values x instance#,feature# for all objects x i,b, dist(x i, x j ) 0, dist(x i, x j )=dist(x j, x i ) for any object x i, dist(x i, x i ) = 0 dist(x i, x j ) dist(x i, x k ) + dist(x k, x j ) Manhattan distance: Euclidean distance: features f = 1 features f = 1 x i, f x j, f ( x i, f x j, f ) 2 Data Clustering 9

Distance/Similarity Measures Minkowski distance (p): p features ( x i x f = 1 p, f j, f ) Mahalanobis distance: 1 ( x x ) ( x x ) i where -1 is covariance matrix of the patterns j i j T More complex measures: Mutual Neighbor Distance (MND) - based on a count of number of neighbors Data Clustering 10

Distance (Similarity) Matrix Similarity (Distance) Matrix based on the distance or similarity measure we can construct a symmetric matrix of distance (or similarity values) (i, j) entry in the matrix is the distance (similarity) between items i and j I1 I2 L In I1 d12 L d1 n I2 d21 L d2n M M M M I d d L n n1 n2 Note Note that that d ij = ij d ji (i.e., ji (i.e., the the matrix matrix is is symmetric). So, So, we we only only need need the the lower lower triangle triangle part part of of the the matrix. matrix. The The diagonal is is all all 1 s 1 s (similarity) or or all all 0 s 0 s (distance) d = similarity (or distance) of D to D ij i j Data Clustering 11

Example: Term Similarities in Documents T1 T2 T3 T4 T5 T6 T7 T8 Doc1 0 4 0 0 0 2 1 3 Doc2 3 1 4 3 1 2 0 1 Doc3 3 0 0 0 3 0 3 0 Doc4 0 1 0 3 0 0 2 0 Doc5 2 2 2 3 1 4 0 2 sim( T, T ) = ( w w ) N i j ik jk k = 1 Term-Term Similarity Matrix Matrix T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 Data Clustering 12

Similarity (Distance) Thresholds A similarity (distance) threshold may be used to mark pairs that are sufficiently similar T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 T1 T2 T3 T4 T5 T6 T7 T2 0 T3 1 0 T4 1 1 1 T5 1 0 0 0 T6 1 1 1 1 0 T7 0 0 0 0 0 0 T8 0 1 0 0 0 1 0 Using a threshold value of 10 in the previous example Data Clustering 13

Graph Representation The similarity matrix can be visualized as an undirected graph each item is represented by a node, and edges represent the fact that two items are similar (a one in the similarity threshold matrix) T1 T2 T3 T4 T5 T6 T7 T2 0 T3 1 0 T4 1 1 1 T5 1 0 0 0 T6 1 1 1 1 0 T7 0 0 0 0 0 0 T8 0 1 0 0 0 1 0 T4 T1 T3 T2 T5 If If no no threshold is is used, used, then then matrix matrix can can be be represented as as a a weighted graph graph T6 T8 T7 Data Clustering 14

Agglomerative Single-Link Single-link: connect all points together that are within a threshold distance Algorithm: 1. place all points in a cluster 2. pick a point to start a cluster 3. for each point in current cluster add all points within threshold not already in cluster repeat until no more items added to cluster 4. remove points in current cluster from graph 5. Repeat step 2 until no more points in graph Data Clustering 15

Example T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 All points except T7 end up in one cluster T1 T3 T1 T2 T3 T4 T5 T6 T7 T2 0 T3 1 0 T4 1 1 1 T5 1 0 0 0 T6 1 1 1 1 0 T7 0 0 0 0 0 0 T8 0 1 0 0 0 1 0 T4 T6 T8 T2 T5 T7 Data Clustering 16

Agglomerative Complete-Link (Clique) Complete-link (clique): all of the points in a cluster must be within the threshold distance In the threshold distance matrix, a clique is a complete graph Algorithms based on finding maximal cliques (once a point is chosen, pick the largest clique it is part of) not an easy problem Data Clustering 17

Example T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 Different clusters possible based on where cliques start T1 T3 T1 T2 T3 T4 T5 T6 T7 T2 0 T3 1 0 T4 1 1 1 T5 1 0 0 0 T6 1 1 1 1 0 T7 0 0 0 0 0 0 T8 0 1 0 0 0 1 0 T4 T6 T8 T2 T5 T7 Data Clustering 18

Hierarchical Methods Based on some method of representing hierarchy of data points One idea: hierarchical dendogram (connects points based on similarity) T1 T2 T3 T4 T5 T6 T7 T2 7 T3 16 8 T4 15 12 18 T5 14 3 6 6 T6 14 18 16 18 6 T7 9 6 0 6 9 2 T8 7 17 8 9 3 16 3 T5 T1 T3 T4 T2 T6 T8 T7 Data Clustering 19

Hierarchical Agglomerative Compute distance matrix Put each data point in its own cluster Find most similar pair of clusters merge pairs of clusters (show merger in dendogram) update proximity matrix repeat until all patterns in one cluster Data Clustering 20

Partitional Methods Divide data points into a number of clusters Difficult questions how many clusters? how to divide the points? how to represent cluster? Representing cluster: often done in terms of centroid for cluster centroid of cluster minimizes squared distance between the centroid and all points in cluster Data Clustering 21

k-means Clustering 1. Choose k cluster centers (randomly pick k data points as center, or randomly distribute in space) 2. Assign each pattern to the closest cluster center 3. Recompute the cluster centers using the current cluster memberships (moving centers may change memberships) 4. If a convergence criterion is not met, goto step 2 Convergence criterion: no reassignment of patterns minimal change in cluster center Data Clustering 22

k-means Clustering Data Clustering 23

k-means Variations What if too many/not enough clusters? After some convergence: any cluster with too large a distance between members is split any clusters too close together are combined any cluster not corresponding to any points is moved thresholds decided empirically Data Clustering 24

An Incremental Clustering Algorithm 1. Assign first data point to a cluster 2. Consider next data point. Either assign data point to an existing cluster or create a new cluster. Assignment to cluster based on threshold 3. Repeat step 2 until all points are clustered Useful for efficient clustering Data Clustering 25

Clustering Summary Unsupervised learning method generation of classes Based on similarity/distance measure Manhattan, Euclidean, Minkowski, Mahalanobis, etc. distance matrix threshold distance matrix Hierarchical representation hierarchical dendogram Agglomerative methods single link complete link (clique) Data Clustering 26

Partitional method Clustering Summary representing clusters centroids and error k-means clustering combining/splitting k-means Incremental clustering one pass clustering Data Clustering 27