Clustering UE 141 Spring 2013

Similar documents
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Information Retrieval and Web Search Engines

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Unsupervised learning: Clustering

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

SoSe 2014: M-TANI: Big Data Analytics

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Cluster Analysis: Advanced Concepts

Chapter 7. Cluster Analysis

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Distances, Clustering, and Classification. Heatmaps

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Constrained Clustering of Territories in the Context of Car Insurance

Cluster Analysis using R

Cluster Analysis: Basic Concepts and Algorithms

Introduction to Clustering

Hierarchical Cluster Analysis Some Basics and Algorithms

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Map-Reduce for Machine Learning on Multicore

A comparison of various clustering methods and algorithms in data mining

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Statistical Databases and Registers with some datamining

International Journal of Advance Research in Computer Science and Management Studies

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Machine Learning using MapReduce

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

COC131 Data Mining - Clustering

A Survey of Clustering Techniques

How To Cluster

DATA CLUSTERING USING MAPREDUCE

Chapter ML:XI (continued)

ANALYSIS & PREDICTION OF SALES DATA IN SAP- ERP SYSTEM USING CLUSTERING ALGORITHMS

K-Means Clustering Tutorial

A Comparative Study of clustering algorithms Using weka tools

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Cluster Analysis: Basic Concepts and Methods

Social Media Mining. Data Mining Essentials

DISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Lecture 10: Regression Trees

Vector Quantization and Clustering

Branch-and-Price Approach to the Vehicle Routing Problem with Time Windows

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

CLUSTERING FOR FORENSIC ANALYSIS

SPSS Tutorial. AEB 37 / AE 802 Marketing Research Methods Week 7

Machine Learning Big Data using Map Reduce

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

The SPSS TwoStep Cluster Component

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

1 Choosing the right data mining techniques for the job (8 minutes,

An Introduction to Cluster Analysis for Data Mining

Parallel Data Mining Algorithms for Association Rules and Clustering

Clustering Connectionist and Statistical Language Processing

Personalized Hierarchical Clustering

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Comparison of K-means and Backpropagation Data Mining Algorithms

Forschungskolleg Data Analytics Methods and Techniques

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Clustering Via Decision Tree Construction

Data Clustering Techniques Qualifying Oral Examination Paper

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Standardization and Its Effects on K-Means Clustering Algorithm

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

They can be obtained in HQJHQH format directly from the home page at:

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

Clustering methods for Big data analysis

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Transcription:

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1

Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or unrelated to) the obects in other groups Intra-cluster similarity are maximized Inter-cluster similarity are minimized

K-means Partition {x 1,,x n } into K clusters K is predefined Initialization Specify the initial cluster centers (centroids) Iteration until no change For each obect x i Calculate the distances between x i and the K centroids (Re)assign x i to the cluster whose centroid is the closest to x i Update the cluster centroids based on current assignment 3

5 K-means: Initialization Initialization: Determine the three cluster centers 4 m 1 3 m 1 0 0 1 3 4 5 m 3 4

K-means Clustering: Cluster Assignment Assign each obect to the cluster which has the closet distance from the centroid to the obect 5 4 m 1 3 m 1 0 0 1 3 4 5 m 3 5

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster 5 4 m 1 3 m 1 0 0 1 3 4 5 m 3 6

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster 5 4 m 1 3 1 m m 3 0 0 1 3 4 5 7

K-means Clustering: Cluster Assignment Assign each obect to the cluster which has the closet distance from the centroid to the obect 5 4 m 1 3 1 m m 3 0 0 1 3 4 5 8

K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster 5 4 m 1 3 1 m m 3 0 0 1 3 4 5 9

m m 3 K-means Clustering: Update Cluster Centroid Compute cluster centroid as the center of the points in the cluster 5 4 m 1 3 1 0 0 1 3 4 5 10

Sum of Squared Error (SSE) Suppose the centroid of cluster C is m For each obect x in C, compute the squared error between x and the centroid m Sum up the error of all the obects SSE = x C ( x - m ) 1 4 5 m 1= 1.5 m = 4.5 SSE= (1-1.5) + ( -1.5) + (4-4.5) + (5-4.5) = 1 11

How to Minimize SSE min x C ( x - m Two sets of variables to minimize Each obect x belongs to which cluster? ) x C What s the cluster centroid? m Minimize the error wrt each set of variable alternatively Fix the cluster centroid find cluster assignment that minimizes the current error Fix the cluster assignment compute the cluster centroids that minimize the current error 1

Cluster Assignment Step min Cluster centroids (m ) are known For each obect x C ( x - m Choose C among all the clusters for x such that the distance between x and m is the minimum Choose another cluster will incur a bigger error Minimize error on each obect will minimize the SSE ) 13

Example Cluster Assignment Given m 1, m, which cluster each of the five points belongs to? Assign points to the closet centroid minimize SSE x 1, x, x C 3 1 x, x C 4 5 SSE= ( x - m 1 - m1 ) + ( x - m1 ) + ( x3 1) + ( x - m 4 - m) + ( x5 ) 14

Cluster Centroid Computation Step min For each cluster x C ( x - m Choose cluster centroid m as the center of the points x m = Minimize error on each cluster will minimize the SSE x C C ) 15

Example Cluster Centroid Computation Given the cluster assignment, compute the centers of the two clusters 16

Hierarchical Clustering Agglomerative approach a a b b a b c d e c c d e d d e e Initialization: Each obect is a cluster Iteration: Merge two clusters which are most similar to each other; Until all obects are merged into a single cluster Step 0 Step 1 Step Step 3 Step 4 bottom-up 17

Hierarchical Clustering Divisive Approaches a a b b a b c d e c c d e d d e e Initialization: All obects stay in one cluster Iteration: Select a cluster and split it into two sub clusters Until each leaf cluster contains only one obect Step 4 Step 3 Step Step 1 Step 0 Top-down 18

Dendrogram A tree that shows how clusters are merged/split hierarchically Each node on the tree is a cluster; each leaf node is a singleton cluster 19

Dendrogram A clustering of the data obects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster 0

Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward 1. Compute the distance matrix. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the distance matrix 6. Until only a single cluster remains Key operation is the computation of the distance between two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 1

Starting Situation Start with clusters of individual points and a distance matrix p1 p p3 p4 p5... p1 p p3 p4 p5... Distance Matrix... p1 p p3 p4 p9 p10 p11 p1

Intermediate Situation After some merging steps, we have some clusters Choose two clusters that has the smallest distance (largest similarity) to merge C1 C1 C C3 C4 C5 C1 C3 C4 C C3 C4 C5 Distance Matrix C C5... p1 p p3 p4 p9 p10 p11 p1 3

Intermediate Situation We want to merge the two closest clusters (C and C5) and update the distance matrix. C1 C C3 C3 C4 C4 C5 C1 C1 C Distance Matrix C3 C4 C5 C C5... p1 p p3 p4 p9 p10 p11 p1 4

After Merging The question is How do we update the distance matrix? C1 C3 C4 C1 C U C5 C3 C4 C1 C U C5??????? C3 Distance Matrix C4 C U C5... p1 p p3 p4 p9 p10 p11 p1 5

How to Define Inter-Cluster Distance MIN MAX Group Average Distance? Distance Between Centroids p1 p p3 p4 p5... p1 p p3 p4 p5... Distance Matrix 6

Question Talk about big data What s the definition of big data? What applications generate big data? What are the challenges? What are the technologies for mining big data? 7