Chapter 7. Cluster Analysis



Similar documents
Data Mining for Knowledge Management. Clustering

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining. Session 9 Main Theme Clustering. Dr. Jean-Claude Franchitti

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Cluster Analysis: Basic Concepts and Methods

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Clustering Techniques Qualifying Oral Examination Paper

Unsupervised learning: Clustering

Chapter ML:XI (continued)

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Introduction to Cluster Analysis for Data Mining

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Cluster Analysis: Advanced Concepts

Clustering and Outlier Detection

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Cluster Analysis: Basic Concepts and Algorithms

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

A comparison of various clustering methods and algorithms in data mining

Clustering UE 141 Spring 2013

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

On Clustering Validation Techniques

Clustering methods for Big data analysis

Neural Networks Lesson 5 - Cluster Analysis

Social Media Mining. Data Mining Essentials

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Clustering. Clustering. What is Clustering? What is Clustering? What is Clustering? Types of Data in Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

How To Cluster

CHAPTER 20. Cluster Analysis

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Data Mining Process Using Clustering: A Survey

Unsupervised Data Mining (Clustering)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster analysis Cosmin Lazar. COMO Lab VUB

A Survey of Clustering Techniques

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

Data Clustering Using Data Mining Techniques

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Information Retrieval and Web Search Engines

A Comparative Study of clustering algorithms Using weka tools

Distances, Clustering, and Classification. Heatmaps

Data Mining K-Clustering Problem

Data Mining 5. Cluster Analysis

Statistical Databases and Registers with some datamining

How To Solve The Cluster Algorithm

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Chapter 3: Cluster Analysis

A Review on Clustering and Outlier Analysis Techniques in Datamining

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Cluster Analysis Overview. Data Mining Techniques: Cluster Analysis. What is Cluster Analysis? What is Cluster Analysis?

Outlier Detection in Clustering

Machine Learning using MapReduce

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

The SPSS TwoStep Cluster Component

A Two-Step Method for Clustering Mixed Categroical and Numeric Data

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ANALYSIS OF CLUSTERING TECHNIQUE FOR CRM

Principles of Data Mining by Hand&Mannila&Smyth

Clustering & Visualization

An Overview of Knowledge Discovery Database and Data mining Techniques

Concept of Cluster Analysis

Hierarchical Cluster Analysis Some Basics and Algorithms

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Going Big in Data Dimensionality:

Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values

CUP CLUSTERING USING PRIORITY: AN APPROXIMATE ALGORITHM FOR CLUSTERING BIG DATA

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Chapter 15 CLUSTERING METHODS. 1. Introduction. Lior Rokach. Oded Maimon

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Data Mining: Foundation, Techniques and Applications

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Transcription:

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based Methods 8. Clustering High-Dimensional Data 9. Constraint-Based Clustering 0. Link-based clustering. Outlier Analysis. Summary

What is Cluster Analysis? Cluster: A collection of data objects similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Clustering for Data Understanding and Applications Biology: taxonomy of living things: kindom, phylum, class, order, family, genus and species Information retrieval: document clustering Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market resarch

Clustering as Preprocessing Tools (Utility) Summarization: Preprocessing for regression, PCA, classification, and association analysis Compression: Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters

Quality: What Is Good Clustering? A good clustering method will produce high quality clusters high intra-class similarity: cohesive within clusters low inter-class similarity: distinctive between clusters The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns October 5, 0 Data Mining: Concepts and Techniques 5

Measure the Quality of Clustering Dissimilarity/Similarity metric Similarity is expressed in terms of a distance function, typically metric: d(i, j) The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables Weights should be associated with different variables based on applications and data semantics Quality of clustering: There is usually a separate quality function that measures the goodness of a cluster. It is hard to define similar enough or good enough The answer is typically highly subjective October 5, 0 Data Mining: Concepts and Techniques 6

Distance Measures for Different Kinds of Data Discussed in Chapter : Data Preprocessing Numerical (interval)-based: Minkowski Distance: Special cases: Euclidean (L -norm), Manhattan (L - norm) Binary variables: symmetric vs. asymmetric (Jaccard coeff.) Nominal variables: # of mismatches Ordinal variables: treated like interval-based Ratio-scaled variables: apply log-transformation first Vectors: cosine measure Mixed variables: weighted combinations October 5, 0 Data Mining: Concepts and Techniques 7

Requirements of Clustering in Data Mining Scalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability October 5, 0 Data Mining: Concepts and Techniques 8

Major Clustering Approaches (I) Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach: based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE October 5, 0 Data Mining: Concepts and Techniques 9

Major Clustering Approaches (II) Model-based: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-cluster User-guided or constraint-based: Clustering by considering user-specified or application-specific constraints Typical methods: COD (obstacles), constrained clustering Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects: SimRank, LinkClus October 5, 0 Data Mining: Concepts and Techniques 0

Calculation of Distance between Clusters Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(k i, K j ) = min(t ip, t jq ) Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(k i, K j ) = max(t ip, t jq ) Average: avg distance between an element in one cluster and an element in the other, i.e., dist(k i, K j ) = avg(t ip, t jq ) Centroid: distance between the centroids of two clusters, i.e., dist(k i, K j ) = dist(c i, C j ) Medoid: distance between the medoids of two clusters, i.e., dist(k i, K j ) = dist(m i, M j ) Medoid: one chosen, centrally located object in the cluster October 5, 0 Data Mining: Concepts and Techniques

Centroid, Radius and Diameter of a Cluster (for numerical data sets) Centroid: the middle of a cluster Radius: square root of average distance from any point of the cluster to its centroid Σ C m = Σ N ( t c i ip m ) R = m = N N ( t i = N Diameter: square root of average mean squared distance between all pairs of points in the cluster Σ N Σ N ( t t ) D = i = i = ip iq m N( N ) ip ) October 5, 0 Data Mining: Concepts and Techniques

Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance E i k = Σi= Σ p C ( p mi ) Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen 67): Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw 87): Each cluster is represented by one of the objects in the cluster October 5, 0 Data Mining: Concepts and Techniques

The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step, stop when no more new assignment October 5, 0 Data Mining: Concepts and Techniques

Example The K-Means Clustering Method 0 9 8 7 6 5 0 0 5 6 7 8 9 0 Assign each objects to most similar center 0 9 8 7 6 5 0 0 5 6 7 8 9 0 reassign Update the cluster means 0 9 8 7 6 5 0 0 5 6 7 8 9 0 reassign K= Arbitrarily choose K object as initial cluster center 0 9 8 7 6 5 0 0 5 6 7 8 9 0 Update the cluster means 0 9 8 7 6 5 0 0 5 6 7 8 9 0 October 5, 0 Data Mining: Concepts and Techniques 5

Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) ), CLARA: O(ks + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes October 5, 0 Data Mining: Concepts and Techniques 6

Variations of the K-Means Method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes (Huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototype method October 5, 0 Data Mining: Concepts and Techniques 7

What Is the Problem of the K-Means Method? The k-means algorithm is sensitive to outliers! Since an object with an extremely large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 0 9 8 7 6 5 0 0 5 6 7 8 9 0 0 9 8 7 6 5 0 0 5 6 7 8 9 0 October 5, 0 Data Mining: Concepts and Techniques 8

Hierarchical Clustering Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step Step Step Step a b c d e a b d e c d e a b c d e Step Step Step Step Step 0 agglomerative (AGNES) divisive (DIANA) October 5, 0 Data Mining: Concepts and Techniques 9

AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (990) Implemented in statistical packages, e.g., Splus Use the Single-Link method and the dissimilarity matrix Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster 0 9 8 7 6 5 0 0 5 6 7 8 9 0 0 9 8 7 6 5 0 0 5 6 7 8 9 0 0 9 8 7 6 5 0 0 5 6 7 8 9 0 October 5, 0 Data Mining: Concepts and Techniques 0

Dendrogram: Shows How the Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster. October 5, 0 Data Mining: Concepts and Techniques