Data Mining. Cluster Analysis: Advanced Concepts and Algorithms



Similar documents
Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Cluster Analysis: Advanced Concepts

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Clustering Techniques Qualifying Oral Examination Paper

Unsupervised Data Mining (Clustering)

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering UE 141 Spring 2013

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Clustering: Techniques & Applications. Nguyen Sinh Hoa, Nguyen Hung Son. 15 lutego 2006 Clustering 1

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering methods for Big data analysis

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

An Introduction to Cluster Analysis for Data Mining

Clustering & Visualization

Chapter 7. Cluster Analysis

The SPSS TwoStep Cluster Component

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

A comparison of various clustering methods and algorithms in data mining

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

On Clustering Validation Techniques

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Cluster Analysis: Basic Concepts and Methods

2 Basic Concepts and Techniques of Cluster Analysis

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

The Role of Visualization in Effective Data Cleaning

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Clustering and Outlier Detection

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

SCAN: A Structural Clustering Algorithm for Networks

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Chapter ML:XI (continued)

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Proposed Application of Data Mining Techniques for Clustering Software Projects

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Authors. Data Clustering: Algorithms and Applications

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

A Comparative Study of clustering algorithms Using weka tools

Cluster Analysis: Basic Concepts and Algorithms

Forschungskolleg Data Analytics Methods and Techniques

GE-INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH VOLUME -3, ISSUE-6 (June 2015) IF ISSN: ( ) EMERGING CLUSTERING TECHNIQUES ON BIG DATA

Neural Networks Lesson 5 - Cluster Analysis

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

A Survey of Clustering Algorithms for Big Data: Taxonomy & Empirical Analysis

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

On Data Clustering Analysis: Scalability, Constraints and Validation

Recent Advances in Clustering: A Brief Survey

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Outlier Detection in Clustering

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Original Article Survey of Recent Clustering Techniques in Data Mining

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

A Method for Decentralized Clustering in Large Multi-Agent Systems

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

How To Cluster On A Large Data Set

Environmental Remote Sensing GEOG 2021

Unsupervised learning: Clustering

Data Mining: Foundation, Techniques and Applications

Social Media Mining. Data Mining Essentials

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Data Clustering Using Data Mining Techniques

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Data Warehousing und Data Mining

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering Objects on a Spatial Network

How To Cluster

Concept of Cluster Analysis

Public Transportation BigData Clustering

Data Mining Process Using Clustering: A Survey

How To Solve The Cluster Algorithm

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

A Survey of Clustering Techniques

A Grid-based Clustering Algorithm using Adaptive Mesh Refinement

Quality Assessment in Spatial Clustering of Data Mining

Hadoop SNS. renren.com. Saturday, December 3, 11

Classification algorithm in Data mining: An Overview

Mr. Scan: Extreme Scale Density-Based Clustering using a Tree-Based Network of GPGPU Nodes

A Review of Clustering Methods forming Non-Convex clusters with, Missing and Noisy Data

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

Machine Learning using MapReduce

Gene Expression Data Clustering Analysis: A Survey

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Part 2: Community Detection

COC131 Data Mining - Clustering

Image Segmentation and Registration

Data Mining K-Clustering Problem

Data Mining. Session 9 Main Theme Clustering. Dr. Jean-Claude Franchitti

Transcription:

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based clustering Scalable Clustering

Prototype-based Clustering Fuzzy c-means: objects are allowed to belong to more than one cluster EM (Expectation-Maximization): a cluster is modeled as a statistical distribution SOM (Self-Organizing Maps): clusters are constrained to have fixed relationships Fuzzy c-means A collection of fuzzy clusters is a subset of all possible fuzzy subsets of the set of data points. Each point is assigned a weight for each cluster. The sum of the weight for each point is 1. Each cluster contains at least one point but not all points. Difference from K-means: instead of updating centroid and reassign points, c-means updates the centroid and the weight.

Clustering using Mixture Models Mixture model: using statistical distributions to model data, with each distribution corresponds to a cluster Maximum Likelihood principle: parameters for distribution are fixed but unknown best parameters are obtained by maximizing the probability of obtaining the samples observed

Likelihood Function: likelihood function for a data set that contains n points: k n P( D θ ) = = P( xk θ ) k = 1 P( D θ ) is called the likelihood of θ w.r.t. the set of samples D) log-likelihood function: l ( θ) = log P( D θ) Distribution parameters ˆ θ = arg log-likelihood is given by: k n = = θl = k 1 θ max θ θ log P( x l( ) that maximize the k θ ) = 0

The EM Algorithm Select an initial set of model parameters Repeat Expectation Step: for each object, calculate the probability that each object belongs to each distribution Maximization Step: Given the probabilities from the expectation step, find the new estimates of the parameters that maximize the expected likelihood Until the parameters do not change K-means is a special case of EM for Gaussian distributions with equal covariance but different means Left cluster corresponds to Gaussian distribution with center at (-4,1) and std=2 Right cluster corresponds to Gaussian distribution with center at (0,0), and std=0.5

SOM Centroids have a pre-determined topographic ordering relationship For each data point, its closest centroid and centroids in the neighborhood are updated Output is a set of centroids with clusters implicitly defined.

Density-based Clustering Grid-based clustering: define a set of grid cells assign data points to the appropriate cells and compute the density of each cell eliminate cells with low density merge adjacent cells to form clusters CLIQUE: cluster data points using a subset of attributes DENCLUE: use kernel density function

CLIQUE Partition each dimension into the same number of equal width interval and partition an m-dimensional data space into non-overlapping rectangular units Repeatedly generate k-dimensional subspace cells from k-1-dimensional subspace cells and eliminate non-dense cells Merge adjacent cells to form clusters DENCLUE Influence or kernel function: each data point has an influence that extends over a range Density function: sum of all kernel function associated with each data point

DENCLUE Find density attractor x* which is the local maximum of the density function Associate points to a density attractor x* that maximize the increase in density Each density attractor and data points associated with it represents a cluster Discard a cluster if its density attractor has a density less than ξ Combine clusters that are connected Points that are attracted to smaller maxima are considered outliers Graph-based Clustering Graph-Based clustering uses the proximity graph Start with the proximity matrix Consider each data point as a node in a graph Each edge between two nodes has a weight which is the proximity between the two data points In the simplest case, clusters are connected components in the graph Graph-based clustering algorithms: Chameleon Shared Nearest Neighbor and SNN density

Chameleon Adapt to the characteristics of the data set to find the natural clusters Use a dynamic model to measure the similarity between clusters Main property is the relative closeness and relative interconnectivity of the cluster Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters The merging scheme preserves self-similarity One of the areas of application is spatial data Characteristics of Spatial Data Sets Clusters are defined as densely populated regions of the space Clusters have arbitrary shapes, orientation, and non-uniform sizes Difference in densities across clusters and variation in density within clusters Existence of special artifacts (streaks) and noise The clustering algorithm must address the above characteristics and also require minimal supervision.

Chameleon Preprocessing Step: Represent the Data by a Graph Given a set of points, construct the k-nearest-neighbor (k-nn) graph to capture the relationship between a point and its k nearest neighbors Concept of neighborhood is captured dynamically (even if region is sparse) Phase 1: Use a multilevel graph partitioning algorithm on the graph to find a large number of clusters of well-connected vertices Each cluster should contain mostly points from one true cluster, i.e., is a sub-cluster of a real cluster Chameleon Phase 2: Use Hierarchical Agglomerative Clustering to merge sub-clusters Two clusters are combined if the resulting cluster shares certain properties with the constituent clusters Two key properties used to model cluster similarity: u Relative Interconnectivity: Absolute interconnectivity of two clusters normalized by the internal connectivity of the clusters u Relative Closeness: Absolute closeness of two clusters normalized by the internal closeness of the clusters

Shared Near Neighbor SNN graph: the weight of an edge is the number of shared neighbors between vertices given that the vertices are connected i j i j 4 SNN Density Clustering 1. Compute the similarity matrix This corresponds to a similarity graph with data points for nodes and edges whose weights are the similarities between data points 2. Sparsify the similarity matrix by keeping only the k most similar neighbors This corresponds to only keeping the k strongest links of the similarity graph 3. Construct the shared nearest neighbor graph from the sparsified similarity matrix. At this point, we could apply a similarity threshold and find the connected components to obtain the clusters (Jarvis-Patrick algorithm) 4. Find the SNN density of each Point. Using a user specified parameters, Eps, find the number points that have an SNN similarity of Eps or greater to each point. This is the SNN density of the point

SNN Density Clustering 5. Find the core points Using a user specified parameter, MinPts, find the core points, i.e., all points that have an SNN density greater than MinPts 6. Form clusters from the core points If two core points are within a radius, Eps, of each other they are place in the same cluster 7. Discard all noise points All non-core points that are not within a radius of Eps of a core point are discarded 8. Assign all non-noise, non-core points to clusters This can be done by assigning such points to the nearest core point (Note that steps 4-8 are DBSCAN) Scalable Clustering Algorithms Handles large amount of data BIRCH CURE

BIRCH Balanced Iterative Reducing and Clustering using Hierarchies Minimized I/O cost (1 or 2 scan) Clustering Feature: N number of points in a cluster LS linear sum of points in a cluster SS square sum of points in a cluster CF-tree Similar to B+-tree Each node corresponds to one page Parameters: B branching factor T threshold Non-leaf node contains at most B CF entries of its child Leaf node each entry s diameter (or radius) has to be less than T, number of entries per node is determined by the page size and entry size. A leaf node represents a cluster made up of all the subclusters represented by its entries.

CF-tree BIRCH Algorithm

BIRCH Algorithm Phase 1: Start with initial threshold and insert points into the tree If run out of memory, increase threshold value, and rebuild a smaller tree by reinserting values from older tree and then other values Remove outliers while rebuilding tree Phase 2: Optional prepare the tree for Phase 3 Removes outliers, and grouping clusters BIRCH Algorithm Phase 3: Cluster all leaf nodes on the CF values according to an existing algorithm (e.g. agglomerative hierarchical clustering) Phase 4: Optional refine clusters Perform additional passes over the dataset and reassign data points to the closest centroid Recalculate the centroids and redistributing the items.

An Example of the CF-tree Initially, the data points in one cluster. root A A An Example of the CF-tree The data arrives, and a check is made whether the size of the cluster does not exceed T. root A A T

An Example of the CF-tree If the cluster size grows too big, the cluster is split into two clusters, and the points are redistributed. A root B B A T An Example of the CF-tree At each node of the tree, the CF tree keeps information about the mean of the cluster, and the mean of the sum of squares to compute the size of the clusters efficiently. A root B A B

CURE Uses a number of points to represent a cluster Representative points are found by selecting a constant number of points from a cluster and then shrinking them toward the center of the cluster Cluster similarity is the similarity of the closest pair of representative points from different clusters CURE Shrinking representative points toward the center helps avoid problems with noise and outliers CURE is better able to handle clusters of arbitrary shapes and sizes