BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies

Similar documents
BIRCH: An Efficient Data Clustering Method For Very Large Databases

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

The SPSS TwoStep Cluster Component

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Big Data and Scripting. Part 4: Memory Hierarchies

Data Warehousing und Data Mining

Clustering UE 141 Spring 2013

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Classifying Large Data Sets Using SVMs with Hierarchical Clusters

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Unsupervised Data Mining (Clustering)

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

DATABASE DESIGN - 1DL400

Clustering & Visualization

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

International journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Decision Trees from large Databases: SLIQ

Chapter 7. Cluster Analysis

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Distances, Clustering, and Classification. Heatmaps

Machine Learning using MapReduce

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Persistent Binary Search Trees

Chapter 13: Query Processing. Basic Steps in Query Processing

Cluster Analysis: Advanced Concepts

Clustering Via Decision Tree Construction

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Clustering Data Streams

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Data Clustering Techniques Qualifying Oral Examination Paper

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

Social Media Mining. Data Mining Essentials

Information Retrieval and Web Search Engines

Cluster Analysis: Basic Concepts and Methods

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

CSE 326: Data Structures B-Trees and B+ Trees

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Neural Networks Lesson 5 - Cluster Analysis

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Chapter 20: Data Analysis

Hadoop SNS. renren.com. Saturday, December 3, 11

Efficient Data Structures for Decision Diagrams

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

SoSe 2014: M-TANI: Big Data Analytics

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Overview of Storage and Indexing

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Vector storage and access; algorithms in GIS. This is lecture 6

Analysis of Algorithms I: Binary Search Trees

Bisecting K-Means for Clustering Web Log data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Cluster Analysis: Basic Concepts and Algorithms

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Fast Matching of Binary Features

CS Introduction to Data Mining Instructor: Abdullah Mueen

How To Cluster

B490 Mining the Big Data. 2 Clustering

An Introduction to Cluster Analysis for Data Mining

Physical Data Organization

Going Big in Data Dimensionality:

Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Modifying Insurance Rating Territories Via Clustering

A Distribution-Based Clustering Algorithm for Mining in Large Spatial Databases

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Machine Learning Big Data using Map Reduce

DISTRIBUTED ANOMALY DETECTION IN WIRELESS SENSOR NETWORKS

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

External Memory Geometric Data Structures

Cluster Analysis: Basic Concepts and Algorithms

Programming the K-means Clustering Algorithm in SQL

Towards running complex models on big data

System Behavior Analysis by Machine Learning

Transcription:

T-61.6020 Popular Algorithms in Data Mining and Machine Learning BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies Sami Virpioja Adaptive Informatics Research Centre Helsinki University of Technology April 23, 2008

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

Task Data clustering with computational constraints: Given number of clusters K, a dataset of N points, and a distance-based measurement function, find a partition of the dataset that minimizes the function. Limited amount of memory (less than the dataset size). Minimize the time used for I/O operations. I.e., do not read the data multiple times from the disk.

Properties of BIRCH On-line algorithm: Single scan of the dataset enough. Optional refining with additional scans I/O cost minimization: Organize data in a in-memory, height-balanced tree. Local decisions: Each clustering decision is made without scanning all the points or clusters. Outlier detection (optional): Points in sparse regions are treated separately as outliers.

Algorithm Phases (1) BIRCH overview [Zhang et al. 1997]

Algorithm Phases (2) Two main phases: Phase 1: Construction of CF tree using local clustering. Incremental, one scan of the data enough. Phase 3: Global clustering of subclusters from the CF tree. Use some standard clustering algorithm (e.g. hierarchical clustering, K-means). The rest is optional.

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

Clustering Features Clustering Feature: Triple CF = (N, LS, SS) N is the number of data points in cluster (0th order moment). LS is the linear sum of data points N i=1 X i (1st order moment). SS is the square sum of data points N i=1 X i 2 (2rd order moment). CF Additivity Theorem: Assume that for two disjoint clusters CF 1 = (N 1, LS 1, SS 1 ) and CF 1 = (N 2, LS 2, SS 2 ). For the merged cluster CF 1 + CF 2 = (N 1 + N 2, LS 1 + LS 2, SS 1 + SS 2 ).

Representativity of Clustering Features (1) CF Representativity Theorem: Given the CF entries of subclusters, the following measurements can be computed accurately: For a single cluster X: Centroid X 0 = 1 N N i=1 X i Radius R = ( 1 N N i=1 X i X 0 2 ) 1 2 1 N N Diameter D = ( N(N 1) i=1 j=1 X i X j 2 ) 1 2

Representativity of Clustering Features (2) For two clusters X and Y : Euclidian centroid distance D0 = X 0 Y 0 Manhattan centroid distance D1 = X 0 Y 0 Average inter-cluster distance NX i=1 NY j=1 X i Y j 2 ) 1 2 1 D2 = ( N X N Y Average intra-cluster distance D3 =... Variance increase distance D4 =... For example: N X N Y X i Y j 2 = i=1 j=1 N X N Y (X T i X i 2X T i Y j + Y T j Y j ) i=1 j=1 = SS X + SS Y 2LS T XLS Y

CF Summary Thus, instead storing a set of data points with the set of their vectors, it can be summarized as a CF entry. Sometimes called micro-clusters. Additivity: CF entries can be stored and calculated incrementally. Representativity: Some distances can be calculated without the individual data points.

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

CF Tree Height-balanced tree with branching factor B, leaf size L and leaf threshold T. Non-leaf nodes contain at most B entries of the form [CF i, child i ]. Represents a cluster made up of all the subclusters of the entries. Leaf nodes contain at most L entries of the form [CF i ] and pointers to previous and next leaf nodes. All entries in a leaf node must satisfy threshold requirement, i.e., diameter or radius of the subcluster must be less than T.

CF Tree Example B = 3 L = 4 Root CF1 CF2 ch1 ch2 CF1 CF2 CF3 ch1 ch2 ch3 CF1 ch1 Non leaf nodes CF1 CF2 CF3 CF4 CF1 CF2 CF1 CF1 prevnext prevnext prevnext prevnext Leaf nodes

Size of CF Tree In BIRCH, it is required that each node fits to a page of size P. Given the dimensionality (and type) of the data, B and L can be determined from P. Thus, size of the tree will be function of T : Larger T allows more points in the same entries.

Operations of CF Tree Structure of CF tree is similar to B+-tree (used for representing sorted data). Basic operation: Insert a new CF entry (single point or subcluster) into the tree. Additional operation: Rebuild the tree with increased threshold T.

Insertion Algorithm 1 Identifying the appropriate leaf: Descend the tree by always choosing the closest child node according to the chosen distance metric (D0 D4). 2 Modifying the leaf: Find the closest entry. If threshold is not exceeded, absorb into the entry. Otherwise, if L is not exceeded, add new entry to the leaf. Otherwise, split the node: Choose the farthest entries and redistribute the rest, each to the closest one. 3 Modifying the path to the leaf: If the child was not split, just update CF. If there was a split, and B is not exceeded, add the new node to the parent. Otherwise, split the parent. Recurse upwards. 4 A merging refinement: When the split propagated up to some non-leaf node, test if the split pair are the closest of the entries. If not, merge the closest entries and resplit if needed.

Insertion Algorithm B = 3 L = 4 Root CF1 CF2 ch1 ch2 CF1 CF2 CF3 ch1 ch2 ch3 CF1 ch1 Non leaf nodes CF1 CF2 CF3 CF4 CF1 CF2 CF1 CF1 prevnext prevnext prevnext prevnext Leaf nodes

Insertion Algorithm B = 3 L = 4 Root CF1 CF2 ch1 ch2 new data point Indentifying leaf CF1 CF2 CF3 ch1 ch2 ch3 CF1 ch1 Non leaf nodes CF1 CF2 CF3 CF4 CF1 CF2 CF1 CF1 prevnext prevnext prevnext prevnext Leaf nodes

Insertion Algorithm B = 3 L = 4 Root CF1 CF2 ch1 ch2 new data point Modifying leaf CF1 CF2 CF3 ch1 ch2 ch3 CF1 ch1 Non leaf nodes CF1 CF2 prevnext CF1 CF2 CF3 prevnext CF1 CF2 CF1 CF1 prevnext prevnext prevnext Leaf nodes

Insertion Algorithm B = 3 L = 4 Root CF1 CF2 ch1 ch2 new data point Modifying path to leaf CF1 CF2 ch1 ch2 CF1 CF2 ch1 ch2 CF1 ch1 Non leaf nodes CF1 CF2 prevnext CF1 CF2 CF3 prevnext CF1 CF2 CF1 CF1 prevnext prevnext prevnext Leaf nodes

Insertion Algorithm B = 3 L = 4 Modifying path to leaf Root CF1 CF2 CF3 ch1 ch2 ch3 new data point CF1 CF2 ch1 ch2 CF1 CF2 ch1 ch2 CF1 ch1 Non leaf nodes CF1 CF2 prevnext CF1 CF2 CF3 prevnext CF1 CF2 CF1 CF1 prevnext prevnext prevnext Leaf nodes

Insertion Algorithm B = 3 L = 4 Root CF1 CF2 ch1 ch2 Merging refinement CF1 CF2 ch1 ch2 CF1 CF2 CF3 ch1 ch2 ch3 Non leaf nodes CF1 CF2 prevnext CF1 CF2 CF3 prevnext CF1 CF2 CF1 CF1 prevnext prevnext prevnext Leaf nodes

Rebuilding Algorithm If the selected T led to a tree that is too large for the memory, it must be rebuild with a larger T. Can be done with small extra memory and without scanning the data again. Assume that original tree t i with threshold T i, size S i and height h i. Proceed one path from a leaf to the root at a time: 1 Copy the path to the new tree. 2 Insert the leaf entries to the new tree. If they can be absorbed or fit in to the closest path of the tree, insert there. Otherwise insert to the leaf of the new path. 3 Remove empty nodes from both the old and new tree. If for new tree t i+1, if T i+1 T i, then S i+1 S i and the transformation from t i to t i+1 needs at most h i extra pages of memory.

Rebuilding Sketch Rebuilding CF tree [Zhang et al. 1997]

Anomalies Limited number of entries per node and skewed input order can produce some anomalies: Two subclusters that should have been in one cluster are split. Two subclusters that should not be in one cluster are kept in same node. Two instances of the same data point might be entered into distinct leaf entries. Global clustering (BIRCH phase 3) solves the first two anomalies. Further passes over the data are needed for the latter (optional phase 4).

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

BIRCH Phases BIRCH overview [Zhang et al. 1997]

BIRCH Phase 1 1 Start an initial in-memory CF tree t 0 with small threshold T 0. 2 Scan the data and insert points to the current tree t i. 3 If memory runs out before all the data is inserted, rebuild a new tree t i+1 with T i+1 > T i. Optionally detect outliers while rebuilding: If a leaf entry has far fewer data points than on average, write them to disk instead of adding to the tree. If disk space runs out, go through all outliers and re-absorb those that do not increase the tree size. 4 Otherwise, proceed to the next phase. (And, optionally, recheck if some of the potential outliers can be absorbed.)

BIRCH Phases 1 2 After Phase 1, the clustering should be easier in several ways: Fast because no I/O operations are needed, and the problem of clustering N points is reduced to clustering of M subclusters (with hopefully M << N). Accurate because outliers are eliminated and the rest data is reflected with finest granularity given the memory limit. Less order sensitive because the order of the leaf entries contain better data locality than an arbitrary order. Phase 2: If known that the global clustering algorithm in Phase 3 works best with some range of the input size M, continue reducing the tree until the range is reached.

BIRCH Phase 3 Use a global or semi-global clustering for the M subclusters (CF entries of the leaf nodes). Fixes the anomalies caused by limited page size in Phase 2. Several possibilities for handling the subclusters: Naive: Use centroid to represent the subcluster. Better: N data points in the centroids. Accurate: As seen, the most pointwise distance and quality metrics can be calculated from the CF vectors. The authors of BIRCH used agglomerative hierarchical clustering with distances metrics D2 and D4.

BIRCH Phase 4 Use the centroids of the clusters from Phase 3 as seeds. Make additional scan of the data and redistribute each point to the cluster corresponding to the nearest seed. Fixes the misplacement problems of the first phases. Can be extended with additional passes (cf. K-means). Optionally discard outliers (i.e. those too far from the closest seed).

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

CPU cost N points in R d, memory U bytes, page size P bytes Maximum size of the tree U P Phases 1 and 2: U Inserting all data points: O(N d B log B P ) Plus something less for the rebuildings. Plus some I/O usage if outliers are written to the disk. Phase 3: Depends on the algorithm and the number of subclusters M. If M is limited in Phase 2, it does not depend on N. Phase 4: One iteration is O(N K) So scales up pretty linearly.

Experiments In [Zhang et al. 1996] and [Zhang et al. 1997], the authors compare the performance to K-means and CLARANS. CLARANS uses a graph of partitions and searches it locally to find a good one. Random 2-d datasets of K = 100 clusters Centers either on a grid, on a curve of sine function, or random About 1000 points per each cluster Unsurprisingly, BIRCH uses less memory and is faster, less order-sensitive and more accurate. Scalability seems to be quite good: Linear w.r.t. cluster number and points per cluster. Little worse than linear w.r.t. data dimensionality.

Application: Filtering Trees from Background Images taken in NIR and VIS [Zhang et al. 1997]

Application: Filtering Trees from Background Separate branches, shadows and sunlit leaves [Zhang et al. 1997]

Outline 1 Introduction 2 Clustering Features 3 CF Tree 4 BIRCH Clustering 5 Performance and Experiments 6 Conclusions

Conclusions Single-pass, sort-of-linear time algorithm that results in a sort-of-clustering of large number of data points, with most outliers sort-of-thrown-out. [Fox & Gribble 1997] Lots of heuristics in the overall algorithm. The novelty was in dealing with large datasets and using summary information (CF entries) instead of the data itself. Several extensions to the idea: parallel computation, different data types, different summary schemes, different clustering algorithms.

References References Tian Zhang, Raghu Ramakrishnan and Miron Livny (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proc. 1996 ACM SIGMOD international Conference on Management of Data, pp. 103 114. ACM Press, New York, NY. Tian Zhang, Raghu Ramakrishnan and Miron Livny (1997). BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery, 1:2(141 182). Springer Netherlands. Armando Fox and Steve Gribble (1997). Databases Paper Summaries. [On-line, cited 23 April 2008.] Available at: http: //www.cs.berkeley.edu/~fox/summaries/database/