BIRCH: An Efficient Data Clustering Method For Very Large Databases



Similar documents
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Clustering UE 141 Spring 2013

The SPSS TwoStep Cluster Component

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Clustering & Visualization

Classification and Prediction

Classifying Large Data Sets Using SVMs with Hierarchical Clusters

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Cluster Analysis: Advanced Concepts

Social Media Mining. Data Mining Essentials

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering Data Streams

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Data Clustering Techniques Qualifying Oral Examination Paper

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Data Mining Classification: Decision Trees

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Final Project Report

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

COC131 Data Mining - Clustering

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Map-Reduce for Machine Learning on Multicore

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Chapter 7. Cluster Analysis

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

How To Cluster

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Chapter 20: Data Analysis

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining Techniques Chapter 6: Decision Trees

Clustering Via Decision Tree Construction

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data mining techniques: decision trees

Overview of Storage and Indexing

Neural Networks Lesson 5 - Cluster Analysis

Big Data and Scripting. Part 4: Memory Hierarchies

Information Retrieval and Web Search Engines

Machine Learning using MapReduce

Decision Trees from large Databases: SLIQ

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Lecture 10: Regression Trees

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Fast Matching of Binary Features

Lecture 1: Data Storage & Index

Building Data Cubes and Mining Them. Jelena Jovanovic

Cluster Analysis: Basic Concepts and Methods

DATABASE DESIGN - 1DL400

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Data Mining for Knowledge Management. Classification

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Graph Mining and Social Network Analysis

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Binary Coded Web Access Pattern Tree in Education Domain

Standardization and Its Effects on K-Means Clustering Algorithm

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Constrained Clustering of Territories in the Context of Car Insurance

Data Mining and Machine Learning in Bioinformatics

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

SoSe 2014: M-TANI: Big Data Analytics

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Data Preprocessing. Week 2

Unsupervised learning: Clustering

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Hierarchical Data Visualization

Data Warehousing und Data Mining

Mining Social Network Graphs

Chapter 7 Memory Management

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Distances, Clustering, and Classification. Heatmaps

CIS 631 Database Management Systems Sample Final Exam

External Sorting. Chapter 13. Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Using multiple models: Bagging, Boosting, Ensembles, Forests

Transcription:

BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image. Flickr. 07 Nov 2009.

Outline What is Data Clustering? Advantages of BIRCH Algorithm Clustering Feature (CF) and CF Tree BIRCH Clustering Algorithm Applications of BIRCH Conclusion

What is Data Clustering? Given a large set of multi-dimensional data points - data space is usually not uniformly occupied Can group closely related points into a cluster - points are similar according to some distance-based measurement f n - choose desired number of clusters, K discover distribution patterns in the dataset help visualize data and guide data analysis

What is Data Clustering? Popular data mining problem in - Machine Learning -- probability-based - Statistics -- distance-based - Database -- limited memory Define problem as: - Partitioning dataset to minimize size of cluster - Data set size may be larger than memory - Minimize I/O costs

Advantages of BIRCH vs. Other Clustering Algorithms BIRCH is local - clusters a point without having to check against all other data points or clusters Can remove outliers ( noise ) Produce good clusters with a single scan of dataset Linearly scalable - minimizes running time - adjusts quality of result with regard to available memory

Clustering Feature (CF) Compact - no need to store individual pts belonging to a cluster Three parts: CFi = (Ni, LSi, SSi), i = 1,2,...,M (no. of clusters) N number of data pts in cluster LS linear sum of N data pts SS square sum of N data pts Sufficient to compute distance between two clusters When merging two clusters, add the CFs

CF Tree Root Entry: [CF, child] CF1 CF2... CFB child1 child2 childb Branching Factor B = max no. of CF in non-leaf node L = max no. of CF in leaf node Threshold requirement: T = max radius/diameter of CF (in leaf) Non-leaf node CF1 CF2... CFB child1 child2 childb... Leaf node prev CF1 CF2... CFL next prev CF1 CF2... CFL next

CF Tree Tree size is a function of T - larger T, more points in each cluster, smaller tree good choice reduces number of rebuilds - if T too low, can be increased dynamically - if T too high, less detailed CF tree heuristic approach used to estimate next threshold value CF tree built dynamically as data is scanned and inserted

CF Tree Insertion Identify the appropriate leaf: - Start with CF list at root node, find the closest cluster (by using CF values) - Look at all the children of the cluster, find the closest - And so on, until you reach a leaf node Modify the leaf: - Find closest leaf entry and test whether it can absorb new entry without violating threshold condition - If not, add new entry to leaf - Leaves have a max size; may need to be split Modifying the path: - Once the point has been added, must update the CF of all ancestors

BIRCH Clustering Algorithm Phase 1: Load data into memory by building a CF tree Phase 2 (optional): Condense into desirable range by building smaller CF trees Phase 3: Global Clustering Phase 4 (optional): Cluster Refining

Phase 1 BIRCH Start with initial threshold T and insert points into tree If we run out of memory, increase T and rebuild - Re-insert leaf entries from old tree into new tree - remove outliers Methods for initializing and adjusting T are adhoc After phase 1: - data reduced to fit in memory - subsequent processing occurs entirely in memory (no I/O)

Phase 2 BIRCH Optional No. of clusters produced in Phase 1 may be not be suitable for algorithms used in Phase 3 Shrink tree as necessary - remove more outliers - crowded subclusters are merged

Phase 3 BIRCH Problems after Phase 1: - input order affect results - splitting triggered Use leaf nodes of CF tree as input to a standard ( global ) clustering algorithm - KMeans, HC Phase 1 has reduced the size of the input dataset enough so that the standard algorithm can work entirely in memory

Phase 4 BIRCH Optional Scan through data again and assign each data point to a cluster - choose cluster whose centroid is closest This redistributes data points amongst clusters in more accurate fashion than original CF cluster Can be repeated for improved refinement of clusters

Apps of Data Clustering Helps identify natural groupings that exist within a dataset Image processing - separate similar properties in an image

Apps of Data Clustering Bioinformatics - identifying genes that are regulated by common mechanisms Market analysis - distinguish groups of consumers with similar tastes

Conclusion BIRCH performs better than other existing algorithms on large datasets - reduces I/O - accounts for memory constraint Produces good clustering from only one scan of entire dataset: O(n) Handles outliers