Unsupervised Data Mining (Clustering)

Similar documents
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Cluster Analysis: Advanced Concepts

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Chapter 7. Cluster Analysis

Clustering & Visualization

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Going Big in Data Dimensionality:

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

How To Cluster

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Forschungskolleg Data Analytics Methods and Techniques

Clustering Data Streams

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

A comparison of various clustering methods and algorithms in data mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Principles of Data Mining by Hand&Mannila&Smyth

Social Media Mining. Data Mining Essentials

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: Basic Concepts and Algorithms

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

The Scientific Data Mining Process

An Overview of Knowledge Discovery Database and Data mining Techniques

Clustering methods for Big data analysis

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Data Mining Analytics for Business Intelligence and Decision Support

Data Clustering Techniques Qualifying Oral Examination Paper

ADVANCED MACHINE LEARNING. Introduction

Cluster Analysis: Basic Concepts and Algorithms

CS Introduction to Data Mining Instructor: Abdullah Mueen

Data, Measurements, Features

How To Perform An Ensemble Analysis

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Information Management course

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

On Clustering Validation Techniques

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining: Overview. What is Data Mining?

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Authors. Data Clustering: Algorithms and Applications

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

The SPSS TwoStep Cluster Component

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Support Vector Machines with Clustering for Training with Very Large Datasets

Grid Density Clustering Algorithm

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Sanjeev Kumar. contribute

Chapter ML:XI (continued)

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Classification algorithm in Data mining: An Overview

Clustering Via Decision Tree Construction

Building Data Cubes and Mining Them. Jelena Jovanovic

OUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Multidimensional data analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

The basic data mining algorithms introduced may be enhanced in a number of ways.

GE-INTERNATIONAL JOURNAL OF ENGINEERING RESEARCH VOLUME -3, ISSUE-6 (June 2015) IF ISSN: ( ) EMERGING CLUSTERING TECHNIQUES ON BIG DATA

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Data Preprocessing. Week 2

An Introduction to Cluster Analysis for Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Hadoop SNS. renren.com. Saturday, December 3, 11

Azure Machine Learning, SQL Data Mining and R

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Data Warehousing und Data Mining

Geospatial Data Integration Open Service Bus Architecture

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

Cluster analysis Cosmin Lazar. COMO Lab VUB

Research Article Traffic Analyzing and Controlling using Supervised Parametric Clustering in Heterogeneous Network. Coimbatore, Tamil Nadu, India

2 Basic Concepts and Techniques of Cluster Analysis

Web Document Clustering

Transcription:

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51

Introduction Clustering in KDD One of the main tasks in the KDD process is the analysis of data when we do not know its structure This task is very different from the task of prediction in which we know the true answer and we try to approximate it A large number of KDD projects involve unsupervised problems (KDNuggets poll, -3 most frequent task) Problems: Scalability, Arbitrary cluster shapes, New types of data, Parameters fitting,... Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 / 51

Clustering in DM Clustering in Data Mining Introduction Structured Data Data Streams Feature Selection Feature Transformation Data Dimensionality Reduction Unstructured Data Paralelization Clustering in DM Model Based Data Compression Density Based Strategies Clustering Algorithms Approximation Sampling Other Grid Based One pass Agglomerative Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 3 / 51

All kinds of data Clustering in Data Mining The data Zoo Structured Data Unstructured Data Data Data Streams Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 4 / 51

Unstructured data Clustering in Data Mining The data Zoo Only one table of observations Each example represents an instance of the problem Each instance is represented by a set of attributes (Discrete, continuous) A B C 1 3.1 a 1 5.7 b 0 -. b 1-9.0 c 0 0.3 d 1.1 a...... Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 5 / 51

Structured data Clustering in Data Mining The data Zoo One sequential relation among instances (Time, Strings) Several instances with internal structure (eg: sequences of events) Subsequences of unstructured instances (eg: sequences of complex transactions) One big instance (eg: time series) Several relations among instances (graphs, trees) Several instances with internal structure (eg: XML documents) One big instance (eg: social network) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 6 / 51

The data Zoo Data Streams Endless sequence of data (eg: sensor data) Several streams synchronized Unstructured instances Structured instances Static/Dynamic model Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 7 / 51

I am in Hyperspace Clustering in Data Mining Dimensionality Reduction Feature Transformation Feature Selection Dimensionality Reduction Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 8 / 51

Dimensionality Reduction The Curse of dimensionality Two problems arise from the dimensionality of a dataset The computational cost of processing the data (scalability of the algorithms) The quality of the data (more probability of bad data) Two elements define the dimensionality of a dataset The number of examples The number of attributes Having too many examples can sometimes be solved by sampling Attribute reduction is a more complex problem Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 9 / 51

Reducing attributes Clustering in Data Mining Dimensionality Reduction Usually the number of attributes of the dataset has an impact on the performance of the algorithms: Because their poor scalability (cost is a function of the number of attributes) Because the inability to cope with irrelevant/noisy/redundant attributes Because the low ratio instances/dimensions (sparsity of space) Two main methodologies to reduce attributes from a dataset Transforming the data to a space of less dimensions preserving the structure of the data (dimensionality reduction - feature extraction) Eliminating the attributes that are not relevant for the goal task (feature subset selection) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 10 / 51

Dimensionality Reduction Dimensionality reduction Many techniques have been developed for this purpose Projection to a space that preserve the statistical model of the data (Linear: PCA, ICA; Non linear: Kernel-PCA) Projection to a space that preserves distances among the data (Linear: Multidimensional scaling, random projection, SVD, NMF; Non linear: ISOMAP, LLE, t-sne) Not all this techniques are scalable per se (O(N ) to O(N 3 ) time complexity), but some can be approximated Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 11 / 51

Dimensionality Reduction Original (60 Dim) PCA Kernel PCA ISOMAP Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51

Dimensionality Reduction Unsupervised Attribute Selection The goal is to eliminate from the dataset all the redundant or irrelevant attributes The original attributes are preserved The methods for Unsupervised Attribute Selection are less developed than in Supervised Attribute Selection The problem is that an attribute can be relevant or not depending on the goal of the discovery process There are mainly two techniques for attribute selection: Wrapping and Filtering Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 13 / 51

All kinds of algorithms Clustering in Data Mining Clustering Algorithms Grid Based Density Based Agglomerative Clustering Algorithms Model Based Other Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 14 / 51

Clustering Algorithms Model/Center Based Algorithms The algorithm is usually biased by a preferred model (hyperspheres, hyper elipsoids,...) Clustering defined as an optimization problem An objective function is optimized (Distorsion, loglikelihood,...) Based on gradient descent-like algorithms k-means and variants Probabilistic mixture models/em algorithms (eg: Gaussian Mixture) Scalability depends on the number of parameters, size of the dataset and iterations until convergence Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 15 / 51

Clustering Algorithms Model Based Algorithms 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 16 / 51

Clustering Algorithms Agglomerative Algorithms Based on the hierarchical aggregation of examples From spherical to arbitrary shaped clusters Defined over the affinity/distance matrix Algorithms based on graphs/matrix algebra Several criteria for the aggregation Scalability is cubic on the number of examples (O(n log(n)) using heaps, O(n ) in special cases) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 17 / 51

Clustering Algorithms Agglomerative Algorithms x 0 4 6 8 9 7 4 1 13 5 3 6 1 8 11 14 10 0 16 1 18 4 19 3 17 15 1 0 1 3 4 5 1 10 8 6 4 0 x1 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 18 / 51

Clustering Algorithms Density Based Algorithms Based on the discovery of areas of larger density Arbitrary shaped clusters Noise/outliers tolerant Algorithms based on: Topology (DBSCAN-like: core points, reachability) Kernel density estimation (DENCLUE, mean shift) Scalability depends on the size of the dataset and the computation of the density (KDE, k-nearest neighbours) (best complexity O(n log(n))) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 19 / 51

Clustering Algorithms Density Based Algorithms - DBSCAN Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 0 / 51

Clustering Algorithms Density Based Algorithms - DENCLUE Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51

Clustering Algorithms Grid Based Algorithms Based on the discovery of areas of larger density by space partitioning Arbitrary shaped clusters Noise/outliers tolerant Algorithms based on: Fixed/adaptative grid partitioning of features Histogram mode discovery and dimension merging Scalability depends on the number of features, grid granularity and number of examples Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 / 51

Grid Based Algorithms Clustering in Data Mining Clustering Algorithms Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 3 / 51

Clustering Algorithms Other Algorithms Based on graph theory K-way graph partitioning Max-flow/min-flow algorithms Spectral clustering Message passing/belief propagation (affinity clustering) Evolutionaty algorithms SVM clustering/maximum margin clustering Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 4 / 51

Clustering Algorithms Other Clustering Related Areas Consensus clustering Clustering by consensuating several partitions Semisupervised clustering/clustering with constraints Clustering with additional information (must and cannot links) Subspace clustering Clusters with reduced dimensionality Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 5 / 51

Size matters :-) Clustering in Data Mining Approximation One pass Data compression Strategies Sampling Paralelization Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 6 / 51

Strategies for cluster scalability One-pass Process data as a stream Summarization/data compression Compress examples to fit more data in memory Sampling/Batch algorithms Process a subset of the data and maintain/compute a global model Approximation Avoid expensive computations by approximate estimation Paralelization/Distribution Divide the task in several parts and merge models Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 7 / 51

One pass Clustering in Data Mining This strategy is based on incremental clustering algorithms They are cheap but order of processing affects greatly their quality Although can be used as a preprocessing step Two steps algorithms 1 A large number of clusters is generated using the one-pass algorithm A more accurate algorithm clusters the preprocessed data Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 8 / 51

Data Compression/Summarization Discard sets of examples and summarize by: Sufficient statistics Density approximations Discard data irrelevant for the model (do not affect the result) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 9 / 51

Approximation Clustering in Data Mining Not using all the information available to make decisions Using K-neighbours (data structures for computing k-neighbours) Preprocessing the data using a cheaper algorithm Generate batches using approximate distances (eg: canopy clustering) Use approximate data structures Use of hashing or approximate counts for distances and frequency computation Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 30 / 51

Batches/Sampling Clustering in Data Mining Process only data that fits in memory Obtain from the data set: Samples (process only a subset of the dataset) Determine the size of the sample so all the clusters are represented Batches (process all the dataset) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 31 / 51

Paralelization/Distribution/Divide&Conquer Paralelization of clustering usually depends on the specific algorithm Some are difficult to parallelize (eg: hierarchical clustering) Some have specific parts that can be solved in parallel or by Divide&Conquer Distance computations in k-means Parameter estimation in EM algorithms Grid density estimations Space partitioning Batches and sampling are more general approaches The problem is how to merge all the different partitions Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 3 / 51

Examples - One pass + Summarization One pass + hierarchical single link One pass summarization using Leader algorithm (many clusters) Single link hierarchical clustering of the summaries Theoretical results about equivalence to SL at top levels (from O(N ) to O(c )) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 33 / 51

One pass + Single Link nd Phase Hierarchical Clustering 1st Phase Leader Algorithm Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 34 / 51

Examples - One pass + Summarization One pass + model based algorithms BIRCH algorithm One-pass Leader algorithm using an adaptive hierarchical data structure (CF-tree) Postprocess for noise filtering Clustering of cluster centers from CF-tree Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 35 / 51

One pass + CFTREE (BIRCH) 1st Phase - CFTree nd Phase - Kmeans Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 36 / 51

Examples - Batch + (Divide&Conquer / One-pass) Batch + K-means-like SaveSpace algorithm (hierarchy of centers) Clustering using a randomized algorithm based on k-facilities problem (LSEARCH algorithm) Data is divided on M batches Each batch is clustered in k weighted centers Recursivelly M k centers are divided on new batches for the next level Last level is clustered in k centers Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 37 / 51

Batch + Divide&Conquer M Batches - K Clusters each M*K new examples Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 38 / 51

Examples - Sampling + Summarization Random Samples + Data Compression Scalable K-means Retrieve a sample from the database that fits in memory Cluster with the current model Discard data close to the centers Compress high density areas using sufficient statistics Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 39 / 51

Sampling + Summarization Original Data Sampling Updating model Compression More data is added Updating model Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 40 / 51

Examples - Sampling + Distribution CURE Draws a random samples from the dataset Partitions the sample in p groups Executes the clustering algorithm on each partition Deletes outliers Runs the clustering algorithm on the union of all groups until it obtains k groups Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 41 / 51

Sampling + Distribution DATA Sampling+Partition Clustering partition 1 Clustering partition Join partitions Labelling data Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 4 / 51

Examples - Approximation + Distribution Canopy Clustering Approximate dense areas by probing the space of examples Batches are formed by extracting spherical volumes (canopies) using a cheap distance Inner radius/outer radius Each canopy is clustered separately (disjoint clusters) Only distances inside the canopy are computed Several strategies for classical algorithms Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 43 / 51

Approximation + Distribution 1st Canopy nd Canopy 3rd Canopy Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 44 / 51

Examples - Divide&Conquer K-means + indexing Build a kd-tree structure for the examples Determine the example-center assignment in parallel Each branch of the kd-tree only keeps a subset of centers If only one candidate in a branch no distances computed Distances only are computed at leaves Effective for low dimensionality Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 45 / 51

Divide&Conquer - K-means + KD-tree KD-TREE Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 46 / 51

Examples - Divide&Conquer Optigrid Find projections of the data in lower dimensionality Determine the best cuts of the projections Build a grid with the cuts Recurse for dense cells (paralellization) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 47 / 51

Divide&Conquer - Grid Clustering Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 48 / 51

Examples - Approximation Quantization Cluster (k-means) Quantize the space of examples and keep dense cells Approximate cluster assignments using cells as examples Recompute centers with examples in the cells Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 49 / 51

Divide&Conquer + approximation Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 50 / 51

The END Clustering in Data Mining THANKS (Questions, Comments,...) Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 51 / 51