SoSe 2014: M-TANI: Big Data Analytics

Similar documents
Clustering UE 141 Spring 2013

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Standardization and Its Effects on K-Means Clustering Algorithm

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

CS Data Science and Visualization Spring 2016

USING THE AGGLOMERATIVE METHOD OF HIERARCHICAL CLUSTERING AS A DATA MINING TOOL IN CAPITAL MARKET 1. Vera Marinova Boncheva

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Introduction to Clustering

Cluster analysis Cosmin Lazar. COMO Lab VUB

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Neural Networks Lesson 5 - Cluster Analysis

Cluster Analysis: Basic Concepts and Algorithms

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Chapter ML:XI (continued)

Introduction to Data Science: CptS Syllabus First Offering: Fall 2015

Machine Learning using MapReduce

Chapter 7. Cluster Analysis

Cluster Analysis using R

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

How To Learn Data Analytics

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Unsupervised learning: Clustering

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Time series clustering and the analysis of film style

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

Statistical Databases and Registers with some datamining

Information Retrieval and Web Search Engines

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

How To Cluster

Big Data Management and Analytics

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Mining Social Network Graphs

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Advanced Concepts

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Part 2: Community Detection

Big Data Analytics. Lucas Rego Drumond

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

A Comparative Study of clustering algorithms Using weka tools

Clustering Connectionist and Statistical Language Processing

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Unsupervised Data Mining (Clustering)

Estimating PageRank Values of Wikipedia Articles using MapReduce

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Concept of Cluster Analysis

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Principles of Data Mining by Hand&Mannila&Smyth

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Clustering Marketing Datasets with Data Mining Techniques

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

processed parallely over the cluster nodes. Mapreduce thus provides a distributed approach to solve complex and lengthy problems

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

A Relevant Document Information Clustering Algorithm for Web Search Engine

A Study on the Hierarchical Data Clustering Algorithm Based on Gravity Theory

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

Segmentation & Clustering

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Final Project Report

The SPSS TwoStep Cluster Component

Efficient and Fast Initialization Algorithm for K- means Clustering

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Distance Degree Sequences for Network Analysis

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Clustering Data Streams

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Personalized Hierarchical Clustering

COC131 Data Mining - Clustering

Chapter 5: Stream Processing. Big Data Management and Analytics 193

Forschungskolleg Data Analytics Methods and Techniques

CLUSTERING FOR FORENSIC ANALYSIS

DHL Data Mining Project. Customer Segmentation with Clustering

Cluster Analysis: Basic Concepts and Algorithms

CSE 427 CLOUD COMPUTING WITH BIG DATA APPLICATIONS

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Big Data and Analytics (Fall 2015)

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

An Introduction to Cluster Analysis for Data Mining

Distances, Clustering, and Classification. Heatmaps

Transcription:

SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis

Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering Partitional Clustering

Introduction Given a set of objects, with a notion of distance between those objects. The task of a clustering algorithm is to group those objects into some number of clusters, so that: Members of a cluster are similar to each other Members of different clusters are dissimilar High dimensionality may be hard to interpret from [2]

Introduction Clustering Classification Classification Assigning objects to predifined classes Requires supervised learning Clustering No predifined classes Assigning objects to clusters (based on distance)

Introduction Clustering applications Clustering DNA sequences Image segmentation Customer segmentation...

Distance mesures Calculate similarity Large distance = low similarity Small distance = high similarity Conditions 1. dddd x, y = d with d: X X R and x, y, z ϵ X 2. dddd x, y 0 3. dddd x, y = 0 if x = y 4. dddd x, y = dddd y, x 5. dddd x, z dddd x, y + dddd(y, z) from [1] and [3]

Distance mesures Euclidian distance: dddd eeeeee X, Y = n i=1 (x i y i ) 2 Jaccard distance: dddd jjjjjjj x, y = 1 SSS x, y Cosine distance: dddd ccc X, Y = n i=1 x i y i n i=1 (x i ) 2 n i=1(x i ) 2 Edit distance: smallest number of insertions and deletions of single characters that will convert string x to string y from [2] and [3]

Distance between clusters Centroid Medoid/Clustroid

Distance between clusters Single-linkage

Distance between clusters Complete-linkage

Distance between clusters Average-linkage

Clustering approches Clustering Hierarchical Partitional adapted from [4]

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance Single Link 4. Repeat 2. and 3. until we have one big cluster

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Agglomerative 1. Every Object is a cluster 2. Calculate the minimal distance between to clusters 3. Merge clusters with minimal distance 4. Repeat 2. and 3. until we have one big cluster Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Hierarchical Divisive 1. Find max. distance between objects 2. Split those clusters 3. Repeat 1. and 2. until we have clusters which contain just one object Single Link

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Cluster Visualization Dendrogram Distance adopted from [3]

Clustering approches Clustering Hierarchical Partitional adapted from [4]

K-Means 1. Place k centroids randomly 2. If distance to centroid min. merge to cluster 3. Move the k centroids to the new cluster center 4. Repeat 2. and 3. until we fulfil the stop criteria k=3

K-Means 1. Place k centroids randomly 2. If distance to centroid min. merge to cluster 3. Move the k centroids to the new cluster center 4. Repeat 2. and 3. until we fulfil the stop criteria k=3

K-Means 1. Place k centroids randomly 2. If distance to centroid min. merge to cluster 3. Move the k centroids to the new cluster center 4. Repeat 2. and 3. until we fulfil the stop criteria k=3

K-Means 1. Place k centroids randomly 2. If distance to centroid min. merge to cluster 3. Move the k centroids to the new cluster center 4. Repeat 2. and 3. until we fulfil the stop criteria k=3

K-Means 1. Place k centroids randomly 2. If distance to centroid min. merge to cluster 3. Move the k centroids to the new cluster center 4. Repeat 2. and 3. until we fulfil the stop criteria k=3

K-Means How to choose the right k? Try different k, looking at the change in the average distance to centroid as k increases Average falls rapidly until right k, then changes little Average distance to centroid k from [2]

Clustering on graphs Create the MST Remove k-1 edges with the highest weights Create the clusters k=3

Clustering on graphs 3 Create the MST 6 Remove k-1 edges with the highest weights 2 4 5 Create the clusters 1 k=3

Clustering on graphs 3 Create the MST Remove k-1 edges with the highest weights 2 4 Create the clusters 1 k=3

Clustering on graphs 3 Create the MST Remove k-1 edges with the highest weights 2 4 Create the clusters 1 k=3

Literature 1. Anand Rajaraman, Jeffrey D. Ullman, Jure Leskovec. 2014 Mining of Massive Datasets Cambridge University Press 2. Jure Leskovec. 2014 Slides: Mining Massive Data Sets URL: http://www.stanford.edu/class/cs246/slides/05-clustering.pdf 3. Martin Ester, Jörg Sander. 2000 Knowledge Discovery in Databases Springer Verlag Berlin Heidelberg 4. A. K. Jain, M. N. Murty, and P. J. Flynn. 1999 Data clustering: a review ACM Comput. Surv., 31(3):264-323