Forschungskolleg Data Analytics Methods and Techniques



Similar documents
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Unsupervised Data Mining (Clustering)

Cluster Analysis: Advanced Concepts

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering & Visualization

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Chapter ML:XI (continued)

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Information Retrieval and Web Search Engines

Clustering UE 141 Spring 2013

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Neural Networks Lesson 5 - Cluster Analysis

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Environmental Remote Sensing GEOG 2021

A comparison of various clustering methods and algorithms in data mining

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Neural Network Add-in

CLUSTER ANALYSIS FOR SEGMENTATION

The STC for Event Analysis: Scalability Issues

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Component Ordering in Independent Component Analysis Based on Data Power

Chapter 7. Cluster Analysis

How To Solve The Cluster Algorithm

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering Data Streams

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Analytics on Big Data

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data, Measurements, Features

Data Clustering Techniques Qualifying Oral Examination Paper

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Structural Health Monitoring Tools (SHMTools)

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

How To Cluster Of Complex Systems

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

A Comparative Study of clustering algorithms Using weka tools

HDDVis: An Interactive Tool for High Dimensional Data Visualization

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Cluster Analysis: Basic Concepts and Methods

Hadoop SNS. renren.com. Saturday, December 3, 11

Part 2: Community Detection

The Scientific Data Mining Process

Scalable Developments for Big Data Analytics in Remote Sensing

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

SoSe 2014: M-TANI: Big Data Analytics

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Make Better Decisions with Optimization

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Unsupervised learning: Clustering

Predictive Analytics Powered by SAP HANA. Cary Bourgeois Principal Solution Advisor Platform and Analytics

SCAN: A Structural Clustering Algorithm for Networks

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Java Modules for Time Series Analysis

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

An Overview of Knowledge Discovery Database and Data mining Techniques

MapReduce Approach to Collective Classification for Networks

Segmentation & Clustering

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

An In-Depth Look at In-Memory Predictive Analytics for Developers

FlowMergeCluster Documentation

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Binary Image Scanning Algorithm for Cane Segmentation

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Decision Trees from large Databases: SLIQ

Methods of Data Analysis Working with probability distributions

Statistical Machine Learning from Data

CS Introduction to Data Mining Instructor: Abdullah Mueen

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Scalable Cluster Analysis of Spatial Events

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

An Introduction to Cluster Analysis for Data Mining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Local features and matching. Image classification & object localization

Data Mining: Overview. What is Data Mining?

Review of Modern Techniques of Qualitative Data Clustering

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Cluster analysis Cosmin Lazar. COMO Lab VUB

A Comparison Framework of Similarity Metrics Used for Web Access Log Analysis

Using Library Dependencies for Clustering

Concept of Cluster Analysis

Transcription:

Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner

Why do we need it? We are drowning in data, but starving for knowledge! John Naisbett 23 petabyte/second of raw data 1 petabyte/year Martin Hahmann Data Analytics Methods and Techniques 2

Data Analytics The Big Four Classification Association Rules Prediction 23.11. Clustering Martin Hahmann Data Analytics Methods and Techniques 3

Clustering Clustering automated grouping of objects into so called clusters objects of the same group are similar different groups are dissimilar Problems applicability versatility support for non-expert users Martin Hahmann Data Analytics Methods and Techniques 4

Clustering Workflow algorithm setup result interpretation >500 algorithm selection adjustment Martin Hahmann Data Analytics Methods and Techniques 5

Clustering Workflow Which algorithm fits my data? Which parameters fit my data? How good is the obtained result? How to improve result quality? algorithm setup algorithm selection result interpretation adjustment Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 6

Traditional Clustering partitional traditional clustering one clustering densitybased hierarchical Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 7

Traditional Clustering Partitional Clustering clusters represented by prototypes objects are assigned to most similar prototype similarity via distance function k-means[lloyd, 1957] partitions dataset into k disjoint subsets minimizes sum-of-squares criterion parameters: k, seed Martin Hahmann Data Analytics Methods and Techniques 8

Traditional Clustering k=4 k=7 Partitional Clustering clusters represented by prototypes objects are assigned to most similar prototype similarity via distancefunction k-means[lloyd, 1957] partitions dataset into k disjoint subsets minimizes sum-of-squares criterion parameters: k, seed k=4 ISODATA[Ball & Hall, 1965] adjusts number of clusters merges, splits, deletes according to thresholds 5 additional parameters Martin Hahmann Data Analytics Methods and Techniques 9

Traditional Clustering k=7 Partitional Clustering clusters represented by prototypes objects are assigned to most similar prototype similarity via distancefunction k-means[lloyd, 1957] partitions dataset into k disjoint subsets minimizes sum-of-squares criterion parameters: k, seed k=7 ISODATA[Ball & Hall, 1965] adjusts number of clusters merges, splits, deletes according to thresholds 5 additional parameters Martin Hahmann Data Analytics Methods and Techniques 10

Traditional Clustering Density-Based Clustering clusters are modelled as dense areas separated by less dense areas density of an object = number of objects satisfying a similarity threshold DBSCAN [Ester & Kriegel et al., 1996] dense areas core objects object count in defined ε-neighbourhood connection via overlapping neighbourhood parameters: ε, minpts DENCLUE[Hinneburg, 1998] density via gaussian kernels cluster extraction by hill climbing ε=1.0; minpts=5 Martin Hahmann Data Analytics Methods and Techniques 11

Traditional Clustering Density-Based Clustering clusters are modelled as dense areas separated by less dense areas density of an object = number of objects satisfying a similarity threshold DBSCAN [Ester & Kriegel et al., 1996] dense areas core objects object count in defined ε-neighbourhood connection via overlapping neighbourhood parameters: ε, minpts DENCLUE[Hinneburg, 1998] density via gaussian kernels cluster extraction by hill climbing ε=0.5; minpts=5 Martin Hahmann Data Analytics Methods and Techniques 12

Traditional Clustering Density-Based Clustering clusters are modelled as dense areas separated by less dense areas density of an object = number of objects satisfying a similarity threshold DBSCAN [Ester & Kriegel et al., 1996] dense areas core objects object count in defined ε-neighbourhood connection via overlapping neighbourhood parameters: ε, minpts DENCLUE[Hinneburg, 1998] density via gaussian kernels cluster extraction by hill climbing ε=0.5; minpts=20 Martin Hahmann Data Analytics Methods and Techniques 13

Traditional Clustering Hierarchical Clusterings builds a hierarchy by progressively merging / splitting clusters clusterings are extracted by cutting the dendrogramm bottom-up, AGNES[Ward, 1963] top-down, DIANA[Kaufmann & Rousseeuw, 1990] Martin Hahmann Data Analytics Methods and Techniques 14

Multiple Clusterings multi-solution clustering multiple clusterings ensemble clustering traditional clustering Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 15

Multi-Solution Clustering alternative subspace multi-solution clustering multiple clusterings traditional clustering Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 16

Multi-Solution Clustering Alternative Clustering generate initial clustering with traditional algorithm utilize given knowledge to find alternative clusterings generate a dissimilar clustering initial clustering alternative 1 alternative 2 Martin Hahmann Data Analytics Methods and Techniques 17

Multi-Solution Clustering Alternative Clustering: COALA[Bae & Bailey, 2006] generate alternative with same number of clusters high dissimilarity to initial clustering(cannot-link constraint) high quality alternative (objects in cluster still similar) trade-off between two goals controlled by parameter w if d quality < w*d dissimilarity choose quality else dissimilarity CAMI[Dang & Bailey, 2010] clusters as gaussian mixture quality by likelihood dissimilarity by mutual information Orthogonal Clustering[Davidson & Qi, 2008] iterative transformation of database use of constraints Martin Hahmann Data Analytics Methods and Techniques 18

Multi-Solution Clustering Subspace Clustering high dimensional data clusters can be observed in arbitrary attribute combinations (subspaces) detect multiple clusterings in different subspaces CLIQUE [Agarawal et al. 1998] based on grid cells find dense cells in all subspaces FIRES[Kriegel et al.,2005] stepwise merge of base clusters (1D) max. dimensional subspace clusters DENSEST[Müller et al.,2009] estimation of dense subspaces correlation based (2D histograms) Martin Hahmann Data Analytics Methods and Techniques 19

Ensemble Clustering multi-solution clustering traditional clustering ensemble clustering multiple base clusterings one consensus clustering pairwise similarity cluster ids Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 20

Ensemble Clustering General Idea generate multiple traditional clusterings employ different algorithms / parameters ensemble combine ensemble to single robust consensus clustering different consensus mechanisms [Strehl & Ghosh, 2002], [Gionis et al., 2007] Martin Hahmann Data Analytics Methods and Techniques 21

Ensemble Clustering Pairwise Similarity a pair of objects belongs to: the same or different cluster(s) pairwise similarities represented as coassociation matrices x1 x2 x3 x4 x5 x6 x7 x8 x9 x1-1 1 0 0 0 0 0 0 x2 - - 1 0 0 0 0 0 0 x3 - - - 0 0 0 0 0 0 x4 - - - - 1 1 0 0 0 x5 - - - - - 1 0 0 0 x6 - - - - - - 0 0 0 x7 - - - - - - - 1 1 x8 - - - - - - - - 1 x9 - - - - - - - - - Martin Hahmann Data Analytics Methods and Techniques 22

Ensemble Clustering Pairwise Similarity a pair of objects belongs to: the same or different cluster(s) pairwise similarities represented as coassociation matrices simple example based on [Gionis et al., 2007] goal: generate consensus clustering with maximal similarity to ensemble x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x2 x3 x4 x5 x6 x7 x8 x9 x1 x2 x3 x4 x5 x6 x7 x8 x9 x1-1 1 0 0 0 0 0 0 x1-1 1 0 0 0 0 0 0 x1-0 1 1 1 0 0 0 0 x1-1 1 0 0 0 0 0 0 x2 - - 1 0 0 0 0 0 0 x2 - - 1 0 0 0 0 0 0 x2 - - 0 0 0 0 0 0 0 x2 - - 1 0 0 0 0 0 0 x3 - - - 0 0 0 0 0 0 x3 - - - 0 0 0 0 0 0 x3 - - - 1 1 0 0 0 0 x3 - - - 0 0 0 0 0 0 x4 - - - - 1 1 0 0 0 x4 - - - - 1 0 1 0 1 x4 - - - - 1 0 0 0 0 x4 - - - - 1 0 0 0 0 x5 - - - - - 1 0 0 0 x5 - - - - - 0 1 0 1 x5 - - - - - 0 0 0 0 x5 - - - - - 0 0 0 0 x6 - - - - - - 0 0 0 x6 - - - - - - 0 1 0 x6 - - - - - - 0 0 0 x6 - - - - - - 0 1 0 x7 - - - - - - - 1 1 x7 - - - - - - - 0 1 x7 - - - - - - - 1 1 x7 - - - - - - - 0 1 x8 - - - - - - - - 1 x8 - - - - - - - - 0 x8 - - - - - - - - 1 x8 - - - - - - - - 0 x9 - - - - - - - - - x9 - - - - - - - - - x9 - - - - - - - - - x9 - - - - - - - - - Martin Hahmann Data Analytics Methods and Techniques 23

Ensemble Clustering Consensus Mechanism majority decision pairs located in the same cluster in at least 50% of the ensemble, are also part of the same cluster in the consensus clustering x1 x2 x3 x4 x5 x6 x7 x8 x9 x1-3/4 4/4 1/4-1/4-0 - 0-0 - 0 - x1 x2 x3 x4 x5 x6 x7 x8 x9 x1-0 1 1 1 0 0 0 0 x2 - - 3/4 0-0 - 0-0 - 0-0 - x2 - - 0 0 0 0 0 0 0 x3 - - - 1 1 0 0 0 0 x3 - - - 1/4-1/4-0 - 0-0 - 0 - x4 - - - - 1 0 0 0 0 x5 - - - - - 0 0 0 0 x4 - - - - 4/4 1/4-1/4-0 - 1/4 - x6 - - - - - - 0 0 0 x5 - - - - x7 - - 1/4 -- - 1/4 - - - 0 - - 1/41-1 x8 - - - - - - - - 1 x6 - - - - - - 1/4-3/4 1/4 - x9 - - - - - - - - - x7 - - - - - - - 2/4 4/4 x8 - - - - - - - - 2/4 x9 - - - - - - - - - Martin Hahmann Data Analytics Methods and Techniques 24

Ensemble Clustering Consensus Mechanism majority decision pairs located in the same cluster in at least 50% of the ensemble, are also part of the same cluster in the consensus clustering more consensus mechanisms exist e.g. graph partitioning via METIS [Karypis et al., 1997] Martin Hahmann Data Analytics Methods and Techniques 25

Our Research Area Algorithm Integration in databases Dipl.-Inf. Phillip Grosse philipp.grosse@sap.com the R Project for Statistical Computing free software environment for statistical computing and graphics includes lots of standard math functions/libraries function- and object-oriented flexible/extensible Martin Hahmann Data Analytics Methods and Techniques 26

Our Research Area Vector: basic datastructure in R > x <- 1 # vector of size 1 > x <- c(1,2,3) # vector of size 3 > x <- 1:10 # vector of size 10 > "abc" -> y # vector of size 1 Vector output > x [1] 1 2 3 4 5 6 7 8 9 10 > sprintf("y is %s", y) [1] "y is abc" Martin Hahmann Data Analytics Methods and Techniques 27

Our Research Area Vector arithmetic element-wise basic operators (+ - * / ^) > x <- c(1,2,3,4) > y <- c(2,3,4,5) > x*y [1] 2 6 12 20 vectorrecycling (short vectors) > y <- c(2,7) > x*y [1] 2 14 6 28 arithmetic functions log, exp, sin, cos, tan, sqrt, min, max, range, length, sum, prod, mean, var Martin Hahmann Data Analytics Methods and Techniques 28

R as HANA operator (R-OP) Database R Client 1 TCP/IP 4 pass R Script Rserve [Urbanek03] 2 fork R process R RICE SHM Manager 3 write data access data SHM 5 access data write data SHM Manager 7 6 Martin Hahmann Data Analytics Methods and Techniques 29

R scripts in databases ## Creates some dummy data CREATE INSERT ONLY COLUMN TABLE Test1(A INTEGER, B DOUBLE); INSERT INTO Test1 VALUES(0, 2.5); INSERT INTO Test1 VALUES(1, 1.5); INSERT INTO Test1 VALUES(2, 2.0); CREATE COLUMN TABLE "ResultTable" AS TABLE "TEST1" WITH NO DATA; ALTER TABLE "ResultTable" ADD ("C" DOUBLE); ## Creates an SQL-script function including the R script CREATE PROCEDURE ROP(IN input1 "Test1", OUT result "ResultTable" ) LANGUAGE RLANG AS BEGIN A <- input1$a ; B <- input1$b ; C <- A * B; END; result <- cbind(a, B, C) ; ## execute SQL-script function and retrieve result CALL ROP(Test1, "ResultTable"); SELECT * FROM "ResultTable"; calcmodel R-Op Dataset A B C 0 2.5 0.0 1 1.5 1.5 2 2.0 4.0 Martin Hahmann Data Analytics Methods and Techniques 30

parallel R operators ## call ROP in parallel DROP PROCEDURE testpartition1; CREATE PROCEDURE testpartition1 (OUT result "ResultTable") AS SQLSCRIPT BEGIN input1 = select * from Test1; p = CE_PARTITION([@input1@],[],'''HASH 3 A'''); CALL ROP(@p$$input1@, r); s = CE_MERGE([@r@]); calcmodel Merge... R-Op R-Op... R-Op... Split-Op END; result = select * from @s$$r@; Dataset CALL testpartition1(?); Martin Hahmann Data Analytics Methods and Techniques 31

Result Interpretation traditional clustering multi-solution clustering result interpretation quality measures visualization ensemble clustering Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 32

Result Interpretation Quality measures How good is the obtained result? Problem: There is no universally valid definition of clustering quality multiple approaches and metrics Rand index [Rand, 1971] comparison to known solution Dunns Index[Dunn, 1974] ratio intercluster / intracluster distances higher score is better Davis Bouldin Index[Davies & Bouldin, 1979] ratio intercluster / intracluster distances lower score is better Martin Hahmann Data Analytics Methods and Techniques 33

Result Interpretation Visualization use visual representations to evaluate clustering quality utilize perception capacity of humans result interpretation left to user subjective quality evaluation communicate information about structures data-driven visualize whole dataset Martin Hahmann Data Analytics Methods and Techniques 34

Result Interpretation Visualization: examples all objects and dimensions are visualized scalability issues for large scale data display multiple dimensions on a 2D screen scatterplot matrix parallel coordinates[inselberg, 1985] Martin Hahmann Data Analytics Methods and Techniques 35

Result Interpretation Visualization: Heidi Matrix[Vadapali et al.,2009] visualize clusters in high dimensional space show cluster overlap in subspaces pixel technique based on k-nearest neighbour relations pixel= object pair, color= subspace Martin Hahmann Data Analytics Methods and Techniques 36

Feedback traditional clustering multi-solution clustering ensemble clustering result interpretation adjustment quality measures visualization Algorithmics Feedback Martin Hahmann Data Analytics Methods and Techniques 37

Feedback Result adjustment best practise: adjustment via re-parameterization or algorithm swap to infer modifications from result interpretation adjust low-level parameters only indirect Clustering with interactive Feedback [Balcan & Blum, 2008] theoretical proof of concept user adjustsments via high-level feedback: split and merge limitations: 1d data, target solution available to the user Martin Hahmann Data Analytics Methods and Techniques 38

Our Research Area algorithmic platform visual-interactive interface Martin Hahmann Data Analytics Methods and Techniques 39

Augur Generalized Process based on Shneidermann information seeking mantra Visualization, Interaction, algorithmic platform based on multiple clustering solutions Martin Hahmann Data Analytics Methods and Techniques 40

Our Research Area A complex use-case scenario self-optimising recommender systems Dipl.-Inf. Gunnar Schröder gunnar.schroeder@tu-dresden.de Martin Hahmann Data Analytics Methods and Techniques 41

Our Research Area Predict user ratings: Products for cross selling & product classification: Martin Hahmann Data Analytics Methods and Techniques 42

Our Research Area Similar items & similar users Martin Hahmann Data Analytics Methods and Techniques 43

Challenges Huge amounts of data building recommender models is expensive! Dynamic models age user preferences change new users & items Calculate Recommendations Personalize User- Interface Monitor Actions User Parameterization data and domain dependent support automatic model maintenance (Re-)Build Model Evaluate Model Optimize Model Parameters Martin Hahmann Data Analytics Methods and Techniques 44

> Questions? Martin Hahmann Data Analytics Methods and Techniques 45