Chapter ML:XI (continued)

Similar documents

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Chapter 7. Cluster Analysis

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Chapter ML:XI. XI. Cluster Analysis

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Neural Networks Lesson 5 - Cluster Analysis

Unsupervised learning: Clustering

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Introduction to Cluster Analysis for Data Mining

Social Media Mining. Data Mining Essentials

A comparison of various clustering methods and algorithms in data mining

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Data Mining 5. Cluster Analysis

Clustering UE 141 Spring 2013

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Statistical Databases and Registers with some datamining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Cluster Analysis: Advanced Concepts

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Data Clustering Techniques Qualifying Oral Examination Paper

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Forschungskolleg Data Analytics Methods and Techniques

A Study of Web Log Analysis Using Clustering Techniques

SCAN: A Structural Clustering Algorithm for Networks

Data Mining for Knowledge Management. Clustering

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Information Retrieval and Web Search Engines

SoSe 2014: M-TANI: Big Data Analytics

2 Basic Concepts and Techniques of Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analysis: Basic Concepts and Algorithms

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

HES-SO Master of Science in Engineering. Clustering. Prof. Laura Elena Raileanu HES-SO Yverdon-les-Bains (HEIG-VD)

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

ANALYSIS OF VARIOUS CLUSTERING ALGORITHMS OF DATA MINING ON HEALTH INFORMATICS

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Unsupervised and supervised dimension reduction: Algorithms and connections

Multidimensional data analysis

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Using Data Mining for Mobile Communication Clustering and Characterization

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Unsupervised Data Mining (Clustering)

Cluster Analysis: Basic Concepts and Methods

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

Outlier Detection in Clustering

Virtual Landmarks for the Internet

Cluster analysis Cosmin Lazar. COMO Lab VUB

Movie Classification Using k-means and Hierarchical Clustering

CLUSTERING FOR FORENSIC ANALYSIS

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Standardization and Its Effects on K-Means Clustering Algorithm

Performance Metrics for Graph Mining Tasks

Machine Learning using MapReduce

CITY UNIVERSITY OF HONG KONG 香港城市大學. Self-Organizing Map: Visualization and Data Handling 自組織神經網絡 : 可視化和數據處理

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Distances, Clustering, and Classification. Heatmaps

Chapter 15 CLUSTERING METHODS. 1. Introduction. Lior Rokach. Oded Maimon

CLUSTER ANALYSIS FOR SEGMENTATION

Social Media Mining. Network Measures

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Introduction to Data Mining

Clustering Technique in Data Mining for Text Documents

Complex Networks Analysis: Clustering Methods

Environmental Remote Sensing GEOG 2021

Classification Techniques for Remote Sensing

Visualization by Linear Projections as Information Retrieval

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Information Management course

B490 Mining the Big Data. 2 Clustering

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Quality Assessment in Spatial Clustering of Data Mining

The SPSS TwoStep Cluster Component

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Final Project Report

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Classification algorithm in Data mining: An Overview

HDDVis: An Interactive Tool for High Dimensional Data Visualization

Transcription:

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis ML:XI-11 Cluster Analysis STEIN 2002-2016

Cluster analysis is the unsupervised classification of a set of objects in groups, pursuing the following objectives: 1. maximize the similarities within the groups (intra groups) 2. minimize the similarities between the groups (inter groups) ML:XI-12 Cluster Analysis STEIN 2002-2016

Cluster analysis is the unsupervised classification of a set of objects in groups, pursuing the following objectives: 1. maximize the similarities within the groups (intra groups) 2. minimize the similarities between the groups (inter groups) Applications: identification of similar groups of buyers higher-level image processing: object recognition search of similar gene profiles specification of syndromes analysis of traffic data in computer networks visualization of complex graphs text categorization in information retrieval ML:XI-13 Cluster Analysis STEIN 2002-2016

Remarks: The setting of a cluster analysis is reverse to the setting of a variance analysis: A variance analysis verifies whether a nominal feature defines groups such that the members of the different groups differ significantly with regard to a numerical feature. I.e., the nominal feature is in the role of the independent variable, while the numerical feature(s) is (are) in role of dependent variable(s). Example: The type of a product packaging (the independent variable) may define the number of customers (the dependent variable) in a supermarket who look at the product. A cluster analysis in turn can be used to identify such a nominal feature, namely by constructing a suited feature domain for the nominal variable: each cluster corresponds implicitly to a value of the domain. Example: Equivalent but differently presented products in a supermarket are clustered (= the impact of product packaging is identified) with regard to the number of customers who buy the products. Cluster analysis is a tool for structure generation. Nearly nothing is known about the nominal variable that is to be identified. In particular, there is no knowledge about the number of domain values (the number of clusters). Variance analysis is a tool for structure verification. ML:XI-14 Cluster Analysis STEIN 2002-2016

Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np ML:XI-15 Cluster Analysis STEIN 2002-2016

Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np no Target concept c 1 c 2. c n ML:XI-16 Cluster Analysis STEIN 2002-2016

Let x 1,... x n denote the p-dimensional feature vectors of n objects: Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np no Target concept c 1 c 2. c n 30 two-dimensional feature vectors (n = 30, p = 2) : Feature 2 Feature 1 ML:XI-17 Cluster Analysis STEIN 2002-2016

Definition 3 (Exclusive Clustering [splitting]) Let X be a set of feature vectors. An exclusive clustering C of X, C = {C 1, C 2,..., C k }, C i X, is a partitioning of X into non-empty, mutually exclusive subsets C i with C i C C i = X. ML:XI-18 Cluster Analysis STEIN 2002-2016

Definition 3 (Exclusive Clustering [splitting]) Let X be a set of feature vectors. An exclusive clustering C of X, C = {C 1, C 2,..., C k }, C i X, is a partitioning of X into non-empty, mutually exclusive subsets C i with C i C C i = X. Algorithms for cluster analysis are unsupervised learning methods: the learning process is self-organized there is no (external) teacher the optimization criterion is task- and domain-independent ML:XI-19 Cluster Analysis STEIN 2002-2016

Definition 3 (Exclusive Clustering [splitting]) Let X be a set of feature vectors. An exclusive clustering C of X, C = {C 1, C 2,..., C k }, C i X, is a partitioning of X into non-empty, mutually exclusive subsets C i with C i C C i = X. Algorithms for cluster analysis are unsupervised learning methods: the learning process is self-organized there is no (external) teacher the optimization criterion is task- and domain-independent Supervised learning: a learning objective such as the target concept is provided the optimization criterion depends on the task or the domain information is provided about how the optimization criterion can be maximized. Keyword: instructive feedback ML:XI-20 Cluster Analysis STEIN 2002-2016

Main Stages of a Cluster Analysis Objects Feature extraction & -preprocessing Similarity computation Merging Clusters ML:XI-21 Cluster Analysis STEIN 2002-2016

Feature Extraction and Preprocessing Required are (possibly new) features of high variance. Approaches: analysis of dispersion parameters dimension reduction: PCA, factor analysis, MDS visual inspection: scatter plots, box plots [Webis 2012, VDM tool] ML:XI-22 Cluster Analysis STEIN 2002-2016

Feature Extraction and Preprocessing Required are (possibly new) features of high variance. Approaches: analysis of dispersion parameters dimension reduction: PCA, factor analysis, MDS visual inspection: scatter plots, box plots Feature standardization can dampen the structure and make things worse: ML:XI-23 Cluster Analysis STEIN 2002-2016

Computation of Distances or Similarities Feature 1 Feature 2... Feature p x 1 x 11 x 12... x 1p x 2 x 21 x 22.... x 2p x n x n1 x n2... x np x 1 x 2... x n x 1 0 d(x 1, x 2 )... d(x 1, x n ) x 2-0... d(x 2, x n ). x n - -... 0 ML:XI-24 Cluster Analysis STEIN 2002-2016

Remarks: Usually, the distance matrix is defined implicitly by a metric on the feature space. The distance matrix can be understood as the adjacency matrix of a weighted, undirected graph G, G = V, E, w. The set X of feature vectors is mapped one-to-one (bijection) onto a set of nodes V. The distance d(x i, x j ) corresponds to the weight w({u, v}) of edge {u, v} E between those nodes u and v that are associated with x i and x j respectively. ML:XI-25 Cluster Analysis STEIN 2002-2016

Computation of Distances or Similarities (continued) Properties of a distance function: 1. d(x 1, x 2 ) 0 2. d(x 1, x 1 ) = 0 3. d(x 1, x 2 ) = d(x 2, x 1 ) 4. d(x 1, x 3 ) d(x 1, x 2 ) + d(x 2, x 3 ) ML:XI-26 Cluster Analysis STEIN 2002-2016

Computation of Distances or Similarities (continued) Properties of a distance function: 1. d(x 1, x 2 ) 0 2. d(x 1, x 1 ) = 0 3. d(x 1, x 2 ) = d(x 2, x 1 ) 4. d(x 1, x 3 ) d(x 1, x 2 ) + d(x 2, x 3 ) Minkowsky metric for features with interval-based measurement scales: where d(x 1, x 2 ) = ( p ) 1/r x 1i x 2i r i=1 r = 1. Manhattan or Hamming distance, L 1 norm r = 2. Euclidean distance, L 2 norm r =. Maximum distance, L norm or L max norm ML:XI-27 Cluster Analysis STEIN 2002-2016

Computation of Distances or Similarities (continued) Cluster analysis does not presume a particular measurement scale. Generalization of the distance function towards a (dis)similarity function by omitting the triangle inequality. (Dis)similarities can be quantified between all kinds of features irrespective of the given levels of measurement. ML:XI-28 Cluster Analysis STEIN 2002-2016

Computation of Distances or Similarities (continued) Cluster analysis does not presume a particular measurement scale. Generalization of the distance function towards a (dis)similarity function by omitting the triangle inequality. (Dis)similarities can be quantified between all kinds of features irrespective of the given levels of measurement. Similarity coefficients given two feature vectors, x 1, x 2, with binary features: Simple Matching Coefficient (SMC) = f 11 + f 00 f 11 + f 00 + f 01 + f 10 Jaccard Coefficient (J) = f 11 f 11 + f 01 + f 10 where f 11 = number of features with a value of 1 in both x 1 and x 2 f 00 = number of features with a value of 0 in both x 1 and x 2 f 01 = number of features with value 0 in x 1 and value 1 in x 2 f 10 = number of features with value 1 in x 1 and value 0 in x 2 ML:XI-29 Cluster Analysis STEIN 2002-2016

Remarks: The definitions for the above similarity coefficients can be extended towards features with a nominal measurement scale. Particular heterogeneous metrics have been developed, such as HEOM and HVDM, which allow the combined computation of feature values from different measurement scales. The computation of the correlation between all features of two feature vectors (not: between two features over all feature vectors) allows to compare feature profiles. Example: Q correlation coefficient The development of a suited, realistic, and expressive similarity measure may pose the biggest challenge within a cluster analysis tasks. Typical problems: (unwanted) structure damping due to normalization (unwanted) sensitivity concerning outliers (unrecognized) feature correlations (neglected) varying feature importance Similarity measures can be transformed straightforwardly into dissimilarity measures and vice versa. ML:XI-30 Cluster Analysis STEIN 2002-2016

Merging Principles [Analysis stages] hierarchical iterative agglomerative divisive exemplar-based exchange-based single link, group average MinCut k-means, k-medoid Kerninghan-Lin Cluster analysis density-based point-density-based attraction-based DBSCAN MajorClust meta-searchcontrolled gradient-based competitive simulated annealing evolutionary strategies stochastic Gaussian mixtures... ML:XI-31 Cluster Analysis STEIN 2002-2016