. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns



Similar documents
Chapter 7. Cluster Analysis

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Cluster Analysis: Advanced Concepts

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Unsupervised Data Mining (Clustering)

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter ML:XI (continued)

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Cluster Analysis: Basic Concepts and Algorithms

Unsupervised learning: Clustering

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Clustering & Visualization

SoSe 2014: M-TANI: Big Data Analytics

Personalized Hierarchical Clustering

A comparison of various clustering methods and algorithms in data mining

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

Cluster Analysis: Basic Concepts and Algorithms

Cluster analysis Cosmin Lazar. COMO Lab VUB

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Information Retrieval and Web Search Engines

An Introduction to Cluster Analysis for Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

Big Ideas in Mathematics

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Visualization methods for patent data

Cluster Analysis using R

Forschungskolleg Data Analytics Methods and Techniques

Social Media Mining. Data Mining Essentials

Using Data Mining for Mobile Communication Clustering and Characterization

How To Solve The Cluster Algorithm

An Analysis on Density Based Clustering of Multi Dimensional Spatial Data

Cluster Analysis: Basic Concepts and Methods

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Clustering methods for Big data analysis

A Comparative Analysis of Various Clustering Techniques used for Very Large Datasets

Data Clustering Techniques Qualifying Oral Examination Paper

How To Cluster

A Review on Clustering and Outlier Analysis Techniques in Datamining

A Comparative Study of clustering algorithms Using weka tools

Statistical Databases and Registers with some datamining

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Character Image Patterns as Big Data

Time series clustering and the analysis of film style

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Clustering UE 141 Spring 2013

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Big Data: Rethinking Text Visualization

Distances between Clustering, Hierarchical Clustering

The Data Mining Process

Distances, Clustering, and Classification. Heatmaps

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Model for Evaluating SaaS on the Cloud

Introduction to Data Mining

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

DYNAMIC FUZZY PATTERN RECOGNITION WITH APPLICATIONS TO FINANCE AND ENGINEERING LARISA ANGSTENBERGER

CHAPTER 1 INTRODUCTION

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Binary Image Scanning Algorithm for Cane Segmentation

Norbert Schuff Professor of Radiology VA Medical Center and UCSF

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Unsupervised and Semi-supervised Clustering: a Brief Survey

IMPROVISATION OF STUDYING COMPUTER BY CLUSTER STRATEGIES

CLUSTER ANALYSIS FOR SEGMENTATION

Hadoop SNS. renren.com. Saturday, December 3, 11

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Visual Data Mining with Pixel-oriented Visualization Techniques

Environmental Remote Sensing GEOG 2021

Investigating Clinical Care Pathways Correlated with Outcomes

Data Mining Analytics for Business Intelligence and Decision Support

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Local outlier detection in data forensics: data mining approach to flag unusual schools

Clustering of Documents for Forensic Analysis

They can be obtained in HQJHQH format directly from the home page at:

Segmentation & Clustering

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

OUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Introduction to Pattern Recognition

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Transcription:

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties and open problems Part 2: Clustering Algorithms Hierarchical methods : Single-link : Complete-link : Clustering Based on Dissimilarity Increments Criteria -- Ana Fred 1 From Single Clustering to Ensemble Methods - April 2009 Pattern Recognition Decision Making Supervised Learning : training samples, labeled by their category membership, are used to design a classifier. Labeled training patterns. Labels represent true categories of patterns : Based on a collection of samples without being told their categories. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns. Datamining -- Ana Fred 2 From Single Clustering to Ensemble Methods - April 2009 1

/ Clustering : Learn the structure of multidimensional patterns. Mixture Densities Gaussian Mixture Decomposition» The probability structure is known with the exception of the values of the parameters Clustering Procedures : Find subclasses. Data description in terms of clusters or groups of data points that possess strong internal similarities Typical applications:. As a stand-alone tool to get insight into data distribution. As a preprocessing step for other algorithms 3 From Single Clustering to Ensemble Methods - April 2009 Cluster Analysis Organize data into sensible groupings (either as a grouping of patterns or a hierarchy of groups) Clustering : The process of grouping a set of objects into classes of similar objects (extracting hidden structure from data) Cluster : A collection of objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters 4 From Single Clustering to Ensemble Methods - April 2009 2

Shape Clustering Right Ventricle from MR brain images Cistern from MR brain images The main cluster is drawn using multicolor dots, secondary clusters are drawn in red, green and magenta. Duta, Jain and Jolly, Automatic Construction of 2-D Shape Models, IEEE PAMI, May 2001 5 From Single Clustering to Ensemble Methods - April 2009 Shape Clustering Right Ventricle from MR brain images Cistern from MR brain images The main cluster is drawn using multicolor dots, secondary clusters are drawn in red, green and magenta. Duta, Jain and Jolly, Automatic Construction of 2-D Shape Models, IEEE PAMI, May 2001 6 From Single Clustering to Ensemble Methods - April 2009 3

Identification of Writing Styles 122,000 online characters written by 100 writers Lexemes are identified by clustering data within each character class into subclasses: a string matching measure used to calculate distance between 2 characters Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, Mar 2002 7 From Single Clustering to Ensemble Methods - April 2009 Segmentation of Natural Scenes Hermes, Zoller, Bumannn, Parametric Distributional Clustering for Image Segmentation, ECCV 2002 8 From Single Clustering to Ensemble Methods - April 2009 4

What is a Cluster? A set of entities which are alike; entities from different clusters are not alike An aggregation of points such that the distance between any two points in a cluster is less than the distance between any point in the cluster and any point not in it. A relatively high density of points, surrounded by a relatively low density of points Ideal cluster: Compact and Isolated -- Ana Fred 9 From Single Clustering to Ensemble Methods - April 2009 Taxonomy of Clustering Approaches Two main strategies: Hierarchical Methods :Propose a sequence of nested data partitions in a hierarchical structure. Single-Link. Complete Link Partitional Methods :Organize patterns into a small number of clusters. K-means. Spectral clustering -- Ana Fred 10 From Single Clustering to Ensemble Methods - April 2009 5

Taxonomy of Clustering Approaches Clustering Principles: Compactness : K-means : Complete-link : Histogram clustering : Pairwise data clustering Connectedness : Single-linkage : Dissimilarity Increments : Mean Shift clustering Separation : Normalized Cut : Spectral clustering 11 From Single Clustering to Ensemble Methods - April 2009 Taxonomy of Clustering Approaches Clustering Principles: Compactness : K-means : Complete-link : Histogram clustering : Pairwise data clustering Connectedness : Single-linkage : Dissimilarity Increments : Mean Shift clustering Separation : Normalized Cut : Spectral clustering 12 From Single Clustering to Ensemble Methods - April 2009 6

Taxonomy of Clustering Approaches Clustering Principles: Compactness : K-means : Complete-link : Histogram clustering : Pairwise data clustering Connectedness : Single-linkage : Dissimilarity Increments : Mean Shift clustering Separation : Normalized Cut : Spectral clustering 10 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9 10 13 From Single Clustering to Ensemble Methods - April 2009 Taxonomy of Clustering Approaches Approaches: Model-based : Patterns can be given a simple and compact description in terms of. Parametrical distribution -- Parametric density approaches (Mixture models). A representative element, such as a centroid, median (central clustering, square-error clustering, k-means, k-medoids) or multiple prototypes per cluster (CURE) -- Prototype-based methods. Some geometrical primitives (lines, planes, circles, curves, surfaces) Shape fitting approaches : These approaches assume particular cluster shapes, partitions being in general obtained as a result of an optimization process using a global criterion 14 From Single Clustering to Ensemble Methods - April 2009 7

Taxonomy of Clustering Approaches Graph-theoretical : Mostly explored in hierarchical methods that can be represented graphically as a tree or dendrogram. Agglomerative methods (Single-link, complete-link). Divisive approaches (ex. Based on Minimum Spanning Tree) : View clustering as a graph partitioning problem Non parametric density-based : Attempt to identify high density clusters separated by low density regions (local cluster criterion, such as density-connected points) (valley seeking clustering algorithms). DBSCAN, OPTICS, DENCLUE, CLIQUE. Discover clusters of arbitrary shape 15 From Single Clustering to Ensemble Methods - April 2009 Data Types in Clustering Problems Data representations: Vector data: n vectors in R d Proximity data: n x n pairwise proximity matrix :All types of data may be addressed by choosing adequate proximity measures 16 From Single Clustering to Ensemble Methods - April 2009 8

Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Metrics: : Positivity: d(a, b) >0 and d(a, b)=0, a=b : Symmetry property: d(a,b)=d(b,a). : Triangle inequality: d(a,c) d(a,b) + d(b,c). 17 From Single Clustering to Ensemble Methods - April 2009 Metric Models in Feature Spaces Minskowski distance: (Euclidean distance corresponds to r = 2) Maximum Value Metric:. Considers only most distant features 18 From Single Clustering to Ensemble Methods - April 2009 9

Metric Models in Feature Spaces Absolute Value Metric, Manhattan Distance or City-block (r = 1) d ( a, b) d ( a, b) b a M 1 i i i 1 d Constant Manhattan distance curves:. Reduced computational time; does not penalize much the features with higher dissimilarity. In R 2 : dist 1 ((x 1,y 1 ),(x 2,y 2 ))= x 2 -x 1 + y 2 -y 1, city-block: It is not possible to make short-cuts through corners: it counts the number of blocks that is necessary to pass in order to move from one corner to another 19 From Single Clustering to Ensemble Methods - April 2009 Metric Models in Feature Spaces Euclidean Distance: 2 d ( a, b) d ( a, b) b a e 2 i i i 1 d. R2: dist2((x1,y1),(x2,y2))=((x2-x1)2+(y2-y1)2)1/2.. Emphasizes more features with higher dissimilarity. Mahalanobis Distance T 1 d ( x, y) x y x y Mahalanobis 20 From Single Clustering to Ensemble Methods - April 2009 10

Dissimilarity based on String Editing operations.... The Levensthein distance between two strings s 1, s 2 2 *, D L (s 1, s 2 ), is defined as the minimum number of editing operations needed in order to transform s 1 into s 2. 21 From Single Clustering to Ensemble Methods - April 2009 The Weighted Levensthein distance between two strings s 1, s 2 2 *, is defined by where Normalized Weighted Levensthein distance 22 From Single Clustering to Ensemble Methods - April 2009 11

String Editing operations and String Matching (a) String matching using editing operations. (b) Editing path. String matching. In (b), diagonal path elements represent substitutions, vertical segments correspond to insertions, and horizontal segments correspond to deletions. 23 From Single Clustering to Ensemble Methods - April 2009 Normalized Edit Distance Marzal and Vidal, Computation of normalized edit distance and applications, IEEE PAMI, 1993 24 From Single Clustering to Ensemble Methods - April 2009 12

Dissimilarity based on Error-Correcting Parsing [Fu] : distance between strings based on the modelling of string structure by means of grammars and on the concept of error-correcting parsing : the distance between a string and a reference string is given by the error-correcting parser as the weighted Levensthein distance between the string and the nearest (in terms of edit operations) string generated by the grammar inferred from the reference string (thus exhibiting a similar structure): 25 From Single Clustering to Ensemble Methods - April 2009 ECP distance 2 1 26 From Single Clustering to Ensemble Methods - April 2009 13

Dissimilarity based on Error-Correcting Parsing [Fu] : distance between strings based on the modelling of string structure by means of grammars and on the concept of error-correcting parsing : the distance between a string and a reference string is given by the error-correcting parser as the weighted Levensthein distance between the string and the nearest (in terms of edit operations) string generated by the grammar inferred from the reference string (thus exhibiting a similar structure): : In order to preserve symmetry 27 From Single Clustering to Ensemble Methods - April 2009 Grammar Complexity-based Similarity The basic idea is that, if two sentences are structurally similar, then their joint description will be more compact than their isolated description due to sharing of rules of symbol composition; the compactness of the representation is quantified by the grammar complexity, and the similarity is measured by the ratio of decrease in grammar complexity where C(G si ) denotes grammar complexity. Fred, Clustering of Sequences using a Minimum Grammar Complexity Criterion, ICGI 1996 Fred. Similarity measures and clustering of string patterns. In Dechang Chen and Xiuzhen Cheng, editors, Pattern Recognition and String Matching, Kluwer Academic, 2002, 28 From Single Clustering to Ensemble Methods - April 2009 14

RDGC Similarity 29 From Single Clustering to Ensemble Methods - April 2009 Grammar Complexity-based Similarity RDGC Let G=(V N,, R, ) be a context-free grammar, where V N, are the sets of nonterminal and terminal symbols, respectively, is the grammar s start symbol and R is the set of productions written in the form: Let 2 (V N ) *, be a grammatical sentence of length n, in which the symbols a 1, a 2,, a m appear k 1, k 2,, k m times, respectively. The complexity of the sentence, C( ), is given by [Fu] The complexity of the grammar G is defined as 30 From Single Clustering to Ensemble Methods - April 2009 15

Minimum Code Length-based Similarity : Based on Solomonoff s code: a string is represented by a triplet where a coded string is obtained in an iterative procedure where, in each step, intermediate codes are produced by defining sequences of two symbols, which are represented by special symbols, and rewriting the sequences using them. Compact codes are produced when sequences exhibit local or distant inter-symbol interactions.. Code length: sum of the lengths of the descriptions of the three part code above : Extension to sets of strings Fred and Leitão, A Minimum Code Length Technique for Clustering of Syntactic Patterns, ICPR 1996 Fred. Similarity measures and clustering of string patterns. In Dechang Chen and Xiuzhen Cheng, editors, Pattern Recognition and String Matching, Kluwer Academic, 2002, 31 From Single Clustering to Ensemble Methods - April 2009 Minimum Code Length-based Similarity : The basic idea is that global compact codes are produced by considering the inter-symbol dependencies on the ensemble of the strings. The quantification of this reduction in code length forms the basis of the similarity measure designated by Normalized Ratio of decrease in code length - NRDCL with 32 From Single Clustering to Ensemble Methods - April 2009 16

Requirements of Clustering in Data Mining Discovery of clusters with arbitrary shape Ability to deal with different types of attributes Scalability Minimal requirements for domain knowledge to determine input parameters Insensitivity to the order of input records Ability to deal with noisy data High dimensionality -- Ana Fred 33 From Single Clustering to Ensemble Methods - April 2009 Issues in Clustering Which similarity measure and features to use? How many clusters? Which is the best clustering method? Are the individual clusters and the partition valid? How to choose algorithmic parameters? K-means clustering of uniform data (k=4) K-means using Euclidean (blue) and Mahalanobis distance (k=2) (red) -- Ana Fred 34 From Single Clustering to Ensemble Methods - April 2009 17