Hierarchical Clustering

Similar documents
Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Spazi vettoriali e misure di similaritá

Multivariate Analysis of Ecological Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Neural Networks Lesson 5 - Cluster Analysis

Vector Spaces; the Space R n

Chapter ML:XI (continued)

Unsupervised learning: Clustering

How To Cluster

Cluster analysis with SPSS: K-Means Cluster Analysis

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Time series experiments

Manifold Learning Examples PCA, LLE and ISOMAP

They can be obtained in HQJHQH format directly from the home page at:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Distances, Clustering, and Classification. Heatmaps

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

Machine Learning using MapReduce

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Metric Spaces. Chapter Metrics

Multivariate Analysis

2002 IEEE. Reprinted with permission.

w ki w kj k=1 w2 ki k=1 w2 kj F. Aiolli - Sistemi Informativi 2007/2008

Data Mining 5. Cluster Analysis

Torgerson s Classical MDS derivation: 1: Determining Coordinates from Euclidean Distances

Section 11.4: Equations of Lines and Planes

Movie Classification Using k-means and Hierarchical Clustering

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Review Jeopardy. Blue vs. Orange. Review Jeopardy

A Relevant Document Information Clustering Algorithm for Web Search Engine

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Virtual Landmarks for the Internet

Alignment and Preprocessing for Data Analysis

How To Understand Multivariate Models

The skill content of occupations across low and middle income countries: evidence from harmonized data

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Data Mining and Visualization

Metric Multidimensional Scaling (MDS): Analyzing Distance Matrices

Statistical Databases and Registers with some datamining

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Supervised and unsupervised learning - 1

Big Data Simulator version

Principal Component Analysis

Multidimensional data and factorial methods

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

VisualCalc AdWords Dashboard Indicator Whitepaper Rev 3.2

Time series clustering and the analysis of film style

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Nonlinear Iterative Partial Least Squares Method

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems

The Science and Art of Market Segmentation Using PROC FASTCLUS Mark E. Thompson, Forefront Economics Inc, Beaverton, Oregon

1.3. DOT PRODUCT If θ is the angle (between 0 and π) between two non-zero vectors u and v,

ENVI Classic Tutorial: Classification Methods

Vector and Matrix Norms

Math 215 HW #6 Solutions

I n t e r a c t i n g G a l a x i e s - Making Ellipticals Te a c h e r N o t e s

Tutorial for proteome data analysis using the Perseus software platform

Java Modules for Time Series Analysis

Object Recognition and Template Matching

Introduction to Clustering

Monday Morning Data Mining

The Effectiveness of Self-Organizing Network Toolbox (1)

1996 Nancar Trough, Northern Bonaparte Basin (AC/P16) Airborne Laser Fluorosensor Survey Interpretation Report [WGC AC/P16 Survey Number 1248.

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

STANDARDISATION OF DATA SET UNDER DIFFERENT MEASUREMENT SCALES. 1 The measurement scales of variables

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

User Behavior Analysis and Prediction Methods for Large-scale Video-ondemand

V6 AIRS Spectral Calibra2on

Microarray cluster analysis and applications

In order to describe motion you need to describe the following properties.

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Constrained Clustering of Territories in the Context of Car Insurance

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

Section 1.1. Introduction to R n

0.1 What is Cluster Analysis?

! E6893 Big Data Analytics Lecture 5:! Big Data Analytics Algorithms -- II

Colour Image Segmentation Technique for Screen Printing

Tutorial on Exploratory Data Analysis

The Doppler Effect & Hubble

Standardization and Its Effects on K-Means Clustering Algorithm

Mathematics Pre-Test Sample Questions A. { 11, 7} B. { 7,0,7} C. { 7, 7} D. { 11, 11}

Similarity Search in a Very Large Scale Using Hadoop and HBase

Going Big in Data Dimensionality:

3 An Illustrative Example

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

Cluster analysis Cosmin Lazar. COMO Lab VUB

Using Data Mining for Mobile Communication Clustering and Characterization

Hierarchical Cluster Analysis Some Basics and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Concept of Cluster Analysis

MULTIVARIATE DATA ANALYSIS WITH PCA, CA AND MS TORSTEN MADSEN 2007

The Value of Visualization 2

Bioinformatics: Network Analysis

CLASSIFYING SERVICES USING A BINARY VECTOR CLUSTERING ALGORITHM: PRELIMINARY RESULTS

Transcription:

Hierarchical Clustering Basics Please read the introduction to principal component analysis first. There, we explain how spectra can be treated as data points in a multi-dimensional space, which is required knowledge for this presentation. 1

Hierarchical Clustering We have a number of datapoints in an n-dimensional space, and want to evaluate which data points cluster together. This can be done with a hierarchical clustering approach It is done as follows: 1) Find the two elements with the smallest distance (that t means the most similar elements) 2)These two elements will be clustered together. The cluster becomes a new element 3)Repeat until all elements are clustered 2

3

4

Important parameters in hierarchical clustering are: The distance method This measure defines how the distance between two datapoints is measured in general Available options: Euclidean (default), Minkowksi, Cosine, Correlation, Chebychev, Spearman The linkage method This defines how the distance between two clusters is measured Available options: Average (default) and Ward Use PCA data: Determines if the data is pretreated with a PCA. Only reasonable if used with Euclidean metric and with a reduction of explained variance 5

Which parameters to select for MALDI imaging data? Recommended settings: Use PCA data, reduce dimensions to 70%-95% of explained variance Distance Method: Euclidean Linkage Method: Ward (do not use PCA data with non-euclidean distances!) If you are interested in what the distance and linkage methods are, read on, otherwise quit here. 6

Distance - Euklidean If there are two points P = (p 1,p 2, p n ) and Q= (q 1,q 2, q n ) in n-dimensional space, then the euclidean distance is defined as: If we speak of distance in common language, the euclidean distance is implied. Example: Euclidean distance is invariant against transformations of the coordinates. The clustering results will improve if PCA-data are used and the data are reduced to a certain percentage of explained variance 7

Distance Cosine 1 In the cosine distance, the distance between two datapoints t is defined by the 1-cosine of the angle beween the vectors from the origin to the datapoints The angle between the green and the blue line is very small, so the two datapoints have a small distance in the cosine distance 8

Distance Cosine 2 A B C Int 2 A B C Int 1 The euclidean distance between spectra A and B is equal to the euclidean distance between A and C. Spectra A and B are similar in cosine distance. Note: These two spectra are only different in absolute intensity. Cosine distance does an intrinsic normalization. Since spectra in ClinProTools are always normalized, cosine and euclidean distance behave rather similar. Cosine is not reasonable with PCA data 9

Distance - Minkowsi If there are two points P = (p 1,p 2, p n ) and Q= (q 1,q 2, q n ) in n-dimensional space, then the Minkowski distance is defined as: Euclidean distance is a special case of the Minkowski metric (a=2) One special case is the so called City-block-metric (a=1): Clustering results will be different with unprocessed and with PCA data 10

Distance - Chebychev If there are two points P = (p 1,p 2, p n ) and Q= (q 1,q 2, q n ) in n-dimensional space, then the Minkowski distance is defined as max( ( p 1 -q 1, p p 2 -q 2,, p n -q n ) The Chebychev distance is also a special case of the Minkowski distance (a ). The distance is defined by the maximum distance in any coordinate: Clustering results will be different with unprocessed and with PCA data 11

Distance - Spearman In the Spearman distance the absolute intensities iti of the peaks are unimportant, two spectra will be close together if the relative peak-patterns are similar. Each peak will be assigned a rank in order of the intensity, and the ranks will be conpared A B C 3 2 4 1 1 2 3 4 1 2 3 4 In this example, spectra A and B are identical in the Spearman distance, while spectrum C has a big distance to A and B 12

Distance Correlation Int 2 In the correlation distance method, the blue data point is closer to the green cluster than the red one, because the blue one correlates better with the data in the cluster. Int 1 13

Linkage - Single The single linkage is easiest to understand. d The distance between two clusters is equal to the distance of the closest elements from the two clusters The single linkage has practical disadvantages and is therefore not selectable in ClinProTools 14

Linkage - Average In the average linkage the distance of two clusters is calculated l as the average of the distances of each element of the cluster with each element of the other cluster In this example the distance between the green and the blue cluster is the average length of the red lines Average linkage is the default setting in ClinProTools. However, on imaging data the Ward linkage gives usually better results 15

Linkage - Ward In the ward linkage for each cluster a error function is defined. d This error function is the average (RMS) distance of each datapoint in a cluster to the center of gravity in the cluster D= - Error function of unified cluster Error function for each cluster The distance (D) between to clusters is defined as the error function of the unified cluster minus the error functions of the individual clusters The ward linkage usually gives the nicest clustering results with MALDI imaging data 16

17 www.bdal.com