Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste)

Similar documents

Cluster Analysis: Advanced Concepts

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

ANALYTICAL TECHNIQUES FOR DATA VISUALIZATION

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

GE Medical Systems Training in Partnership. Module 8: IQ: Acquisition Time

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Character Image Patterns as Big Data

Clustering UE 141 Spring 2013

Data Mining + Business Intelligence. Integration, Design and Implementation

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

Segmentation & Clustering

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Alignment and Preprocessing for Data Analysis

Fast Matching of Binary Features

A comparison of various clustering methods and algorithms in data mining

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

Virtual Landmarks for the Internet

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Neural Networks Lesson 5 - Cluster Analysis

Linear Threshold Units

MHI3000 Big Data Analytics for Health Care Final Project Report

Musculoskeletal MRI Technical Considerations

Data Mining and Visualization

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Probabilistic Latent Semantic Analysis (plsa)

Component Ordering in Independent Component Analysis Based on Data Power

A discussion of Statistical Mechanics of Complex Networks P. Part I

How To Understand The Network Of A Network

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

How To Solve The Cluster Algorithm

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Nodes, Ties and Influence

FlowMergeCluster Documentation

Part-Based Recognition

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Chapter ML:XI (continued)

Distances, Clustering, and Classification. Heatmaps

Classification/Decision Trees (II)

Introduction to k Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction

Lecture 4 Online and streaming algorithms for clustering

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

Document Image Retrieval using Signatures as Queries

Standardization and Its Effects on K-Means Clustering Algorithm

Unsupervised learning: Clustering

Anatomic Surface Reconstruc1on from Sampled Point Cloud Data and Prior Models

Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool

So which is the best?

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Biometric Authentication using Online Signatures

Randomized Trees for Real-Time Keypoint Recognition

Clustering & Visualization

Introduction to Clustering

SITE IMAGING MANUAL ACRIN 6698

Cluster Analysis: Basic Concepts and Algorithms

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

Statistical Models in Data Mining

The Scientific Data Mining Process

CAB TRAVEL TIME PREDICTI - BASED ON HISTORICAL TRIP OBSERVATION

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CLUSTER ANALYSIS FOR SEGMENTATION

Environmental Remote Sensing GEOG 2021

OUTLIER ANALYSIS. Data Mining 1

Local features and matching. Image classification & object localization

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Grid Density Clustering Algorithm

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Protein Protein Interaction Networks

Image Segmentation and Registration

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

TV commercial effectiveness predicted by functional MRI

Parametric Comparison of H.264 with Existing Video Standards

COC131 Data Mining - Clustering

A New Image Edge Detection Method using Quality-based Clustering. Bijay Neupane Zeyar Aung Wei Lee Woon. Technical Report DNA #

Part 2: Community Detection

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

CS Data Science and Visualization Spring 2016

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

Models of Cortical Maps II

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Transcription:

Clustering by fast search- and- find of density peaks Alessandro Laio, Maria d Errico and Alex Rodriguez SISSA (Trieste)

What is a cluster? clus ter [kluhs- ter], noun 1.a number of things of the same kind, growing or held together; a bunch: a cluster of grapes. 2.a group of things or persons close together: There was a cluster of tourists at the gate. 3.U.S. Army. a small metal design placed on a ribbon represen=ng an awarded medal to ind icate that the same medal has been awarded again: oak- leaf cluster. 4.Phone<cs. a succession of two or more con=guous consonants in an u?erance: cluster of strap. 5.Astronomy. a group of neighboring stars, held together by mutual gravita=on, that have essen=ally the same age and composi=on and thus supposedly a common origin.

What is a cluster? Visually it is a region with a high density of points. How many clusters do you see? separated from other dense regions Computers easily give the same answer

What is a cluster? And now? How many clusters? Computers have big problems here!

Inefficiency of standard algorithms K-means algorithm (11748 citations!!!!) It assigns each point to the closest cluster center. Variables: the number of centers and their location By construction it is unable to recognize non-spherical clusters

What is a cluster? Clusters = peaks in the density of points = peaks in the mother probability distribu=on

Our approach: fast search- and- find of density peaks SCIENCE, 1492, vol 322 (2014) EXAMPLE: Clustering in a 2- dimensional space 1) Compute the local density around each point ρ(1)=7 ρ(8)=5 ρ(10)=4 2) For each point compute the distance with all the points with higher density. Take the minimum value.

Our approach: fast search- and- find of density peaks 3) For each point, plot the minimum distance as a func=on of the density. SCIENCE, 1492, vol 322 (2014)

Our approach: fast search- and- find of density peaks SCIENCE, 1492, vol 322 (2014) 4) the outliers in this graph are the cluster centers

Our approach: fast search- and- find of density peaks 4) ) the outliers in this graph are the cluster centers 5) Assign each point to the same cluster of its nearest neighbor of higher density

Not even an algorithm Given i δ a distance matrix i d ij, for each data point i compute: d c α ρ i = j χ (d ij d c ) (number of data points within a distance d c ) δ i = min (d ij ) j:ρ j >ρ i χ (x) =0 d c dc i (distance of the closest data point of higher density) Plot δ as a func=on of ρ {j : ρ j > ρ i } δ One free parameter: 1 the cutoff distance d c ρ i = j d α ij However: the method is only sensi=ve to the rela=ve density of two data points, not to the ρ i δ i i absolute value of ρ ρ i d c α

The clustering approach at work

The clustering approach at work

The clustering approach at work

The clustering approach at work

The clustering approach at work

The clustering approach at work SCIENCE, 1492, vol 322 (2014)

Robust with respect to changes in the metric SCIENCE, 1492, vol 322 (2014)

What about noise? The true density 3000 points sampled from this density The density reconstructed from the 3000 points (Gaussian es=mator)

What about noise? Density maxima )( )( )( For each maximum, find the highest saddle point towards a higher peak )( How can we dis=nguish genuine density peaks from noise? Build a tree- like structure, where each peak is a leaf, and nodes are assigned according to the density values at the saddles

What about noise? Density maxima )( )( )( For each maximum, find the highest saddle point towards a highest peak ç )( Build a tree- like structure, where each peak is a leaf, and nodes are assigned according to the density values at the saddles

Possible applications This algorithm can be applied any time it is possible to define a DISTANCE function between each pair of elements in a set:

Possible applications This algorithm can be applied any time it is possible to define a DISTANCE function between each pair of elements in a set: classification of living organisms marketing strategies libraries (book sorting) google search even FACE recognition!!!

Possible applications This algorithm can be applied any time it is possible to define a DISTANCE function between each pair of elements in a set: classification of living organisms marketing strategies libraries (book sorting) google search even FACE recognition!!!

Possible applications This algorithm can be applied any time it is possible to define a DISTANCE function between each pair of elements in a set: classification of living organisms marketing strategies libraries (book sorting) google search even FACE recognition!!!

Possible applications This algorithm can be applied any time it is possible to define a DISTANCE function between each pair of elements in a set: classification of living organisms marketing strategies libraries (book sorting) google search even FACE recognition!!!

Possible applications This algorithm can be applied any time it is possible to define a DISTANCE function between each pair of elements in a set: classification of living organisms marketing strategies libraries (book sorting) google search even FACE recognition!!!

Clustering Algorithm applied to faces

Clustering Algorithm applied to faces It is possible to define a distance between faces, based on some stable features. *Sampat, M. P., Wang, Z., Gupta, S., Bovik, A. C., & Markey, M. K. (2009). Complex wavelet structural similarity: A new image similarity index. Image Processing, IEEE Transac<ons on, 18(11), 2385-2401.

DECISION GRAPH

DECISION GRAPH

The clustering approach at work: Analysis 1 of Introduction a fmri experiment (D. Ama\, M. Maieron, F. Pizzagalli) T Outcome of a fmri experiment: signal intensity for ~100,000 v i (!) = X 1 v i (t) e i! T t voxels covering densely the brain. t=0 The signal is measured every ~2 seconds for a total =me of a few minutes. Intensity Voxel index: i=1,2,,# of voxels P (!) = 1 N v i (t) NX v i (!) vi (!) General idea: if the subject is performing a task, the voxels in Using the transformed voxel signal u the brain region involved in this i (k) we then defined the distance between task must have a similar v(t). voxel i and j as follows: We look for large and connected d ij = regions with voxels with a! similar v(t), namely with a similar =me evolu=on. Similarity measure: esult). i=1 ṽ i (!) =v i (!) / p P (!) u i (!) =ṽ i (!) / max kṽ i (!)k! s X ku i (!) u j (!)k 2 (1) v ux d ij = t T (v i (t) v j (t)) 2 (2) t=1

The clustering approach at work: Analysis of a fmri experiment (D. Ama\, M. Maieron, F. Pizzagalli) The subject was scanned while moving the right or lej hand. They saw the words "move lej", "move right" or "stop" in a random fashion through the glasses. 1.2 1 0.8 0.6 0.4 0.2 0 HAND MOVEMENT 3 13 23 33 43 53 63 73 83 93 103 113 123 133 143 153 163 173 183 193 203 213 223 233 243 253 dx sx 3T Achieva Philips T2* BOLD sensi=ve gradient- recalled EPI sequence standard Head Coil 8 channels TR/TE = 2500/32 ms matrix 128X128, in- plane resolu=on 1.8 X 1.8 #slices 34, thickness = 3mm, no gap 102 scans

The clustering approach at work: Analysis of a fmri experiment (D. Ama\, M. Maieron, F. Pizzagalli) Time window 24-36: decision graph Time window 24-36: clusters Overlap between the cluster of all the =me windows

The clustering approach at work: molecular dynamics 3000 ns of molecular dynamics of 3- Ala in water solu=on, at 300 K 7 slow relaxa=on =mes

The clustering approach at work: molecular dynamics

The clustering approach at work: molecular dynamics Dihedral distance RMSD SCIENCE, 1492, vol 322 (2014)

The clustering approach at work: Test on molecular dynamics Dihedral distance RMSD SCIENCE, 1492, vol 322 (2014)

Conclusions The approach allows detec=ng non- spherical clusters It allows detec=ng clusters with different densi=es Robust with respect to changes in the metric No op=miza=on, no varia=onal parameters: clusters are found determinis=cally from the data. The number of clusters is found automa=cally Outliers and background noise are automa=cally recognized and excluded from the analysis. Alex Rodriguez Maria d Errico Thank- you: Daniele Ama=, Erio Tosat, Jessica Nasica, Francesca Rizzato