1 / Computational Genomics Clustering expression data
2 What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among objects.
3 Clustering gene expression data Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally, finding natural groupings among objects.
4 Goal Data organization (for further study) Functional assignment Determine different patterns Genes Classification Relations between experimental conditions Experiments Subsets of genes related to subset of experiments Both
5 Example: clustering genes Clustering genes helps determine new functions for unknown genes An early killer application in this area The most cited paper in PNAS! (13,844, Google Scholar)
6 What is Similarity? The quality or state of being similar; likeness; resemblance; as, a similarity of features. Webster's Dictionary Similarity is hard to define, but We know it when we see it The real meaning of similarity is a philosophical question. We will take a more pragmatic approach.
7 Defining Distance Measures Definition: Let O 1 and O 2 be two objects from the universe of possible objects. The distance (dissimilarity) between O 1 and O 2 is a real number denoted by D(O 1,O 2 ) gene1 gene
8 gene1 gene2 d('', '') = 0 d(s, '') = d('', s) = s -- i.e. length of s d(s1+ch1, s2+ch2) = min( d(s1, s2) + if ch1=ch2 then 0 else 1 fi, d(s1+ch1, s2) + 1, d(s1, s2+ch2) + 1 ) Inside these black boxes: some function on two variables (might be simple or very complex) 3 A few examples: Euclidian distance Correlation coefficient s ( x, y ) d ( x, y ) ( x i y i ) 2 i i ( x i x )( y i y ) Similarity rather than distance Can determine similar trends x y
9 Clustering algorithms We can divide clustering methods into roughly three types: 1. Hierarchical agglomerative clustering - eg. Ward hierarchical clustering 2. Model based - eg. k-means, k-medioids, Gaussian mixtures, dimensionality reduction (PCA [eigengene*], ICA,..) 3. Iterative partitioning (top down) - eg. graph based algorithms (spectral), local-linear embedding (LLE), isomap,.. Proc Natl Acad Sci U S A Aug 29;97(18): Singular value decomposition for genome-wide expression data processing and modeling. Alter O, Brown PO, Botstein D. Proc Natl Acad Sci U S A Dec 8;95(25): Cluster analysis and display of genome-wide expression patterns.
10 Hierarchical clustering Probably the most popular clustering algorithm in this area First presented in this context by Eisen in 1998 dendrogram Agglomerative (bottom-up) Algorithm: 1. Initialize: each item a cluster 2. Iterate: select two most similar clusters merge them 3. Halt: when there is only one cluster left Proc Natl Acad Sci U S A Dec 8;95(25): Cluster analysis and display of genome-wide expression patterns.
11 Cluster results Combining several time series yeast datasets
12 One potential use of a dendrogram is to detect outliers The single isolated branch is suggestive of a data point that is very different to all others Outlier
13 But what are the clusters? In some cases we can determine the correct number of clusters. However, things are rarely this clear cut, unfortunately.
14 Hierarchical clustering As we mentioned, its one of the most popular methods for clustering gene expression data One of its main advantages is the global overview of the entire experiment in one figure. Biologists often omit the tree and use the figure to determine functional assignments
15 Partitional Clustering Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters. Since the output is only one set of clusters the user has to specify the desired number of clusters K.
16 Algorithm k-means 1. Decide on a value for K, the number of clusters. 2. Initialize the K cluster centers (randomly, if necessary). 3. Decide the class memberships of the N objects by assigning them to the nearest cluster center. 4. Re-estimate the K cluster centers, by assuming the memberships found above are correct. 5. Repeat 3 and 4 until none of the N objects changed membership in the last iteration.
17 K-means Clustering: Finished! Re-assign and move centers, until no objects changed membership. expression in condition k 2 k expression in condition 1 k 1
18 Decide the number of clusters, K Initialize parameters (randomly) E-step: assign probabilistic membership to all input samples One for each cluster Gaussian Mixture Models M-step: re-estimate parameters based on probabilistic membership p i, j p ( C i x j ) i i p i j j j p p p i, j i, j p i, j i x x p Repeat until change in parameters is smaller than a threshold i j j x T j k p ( x j C i ) p ( C i ) p ( x j C k ) p ( C k ) w i j p i p j
19 Iteration 25
20 The importance of initializations
21 The importance of initializations: Step 1
22 The importance of initializations: Step 2
23 The importance of initializations: Step 5
24 The importance of initializations; Convergence
25 Example of clusters for the cell cycle expression dataset
26 Number of clusters How do we find the right number of clusters? The overall log-likelihood of the profiles implied by the cluster models goes up as we add clusters One way is to use cross validation
27 How can we tell the right number of clusters? In general, this is a unsolved problem. However there are many approximate methods. In the next few slides we will see an example
28 When k = 1, the objective function is
29 When k = 2, the objective function is
30 When k = 3, the objective function is
31 Objective Function We can plot the objective function values for k equals 1 to 6 The abrupt change at k = 2, is highly suggestive of two clusters in the data. This technique for determining the number of clusters is known as knee finding or elbow finding. 1.00E E E E E E E E E E E+00 k Note that the results are not always as clear cut as in this toy example
32 Cross validation We can also use cross validation to determine the correct number of classes Recall that GMMs is a generative model. We can compute the likelihood of the left out data to determine which model (number of clusters) is more accurate n p ( x 1 x n ) p ( x j C i ) w i j 1 k i 1
33 Cross validation
34 Top down: Graph based clustering Many top down clustering algorithms work by first constructing a neighborhood graph and then trying to infer some sort of connected components in that graph
35 Graph based clustering We need to clarify how to perform the following three steps: 1. construct the neighborhood graph 2. assign weights to the edges (similarity) 3. partition the nodes using the graph structure
37 Clustering methods: Comparison Bottom up Model based Top down Running time naively, O(n 3 ) fast (each iteration is linear) Assumptions requires a similarity / distance measure strong assumptions Input parameters none k (number of clusters) could be slow (matrix transformation) general (except for graph structure) either k or distance threshold Clusters subjective (only a tree is returned) exactly k clusters depends on the input format
38 Bi-clustering Find subsets of genes and experiments such that the genes in the subset behave similarly across the subset of the experiments
39 Cluster validation We wish to determine whether the clusters are real - internal validation (stability, coherence) - external validation (match to known categories)
40 Internal validation: Coherence A simple method is to compare clustering algorithm based on the coherence of their results We compute the average inter-cluster similarity and the average intra-cluster similarity Requires the definition of the similarity / distance metric
41 Internal validation: Stability If the clusters capture real structure in the data they should be stable to minor perturbation (e.g., subsampling) of the data. To characterize stability we need a measure of similarity between any two k-clusterings. For any set of clusters C we define L(C) as the matrix of 0/1 labels such that L(C) ij =1 if genes i and j belong to the same cluster and zero otherwise. We can compare any two k clusterings C and C' by comparing the corresponding label matrices L(C) and L(C').
42 Validation by subsampling C is the set of k clusters based on all the gene profiles C' denotes the set of k clusters resulting from a randomly chosen subset (80-90%) of genes We have high confidence in the original clustering if Sim(L(C),L(C')) approaches 1 with high probability, where the comparison is done over the genes common to both
43 External validation More common (why?). Suppose we have generated k clusters (sets of gene profiles) C 1,,C k. How do we assess the significance of their relation to m known (potentially overlapping) categories G 1,,G m? Let's start by comparing a single cluster C with a single category G j. The p-value for such a match is based on the hyper-geometric distribution. Board. This is the probability that a randomly chosen C i elements out of n would have l elements in common with G j.
44 P-value (cont.) If the observed overlap between the sets (cluster and category) is l elements (genes), then the p-value is p prob ( l l ˆ ) min( c, m ) j l prob ( exactly j matches ) Since the categories G 1,,G m typically overlap we cannot assume that each cluster-category pair represents an independent comparison In addition, we have to account for the multiple hypothesis we are testing. Solution?
45 External validation: Example P-value comparison 7 6 -Log Pval Profiles cell death Response to stimulus transerase activity Ratio Log Pval Kmeans
46 What you should know Why is clustering useful What are the different types of clustering algorithms What are the assumptions we are making for each, and what can we get from them Cluster validation: Internal and external
Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
Unsupervised Learning c: Artificial Intelligence The University of Iowa Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
Clustering Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. Differences between models/algorithms
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler Unsupervised learning Supervised learning Predict target value ( y ) given features ( x ) Unsupervised learning Understand
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 email@example.com What is Learning? "Learning denotes changes in a system that enable
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
Question 1 Cluster hypothesis states the idea of closely associating documents that tend to be relevant to the same requests. This hypothesis does make sense. The size of documents for information retrieval
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining /8/ What is Cluster
K-Means Cluster Analsis Chapter 3 PPDM Class Tan,Steinbach, Kumar Introduction to Data Mining 4/18/4 1 What is Cluster Analsis? Finding groups of objects such that the objects in a group will be similar
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Clustering Clustering Algorithms Contents K-means Hierarchical algorithms Linkage functions Vector quantization Clustering Formulation Objects.................................
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Clustering and Cluster Evaluation Josh Stuart Tuesday, Feb 24, 2004 Read chap 4 in Causton Clustering Methods Agglomerative Start with all separate, end with some connected Partitioning / Divisive Start
Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every
Integrating DNA Motif Discovery and Genome-Wide Expression Analysis Department of Mathematics and Statistics University of Massachusetts Amherst Statistics in Functional Genomics Workshop Ascona, Switzerland
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths
Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Data visualization and clustering Genomics is to no small extend a data science [www.data2discovery.org] Data visualization and clustering Genomics is to no small extend a data science [Andersson et al.,
CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize
Forschungskolleg Data Analytics Methods and Techniques Martin Hahmann, Gunnar Schröder, Phillip Grosse Prof. Dr.-Ing. Wolfgang Lehner Why do we need it? We are drowning in data, but starving for knowledge!
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur Lovely Professional University Phagwara- Punjab Abstract Clustering is an essential
10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:
MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center
Volume :2, Issue :5, 32-36 May 2015 www.allsubjectjournal.com e-issn: 2349-4182 p-issn: 2349-5979 Impact Factor: 3.762 R.Tamilselvi B.Sivasakthi R.Kavitha Assistant Professor A comparison of various clustering
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE
Data Mining Clustering Toon Calders Sheets are based on the those provided b Tan, Steinbach, and Kumar. Introduction to Data Mining What is Cluster Analsis? Finding groups of objects such that the objects
Decision Support System Methodology Using a Visual Approach for Cluster Analysis Problems Ran M. Bittmann School of Business Administration Ph.D. Thesis Submitted to the Senate of Bar-Ilan University Ramat-Gan,
SCAN: A Structural Clustering Algorithm for Networks Xiaowei Xu, Nurcan Yuruk, Zhidan Feng (University of Arkansas at Little Rock) Thomas A. J. Schweiger (Acxiom Corporation) Networks scaling: #edges connected
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
Medical Information Management & Mining You Chen Jan,15, 2013 You.firstname.lastname@example.org 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
7-1 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/4 What is
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton
Logistic Regression Department of Statistics The Pennsylvania State University Email: email@example.com Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we
Flat Clustering K-Means Algorithm 1. Purpose. Clustering algorithms group a set of documents into subsets or clusters. The cluster algorithms goal is to create clusters that are coherent internally, but
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling