The data. Introducción al análisis de datos en microarrays ... Characteristics of the data: Universidad Complutense de Madrid ESCUELA DE VERANO 2007

Similar documents
Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

COMPUTATIONAL ANALYSIS OF MICROARRAY DATA

Statistical issues in the analysis of microarray data

AIMS ESSAY 2006 Overview of Tools for Microarray Data Analysis and Comparison Analysis

Distances, Clustering, and Classification. Heatmaps

Gene expression analysis. Ulf Leser and Karin Zimmermann

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Data Mining Techniques for DNA Microarray Data

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

They can be obtained in HQJHQH format directly from the home page at:

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

Microarray Data Analysis and Mining Tools

Exploratory data analysis approaches unsupervised approaches. Steven Kiddle With thanks to Richard Dobson and Emanuele de Rinaldis

Software reviews. Expression Pro ler: A suite of web-based tools for the analysis of microarray gene expression data

Pattern Recognition Techniques in Microarray Data Analysis: A Survey

Tutorial for proteome data analysis using the Perseus software platform

MultiExperiment Viewer Quickstart Guide

Neural Networks Lesson 5 - Cluster Analysis

A Study of Web Log Analysis Using Clustering Techniques

Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data

Cluster 3.0 Manual. for Windows, Mac OS X, Linux, Unix. Michael Eisen; updated by Michiel de Hoon

Using Data Mining for Mobile Communication Clustering and Characterization

Advanced Web Usage Mining Algorithm using Neural Network and Principal Component Analysis

diagnosis through Random

Unsupervised learning: Clustering

An Introduction to Microarray Data Analysis

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

How To Cluster

Methods for assessing reproducibility of clustering patterns

Data Mining in Bioinformatics - A Review

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

An unsupervised fuzzy ensemble algorithmic scheme for gene expression data analysis

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Fecha: 04/12/2010. New standard treatment for breast cancer at early stages established

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

A truly robust Expression analyzer

Machine Learning using MapReduce

Comparison of K-means and Backpropagation Data Mining Algorithms

Gene Expression Analysis

Computer program review

Method of Combining the Degrees of Similarity in Handwritten Signature Authentication Using Neural Networks

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Hierarchical Clustering Analysis

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Chapter ML:XI (continued)

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Time series experiments

Capturing Best Practice for Microarray Gene Expression Data Analysis

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Quality Assessment in Spatial Clustering of Data Mining

Data Mining and Machine Learning in Bioinformatics

Analysis of gene expression data. Ulf Leser and Philippe Thomas

SoSe 2014: M-TANI: Big Data Analytics

MD - Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

Statistics Graduate Courses

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Data Mining and Neural Networks in Stata

Protein Protein Interaction Networks

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Quality Assessment of Exon and Gene Arrays

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Manejo Basico del Servidor de Aplicaciones WebSphere Application Server 6.0

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

IT services for analyses of various data samples

INTELIGENCIA DE NEGOCIO CON SQL SERVER

Data Mining Analysis of HIV-1 Protease Crystal Structures

life science data mining

DeCyder Extended Data Analysis module Version 1.0

Visualization and Clustering with High-dimensional Genomic Data

Introduction to Pattern Recognition

Cluster Analysis: Basic Concepts and Algorithms

Applications of Neural Network and Genetic Algorithm Data Mining Techniques in Bioinformatics Knowledge Discovery - A Preliminary Study

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

Machine Learning Introduction

Microarray Data Mining: Puce a ADN

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

A comparison of various clustering methods and algorithms in data mining

Cluster analysis Cosmin Lazar. COMO Lab VUB

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis

6.2.8 Neural networks for data mining

Tables of significant values of Jaccard's index of similarity

PERFORMANCE ANALYSIS OF CLUSTERING ALGORITHMS IN DATA MINING IN WEKA

LINIO COLOMBIA. Starting-Up & Leading E-Commerce. Luca Ranaldi, CEO. Pedro Freire, VP Marketing and Business Development

Biomedicine The background. The main interest. The tools

Segmentation of stock trading customers according to potential value

Introduction to Clustering

Clustering and Visualising Microarray Data Patterns

Transcription:

Universidad Complutense de Madrid ESCUELA DE VERANO 2007 Universidad Complutense de Madrid ESCUELA DE VERANO 2007 Introducción al análisis de datos en microarrays 1. Introducción 2. Microarrays (tipos, tratamiento de las muestras e hibridación, bases de datos, aplicaciones ) 3. Normalizacion y análisis exploratorio (JC Oliveros) 4. Análisis de datos (Análisis supervisado y no supervisado, algoritmos ) 5. Práctica Gonzalo Gómez López INB-CNIO ggomez@cnio.es Madrid, 26 de Julio, 2007 Gonzalo Gómez López ggomez@cnio.es Genes (thousands)... A B C Experimental conditions (from tens up to no more than a few hundreds) The data Different classes of experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc. Expression profile of all the genes for a experimental condition (1 array) Expression profile of a gene across the experimental conditions Characteristics of the data: Low signal to noise ratio High redundancy and intra-gene correlations Most of the genes are not informative with respect to the trait we are studying (physiological conditions, etc.) Many genes have no annotations!! Source: J. Dopazo Unsupervised methods Brief history in microarray data analysis: Supervised methods (predictors) 1. Clustering methods Neural networks (Khan et al. 2001) UPGMA (Sneath and Sokal, 1973) SOMs (Kohonen et al. 1984) kmeans (Hartigan and Wong 1979) knn (Ripley 1996; Hastie et al 2001) kmedians (Hartigan and Wong 1979) PAM (Tibshirani et al. 2002) SOTA (Herrero et al. 2001) DLDA (Dudoit et al. 2002) SOM (Kohonen 1979, Tamayo et al 1999) SVMs (Furey et al, 2000) Gene Shaving (Hastie et al, 2000) Fuzzy methods (Dougherty ET AL. 2002) Probabilistic (Bhattacharjee et al. 2001) Metagenes (Pittman et al, 2004) 2. Exploratory analysis parametric: ttest, SAM (Tusher et al, 2001), ANOVA non parametric: Welch ttest, Wilcoxon, Kruskal Wallis Sumarizing datasets: PCA 3. Blocks of genes Kolmogorov-Smirnof: GSEA (Subramanian et al. 2005) FatiScan (Al-Shahrour et al. 2005) SAM 3.0 (Jan 07) Gonzalo Gómez GeneTrail López (Backes et al. 2007)

Overexpression EJEMPLO or No Change Underexpression Image analysis comparison (normalization and filtering) Data analysis (supervised, unsupervised ) Ratios (or not ) Log 2 transform Underexpression Overexpression x1/2 x2 Why Log 2 transform?? QUESTIONS & METHODOLOGICAL APPROACH Ratio scale Log scale (lineal) -2 0.25-1 0.5 1 2 0 1 2 4 Can we find groups of experiments with similar gene expression profiles? Different phenotypes... Unsupervised analysis Supervised analysis Reverse engineering Molecular classification of samples What genes are responsible for? Co-expressing genes... What do they have in common? No Log transform After Log transform What profile(s) do they display? and... Genes interacting in a network (A,B,C..)... Genes of a class Are there more genes? How is the network? A B E C D Source: J. Dopazo

Algorithms & Reminding Examples Algorithms Molecular classification of cancer Known class prediction (Supervised data analysis).? Hierarchical Neural networks PCA SVM A set of instructions for solving a problem. When the instructions are followed, it must eventually stop with an answer. Computer Science Computational geometry Image processing Design Real-time simulations Economy Ecomomy models Prediction & simulation Applications A.I. Discovering SOTA class SOMs (Unsupervised data analysis).? k-means UPGMA Engineering Brief history in microarray data analysis: Unsupervised methods Supervised methods (predictors) Neural networks (Khan et al. 2001) 1. Clustering methods SOMs (Kohonen et al. 1984) UPGMA (Sneath and Sokal, 1973) knn (Ripley 1996; Hastie et al 2001) kmeans (Hartigan and Wong 1979) PAM (Tibshirani et al. 2002) kmedians (Hartigan and Wong 1979) DLDA (Dudoit et al. 2002) SOTA (Herrero et al. 2001) SVMs (Furey et al, 2000) SOM (Kohonen 1979, Tamayo et al 1999) Gene Shaving (Hastie et al, 2000) Fuzzy methods (Dougherty ET AL. 2002) Probabilistic (Bhattacharjee et al. 2001) Metagenes (Pittman et al, 2004) 2. Exploratory analysis parametric: ttest, SAM (Tusher et al, 2001), ANOVA non parametric: Welch ttest, Wilcoxon, Kruskal Wallis Sumarizing datasets: PCA 3. Blocks of genes Kolmogorov-Smirnof: GSEA (Subramanian et al. 2005) FatiScan (Al-Shahrour et al. 2005) SAM 3.0 (Jan 07) GeneTrail Gonzalo Gómez López (Backes et al. 2007) Science Cosmology & radio astronomy Condensed matter physics 3D proteins design Mathematics N items Cryptography Cipher & decipher codes Patterns recognition Artificial vision Applications and tasks Stochastic systems evolution Proteins folding prediction Biological evolution Atomistic resolution Nanomaterials Communications Optimizations Meteorology Simulation & prediction CLUSTERING Clustering??? K groups

Algorithms & Algorithms & Agglomerative Hierarchical algorithms Different methods UPGMA Divisive SOTA Non-hierarchical algorithms Clustering algorithm 26 items 4 groups Why different algorithms? Y C1 C2....... C4 K-means SOMs Data analysis in microarrays Hierarchical & dendrograms I) Cartesian algorithm (x, y):...c3... a) (X>0,Y>0)! C3, 1/4 C2? b) (X<0,Y>0)! C1, 1/4 C2? c) (X<0,Y<0)! C4?, 1/4 C2? d) (X>0,Y<0)! C4?, 1/4 C2? aubucud! C1, C2, C3, C4. x II) Polar algorithm (", m): There s no a perfect method e) ("=cte, m)! C1... f) (", m=cte)! C2, C3?... g) (", m)! C1, C2, C3, C4. Different methods Hierarchical algorithms Non-hierarchical algorithms K-means SOMs Etc Others Agglomerative UPGMA Divisive SOTA Artificial Neural Networks (ANNs) SVM PCA

UPGMA Unweighted Pair Group Method with Arithmetic mean Distances between genes?? Metrics Example Pearson Correlation Coefficient Gene 1 1) B y C son los dos genes más próximos (en sus medias) 2) B y C se unen y se recalcula la matriz de distancias empleando el nuevo cluster en vez de B y C. A y D son ahora los mas próximos. 3) A y D tambien se unen. Se reordenan los elementos para ajustar la topología del árbol. Se recalcula la matriz de distancias entre los grupos existentes 4) El proceso termina cuando el dendrograma esta construído. Exp 1 Exp 2 Exp 3 Euclidean distance Pearson correlation coefficient Spearman! correlation coefficient RNA express Gene 1 0.7 1.2 1.1 RNA express Gene 2 0.3 1.9 0.9 RNA express Gene 3 7.3 6.5 8.9 Gene 2 Gene 3 0.88-0.19-0.62 Gene 1 Gene 2 Gene 3 DISTANCIA B C A Dendrograms A EUCLIDEANA Linkage methods Niveles de Expresión B C t CORRELACIÓN CITY BLOCK (Manhattan or taxi cab) OTRAS (Canberra, Bray-Curtis ) PEARSON B A C CLUSTER 1 A C B D E F G CLUSTER 3 CLUSTER 2 Average-linkage method Single-linkage method Complete-linkage method A B C D E F G E D F G A B C G F E D C B A Others SPEARMAN (Weighted pair-group average, Within-groups Ward s method )

Ejemplos Hierarchical - UPGMA En nuestra práctica: GEPAS? Garber M. E. et al. PNAS, vol.98, nº24; 2001. SOTA Data analysis in microarrays Self-Organizing Tree Algorithm Hierarchical & dendrograms Different methods Hierarchical algorithms Non-hierarchical algorithms K-means SOMs Etc Others - Artificial Neural Network - Growth from the root of the tree, toward the leaves (from lower to higher resolution ) -Threshold of resource value should be fixed -Generation of a hierarchical cluster structure at the desired level of resolution Agglomerative UPGMA Divisive SOTA Artificial Neural Networks (ANNs) SVM PCA Herrero et al. Bioinformatics. 17: 126-136. 2001.

SOTA Self-Organizing Tree Algorithm SOTA Self-Organizing Tree Algorithm Number of genes 23 15 21 6 12 33 2 20 Average profiles Un ejemplo En nuestra práctica: GEPAS K-means Data analysis in microarrays K-means & Self organizing-maps (SOMs) Gene expression Time Gene expression 1 2 Time3 Gene expression Different methods Hierarchical algorithms Non-hierarchical algorithms K-means SOMs Etc Others Agglomerative UPGMA SOTA Divisive Artificial Neural Networks (ANNs) SVM PCA 4 Time5 6

K-means K-means Example More information: http://biodiver.bio.ub.es K-means Example 2 K= 12 www.isrec.isb-sib.ch. d te s a l Re ene g Self-organizing maps (SOMs) or Kohonen Maps SOMs Geometric dependence, i.e. 4 x 3 A B C D exp t Kohonen T. Proc IEEE 78(9):1464-1480,1990 Tamayo P. et al. PNAS, 96: 2907-2912; 1999.

Self-organizing maps (SOMs) Unsupervised methods Brief history in microarray data analysis: Supervised methods (predictors) EJEMPLO Indices de bienestar mundiales http://biodiver.bio.ub.es 1. Clustering methods Neural networks (Khan et al. 2001) UPGMA (Sneath and Sokal, 1973) SOMs (Kohonen et al. 1984) kmeans (Hartigan and Wong 1979) knn (Ripley 1996; Hastie et al 2001) kmedians (Hartigan and Wong 1979) PAM (Tibshirani et al. 2002) SOTA (Herrero et al. 2001) DLDA (Dudoit et al. 2002) SOM (Kohonen 1979, Tamayo et al 1999) SVMs (Furey et al, 2000) Gene Shaving (Hastie et al, 2000) Fuzzy methods (Dougherty ET AL. 2002) Probabilistic (Bhattacharjee et al. 2001) Metagenes (Pittman et al, 2004) 2. Exploratory analysis parametric: ttest, SAM (Tusher et al, 2001), ANOVA non parametric: Welch ttest, Wilcoxon, Kruskal Wallis Sumarizing datasets: PCA 3. Blocks of genes Kolmogorov-Smirnof: GSEA (Subramanian et al. 2005) FatiScan (Al-Shahrour et al. 2005) SAM 3.0 (Jan 07) Gonzalo Gómez GeneTrail López (Backes et al. 2007) Class1 Class2 Exploratory analysis Currently testing block of genes ~ biological pathways FDR<0.05...testing genes independently... GSEA (Subramanian et al. PNAS. 2005.) Gene Set Enrichment Analysis Broad Institute v 2.0 available since Jan 2007 New version includes Biocarta, Broad Institute, GeneMAPP, KEGG annotations and more... Platforms: Affymetrix, Agilent, CodeLink, 2-color... ttest cut-off GSEA applies Kolmogorov-Smirnof test to find assymmetrical distributions for defined blocks of genes in datasets whole distribution. FDR<0.05 Biological meaning?

- Class1 Class2 Gene Set 1 Gene Set 2 Gene Set 3 Now analizing block of genes ~ biological pathways FatiScan Gene set 3 enriched in Class 2 Al-Shahrour et al. Bioinformatics. 2005 Jul 1;21(13):2988-93. Available in Babelomics suite: www.babelomics.org ES/NES statistic ttest cut-off Segmentation test for searching asymmetries More statistically sensitive than GSEA Few annotations implemented En nuestra practica: Gene set 2 enriched in Class 1 + Datamining Mathematics methods for biology A di sc us si on www.babelomics.org? Tengo mis datos... He analizado su estructura y relevancia estadística Qué es, que hace y que sabemos de este gen? Que hay en este cluster?? Cell cycle... DBs Information Current math s methods are useful to explain real and meaningful processes from a biological point of view? mmm actually we don t know sometimes they are sometimes not New integrative methods should be developed Clustering,estadística, etc Source: J. Dopazo Datamining Links

Datamining Para profundizar Data Analysis Tools for DNA Microarrays by Sorin Draghic DNA Microarrays and Gene Expression : From Experiments to Data Analysis and Modeling by Pierre Baldi, G. Wesley Hatfield, Wesley G. Hatfield Exploration and Analysis of DNA Microarray and Protein Array Data (Wiley Series in Probability and Statistics) by Dhammika Amaratunga, Javier Cabrera And good luck NetAffx ANSWER Your genes, hypothesis, questions Statistics for Microarrays : Design, Analysis and Inference by Ernst Wit, John McClure Analyzing Microarray Gene Expression Data (Wiley Series in Probability and Statistics) by Geoffrey J. McLachlan, Kim-Anh Do, Christophe Ambroise En nuestra práctica: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health) by Robert Gentleman, Vincent Carey, Wolfgang Huber, Rafael Irizarry, Sandrine Dudoit Microarray Gene Expression Data Analysis: A Beginner's Guide by Helen Causton www.r-project.org Windows Mac OS X LINUX/UNIX

Gracias!! Universidad Complutense de Madrid ESCUELA DE VERANO 2007 1. 2. 3. 4. 5. Introducción Microarrays (tipos, tratamiento de las muestras e hibridación, bases de datos, aplicaciones ) Normalización y análisis exploratorio (JC Oliveros) Análisis de datos (Análisis supervisado y no supervisado, algoritmos ) Práctica Gonzalo Gómez López ggomez@cnio.es Gonzalo Gómez López ggomez@cnio.es Otras herramientas gratuitas: -BASE http://base.thep.lu.se/ http://www.gepas.org http://www.babelomics.org -BRBTools http://linus.nci.nih.gov/brb-arraytools.html -TM4(MeV) http://www.tm4.org/mev.html -Otras SNOMAD, DAVID, MIDAW

www.gepas.org INPUT FILE FORMAT (preprocessing)

Inf.cluster/genes Expandir arbol

SOTA Inf.cluster/genes SOMs FatiScan Inf.cluster/genes

FDR!!! FatiScan Output FatiScan output Enriched in A Enriched in B % Genes with the specific GO annotation for each partition