The data. Introducción al análisis de datos en microarrays ... Characteristics of the data: Universidad Complutense de Madrid ESCUELA DE VERANO 2007

Universidad Complutense de Madrid ESCUELA DE VERANO 2007 Universidad Complutense de Madrid ESCUELA DE VERANO 2007 Introducción al análisis de datos en microarrays 1. Introducción 2. Microarrays (tipos, tratamiento de las muestras e hibridación, bases de datos, aplicaciones ) 3. Normalizacion y análisis exploratorio (JC Oliveros) 4. Análisis de datos (Análisis supervisado y no supervisado, algoritmos ) 5. Práctica Gonzalo Gómez López INB-CNIO ggomez@cnio.es Madrid, 26 de Julio, 2007 Gonzalo Gómez López ggomez@cnio.es Genes (thousands)... A B C Experimental conditions (from tens up to no more than a few hundreds) The data Different classes of experimental conditions, e.g. Cancer types, tissues, drug treatments, time survival, etc. Expression profile of all the genes for a experimental condition (1 array) Expression profile of a gene across the experimental conditions Characteristics of the data: Low signal to noise ratio High redundancy and intra-gene correlations Most of the genes are not informative with respect to the trait we are studying (physiological conditions, etc.) Many genes have no annotations!! Source: J. Dopazo Unsupervised methods Brief history in microarray data analysis: Supervised methods (predictors) 1. Clustering methods Neural networks (Khan et al. 2001) UPGMA (Sneath and Sokal, 1973) SOMs (Kohonen et al. 1984) kmeans (Hartigan and Wong 1979) knn (Ripley 1996; Hastie et al 2001) kmedians (Hartigan and Wong 1979) PAM (Tibshirani et al. 2002) SOTA (Herrero et al. 2001) DLDA (Dudoit et al. 2002) SOM (Kohonen 1979, Tamayo et al 1999) SVMs (Furey et al, 2000) Gene Shaving (Hastie et al, 2000) Fuzzy methods (Dougherty ET AL. 2002) Probabilistic (Bhattacharjee et al. 2001) Metagenes (Pittman et al, 2004) 2. Exploratory analysis parametric: ttest, SAM (Tusher et al, 2001), ANOVA non parametric: Welch ttest, Wilcoxon, Kruskal Wallis Sumarizing datasets: PCA 3. Blocks of genes Kolmogorov-Smirnof: GSEA (Subramanian et al. 2005) FatiScan (Al-Shahrour et al. 2005) SAM 3.0 (Jan 07) Gonzalo Gómez GeneTrail López (Backes et al. 2007)

Overexpression EJEMPLO or No Change Underexpression Image analysis comparison (normalization and filtering) Data analysis (supervised, unsupervised ) Ratios (or not ) Log 2 transform Underexpression Overexpression x1/2 x2 Why Log 2 transform?? QUESTIONS & METHODOLOGICAL APPROACH Ratio scale Log scale (lineal) -2 0.25-1 0.5 1 2 0 1 2 4 Can we find groups of experiments with similar gene expression profiles? Different phenotypes... Unsupervised analysis Supervised analysis Reverse engineering Molecular classification of samples What genes are responsible for? Co-expressing genes... What do they have in common? No Log transform After Log transform What profile(s) do they display? and... Genes interacting in a network (A,B,C..)... Genes of a class Are there more genes? How is the network? A B E C D Source: J. Dopazo

Algorithms & Reminding Examples Algorithms Molecular classification of cancer Known class prediction (Supervised data analysis).? Hierarchical Neural networks PCA SVM A set of instructions for solving a problem. When the instructions are followed, it must eventually stop with an answer. Computer Science Computational geometry Image processing Design Real-time simulations Economy Ecomomy models Prediction & simulation Applications A.I. Discovering SOTA class SOMs (Unsupervised data analysis).? k-means UPGMA Engineering Brief history in microarray data analysis: Unsupervised methods Supervised methods (predictors) Neural networks (Khan et al. 2001) 1. Clustering methods SOMs (Kohonen et al. 1984) UPGMA (Sneath and Sokal, 1973) knn (Ripley 1996; Hastie et al 2001) kmeans (Hartigan and Wong 1979) PAM (Tibshirani et al. 2002) kmedians (Hartigan and Wong 1979) DLDA (Dudoit et al. 2002) SOTA (Herrero et al. 2001) SVMs (Furey et al, 2000) SOM (Kohonen 1979, Tamayo et al 1999) Gene Shaving (Hastie et al, 2000) Fuzzy methods (Dougherty ET AL. 2002) Probabilistic (Bhattacharjee et al. 2001) Metagenes (Pittman et al, 2004) 2. Exploratory analysis parametric: ttest, SAM (Tusher et al, 2001), ANOVA non parametric: Welch ttest, Wilcoxon, Kruskal Wallis Sumarizing datasets: PCA 3. Blocks of genes Kolmogorov-Smirnof: GSEA (Subramanian et al. 2005) FatiScan (Al-Shahrour et al. 2005) SAM 3.0 (Jan 07) GeneTrail Gonzalo Gómez López (Backes et al. 2007) Science Cosmology & radio astronomy Condensed matter physics 3D proteins design Mathematics N items Cryptography Cipher & decipher codes Patterns recognition Artificial vision Applications and tasks Stochastic systems evolution Proteins folding prediction Biological evolution Atomistic resolution Nanomaterials Communications Optimizations Meteorology Simulation & prediction CLUSTERING Clustering??? K groups

Algorithms & Algorithms & Agglomerative Hierarchical algorithms Different methods UPGMA Divisive SOTA Non-hierarchical algorithms Clustering algorithm 26 items 4 groups Why different algorithms? Y C1 C2....... C4 K-means SOMs Data analysis in microarrays Hierarchical & dendrograms I) Cartesian algorithm (x, y):...c3... a) (X>0,Y>0)! C3, 1/4 C2? b) (X<0,Y>0)! C1, 1/4 C2? c) (X<0,Y<0)! C4?, 1/4 C2? d) (X>0,Y<0)! C4?, 1/4 C2? aubucud! C1, C2, C3, C4. x II) Polar algorithm (", m): There s no a perfect method e) ("=cte, m)! C1... f) (", m=cte)! C2, C3?... g) (", m)! C1, C2, C3, C4. Different methods Hierarchical algorithms Non-hierarchical algorithms K-means SOMs Etc Others Agglomerative UPGMA Divisive SOTA Artificial Neural Networks (ANNs) SVM PCA

UPGMA Unweighted Pair Group Method with Arithmetic mean Distances between genes?? Metrics Example Pearson Correlation Coefficient Gene 1 1) B y C son los dos genes más próximos (en sus medias) 2) B y C se unen y se recalcula la matriz de distancias empleando el nuevo cluster en vez de B y C. A y D son ahora los mas próximos. 3) A y D tambien se unen. Se reordenan los elementos para ajustar la topología del árbol. Se recalcula la matriz de distancias entre los grupos existentes 4) El proceso termina cuando el dendrograma esta construído. Exp 1 Exp 2 Exp 3 Euclidean distance Pearson correlation coefficient Spearman! correlation coefficient RNA express Gene 1 0.7 1.2 1.1 RNA express Gene 2 0.3 1.9 0.9 RNA express Gene 3 7.3 6.5 8.9 Gene 2 Gene 3 0.88-0.19-0.62 Gene 1 Gene 2 Gene 3 DISTANCIA B C A Dendrograms A EUCLIDEANA Linkage methods Niveles de Expresión B C t CORRELACIÓN CITY BLOCK (Manhattan or taxi cab) OTRAS (Canberra, Bray-Curtis ) PEARSON B A C CLUSTER 1 A C B D E F G CLUSTER 3 CLUSTER 2 Average-linkage method Single-linkage method Complete-linkage method A B C D E F G E D F G A B C G F E D C B A Others SPEARMAN (Weighted pair-group average, Within-groups Ward s method )

Ejemplos Hierarchical - UPGMA En nuestra práctica: GEPAS? Garber M. E. et al. PNAS, vol.98, nº24; 2001. SOTA Data analysis in microarrays Self-Organizing Tree Algorithm Hierarchical & dendrograms Different methods Hierarchical algorithms Non-hierarchical algorithms K-means SOMs Etc Others - Artificial Neural Network - Growth from the root of the tree, toward the leaves (from lower to higher resolution ) -Threshold of resource value should be fixed -Generation of a hierarchical cluster structure at the desired level of resolution Agglomerative UPGMA Divisive SOTA Artificial Neural Networks (ANNs) SVM PCA Herrero et al. Bioinformatics. 17: 126-136. 2001.

SOTA Self-Organizing Tree Algorithm SOTA Self-Organizing Tree Algorithm Number of genes 23 15 21 6 12 33 2 20 Average profiles Un ejemplo En nuestra práctica: GEPAS K-means Data analysis in microarrays K-means & Self organizing-maps (SOMs) Gene expression Time Gene expression 1 2 Time3 Gene expression Different methods Hierarchical algorithms Non-hierarchical algorithms K-means SOMs Etc Others Agglomerative UPGMA SOTA Divisive Artificial Neural Networks (ANNs) SVM PCA 4 Time5 6

K-means K-means Example More information: http://biodiver.bio.ub.es K-means Example 2 K= 12 www.isrec.isb-sib.ch. d te s a l Re ene g Self-organizing maps (SOMs) or Kohonen Maps SOMs Geometric dependence, i.e. 4 x 3 A B C D exp t Kohonen T. Proc IEEE 78(9):1464-1480,1990 Tamayo P. et al. PNAS, 96: 2907-2912; 1999.

Self-organizing maps (SOMs) Unsupervised methods Brief history in microarray data analysis: Supervised methods (predictors) EJEMPLO Indices de bienestar mundiales http://biodiver.bio.ub.es 1. Clustering methods Neural networks (Khan et al. 2001) UPGMA (Sneath and Sokal, 1973) SOMs (Kohonen et al. 1984) kmeans (Hartigan and Wong 1979) knn (Ripley 1996; Hastie et al 2001) kmedians (Hartigan and Wong 1979) PAM (Tibshirani et al. 2002) SOTA (Herrero et al. 2001) DLDA (Dudoit et al. 2002) SOM (Kohonen 1979, Tamayo et al 1999) SVMs (Furey et al, 2000) Gene Shaving (Hastie et al, 2000) Fuzzy methods (Dougherty ET AL. 2002) Probabilistic (Bhattacharjee et al. 2001) Metagenes (Pittman et al, 2004) 2. Exploratory analysis parametric: ttest, SAM (Tusher et al, 2001), ANOVA non parametric: Welch ttest, Wilcoxon, Kruskal Wallis Sumarizing datasets: PCA 3. Blocks of genes Kolmogorov-Smirnof: GSEA (Subramanian et al. 2005) FatiScan (Al-Shahrour et al. 2005) SAM 3.0 (Jan 07) Gonzalo Gómez GeneTrail López (Backes et al. 2007) Class1 Class2 Exploratory analysis Currently testing block of genes ~ biological pathways FDR<0.05...testing genes independently... GSEA (Subramanian et al. PNAS. 2005.) Gene Set Enrichment Analysis Broad Institute v 2.0 available since Jan 2007 New version includes Biocarta, Broad Institute, GeneMAPP, KEGG annotations and more... Platforms: Affymetrix, Agilent, CodeLink, 2-color... ttest cut-off GSEA applies Kolmogorov-Smirnof test to find assymmetrical distributions for defined blocks of genes in datasets whole distribution. FDR<0.05 Biological meaning?

- Class1 Class2 Gene Set 1 Gene Set 2 Gene Set 3 Now analizing block of genes ~ biological pathways FatiScan Gene set 3 enriched in Class 2 Al-Shahrour et al. Bioinformatics. 2005 Jul 1;21(13):2988-93. Available in Babelomics suite: www.babelomics.org ES/NES statistic ttest cut-off Segmentation test for searching asymmetries More statistically sensitive than GSEA Few annotations implemented En nuestra practica: Gene set 2 enriched in Class 1 + Datamining Mathematics methods for biology A di sc us si on www.babelomics.org? Tengo mis datos... He analizado su estructura y relevancia estadística Qué es, que hace y que sabemos de este gen? Que hay en este cluster?? Cell cycle... DBs Information Current math s methods are useful to explain real and meaningful processes from a biological point of view? mmm actually we don t know sometimes they are sometimes not New integrative methods should be developed Clustering,estadística, etc Source: J. Dopazo Datamining Links

Datamining Para profundizar Data Analysis Tools for DNA Microarrays by Sorin Draghic DNA Microarrays and Gene Expression : From Experiments to Data Analysis and Modeling by Pierre Baldi, G. Wesley Hatfield, Wesley G. Hatfield Exploration and Analysis of DNA Microarray and Protein Array Data (Wiley Series in Probability and Statistics) by Dhammika Amaratunga, Javier Cabrera And good luck NetAffx ANSWER Your genes, hypothesis, questions Statistics for Microarrays : Design, Analysis and Inference by Ernst Wit, John McClure Analyzing Microarray Gene Expression Data (Wiley Series in Probability and Statistics) by Geoffrey J. McLachlan, Kim-Anh Do, Christophe Ambroise En nuestra práctica: Bioinformatics and Computational Biology Solutions Using R and Bioconductor (Statistics for Biology and Health) by Robert Gentleman, Vincent Carey, Wolfgang Huber, Rafael Irizarry, Sandrine Dudoit Microarray Gene Expression Data Analysis: A Beginner's Guide by Helen Causton www.r-project.org Windows Mac OS X LINUX/UNIX

Gracias!! Universidad Complutense de Madrid ESCUELA DE VERANO 2007 1. 2. 3. 4. 5. Introducción Microarrays (tipos, tratamiento de las muestras e hibridación, bases de datos, aplicaciones ) Normalización y análisis exploratorio (JC Oliveros) Análisis de datos (Análisis supervisado y no supervisado, algoritmos ) Práctica Gonzalo Gómez López ggomez@cnio.es Gonzalo Gómez López ggomez@cnio.es Otras herramientas gratuitas: -BASE http://base.thep.lu.se/ http://www.gepas.org http://www.babelomics.org -BRBTools http://linus.nci.nih.gov/brb-arraytools.html -TM4(MeV) http://www.tm4.org/mev.html -Otras SNOMAD, DAVID, MIDAW

www.gepas.org INPUT FILE FORMAT (preprocessing)

Inf.cluster/genes Expandir arbol

SOTA Inf.cluster/genes SOMs FatiScan Inf.cluster/genes

FDR!!! FatiScan Output FatiScan output Enriched in A Enriched in B % Genes with the specific GO annotation for each partition