Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -



Similar documents
Tutorial for proteome data analysis using the Perseus software platform

Package GSA. R topics documented: February 19, 2015

Gene Enrichment Analysis

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Hierarchical Clustering Analysis

Time series experiments

Projects Involving Statistics (& SPSS)

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Package empiricalfdr.deseq2

Exercise with Gene Ontology - Cytoscape - BiNGO

MultiExperiment Viewer Quickstart Guide

The data. Introducción al análisis de datos en microarrays ... Characteristics of the data: Universidad Complutense de Madrid ESCUELA DE VERANO 2007

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

CNV Univariate Analysis Tutorial

ProteinQuest user guide

Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

HYPOTHESIS TESTING WITH SPSS:

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Normality Testing in Excel

Non-Inferiority Tests for One Mean

II. DISTRIBUTIONS distribution normal distribution. standard scores

TIPS FOR DOING STATISTICS IN EXCEL

Module 5: Statistical Analysis

The Advantages and Disadvantages of Using Gene Ontology

Biomedicine The background. The main interest. The tools

Analysis of the colorectal tumor microenvironment using integrative bioinformatic tools

Step-by-Step Guide to Basic Expression Analysis and Normalization

Package copa. R topics documented: August 9, 2016

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Is There a Future in Property Marketing?

MEASURES OF LOCATION AND SPREAD

A Streamlined Workflow for Untargeted Metabolomics

MTH 140 Statistics Videos

Chapter G08 Nonparametric Statistics

Descriptive Statistics

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

To create a histogram, you must organize the data in two columns on the worksheet. These columns must contain the following data:

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Package dsstatsclient

Introduction to Exploratory Data Analysis

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Testing for differences I exercises with SPSS

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Tutorial 5: Hypothesis Testing

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

Pearson's Correlation Tests

Statistical issues in the analysis of microarray data

Unit 26: Small Sample Inference for One Mean

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Quality Assessment of Exon and Gene Arrays

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

EXCEL Analysis TookPak [Statistical Analysis] 1. First of all, check to make sure that the Analysis ToolPak is installed. Here is how you do it:

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Guide for Data Visualization and Analysis using ACSN

User Manual. Transcriptome Analysis Console (TAC) Software. For Research Use Only. Not for use in diagnostic procedures. P/N Rev.

WISE Power Tutorial All Exercises

January 26, 2009 The Faculty Center for Teaching and Learning

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

NAG C Library Chapter Introduction. g08 Nonparametric Statistics

Data Analysis Tools. Tools for Summarizing Data

Chapter 2 Probability Topics SPSS T tests

IBM SPSS Direct Marketing 23

Getting Started with the ArcGIS Predictive Analysis Add-In

SPSS Tests for Versions 9 to 13

Cluster software and Java TreeView

UNIVERSITY OF NAIROBI

Protein Protein Interaction Networks

How To Cluster

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

STATISTICA Formula Guide: Logistic Regression. Table of Contents

IBM SPSS Direct Marketing 22

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Analysis of ChIP-seq data in Galaxy

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Testing Random- Number Generators

An introduction to IBM SPSS Statistics

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Diagrams and Graphs of Statistical Data

Visualization Quick Guide

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

An SPSS companion book. Basic Practice of Statistics

Accountable Care Organization Quality Explorer. Quick Start Guide

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

They can be obtained in HQJHQH format directly from the home page at:

Introduction to Statistics with GraphPad Prism (5.01) Version 1.1

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Additional sources Compilation of sources:

Microarray Data Mining: Puce a ADN

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Computational localization of promoters and transcription start sites in mammalian genomes

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Supervised and unsupervised learning - 1

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Advanced Excel for Institutional Researchers

Transcription:

Course on Functional Analysis ::: Madrid, June 31st, 2007. Gonzalo Gómez, PhD. ggomez@cnio.es Bioinformatics Unit CNIO

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: Introduction. GSEA MIT Broad Institute v 2.0 available since Jan 2007 v 2.0.1 available since Feb 16th 2007 Version 2.0 includes Biocarta, Broad Institute, GeneMAPP, KEGG annotations and more... Platforms: Affymetrix, Agilent, CodeLink, custom... (Subramanian et al. PNAS. 2005.)

::: Introduction. ::: How works GSEA? GSEA applies Kolmogorov-Smirnof test to find assymmetrical distributions for defined blocks of genes in datasets whole distribution. Is this particular Gene Set enriched in my experiment? Genes selected by researcher, Biocarta pathways, GeneMAPP sets, genes sharing cytoband, genes targeted by common mirnas up to you

::: Introduction. ::: K-S test The Kolmogorov Smirnov test is used to determine whether two underlying one-dimensional probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution, in either case based on finite samples. The one-sample KS test compares the empirical distribution function with the cumulative distribution functionspecified by the null hypothesis. The main applications are testing goodness of fit with the normal and uniform distributions. The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. Dataset distribution Gene set 1 distribution Gene set 2 distribution Number of genes Gene Expression Level

::: Introduction. ClassA ClassB ::: How works GSEA? FDR<0.05...testing genes independently... ttest cut-off FDR<0.05 Biological meaning?

::: Introduction. ::: How works GSEA? - ClassA ClassB Gene Set 1 Gene Set 2 Gene Set 3 Gene set 3 enriched in Class B ttest cut-off ES/NES statistic Gene set 2 enriched in Class A +

::: Introduction. ES examples :::

::: Introduction. The Enrichment Score ::: NES pval FDR Benjamini-Hochberg

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: GSEA software. Download ::: http://www.broad.mit.edu/gsea/

::: GSEA software. Main Window :::

::: GSEA software. Loading data :::!!!

::: GSEA software. Running GSEA :::

::: GSEA software. Leading Edge Analysis :::

::: GSEA software. MSigDB ::: Chip to Chip Mapping :::

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: Data Formats.

::: Data Formats.

::: Data Formats. Expression datasets ::: *.gct

::: Data Formats. Expression datasets ::: *.res

::: Data Formats. Expression datasets ::: *.pcl

::: Data Formats. Expression datasets ::: *.txt

::: Data Formats. Phenotype datasets ::: *.cls For categorical phenotypes (e.g. Tumor vs Control)

::: Data Formats. Phenotype datasets ::: For continuous phenotypes (e.g. Gene correlated to GeneSet) Time serie (each 30 minutes) Peak profile wanted For continuous phenotypes (e.g. Gene vs Time Series)

::: Data Formats. Gene Set Database ::: *.gmx

::: Data Formats. Gene Set Database ::: *.gmt

::: Data Formats. Other formats::: *.chip *.grp

::: Data Formats. Ranked list format ::: *.rnk

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: Using GSEA. Loading data :::

::: Using GSEA. Loading data :::

::: Using GSEA. Running GSEA :::

::: Using GSEA. ::: MSigDB. gsea_home

::: Using GSEA. Running GSEA ::: 1. Choose true (default) to have GSEA collapse each probe set in your expression dataset into a single gene vector, which is identified by its HUGO gene symbol. In this case, you are using HUGO gene symbols for the analysis. The gene sets that you use for the analysis must use HUGO gene symbols to identify the genes in the gene sets. 2. Choose false to use your expression dataset "as is." In this case, you are using the probe identifiers that are in your expression dataset for the analysis. The gene sets that you use for the analysis must also use these probe identifiers to identify the genes in the gene sets.

::: Using GSEA. Running GSEA ::: Phenotype Gene Sets (few samples)

::: Using GSEA. Running GSEA :::

::: Using GSEA. Chip2Chip mapping ::: Chip2Chip translates the gene identifiers in a gene sets from HUGO gene symbols to the probe identifiers for a selected DNA chip.

::: Using GSEA. Enrichment statistic ::: To calculate the enrichment score, GSEA first walks down the ranked list of genes increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The enrichment score is the maximum deviation from zero encountered during that walk. This parameter affects the running-sum statistic used for the analysis.

::: Using GSEA. Ranking Metric ::: Signal2Noise ttest Cosine Euclidean Manhatten Pearson (time series) Ratio of Classes Diff of Classes Log2_Ratio_of_Classes Categorical phenotypes Continuous phenotypes

::: Using GSEA. Ranking Metric :::

::: Using GSEA. Ranking Metric :::

::: Using GSEA. More parameters ::: real 8.2 8.1 8.0-7.5-7.7-7.9 abs 8.2 8.1 8.0 7.9 7.7 7.5 parameter to determine whether to sort the genes in descending (default) or ascending order.

::: Using GSEA. Launching Analysis :::

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: GSEA output. By default in gsea_home Results Accession ::: C:\Documents and settings\username\gsea_home /Users/yourhome/gsea_home

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: GSEA results. Index.html ::: Heat map of the top 50 features for each phenotype and a plot showing the correlation between the ranked genes and the phenotypes. In a heat map, expression values are represented as colors, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest).

::: GSEA results. Enrichment results in html :::

::: GSEA results. Enrichment results in html :::

::: GSEA results. Enrichment results in html ::: How can I decide about my results? FDR 0.25 NOM p-val 0.05

::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

::: GSEA results. Leading Edge Analysis :::

::: GSEA results. Leading Edge Analysis ::: HeatMap Set-to-Set Histogram Gene in Subsets

::: GSEA results. Leading Edge Analysis ::: Heat Map The heat map shows the (clustered) genes in the leading edge subsets. In a heat map, expression values are represented as colors, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest).

::: GSEA results. Leading Edge Analysis ::: Set-to-Set The graph uses color intensity to show the overlap between subsets: the darker the color, the greater the overlap between the subsets.. When you compare a leading edge subset to itself, its members completely overlap so the corresponding cell is dark green. When you compare two subsets that have no overlapping members, the corresponding cell is white.

::: GSEA results. Leading Edge Analysis ::: Gene in Subsets The graph shows each gene and the number of subsets in which it appears.

::: GSEA results. Leading Edge Analysis ::: Histogram The last plot is a histogram, where the Jacquard is the intersection divided by the union for a pair of leading edge subsets. Number of Occurrences is the number of leading edge subset pairs in a particular bin. In this example, most subset pairs have no overlap (Jacquard = 0).

::: GSEA & FatiScan. Detects significant functions with Gene Ontology InterPro motifs, Swissprot KW and KEGG pathways in lists of genes ordered according to differents characteristics.

ggomez@cnio.es T H A N K S