Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Size: px

Start display at page:

Download "Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -"

Amos Flowers
10 years ago
Views:

1 Course on Functional Analysis ::: Madrid, June 31st, Gonzalo Gómez, PhD. Bioinformatics Unit CNIO

2 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

3 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

4 ::: Introduction. GSEA MIT Broad Institute v 2.0 available since Jan 2007 v available since Feb 16th 2007 Version 2.0 includes Biocarta, Broad Institute, GeneMAPP, KEGG annotations and more... Platforms: Affymetrix, Agilent, CodeLink, custom... (Subramanian et al. PNAS )

0 includes Biocarta, Broad Institute, GeneMAPP, KEGG annotations and

5 ::: Introduction. ::: How works GSEA? GSEA applies Kolmogorov-Smirnof test to find assymmetrical distributions for defined blocks of genes in datasets whole distribution. Is this particular Gene Set enriched in my experiment? Genes selected by researcher, Biocarta pathways, GeneMAPP sets, genes sharing cytoband, genes targeted by common mirnas up to you

blocks of genes in datasets whole distribution.

6 ::: Introduction. ::: K-S test The Kolmogorov Smirnov test is used to determine whether two underlying one-dimensional probability distributions differ, or whether an underlying probability distribution differs from a hypothesized distribution, in either case based on finite samples. The one-sample KS test compares the empirical distribution function with the cumulative distribution functionspecified by the null hypothesis. The main applications are testing goodness of fit with the normal and uniform distributions. The two-sample KS test is one of the most useful and general nonparametric methods for comparing two samples, as it is sensitive to differences in both location and shape of the empirical cumulative distribution functions of the two samples. Dataset distribution Gene set 1 distribution Gene set 2 distribution Number of genes Gene Expression Level

hypothesized distribution, in either case based on finite samples.

7 ::: Introduction. ClassA ClassB ::: How works GSEA? FDR< testing genes independently... ttest cut-off FDR<0.05 Biological meaning?

8 ::: Introduction. ::: How works GSEA? - ClassA ClassB Gene Set 1 Gene Set 2 Gene Set 3 Gene set 3 enriched in Class B ttest cut-off ES/NES statistic Gene set 2 enriched in Class A +

9 ::: Introduction. ES examples :::

10 ::: Introduction. The Enrichment Score ::: NES pval FDR Benjamini-Hochberg

11 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

12 ::: GSEA software. Download :::

13 ::: GSEA software. Main Window :::

14 ::: GSEA software. Loading data :::!!!

15 ::: GSEA software. Running GSEA :::

16 ::: GSEA software. Leading Edge Analysis :::

17 ::: GSEA software. MSigDB ::: Chip to Chip Mapping :::

18 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

19 ::: Data Formats.

20 ::: Data Formats.

21 ::: Data Formats. Expression datasets ::: *.gct

22 ::: Data Formats. Expression datasets ::: *.res

23 ::: Data Formats. Expression datasets ::: *.pcl

24 ::: Data Formats. Expression datasets ::: *.txt

25 ::: Data Formats. Phenotype datasets ::: *.cls For categorical phenotypes (e.g. Tumor vs Control)

26 ::: Data Formats. Phenotype datasets ::: For continuous phenotypes (e.g. Gene correlated to GeneSet) Time serie (each 30 minutes) Peak profile wanted For continuous phenotypes (e.g. Gene vs Time Series)

27 ::: Data Formats. Gene Set Database ::: *.gmx

28 ::: Data Formats. Gene Set Database ::: *.gmt

29 ::: Data Formats. Other formats::: *.chip *.grp

30 ::: Data Formats. Ranked list format ::: *.rnk

31 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

32 ::: Using GSEA. Loading data :::

33 ::: Using GSEA. Loading data :::

34 ::: Using GSEA. Running GSEA :::

35 ::: Using GSEA. ::: MSigDB. gsea_home

36 ::: Using GSEA. Running GSEA ::: 1. Choose true (default) to have GSEA collapse each probe set in your expression dataset into a single gene vector, which is identified by its HUGO gene symbol. In this case, you are using HUGO gene symbols for the analysis. The gene sets that you use for the analysis must use HUGO gene symbols to identify the genes in the gene sets. 2. Choose false to use your expression dataset "as is." In this case, you are using the probe identifiers that are in your expression dataset for the analysis. The gene sets that you use for the analysis must also use these probe identifiers to identify the genes in the gene sets.

37 ::: Using GSEA. Running GSEA ::: Phenotype Gene Sets (few samples)

38 ::: Using GSEA. Running GSEA :::

39 ::: Using GSEA. Chip2Chip mapping ::: Chip2Chip translates the gene identifiers in a gene sets from HUGO gene symbols to the probe identifiers for a selected DNA chip.

40 ::: Using GSEA. Enrichment statistic ::: To calculate the enrichment score, GSEA first walks down the ranked list of genes increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The enrichment score is the maximum deviation from zero encountered during that walk. This parameter affects the running-sum statistic used for the analysis.

41 ::: Using GSEA. Ranking Metric ::: Signal2Noise ttest Cosine Euclidean Manhatten Pearson (time series) Ratio of Classes Diff of Classes Log2_Ratio_of_Classes Categorical phenotypes Continuous phenotypes

42 ::: Using GSEA. Ranking Metric :::

43 ::: Using GSEA. Ranking Metric :::

44 ::: Using GSEA. More parameters ::: real abs parameter to determine whether to sort the genes in descending (default) or ascending order.

45 ::: Using GSEA. Launching Analysis :::

46 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

47 ::: GSEA output. By default in gsea_home Results Accession ::: C:\Documents and settings\username\gsea_home /Users/yourhome/gsea_home

48 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

49 ::: GSEA results. Index.html ::: Heat map of the top 50 features for each phenotype and a plot showing the correlation between the ranked genes and the phenotypes. In a heat map, expression values are represented as colors, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest).

50 ::: GSEA results. Enrichment results in html :::

51 ::: GSEA results. Enrichment results in html :::

52 ::: GSEA results. Enrichment results in html ::: How can I decide about my results? FDR 0.25 NOM p-val 0.05

53 ::: Contents. 1. Introduction. 2. GSEA Software 3. Data Formats 4. Using GSEA 5. GSEA Output 6. GSEA Results 7. Leading Edge Analysis

54 ::: GSEA results. Leading Edge Analysis :::

55 ::: GSEA results. Leading Edge Analysis ::: HeatMap Set-to-Set Histogram Gene in Subsets

56 ::: GSEA results. Leading Edge Analysis ::: Heat Map The heat map shows the (clustered) genes in the leading edge subsets. In a heat map, expression values are represented as colors, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest).

57 ::: GSEA results. Leading Edge Analysis ::: Set-to-Set The graph uses color intensity to show the overlap between subsets: the darker the color, the greater the overlap between the subsets.. When you compare a leading edge subset to itself, its members completely overlap so the corresponding cell is dark green. When you compare two subsets that have no overlapping members, the corresponding cell is white.

58 ::: GSEA results. Leading Edge Analysis ::: Gene in Subsets The graph shows each gene and the number of subsets in which it appears.

59 ::: GSEA results. Leading Edge Analysis ::: Histogram The last plot is a histogram, where the Jacquard is the intersection divided by the union for a pair of leading edge subsets. Number of Occurrences is the number of leading edge subset pairs in a particular bin. In this example, most subset pairs have no overlap (Jacquard = 0).

60 ::: GSEA & FatiScan. Detects significant functions with Gene Ontology InterPro motifs, Swissprot KW and KEGG pathways in lists of genes ordered according to differents characteristics.

61 T H A N K S

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information