Microarray Data Analysis. Statistical methods to detect differentially expressed genes

Similar documents
Statistical issues in the analysis of microarray data

Design and Analysis of Comparative Microarray Experiments

Gene Expression Analysis

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

False Discovery Rates

Tutorial for proteome data analysis using the Perseus software platform

Gene expression analysis. Ulf Leser and Karin Zimmermann

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Tutorial 5: Hypothesis Testing

Additional sources Compilation of sources:

Descriptive Statistics

Inference for two Population Means

Principles of Hypothesis Testing for Public Health

MAANOVA: A Software Package for the Analysis of Spotted cdna Microarray Experiments

Chapter 5 Analysis of variance SPSS Analysis of variance

Analysis of gene expression data. Ulf Leser and Philippe Thomas

The Variability of P-Values. Summary

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Simple Linear Regression Inference

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

THE FIRST SET OF EXAMPLES USE SUMMARY DATA... EXAMPLE 7.2, PAGE 227 DESCRIBES A PROBLEM AND A HYPOTHESIS TEST IS PERFORMED IN EXAMPLE 7.

Measuring gene expression (Microarrays) Ulf Leser

Two-sample inference: Continuous data

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Section 13, Part 1 ANOVA. Analysis Of Variance

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Package empiricalfdr.deseq2

Analysis of Illumina Gene Expression Microarray Data

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Package ERP. December 14, 2015

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

MTH 140 Statistics Videos

Introduction to SAGEnhaft

Sample Size and Power in Clinical Trials

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Quantitative proteomics background

Chapter 7. One-way ANOVA

Permutation & Non-Parametric Tests

Stat 411/511 THE RANDOMIZATION TEST. Charlotte Wickham. stat511.cwick.co.nz. Oct

II. DISTRIBUTIONS distribution normal distribution. standard scores

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Comparing Means in Two Populations

Evaluation & Validation: Credibility: Evaluating what has been learned

Organizing Your Approach to a Data Analysis

Two-sample hypothesis testing, II /16/2004

HYPOTHESIS TESTING: POWER OF THE TEST

individualdifferences

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Deterministic and Stochastic Modeling of Insulin Sensitivity

Package dunn.test. January 6, 2016

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Fairfield Public Schools

Analysis of Variance. MINITAB User s Guide 2 3-1

Introduction to Regression and Data Analysis

Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate 1

Permutation Tests for Comparing Two Populations

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

Version 4.0. Statistics Guide. Statistical analyses for laboratory and clinical researchers. Harvey Motulsky

Validation parameters: An introduction to measures of

A Better Statistical Method for A/B Testing in Marketing Campaigns

Example: Boats and Manatees

Controlling the number of false discoveries: application to high-dimensional genomic data

Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

NCSS Statistical Software

MEASURES OF LOCATION AND SPREAD

UNDERSTANDING THE TWO-WAY ANOVA

Non-Inferiority Tests for One Mean

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

False discovery rate and permutation test: An evaluation in ERP data analysis

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Some Essential Statistics The Lure of Statistics

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Introduction to Statistical Methods for Microarray Data Analysis

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Applying Statistics Recommended by Regulatory Documents

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

P(every one of the seven intervals covers the true mean yield at its location) = 3.

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Semi-parametric Differential Expression Analysis via Partial Mixture Estimation

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Statistics Review PSY379

Statistical Models in R

SAMPLING & INFERENTIAL STATISTICS. Sampling is necessary to make inferences about a population.

Research Methods & Experimental Design

p-values and significance levels (false positive or false alarm rates)

Psychology 60 Fall 2013 Practice Exam Actual Exam: Next Monday. Good luck!

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Validation and Calibration. Definitions and Terminology

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Step-by-Step Guide to Basic Expression Analysis and Normalization

Hypothesis testing. c 2014, Jeffrey S. Simonoff 1

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Transcription:

Microarray Data Analysis Statistical methods to detect differentially expressed genes

Outline The class comparison problem Statistical tests Calculation of p-values Permutations tests The volcano plot Multiple testing Extensions Examples 2

Class comparison: Identifying differentially expressed genes Identify genes differentially expressed between different conditions such as Treatment, cell type,... (qualitative covariates) Dose, time,... (quantitative covariate) Survival, infection time,...! Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(x)-log(y) [=log(x/y)] 3

What is a significant change? Depends on the variability within groups, which may be different from gene to gene. To assess the statistical significance of differences, conduct a statistical test for each gene. 4

Different settings for statistical tests Indirect comparisons: 2 groups, 2 samples, unpaired E.g. 10 individuals: 5 suffer diabetes, 5 healthy One sample fro each individual Typically: Two sample t-test or similar Direct comparisons: Two groups, two samples, paired E.g. 6 individuals with brain stroke. Two samples from each: one from healthy (region 1) and one from affected (region 2). Typically: One sample t-test (also called paired t-test) or similar based on the individual differences between conditions. 5

Different ways to do the experiment An experiment use cdna arrays ( two-colour ) or affy ( one-colour). Depending on the technology used allocation of conditions to slides changes. Type of chip Experiment 10 indiv. Diab (5) Heal (5) 6 indiv. Region 1 Region 2 cdna (2-col) Reference design. (5) Diab/Ref (5) Heal/Ref 6 slides 1 individual per slide (6) reg1/reg2 Affy (1-col) Comparison design. (5) Diab vs (5) Heal 12 slides (6) Paired differences 6

Natural measures of discrepancy For Direct comparisons in two colour or paired-one colour. 1 Mean (log) ratio = n T n T i= 1 i, (R or M used indistinctly) Classical t-test = t = ( R) SE, ( SE estimates standard error of R) Robust t-test = Use robust estimates of location &scale For Indirect comparisons in two colour or Direct comparisons in one colour. R n 1 T n 1 C Mean difference = T Ci = T C n T i i= 1 nc i= 1 Classical t-test = t = ( T C) s 1/ n + 1/ n p T C Robust t-test = Use robust estimates of location &scale 7

Some Issues Can we trust average effect sizes (average difference of means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no. Gene M1 M2 M3 M4 M5 M6 Mean SD t A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10 B 0.01 0.05-0.05 0.01 0 0 0.003 0.03 0.25 C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69 D 0.5 0 0.2 0.1-0.3 0.3 0.13 0.27 1.19 E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09 Courtesy of Y.H. Yang 8

Some Issues Can we trust average effect sizes (average difference of means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no. Gene M1 M2 M3 M4 M5 M6 Mean SD t A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10 B 0.01 0.05-0.05 0.01 0 0 0.003 0.03 0.25 C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69 D 0.5 0 0.2 0.1-0.3 0.3 0.13 0.27 1.19 E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09 Averages can be driven by outliers. Courtesy of Y.H. Yang 9

Some Issues Can we trust average effect sizes (average difference of means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no. Gene M1 M2 M3 M4 M5 M6 Mean SD t A 2.5 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10 B 0.01 0.05-0.05 0.01 0 0 0.003 0.03 0.25 C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69 D 0.5 0 0.2 0.1-0.3 0.3 0.13 0.27 1.19 E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09 t s can be driven by tiny variances. Courtesy of Y.H. Yang 10

Variations in t-tests (1) Let R g mean observed log ratio SE g standard error of R g estimated from data on gene g. SE standard error of R g estimated from data across all genes. Global t-test: t=r g /SE Gene-specific t-test t=r g /SE g 11

Some pro s and con s of t-test Test Pro s Con s Global t-test: t=r g /SE Gene-specific: t=r g /SE g Yields stable variance estimate Robust to variance heterogeneity Assumes variance homogeneity biased if false Low power Yields unstable variance estimates (due to few data) 12

T-tests extensions SAM (Tibshirani, 2001) Regularized-t (Baldi, 2001) EB-moderated t (Smyth, 2003) S t = t = Rg = c + SE 0 g vse + ( n 1) SE 2 2 g v 0 R g + n R 2 d SE + d SE 2 2 0 0 g d 0 g + d 13

Up to here : Can we generate a list of candidate genes? With the tools we have, the reasonable steps to generate a list of candidate genes may be: Gene 1: M 11, M 12,., M 1k Gene 2: M 21, M 22,., M 2k. Gene G: M G1, M G2,., M Gk For every gene, calculate S i =t(m i1, M i2,., M ik ), e.g. t-statistics, S, B,? Statistics of interest S 1, S 2,., S G A list of candidate DE genes We need an idea of how significant are these values We d like to assign them p-values 14

Significance testing

Nominal p-values After a test statistic is computed, it is convenient to convert it to a p-value: The probability that a test statistic, say S(X), takes values equal or greater than that taken on the observed sample, say S(X 0 ), under the assumption that the null hypothesis is true p=p{s(x)>=s(x 0 ) H 0 true} 16

Significance testing Test of significance at the α level: Reject the null hypothesis if your p-value is smaller than the significance level It has advantages but not free from criticisms Genes with p-values falling below a prescribed level may be regarded as significant 17

Hypothesis testing overview for a single gene Reported decision H 0 is Rejected (gene is Selected) H 0 is Accepted (gene not Selected) State of the nature ("Truth") H 0 is false (Affected) H 0 is true (Not Affected) TP, prob: 1-α FP, P[Rej H 0 H 0 ]<= α Type I error FN, prob: 1-β Type II error TN, prob: β Sensitiviy TP/[TP+FN] Specificity TN/[TN+FP] Positive predictive value Negative predictive value TP/[TP+FP] TN/[TN+FN] 18

Calculation of p-values Standard methods for calculating p- values: (i) Refer to a statistical distribution table (Normal, t, F, ) or (ii) Perform a permutation analysis 19

(i) Tabulated p-values Tabulated p-values can be obtained for standard test statistics (e.g.the t-test) They often rely on the assumption of normally distributed errors in the data This assumption can be checked (approximately) using a Histogram Q-Q plot 20

Example Golub data, 27 ALL vs 11 AML samples, 3051 genes A t-test yields 1045 genes with p< 0.05 21

(ii) Permutations tests Based on data shuffling. No assumptions Random interchange of labels between samples Estimate p-values for each comparison (gene) by using the permutation distribution of the t-statistics Repeat for every possible permutation, b=1 B Permute the n data points for the gene (x). The first n1 are referred to as treatments, the second n2 as controls For each gene, calculate the corresponding two sample t-statistic, tb After all the B permutations are done put p = #{b: tb tobserved }/B 22

Permutation tests (2) 23

The volcano plot: fold change vs log(odds) 1 Significant change detected No change detected 24

Multiple testing

How far can we trust the decision? The test: "Reject H 0 if p-val α" issaidtocontrol the type I error because, under a certain set of assumptions, the probability of falsely rejecting H 0 is less than a fixed small threshold P[Reject H 0 H 0 true]=p[fp] α Nothing is warranted about P[FN] Optimal tests are built trying to minimize this probability In practical situations it is often high 26

What if we wish to test more than one gene at once? (1) Consider more than one test at once Two tests each at 5% level. Now probability of getting a false positive is 1 0.95*0.95 = 0.0975 Three tests 1 0.95 3 =0.1426 n tests 1 0.95 n Converge towards 1 as n increases Small p-values don t necessarily imply significance!!! We are not controlling the probability of type I error anymore 27

What if we wish to test more than one gene at once? (2): a simulation Simulation of this process for 6,000 genes with 8 treatments and 8 controls All the gene expression values were simulated i.i.d from a N (0,1) distribution, i.e. NOTHING is differentially expressed in our simulation The number of genes falsely rejected will be on the average of (6000 α), i.e. if we wanted to reject all genes with a p-value of less than 1% we would falsely reject around 60 genes See example 28

Multiple testing: Counting errors State of the nature ("Truth") H 0 is false (Affected) H 0 is true (Not Affected) H 0 is Rejected (Genes Selected) Decision reported H 0 is accepted (Genes not Selected) Total (m-m m α αm 0 (S) o )- (T) m-m (m α αm o 0) αm 0 (V) m o -αm0 (U) m o Total M α (R) m-m α (m-r) m V = # Type I errors [false positives] T = # Type II errors [false negatives] All these quantities could be known if m 0 was known 29

How does type I error control extend to multiple testing situations? Selecting genes with a p-value less than α doesn t control for P[FP] anymore What can be done? Extend the idea of type I error FWER and FDR are two such extensions Look for procedures that control the probability for these extended error types Mainly adjust raw p-values 30

Two main error rate extensions Family Wise Error Rate (FWER) FWER is probability of at least one false positive FWER= Pr(# of false discoveries >0) = Pr(V>0) False Discovery Rate (FDR) FDR is expected value of proportion of false positives among rejected null hypotheses FDR = E[V/R; R>0] = E[V/R R>0] P[R>0] 31

FWER FDR and FWER controlling procedures Bonferroni (adj Pvalue = min{n*pvalue,1}) Holm (1979) Hochberg (1986) Westfall & Young (1993) maxt and minp FDR Benjamini & Hochberg (1995) Benjamini & Yekutieli (2001) 32

Difference between controlling FWER or FDR FWER Controls for no (0) false positives gives many fewer genes (false positives), but you are likely to miss many adequate if goal is to identify few genes that differ between two groups FDR Controls the proportion of false positives if you can tolerate more false positives you will get many fewer false negatives adequate if goal is to pursue the study e.g. to determine functional relationships among genes 33

Steps to generate a list of candidate genes revisited (2) Gene 1: M 11, M 12,., M 1k Gene 2: M 21, M 22,., M 2k. Gene G: M G1, M G2,., M Gk For every gene, calculate S i =t(m i1, M i2,., M ik ), e.g. t-statistics, S, B, Statistics of interest S 1, S 2,., S G Assumption on the null distribution: data normality Nominal p-values P 1, P 2,, P G Adjusted p-values ap 1, ap 2,, ap G A list of candidate DE genes Select genes with adjusted P-values smaller than α 34

Example Golub data, 27 ALL vs 11 AML samples, 3051 genes Bonferroni adjustment: 98 genes with p adj < 0.05 (p raw < 0.000016) 35

Extensions Some issues we have not dealt with Replicates within and between slides Several effects: use a linear model ANOVA: are the effects equal? Time series: selecting genes for trends Different solutions have been suggested for each problem Still many open questions 36

Examples

Ex. 1- Swirl zebrafish experiment Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis As a result some cell types are reduced and others are expanded A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish 38

Example 1: Experimental design Each microarray contained 8848 cdna probes (either genes or EST sequences) 4 replicate slides: 2 sets of dye-swap pairs For each pair, target cdna of the swirl mutant was labeled using one of Cy5 or Cy3 and the target cdna of the wild type mutant was labeled using the other dye Wild type 2 2 Swirl 39

Example 1. Data analysis Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type On a gene-per-gene basis this is a onesample problem Hypothesis to be tested for each gene: H 0 : log 2 (R/G)=0 The decision will be based on average log-ratios 40

Example 2. Scanvenger receptor BI (SR-BI) experiment Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice. Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels. Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to normal FVB control mice (C). 41

Example 2. Experimental design 8 treatment mice (T i ) and 8 control ones (C i ). 16 hybridizations: liver mrna from each of the 16 mice (T i, C i ) is labelled with Cy5, while pooled liver mrna from the control mice (C*) is labelled with Cy3. Probes: ~ 6,000 cdnas (genes), including 200 related to pathogenicity. T C 8 8 C* 42

Example 2. Data analysis Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*)) On a gene-per-gene basis this is a 2 sample problem Hypothesis to be tested for each gene: H 0 : [log (R 1 /G)-log (R 2 /G)]=0 Decision will be based on average difference of log ratios 43