From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Similar documents
Gene Expression Analysis

False Discovery Rates

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

Expression Quantification (I)

Statistical issues in the analysis of microarray data

Tutorial for proteome data analysis using the Perseus software platform

Package empiricalfdr.deseq2

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Quality Assessment of Exon and Gene Arrays

Frequently Asked Questions Next Generation Sequencing

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Gene expression analysis. Ulf Leser and Karin Zimmermann

Basic processing of next-generation sequencing (NGS) data

A direct approach to false discovery rates

Package dunn.test. January 6, 2016

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Challenges associated with analysis and storage of NGS data

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Practical Differential Gene Expression. Introduction

The Bonferonni and Šidák Corrections for Multiple Comparisons

Quantitative proteomics background

Section 13, Part 1 ANOVA. Analysis Of Variance

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

One-Way Analysis of Variance (ANOVA) Example Problem

Statistical Analysis Strategies for Shotgun Proteomics Data

1. How different is the t distribution from the normal?

Normalization of RNA-Seq

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Two-sample hypothesis testing, II /16/2004

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

False discovery rate and permutation test: An evaluation in ERP data analysis

How Sequencing Experiments Fail

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

Tutorial 5: Hypothesis Testing

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Row Quantile Normalisation of Microarrays

Multivariate Analysis of Ecological Data

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

1 Why is multiple testing a problem?

13: Additional ANOVA Topics. Post hoc Comparisons

Analysis of Illumina Gene Expression Microarray Data

Analysis Issues II. Mary Foulkes, PhD Johns Hopkins University

STATISTICA Formula Guide: Logistic Regression. Table of Contents

P(every one of the seven intervals covers the true mean yield at its location) = 3.

Package ERP. December 14, 2015

Exercise with Gene Ontology - Cytoscape - BiNGO

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Introduction to data analysis: Supervised analysis

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Real-time PCR: Understanding C t

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

SAS Software to Fit the Generalized Linear Model

Package HHG. July 14, 2015

The Variability of P-Values. Summary

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

You Are What You Bet: Eliciting Risk Attitudes from Horse Races

Analysis of gene expression data. Ulf Leser and Philippe Thomas

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Statistics Review PSY379

Disease gene identification with exome sequencing

How To Check For Differences In The One Way Anova

Outline. Dispersion Bush lupine survival Quasi-Binomial family

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Chapter 7 Section 7.1: Inference for the Mean of a Population

Exploratory data analysis (Chapter 2) Fall 2011

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

A Streamlined Workflow for Untargeted Metabolomics

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Fairfield Public Schools

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Least Squares Estimation

Power Analysis for Correlation & Multiple Regression

mrna NGS Data Analysis Report

A survey of best practices for RNA-seq data analysis

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Understanding West Nile Virus Infection

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

9. Sampling Distributions

NGS Data Analysis: An Intro to RNA-Seq

Transcription:

From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data

experimental design data collection modeling statistical testing

biological heterogeneity replicated vs unreplicated experimental design biological vs technical replicates pooling data collection multiplexing modeling statistical testing

Experimental Design Unreplicated Definition: One biological replicate per treatment group. Pros: Cheap, and can be informative. Cons: We can only make inferences about the particular biological individuals not the treatment groups. Applications: Pilot studies (although not to assess variation!), non-model organism runs focused on reference transcriptome assembly.

Experimental Design Replicated Definition: Multiple biological replicate per treatment group. Pros: We can make inferences about the treatment groups, and we can be more confident about our inferences. Cons: More expensive. Applications: Differential expression (and alternative splicing) analysis to make inferences about treatment groups, reliably infer networks.

Experimental Design Biological vs Technical Replicates Biological replicates contain multiple individuals; technical replicates contain one individual with some technical steps replicated. Usually biological variance > technical variance, thus biological replicates are more useful. Again, they also allow us to make inferences about the treatment groups.

Experimental Design Pooling Definition: Combining multiple samples (individuals, tissues, etc) during preparation into a single sample, assayed together. Pros: Entirely necessary in cases in which there isn t enough sample per individual for sequencing. Unreplicated, pooled samples could also decrease bias. Cons: All ability to measure variability between individuals is lost. A single outlier could bias an entire sample. Applications: Small or difficult-to-collect samples, possibly to reduce bias in unreplicated designs.

Experimental Design Multiplexing Definition: Attach a unique nucleotide sequence to each sample/replicate group and combine into one pooled sample. Spread pooled sample across multiple lanes and sequence. Pros: Removes technical variation as a source of confounding. Cons: Shorter reads, slightly higher cost. Applications: Generally recommended in all differential expression studies.

experimental design data collection multireads genomic vs transcriptomic mapping modeling statistical testing

Multireads A multiread is a read that maps equally well to many reference sequences. By default, BWA maps these randomly and uniformly across all equally-good reference positions. read: AGTCGACTAGCTATTAGCATG AGTCGACTAGCTATTAGCATG transcript 1 AGTCGACTAGCTATTAGCATG transcript 2

Genomic Mapping mut wt

Genomic Mapping Advantages: - Less likely to have multireads across different isoforms. - One can get a sense of the coverage across exons. Disadvantages: - It s a bit involved to estimate isoforms expression. - Needs an (annotated) genome! (i.e. not great for non-model organisms)

Transcriptomic Mapping mut wt isoform 1 mut wt isoform 2

Transcriptomic Mapping Advantages: - Transcript-level expression. - Slightly easier to do. Disadvantages: - Multiple isoforms can share an exon. Thus, we can get multireads. - Requires annotation to wrap to gene-level counts.

Where does RNA-seq data come from? mut 12 wt 21

Where does RNA-seq data come from? differential isoform expression? mut 12 wt 21

Genomic Mapping

Count data (unreplicated) gene wt mut 24 3203 2304 12 23 14 5 2 0 2 1 0 34 0 0 21 13 14 56 54 32 3 12 12

Count data (replicated) gene wt1 wt2 mut1 mut2 24 3203 3215 2304 2220 12 23 30 14 5 5 2 3 0 5 2 1 5 0 6 34 0 3 0 2 21 13 14 14 0 56 54 59 32 31 3 12 155 12 16

Normalization Why normalize? Suppose there are two lanes of data, and 2 times as many sequences in lane A as lane B. Everything will appear to be upregulated, if unnormalized.

Normalization Techniques RPKM Reads Per Kilobase Million reads mapped is a common normalization procedure. RPKM = total mapped to gene total mapped to lane (in millions) x gene length (in kilobases) However, a few highly-expressed genes can dominate total lane counts. Consequently changes in highly expressed genes can disproportionately affect the scaling factor.

Normalization Techniques RPKM For example, in one lane of data, the top 2% of genes make up 30% of total lane counts. These 411 genes (out of 20,545) dominate the lane. A constant scaling factor based on total lane count is over-emphasizing the expression of these genes.

Normalization Techniques Quantile based techniques Idea: rescale empirical distribution to a theoretical one by ordering both, and making the nth smallest value of the empirical distribution equal to the nth smallest of the theoretical distribution. Bullard et al, 2010 have shown that these methods lead to more accurate differential expression results when verified with qpcr.

Normalization Techniques DESeq s Approach Size factors are estimated for each column (sample) of the data. Size factors are then used directly in the model fitting step. First, a psuedoreference is created by taking the geometric mean across rows. Then, the median of the ratios of all counts to the psuedoreference value is the size factor.

Normalization Techniques

experimental design data collection modeling Poisson vs Negative Binomial models assessing models assumptions statistical testing

Modeling RNA-Seq data Example: Poisson models Image of human brain from Anne Brogdon, http://annebrogdonportfolio.blogspot.com/

Modeling RNA-Seq data Models for Overdispersion DESeq & edger from Bioconductor both use a Negative Binomial model, which model the mean and variance separately.

Modeling RNA-Seq data Models for Overdispersion DESeq & edger from Bioconductor both use a Negative Binomial model, which model the mean and variance separately. Both packages have ways of assessing model fit. Use them!

Modeling RNA-Seq data Consistency between edger and DESeq Using data from Mariano, et al, 2008

Modeling RNA-Seq data Models for Overdispersion Why the difference? DESeq allows for a more local dispersion parameter for similar genes, whereas edger has a fixed dispersion parameter.* Anders and Huber, 2010. Orange dashed line is edger estimated variance, purple is variance from Poisson, and orange line is variance estimated from DESeq. *New versions of edger actually allow local fits, and new versions of DESeq have a fixed dispersion parameter! I am simplifying because this is as it is presented in the DESeq paper.

experimental design data collection modeling approaches to testing p-values statistical testing FWER FDR q-values

Testing Hypotheses

Testing Hypotheses

Testing Hypotheses

Testing Hypotheses

Why does this matter?

Multiple Testing n samples We re doing p simultaneous tests! p genes H1, H2, H3,..., Hp

Multiple Testing 20,000 simultaneous t-tests on random normal data from the same distribution. There are 1,009 green points (false positives), making up 0.05 of the comparisons (at α = 0.05).

Multiple Testing Familywise Error Rate number declared non-significant number declared significant total true null hypotheses false null hypotheses U V m0 T S m - m0 m - R R m FWER = P(V 1) FWER = 1 - P(V = 0)

Multiple Testing Bonferroni Correction One way of controlling FWER: set α = α/n Problem: very conservative.

Multiple Testing False Discovery number declared non-significant number declared significant total true null hypotheses false null hypotheses U V m0 T S m - m0 m - R R m FDR = E[V/R] (Benjamini and Hochberg, 1995)

Multiple Testing False Discovery number declared non-significant number declared significant total true null hypotheses false null hypotheses U V m0 T S m - m0 m - R R m control this FDR = E[V/R] (Benjamini and Hochberg, 1995) not this FWER = P(V 1)

Multiple Testing False Discovery Procedure (Benjamini and Hochberg, 1995) δ = 0.05 n = 10 Imagine 100 genes were tested, at δ = 0.1 If 40 were found significant, we d expect 4 to be false discoveries.

Multiple Testing Storey s q-value (Storey 2002; Storey and Tibshirani, 2003) When a given q-value is called significant, the q-value is the proportion of false discoveries incurred from p-values as or more extreme.

Multiple Testing Storey s q-value (Storey 2002; Storey and Tibshirani, 2003) When a given q-value is called significant, the q-value is the proportion of false discoveries incurred from p-values as or more extreme. For example, a q-value of 0.023 says that 2.3% of genes with p-values as or more extreme (less likely) are false positives.

Multiple Testing Storey s q-value Practical Example: You have funds to test 100 top differentially expressed gene candidates. How should you pick them? One way: order by absolute value log fold change, and take the top 100 genes. Then order by q-value and the product of 100 and the last q-value is the expected number of false positives.

Reading Top Tables

Practical: Reading Top Tables Recall: it s not just about significance, but effect size. Sorting options: - absolute value of log FC (decreasing) - absolute value of adjusted log FC (decreasing) - p-value (increasing) Combinations: Absolute value of adjusted log FC (decreasing), subset by adjusted p-value less than some threshold.

Practical: Reading Top Tables Recall: it s not just about significance, but effect size. Sorting options: - absolute value of log FC (decreasing) - absolute value of adjusted log FC (decreasing) - p-value (increasing) Combinations: Absolute value of adjusted log FC (decreasing), subset by adjusted p-value less than some threshold.

Beyond Differential Expression Differentially Expressed Gene Combinations (Dettling, et al, 2005)

Acknowledgements The Bioinforma-cs Core Dr. Dawei Lin, Ph.D. (Director) Data Analysis Dr. Joe Fass, Ph.D. (Lead) Dr. Monica Bri9on Mr. Nikhil Joshi Sta-s-cal Programming Mr. Vince Buffalo (Lead) Applica-on Development (Web/DB) Mr. Jose Boveda (Lead) System Admin & HPC Dr. Zhi- Wei Lu (Lead) Visi-ng members Ms. Xinran Dong Campus Scien-fic Advisory Board Chair Dr. Craig Benham, Ph.D. (MathemaHcs) Members Dr. Gino Cortopassi, Ph.D. (Molecular Sciences) Dr. Vladimir Filkov, Ph.D. (Computer Sciences) Dr. Fredric Gorin, Ph.D. (Neurosciences) Dr. Juan Medrano, Ph.D. (Animal Sciences) Dr. Jie Peng, Ph.D. (StaHsHcs) Dr. David Rocke, Ph.D. (BiostaHsHcs) Genome Center Director Dr. Richard Michelmore, Ph.D. Associate Directors for Bioinforma-cs Dr. Ian Korf, Ph.D. Dr. Patrice Koehl, Ph.D.