From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Transcription

1 From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data

2 experimental design data collection modeling statistical testing

3 biological heterogeneity replicated vs unreplicated experimental design biological vs technical replicates pooling data collection multiplexing modeling statistical testing

4 Experimental Design Unreplicated Definition: One biological replicate per treatment group. Pros: Cheap, and can be informative. Cons: We can only make inferences about the particular biological individuals not the treatment groups. Applications: Pilot studies (although not to assess variation!), non-model organism runs focused on reference transcriptome assembly.

5 Experimental Design Replicated Definition: Multiple biological replicate per treatment group. Pros: We can make inferences about the treatment groups, and we can be more confident about our inferences. Cons: More expensive. Applications: Differential expression (and alternative splicing) analysis to make inferences about treatment groups, reliably infer networks.

6 Experimental Design Biological vs Technical Replicates Biological replicates contain multiple individuals; technical replicates contain one individual with some technical steps replicated. Usually biological variance > technical variance, thus biological replicates are more useful. Again, they also allow us to make inferences about the treatment groups.

7 Experimental Design Pooling Definition: Combining multiple samples (individuals, tissues, etc) during preparation into a single sample, assayed together. Pros: Entirely necessary in cases in which there isn t enough sample per individual for sequencing. Unreplicated, pooled samples could also decrease bias. Cons: All ability to measure variability between individuals is lost. A single outlier could bias an entire sample. Applications: Small or difficult-to-collect samples, possibly to reduce bias in unreplicated designs.

8 Experimental Design Multiplexing Definition: Attach a unique nucleotide sequence to each sample/replicate group and combine into one pooled sample. Spread pooled sample across multiple lanes and sequence. Pros: Removes technical variation as a source of confounding. Cons: Shorter reads, slightly higher cost. Applications: Generally recommended in all differential expression studies.

9 experimental design data collection multireads genomic vs transcriptomic mapping modeling statistical testing

10 Multireads A multiread is a read that maps equally well to many reference sequences. By default, BWA maps these randomly and uniformly across all equally-good reference positions. read: AGTCGACTAGCTATTAGCATG AGTCGACTAGCTATTAGCATG transcript 1 AGTCGACTAGCTATTAGCATG transcript 2

11 Genomic Mapping mut wt

12 Genomic Mapping Advantages: - Less likely to have multireads across different isoforms. - One can get a sense of the coverage across exons. Disadvantages: - It s a bit involved to estimate isoforms expression. - Needs an (annotated) genome! (i.e. not great for non-model organisms)

13 Transcriptomic Mapping mut wt isoform 1 mut wt isoform 2

14 Transcriptomic Mapping Advantages: - Transcript-level expression. - Slightly easier to do. Disadvantages: - Multiple isoforms can share an exon. Thus, we can get multireads. - Requires annotation to wrap to gene-level counts.

15 Where does RNA-seq data come from? mut 12 wt 21

16 Where does RNA-seq data come from? differential isoform expression? mut 12 wt 21

17 Genomic Mapping

18 Count data (unreplicated) gene wt mut

19 Count data (replicated) gene wt1 wt2 mut1 mut

20 Normalization Why normalize? Suppose there are two lanes of data, and 2 times as many sequences in lane A as lane B. Everything will appear to be upregulated, if unnormalized.

21 Normalization Techniques RPKM Reads Per Kilobase Million reads mapped is a common normalization procedure. RPKM = total mapped to gene total mapped to lane (in millions) x gene length (in kilobases) However, a few highly-expressed genes can dominate total lane counts. Consequently changes in highly expressed genes can disproportionately affect the scaling factor.

22 Normalization Techniques RPKM For example, in one lane of data, the top 2% of genes make up 30% of total lane counts. These 411 genes (out of 20,545) dominate the lane. A constant scaling factor based on total lane count is over-emphasizing the expression of these genes.

23 Normalization Techniques Quantile based techniques Idea: rescale empirical distribution to a theoretical one by ordering both, and making the nth smallest value of the empirical distribution equal to the nth smallest of the theoretical distribution. Bullard et al, 2010 have shown that these methods lead to more accurate differential expression results when verified with qpcr.

24 Normalization Techniques DESeq s Approach Size factors are estimated for each column (sample) of the data. Size factors are then used directly in the model fitting step. First, a psuedoreference is created by taking the geometric mean across rows. Then, the median of the ratios of all counts to the psuedoreference value is the size factor.

25 Normalization Techniques

26 experimental design data collection modeling Poisson vs Negative Binomial models assessing models assumptions statistical testing

27 Modeling RNA-Seq data Example: Poisson models Image of human brain from Anne Brogdon,

28 Modeling RNA-Seq data Models for Overdispersion DESeq & edger from Bioconductor both use a Negative Binomial model, which model the mean and variance separately.

29 Modeling RNA-Seq data Models for Overdispersion DESeq & edger from Bioconductor both use a Negative Binomial model, which model the mean and variance separately. Both packages have ways of assessing model fit. Use them!

30 Modeling RNA-Seq data Consistency between edger and DESeq Using data from Mariano, et al, 2008

31 Modeling RNA-Seq data Models for Overdispersion Why the difference? DESeq allows for a more local dispersion parameter for similar genes, whereas edger has a fixed dispersion parameter.* Anders and Huber, Orange dashed line is edger estimated variance, purple is variance from Poisson, and orange line is variance estimated from DESeq. *New versions of edger actually allow local fits, and new versions of DESeq have a fixed dispersion parameter! I am simplifying because this is as it is presented in the DESeq paper.

32 experimental design data collection modeling approaches to testing p-values statistical testing FWER FDR q-values

33 Testing Hypotheses

37 Why does this matter?

38 Multiple Testing n samples We re doing p simultaneous tests! p genes H1, H2, H3,..., Hp

39 Multiple Testing 20,000 simultaneous t-tests on random normal data from the same distribution. There are 1,009 green points (false positives), making up 0.05 of the comparisons (at α = 0.05).

40 Multiple Testing Familywise Error Rate number declared non-significant number declared significant total true null hypotheses false null hypotheses U V m0 T S m - m0 m - R R m FWER = P(V 1) FWER = 1 - P(V = 0)

41 Multiple Testing Bonferroni Correction One way of controlling FWER: set α = α/n Problem: very conservative.

42 Multiple Testing False Discovery number declared non-significant number declared significant total true null hypotheses false null hypotheses U V m0 T S m - m0 m - R R m FDR = E[V/R] (Benjamini and Hochberg, 1995)

43 Multiple Testing False Discovery number declared non-significant number declared significant total true null hypotheses false null hypotheses U V m0 T S m - m0 m - R R m control this FDR = E[V/R] (Benjamini and Hochberg, 1995) not this FWER = P(V 1)

44 Multiple Testing False Discovery Procedure (Benjamini and Hochberg, 1995) δ = 0.05 n = 10 Imagine 100 genes were tested, at δ = 0.1 If 40 were found significant, we d expect 4 to be false discoveries.

45 Multiple Testing Storey s q-value (Storey 2002; Storey and Tibshirani, 2003) When a given q-value is called significant, the q-value is the proportion of false discoveries incurred from p-values as or more extreme.

46 Multiple Testing Storey s q-value (Storey 2002; Storey and Tibshirani, 2003) When a given q-value is called significant, the q-value is the proportion of false discoveries incurred from p-values as or more extreme. For example, a q-value of says that 2.3% of genes with p-values as or more extreme (less likely) are false positives.

47 Multiple Testing Storey s q-value Practical Example: You have funds to test 100 top differentially expressed gene candidates. How should you pick them? One way: order by absolute value log fold change, and take the top 100 genes. Then order by q-value and the product of 100 and the last q-value is the expected number of false positives.

48 Reading Top Tables

49 Practical: Reading Top Tables Recall: it s not just about significance, but effect size. Sorting options: - absolute value of log FC (decreasing) - absolute value of adjusted log FC (decreasing) - p-value (increasing) Combinations: Absolute value of adjusted log FC (decreasing), subset by adjusted p-value less than some threshold.

50 Practical: Reading Top Tables Recall: it s not just about significance, but effect size. Sorting options: - absolute value of log FC (decreasing) - absolute value of adjusted log FC (decreasing) - p-value (increasing) Combinations: Absolute value of adjusted log FC (decreasing), subset by adjusted p-value less than some threshold.

51 Beyond Differential Expression Differentially Expressed Gene Combinations (Dettling, et al, 2005)

52 Acknowledgements The Bioinforma-cs Core Dr. Dawei Lin, Ph.D. (Director) Data Analysis Dr. Joe Fass, Ph.D. (Lead) Dr. Monica Bri9on Mr. Nikhil Joshi Sta-s-cal Programming Mr. Vince Buffalo (Lead) Applica-on Development (Web/DB) Mr. Jose Boveda (Lead) System Admin & HPC Dr. Zhi- Wei Lu (Lead) Visi-ng members Ms. Xinran Dong Campus Scien-fic Advisory Board Chair Dr. Craig Benham, Ph.D. (MathemaHcs) Members Dr. Gino Cortopassi, Ph.D. (Molecular Sciences) Dr. Vladimir Filkov, Ph.D. (Computer Sciences) Dr. Fredric Gorin, Ph.D. (Neurosciences) Dr. Juan Medrano, Ph.D. (Animal Sciences) Dr. Jie Peng, Ph.D. (StaHsHcs) Dr. David Rocke, Ph.D. (BiostaHsHcs) Genome Center Director Dr. Richard Michelmore, Ph.D. Associate Directors for Bioinforma-cs Dr. Ian Korf, Ph.D. Dr. Patrice Koehl, Ph.D.