Gene Expression Analysis



Similar documents
From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Statistical issues in the analysis of microarray data

Practical Differential Gene Expression. Introduction

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

Quality Assessment of Exon and Gene Arrays

Normalization of RNA-Seq

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

Gene expression analysis. Ulf Leser and Karin Zimmermann

False Discovery Rates

Expression Quantification (I)

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Tutorial for proteome data analysis using the Perseus software platform

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

Row Quantile Normalisation of Microarrays

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Statistical Analysis Strategies for Shotgun Proteomics Data

Analysis of Illumina Gene Expression Microarray Data

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Package empiricalfdr.deseq2

Frequently Asked Questions Next Generation Sequencing

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Automated Biosurveillance Data from England and Wales,

Analysing Questionnaires using Minitab (for SPSS queries contact -)

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

SAS Software to Fit the Generalized Linear Model

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Exploratory data analysis for microarray data

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

PreciseTM Whitepaper

Basic processing of next-generation sequencing (NGS) data

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Quantitative proteomics background

Core Facility Genomics

Measuring gene expression (Microarrays) Ulf Leser

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Real-time PCR: Understanding C t

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

How Sequencing Experiments Fail

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Package dunn.test. January 6, 2016

Package HHG. July 14, 2015

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Basic Analysis of Microarray Data

REAL TIME PCR USING SYBR GREEN

edger: differential expression analysis of digital gene expression data User s Guide

Logistic Regression (a type of Generalized Linear Model)

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Consistent Assay Performance Across Universal Arrays and Scanners

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Materials and Methods. Blocking of Globin Reverse Transcription to Enhance Human Whole Blood Gene Expression Profiling

Introduction to next-generation sequencing data

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

False discovery rate and permutation test: An evaluation in ERP data analysis

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Exercise with Gene Ontology - Cytoscape - BiNGO

Microarray Analysis. The Basics. Thomas Girke. December 9, Microarray Analysis Slide 1/42

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Influence of GSM and UMTS on the Blood Brain Barrier in vitro additional results

Outline. Dispersion Bush lupine survival Quasi-Binomial family

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

ALLEN Mouse Brain Atlas

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Introduction to data analysis: Supervised analysis

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Analysis of Variance. MINITAB User s Guide 2 3-1

Exploratory Data Analysis

User Manual. Transcriptome Analysis Console (TAC) Software. For Research Use Only. Not for use in diagnostic procedures. P/N Rev.

Online Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using

Introduction To Real Time Quantitative PCR (qpcr)

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Next generation DNA sequencing technologies. theory & prac-ce

RNAseq / ChipSeq / Methylseq and personalized genomics

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Penalized Logistic Regression and Classification of Microarray Data

Transcription:

Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012

RNA expression technologies High-throughput technologies to measure the expression levels of thousands of genes simultaneously: Microarray, RNA-seq. Platforms: Affymetrix GeneChip arrays; Genome Analyzer II, HiSeq 1000/2000. Goal: study the effects of treatments, developmental stages, tissues, etc. on gene expression. Experimental design issues. pooling, replication multiplexing include multiple bar coded samples in the same sequencing reaction lane, flow cell run, batch Library preparation. Extract data: image analysis; reads mapping.

Analyzing data Data structure: microarray intensity value for each probe on the array; RNA-seq: mapped reads count for each gene. Data exploration, filtering Normalization Fitting differential expression (DE) models Calling for significant genes

Data exploration Plots: MA plots, histograms, etc. Summaries: mean/median, variance/mad, missing rate, library size, etc. Filtering: Microarray: low intensity, low variation RNA-seq: low count

Normalization Remove systematic biases due to library preparation, RNA composition, etc. such that samples are comparable. Depend on technology and platform. Basic assumption: majority of genes are not differentially expressed across samples. Global normalization match certain global features of the samples. For example, make all samples have the same median and MAD; or make all samples to have the same.75% quantile. Do not change data much (often upto a scaling factor), may not remove all systematic biases. Quantile normalization impose the same empirical distribution to every sample. May change data a lot, may reduce signals while removing bias.

Quantile normalization: an R implementation quan.norm<-function(x,quan=0.5){ ##x: p by n data matrix, where columns are the samples. norm<-x p<-nrow(x) n<-ncol(x) x.sort<-apply(x, 2, sort) ## sort genes within a sample x.rank<-apply(x,2,rank) ## rank genes within a sample ## find the common distribution to be matched to: qant.sort<-matrix(apply(x.sort,1,quantile, probs=quan), + p,n,byrow=false) ## match each sample to the common distribution: for (i in 1:n){ norm[,i]<-qant.sort[x.rank[,i],i] } return(norm) }

Normalization of RNA-seq data Global normalization by scaling. Library size normalization choose a reference sample: e.g., the sample with a median library size. for a target sample: multiply its counts by the ratio between the library size of the reference and that of the target. TMM normalization takes into account RNA composition differences. Ref: Mark D Robinson and Alicia Oshlack. A scaling normalization method for differential expression analysis of rna-seq data. Genome Biology, 11(3):R25, 2010 Quantile-matched normalization match a certain quantile across samples: e.g., make the 75%-quantile of counts the same for all samples. Ref: Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mrna-seq experiments. BMC Bioinformatics 11, 94.

RNA composition Observed quantities: counts: Y gk number of reads mapped to gene g in sample k. library size: N k := g Y gk total number of mapped reads in sample k. gene length: L g length of gene g. Unobserved quantities: abundance: A gk number of RNA transcripts of gene g in sample k. total abundance: A k := g A gk total amount of RNA transcripts in sample k. S k := g A gkl g. relative abundance: λ gk := A gk A k. For each gene g, we d like to compare the relative abundance across samples, e.g., testing H 0g : λ g1 = λ g2.

The expected value of Y gk can be modeled as E(Y gk ) = A gkl g s A skl s N k = (λ gk L g )( A k S k N k ) =: µ gk. Effective library size: Ñ k := A k S k N k. If Ñ 1 = Ñ 2, then comparing λ g1, λ g2 is equivalent to comparing µ g1, µ g2, which can be done by using a test based on the observed counts Y gk s. The goal is therefore to equalize the effective sample size across samples.

Note that E(Y gk /N k ) = (λ gk L g )(A k /S k ). By assuming that most of genes are not DE, i.e., for most genes, λ g1 = λ g2, the trimmed mean of the log ratios can be used to estimate {M g := log Y g1/n 1 Y g2 /N 2 } g, log A 1/S 1 A 2 /S 2.

Model expression data Microarray data: assume a multiplicative noise model and model the log intensity as normal random variables. RNA-seq data. Within a sample, it is reasonable to model the counts as Poisson random variables with means proportional to the relative RNA abundance. When comparing two samples: R function glm() with famiy="poisson" can be used to fit data. findings are restricted to these two samples and can not be generalized to general populations. To account for biological variations across samples, various overdispersion models are considered. overdispersion: variance > mean. Note that for Poisson random variables, variance = mean. commonly used overdispersion models: negative binomial, quasi-poisson, quasi-binomial.

Cautions. The Poisson model is based on the assumption that reads are randomly and independently distributed. This may not be true due to various reasons such as random hexamer priming, GC content bias. Ref: Kasper D. Hansen, Steven E. Brenner, Sandrine Dudoit. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Research, Vol. 38, No. 12. (01 July 2010), pp. e131-e131; Davide Risso, Katja Schwartz, Gavin Sherlock and Sandrine Dudoit. GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics 2011, 12:480. Corrections and normalizations may be necessary depending on the goal of the study. Underdispersion is sometimes observed. Quasi-Poisson model can deal with both overdispersion and underdispersion. Negative binomial model can only model overdispersion.

Differentially expressed genes Microarray: (moderated) t-tests based on log intensities. RNA-seq: likelihood ratio tests or exact tests based on counts. Permutation tests, rank tests, empirical Bayes methods, etc. Multiple comparison adjustment: based on pvalues. Control familywise error rate (FWER): bonferroni, holm, etc. Control false discovery rate (fdr): Benjamini & Hochberg (BH), Benjamini & Yekutieli (2001) (BY), etc. R function p.adjust. Other variants of fdr: R package locfdr, R package qvalue.

R packages Microarray: affy, limma, etc. RNA-seq: DESeq, edger, glm, etc. Bioconductor package edger Based on negative binomial models: Y NB(µ, φ), E(Y ) = µ, Var(Y ) = µ(1 + µφ) (µ > 0, φ > 0). To account for small sample sizes as is typical in RNA-seq studies, edger also utilizes empirical Bayes ideas to pool information across genes. Ref: Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139-40,2010; M. D Robinson and G. K Smyth. Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23(21):2881-2887, 2007; M. D Robinson and G. K Smyth. Small-sample estimation of negative binomial dispersion, with applications to sage data. Biostatistics, 9(2):321-332, 2008.

A Case study An RNA-seq data set with two groups: grp1 eight replicates, grp2 seven replicates. Data exploration. Data matrix : row gene, column sample. dim(counts) geneid grp1 sample1 grp1 sample2 grp1 sample3... gene1 0 0 0... gene2 109 71 128... gene3 3 2 10.................. Library size: barplot(colsums(counts)) Filtering: allzero=(rowsums(counts)==0);counts=counts[!allzero,]; dim(counts) Clustering of samples: are samples from the same group clustered together?

> library(edger) > group=factor(c(rep(1,8), rep(2,7))) > d=dgelist(counts,group) > d$samples$lib.size > plotmds(d) grp2 sample 6 Dimension 2 0.6 0.4 0.2 0.0 0.2 0.4 grp1 sample 7 grp1 sample 1 grp1 sample 5 grp1 sample 8 grp1 sample 6 grp1 sample 4 grp1 sample 3 grp1 sample 2 grp2 sample 2 grp2 sample 7 grp2 sample 1 grp2 sample 3 grp2 sample 4 grp2 sample 5 2 1 0 1 2 Dimension 1

Normalization and MA plots. > d=calcnormfactors(d,method="tmm") > samp1="grp1-sample 7"; samp2="grp2-sample 5" > maplot(d$counts[,samp1],d$counts[,samp2],normalize=true, + lowess=true, ylim=c(-8,8),pch=19, cex=0.4) > abline(h=0, lty=2) > eff.libsize=d$samples$lib.size*d$samples$norm.factors > names(eff.libsize)=colnames(d$counts) > maplot(d$counts[,samp1]/eff.libsize[samp1], + d$counts[,samp2]/eff.libsize[samp2],normalize=false, + lowess=true, ylim=c(-8,8),pch=19, cex=0.4) > abline(h=0, lty=2)

Two-group comparison and gene calling. Estimate dispersion parameters and plot genewise biological coefficient of variation (square root of dispersion) against gene abundance (in log2 counts per million). > d=estimatecommondisp(d, verbose=true) > d$common.dispersion > d=estimatetagwisedisp(d,prior.n=getpriorn(d)) > plotbcv(d)

Exact test and gene calling. > et=exacttest(d,pair=1:2,dispersion="tagwise", + rejection.region="doubletail",big.count=900) > toptags(et,n=100, adjust.method="by") > de=decidetestsdge(et, adjust.method="by", + p.value=0.05) > summary(de) FDR method BY takes into account dependency and is more conservative than method BH. Draw smear plot of log concentration vs. log fold-change: find both statistically significant and practically significant DE genes. > plotsmear(et, + de.tags=rownames(et$table)[as.logical(de)])

Look at pvalue distribution Histogram: > hist(et$table$pvalue, breaks=50,xlab="pvalue") Observe a unusual high bar on pvalue close to one. Examine log-pvalue vs. log-concentration/log-cpm: this bar is primarily from genes with small number of counts. Use a threshold (e.g., 10) on the total number of counts across samples to filter out low-count genes. Similar phenomena occurs when analyzing exon sequence data in GWAS studies.

histogram of pvalues Frequency 0 1000 2000 3000 0.0 0.2 0.4 0.6 0.8 1.0 pvalue

histogram of pvalues genes with at least 10 total counts: 84% genes pass Frequency 0 1000 2000 3000 Frequency 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 pvalue 0.0 0.2 0.4 0.6 0.8 1.0 pvalue genes with at least 20 total counts: 79% genes pass genes with at least 40 total counts: 73% genes pass Frequency 0 1000 2000 3000 4000 Frequency 0 1000 2000 3000 4000 0.0 0.2 0.4 0.6 0.8 1.0 pvalue 0.0 0.2 0.4 0.6 0.8 1.0 pvalue

Summary Explore data by graphs and numerical summaries. Examine normalization by MA plots. Filter out genes with small counts. Look at both p-values and fold change for significant genes.