Normalization of RNA-Seq

Normalization of RNA-Seq Davide Risso Modified: April 27, 2012. Compiled: April 27, 2012 1 Retrieving the data Usually, an RNA-Seq data analysis from scratch starts with a set of FASTQ files (see e.g. http://en.wikipedia.org/wiki/fastq_format) which contain information on both the quality and the sequence of the short reads. There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, GSNAP, Stampy,... ). A common output file format is the SAM/BAM format (of which you can read here: http://samtools.sourceforge.net/). You just saw how to align reads when you don t have a genome, and how to summarize them. When you do have a genome, a standard approach is to align the reads with Bowtie or TopHat, and then summarize them in region of interests, such as gene, exons, non-coding RNAs, etc. To do this, you need your aligned reads and an annotation for your reference genome. There are tools and packages to summarize the aligned reads in gene counts. One of them is HTSeq (http://www-huber.embl.de/users/anders/ HTSeq/doc/overview.html). The simple command: $ htseq-count example.sam Saccharomyces_cerevisiae.EF2.60.gtf will produce a table of counts, i.e., YAL002W 1 YAL003W 19 YAL005C 8 YAL007C 2 YAL008W 2 YAL012W 9 YAL014C 1 YAL016W 3 YAL017W 2 YAL019W 1 By doing this for every sample in your study you end up with a table with m rows (genes) and n columns (samples). This is what you have in the file genelevelcounts.txt. 1

$ head genelevelcounts.txt YAL067W-A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YAL067C 0 0 0 2 2 1 9 7 20 11 13 44 12 13 YAL066W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YAL065C 0 0 0 0 0 0 1 0 0 0 0 0 0 0 YAL064W-B 0 0 0 0 0 0 2 1 0 0 0 0 0 0 YAL064C-A 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Today, we will consider an example based on the data analyzed in Risso et al. [7]. The Sherlock Lab in Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates. Illumina s standard Genome Analyzer pre-processing pipeline was used to yield 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie [4], considering only unique mapping and allowing up to two mismatches. The read count for a given gene is defined as the number of reads with 5 -end falling within the corresponding region. The gene-level counts for this example are provided in the yeastrnaseqrisso2011 R package. For Exploratory Data Analysis (EDA) and normalization purposes, it is useful to consider some features of the genes, such as GC-content and gene length. To obtained this information, we need the gene sequences, that can be retrieved from different sources (e.g., Ensembl, UCSC, FlyBase,... ). In the Yeast community, a standard resource is the SGD website (http://www.yeastgenome. org). In general, a good resource is Ensembl (http://www.ensembl.org). In any case, you need to download the sequences of your regions of interest (e.g., protein coding genes, non-coding RNAs,... ), usually in FASTA format. Example of FASTA format: $ head Scer.fasta >YAL001C ATGGTACTGACGATTTATCCTGACGAACTCGTACAAA... >YAL002W ATGGAGCAAAATGGCCTTGACCACGA... Once you have your FASTA file, it is easy to compute length and GC-content of each gene using the ShortRead Bioconductor package. Bioconductor (http://bioconductor.org) is an open source project based on the R statistical programming language (http://r-project.org). Enter a terminal and type R. This will open an R console. > library(shortread) > filename <- "Scer.fasta" > fa <- readfasta(filename) > abc <- alphabetfrequency(sread(fa), baseonly=true) > rownames(abc) <- sapply(strsplit(as.character(id(fa))," "),function(x) x[1]) 2

> alphabet <- abc[,1:4] > gc <- rowsums(alphabet[,2:3])/rowsums(alphabet) > length <- width(sread(fa)) > head(gc) YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C 0.3712317 0.3717647 0.4460548 0.4490741 0.4406428 0.3703704 > head(length) [1] 3483 3825 621 648 1929 648 We can create a data.frame to store this information, we will call it gene- Info. > geneinfo <- data.frame(length=length, gc=gc) > head(geneinfo) length gc YAL001C 3483 0.3712317 YAL002W 3825 0.3717647 YAL003W 621 0.4460548 YAL004W 648 0.4490741 YAL005C 1929 0.4406428 YAL007C 648 0.3703704 2 Exploratory Data Analysis We will use the EDASeq [6] R package for the EDA and the normalization. This package provides a class of objects named SeqExpressionSet, useful to store gene counts along with gene and lane information. First of all, we need to read the counts into R. This is done with the read.table function: > genelevelcounts <- read.table("genelevelcounts.txt", header=true, row.names=1) > laneinfo <- read.table("laneinfo.txt", header=true, row.names=1) We want to filter out the non-expressed genes. For simplicity, we consider only the genes expressed in all growth conditions, i.e., genes with an average read count of 10 or more. > means <- rowmeans(genelevelcounts) > filter <- means >= 10 > table(filter) filter FALSE TRUE 1041 5534 3

> genelevelcounts <- genelevelcounts[filter,] This leaves us with 5534 genes. Now we can store this information (gene and lane info along with gene counts) in one single object. > library(edaseq) > data <- newseqexpressionset(exprs = as.matrix(genelevelcounts), + featuredata = geneinfo[rownames(genelevelcounts), ], + phenodata = laneinfo) > data SeqExpressionSet (storagemode: lockedenvironment) assaydata: 5534 features, 14 samples element names: exprs, offset protocoldata: none phenodata samplenames: Y1_1 Y1_2... G3 (14 total) varlabels: lib_prep conditions flow_cell lib_prep_proto varmetadata: labeldescription featuredata featurenames: YAL062W YAL061W... YIR042C (5534 total) fvarlabels: length gc fvarmetadata: labeldescription experimentdata: use 'experimentdata(object)' Annotation: > head(exprs(data)) Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 YAL062W 11 4 6 8 12 9 41 43 54 38 YAL061W 33 17 50 20 77 51 177 166 311 338 YAL060W 209 129 216 181 387 286 1328 1386 3316 1262 YAL059W 78 55 82 73 187 121 658 686 176 46 YAL058W 95 56 101 87 232 163 618 581 305 117 YAL056C-A 27 17 8 5 11 7 5 1 19 2 D7 G1 G2 G3 YAL062W 44 1628 57 256 YAL061W 301 29951 1310 2208 YAL060W 1130 16548 5222 3482 YAL059W 51 226 75 127 YAL058W 97 681 226 216 YAL056C-A 6 1 34 12 > pdata(data) 4

lib_prep conditions flow_cell lib_prep_proto Y1_1 Y1 YPD 428R1 Protocol1 Y1_2 Y1 YPD 4328B Protocol1 Y2_1 Y2 YPD 428R1 Protocol1 Y2_2 Y2 YPD 4328B Protocol1 Y7_1 Y7 YPD 428R1 Protocol1 Y7_2 Y7 YPD 4328B Protocol1 Y4_1 Y4 YPD 61MKN Protocol2 Y4_2 Y4 YPD 61MKN Protocol2 D1 D1 Del 428R1 Protocol1 D2 D2 Del 428R1 Protocol1 D7 D7 Del 428R1 Protocol1 G1 G1 Gly 6247L Protocol2 G2 G2 Gly 62OAY Protocol1 G3 G3 Gly 62OAY Protocol1 > head(fdata(data)) length gc YAL062W 1374 0.4868996 YAL061W 1254 0.4840510 YAL060W 1149 0.4499565 YAL059W 639 0.4037559 YAL058W 1509 0.4340623 YAL056C-A 351 0.4131054 We can look at some graphical summary of the data. discover biases and artifacts in the data. This will help us Between-lane distribution of gene-level counts. One of the main considerations when dealing with gene-level counts is the difference in count distributions between lanes. The boxplot method provides an easy way to produce boxplots of the logarithms of the gene counts in each lane. 5

> colors <- as.numeric(pdata(data)[, 2]) + 1 > boxplot(data, col=colors) Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G2 0 2 4 6 8 10 12 Over-dispersion. The function meanvarplot can be used to check whether the count data are over-dispersed (for the Poisson distribution, one would expect the points to be evenly scattered around the black line). > meanvarplot(data[, 1:8], log=true) 0 2 4 6 8 10 0 5 10 15 20 mean variance 6

Gene-specific effects on read counts. Several authors have reported selection biases related to sequence features such as gene length, GC-content, and mappability [2, 3, 5, 7]. Using biasplot, one can see the dependence of gene-level counts on GCcontent. The same plot could be created for gene length or mappability instead of GC-content. > biasplot(data[,1:8], "gc", log=true, ylim=c(0, 8), col=1) gene counts (log) 0 2 4 6 8 Y1 Y2 Y7 Y4 0.2 0.3 0.4 0.5 0.6 gc 3 Normalization Following Risso et al. [7], we consider two main types of effects on gene-level counts: (1) within-lane gene-specific (and possibly lane-specific) effects, e.g., related to gene length or GC-content, and (2) effects related to between-lane distributional differences, e.g., sequencing depth. Accordingly, withinlanenormalization and betweenlanenormalization adjust for the first and second type of effects, respectively. We recommend to normalize for within-lane effects prior to between-lane normalization. EDASeq implements four within-lane normalization methods, namely: loess robust local regression of read counts (log) on a gene feature such as GC-content (loess), global-scaling between feature strata using the median (median), globalscaling between feature strata using the upper-quartile (upper), and full-quantile normalization between feature strata (full). For a discussion of these methods in context of GC-content normalization see Risso et al. [7]. Regarding between-lane normalization, the package implements three of the methods introduced in Bullard et al. [2]: global-scaling using the median (median), global-scaling using the upper-quartile (upper), and full-quantile normalization (full). 7

> datawithin <- withinlanenormalization(data, "gc", which="full") > datanorm <- betweenlanenormalization(datawithin, which="median") After normalization the GC-content bias is reduced, and the gene-level counts are comparable across lanes. > biasplot(datanorm[,1:8], "gc", log=true, ylim=c(0, 8), col=1) 0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 gc gene counts (log) Y1 Y2 Y7 Y4 > boxplot(datanorm, col=colors) Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G2 0 2 4 6 8 10 Moreover, the overdispersion is reduced in normalized counts, even though the Poisson assumption still does not hold true. 8

> meanvarplot(datanorm[, 1:8], log=true) variance 0 5 10 15 0 2 4 6 8 10 mean You can write to file your normalized counts with the write.table function > write.table(datanorm, file="normalizedcounts.txt", sep="\t", quote=false) 4 Differential expression (DE) analysis One of the main applications of RNA-Seq is differential expression analysis. The normalized counts (or the original counts and the offset) obtained using the EDASeq package can be supplied to packages such as edger [8] or DESeq [1] to find differentially expressed genes. Some authors have argued that it is better to leave the count data unchanged to preserve their sampling properties and instead use an offset for normalization purposes in the context of DE analysis [1, 3, 8]. This can be achieved easily using the argument offset in both normalization functions. > dataoffset <- withinlanenormalization(data, "gc", + which="full", offset=true) > dataoffset <- betweenlanenormalization(dataoffset, + which="full", offset=true) 4.1 DESeq If one wants to use the normalized data to perform a DE analysis with DESeq, there is a simple way to transform the data in the format needed by DESeq. > library(deseq) > counts <- as(datanorm,"countdataset") > counts 9

CountDataSet (storagemode: environment) assaydata: 5534 features, 14 samples element names: counts protocoldata: none phenodata samplenames: Y1_1 Y1_2... G3 (14 total) varlabels: sizefactor lib_prep... lib_prep_proto (5 total) varmetadata: labeldescription featuredata featurenames: YAL062W YAL061W... YIR042C (5534 total) fvarlabels: length gc fvarmetadata: labeldescription experimentdata: use 'experimentdata(object)' Annotation: 5 SessionInfo > tolatex(sessioninfo()) R version 2.15.0 (2012-03-30), x86_64-apple-darwin10.8.0 Locale: en_us.utf-8/en_us.utf-8/en_us.utf-8/c/en_us.utf-8/en_us.utf-8 Base packages: base, datasets, graphics, grdevices, methods, stats, utils Other packages: aroma.light 1.24.0, Biobase 2.16.0, BiocGenerics 0.2.0, Biostrings 2.24.1, DESeq 1.8.1, EDASeq 1.2.0, GenomicRanges 1.8.3, IRanges 1.14.2, lattice 0.20-6, latticeextra 0.6-19, locfit 1.5-7, R.methodsS3 1.2.2, R.oo 1.9.3, RColorBrewer 1.0-5, Rsamtools 1.8.1, ShortRead 1.14.1 Loaded via a namespace (and not attached): annotate 1.34.0, AnnotationDbi 1.18.0, bitops 1.0-4.1, BSgenome 1.24.0, DBI 0.2-5, genefilter 1.38.0, geneplotter 1.34.0, grid 2.15.0, hwriter 1.3, KernSmooth 2.23-7, RCurl 1.91-1, RSQLite 0.11.1, rtracklayer 1.16.1, splines 2.15.0, stats4 2.15.0, survival 2.36-12, tools 2.15.0, XML 3.9-4, xtable 1.7-0, zlibbioc 1.2.0 References [1] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R106. 10

[2] Bullard, J., Purdom, E., Hansen, K., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mrna- Seq experiments. BMC Bioinformatics, 11(1), 94. [3] Hansen, K., Irizarry, R., and Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. [4] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25. [5] Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct, 4(1), 14. [6] Risso, D. and Dudoit, S. (2011). EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq. R package version 1.2.0 http://www. bioconductor.org/packages/release/bioc/html/edaseq.html. [7] Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics, 12, 480. [8] Robinson, M., McCarthy, D., and Smyth, G. (2010). edger: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139. 11