Normalization of RNA-Seq



Similar documents
EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq

Practical Differential Gene Expression. Introduction

Gene Expression Analysis

Introduction. Overview of Bioconductor packages for short read analysis

Basic processing of next-generation sequencing (NGS) data

NGS Data Analysis: An Intro to RNA-Seq

Creating a New Annotation Package using SQLForge

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Analysis of Bead-summary Data using beadarray

msmseda LC-MS/MS Exploratory Data Analysis

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Challenges associated with analysis and storage of NGS data

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Expression Quantification (I)

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Comparing Methods for Identifying Transcription Factor Target Genes

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Introduction to robust calibration and variance stabilisation with VSN

netresponse probabilistic tools for functional network analysis

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

Package empiricalfdr.deseq2

Frequently Asked Questions Next Generation Sequencing

IRanges, GenomicRanges, and Biostrings

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Package cgdsr. August 27, 2015

Version 5.0 Release Notes

GeneProf and the new GeneProf Web Services

Bioinformatics Resources at a Glance

Deep Sequencing Data Analysis

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

GenBank, Entrez, & FASTA

Data Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis

Package GEOquery. August 18, 2015

Exercise with Gene Ontology - Cytoscape - BiNGO

Interactive Visualization of Genomic Data

How Sequencing Experiments Fail

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Visualisation tools for next-generation sequencing

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

The Galaxy workflow. George Magklaras PhD RHCE

Quality Assessment of Exon and Gene Arrays

G E N OM I C S S E RV I C ES

BioHPC Web Computing Resources at CBSU

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Analyzing Flow Cytometry Data with Bioconductor

Hadoopizer : a cloud environment for bioinformatics data analysis

A Tutorial in Genetic Sequence Classification Tools and Techniques

Statistical challenges in RNA-Seq data analysis

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Analysis of ChIP-seq data in Galaxy

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Bioconductor R packages for exploratory analysis and normalization of cdna microarray data

Module 1. Sequence Formats and Retrieval. Charles Steward

Efficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study

Scientific Argumentation and Software Significance

Measuring gene expression (Microarrays) Ulf Leser

mrna NGS Data Analysis Report

HowTo: Querying online Data

New solutions for Big Data Analysis and Visualization

Statistical issues in the analysis of microarray data

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Subread/Rsubread Users Guide

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Scatter Plots with Error Bars

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

GWAS Data Cleaning. GENEVA Coordinating Center Department of Biostatistics University of Washington. January 13, 2016.

A demonstration of the use of Datagrid testbed and services for the biomedical community

Introduction to NGS data analysis

-> Integration of MAPHiTS in Galaxy

Package hoarder. June 30, 2015

BIOS 6660: Analysis of Biomedical Big Data Using R and Bioconductor, Fall 2015 Computer Lab: Education 2 North Room 2201DE (TTh 10:30 to 11:50 am)

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

UGENE Quick Start Guide

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Final Project Report

Benjamin Czech, Jonathan B. Preall, Jon McGinn, and Gregory J. Hannon

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

Importance of Statistics in creating high dimensional data

Tutorial for proteome data analysis using the Perseus software platform

High Throughput Sequencing Data Analysis using Cloud Computing

Biological Sequence Data Formats

CummeRbund: Visualization and Exploration of Cufflinks High-throughput Sequencing Data

Exploratory data analysis (Chapter 2) Fall 2011

Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Optimization of sampling strata with the SamplingStrata package

5 Correlation and Data Exploration

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Keeping up with DNA technologies

On-line supplement to manuscript Galaxy for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly

Transcription:

Normalization of RNA-Seq Davide Risso Modified: April 27, 2012. Compiled: April 27, 2012 1 Retrieving the data Usually, an RNA-Seq data analysis from scratch starts with a set of FASTQ files (see e.g. http://en.wikipedia.org/wiki/fastq_format) which contain information on both the quality and the sequence of the short reads. There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, GSNAP, Stampy,... ). A common output file format is the SAM/BAM format (of which you can read here: http://samtools.sourceforge.net/). You just saw how to align reads when you don t have a genome, and how to summarize them. When you do have a genome, a standard approach is to align the reads with Bowtie or TopHat, and then summarize them in region of interests, such as gene, exons, non-coding RNAs, etc. To do this, you need your aligned reads and an annotation for your reference genome. There are tools and packages to summarize the aligned reads in gene counts. One of them is HTSeq (http://www-huber.embl.de/users/anders/ HTSeq/doc/overview.html). The simple command: $ htseq-count example.sam Saccharomyces_cerevisiae.EF2.60.gtf will produce a table of counts, i.e., YAL002W 1 YAL003W 19 YAL005C 8 YAL007C 2 YAL008W 2 YAL012W 9 YAL014C 1 YAL016W 3 YAL017W 2 YAL019W 1 By doing this for every sample in your study you end up with a table with m rows (genes) and n columns (samples). This is what you have in the file genelevelcounts.txt. 1

$ head genelevelcounts.txt YAL067W-A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YAL067C 0 0 0 2 2 1 9 7 20 11 13 44 12 13 YAL066W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 YAL065C 0 0 0 0 0 0 1 0 0 0 0 0 0 0 YAL064W-B 0 0 0 0 0 0 2 1 0 0 0 0 0 0 YAL064C-A 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Today, we will consider an example based on the data analyzed in Risso et al. [7]. The Sherlock Lab in Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates. Illumina s standard Genome Analyzer pre-processing pipeline was used to yield 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie [4], considering only unique mapping and allowing up to two mismatches. The read count for a given gene is defined as the number of reads with 5 -end falling within the corresponding region. The gene-level counts for this example are provided in the yeastrnaseqrisso2011 R package. For Exploratory Data Analysis (EDA) and normalization purposes, it is useful to consider some features of the genes, such as GC-content and gene length. To obtained this information, we need the gene sequences, that can be retrieved from different sources (e.g., Ensembl, UCSC, FlyBase,... ). In the Yeast community, a standard resource is the SGD website (http://www.yeastgenome. org). In general, a good resource is Ensembl (http://www.ensembl.org). In any case, you need to download the sequences of your regions of interest (e.g., protein coding genes, non-coding RNAs,... ), usually in FASTA format. Example of FASTA format: $ head Scer.fasta >YAL001C ATGGTACTGACGATTTATCCTGACGAACTCGTACAAA... >YAL002W ATGGAGCAAAATGGCCTTGACCACGA... Once you have your FASTA file, it is easy to compute length and GC-content of each gene using the ShortRead Bioconductor package. Bioconductor (http://bioconductor.org) is an open source project based on the R statistical programming language (http://r-project.org). Enter a terminal and type R. This will open an R console. > library(shortread) > filename <- "Scer.fasta" > fa <- readfasta(filename) > abc <- alphabetfrequency(sread(fa), baseonly=true) > rownames(abc) <- sapply(strsplit(as.character(id(fa))," "),function(x) x[1]) 2

> alphabet <- abc[,1:4] > gc <- rowsums(alphabet[,2:3])/rowsums(alphabet) > length <- width(sread(fa)) > head(gc) YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C 0.3712317 0.3717647 0.4460548 0.4490741 0.4406428 0.3703704 > head(length) [1] 3483 3825 621 648 1929 648 We can create a data.frame to store this information, we will call it gene- Info. > geneinfo <- data.frame(length=length, gc=gc) > head(geneinfo) length gc YAL001C 3483 0.3712317 YAL002W 3825 0.3717647 YAL003W 621 0.4460548 YAL004W 648 0.4490741 YAL005C 1929 0.4406428 YAL007C 648 0.3703704 2 Exploratory Data Analysis We will use the EDASeq [6] R package for the EDA and the normalization. This package provides a class of objects named SeqExpressionSet, useful to store gene counts along with gene and lane information. First of all, we need to read the counts into R. This is done with the read.table function: > genelevelcounts <- read.table("genelevelcounts.txt", header=true, row.names=1) > laneinfo <- read.table("laneinfo.txt", header=true, row.names=1) We want to filter out the non-expressed genes. For simplicity, we consider only the genes expressed in all growth conditions, i.e., genes with an average read count of 10 or more. > means <- rowmeans(genelevelcounts) > filter <- means >= 10 > table(filter) filter FALSE TRUE 1041 5534 3

> genelevelcounts <- genelevelcounts[filter,] This leaves us with 5534 genes. Now we can store this information (gene and lane info along with gene counts) in one single object. > library(edaseq) > data <- newseqexpressionset(exprs = as.matrix(genelevelcounts), + featuredata = geneinfo[rownames(genelevelcounts), ], + phenodata = laneinfo) > data SeqExpressionSet (storagemode: lockedenvironment) assaydata: 5534 features, 14 samples element names: exprs, offset protocoldata: none phenodata samplenames: Y1_1 Y1_2... G3 (14 total) varlabels: lib_prep conditions flow_cell lib_prep_proto varmetadata: labeldescription featuredata featurenames: YAL062W YAL061W... YIR042C (5534 total) fvarlabels: length gc fvarmetadata: labeldescription experimentdata: use 'experimentdata(object)' Annotation: > head(exprs(data)) Y1_1 Y1_2 Y2_1 Y2_2 Y7_1 Y7_2 Y4_1 Y4_2 D1 D2 YAL062W 11 4 6 8 12 9 41 43 54 38 YAL061W 33 17 50 20 77 51 177 166 311 338 YAL060W 209 129 216 181 387 286 1328 1386 3316 1262 YAL059W 78 55 82 73 187 121 658 686 176 46 YAL058W 95 56 101 87 232 163 618 581 305 117 YAL056C-A 27 17 8 5 11 7 5 1 19 2 D7 G1 G2 G3 YAL062W 44 1628 57 256 YAL061W 301 29951 1310 2208 YAL060W 1130 16548 5222 3482 YAL059W 51 226 75 127 YAL058W 97 681 226 216 YAL056C-A 6 1 34 12 > pdata(data) 4

lib_prep conditions flow_cell lib_prep_proto Y1_1 Y1 YPD 428R1 Protocol1 Y1_2 Y1 YPD 4328B Protocol1 Y2_1 Y2 YPD 428R1 Protocol1 Y2_2 Y2 YPD 4328B Protocol1 Y7_1 Y7 YPD 428R1 Protocol1 Y7_2 Y7 YPD 4328B Protocol1 Y4_1 Y4 YPD 61MKN Protocol2 Y4_2 Y4 YPD 61MKN Protocol2 D1 D1 Del 428R1 Protocol1 D2 D2 Del 428R1 Protocol1 D7 D7 Del 428R1 Protocol1 G1 G1 Gly 6247L Protocol2 G2 G2 Gly 62OAY Protocol1 G3 G3 Gly 62OAY Protocol1 > head(fdata(data)) length gc YAL062W 1374 0.4868996 YAL061W 1254 0.4840510 YAL060W 1149 0.4499565 YAL059W 639 0.4037559 YAL058W 1509 0.4340623 YAL056C-A 351 0.4131054 We can look at some graphical summary of the data. discover biases and artifacts in the data. This will help us Between-lane distribution of gene-level counts. One of the main considerations when dealing with gene-level counts is the difference in count distributions between lanes. The boxplot method provides an easy way to produce boxplots of the logarithms of the gene counts in each lane. 5

> colors <- as.numeric(pdata(data)[, 2]) + 1 > boxplot(data, col=colors) Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G2 0 2 4 6 8 10 12 Over-dispersion. The function meanvarplot can be used to check whether the count data are over-dispersed (for the Poisson distribution, one would expect the points to be evenly scattered around the black line). > meanvarplot(data[, 1:8], log=true) 0 2 4 6 8 10 0 5 10 15 20 mean variance 6

Gene-specific effects on read counts. Several authors have reported selection biases related to sequence features such as gene length, GC-content, and mappability [2, 3, 5, 7]. Using biasplot, one can see the dependence of gene-level counts on GCcontent. The same plot could be created for gene length or mappability instead of GC-content. > biasplot(data[,1:8], "gc", log=true, ylim=c(0, 8), col=1) gene counts (log) 0 2 4 6 8 Y1 Y2 Y7 Y4 0.2 0.3 0.4 0.5 0.6 gc 3 Normalization Following Risso et al. [7], we consider two main types of effects on gene-level counts: (1) within-lane gene-specific (and possibly lane-specific) effects, e.g., related to gene length or GC-content, and (2) effects related to between-lane distributional differences, e.g., sequencing depth. Accordingly, withinlanenormalization and betweenlanenormalization adjust for the first and second type of effects, respectively. We recommend to normalize for within-lane effects prior to between-lane normalization. EDASeq implements four within-lane normalization methods, namely: loess robust local regression of read counts (log) on a gene feature such as GC-content (loess), global-scaling between feature strata using the median (median), globalscaling between feature strata using the upper-quartile (upper), and full-quantile normalization between feature strata (full). For a discussion of these methods in context of GC-content normalization see Risso et al. [7]. Regarding between-lane normalization, the package implements three of the methods introduced in Bullard et al. [2]: global-scaling using the median (median), global-scaling using the upper-quartile (upper), and full-quantile normalization (full). 7

> datawithin <- withinlanenormalization(data, "gc", which="full") > datanorm <- betweenlanenormalization(datawithin, which="median") After normalization the GC-content bias is reduced, and the gene-level counts are comparable across lanes. > biasplot(datanorm[,1:8], "gc", log=true, ylim=c(0, 8), col=1) 0.2 0.3 0.4 0.5 0.6 0 2 4 6 8 gc gene counts (log) Y1 Y2 Y7 Y4 > boxplot(datanorm, col=colors) Y1_1 Y2_1 Y7_1 Y4_1 D1 D7 G2 0 2 4 6 8 10 Moreover, the overdispersion is reduced in normalized counts, even though the Poisson assumption still does not hold true. 8

> meanvarplot(datanorm[, 1:8], log=true) variance 0 5 10 15 0 2 4 6 8 10 mean You can write to file your normalized counts with the write.table function > write.table(datanorm, file="normalizedcounts.txt", sep="\t", quote=false) 4 Differential expression (DE) analysis One of the main applications of RNA-Seq is differential expression analysis. The normalized counts (or the original counts and the offset) obtained using the EDASeq package can be supplied to packages such as edger [8] or DESeq [1] to find differentially expressed genes. Some authors have argued that it is better to leave the count data unchanged to preserve their sampling properties and instead use an offset for normalization purposes in the context of DE analysis [1, 3, 8]. This can be achieved easily using the argument offset in both normalization functions. > dataoffset <- withinlanenormalization(data, "gc", + which="full", offset=true) > dataoffset <- betweenlanenormalization(dataoffset, + which="full", offset=true) 4.1 DESeq If one wants to use the normalized data to perform a DE analysis with DESeq, there is a simple way to transform the data in the format needed by DESeq. > library(deseq) > counts <- as(datanorm,"countdataset") > counts 9

CountDataSet (storagemode: environment) assaydata: 5534 features, 14 samples element names: counts protocoldata: none phenodata samplenames: Y1_1 Y1_2... G3 (14 total) varlabels: sizefactor lib_prep... lib_prep_proto (5 total) varmetadata: labeldescription featuredata featurenames: YAL062W YAL061W... YIR042C (5534 total) fvarlabels: length gc fvarmetadata: labeldescription experimentdata: use 'experimentdata(object)' Annotation: 5 SessionInfo > tolatex(sessioninfo()) R version 2.15.0 (2012-03-30), x86_64-apple-darwin10.8.0 Locale: en_us.utf-8/en_us.utf-8/en_us.utf-8/c/en_us.utf-8/en_us.utf-8 Base packages: base, datasets, graphics, grdevices, methods, stats, utils Other packages: aroma.light 1.24.0, Biobase 2.16.0, BiocGenerics 0.2.0, Biostrings 2.24.1, DESeq 1.8.1, EDASeq 1.2.0, GenomicRanges 1.8.3, IRanges 1.14.2, lattice 0.20-6, latticeextra 0.6-19, locfit 1.5-7, R.methodsS3 1.2.2, R.oo 1.9.3, RColorBrewer 1.0-5, Rsamtools 1.8.1, ShortRead 1.14.1 Loaded via a namespace (and not attached): annotate 1.34.0, AnnotationDbi 1.18.0, bitops 1.0-4.1, BSgenome 1.24.0, DBI 0.2-5, genefilter 1.38.0, geneplotter 1.34.0, grid 2.15.0, hwriter 1.3, KernSmooth 2.23-7, RCurl 1.91-1, RSQLite 0.11.1, rtracklayer 1.16.1, splines 2.15.0, stats4 2.15.0, survival 2.36-12, tools 2.15.0, XML 3.9-4, xtable 1.7-0, zlibbioc 1.2.0 References [1] Anders, S. and Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11(10), R106. 10

[2] Bullard, J., Purdom, E., Hansen, K., and Dudoit, S. (2010). Evaluation of statistical methods for normalization and differential expression in mrna- Seq experiments. BMC Bioinformatics, 11(1), 94. [3] Hansen, K., Irizarry, R., and Wu, Z. (2012). Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. [4] Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3), R25. [5] Oshlack, A. and Wakefield, M. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct, 4(1), 14. [6] Risso, D. and Dudoit, S. (2011). EDASeq: Exploratory Data Analysis and Normalization for RNA-Seq. R package version 1.2.0 http://www. bioconductor.org/packages/release/bioc/html/edaseq.html. [7] Risso, D., Schwartz, K., Sherlock, G., and Dudoit, S. (2011). GC-content normalization for RNA-Seq data. BMC Bioinformatics, 12, 480. [8] Robinson, M., McCarthy, D., and Smyth, G. (2010). edger: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139. 11