FAQs of Differential Gene Expression using RNA-Seq A collection of questions about RNA-Seq

FAQs of Differential Gene Expression using RNA-Seq A collection of questions about RNA-Seq July 18, 2013 Jyothi Thimmapuram jyothit@purdue.edu Bioinformatics Core bioinformatics@purdue.edu

Strategies for RNA-Seq Haas and Zody, Nature Biotechnology, 2010, 28:421

RNA-Seq - issues Coverage across transcriptome may not be random Some reads map to multiple locations Some reads do not map Some reads map outside exons new genes or new gene models?

What platform to use HiSeq/MiSeq How many lanes/flowcell How many reads/lane What is PE and SE sequencing What is the format of sequencing file

Illumina HiSeq 2500) 2 independent Flow Cells

Single Reads Barcode primer Adaptor A Adaptor B barcode s_1_sequence.txt SP2 SP1 Paired-end Reads Barcode primer Adaptor A Adaptor B barcode SP2 s_1_1_sequence.txt s_1_2_sequence.txt

HiSeq 2500 MiSeq No. of lanes 8 1 Length of Run 10 days 1 day Single Reads (per lane) 180-200 million 12-15 million Paired-end Reads 360-400 million 24-30 million Read length 50,100bp 2 x 250bp Bases > Q30 >85% (2x50bp) >85% (2x100bp) >80% (2x100bp) >80% (2x150bp) >70% (2x250bp) HiSeq 2500, Rapid Run Chemistry: 2 lanes, 120 million reads per lane, 50, 100,150bp, two days

Illumina files: one fastq file per sample Sequence ID Sequence Quality Scores @HWI-ST330_0106:4:1:2643:2862#CTTGTA/1 CTTGACAAAGGGTGCAAGGCAGTTAGTGGTGCAAGATGCATTGCTGATGATGGGTTCATCAGGGCTGTAATCATA + ggggggggggggdgggeggfgggggfaffdgdgdgeggggggggggdggdggggceeedeegggg_eydc_dcac @HWI-ST330_0106:4:1:2613:2891#CTTGTA/1 CGTGTCTTAAGGAGGCACCAAACAATATAAAGCTACAGATGGCGTCCTTGGTTTTTAATTTTAAGTTGGGGGACT + ggggggggggggegggdgggggegggffdgggggggggeggggggggggggegfggeegggfgcgggefgge^eg Seq 1 Seq 2

How many reads are needed (depth of sequencing) Number of reads/lane Number of samples/lane Read length

Number of reads/coverage Number of genes in the species Number of genes expressed under the treatment/tissue Rare transcripts

Number of reads/coverage Trapnell et al., Nature Biotechnology, 2010, 28:311

Sample Pairwise Comparisons Number of Differentially Expressed (DE) genes by each method edger voom Cufflinks Cond1_vs_Cond2 220 0 5 Cond1_vs_Cond3 93 24 7 Cond1_vs_Cond4 43 0 0 Cond2_vs_Cond3 175 0 2 Cond2_vs_Cond4 162 0 1 Cond3_vs_Cond4 119 0 2 Each of these samples had at least 50 million PE reads

Standards, Guidelines and Best Practices for RNA-Seq The ENCODE consortium The ability to detect reliably low copy number transcripts/isoforms depends upon the depth of sequencing and on a sufficiently complex library. For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended. http://encodeproject.org/encode/protocols/datastandards/encode_rnaseq_standards_v1.0.pdf Evaluating the impact of sequencing depth on transcriptome profiling in human adipose Liu et. al., PLoS One 8:e66883

Experimental Design Number of biological replicates

We can sequence multiple samples on one lane by indexing (barcoding, tagging) the sample Multiplexed The index is usually 6-7 bp that is used to separate sequences for each sample SP1 Paired-end Reads Barcode primer Adaptor A Adaptor B barcode SP2

Auer P L, and Doerge R W Genetics 2010;185:405-416 Balanced Blocked Design

Quality Control How do you check the quality of reads How do you trim and filter low quality bases Do we need to trim and filter low quality bases

Sequence quality check FastQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Quality score > 30 Min. length 50% of the read

Bad run Good run x-axis: Position in read y-axis: Quality scores

FastQC Before trimming After trimming x-axis: Position in read y-axis: Quality scores

Mapping to reference genome or transcriptome No reference available Draft genome available Not well annotated reference

Align to genome Bioinformatics Core Mapping of RNA-Seq reads Can detect novel exons or un-annotated genes Aligners should be able to map reads across splice sites Reads from non-genic regions influence expression values, SNP detection etc. Align to transcriptome Information about splice junctions is not required PE distance and junction reads - isoforms

Strategies for RNA-Seq Haas and Zody, Nature Biotechnology, 2010, 28:421

To map RNA-Seq reads Number of mismatches allowed Number of hits allowed Exon-exon/exon-intron junctions Expected distance in PE reads

What alignment program to use Unique or multiple mapping Usually what %reads map to reference How to generate read counts

TopHat alignment

TopHat-Cufflinks-Cuffcompare-CuffDiff http://tophat.cbcb.umd.edu/ Uses Bowtie Splits the read into segments and map independently and glue them together to produce end-to-end read alignment Currently does not support short indels Can align up to 1024 bp Do not mix PE and single reads

TopHat contd. Junctions from GFF or other list file Without reference Neighboring coverage islands joined with an intron PE reads genomic coordinates and expected distance Two segments of the same read mapped apart reports alignments across GT-AG introns

Cufflinks, Cuffcompare and CuffDiff Cufflinks Assembles transcripts and estimates abundances Alignments in SAM format as input Cuffcompare/Cuffmerge Compares assembled sequences to a ref. annotation Compares Cufflinks transcripts across experiments Input - GTF file from Cufflinks CuffDiff GTF file & SAM files Finds significant changes in transcript expression, splicing and promoter use Output files Genes.fpkm_tracking Gnes_exp.diff

Some popular aligners BWA slow for long reads and reads with higher error rate; suboptimal alignment pairs; allows gapped alignment TopHat uses Bowtie; maps reads to genome, builds a database of possible splice junctions, and maps the reads against these junctions to confirm Novoalign most accurate, slow Others: SpliceMap, MapSplice, SOAP, MAQ, CLC Bio

Overlap multireads can cause inaccurate expression estimates Van Verk et al., 2013. Trends Plant Sci. 18:175-179.

Counting reads with HTSeq

What are the different methods for DGE analysis What is RPKM/FPKM Why do we use more than one method How to validate and verify the RNA-Seq results How to select genes for qrt-pcr

Length of genes Sequencing depth Differential Gene Expression RPKM Reads Per Kilobase of exon model per Million mapped reads Mortazavi et al., Nature Methods,2008, 5:621 FPKM Fragment Per Kilobase of exon model per Million mapped reads Normalization gene counts should be adjusted to minimize the bias Statistical model should account for length and depth

Differential expression methods Fisher s exact test or similar tests for RPKM/FPKM R-packages for RNA-Seq analysis: DESeq small # or no replicates; negative binomial (NB) dist edger NB dist; Similar to Fisher s exact test using NB (instead of hypergeometric probablities) bayseq more complex; empirical Bayesian methods DEGseq based on MA-plots

DESeq-edgeR-Cufflinks DESeq 10,939 edger 11,770 Cufflinks 6,263 DESeq+edgeR 10,219 DESeq+Cufflinks 6,070 edger+cufflinks 6,077 DESeq+edgeR+Cufflinks 6,045 DESeq and edger novo align mapping

DESeq-edgeR-baySeq DESeq 888 edger 895 bayseq 1,115 DESeq + edger 591 DESeq + bayseq 488 edger + bayseq 465 DESeq + edger + bayseq 338 http://davetang.org/muse/ Soneson and Delorenzi, 2013. BMC Bioinfomatics. 14:91 Rapaport et. al., 2013. http://arxiv.org/pdf/1301.5277.pdf

Yendrek et al., 2012. BMC Research Notes. 5:506 Comparing RNA-Seq with qrt-pcr

Sequencing platform to use FAQs of RNA-Seq Illumina HiSeq, 8lanes/flowcell, fastq files Sequencing depth number of reads At least 30-40 mil/sample Paired-end or single reads PE Length of reads At least 50bp, usually 100bp Number of biological replicates At least 3 (more if you can afford) Experimental design for sequencing Balanced Block Design How to analyze RNA-Seq data

FAQs of RNA-Seq cont. How to analyze RNA-Seq data How to check quality and trim/filter low quality FASTQC and FASTXtoolkit Reference genome or transcriptome Depends on the purpose of the expt Build a reference transcriptome if not available (Trinity, Trans-ABySS, Velvet/Oases) What alignment program to use TopHat, Bowtie2, BWA Unique or multiple mapping Unique A good %mapping : 70-90%

FAQs of RNA-Seq cont. How to analyze RNA-Seq data How to get read counts HTSeq with option union What statistical methods to use limma package (edger, voom, rpkm), DESeq Why do we use more than one method Different normalization methods and assumptions Validation and verification How to select genes : FDR, FC, pathway

Applications using RNA-Seq data Differential gene expression Structural annotation of a genome Alternative splicing Fusion transcripts de novo transcriptome assembly SNPs/Indels Phylogenomics

Resources for RNA-Seq analysis RNA-Seq Blog Transcriptome Analysis: Sequencing and Profiling

Additional references for RNA-Seq analysis: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell et. al., 2012. Nature Protocols. 7:562-578. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose Liu et. al., 2013. PLoS One 8:e66883 http://string-db.org Counting reads in features with htseq-count http://www-huber.embl.de/users/anders/htseq/doc/count.html

References for statistical analysis of DGE: Design and validation issues in RNA-seq experiments Fang and Cui, 2011. Brief Bioinform. 12:280-287. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis Dillies, et. al., 2012. Brief Bioinform. doi: 10.1093/bib/bbs046 A comparison of statistical methods for detecting differentially expressed genes from RNA-Seq data Kvam et. al., 2012. Am. J. Botany. 99:248-256. Comprehensive evaluation of differential expression analysis methods for RNA- Seq data Rapaport et. al., 2013. http://arxiv.org/pdf/1301.5277.pdf A comparison of methods for differential expression analysis of RNA-seq data Soneson and Delorenzi, 2013. BMC Bioinfomatics. 14:91.

BIOINFORMATICS CORE IS SUPPORTED BY: Financial: OVPR College of Agriculture Ag. Research Programs College of Technology College of Veterinary Medicine Cancer Center Cyber Center - Discovery Park (College of Science) Technical: Rosen Center for Advanced Computing (RCAC) ITaP AgIT

Genomics Facility Information: Phillip San Miguel, Ph.D. Genomics Facility Director pmiguel@purdue.edu (765)-496-6328

THANK YOU!