FAQs of Differential Gene Expression using RNA-Seq A collection of questions about RNA-Seq

Similar documents
RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Challenges associated with analysis and storage of NGS data

Expression Quantification (I)

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Basic processing of next-generation sequencing (NGS) data

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Deep Sequencing Data Analysis

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Gene Expression Analysis

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Frequently Asked Questions Next Generation Sequencing

Introduction to NGS data analysis

NGS Data Analysis: An Intro to RNA-Seq

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

A survey of best practices for RNA-seq data analysis

Next Generation Sequencing: Technology, Mapping, and Analysis

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Introduction. Overview of Bioconductor packages for short read analysis

RNAseq / ChipSeq / Methylseq and personalized genomics

PreciseTM Whitepaper

G E N OM I C S S E RV I C ES

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

High Throughput Sequencing Data Analysis using Cloud Computing

Isoform prefiltering improves performance of count-based methods for analysis of differential transcript usage

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Services. Updated 05/31/2016

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

mrna NGS Data Analysis Report

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Computational Genomics. Next generation sequencing (NGS)

NGS data analysis. Bernardo J. Clavijo

Analysis of ChIP-seq data in Galaxy

LifeScope Genomic Analysis Software 2.5

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Welcome to the Plant Breeding and Genomics Webinar Series

Introduction to next-generation sequencing data

Next Generation Sequencing

Comparing Methods for Identifying Transcription Factor Target Genes

Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

NEXT GENERATION SEQUENCING

Analysis of NGS Data

Efficient tool deployment to the Galaxy Cloud: An RNA-seq workflow case study

RNA- seq de novo ABiMS

Next Generation Sequencing

TGC AT YOUR SERVICE. Taking your research to the next generation

Next generation sequencing (NGS)

Bioinformatics Unit Department of Biological Services. Get to know us

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Text file One header line meta information lines One line : variant/position

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

CIDANE: comprehensive isoform discovery and abundance estimation

Next generation DNA sequencing technologies. theory & prac-ce

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

-> Integration of MAPHiTS in Galaxy

SEQUENCING. From Sample to Sequence-Ready

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Core Facility Genomics

Normalization of RNA-Seq

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Practical Differential Gene Expression. Introduction

Understanding West Nile Virus Infection

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

New solutions for Big Data Analysis and Visualization

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

Copy Number Variation: available tools

Cahier de réalisation

How Sequencing Experiments Fail

How-To: SNP and INDEL detection

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Disease gene identification with exome sequencing

Human Genome Organization: An Update. Genome Organization: An Update

MiSeq: Imaging and Base Calling

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Delivering the power of the world s most successful genomics platform

The RNAi Consortium (TRC) Broad Institute

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

BioHPC Web Computing Resources at CBSU

RAP: RNA-Seq Analysis Pipeline, a new cloud-based NGS web application

A Primer of Genome Science THIRD

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Transcription:

FAQs of Differential Gene Expression using RNA-Seq A collection of questions about RNA-Seq July 18, 2013 Jyothi Thimmapuram jyothit@purdue.edu Bioinformatics Core bioinformatics@purdue.edu

Strategies for RNA-Seq Haas and Zody, Nature Biotechnology, 2010, 28:421

RNA-Seq - issues Coverage across transcriptome may not be random Some reads map to multiple locations Some reads do not map Some reads map outside exons new genes or new gene models?

What platform to use HiSeq/MiSeq How many lanes/flowcell How many reads/lane What is PE and SE sequencing What is the format of sequencing file

Illumina HiSeq 2500) 2 independent Flow Cells

Single Reads Barcode primer Adaptor A Adaptor B barcode s_1_sequence.txt SP2 SP1 Paired-end Reads Barcode primer Adaptor A Adaptor B barcode SP2 s_1_1_sequence.txt s_1_2_sequence.txt

HiSeq 2500 MiSeq No. of lanes 8 1 Length of Run 10 days 1 day Single Reads (per lane) 180-200 million 12-15 million Paired-end Reads 360-400 million 24-30 million Read length 50,100bp 2 x 250bp Bases > Q30 >85% (2x50bp) >85% (2x100bp) >80% (2x100bp) >80% (2x150bp) >70% (2x250bp) HiSeq 2500, Rapid Run Chemistry: 2 lanes, 120 million reads per lane, 50, 100,150bp, two days

Illumina files: one fastq file per sample Sequence ID Sequence Quality Scores @HWI-ST330_0106:4:1:2643:2862#CTTGTA/1 CTTGACAAAGGGTGCAAGGCAGTTAGTGGTGCAAGATGCATTGCTGATGATGGGTTCATCAGGGCTGTAATCATA + ggggggggggggdgggeggfgggggfaffdgdgdgeggggggggggdggdggggceeedeegggg_eydc_dcac @HWI-ST330_0106:4:1:2613:2891#CTTGTA/1 CGTGTCTTAAGGAGGCACCAAACAATATAAAGCTACAGATGGCGTCCTTGGTTTTTAATTTTAAGTTGGGGGACT + ggggggggggggegggdgggggegggffdgggggggggeggggggggggggegfggeegggfgcgggefgge^eg Seq 1 Seq 2

How many reads are needed (depth of sequencing) Number of reads/lane Number of samples/lane Read length

Number of reads/coverage Number of genes in the species Number of genes expressed under the treatment/tissue Rare transcripts

Number of reads/coverage Trapnell et al., Nature Biotechnology, 2010, 28:311

Sample Pairwise Comparisons Number of Differentially Expressed (DE) genes by each method edger voom Cufflinks Cond1_vs_Cond2 220 0 5 Cond1_vs_Cond3 93 24 7 Cond1_vs_Cond4 43 0 0 Cond2_vs_Cond3 175 0 2 Cond2_vs_Cond4 162 0 1 Cond3_vs_Cond4 119 0 2 Each of these samples had at least 50 million PE reads

Standards, Guidelines and Best Practices for RNA-Seq The ENCODE consortium The ability to detect reliably low copy number transcripts/isoforms depends upon the depth of sequencing and on a sufficiently complex library. For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended. http://encodeproject.org/encode/protocols/datastandards/encode_rnaseq_standards_v1.0.pdf Evaluating the impact of sequencing depth on transcriptome profiling in human adipose Liu et. al., PLoS One 8:e66883

Experimental Design Number of biological replicates

We can sequence multiple samples on one lane by indexing (barcoding, tagging) the sample Multiplexed The index is usually 6-7 bp that is used to separate sequences for each sample SP1 Paired-end Reads Barcode primer Adaptor A Adaptor B barcode SP2

Auer P L, and Doerge R W Genetics 2010;185:405-416 Balanced Blocked Design

Quality Control How do you check the quality of reads How do you trim and filter low quality bases Do we need to trim and filter low quality bases

Sequence quality check FastQC http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ Quality score > 30 Min. length 50% of the read

Bad run Good run x-axis: Position in read y-axis: Quality scores

FastQC Before trimming After trimming x-axis: Position in read y-axis: Quality scores

Mapping to reference genome or transcriptome No reference available Draft genome available Not well annotated reference

Align to genome Bioinformatics Core Mapping of RNA-Seq reads Can detect novel exons or un-annotated genes Aligners should be able to map reads across splice sites Reads from non-genic regions influence expression values, SNP detection etc. Align to transcriptome Information about splice junctions is not required PE distance and junction reads - isoforms

Strategies for RNA-Seq Haas and Zody, Nature Biotechnology, 2010, 28:421

To map RNA-Seq reads Number of mismatches allowed Number of hits allowed Exon-exon/exon-intron junctions Expected distance in PE reads

What alignment program to use Unique or multiple mapping Usually what %reads map to reference How to generate read counts

TopHat alignment

TopHat-Cufflinks-Cuffcompare-CuffDiff http://tophat.cbcb.umd.edu/ Uses Bowtie Splits the read into segments and map independently and glue them together to produce end-to-end read alignment Currently does not support short indels Can align up to 1024 bp Do not mix PE and single reads

TopHat contd. Junctions from GFF or other list file Without reference Neighboring coverage islands joined with an intron PE reads genomic coordinates and expected distance Two segments of the same read mapped apart reports alignments across GT-AG introns

Cufflinks, Cuffcompare and CuffDiff Cufflinks Assembles transcripts and estimates abundances Alignments in SAM format as input Cuffcompare/Cuffmerge Compares assembled sequences to a ref. annotation Compares Cufflinks transcripts across experiments Input - GTF file from Cufflinks CuffDiff GTF file & SAM files Finds significant changes in transcript expression, splicing and promoter use Output files Genes.fpkm_tracking Gnes_exp.diff

Some popular aligners BWA slow for long reads and reads with higher error rate; suboptimal alignment pairs; allows gapped alignment TopHat uses Bowtie; maps reads to genome, builds a database of possible splice junctions, and maps the reads against these junctions to confirm Novoalign most accurate, slow Others: SpliceMap, MapSplice, SOAP, MAQ, CLC Bio

Overlap multireads can cause inaccurate expression estimates Van Verk et al., 2013. Trends Plant Sci. 18:175-179.

Counting reads with HTSeq

What are the different methods for DGE analysis What is RPKM/FPKM Why do we use more than one method How to validate and verify the RNA-Seq results How to select genes for qrt-pcr

Length of genes Sequencing depth Differential Gene Expression RPKM Reads Per Kilobase of exon model per Million mapped reads Mortazavi et al., Nature Methods,2008, 5:621 FPKM Fragment Per Kilobase of exon model per Million mapped reads Normalization gene counts should be adjusted to minimize the bias Statistical model should account for length and depth

Differential expression methods Fisher s exact test or similar tests for RPKM/FPKM R-packages for RNA-Seq analysis: DESeq small # or no replicates; negative binomial (NB) dist edger NB dist; Similar to Fisher s exact test using NB (instead of hypergeometric probablities) bayseq more complex; empirical Bayesian methods DEGseq based on MA-plots

DESeq-edgeR-Cufflinks DESeq 10,939 edger 11,770 Cufflinks 6,263 DESeq+edgeR 10,219 DESeq+Cufflinks 6,070 edger+cufflinks 6,077 DESeq+edgeR+Cufflinks 6,045 DESeq and edger novo align mapping

DESeq-edgeR-baySeq DESeq 888 edger 895 bayseq 1,115 DESeq + edger 591 DESeq + bayseq 488 edger + bayseq 465 DESeq + edger + bayseq 338 http://davetang.org/muse/ Soneson and Delorenzi, 2013. BMC Bioinfomatics. 14:91 Rapaport et. al., 2013. http://arxiv.org/pdf/1301.5277.pdf

Yendrek et al., 2012. BMC Research Notes. 5:506 Comparing RNA-Seq with qrt-pcr

Sequencing platform to use FAQs of RNA-Seq Illumina HiSeq, 8lanes/flowcell, fastq files Sequencing depth number of reads At least 30-40 mil/sample Paired-end or single reads PE Length of reads At least 50bp, usually 100bp Number of biological replicates At least 3 (more if you can afford) Experimental design for sequencing Balanced Block Design How to analyze RNA-Seq data

FAQs of RNA-Seq cont. How to analyze RNA-Seq data How to check quality and trim/filter low quality FASTQC and FASTXtoolkit Reference genome or transcriptome Depends on the purpose of the expt Build a reference transcriptome if not available (Trinity, Trans-ABySS, Velvet/Oases) What alignment program to use TopHat, Bowtie2, BWA Unique or multiple mapping Unique A good %mapping : 70-90%

FAQs of RNA-Seq cont. How to analyze RNA-Seq data How to get read counts HTSeq with option union What statistical methods to use limma package (edger, voom, rpkm), DESeq Why do we use more than one method Different normalization methods and assumptions Validation and verification How to select genes : FDR, FC, pathway

Applications using RNA-Seq data Differential gene expression Structural annotation of a genome Alternative splicing Fusion transcripts de novo transcriptome assembly SNPs/Indels Phylogenomics

Resources for RNA-Seq analysis RNA-Seq Blog Transcriptome Analysis: Sequencing and Profiling

Additional references for RNA-Seq analysis: Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Trapnell et. al., 2012. Nature Protocols. 7:562-578. Evaluating the impact of sequencing depth on transcriptome profiling in human adipose Liu et. al., 2013. PLoS One 8:e66883 http://string-db.org Counting reads in features with htseq-count http://www-huber.embl.de/users/anders/htseq/doc/count.html

References for statistical analysis of DGE: Design and validation issues in RNA-seq experiments Fang and Cui, 2011. Brief Bioinform. 12:280-287. A comprehensive evaluation of normalization methods for Illumina highthroughput RNA sequencing data analysis Dillies, et. al., 2012. Brief Bioinform. doi: 10.1093/bib/bbs046 A comparison of statistical methods for detecting differentially expressed genes from RNA-Seq data Kvam et. al., 2012. Am. J. Botany. 99:248-256. Comprehensive evaluation of differential expression analysis methods for RNA- Seq data Rapaport et. al., 2013. http://arxiv.org/pdf/1301.5277.pdf A comparison of methods for differential expression analysis of RNA-seq data Soneson and Delorenzi, 2013. BMC Bioinfomatics. 14:91.

BIOINFORMATICS CORE IS SUPPORTED BY: Financial: OVPR College of Agriculture Ag. Research Programs College of Technology College of Veterinary Medicine Cancer Center Cyber Center - Discovery Park (College of Science) Technical: Rosen Center for Advanced Computing (RCAC) ITaP AgIT

Genomics Facility Information: Phillip San Miguel, Ph.D. Genomics Facility Director pmiguel@purdue.edu (765)-496-6328

THANK YOU!