RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012



Similar documents
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Challenges associated with analysis and storage of NGS data

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

PreciseTM Whitepaper

Frequently Asked Questions Next Generation Sequencing

Introduction to NGS data analysis

Basic processing of next-generation sequencing (NGS) data

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

NGS Data Analysis: An Intro to RNA-Seq

G E N OM I C S S E RV I C ES

Expression Quantification (I)

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Introduction to next-generation sequencing data

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Comparing Methods for Identifying Transcription Factor Target Genes

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Next generation sequencing (NGS)

Deep Sequencing Data Analysis

Services. Updated 05/31/2016

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Next Generation Sequencing

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

PrimePCR Assay Validation Report

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

How Sequencing Experiments Fail

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

PrimePCR Assay Validation Report

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Next Generation Sequencing: Technology, Mapping, and Analysis

Analysis of ChIP-seq data in Galaxy

Gene Expression Analysis

TGC AT YOUR SERVICE. Taking your research to the next generation

RNAseq / ChipSeq / Methylseq and personalized genomics

Next generation DNA sequencing technologies. theory & prac-ce

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

How-To: SNP and INDEL detection

mrna NGS Data Analysis Report

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

SEQUENCING. From Sample to Sequence-Ready

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Bioinformatics Unit Department of Biological Services. Get to know us

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

NGS data analysis. Bernardo J. Clavijo

A Tutorial in Genetic Sequence Classification Tools and Techniques

Databases and mapping BWA. Samtools

Computational Genomics. Next generation sequencing (NGS)

Welcome to the Plant Breeding and Genomics Webinar Series

July 7th 2009 DNA sequencing

A survey of best practices for RNA-seq data analysis

ncounter Leukemia Fusion Gene Expression Assay Molecules That Count Product Highlights ncounter Leukemia Fusion Gene Expression Assay Overview

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Introduction to Bioinformatics 3. DNA editing and contig assembly

LifeScope Genomic Analysis Software 2.5

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Practical Guideline for Whole Genome Sequencing

Module 1. Sequence Formats and Retrieval. Charles Steward

An Overview of DNA Sequencing

High Throughput Sequencing Data Analysis using Cloud Computing

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

NEXT GENERATION SEQUENCING

Bioinformatics Resources at a Glance

Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Introduction To Epigenetic Regulation: How Can The Epigenomics Core Services Help Your Research? Maria (Ken) Figueroa, M.D. Core Scientific Director

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Overview of Next Generation Sequencing platform technologies

Research Article Stormbow: A Cloud-Based Tool for Reads Mapping and Expression Quantification in Large-Scale RNA-Seq Studies

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

New solutions for Big Data Analysis and Visualization

BioHPC Web Computing Resources at CBSU

Text file One header line meta information lines One line : variant/position

Introduction. Overview of Bioconductor packages for short read analysis

MiSeq: Imaging and Base Calling

Normalization of RNA-Seq

GenBank, Entrez, & FASTA

Next Generation Sequencing

TruSeq Custom Amplicon v1.5

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Copy Number Variation: available tools

Delivering the power of the world s most successful genomics platform

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Core Facility Genomics

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Transcription:

RNA-Seq Tutorial 1 John Garbe Research Informatics Support Systems, MSI March 19, 2012

Tutorial 1 RNA-Seq Tutorials RNA-Seq experiment design and analysis Instruction on individual software will be provided in other tutorials Tutorial 2 Hands-on using TopHat and Cufflinks in Galaxy Tutorial 3 Advanced RNA-Seq Analysis topics

Galaxy.msi.umn.edu Web-base platform for bioinformatic analysis

Introduction Experimental Design RNA Outline Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Reference Genome fasta Reference Transcriptome GFF/GTF Differential Expression Analysis Transcriptome Assembly

Introduction Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Reference Genome fasta Introduction Gene expression RNA-Seq Platform characteristics Microarray comparison Reference Transcriptome GFF/GTF Differential expression analysis Transcriptome Assembly

Central dogma of molecular biology Nucleotide Base Pair Genome Gene Gene Gene A C C T T G G A Exon Gene Alternative Transcriptionsplicing Intron A A G G C A Gene Expression pre-mrna A A G G C A mrna exon 1 exon 2 exon 3 Splicing Translation G C A gene Alternative Splicing transcription & splicing Translation alternatively spliced mrna G C A Protein K K V I V R Q isoform 1 isoform 2 R K Q 92 94% of human genes undergo alternative splicing, 86% with a minor isoform frequency of 15% or more E.T. Wang, et al, Nature 456, 470-476 (2008) K K V Q R K Q

RNA-Seq Introduction High-throughput sequencing of RNA Transcriptome assembly Qualitative identification of expressed sequence Differential expression analysis Quantitative measurement of transcript expression

Sample 1 mrna isolation Calculate transcript abundance Gene A Gene B Sample 1 4 4 # of Reads Fragmentation RNA -> cdna Gene A Gene B Sample 1 4 2 Reads per kilobase of exon Sequence fragment end(s) Gene A Gene B Total Sample 1 4 2 6 Sample 2 7 5 12 Reads per kilobase of exon Map reads Gene A Gene B Total Genome Sample 1.7.3 6 Reference Transcriptome A B Sample 2.6.3 12 Reads per kilobase of exon per million mapped reads RPKM

Ali Mortazavi et al., Nature Methods - 5, 621-628 (2008)

Introduction RNA-Seq (vs Microarray) Strong concordance between platforms Higher sensitivity and dynamic range Lower technical variation Available for all species Novel transcribed regions Alternative splicing Allele-specific expression Fusion genes Higher informatics cost

Introduction Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM GFF/GTF Reference Genome fasta Reference Transcriptome Experimental Design Biological comparison(s) Paired-end vs single end reads Read length Read depth Replicates Pooling Differential expression analysis Transcriptome Assembly

Experimental design Simple designs (Pairwise comparisons) Two group Drug effect Control Experimental (drug applied) Complex designs Consult a statistician Cancer sub-type 1 Cancer sub-type 1 With drug Normal Cancer Two factor Cancer type X drug Matched-pair Cancer sub-type 2 Cancer sub-type 2 With drug

Experimental design What are my goals? Transcriptome assembly? Differential expression analysis? Identify rare transcripts? What are the characteristics of my system? Large, complex genome? Introns and high degree of alternative splicing? No reference genome or transcriptome?

Experimental design BMGC RNA-Seq Price list (Jan 2012)

Experimental design 10 million reads per sample, 50bp single-end reads Small genomes with no alternative splicing

Experimental design 20 million reads per sample, 50bp paired-end reads Mammalian genomes (large transcriptome, alternative splicing, gene duplication)

Experimental design 50-200 million reads per sample, 100bp paired-end reads Transcriptome Assembly (100X coverage of transcriptome) 50bp Paired-end >> 100bp Single-end

Experimental design Technical replicates Not needed: low technical variation Minimize batch effects Randomize sample order Biological replicates Not needed for transcriptome assembly Essential for differential expression analysis Difficult to estimate 3+ for cell lines 5+ for inbred lines 20+ for human samples

Experimental design Pooling samples Limited RNA obtainable Multiple pools per group required Transcriptome assembly

Experimental design RNA-seq: technical variability and sampling Lauren M McIntyre, Kenneth K Lopiano, Alison M Morse, Victor Amin, Ann L Oberg, Linda J Young and Sergey V Nuzhdin BMC Genomics 2011, 12:293 Statistical Design and Analysis of RNA Sequencing Data Paul L. Auer and R. W. Doerge Genetics. 2010 June; 185(2): 405 416. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries Daniel Aird, Michael G Ross, Wei-Sheng Chen, Maxwell Danielsson, Timothy Fennell, Carsten Russ, David B Jaffe, Chad Nusbaum and Andreas Gnirke Genome Biology 2011, 12:R18 ENCODE RNA-Seq guidelines http://www.encodeproject.org/encode/experiment_guidelines.html

Introduction Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Reference Genome fasta Sequencing Platforms Library preparation Multiplexing Sequence reads Reference Transcriptome GFF/GTF Differential expression analysis Transcriptome Assembly

Sequencing Illumina sequencing by synthesis GAIIx replaced by HiSeq HiSeq2000 MiSeq low throughput, fast turnaround SOLiD (not available at BMGC) Color-space reads (require special mapping software) Low error rate 454 pyrosequencing Longer reads, lower throughput

Sequencing Library preparation (Illumina TruSeq protocol for HiSeq) RNA isolation Ploy-A purification Fragmentation cdna synthesis using random primers Adapter ligation Size selection PCR amplification

Flowcell 8 lanes Sequencing 200 Million reads per lane Multiplex up to 24 samples on one lane using barcodes

Adapter Sequencing cdna insert Adaptor Barcode Read 1 left read Index read Read 2 right read (optional) Adaptor contamination

Library types Sequencing Polyadenylated RNA > 200bp (standard method) Total RNA Small RNA Strand-specific Gene-dense genomes (bacteria, archaea, lower eukaryotes) Antisense transcription (higher eukaryotes) Low input Library capture

Sequence Data Format Data delivery /project/pi-groupname/120318_sn261_0348_a81jumabxx fastq_flt/ Bad reads removed by Illumina software, for use in data analysis fastq/ Raw sequence output for submission to public archives, contains bad reads Upload to Galaxy File names L1_R1_CCAAT_cancer1.fastq L1_R2_CCAAT_cancer1.fastq Fastq format (Illumina Casava 1.8.0) 4 lines per read Machine ID Formats vary Don t use in analysis QC Filter flag Y=bad N=good barcode Read ID Sequence + Quality score Phred+33 @HWI-M00262:4:000000000-A0ABC:1:1:18376:2027 1:N:0:AGATC TTCAGAGAGAATGAATTGTACGTGCTTTTTTTGT + Read pair # =1:?7A7+?77+<<@AC<3<,33@A;<A?A=:4=

Introduction Experimental Design RNA Sequencing fastq Data Quality Control Data Quality Control Quality assessment Trimming and filtering fastq Read mapping SAM/BAM Reference Genome fasta Reference Transcriptome GFF/GTF Differential expression analysis Transcriptome Assembly

Data Quality Assessment Evaluate read library quality Identify contaminants Identify poor/bad samples Software FastQC (recommended) Command-line, Java GUI, or Galaxy SolexaQC Command-line Supports quality-based read trimming and filtering SAMStat Command-line Also works with bam alignment files

Data Quality Assessment Trimming: remove bad bases from (end of) read Adaptor sequence Low quality bases Filtering: remove bad reads from library Low quality reads Contaminating sequence Low complexity reads (repeats) Short reads Short (< 20bp) reads slow down mapping software Only needed if trimming was performed Software Galaxy, many options (NGS: QC and manipulation) Tagdust Many others: http://seqanswers.com/wiki/software/list

Data Quality Assessment - FastQC Good Quality scores across bases Bad Trimming needed Phred 30 = 1 error / 1000 bases Phred 20 = 1 error / 100 bases

Data Quality Assessment - FastQC Good Quality scores across reads Bad Filtering needed

Data Quality Assessment - FastQC Good GC Distribution Bad

Data Quality Assessment - FastQC High level of sequencing adapter contamination, trimming needed

Data Quality Assessment - FastQC Normal level of sequence duplication in 20 million read mammalian sample

Data Quality Assessment - FastQC Normal sequence bias at beginning of reads due to non-random hybridization of random primers

Data Quality Assessment Recommendations Generate quality plots for all read libraries Trim and/or filter data if needed Always trim and filter for de novo transcriptome assembly Regenerate quality plots after trimming and filtering to determine effectiveness

Introduction Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Reference Genome fasta Read Mapping Pipeline Software Input Output Reference Transcriptome GFF/GTF Differential expression analysis Transcriptome Assembly

Mapping with reference genome Reference Genome Millions of short reads Fastq Spliced aligner Reads aligned to genome SAM/BAM Abundance estimation Differential expression analysis

Mapping with reference genome Reference Genome Millions of short reads Fastq Spliced aligner aligner Unmapped reads aligner Reference splice Junction library or De novo splice junction library Fasta/GTF Fasta/GTF Reads aligned to genome SAM/BAM Abundance estimation Differential expression analysis 3 exon gene junction library

Alignment algorithm must be Mapping Fast Able to handle SNPs, indels, and sequencing errors Allow for introns for reference genome alignment (spliced alignment) Burrows Wheeler Transform (BWT) mappers Faster Few mismatches allowed (< 3) Limited indel detection Spliced: Tophat, MapSplice Unspliced: BWA, Bowtie Hash table mappers Slower More mismatches allowed Indel detection Spliced: GSNAP, MapSplice Unspliced: SHRiMP, Stampy

Input Fastq read libraries Mapping Reference genome index (software-specific: /project/db/genomes) Insert size mean and stddev (for paired-end libraries) Map library (or a subset) using estimated mean and stddev Calculate empirical mean and stddev Galaxy: NGS Picard: insertion size metrics Cufflinks standard error Re-map library using empirical mean and stddev

Mapping Output SAM (text) / BAM (binary) alignment files SAMtools SAM/BAM file manipulation Summary statistics (per read library) % reads with unique alignment % reads with multiple alignments % reads with no alignment % reads properly paired (for paired-end libraries)

Introduction Experimental Design RNA Sequencing fastq Differential Expression Discrete vs continuous data Cuffdiff and EdgeR Data Quality Control fastq Read mapping SAM/BAM Reference Genome fasta Reference Transcriptome GFF/GTF Differential Expression Analysis Transcriptome Assembly

Differential Expression Discrete vs Continuous data Microarray florescence intensity data: continuous Modeled using normal distribution RNA-Seq read count data: discrete Modeled using negative binomial distribution Microarray software cannot be used to analyze RNA-Seq data

Differential Expression Cuffdiff (Cufflinks package) Pairwise comparisons Differential gene, transcript, and primary transcript expression; differential splicing and promoter use Easy to use, well documented Input: transcriptome, SAM/BAM read alignments (abundance estimation built-in) EdgeR Complex experimental designs using generalized linear model Information sharing among genes (Bayesian gene-wise dispersion estimation) Difficult to use R package Consult a statistician Input: raw gene/transcript read counts (calculate abundance using separate software)

Introduction Introduction Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Suggested Implementation (reference genome available) Two group design Reference Transcriptome RNA Sequencing fastq FastQC fastq Tophat SAM/BAM Transcript abundance estimation Cuffdiff (Cufflinks package) Differential expression analysis

Introduction Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Reference Genome fasta Transcriptome Assembly Pipeline Software Input Output Reference Transcriptome GFF/GTF Differential expression analysis Transcriptome Assembly

RNA-Seq Reference genome Reference transcriptome RNA-Seq Reference genome No reference transcriptome RNA-Seq No reference genome No reference transcriptome Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Differential Expression Analysis GFF/GTF Reference Genome fasta Reference Transcriptome Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Transcriptome assembly Differential Expression Analysis Reference Genome fasta GFF/GTF Experimental Design RNA Sequencing fastq Data Quality Control fastq Read mapping SAM/BAM Differential Expression Analysis Transcriptome assembly fasta

Transcriptome Assembly -with reference genome Reference transcriptome available No/poor reference transcriptome available Reads aligned to genome SAM/BAM Assembler Reads aligned to genome With de novo junction library SAM/BAM Reference transcriptome Novel transcript identification De novo transcriptome GFF/GTF Abundance estimation GFF/GTF Annotate Abundance estimation

Transcriptome Assembly -with reference genome Reference genome based assembly Cufflinks, Scripture Reference annotation based assembly Cufflinks Transcriptome comparison Cuffcompare Transcriptome Annotation Generate cdna fasta from annotation (Cufflinks gffread program) Align to library of known cdna (RefSeq, GenBank)

Transcriptome Assembly no reference genome Millions of short reads Unspliced Aligner Reads aligned to transcriptome Fastq SAM/BAM Assembler Trinity Trans-ABySS Velvet/oasis De novo Transcriptome Computationally intensive Abundance estimation Differential expression analysis

Further Reading Bioinformatics for High Throughput Sequencing Rodríguez-Ezpeleta, Naiara.; Hackenberg, Michael.; Aransay, Ana M.; SpringerLink New York, NY : Springer c2012 Online access through U library RNA sequencing: advances, challenges and opportunities Fatih Ozsolak1 & Patrice M. Milos1 Nature Reviews Genetics 12, 87-98 (February 2011) Computational methods for transcriptome annotation and quantification using RNA-seq Manuel Garber, Manfred G Grabherr, Mitchell Guttman & Cole Trapnell Nature Methods 8, 469 477 (2011) Next-generation transcriptome assembly Jeffrey A. Martin & Zhong Wang Nature Reviews Genetics 12, 671-682 (October 2011) Table of RNA-Seq software Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks Cole Trapnell, Adam Roberts, Loyal Goff, Geo Pertea, Daehwan Kim, David Kelley, Harold Pimentel, Steven Salzberg, John L Rinn & Lior Pachter Nature Protocols 7, 562 578 (2012) SEQanswers.com biostar.stackexchange.com Popular bioinformatics forums

Questions / Discussion