RNAseq Introduction Ian Misner, Ph.D. Bioinformatics Crash Course
Many types of RNA rrna, trna, mrna, mirna, ncrna, etc. ~2% is mrna
Why sequence RNA Functional studies Drug treated vs untreated cell line Wild type vs knock out SNP finding Transcriptome assembly Novel gene finding Splice variant analysis
Challenges Sampling Purity?, quantity?, quality? Exons can be problematic Mapping reads can become difficult RNA abundances vary by orders of magnitude Highly expressed genes can over power genes of interest Organeller RNA can block overall signal RNA is fragile and must be properly handled RNA population turns over quickly within a cell.
General workflows Obtain raw data Align/assemble reads Process alignment with a tool specific to the goal e.g. cufflinks sailfish Post process Import into downstream software (R, Matlab, Cytoscape, etc.) Summarize and visualize Create gene lists, prioritize candidates for validation, etc.
Experimental Design Questions What is my biological question? How much sequencing do I need? What type of sequencing should I do? Read length? Which platform? SE or PE? How much multiplexing can I do? Should I pool samples? How many replicates do I need? What about duplicates?
What are you working with? Novel little or no data Some data ESTs or Unigenes Basic Draft Genome Few thousand contigs Some annotation, mostly ab initio Good Draft Genome Few thousand scaffolds to chromosome arms Better annotations with human verification Model Organism Fully sequenced genome High confidence annotations Genetic maps and markers Mutant data available
Number of Reads/Replicates (a) Increase in biological replication significantly increases the number of DE genes identified. Liu Y et al. Bioinformatics 2014;30:301-304
Read Type and Platform Read Type Pla+orm Uses 50 SE Illumina Gene Expression Quan5fica5on SNP- finding (Good Reference) 50 PE Illumina Above plus Splice variants 100+ PE Illumina Above plus Transcriptome assembly DE within gene families 200+ Ion Torrent Sanger 454 Nanopore Splice variants Transcriptome assembly Haplotypes Too large for DE
Read Platform Perdue University Discovery Park
Multiplexing 6-8 nt barcodes added to samples during library prep. Allows for pooling of samples into the same lane. Mitigate lane effects Maximize sequencing efficiency Dual barcoding allows for up to 96 samples per lane.
Replicates Biological Measurement of variation between samples More are better Can detect genetic variation between samples Pooling with barcodes each sample is a replicate Pooling without barcodes each pool is a replicate
Replicates Technical Can determine variation within sample preparation. Can be cost prohibitive. More biological replicate are better. Useful across lanes to mitigate lane effects.
Should I remove duplicates? Maybe Duplicates may correspond to biased PCR amplification of particular fragments For highly expressed, short genes, duplicates are expected even if there is no amplification bias Removing them may reduce the dynamic range of expression estimates Assess library complexity and decide If you do remove them, assess duplicates at the level of paired-end reads, not single end reads
Processing RNA for Sequencing Depends upon what you re looking to achieve. mrna is the main target PolyA Selection Oligo-dT beads Highly efficient at getting mrna and depleting the rrna Can t be used with non-polya RNA mirna kits as well
Strand Specific Sequencing Illumina prep that ligates adaptors to 5 and 3 ends of RNA prior to cdna reverse transcription Having strand information makes mapping more straightforward. Can identify antisense transcripts 5 3
Insert Sizes
Alignment Options No Genome?! No Problem! Transcriptome assembly There will be redundancy NCBI Unigene Set Not necessarily complete Good to identify highly expressed genes Valid Transcripts from you organism Easy to use but may miss novel genes Fully Sequenced and Annotated Genome No excuses this better be a Nature paper!
Mapping RNAseq Reads How many mismatches will you allow? Depends on what your mapping and what your using for a reference. Number of hits allowed? How many times can a read match in different locations? Splice Junctions? Is your mapping tool splice aware? Expected distance for PE reads? This is important to know so that read pairs can map properly.
Why PE reads are great 2 Mismatches Exact Match
Perdue University Discovery Park
RNAseq Pipeline TopHat Cufflinks Cuffcompare CuffDiff CummRbund
There are other options
Not all software is created equal
RNAseq Best Practices Platform Illumina HiSeq Read Length Minimum of 50bp 100bp is better Paired-end or Single PE Read Depth 30-40 million/sample
RNAseq Best Practices Number of biological replicates 3 or more as cost allows Experimental Design Balanced Block What type of alignment TopHat Highly confident and splice aware Unique or Multiple mapping Unique 70-90% mapping rate Analysis Method Use more than one approach Know the limits of the experiment