Quantitative RNA Sequencing (RNA-seq) and Exome Analysis Richard A. Radcliffe, Ph.D. Professor of Pharmacology School of Pharmacy, Department of Pharmaceutical Sciences Room V20-3124 (303) 724-3362 richard.radcliffe@ucdenver.edu Why RNA-seq? Genetic architecture Developmental stage Environmental influences Tissue type Disease state Phenotype Crick (1970) Nature 227:561-563 1
Why RNA-seq? Understanding the transcriptome is essential for interpreting the functional elements of the genome and revealing the molecular constituents of cells and tissues, and also for understanding development and disease. Catalogue all species of transcript, including mrnas, non-coding RNAs and small RNAs Determine the transcriptional structure of genes, in terms of their start sites, 5 and 3 ends, splicing patterns and other post-transcriptional modifications Quantify the changing expression levels of each transcript during development and under different conditions. Pathway/network/ontology analysis. Massively parallel expression analysis Wang et al. (2009) Nat Rev Genetics 10:57-63 RNA-seq Overview Select fraction of interest Library prep Sequence and map to reference genome Analysis (QC, quantitation, transcript annotation) Adapted from: Pepke et al. (2009) Nat Methods 6:S22-S32 2
Library Prep Corney (2013) Mater Methods 3:203 Library Prep: Some Considerations RNA fraction Many different RNA species Poly(A) Size (<200 nt vs. >200 nt) Strandedness Read length Single- vs. pair-end Multiplexing 3
RR34 RNA Fraction ~80% ~15% Both strands transcribed Transcribed Genomic Distribution Total RNA Distribution Mattick & Makunin (2006) Hum Mol Genet 1:R17-29 Genomes, 2 nd Edition, Oxford: Wiley-Liss, 2002 Library Prep: Some Considerations RNA fraction Many different RNA species Poly(A) Size (<200 nt vs. >200 nt) Strandedness Overlapping transcripts Annotation of novel transcripts Read length Single- vs. pair-end Multiplexing 4
Slide 7 RR34 The area of the box represents the genome. The area of large green circle is equivalent to the documented extent of transcription, with the darker green area corresponding to that on both strands. CDSs are protein-coding sequences, and UTRs are 5 - and 3 -untranslated sequences in mrnas. The dots indicate (and in fact overstate) the proportion of the genome occupied by known snornas and mirnas. Richard Radcliffe, 1/26/2015
Strandedness Strandedness Ncstn (-) <<<<< <<< Copa (+) <<< Transcription <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< DS library prep SS library prep <<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<< Alignment <<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<< <<<<<<<<<<<<<<<<<<<<<< Which strand (gene) did the fragment come from? No question about which strand (gene) the fragment came from. 5
Library Prep: Some Considerations RNA fraction Many different RNA species Poly(A) Size (<200 nt vs. >200 nt) Strandedness Read length Single- vs. pair-end Multiplexing Read Length Read length is related to: Sequencing accuracy: quality declines as a function of the length of a read Mapping accuracy: the longer the read, the more accurately it maps 6
Library Prep: Some Considerations RNA fraction Many different RNA species Poly(A) Size (<200 nt vs. >200 nt) Strandedness Read length Single- vs. pair-end Multiplexing Single vs. Paired-end Zhernakova et al. (2013) PLoS Genet e1003594 7
Library Prep: Some Considerations RNA fraction Many different RNA species Poly(A) Size (<200 nt vs. >200 nt) Strandedness Read length Single- vs. pair-end Multiplexing Mapping to the Reference Genome @HWUSI-EA541_0032:1:2:0:325#0 CCATCTTTTTGATGTCCGCAATGATTT + WTORTSOQXTVVYXRXXXVPTXXXWUUL Alignment @HWUSI-EA541_0032:1:2:0:325#0 - chr7 13619194 CCATCTTT Bowtie, BWA Computational considerations 8
Mapping to the Genome: Some Considerations Non-unique reads Gene families Repeat sequences (simple repeats, transposons) Depth Probability of representation & limits of detection Transcript isoform quantification Variant calling (SNPs, small indels) Reference genome effects Non-unique Reads 20 250 Fraction of reads suppressed (%) 16 12 8 4 200 150 100 50 Number of alignments (10 6 ) 0 0 10 0 10 1 10 2 10 3 10 4 10 5 Number of multiple alignment reads allowed (bowtie option -m) 9
Non-unique Reads: Gene Families Non-unique Reads: Repeats 10
Mapping to the Genome: Some Considerations Non-unique reads Gene families Repeat sequences (simple, SINEs, LINEs, etc.) Depth Probability of representation & limits of detection Transcript isoform quantification Variant calling (SNPs, small indels) Reference genome effects Depth: Transcript Quantification 11
Depth: Variant Calling Mapping to the Genome: Some Considerations Non-unique reads Gene families Repeat sequences (simple, SINEs, LINEs) Depth Probability of representation & limits of detection Variant calling (SNPs, small indels) Transcript isoform quantification Reference genome effects 12
Reference Genome Effects RNA seq: ISS (ISS genome) RNA seq: ISS (mm10 genome) ILS DNA Sequencing ISS DNA Sequencing Gene Annotations Analysis QC Assembly/Quantification Reads Per Kilobase Exon per Million Mapped Reads (RPKM) Differential expression Pathway/network functional analysis Annotation Novel exons novel splice junctions novel genes 13
Quality Control Pre-library construction: RNA quality Pre-alignment: Per base quality Per read quality Nucleotide distribution per position GC content Sequence over-representation Post-alignment: Mean coverage, 5-3 and 3-5 Ribosomal RNA contamination Percent mapped reads Quality Control: RNA Degradation 28s 18s 14
Quality Control Quality per position Quality per read Nucleotide distribution Analysis QC Assembly/Quantification Reads Per Kilobase Exon per Million Mapped Reads (RPKM) Differential expression Pathway/network functional analysis Annotation Novel exons novel splice junctions novel genes 15
Assembly/Quantification: RPKM 3.18 RPKM = C/LN Analysis QC Assembly/Quantification Reads Per Kilobase Exon per Million Mapped Reads (RPKM) Differential expression Pathway/network functional analysis Annotation Novel exons novel splice junctions novel genes 16
Differential Expression Hddc3 Analysis QC Assembly/Quantification Reads Per Kilobase Exon per Million Mapped Reads (RPKM) Differential expression Pathway/network functional analysis Annotation Novel exons novel splice junctions novel genes 17
Pathway/Network Functional Analysis Weighted Gene Co-expression Network Analysis (WGCNA) Gene Ontology (GO) Cluster Analysis Darlington et al. (2013) Genes Brain Behav 12:263-274 Bennett et al. (2015) Alcohol Clin Exp Res NIHMS658870 Analysis QC Assembly/Quantification Reads Per Kilobase Exon per Million Mapped Reads (RPKM) Differential expression Pathway/network functional analysis Annotation Novel exons novel splice junctions novel genes 18
Annotation Exome Sequencing Why Identification of variants (SNPs, CNVs, small InDels) Linkage/association/pedigree studies Clinical diagnostics How Isolate, fragment DNA Build library Exome enrichment Sequence Align to reference genome Variant calling Higher order genetic analysis 19
Exome Enrichment www.genomics.agilent.com RR1 Variant Calling Altmann et al. (2012) Hum Genetics 131:1541-1554 20
Slide 40 RR1 Examples of intragenic deletion and duplication detected by WES and confirmed by exome acgh. Each bar in the graphs (a) (c) and (e) (g) represents an exon. (a c) WES data from a family trio in which the (a) proband has inherited a whole-gene duplication of KRT34 from the (b) father, whereas the (c) mother shows normal copy number at that gene. (e g) WES data from a family trio in which the (e) proband has inherited a partial-gene heterozygous deletion in the SYCP2L gene from the (g) mother, whereas the (f) father shows normal copy number at those exons. Each dot in panels d and h represents an oligonucleotide probe in the gene of interest on the exome array, with a duplication shown by probes deviating to a positive log2 ratio (marked in red) and a deletion shown by probes deviating to a negative log2 ratio (marked in green). Panels d and h show confirmation of the KRT34 duplication and the SYCP2L deletion, respectively, by exome acgh. acgh, array comparative genomic hybridization; WES, whole-exome sequencing. Radcliffe, Richard, 2/1/2015
RR2 Variant Calling: CNVs/Indels Child Father Mother Retterer et al. (2014) Genetics Med doi:10.1038/gim.2014 Genetic Analysis: Mendelian Inheritance Assumptions: Only consider small indels and SNPs Causal variants are coding Causal variants alter protein sequence Near complete penetrance Rabbani et al. (2012) J Hum Genetics 57:621-632 21
Slide 41 RR2 Examples of intragenic deletion and duplication detected by WES and confirmed by exome acgh. Each bar in the graphs (a) (c) and (e) (g) represents an exon. (a c) WES data from a family trio in which the (a) proband has inherited a whole-gene duplication of KRT34 from the (b) father, whereas the (c) mother shows normal copy number at that gene. (e g) WES data from a family trio in which the (e) proband has inherited a partial-gene heterozygous deletion in the SYCP2L gene from the (g) mother, whereas the (f) father shows normal copy number at those exons. Each dot in panels d and h represents an oligonucleotide probe in the gene of interest on the exome array, with a duplication shown by probes deviating to a positive log2 ratio (marked in red) and a deletion shown by probes deviating to a negative log2 ratio (marked in green). Panels d and h show confirmation of the KRT34 duplication and the SYCP2L deletion, respectively, by exome acgh. acgh, array comparative genomic hybridization; WES, whole-exome sequencing. Radcliffe, Richard, 2/1/2015
Genetic Analysis Ku et al. (2012) Ann Neurology 71:5-14 A Few References RNA-seq: Griffith M, Griffith OL, Mwenifumbo J, Goya R, Morrissy AS, Morin RD, Corbett R, Tang MJ, Hou YC, Pugh TJ, Robertson G, Chittaranjan S, Ally A, Asano JK, Chan SY, Li HI, McDonald H, Teague K, Zhao Y, Zeng T, Delaney A, Hirst M, Morin GB, Jones SJ, Tai IT, Marra MA (2010) Alternative expression analysis by RNA sequencing. Nat Methods 7:843-847. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA- Seq. Nat Methods 5:621-628. Munger SC, Raghupathy N, Choi K, Simons AK, Gatti DM, Hinerfeld DA, Svenson KL, Keller MP, Attie AD, Hibbs MA, Graber JH, Chesler EJ, Churchill GA (2014) RNA-Seq Alignment to Individualized Genomes Improves Transcript Abundance Estimates in Multiparent Populations. Genetics 198:59-73. Oshlack A, Robinson MD, Young MD (2010) From RNA-seq reads to differential expression results. Genome Biol 11:220. Wang Z, Gerstein M, Snyder M (2009) RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 10:57-63. Exome sequencing: Altmann A, Weber P, Bader D, Preuß M, Binder E, Müller-Myhsok B (2012) A beginners guide to SNP calling from highthroughput DNA-sequencing data. Hum Genet 131:1541-1554. Biesecker LG, Green RC (2014) Diagnostic clinical genome and exome sequencing. The New England Journal of Medicine 370:2418-2425. Krumm N, Sudmant PH, Ko A, O'Roak BJ, Malig M, Coe BP, Quinlan AR, Nickerson DA, Eichler EE (2012) Copy number variation detection and genotyping from exome sequence data. Genome Res 22:1525-1532. Majewski J, Schwartzentruber J, Lalonde E, Montpetit A, Jabado N (2011) What can exome sequencing do for you? Journal of Medical Genetics 48:580-589. Singleton AB (2011) Exome sequencing: a transformative technology. The Lancet Neurology 10:942-946. 22