Precise TM Whitepaper
Introduction LIMITATIONS OF EXISTING RNA-SEQ METHODS Correctly designed gene expression studies require large numbers of samples, accurate results and low analysis costs. Analysis tools have evolved from hybridization array based technologies to quantitative real time reverse transcription PCR (q-rt-pcr) and most recently RNA-Seq. Though RNA-Seq has become the gold standard and can be used to determine relative transcript abundance, it is costly, time intensive, comparatively inaccurate and requires considerable expertise for assay design, performance and data analysis. Generating libraries for mrna sequencing is a laborious process involving many steps with loss of precious sample at every step. The presence of high abundance RNAs (rrna, etc) have required the inclusion of methods to reduce this background RNA and/or enrich for poly-a mrnas. Although these methods improve data quality, they contribute to the labor and time required as well as to loss of original mrna sample. The efficiency of the mrna seq library process has been shown to be universally low due to RNA loss, inefficient reverse transcription, a multitude of intermediate cleanup steps, and the inhibition of the enzymatic steps due to carryover. This low efficiency has lead to the requirement for restrictively high amounts of initial sample and/or many rounds of PCR amplification, which have been shown to distort abundance measurements due to differential amplification efficiencies between transcripts. Taken together, these limitations further challenge the use of standard RNA seq library prep methods for precise, accurate and reliable mrna quantitation. These challenges are further compounded in situations with limited sample such as needle biopsies or when studying rare transcripts. While many methods have been described for RNA-Seq from limited quantities, they all require substantial pre-amplification steps prior 02
to library construction that limit their sensitivity, reproducibility, and further exacerbates the previously mentioned amplification bias. Because of the count-based nature of RNA-Seq data, sufficient sequencing reads must be collected from each transcript studied to accurately measure its abundance in the library. To accurately measure rare transcripts, many resort to deeper sequencing to increase number of reads in hopes of finding and sequencing a rare molecule, which presents challenges to data analysis, reduces sample throughput, and can quickly become cost prohibitive. RNA-Seq is especially inefficient if gene markers have already been identified. While qrt-pcr based methods can be used in these situations, they are severely limited in the number of targets that can be assayed and sample throughput. Although some current instrumentation allows for multiplexed, multi-analyte assays, these assays require optimization resulting in most users examining 1-4 genes at a time in any given reaction. These limitations have driven the need for higher throughput and higher content targeted sequencing approaches. 03
What are Precise TM Assays To address these problems, Cellular Research has designed a novel product specifically for targeted mrna sequencing. The product, called Precise assays, is designed to examine hundreds to thousands of genes using as little or less than 100pg of total RNA in a high throughput manner enabling users to make the most of their precious samples. Based on Cellular Research s patented Molecular Indexing technology, Precise assays provide absolute quantitation of target transcripts in an easy to follow workflow. Precise assays combine molecular and sample indexing in 96 sample and 384 sample formats, enabling customers to sequence up to 4,608 samples at once without new equipment or extensive training. The assays focus on specific genes and pathways of interest; delivering unprecedented accuracy and sensitivity for low expression targets in rare and limiting samples. Combined with robust design and analysis pipelines, Precise assays deliver a simple turnkey solution for customers looking for a targeted gene expression solution. Cellular Research InC Inc PRECISE TM ASSAYS Whitepaper 04
Workflow Precise assays utilize a pool of 6,561 unique molecular indexes to label all poly-a mrna in a human sample prior to the RT step. These pools of molecular indexes come pre-aliquoted in a 96 well plate. This pre- RT labelling step allows individual RNA molecules to be tracked and counted independent of any bias introduced by PCR amplification. The assays are designed for quantifying 100-1,000 targets in purified total RNA, and can be optimized to analyze cell extracts, including single cell extracts, without the need for nucleic acid purification, poly-a enrichment or rrna depletion. The schematic below shows the steps involved. The poly-t tails of the molecular indexes are annealed to 96 samples in 96 well plate (or 384 well) format immediately prior to RT and also include a universal PCR sequence on the 5 end to provide a template for subsequent PCR amplification steps. The pool of molecular indexes also include a sample/well barcode and plate bar codes can be added to de-multiplex complex data sets. This level of barcoding allows all of the samples to be combined into a single tube for all subsequent steps adding ease of handling and reduced reagent costs. After pooling (96 or 384 samples into a single tube), the resulting single stranded cdnas undergo second stand synthesis with a single amplification step using gene specific primers. The resulting PCR products are size selected and bead purified with SPRI beads, and serve as the template for a second nested PCR step with a multiplex set of nested PCR primers that include Illumina primer sequences. Only a fraction of the original cdna is required for the specific amplification; the remaining cdna is effectively archived as it contains all of the transcribed mrna material and can be used for testing of additional genes or targets with no need to return to original samples. After a final size selection with beads, the amplicons are ready for sequencing. Up to twelve 384 well plates can be combined and run on a single Illumina HiSeq run using 2 x 150bp reagents. An analyses of 96 genes in 96 samples can be combined 05
and run on a single Illumina MiSeq run using either 2 x 150bp or 2 x 75bp reagents. The entire library preparation process takes ~5 hour total with ~ 1 hour of hands on time for a 96 well plate. The MiSeq run takes ~ 20 hours with an additional 3 hours of time for data analysis. 1. REVERSE TRANSCRIPTION The Reverse Transcription process encodes each mrna molecule with a unique Molecular and Sample Index enabling all 96 or 384 samples to be combined into a single tube. Total RNA Sample ID AAAAAAA TTTTTTTTNNNNNNNNXXXXXXXXXXXXXXXX Molecular Index Universal PCR primer 2. Multiplex PCR AmplIfication The reverse transcription reaction encodes each individual mrna molecule with a unique Molecular and Sample Index retaining its unique representation and identity regardless of any amplification and preparation bias. Primers to 10-100 genes 3. Nested, gene specific PCR Amplification The second amplification step incorporates sequencing adaptors into the final PCR product, utilizing nested primers for additional target specificity. Primers to 10-100 genes 06
In Depth Designing primer sets for precise TM assays Cellular Research designs assays on a custom basis. Users specify the gene panels they are interested in, and Cellular Research does primer design, selection, initial QC and further method development (if needed). Multiplex assays targeting 100-1000 genes are readily designed with 90% success rate using standard defined primer parameters. Additional optimization can increase the success rate to increase the inclusion of more difficult genes. Primer stability, size, Tm, GC content, and amplicon size are all taken into account, and primers are designed to ensure they do not overlap genomic repeats or low-complexity sequences and hairpins. Assays are also QC d to prevent self-dimers and mis-primings, and are cross-checked with all of relevant assay components as part of the primer design process. Data analysis provided via seven bridges analysis platform Precise assays are designed for simple analysis with the Seven Bridges cloud based analysis pipeline. MiSeq FASTQ files are uploaded directly to Seven Bridges Genomics (SB Genomics) data analysis platform. The SB Genomics platform de-multiplexes the sequences by sample plate (using Illumina plate barcodes) and by sample / well ID (using Cellular Research s sample specific bar codes). A proprietary algorithm performs bar code error correction, and data polishing (Illumina trimming, corrections for primer/dimers, etc). An internal Bowtie tool within the SB Genomics platform performs sequence alignments and calculates % mapped reads. Following alignment, an additional algorithm interrogates the molecular indexes, which de-convolutes any bias introduced by PCR steps, allowing for absolute transcript quantification. 07
The SBGenomics data output from the analysis of a Precise experiment includes a reads report (raw number of multiple gene counts in multiple samples) and also a Molecular report (bar code corrected or absolute copy number of multiple genes in multiple samples). Examples of these reports are provided below. 08
Read counts The following is an example of a Read report generated via the Seven Bridges Genomics platform. The data set shown is de-multiplexed gene expression data showing the raw read count and does not take the molecular indexes into account. The data set shown is from the analysis of 12 genes in 12 samples. The size of this data was limited for demonstration purposes. An actual experiment can quantify 96 genes in 96 samples on a single MiSeq run. Sample 1 2 3 4 5 6 7 8 9 10 11 12 Gene 1 1238 1670 1269 1675 1382 1805 1649 1286 1400 1825 1325 1734 1932 1482 1519 1972 1915 1453 1557 2068 1532 1185 1284 1632 Gene 2 22856 5173 21766 5374 23930 5585 22261 5428 23478 5694 23883 5466 26572 5834 25145 5557 23766 4447 27305 5828 19954 5531 21670 5559 Gene 3 2112 3631 2106 3382 2298 3738 2138 3336 2209 3490 2258 3707 3974 2468 3794 2356 2289 3697 4169 2488 1978 3057 2123 3342 Gene 4 12973 4260 12705 4560 13760 4656 12614 4501 13597 4710 13857 4591 15030 4934 14432 4912 13627 4599 16041 5001 11671 4512 12489 4618 Gene 5 1103 1410 1167 1448 1211 1506 1392 1152 1190 1463 1208 1532 1328 1618 1315 1601 1192 1483 1388 1717 1261 1022 1394 1147 Gene 6 626 802 709 586 649 837 796 628 790 625 703 856 920 770 707 860 792 632 956 770 701 544 528 761 Gene 7 9903 3876 9714 4214 10550 4344 9885 4180 10354 4393 10583 4286 11514 4571 11282 4571 10536 4295 12213 4650 8773 4083 9747 4330 Gene 8 771 764 812 776 765 782 896 868 779 969 664 761 Gene 9 592 624 649 620 665 683 705 665 637 747 534 570 Gene 10 444 425 498 455 472 500 570 542 479 571 401 437 Gene 11 834 863 891 853 932 962 995 974 870 1053 758 850 Gene 12 364 317 332 316 358 350 426 393 375 413 305 318 09
Read counts corrected with molecular indexes Seven Bridges also generates a molecular report with a molecular index corrected data set. The data set is de-multiplexed gene expression data showing the absolute quantitation of 12 genes in 12 samples. The size of this data set was also limited for demonstration purposes. Sample 1 2 3 4 5 6 7 8 9 10 11 12 Gene 1 1238 1269 1382 1286 1400 1325 1482 1519 1453 1557 1185 1284 Gene 2 5173 5374 5585 5428 5694 5466 5834 5557 4447 5828 5531 5559 Gene 3 2112 2106 2298 2138 2209 2258 2468 2356 2289 2488 1978 2123 Gene 4 4260 4560 4656 4501 4710 4591 4934 4912 4599 5001 4512 4618 Gene 5 1103 1167 1211 1152 1190 1208 1328 1315 1192 1388 1022 1147 Gene 6 626 586 649 628 625 703 770 707 632 770 544 528 Gene 7 3876 4214 4344 4180 4393 4286 4571 4571 4295 4650 4083 4330 Gene 8 592 654 662 626 649 639 734 724 641 785 536 622 Gene 9 461 496 525 492 550 557 584 554 517 596 427 461 Gene 10 373 365 428 384 393 431 484 468 398 491 346 370 Gene 11 659 688 705 693 739 757 785 767 688 834 614 676 Gene 12 272 243 273 243 264 261 321 288 284 310 226 234 10
Example of a Precise TM Experiment The data provided in the read and molecular reports illustrate a single Precise experiment analyzing 12 genes on 12 samples. In this experiment, sample bar codes were incorporated into the primers for sample tracking. The entire process was performed manually using purified and quality controlled total RNA samples (Agilent RIN of >7 and OD 260/280 ratio of >1.7). The entire library preparation process was performed with standard laboratory equipment (i.e. multichannel pipettes, PCR machine, etc). Automation was not required. The entire analysis was performed on a single Illumina MiSeq run. Time and cost estimates of the library prep process and sequence analysis are outlined below. Time and Cost Estimates for Experiment Hands-on Hands-off Costs (US) Library preparation for 96 genes in 96 samples 1 hour 5 hours $960 MISeq Sequencing 20 min 20 hours $800 Bioinformatics / Data Analysis 30 min 3 hours ~ $5 11
BENEFITS OF THE MOLECULAR INDEXING IN PRECISE ASSAYS Precise is unique by incorporating molecular indexes to label mrnas prior to reverse transcription (See workflow diagram). Labelling the mrna at this early step provides a means for more accurate and precise quantitation by providing a tool to correct for bias introduced by RT and PCR. The initial annealing of molecular barcodes to the poly A RNA labels the actual number of poly A transcripts present. Though subsequent PCR efficiencies may result in differential amplification yields, the presence of the barcodes provides a novel means to correct for potential bias. Additionally, molecular indexing enables users to quickly and easily understand the statistical quality of the results by analyzing the starting number of mrna molecules vs. number of reads. While a large number of reads seem to indicate a statistically accurate answer, if the reads are generated from a small number of starting mrna molecules the accuracy may, in fact, be very low. Quantitative precision Take, for example, the boxed data set for sample one in the read and molecular reports. In this example, there are more read counts than Molecular counts. For example, for gene 2, there were 22,856 reads but only 5,173 actual unique molecules. Counting reads alone is the extent of most RNA seq methods. By also counting the molecular indexes, Precise adjusts for PCR biases to more precisely quantify the actual transcripts initially present. This is all possible via the incorporation of molecular indexes prior to reverse transcription in Precise assays. 12
Assessing read depth saturation The incorporation of molecular indexes prior to reverse transcription also provides a means of assessing whether a library has been completely sequenced or not via comparing the data in a Read Report with the data in a Molecular Report. For example, when running different amounts of a library, if the data in a Read Report and a Molecular Report are both increasing, the Molecular Indexes have not been saturated and the library has not been completely sequenced. In this example, there is more information to be gained. If, however, the Molecular counts remain constant, but the read counts are increasing all of the original molecules in the library have been sequenced, there is no additional information to be gained from additional sequencing. Benefits for rare Transcript analysis Standard RNA seq library methods involving a number of enzymatic, purification, depletion and/or enrichment steps suffer from sample loss and an overall low efficiency of mrna library preparation. The problem of signal/sample loss is even more profound when analyzing low copy number transcripts which typically require additional manipulations to enrich the signal (i.e. poly-a enrichment) and/or decrease the background (i.e. rrna depletion) further adding to the sample/signal loss. In many cases, the only option for capturing and sequencing rare transcripts previously was deeper sequencing. For applications involving rare transcripts, Precise offers a more streamlined workflow with less chances of sample loss. Precise assays are also targeted and perform in the presence of complex samples without interference from the background. rrna and other common contaminants are not issues. The Molecular Indexing steps further benefits applications involving rare transcripts by providing accurate quantification and drastically reducing PCR bias that could otherwise over-shadow signal from rare transcripts. Most importantly, Precise provides a means for assessing when reading deeper is useful and when it becomes a game of diminishing returns. 13
Assessing the Efficiency of the mrna library prep process Precise assays also provide a novel means for assessing the efficiency of the mrna library preparation process. By default, each 96 well Precise plate contain a known amount of of synthetic RNA molecules (~20 copies per well) engineered to contain a pool of 960 molecular barcodes. The barcodes are flanked on both sides by control gene specific primer annealing sites for the KAN, DAP and PHE genes. These engineered RNA molecules are subjected to the reverse transcription and subsequent PCRs in the Precise protocol, and provide a tool to track the percentage of the barcodes that make it through the library prep process and get represented in the libraries. The results can be read via direct read-out on Cellular Research s complementary Pixel system without the need to sequence. The following is a schematic of the RNA-Seq QC construct included in all Precise assays. The KAN, DAP and PHE amplicons range in size to provide mock transcripts ranging in size from 405bp, 481bp and 1092 bp respectively. Although these controls come pre-aliquoted in in Precise kits, they can also be spiked into other mrna seq library generation processes to assess efficiency (i.e. the percentage recovery or the percentage of poly-a RNA that makes it through the library prep process and is represented in the final library). By gauging the efficiency of the process, users can pre-determine whether they are likely to have lower copy number genes present in the final library preparation. RNA-SEQ Library efficiency qc F1 Molecular Index 680 728 748 828 AAAAAAAAAAAAAAAAA R1 PCR Round #1 Cy3PCR004 708 811 PCR Round #2 R2 MiSeq is a regjstered trademark of Illumina, Inc. Cellular Research InC Inc PRECISE TM ASSAYS Whitepaper 14