SOLiD System accuracy with the Exact Call Chemistry module


 Coral Fitzgerald
 1 years ago
 Views:
Transcription
1 WHITE PPER 55 Series SOLiD System SOLiD System accuracy with the Exact all hemistry module ONTENTS Principles of Exact all hemistry Introduction Encoding of base sequences with Exact all hemistry Demonstration of increased accuracy with Exact all hemistry onclusion Bioinformatics of Exact all hemistry ppendix : Determination of base calls and quality values ppendix B: The forwardbackward algorithm 5 ppendix : Probe set design 7 Principles of Exact all hemistry Introduction The 55 Series SOLiD System is designed to provide industryleading accuracy through a unique, ligationbased sequencing methodology, Exact all hemistry (E). This distinct sequencing approach builds on SOLiD 4 System chemistry, employing sequential ligation of oligonucleotide probes labeled with one of four fluorescent dyes, whereby each probe assays multiple base positions at a time. While SOLiD 4 System chemistry interrogates every base of the DN template twice using twobase encoded probes, Exact all hemistry performs an additional inspection of the template using a new, threebase encoded probe set. This new probe set is carefully designed to complement twobase encoding and jointly form a redundant errorcorrection code. The set of all dye color measurements, each carrying information about multiple bases, is then used by a specialized decoding algorithm to establish the correct base sequence without knowledge of the reference, even in the presence of measurement errors. This strategy allows for error detection and correction, providing highly accurate results for resequencing, de novo sequencing, and rare variant detection. Encoding of base sequences with Exact all hemistry 55 Series SOLiD System Exact all hemistry is illustrated in Figure. Beads containing singlestranded copies of DN library fragments are attached to the surface of the Flowhip. These sequences are interrogated by fluorescently labeled probes that hybridize and ligate at the boundary of singlestranded and doublestranded DN. The sequencing process is organized into phases called primer rounds. Each primer round consists of hybridization of a sequencing primer with a specific offset, followed by multiple cycles of probe ligation and detection. The primer round is concluded by a reset that melts the primer and extended sequence off the template, preparing the template for the next round. SOLiD 4 System chemistry relies on performing five primer rounds using a twobase encoded probe set. In addition, E follows these five primer rounds with one additional round. This sixth round starts with the same sequencing primer as the fifth round, but uses a new, independent probe set with a specific threebase labeling. Together, the six primer rounds enable unrivaled system accuracy by forming a redundant errorcorrection code. The labeling of both probe sets, shown in Figure, is inspired by a convolutional code, a type of errorcorrection code used in digital communication systems where the encoded symbols are derived from a short subset of consecutive information symbols [. The algebraic formula used to generate them is described in ppendix.
2 Probe Set Probe Set Figure. Ligationbased sequencing with 55 Series SOLiD System Exact all hemistry.
3 Demonstration of increased accuracy with Exact all hemistry To validate this sequencing strategy, we performed SOLiD System sequencing using Exact all hemistry on templated beads containing synthetic DN templates. The color measurements were collected by the instrument and processed with the E base calling algorithms (described in detail in ppendices and B). Because the precise sequence of the DN template was known, this approach specifically allowed for direct assessment of the sequencing accuracy of Exact all hemistry, presented in Figure as a histogram of calibrated Phred scores. s the results demonstrate, Exact all hemistry determines the template sequence with extremely high accuracy, the majority of base calls achieving accuracy in excess of %. Since each base position is interrogated by multiple colors, consistent agreements between multiple color calls significantly increase the confidence in base calls. t the same time, some color errors can be corrected as a result of the singleread redundancy provided by the sixth primer round. 7 Percentage of all base calls Phred score ccuracy 9% 99% 99.9% % % % ccuracy of base calls in Phred scale 6 or above Figure. Histogram of base call accuracies expressed in calibrated Phred scale. Base calls were derived from color measurement through algorithms described in ppendices B and. onclusion Sequencing with Exact all hemistry and the 55 Series SOLiD System clearly demonstrates increased performance and accuracy, which is imperative to highthroughput detection of rare genetic variants in heterogeneous or pooled samples, as well as de novo sequencing. The multibase encoding functionality contributes to lower error rate and reduced systematic noise, resulting in unsurpassed sequencing accuracy even at low coverage. Bioinformatics of Exact all hemistry ppendix : Determination of base calls and quality values Symbol u = [u, u, u,, u K x = [x,, x,,, x,k x B = [x B,, x B,,, x B,K/5 Definition Sequence of bases in a DN fragment. Base u is the last base of the adapter ligated to the 5 end of the fragment, which is known ahead of time. Expected sequence of colors resulting from interrogating u using probe set, assuming no color errors. Expected sequence of colors resulting from interrogating u using probe set, assuming no color errors. y = [y,, y,,, y,k Noisy color measurements of x. y B = [y B,, y B,,, y B,K/5 Noisy color measurements of x B. Table. Mathematical notation for ligationbased sequencing. Symbols and corresponding definitions are indicated.
4 The 55 Series SOLiD System sequencing process can be understood as a transformation of the template base sequence u into a collection of color measurements (see Table for notation). If the colors always could be read out without errors, this conversion would be a deterministic, injective function u (x,x B ), where every possible base sequence translates to a unique color sequence. For example, a sequence u = [TTTTT (last base T of the adapter followed by 5 template bases) is expected to translate into the set of colors in Figure. In the example shown in Figure, 5 unknown template bases (excluding u ) have been converted into 8 color reads. In general, there are 4 8 (~64 billion) possible color sequences of length 8, but only 4 5 (~ billion) possible base sequences of length 5. This means that only one in every 64 possible color sequences corresponds to a base sequence. Such valid color sequences are rare and, due to the design of the probe sets, dissimilar from each other. This makes them easy to distinguish, even in the presence of some measurement noise and errors in color calls.. Primer n Primer n Primer n Primer n Primer n4 Primer n4 x, = x, = bridge probe bridge probe bridge probe bridge probe x,5 = x,4 = x, = x B, = x,7 = x,6 = x, = x,9 = x,8 = x B, = x, = x, = x,5 = x,4 = x, = x B, = Probe Set Probe Set P adapter T T T T T... u u u u u 4 u 5 u 6 u 7 u 8 u 9 u u u u u 4 u 5 B. Probe Set x = [ Base Sequence u = [ T T T T T Probe Set x B = [ Figure. Example of color encoding for base sequence u = [TTTTT. olors are arranged by () order of data acquisition; (B) base position and probe set. The optics, imaging, and image processing subsystems of the 55 Series SOLiD System produce color measurements for each DN fragment and each probe ligation cycle. Measurements for a specific bead (y and y B, see Table ) carry information about the color sequence x, x B, and indirectly about base sequence u. The unknown sequence u, hidden variables x and x B, and observed variables y and y B are linked together through a causal, statistical relation that is key to base calling (illustrated by a Bayesian network in Figure 4). u,u,u,u x, u,u,u,u 4 u,u,u 4,u 5 u,u 4,u 5,u 6 u 4,u 5,u 6,u 7 x, x, x B, x,4 x,5 The 55 Series SOLiD Sequencer performs three steps to determine individual bases and assign quality values to the calls, as illustrated in Figure 5. t the core of this process is the Bayesian inference operation that derives four conditional (posterior) probabilities, P(u i =,,,T y ), for each base position i. The general formula for such derivation is a marginalization of P(u y ) derived through the Bayes theorem from conditional probability P(y u): y, y, y, y B, Figure 4. Bayesian network linking the unknown bases and the observed color measurements. y,4 y,5
5 practical way to efficiently perform the above computation is to use dynamic programming. The 55 Series SOLiD System uses a type of dynamic programming called the forwardbackward algorithm [, which is detailed in ppendix B. The steps preceding and following the Bayesian inference are much simpler. The measurement model, evaluated before Bayesian inference, converts each color measurement y i into four individual color likelihoods P(y i x i =,,, ), which are later used within the forwardbackward algorithm to calculate P(y u). Distributions P(y i x i ) are well characterized within the 55 Series SOLiD Sequencer image processing system. Finally, the product of Bayesian inference, probabilities P(u i =,,,T y ), are converted to base calls a i through straightforward probability x x x maximization: x x x Once the base call is established, its quality value b i is directly derived from the base probability as: g gg The quality values are determined from a lookup table using the four base likelihoods as feature vectors. Measurement model Bayesian Inference Base calling olor measurements olor likelihoods Base posterior probabilities Base calls & Quality values y P(y,i x,i =,,, ) P(y B,i x B,i =,,, ) for i =,,,K for i =,,,K/5 P(u i =,,,T y ) for i =,,,K Figure 5. Base calling process performed by the 55 Series SOLiD System for Exact all hemistry. a i,b i for i =,,,K g gg ppendix B: The forwardbackward algorithm!" #$ %!" /& #$ /& %. Edge label (probe set color) x,i+ Edge label (probe set color)!" #$ %!" /& #$ /& % x B,(i+)/5 ' ' ( u i u i+ u i+ u i u i+ u i+ u i+ u i+ u i+ u i+ B. Node (base triplet) x, x, x, x B, Edge (base quadruplet) ' ' ( x,4 Node (base triplet) x,k x,k T T T T T T T T T T T T T u u u u u u u u u 4 u u 4 u 5 u 4 u 5 u 6 u K u K u K u K u K u K+ u K u K+ u K+ Figure 6. raphical representation of one base sequence u = [u,u,u,,u K as a path through a graph. () Notation for variables associated with nodes and edges. (B) Example for base sequence u = [TTTTT. The forwardbackward algorithm [ solves the Bayesian inference problem of computing all conditional probabilities P(u i =,,,T y ) of bases from measurements, which is the basis for performing base calls and calculating base quality values for each base call. The forwardbackward algorithm represents any possible sequence of bases u = [u,u,u,,u K with a path through a graph that starts with node [u,u,u and passes through nodes [u,u,u, [u,u,u 4, and so on, to node [u K,u K,u K. Figure 6B shows a path corresponding to a sequence from the example in Figure. Every edge connecting a pair of nodes [u i,u i+,u i+ and [u i+,u i+,u i+ carries information about four bases [u i,u i+,u i+,u i+, which is enough to uniquely associate it with the expected color of the probes interrogating bases [u i,u i+,u i+,u i+,u i+4 from either probe set (x,i+ ) or probe set (x B,(i+)/5 ) according to Figure. The graphical notation for values associated with nodes and edges is shown in Figure 6. The set of paths corresponding to all 4 K possible Kbase sequences can be combined into a single compact graph called a trellis (Figure 7). The trellis guarantees onetoone correspondence between any path from the left most nodes to the right most nodes and all possible base sequences, which makes it a blueprint for dynamic programming computations.
6 . T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T [u u u [u u u [u u u 4 [u u 4 u 5 T T T T T T T T T T T T T T T T T T T T T [u 4 u 5 u 6 T T T T T T T T T T T T T T T T T T T T T [u K u K u K T T T T T T [u K u K u K+ [u K u K+ u K+ B. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Figure 7. eneral structure of the trellis. [ The trellis is divided into K stages, each consisting of 56 edges connecting x two x groups x of 64 nodes. Each group of 64 nodes corresponds to all possible base triplets. t the beginning and end of each section there are 64 nodes, corresponding to all possible base triplets. Each edge corresponds to some quadruplet of bases [u i, u i+, u i+, u i+, connecting node [u i, u i+, u i+ to node [u i, u i+, u i+, u i+. [B n example subset of nodes and edges. n edge that corresponds to a quadruplet of bases [u i, u i+, u i+, u i+, has an expected color x,i+ in probe set and an expected color x B,(i+)/5 in probe set determined from Figure. Note that measurements for probe set are only available in every fifth trellis section. The algorithm determines base calls and quality values by performing the following steps: Step. For each edge in the trellis, establish edge metric γ(u i,u i+,u i+,u i+ ). Page 6, equation in Step : For each edge in the graph associated with bases u i,u i+,u i+,u i+ and colors x,i+, x B,(i+)/5, determine the edge weight: g gg Step. alculate forward node metric α(u i,u i+,u i+ ) for every node in the graph:. Initialize α(u,u,u ) to!" for nodes #$ where %!" u /& agrees #$ with /& last % base of the primer and otherwise. B. Iteratively compute all node metrics α(u i+,u i+,u i+ ) from node metrics α(u i,u i+,u i+ ) according to the formula: In the above formula, P(y,i+ x,i+ =,,, ) and P(y B,(i+)/5 x B,(i+)/5 =,,, ) are color likelihoods derived from color measurements y,i+ and y B,(i+)/5. P(u i+ ) is always assumed to be.5. Note that for any base sequence, the product of all edge weights along the corresponding path results in P(y x ) P(y B x B ) P(u) = P(y,u), is proportional to the conditional sequence probability P(u y ). The goal of the forwardbackward algorithm is the efficient marginalization of P(u y ) to obtain P(u i y ) for individual base positions. Page 7, equation in Step 5: ' ' ( Notice that summands in the above formula correspond to four edges arriving at the node [u i+,u i+,u i+ from the left.
7 Step. Each value is associated with one node in the graph. alculate backward node metric β(u i,u i+,u i+ ) for every node in Step 5. the graph: Page 7, equation alculate the in Step posterior 5: base probabilities P(u i =,,,T y ) from. Initialize β(u K,u K+,u K+ ) to for all nodes. partially marginalized probabilities from step 4. B. Iteratively compute all node metrics β(u i,u i+,u i+ ) from node metrics β(u i+,u i+,u i+ ) according to the formula: ) ) ( ) Notice that summands in the ) above formula correspond to four edges arriving at the node [u i,u i+,u i+ from the right. ( ll the steps of the decoding process have low complexity and are well suited for highperformance implementation. Step 4. alculate partially marginalized ' probabilities ) P(y,u i,u i+,u i+ ) for all base triplets as: ' ) ppendix : Probe set design + + The design of the labeling for the two probe sets is inspired by a convolutional code, a type of errorcorrecting code used in digital communication systems [. onvolutional codes have two key properties. First, they have a sliding window property, where the encoded symbols are derived from a short subset of consecutive information symbols a property that is naturally satisfied by SOLiD System chemistry. Second, convolutional codes are linear in a finite field; therefore, encoded symbols, when treated as elements from a finite field, can be computed as weighted sums of input symbols. This second property can be directly applied to probe set design. Each probe consists of five nucleotides v = [v,v,v,v 4,v 5 that specifically hybridize to complementary DN sequence, three inosines that hybridize to any sequence, and a fluorescent dye c. With four possible nucleotides (,,, and T),,4 possible probe sequences exist, each having one of four unique dyes assigned to it. Linearity of the probe set means that each of the,4 probes is assigned a dye according to the formula ) Math in finite field F(4) B + B B B) F(4) assignment to bases Template Base T Probe Base T F(4) assignment ) F(4) assignment Dyes Dye Label olor Notation F(4) assignment FM y TXR y5 c = v g + v g + v g + v 4 g 4 + v 5 g 5, where g = [g,g,g,g 4,g 5 is a vector of weights that defines the probe sets. Bases v i, dye c, and weights g i are considered to be elements of a finite (alois) field of size 4, denoted F(4) [. The correspondence between nucleotides v i, dyes c and elements of F(4) are presented in Figure 8. Figures 8B and 8 further detail the mechanics of performing multiplication and addition of elements of F(4). Probe sets presented in Figure for Exact all hemistry have been generated by formula () by assigning weights g = [,,,, to Probe Set and weights g = [,,,, to Probe Set. Figure 8. Finite field F(4). [ addition and multiplication in F(4); [B association between elements of F(4) and nucleotides; [ association between elements of F(4) and dyes. References. Lin, ostello, Error ontrol oding (nd ed.), Prentice Hall, 4, ISBN Lang, lgebra (rd ed.), Springer,, ISBN X.
8 Life Technologies offers a breadth of products DN RN PROTEIN ELL ULTURE INSTRUMENTS For Research Use Only. Not intended for any animal or human therapeutic or diagnostic use. Life Technologies orporation. ll rights reserved. The trademarks mentioned herein are the property of Life Technologies orporation or their respective owners. Printed in the US. O85 Headquarters 579 Van llen Way arlsbad, 98 US Phone Toll Free in the US
OPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
More informationHighDimensional Image Warping
Chapter 4 HighDimensional Image Warping John Ashburner & Karl J. Friston The Wellcome Dept. of Imaging Neuroscience, 12 Queen Square, London WC1N 3BG, UK. Contents 4.1 Introduction.................................
More informationMiSeq: Imaging and Base Calling
MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please
More informationThe Backpropagation Algorithm
7 The Backpropagation Algorithm 7. Learning as gradient descent We saw in the last chapter that multilayered networks are capable of computing a wider range of Boolean functions than networks with a single
More informationUniversità degli Studi di Bologna
Università degli Studi di Bologna DEIS Biometric System Laboratory Incremental Learning by Message Passing in Hierarchical Temporal Memory Davide Maltoni Biometric System Laboratory DEIS  University of
More informationRealtime PCR handbook
Realtime PCR handbook Singletube assays 96 and 384well plates 384well TaqMan Array cards OpenArray plates The image on this cover is of an OpenArray plate which is primarily used for middensity realtime
More informationAnomalous System Call Detection
Anomalous System Call Detection Darren Mutz, Fredrik Valeur, Christopher Kruegel, and Giovanni Vigna Reliable Software Group, University of California, Santa Barbara Secure Systems Lab, Technical University
More informationWhat are plausible values and why are they useful?
What are plausible values and why are they useful? Matthias von Davier Educational Testing Service, Princeton, New Jersey, United States Eugenio Gonzalez Educational Testing Service, Princeton, New Jersey,
More informationFactor Graphs and the SumProduct Algorithm
498 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001 Factor Graphs and the SumProduct Algorithm Frank R. Kschischang, Senior Member, IEEE, Brendan J. Frey, Member, IEEE, and HansAndrea
More informationMINITAB ASSISTANT WHITE PAPER
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. OneWay
More informationUnbiased Groupwise Alignment by Iterative Central Tendency Estimation
Math. Model. Nat. Phenom. Vol. 3, No. 6, 2008, pp. 232 Unbiased Groupwise Alignment by Iterative Central Tendency Estimation M.S. De Craene a1, B. Macq b, F. Marques c, P. Salembier c, and S.K. Warfield
More informationA Googlelike Model of Road Network Dynamics and its Application to Regulation and Control
A Googlelike Model of Road Network Dynamics and its Application to Regulation and Control Emanuele Crisostomi, Steve Kirkland, Robert Shorten August, 2010 Abstract Inspired by the ability of Markov chains
More informationMODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS
MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS JEANNETTE JANSSEN, MATT HURSHMAN, AND NAUZER KALYANIWALLA Abstract. Several network models have been proposed to explain the link structure observed
More informationWhat Every Computer Scientist Should Know About FloatingPoint Arithmetic
What Every Computer Scientist Should Know About FloatingPoint Arithmetic D Note This document is an edited reprint of the paper What Every Computer Scientist Should Know About FloatingPoint Arithmetic,
More informationRobust Set Reconciliation
Robust Set Reconciliation Di Chen 1 Christian Konrad 2 Ke Yi 1 Wei Yu 3 Qin Zhang 4 1 Hong Kong University of Science and Technology, Hong Kong, China 2 Reykjavik University, Reykjavik, Iceland 3 Aarhus
More informationPRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
More informationTHE development of methods for automatic detection
Learning to Detect Objects in Images via a Sparse, PartBased Representation Shivani Agarwal, Aatif Awan and Dan Roth, Member, IEEE Computer Society 1 Abstract We study the problem of detecting objects
More informationSample design for educational survey research
Quantitative research methods in educational planning Series editor: Kenneth N.Ross Module Kenneth N. Ross 3 Sample design for educational survey research UNESCO International Institute for Educational
More informationData Quality Assessment: A Reviewer s Guide EPA QA/G9R
United States Office of Environmental EPA/240/B06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G9R FOREWORD This document is
More information1 An Introduction to Conditional Random Fields for Relational Learning
1 An Introduction to Conditional Random Fields for Relational Learning Charles Sutton Department of Computer Science University of Massachusetts, USA casutton@cs.umass.edu http://www.cs.umass.edu/ casutton
More informationOrthogonal Bases and the QR Algorithm
Orthogonal Bases and the QR Algorithm Orthogonal Bases by Peter J Olver University of Minnesota Throughout, we work in the Euclidean vector space V = R n, the space of column vectors with n real entries
More informationVARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE
VARIATIONAL ALGORITHMS FOR APPROXIMATE BAYESIAN INFERENCE by Matthew J. Beal M.A., M.Sci., Physics, University of Cambridge, UK (1998) The Gatsby Computational Neuroscience Unit University College London
More informationMining Templates from Search Result Records of Search Engines
Mining Templates from Search Result Records of Search Engines Hongkun Zhao, Weiyi Meng State University of New York at Binghamton Binghamton, NY 13902, USA {hkzhao, meng}@cs.binghamton.edu Clement Yu University
More informationFast Solution of l 1 norm Minimization Problems When the Solution May be Sparse
Fast Solution of l 1 norm Minimization Problems When the Solution May be Sparse David L. Donoho and Yaakov Tsaig October 6 Abstract The minimum l 1 norm solution to an underdetermined system of linear
More informationTHE PROBLEM OF finding localized energy solutions
600 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 45, NO. 3, MARCH 1997 Sparse Signal Reconstruction from Limited Data Using FOCUSS: A Reweighted Minimum Norm Algorithm Irina F. Gorodnitsky, Member, IEEE,
More informationimap: Discovering Complex Semantic Matches between Database Schemas
imap: Discovering Complex Semantic Matches between Database Schemas Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois, UrbanaChampaign, IL, USA {dhamanka,ylee11,anhai}@cs.uiuc.edu
More informationObject Detection with Discriminatively Trained Part Based Models
1 Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan Abstract We describe an object detection system based on mixtures
More informationThe Haplotyping Problem: An Overview of Computational Models and Solutions
The Haplotyping Problem: An Overview of Computational Models and Solutions Paola Bonizzoni Gianluca Della Vedova Riccardo Dondi Jing Li June 16, 2008 Abstract The investigation of genetic differences among
More informationgprof: a Call Graph Execution Profiler 1
gprof A Call Graph Execution Profiler PSD:181 gprof: a Call Graph Execution Profiler 1 by Susan L. Graham Peter B. Kessler Marshall K. McKusick Computer Science Division Electrical Engineering and Computer
More informationFrom Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose. Abstract
To Appear in the IEEE Trans. on Pattern Analysis and Machine Intelligence From Few to Many: Illumination Cone Models for Face Recognition Under Variable Lighting and Pose Athinodoros S. Georghiades Peter
More information