SOLiD System accuracy with the Exact Call Chemistry module



Similar documents
Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

Real-time PCR: Understanding C t

The Basics of Graphical Models

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Coding and decoding with convolutional codes. The Viterbi Algor

Social Media Mining. Data Mining Essentials

South Carolina College- and Career-Ready (SCCCR) Algebra 1

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

MATH 0110 Developmental Math Skills Review, 1 Credit, 3 hours lab

MiSeq: Imaging and Base Calling

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Protein Protein Interaction Networks

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Illumina Sequencing Technology

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Prentice Hall Algebra Correlated to: Colorado P-12 Academic Standards for High School Mathematics, Adopted 12/2009

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

DELAWARE MATHEMATICS CONTENT STANDARDS GRADES PAGE(S) WHERE TAUGHT (If submission is not a book, cite appropriate location(s))

Big Ideas in Mathematics

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

A Statistical Framework for Operational Infrasound Monitoring

Efficient Recovery of Secrets

DNA Sequence Analysis

Complete Genomics Sequencing

Tennessee Department of Education

Performance Level Descriptors Grade 6 Mathematics

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

FUNDAMENTALS of INFORMATION THEORY and CODING DESIGN

Pre-Algebra Academic Content Standards Grade Eight Ohio. Number, Number Sense and Operations Standard. Number and Number Systems

TruSeq Custom Amplicon v1.5

VISUAL ALGEBRA FOR COLLEGE STUDENTS. Laurie J. Burton Western Oregon University

Bayesian Statistics: Indian Buffet Process

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Math 0980 Chapter Objectives. Chapter 1: Introduction to Algebra: The Integers.

MATHEMATICAL METHODS OF STATISTICS

BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line

MATCH Commun. Math. Comput. Chem. 61 (2009)

Prentice Hall: Middle School Math, Course Correlated to: New York Mathematics Learning Standards (Intermediate)

Bayesian networks - Time-series models - Apache Spark & Scala

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

X On record with the USOE.

Illinois State Standards Alignments Grades Three through Eleven

Statistics Graduate Courses

parent ROADMAP MATHEMATICS SUPPORTING YOUR CHILD IN HIGH SCHOOL

Hidden Markov Models

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

096 Professional Readiness Examination (Mathematics)

Final Project Report

Prentice Hall Mathematics: Course Correlated to: Arizona Academic Standards for Mathematics (Grades 6)

NEW MEXICO Grade 6 MATHEMATICS STANDARDS

RUTHERFORD HIGH SCHOOL Rutherford, New Jersey COURSE OUTLINE STATISTICS AND PROBABILITY

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

How is genome sequencing done?

A Correlation of Pearson Texas Geometry Digital, 2015

Introduction to next-generation sequencing data

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Analysis of Bayesian Dynamic Linear Models

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Microsoft Mathematics for Educators:

Technical Note. Roche Applied Science. No. LC 19/2004. Color Compensation

Clovis Community College Core Competencies Assessment Area II: Mathematics Algebra

Model-based Synthesis. Tony O Hagan

Network (Tree) Topology Inference Based on Prüfer Sequence

Research on the UHF RFID Channel Coding Technology based on Simulink

OPRE 6201 : 2. Simplex Method

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

A Little Set Theory (Never Hurt Anybody)

Mathematics for Computer Science/Software Engineering. Notes for the course MSM1F3 Dr. R. A. Wilson

Future Value of an Annuity Sinking Fund. MATH 1003 Calculus and Linear Algebra (Lecture 3)

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA

MATHS LEVEL DESCRIPTORS

Next Generation Sequencing for DUMMIES

Algorithm & Flowchart & Pseudo code. Staff Incharge: S.Sasirekha

Pennsylvania System of School Assessment

Such As Statements, Kindergarten Grade 8

On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Scope and Sequence KA KB 1A 1B 2A 2B 3A 3B 4A 4B 5A 5B 6A 6B

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Prentice Hall Connected Mathematics 2, 7th Grade Units 2009

Academic Standards for Mathematics

Primary Curriculum 2014

Mathematics Cognitive Domains Framework: TIMSS 2003 Developmental Project Fourth and Eighth Grades

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

THE WHE TO PLAY. Teacher s Guide Getting Started. Shereen Khan & Fayad Ali Trinidad and Tobago

Blind Deconvolution of Barcodes via Dictionary Analysis and Wiener Filter of Barcode Subsections

Transcription:

WHITE PPER 55 Series SOLiD System SOLiD System accuracy with the Exact all hemistry module ONTENTS Principles of Exact all hemistry Introduction Encoding of base sequences with Exact all hemistry Demonstration of increased accuracy with Exact all hemistry onclusion Bioinformatics of Exact all hemistry ppendix : Determination of base calls and quality values ppendix B: The forward-backward algorithm 5 ppendix : Probe set design 7 Principles of Exact all hemistry Introduction The 55 Series SOLiD System is designed to provide industryleading accuracy through a unique, ligation-based sequencing methodology, Exact all hemistry (E). This distinct sequencing approach builds on SOLiD 4 System chemistry, employing sequential ligation of oligonucleotide probes labeled with one of four fluorescent dyes, whereby each probe assays multiple base positions at a time. While SOLiD 4 System chemistry interrogates every base of the DN template twice using two-base encoded probes, Exact all hemistry performs an additional inspection of the template using a new, three-base encoded probe set. This new probe set is carefully designed to complement two-base encoding and jointly form a redundant error-correction code. The set of all dye color measurements, each carrying information about multiple bases, is then used by a specialized decoding algorithm to establish the correct base sequence without knowledge of the reference, even in the presence of measurement errors. This strategy allows for error detection and correction, providing highly accurate results for resequencing, de novo sequencing, and rare variant detection. Encoding of base sequences with Exact all hemistry 55 Series SOLiD System Exact all hemistry is illustrated in Figure. Beads containing single-stranded copies of DN library fragments are attached to the surface of the Flowhip. These sequences are interrogated by fluorescently labeled probes that hybridize and ligate at the boundary of single-stranded and doublestranded DN. The sequencing process is organized into phases called primer rounds. Each primer round consists of hybridization of a sequencing primer with a specific offset, followed by multiple cycles of probe ligation and detection. The primer round is concluded by a reset that melts the primer and extended sequence off the template, preparing the template for the next round. SOLiD 4 System chemistry relies on performing five primer rounds using a two-base encoded probe set. In addition, E follows these five primer rounds with one additional round. This sixth round starts with the same sequencing primer as the fifth round, but uses a new, independent probe set with a specific three-base labeling. Together, the six primer rounds enable unrivaled system accuracy by forming a redundant error-correction code. The labeling of both probe sets, shown in Figure, is inspired by a convolutional code, a type of error-correction code used in digital communication systems where the encoded symbols are derived from a short subset of consecutive information symbols [. The algebraic formula used to generate them is described in ppendix.

Probe Set Probe Set Figure. Ligation-based sequencing with 55 Series SOLiD System Exact all hemistry.

Demonstration of increased accuracy with Exact all hemistry To validate this sequencing strategy, we performed SOLiD System sequencing using Exact all hemistry on templated beads containing synthetic DN templates. The color measurements were collected by the instrument and processed with the E base calling algorithms (described in detail in ppendices and B). Because the precise sequence of the DN template was known, this approach specifically allowed for direct assessment of the sequencing accuracy of Exact all hemistry, presented in Figure as a histogram of calibrated Phred scores. s the results demonstrate, Exact all hemistry determines the template sequence with extremely high accuracy, the majority of base calls achieving accuracy in excess of 99.9999%. Since each base position is interrogated by multiple colors, consistent agreements between multiple color calls significantly increase the confidence in base calls. t the same time, some color errors can be corrected as a result of the single-read redundancy provided by the sixth primer round. 7 Percentage of all base calls 6 5 4 Phred score ccuracy 9% 99% 99.9% 4 99.99% 5 99.999% 6 99.9999% 4 6 8 4 6 8 4 6 8 4 6 8 4 4 44 46 48 5 5 54 56 58 ccuracy of base calls in Phred scale 6 or above Figure. Histogram of base call accuracies expressed in calibrated Phred scale. Base calls were derived from color measurement through algorithms described in ppendices B and. onclusion Sequencing with Exact all hemistry and the 55 Series SOLiD System clearly demonstrates increased performance and accuracy, which is imperative to high-throughput detection of rare genetic variants in heterogeneous or pooled samples, as well as de novo sequencing. The multibase encoding functionality contributes to lower error rate and reduced systematic noise, resulting in unsurpassed sequencing accuracy even at low coverage. Bioinformatics of Exact all hemistry ppendix : Determination of base calls and quality values Symbol u = [u, u, u,, u K x = [x,, x,,, x,k x B = [x B,, x B,,, x B,K/5 Definition Sequence of bases in a DN fragment. Base u is the last base of the adapter ligated to the 5 end of the fragment, which is known ahead of time. Expected sequence of colors resulting from interrogating u using probe set, assuming no color errors. Expected sequence of colors resulting from interrogating u using probe set, assuming no color errors. y = [y,, y,,, y,k Noisy color measurements of x. y B = [y B,, y B,,, y B,K/5 Noisy color measurements of x B. Table. Mathematical notation for ligation-based sequencing. Symbols and corresponding definitions are indicated.

The 55 Series SOLiD System sequencing process can be understood as a transformation of the template base sequence u into a collection of color measurements (see Table for notation). If the colors always could be read out without errors, this conversion would be a deterministic, injective function u (x,x B ), where every possible base sequence translates to a unique color sequence. For example, a sequence u = [TTTTT (last base T of the adapter followed by 5 template bases) is expected to translate into the set of colors in Figure. In the example shown in Figure, 5 unknown template bases (excluding u ) have been converted into 8 color reads. In general, there are 4 8 (~64 billion) possible color sequences of length 8, but only 4 5 (~ billion) possible base sequences of length 5. This means that only one in every 64 possible color sequences corresponds to a base sequence. Such valid color sequences are rare and, due to the design of the probe sets, dissimilar from each other. This makes them easy to distinguish, even in the presence of some measurement noise and errors in color calls.. Primer n Primer n- Primer n- Primer n- Primer n-4 Primer n-4 x, = x, = bridge probe bridge probe bridge probe bridge probe x,5 = x,4 = x, = x B, = x,7 = x,6 = x, = x,9 = x,8 = x B, = x, = x, = x,5 = x,4 = x, = x B, = Probe Set Probe Set P adapter T T T T T... u u u u u 4 u 5 u 6 u 7 u 8 u 9 u u u u u 4 u 5 B. Probe Set x = [ Base Sequence u = [ T T T T T Probe Set x B = [ Figure. Example of color encoding for base sequence u = [TTTTT. olors are arranged by () order of data acquisition; (B) base position and probe set. The optics, imaging, and image processing subsystems of the 55 Series SOLiD System produce color measurements for each DN fragment and each probe ligation cycle. Measurements for a specific bead (y and y B, see Table ) carry information about the color sequence x, x B, and indirectly about base sequence u. The unknown sequence u, hidden variables x and x B, and observed variables y and y B are linked together through a causal, statistical relation that is key to base calling (illustrated by a Bayesian network in Figure 4). u,u,u,u x, u,u,u,u 4 u,u,u 4,u 5 u,u 4,u 5,u 6 u 4,u 5,u 6,u 7 x, x, x B, x,4 x,5 The 55 Series SOLiD Sequencer performs three steps to determine individual bases and assign quality values to the calls, as illustrated in Figure 5. t the core of this process is the Bayesian inference operation that derives four conditional (posterior) probabilities, P(u i =,,,T y ), for each base position i. The general formula for such derivation is a marginalization of P(u y ) derived through the Bayes theorem from conditional probability P(y u): y, y, y, y B, Figure 4. Bayesian network linking the unknown bases and the observed color measurements. y,4 y,5

practical way to efficiently perform the above computation is to use dynamic programming. The 55 Series SOLiD System uses a type of dynamic programming called the forward-backward algorithm [, which is detailed in ppendix B. The steps preceding and following the Bayesian inference are much simpler. The measurement model, evaluated before Bayesian inference, converts each color measurement y i into four individual color likelihoods P(y i x i =,,, ), which are later used within the forward-backward algorithm to calculate P(y u). Distributions P(y i x i ) are well characterized within the 55 Series SOLiD Sequencer image processing system. Finally, the product of Bayesian inference, probabilities P(u i =,,,T y ), are converted to base calls a i through straightforward probability x x x maximization: x x x Once the base call is established, its quality value b i is directly derived from the base probability as: g gg The quality values are determined from a lookup table using the four base likelihoods as feature vectors. Measurement model Bayesian Inference Base calling olor measurements olor likelihoods Base posterior probabilities Base calls & Quality values y P(y,i x,i =,,, ) P(y B,i x B,i =,,, ) for i =,,,K for i =,,,K/5 P(u i =,,,T y ) for i =,,,K Figure 5. Base calling process performed by the 55 Series SOLiD System for Exact all hemistry. a i,b i for i =,,,K g gg ppendix B: The forward-backward algorithm!" #$ %!" /& #$ /& %. Edge label (probe set color) x,i+ Edge label (probe set color)!" #$ %!" /& #$ /& % x B,(i+)/5 ' ' ( u i u i+ u i+ u i u i+ u i+ u i+ u i+ u i+ u i+ B. Node (base triplet) x, x, x, x B, Edge (base quadruplet) ' ' ( x,4 Node (base triplet) x,k- x,k T T T T T T T T T T T T T u u u u u u u u u 4 u u 4 u 5 u 4 u 5 u 6 u K- u K- u K u K- u K u K+ u K u K+ u K+ Figure 6. raphical representation of one base sequence u = [u,u,u,,u K as a path through a graph. () Notation for variables associated with nodes and edges. (B) Example for base sequence u = [TTTTT. The forward-backward algorithm [ solves the Bayesian inference problem of computing all conditional probabilities P(u i =,,,T y ) of bases from measurements, which is the basis for performing base calls and calculating base quality values for each base call. The forward-backward algorithm represents any possible sequence of bases u = [u,u,u,,u K with a path through a graph that starts with node [u,u,u and passes through nodes [u,u,u, [u,u,u 4, and so on, to node [u K-,u K-,u K. Figure 6B shows a path corresponding to a sequence from the example in Figure. Every edge connecting a pair of nodes [u i,u i+,u i+ and [u i+,u i+,u i+ carries information about four bases [u i,u i+,u i+,u i+, which is enough to uniquely associate it with the expected color of the probes interrogating bases [u i,u i+,u i+,u i+,u i+4 from either probe set (x,i+ ) or probe set (x B,(i+)/5 ) according to Figure. The graphical notation for values associated with nodes and edges is shown in Figure 6. The set of paths corresponding to all 4 K possible K-base sequences can be combined into a single compact graph called a trellis (Figure 7). The trellis guarantees one-to-one correspondence between any path from the left most nodes to the right most nodes and all possible base sequences, which makes it a blueprint for dynamic programming computations.

. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T [u u u [u u u [u u u 4 [u u 4 u 5 T T T T T T T T T T T T T T T T T T T T T [u 4 u 5 u 6 T T T T T T T T T T T T T T T T T T T T T [u K- u K- u K T T T T T T [u K- u K u K+ [u K u K+ u K+ B. T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T T Figure 7. eneral structure of the trellis. [ The trellis is divided into K stages, each consisting of 56 edges connecting x two x groups x of 64 nodes. Each group of 64 nodes corresponds to all possible base triplets. t the beginning and end of each section there are 64 nodes, corresponding to all possible base triplets. Each edge corresponds to some quadruplet of bases [u i, u i+, u i+, u i+, connecting node [u i, u i+, u i+ to node [u i, u i+, u i+, u i+. [B n example subset of nodes and edges. n edge that corresponds to a quadruplet of bases [u i, u i+, u i+, u i+, has an expected color x,i+ in probe set and an expected color x B,(i+)/5 in probe set determined from Figure. Note that measurements for probe set are only available in every fifth trellis section. The algorithm determines base calls and quality values by performing the following steps: Step. For each edge in the trellis, establish edge metric γ(u i,u i+,u i+,u i+ ). Page 6, equation in Step : For each edge in the graph associated with bases u i,u i+,u i+,u i+ and colors x,i+, x B,(i+)/5, determine the edge weight: g gg Step. alculate forward node metric α(u i,u i+,u i+ ) for every node in the graph:. Initialize α(u,u,u ) to!" for nodes #$ where %!" u /& agrees #$ with /& last % base of the primer and otherwise. B. Iteratively compute all node metrics α(u i+,u i+,u i+ ) from node metrics α(u i,u i+,u i+ ) according to the formula: In the above formula, P(y,i+ x,i+ =,,, ) and P(y B,(i+)/5 x B,(i+)/5 =,,, ) are color likelihoods derived from color measurements y,i+ and y B,(i+)/5. P(u i+ ) is always assumed to be.5. Note that for any base sequence, the product of all edge weights along the corresponding path results in P(y x ) P(y B x B ) P(u) = P(y,u), is proportional to the conditional sequence probability P(u y ). The goal of the forward-backward algorithm is the efficient marginalization of P(u y ) to obtain P(u i y ) for individual base positions. Page 7, equation in Step 5: ' ' ( Notice that summands in the above formula correspond to four edges arriving at the node [u i+,u i+,u i+ from the left.

Step. Each value is associated with one node in the graph. alculate backward node metric β(u i,u i+,u i+ ) for every node in Step 5. the graph: Page 7, equation alculate the in Step posterior 5: base probabilities P(u i =,,,T y ) from. Initialize β(u K,u K+,u K+ ) to for all nodes. partially marginalized probabilities from step 4. B. Iteratively compute all node metrics β(u i,u i+,u i+ ) from node metrics β(u i+,u i+,u i+ ) according to the formula: ) ) ( ) Notice that summands in the ) above formula correspond to four edges arriving at the node [u i,u i+,u i+ from the right. ( ll the steps of the decoding process have low complexity and are well suited for high-performance implementation. Step 4. alculate partially marginalized ' probabilities ) P(y,u i,u i+,u i+ ) for all base triplets as: ' ) + + + + + ppendix : Probe set design + + The design of the labeling for the two probe sets is inspired by a + + + convolutional code, a type of error-correcting code used in digital communication systems [. onvolutional codes have two key properties. First, they have a sliding window property, where the encoded symbols are derived from a short subset of consecutive information symbols a property that is naturally satisfied by SOLiD System chemistry. Second, convolutional codes are linear in a finite field; therefore, encoded symbols, when treated as elements from a finite field, can be computed as weighted sums of input symbols. This second property can be directly applied to probe set design. Each probe consists of five nucleotides v = [v,v,v,v 4,v 5 that specifically hybridize to complementary DN sequence, three inosines that hybridize to any sequence, and a fluorescent dye c. With four possible nucleotides (,,, and T),,4 possible probe sequences exist, each having one of four unique dyes assigned to it. Linearity of the probe set means that each of the,4 probes is assigned a dye according to the formula ) Math in finite field F(4) B + B B B) F(4) assignment to bases Template Base T Probe Base T F(4) assignment ) F(4) assignment Dyes Dye Label olor Notation F(4) assignment FM y TXR y5 c = v g + v g + v g + v 4 g 4 + v 5 g 5, where g = [g,g,g,g 4,g 5 is a vector of weights that defines the probe sets. Bases v i, dye c, and weights g i are considered to be elements of a finite (alois) field of size 4, denoted F(4) [. The correspondence between nucleotides v i, dyes c and elements of F(4) are presented in Figure 8. Figures 8B and 8 further detail the mechanics of performing multiplication and addition of elements of F(4). Probe sets presented in Figure for Exact all hemistry have been generated by formula () by assigning weights g = [,,,, to Probe Set and weights g = [,,,, to Probe Set. Figure 8. Finite field F(4). [ addition and multiplication in F(4); [B association between elements of F(4) and nucleotides; [ association between elements of F(4) and dyes. References. Lin, ostello, Error ontrol oding (nd ed.), Prentice Hall, 4, ISBN 4675.. Lang, lgebra (rd ed.), Springer,, ISBN 879585X.

Life Technologies offers a breadth of products DN RN PROTEIN ELL ULTURE INSTRUMENTS For Research Use Only. Not intended for any animal or human therapeutic or diagnostic use. Life Technologies orporation. ll rights reserved. The trademarks mentioned herein are the property of Life Technologies orporation or their respective owners. Printed in the US. O85 Headquarters 579 Van llen Way arlsbad, 98 US Phone +.76.6.7 Toll Free in the US 8.955.688 www.lifetechnologies.com