Concepts and methods in sequencing and genome assembly



Similar documents
Next Generation Sequencing

Introduction to next-generation sequencing data

July 7th 2009 DNA sequencing

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

Illumina Sequencing Technology

The Biotechnology Education Company

Next Generation Sequencing for DUMMIES

How is genome sequencing done?

- In , Allan Maxam and walter Gilbert devised the first method for sequencing DNA fragments containing up to ~ 500 nucleotides.

How many of you have checked out the web site on protein-dna interactions?

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

HiPer RT-PCR Teaching Kit

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

Real-Time PCR Vs. Traditional PCR

An Overview of DNA Sequencing

Universidade Estadual de Maringá

STRUCTURES OF NUCLEIC ACIDS

Sanger Sequencing. Troubleshooting Guide. Failed sequence

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

DNA Sequencing & The Human Genome Project

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

Essentials of Real Time PCR. About Sequence Detection Chemistries

The Techniques of Molecular Biology: Forensic DNA Fingerprinting

DNA Sequence Analysis

A Brief Guide to Interpreting the DNA Sequencing Electropherogram Version 3.0

Nucleic Acid Techniques in Bacterial Systematics

DNA SEQUENCING SANGER: TECHNICALS SOLUTIONS GUIDE

Troubleshooting Sequencing Data

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

PreciseTM Whitepaper

RT rxns. RT rxns TRANSCRIPTME Enzyme Mix (1) 40 µl 2 x 50 µl 5 x 40 µl

Mir-X mirna First-Strand Synthesis Kit User Manual

Procedures For DNA Sequencing

Introduction. Preparation of Template DNA

Next generation DNA sequencing technologies. theory & prac-ce

DNA sequencing. Dideoxy-terminating sequencing or Sanger dideoxy sequencing

RevertAid Premium First Strand cdna Synthesis Kit

Thermo Scientific DyNAmo cdna Synthesis Kit for qrt-pcr Technical Manual

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR

Forensic DNA Testing Terminology

Genomic DNA Clean & Concentrator Catalog Nos. D4010 & D4011

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

DNA Sequencing Troubleshooting Guide

Taq98 Hot Start 2X Master Mix

Introduction To Real Time Quantitative PCR (qpcr)

DNA SEQUENCING: A Sequencing Method Based on Real-Time Pyrophosphate. Mostafa Ronaghi, Mathias Uhlén, and Pål Nyrén *

1/12 Dideoxy DNA Sequencing

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

NGS data analysis. Bernardo J. Clavijo

Co Extra (GM and non GM supply chains: Their CO EXistence and TRAceability) Outcomes of Co Extra

First Strand cdna Synthesis

ab Hi-Fi cdna Synthesis Kit

Computational Genomics. Next generation sequencing (NGS)

Reverse Transcription System

Protocol. Introduction to TaqMan and SYBR Green Chemistries for Real-Time PCR

Central Dogma. Lecture 10. Discussing DNA replication. DNA Replication. DNA mutation and repair. Transcription

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

Biotechnology: DNA Technology & Genomics

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

DNA Sequencing Troubleshooting Guide.

DNA Replication in Prokaryotes

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Sequencing Guidelines Adapted from ABI BigDye Terminator v3.1 Cycle Sequencing Kit and Roswell Park Cancer Institute Core Laboratory website

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

VLLM0421c Medical Microbiology I, practical sessions. Protocol to topic J10

Description: Molecular Biology Services and DNA Sequencing

Application Guide... 2

Gene Expression Assays

DNA (genetic information in genes) RNA (copies of genes) proteins (functional molecules) directionality along the backbone 5 (phosphate) to 3 (OH)

Transcription in prokaryotes. Elongation and termination

FOR REFERENCE PURPOSES

PrimeSTAR HS DNA Polymerase

PyroPhage 3173 DNA Polymerase, Exonuclease Minus (Exo-)

Bacterial Transformation and Plasmid Purification. Chapter 5: Background

Electrophoresis, cleaning up on spin-columns, labeling of PCR products and preparation extended products for sequencing

RNA & Protein Synthesis

DNA Fingerprinting. Unless they are identical twins, individuals have unique DNA

Troubleshooting Guide for DNA Electrophoresis

Validating Microarray Data Using RT 2 Real-Time PCR Products

ZR-96 DNA Sequencing Clean-up Kit Catalog Nos. D4052 & D4053

Data Analysis for Ion Torrent Sequencing

Single Nucleotide Polymorphisms (SNPs)

Genome Sequencing. Phil McClean September, 2005

DNA Sequencing Handbook

Dye-Blob message: Example: Generally, this is due to incomplete excess dye removal of the cycle sequence reaction.

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

LESSON 9. Analyzing DNA Sequences and DNA Barcoding. Introduction. Learning Objectives

Translation Study Guide

ZR DNA Sequencing Clean-up Kit

DNA Core Facility: DNA Sequencing Guide

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Table of Contents. I. Description II. Kit Components III. Storage IV. 1st Strand cdna Synthesis Reaction... 3

Beginner s Guide to Real-Time PCR

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

AffinityScript QPCR cdna Synthesis Kit

HCS Exercise 1 Dr. Jones Spring Recombinant DNA (Molecular Cloning) exercise:

Transcription:

BCM-2004 Concepts and methods in sequencing and genome assembly B. Franz LANG, Département de Biochimie Bureau: H307-15 Courrier électronique: Franz.Lang@Umontreal.ca

Outline 1. Concepts in DNA and RNA sequencing 2. Sequencing technologies 3. Random genome sequencing, with/without cloning 4. Data formats of results autoradiograms, traces, fastq and base call qualities 5. Sequencing and assembly artifacts

1. Concepts in DNA and RNA sequencing Reminder DNA and RNA are polar (5 P; 3 OH), charged biopolymers, made up of nucleotides. By convention, sequences are always written from 5 (left) to 3 (right); otherwise, the polarity has to be indicated. DNA usually occurs in double-stranded, antiparallel perfectly base-paired form: 5 AGCTATTGATTTCCTTGG 3 3 TCGATAACTAAAGGAACC 5 RNAs are most often single-stranded and may form secondary and tertiary base-pairs (intra-molecular, or with other molecules). Single-stranded DNA does the same. For sequencing, DNAs and RNAs have to be denatured and single-stranded, without structure.

1. Concepts in DNA and RNA sequencing Principles; see also Maniatis (a popular biochemistry cook-book): The initial two sequencing techniques are the enzymatic synthesis method of Sanger et al. (1977) and the chemical degradation method of Maxam and Gilbert (1977). Note that Maxam and Gilbert is slow and no longer used, except for special applications such as mapping of protein binding to DNA. New Generation Sequencing (NGS) techniques have taken over for genome projects see below. They do not require electrophoretic techniques but use instead various nano-technological approaches.

Principle: 1. Concepts in DNA and RNA sequencing Although very different in principle, both Maxam/Gilbert and Sanger produce populations of (radio- or fluorochrome-) labeled oligonucleotides that all start at the same site of a given DNA/RNA, and that end in a given nucleotide (G,A,T/U,C) that is generated with a given sequencing biochemistry (nucleotide-specific termination of DNA synthesis, or nucleotide-specific cleavage; etc.). Cleavage at random meg site ========== > Visible radioactive fragments Note that in any sequencing technology, only separate, labeled singlestranded DNAs or RNAs are sequenced; unlabeled material does not matter. With more molecules carrying the same label, these need to be separated.

1. Concepts in DNA and RNA sequencing Electrophoretic separation, and detection principles: These populations of oligonucleotides are then resolved by electrophoresis under conditions that discriminate size differences at the single nucleotide level (PAGE). When loaded into four adjacent lanes of a sequencing gel, the order of nucleotides can be read directly from an image after visualizing the radioactive or any other label (see below). When sequence reactions are marked with four different fluorescent dyes, these can be loaded on a single lane (or capillary), and read automatically and continuously as differentwavelength light emission, generated by laser excitation.

1. Concepts in DNA and RNA sequencing Principles of RNA sequencing: RNA is sequenced similar to DNA, either directly by chemical methods (yet inefficient, slow), by a Sanger-like synthesis protocol with reverse transcriptase (to produce cdna sequence ladders), or after transformation to cdna by regular DNA sequencing procedures. RNA classes may be separated by size (micro RNAs, trnas rrnas ) or by enrichment of eukaryotic mrnas carrying a 3 poly-a, by binding to an oligodt column. That is, RNA sequencing may provide more information than just the primary sequence. Most RNAs have distinct start and processing sites. High volume RNA sequencing (NGS, called RNA-seq) allows precise identification of starts and stops, and measurement of relative quantities.

2. Sequencing technologies 2.1. Maxam and Gilbert (chemical) Requires high amount of highly purified DNA fragments (e.g., restriction fragments). Single radioactive label, can be on double- or single-stranded DNA. Nucleotide-specific, partial chemical modification (random along DNA). Chemical cleavage at modified nucleotides. Denaturation (heat, formamide), to allow uniform electrophoresis of single-stranded DNA molecules that are perfectly linear and without secondary structure (if not sequencing artifacts). High-resolution slab gel PAGE, followed by autoradiography. Reading (up to a few hundred nt/reaction) usually by human expert. Several days labor with a few gel runs provides ~ 10 kbp sequence

2.1. Maxam-Gilbert sequencing summary Slow, many DNA purification steps, requires lots of DNA, toxic reagents, no automation available, relatively short reads up to a few hundred.

2. Sequencing technologies 2.2. Sanger (enzymatic synthesis) Unique start of sequencing ladder is determined by a sequencing primer, hybridized to DNA or RNA. Purity of template is not an issue (!). DNA polymerase (reverse transcriptase) used for primer elongation. Nucleotide-specific termination (random) with one of four dideoxynucleotides that are mixed with the four regular nucleotides.

2. Sequencing technologies 2.2. Sanger (enzymatic synthesis) Label may be radioactive or a fluorescent dye on Primer itself (e.g., 5 P32; dye label added during primer synthesis). Nucleotides incorporated during synthesis (e.g., P32, S35). Dideoxy-nucleotides (different dyes emitting different colors single lane or capillary sequencing is possible). High-resolution slab gel or capillary electrophoresis Autoradiography or automated reading of migrating fragments (laser, with camera or diodes). Several days labor may produce ~100 kbp sequence. Robotic procedures for template purification and sequence reactions allows scale-up.

2.2. Sanger (enzymatic synthesis), summary

2. Sequencing technologies 2.3. 454 Technology Roche GS FLX (several hundred MB per run; advantage: reads up to 1,000 nt) Pyrosequencing is a method of DNA sequencing (determining the order of nucleotides in DNA) based on the "sequencing by synthesis" principle. It differs from Sanger sequencing, in that it relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides. ssdna template is hybridized to a sequencing primer and incubated with the enzymes DNA polymerase, ATP sulfurylase, luciferase and apyrase, and with the substrates adenosine 5 phosphosulfate (APS) and luciferin. The addition of one of the four deoxynucleoside triphosphates (dntps) (datpαs, which is not a substrate for a luciferase, is added instead of datp) initiates the second step. DNA polymerase incorporates the correct, complementary dntps onto the template. This incorporation releases pyrophosphate (PPi) stoichiometrically. ATP sulfurylase quantitatively converts PPi to ATP in the presence of adenosine 5 phosphosulfate. This ATP acts as fuel to the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light in amounts that are proportional to the amount of ATP. The light produced in the luciferasecatalyzed reaction is detected by a camera and analyzed in a program. Unincorporated nucleotides and ATP are degraded by the apyrase, and the reaction can restart with another nucleotide.

2. Sequencing technologies 2.3. 454 Technology Roche GS FLX

2. Sequencing technologies 2.3. 454 Technology Roche GS FLX

2. Sequencing technologies 2.3. 454 Technology Roche GS FLX

2. Sequencing technologies 2.3. 454 Technology Roche GS FLX DNA polymerase incorporates the correct, complementary dntps onto the template. This incorporation releases pyrophosphate (PPi) stoichiometrically. ATP sulfurylase quantitatively converts PPi to ATP in the presence of adenosine 5 phosphosulfate. This ATP acts as fuel to the luciferase-mediated conversion of luciferin to oxyluciferin that generates visible light Unincorporated nucleotides and ATP are degraded by apyrase, and the reaction can restart.

2. Sequencing technologies 2.4. Illumina (several GB per run; reads up to 300 nt)

2. Sequencing technologies 2.4. Illumina

2. Sequencing technologies 2.4. Illumina

2. Sequencing technologies 2.5. ABI SOLiD sequencing by ligation (2,000 MB per run; but only 35 nt/read) A library of DNA fragments, ligated with universal sequence adaptors, is attached to the surface of magnetic beads (one fragment per bead). Emulsion PCR taking place in microreactors amplifies the fragments that are then covalently bound to a glass slide. SOLiD technology applies a rather complicated ligation/cleavage procedure. Partially degenerate, fluorescently labeled DNA octamers with dinucleotide sequence recognition cores are hybridized to the template, and perfectly annealing sequences are ligated to the primer. After imaging, unextended strands are capped and fluorophores are cleaved. Repetitions of new priming, primer removal, and ligation cycles will in the end cover a stretch of 35 nt twice (redundantly), which improves the accuracy of base calling. Yet the value of a 35 nt reading starts dwindling,in face of other NGS technologies producing longer reads almost every year (e.g., Illumina promising 300 nt for 2014). First cycle cleavage

2. Sequencing technologies. and so on.

2. Sequencing technologies 2.6. Ion Torrent (100 MB + per run; up to 200 nt/read) Incorporation of a deoxyribonucleotide triphosphate (dntp) into a primed, growing DNA strand involves the release of pyrophosphate, and a hydrogen ion that s measured on a semiconductor chip. Microwells each containing one single-stranded template DNA molecule plus a DNA polymerase are sequentially flooded with A, C, G or T. Only if an introduced dntp is complementary to the next unpaired nucleotide on the template strand it is incorporated into the growing complementary strand. If more than one nucleotides follow each other, the signal strength correlates with the number of identical incorporated nucleotides. The series of electrical pulses is translated into a DNA sequence, without intermediate signal conversion, the use of labeled nucleotides, or error-prone intermediate amplification steps. However, the signal precision is lower than with 454, Illumina, and Solid technologies.

2. Sequencing technologies 2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000) The PacBio RS II is a single molecule, real-time DNA sequencing system that provides the longest read lengths of any available sequencing technology, however in comparison to all other NGS technologies it has the lowest precision. Sequencing occurs on SMRT Cells, each containing thousands of Zero-Mode Waveguides (ZMWs) in which polymerases are immobilized. The ZMWs provide a way for directly watching DNA polymerase with a high-resolution camera, as it performs sequencing by synthesis (fluorescence measurement; four different flurochrome-labeled nucleotides). The long read length is precious for the assembly of genomes, in particular in regions containing long sequence repeats that cause otherwise problems in genome assembly. In addition, it detects DNA base modifications using the kinetics of the polymerization reaction during sequencing.

2. Sequencing technologies 2.7. Pacific Biosciences (+/- 3,000 nt/read, up to 15,000)

2. Sequencing technologies comparison from 2012 Quail et al. BMC Genomics 2012, 13:341

3. Random genome sequencing (Sanger, Maxam Gilbert) 3.1. with cloning DNA is not amplified in vitro, each clone receives original piece of DNA in a plasmid that is multiplied by E. coli (for artifacts, see below).

3. Random genome sequencing 3.2. without cloning NGS procedures using either DNAs attached to nanochips (micro wells) or in oil drop emulsion. 454, Illumina, Solid DNA is highly PCR-amplified. Errors may therefore come from PCR amplification artifacts. Pacific Biosciences and Ion Torrent technologies both read single molecules directly without prior PCR amplification. Yet in contrast. their relatively high error rate is due to the signal precision itself.

4. Data formats of results autoradiograms, traces, fastq and base call qualities Trace file typical for Sanger sequencing with base call qualities indicated by the height of blue bars and Q numbers. The advantage of this format is easy spotting of artifacts by a human expert. The typical NGS format (FastQ) only reports the sequence plus the quality encoded in machine readable format.

4. Data formats of results quality scores in fastq format Typical NGS format (FastQ) only reports the sequence plus the quality encoded in machine readable format.

4. Data formats of results quality scores

4. Data formats of results quality scores

4. Data formats of results quality scores (Illumina example)

5. Artifacts in sequencing and sequence assembly The denatured DNA is not linear as it folds back on itself and then migrates differently on the sequencing gel (Sanger) reason: secondary structures, mainly in G+C rich regions effect: compression zones in the sequencing ladder solutions (i) sequence DNA in the two directions of complementary strands; sequencing artifacts due to folding are not symmetric; (ii) for Sanger sequencing, use nucleotide analogs that minimize secondary structure folding, like deaza-ntp, deaza-ditp, or ITP ( instead of NTPs or dgtp, respectively)

Artifacts in sequencing and assembly Sequencing ladders terminate prematurely or contain holes Reasons: sequencing reactions over-modified (M&G), or too elevated terminator concentrations (Sanger); (ii) strong nucleotide bias, like long runs of A or T that cause many polymerases to fall of the template (Sanger)

Artifacts in sequencing and assembly Uncertain number of identical nucleotides in a row (homopolymers; > 6) Reasons: Amplification errors by DNA polymerase (Illumina, 454) Signal ambiguity when estimating the number of identical nucleotides from the height of a single signal (Illumina, 454)

Artifacts in sequencing and assembly Readings that only partially fit genome sequence (one of the worst artifacts) Reasons: Ligation of separate pieces into one fragment, during primer ligation (all using primer ligation) Partial deletion of sequence during PCR at repeat sequence and folded structures (all using PCR amplification)

This is it, folks!