An Overview of DNA Sequencing



Similar documents
Introduction to next-generation sequencing data

restriction enzymes 350 Home R. Ward: Spring 2001

Next Generation Sequencing

DNA Sequence Analysis

Sanger Sequencing. Troubleshooting Guide. Failed sequence

Dye-Blob message: Example: Generally, this is due to incomplete excess dye removal of the cycle sequence reaction.

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

1/12 Dideoxy DNA Sequencing

July 7th 2009 DNA sequencing

How is genome sequencing done?

Description: Molecular Biology Services and DNA Sequencing

DNA Sequencing Troubleshooting Guide

Forensic DNA Testing Terminology

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

CHAPTER 6: RECOMBINANT DNA TECHNOLOGY YEAR III PHARM.D DR. V. CHITRA

DNA SEQUENCING SANGER: TECHNICALS SOLUTIONS GUIDE

The Techniques of Molecular Biology: Forensic DNA Fingerprinting

Sequencing Guidelines Adapted from ABI BigDye Terminator v3.1 Cycle Sequencing Kit and Roswell Park Cancer Institute Core Laboratory website

Biotechnology: DNA Technology & Genomics

- In , Allan Maxam and walter Gilbert devised the first method for sequencing DNA fragments containing up to ~ 500 nucleotides.

Introduction. Preparation of Template DNA

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

AP BIOLOGY 2007 SCORING GUIDELINES

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

How many of you have checked out the web site on protein-dna interactions?

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

CUSTOM DNA SEQUENCING SERVICES

HCS Exercise 1 Dr. Jones Spring Recombinant DNA (Molecular Cloning) exercise:

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Concepts and methods in sequencing and genome assembly

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

The Biotechnology Education Company

DNA Sequencing Overview

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

CCR Biology - Chapter 9 Practice Test - Summer 2012

Next Generation Sequencing: Technology, Mapping, and Analysis

Expression and Purification of Recombinant Protein in bacteria and Yeast. Presented By: Puspa pandey, Mohit sachdeva & Ming yu

Troubleshooting Sequencing Data

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Genetics Module B, Anchor 3

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

Recombinant DNA and Biotechnology

Recombinant DNA Unit Exam

Recombinant DNA Technology

Introduction To Real Time Quantitative PCR (qpcr)

DNA Sequencing & The Human Genome Project

A Brief Guide to Interpreting the DNA Sequencing Electropherogram Version 3.0

Introduction to Bioinformatics 3. DNA editing and contig assembly

How Sequencing Experiments Fail

RESTRICTION DIGESTS Based on a handout originally available at

Procedures For DNA Sequencing

European Medicines Agency

Genetic Technology. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

DNA Core Facility: DNA Sequencing Guide

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

Data Analysis for Ion Torrent Sequencing

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

DNA Scissors: Introduction to Restriction Enzymes

Genomics GENterprise

Universidade Estadual de Maringá

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

DNA Sequencing Setup and Troubleshooting

Genome Sequencing. Phil McClean September, 2005

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Every time a cell divides the genome must be duplicated and passed on to the offspring. That is:

Reading DNA Sequences:

STRUCTURES OF NUCLEIC ACIDS

Translation Study Guide

Illumina Sequencing Technology

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Basic Concepts Recombinant DNA Use with Chapter 13, Section 13.2

HiPer RT-PCR Teaching Kit

LESSON 9. Analyzing DNA Sequences and DNA Barcoding. Introduction. Learning Objectives

BacReady TM Multiplex PCR System

What is a contig? What are the contig assembly programs?

Chapter 11: Molecular Structure of DNA and RNA

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

Bacterial Transformation and Plasmid Purification. Chapter 5: Background

DNA sequencing is the process of determining the precise order of the nucleotide bases in a particular DNA molecule. In 1974, two methods of DNA

PreciseTM Whitepaper

Mitochondrial DNA Analysis

DNA Paper Model Activity Level: Grade 6-8

DNA: Structure and Replication

Replication Study Guide

Gene Mapping Techniques

Molecular Biology Techniques: A Classroom Laboratory Manual THIRD EDITION

4. DNA replication Pages: Difficulty: 2 Ans: C Which one of the following statements about enzymes that interact with DNA is true?

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

ZR-96 DNA Sequencing Clean-up Kit Catalog Nos. D4052 & D4053

DNA sequencing. Dideoxy-terminating sequencing or Sanger dideoxy sequencing

Transfection-Transfer of non-viral genetic material into eukaryotic cells. Infection/ Transduction- Transfer of viral genetic material into cells.

First generation" sequencing technologies and genome assembly. Roger Bumgarner Associate Professor, Microbiology, UW

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

ZR DNA Sequencing Clean-up Kit

Transcription:

An Overview of DNA Sequencing

Prokaryotic DNA Plasmid http://en.wikipedia.org/wiki/image:prokaryote_cell_diagram.svg

Eukaryotic DNA http://en.wikipedia.org/wiki/image:plant_cell_structure_svg.svg

DNA Structure The two strands of a DNA molecule are held together by weak bonds (hydrogen bonds) between the nitrogenous bases, which are paired in the interior of the double helix. The two strands of DNA are antiparallel; they run in opposite directions. The carbon atoms of the deoxyribose sugars are numbered for orientation. http://en.wikipedia.org/wiki/image:dna_chemical_structure.png

Sequencing DNA The goal of sequencing DNA is to tell the order of the bases, or nucleotides, that form the inside of the double-helix molecule. We can do this in one of two ways: Directed Sequencing Shotgun Sequencing

Directed Sequencing=Primer Walking Start with genome, gene, clone, PCR product Design a primer and sequence a certain segment of the genome, usually the beginning. From that sequence, design the next primer and sequence the next segment of the genome. Continue designing primers and sequencing until the genome is completed..

Shotgun Sequencing Start with a whole genome or a large piece of the DNA (a BAC). Shear the DNA into many different, random segments. Sequence each of the random segments. Then, put the pieces back together again in their original order.

Theory Behind Shotgun Sequencing Haemophilus influenzae 1.83 Mb base Coverage unsequenced (%) 1X 37% 2X 13% 5X 0.67% 6X 0.25 7X 0.09% Gaps 3000 2500 2000 1500 1000 500 0 0 20000 40000 60000 80000 Sequences For 1.83 Mb genome, 6X coverage is 10.98 Mb of sequence, or 22,000 sequencing reactions, 11000 clones (1.5-2.0 kb insert), 500 bp average read.

BAC based Projects BACs are Bacterial Artificial Chromosomes. They are large transport systems that can hold pieces of DNA that are 50 to 300 kilobases. To make copies of the DNA we insert the BAC into an E. coli cell. transform in E. coli and process each BAC as individual projects chromosome break into large pieces clone into BAC vector

Shotgun vs. Directed: Which is the better, more efficient method? With directed, sequences are done in order and there s no puzzle to put back together. Minimal computing power is required. Primers must be continually designed and purchased. If an area is difficult to sequence, you could get stuck. Shotgun sequencing takes less time; all the sequences can be done at about the same time. Minimal cost But, the pieces have to be assembled, a time-consuming process requiring extensive computational power.

Whole Genome Shotgun Sequencing 1.Library construction 2. Random Sequencing Phase 3. Closure Phase a. isolate DNA a. sequence DNA (15,000 sequences/ Mb) a. assemble sequences b. close gaps b. fragment DNA c. clone DNA GGG ACTGTTC... c. edit d. annotation 4. COMPLETE GENOME SEQUENCE 237 239 238

Library Construction Construct shotgun libraries of the genome or target DNA; sizes of inserts are varied; could be small (2-3 kb), medium (8-12 kb), or fosmid (30-40 kb) Start with multiple copies of purified DNA Shear the DNA using mechanical force, breaking it into smaller pieces DNA fragments are cloned into a plasmid vector to replicate the DNA; these fragments are called inserts

Library Construction Insert fragment DNA into a vector such as pbr322 Transform into E. coli cells and, using an antibiotic, select for cells that have a plasmid. The plasmids carry antibiotic resistant genes. When plated in the presence of an antibiotic, the cells without a plasmid die.

Isolation of the plasmid DNA Template Preparation Transformed E.coli is plated onto an agar plate. Every E. coli colony will contain plasmids with the same insert. Colonies are picked & transferred to liquid media where they multiply; use 384 well high throughput plates Plasmids are isolated and suspended in a buffer.

Template Production Laboratory Current Capacity: 22,000,000 plasmids/year

Sanger Sequencing Utilize dideoxy sequencing method of chain termination (Sanger) Each plasmid is reacted with a forward and reverse primer (2 reactions for each piece of DNA). Done in high throughput manner in 384 well plates

Sequencing reactions -Initial dideoxy sequencing involved use of radioactive datp and 4 separate reactions (ddatp, ddttp, ddctp, ddgtp) & separation on 4 separate lanes on an acrylamide gel with detection through autoradiogram -New techologies use 4 fluorescently labeled bases and separation on capillaries and detection through a CCD camera

DNA sequencing

R primer binds, synthesis Plasmid Structure Antibiotic resistance F primer binds, synthesis

Sequencing Machines The DNA fragments are loaded into capillaries in the sequencing machines. Polymer in the capillaries provides a matrix for separating the DNA fragments based on size. Separation of the fragments through a matrix is called electrophoresis.

Sequence Production Laboratory Current Capacity: 40,000,000 sequences/year

Data Collection A laser excites the fluorescent dyes. A camera detects the fluorescence. Data collection software collects the data. Capillary array view

Sequencing Machine Output AACTCATCGAATCCGTACGGG AACTCATCGAATCCGTACGG AACTCATCGAATCCGTACG AACTCATCGAATCCGTAC AACTCATCGAATCCGTA AACTCATCGAATCCGT AACTCATCGAATCCG AACTCATCGAATCC AACTCATCGAATC AACTCATCGAAT AACTCATCGAA AACTCATCGA AACTCATCG AACTCATC AACTCAT AACTCA AACTC AACT AAC AA A This is a diagram of just one lane. Reading from the bottom, where the fragment is only one base long, the fluorescent dye is an A. This is the first base in the sequence. Fluorescent Sequencing Gel Four colors, one lane per sample Fluorescent Sequencing Gel Each fragment differs by one nucleotide

Data Analysis An chromatogram is produced and the bases are called Software assign a quality value to each base Phred & TraceTuner Read DNA sequencer traces Call bases Assign base quality values Write basecalls and quality values to output files.

Assemble Fragments SEQUENCER OUTPUT AGCTAGGCTC ASSEMBLE FRAGMENTS CTAGCTAGCTAGGCTC AGCTCGCTAGCTA TAGCTAGC GCTAGCTAGCT AGCTAGC GCTAGCTAGC AGCTCGCTA TAGCTAGCTA CTCGCTAGCTAG AGCTCGCTAGCTAGCTAGCTAGCTAGGCTC GCTAGCTAGC AGCTCGCTAGCTA TAGCTAGC TAGCTAGCTA AGCTCGCTA GCTAGCTAGCT CTCGCTAGCTAG AGCTAGC CTAGCTAGCTAGGCTC AGCTAGGCTC CLOSURE & ANNOTATION

FRAGMENTS FROM SEQUENCING PROCESS CONSENSUS SEQUENCE

Closure Assemble the sequence files, relate them to each other, and close gaps Involves many computational programs to identify overlapping sequences, linkages between sequences Back to the lab for hard to close gaps

Complicating Factors A procedure that works well in one species may not produce the same results in even a closely related species. Each genome is uniquely different in its size and how it sequences and assembles New technologies must be developed to tackle the unique characteristics and properties of difficult genomes

Whole Genome Shotgun Sequencing: Modifying for Eukaryotes Not restricted to bacterial organisms Sequence eukaryotes: whole genome draft sequence; same approach as with bacteria chromosome by chromosome; sequence genome using large insert bacterial artificial chromosome (BAC) clones anchored to the chromosomes combination of whole genome and chromosome by chromosome

Whole Genome Draft Sequencing 1.Library construction 2. Random Sequencing Phase 3. Closure Phase a. isolate DNA a. sequence DNA (15,000 sequences/ Mb) a. assemble sequences b. close gaps b. fragment DNA c. clone DNA GGG ACTGTTC... c. edit d. annotation 4. COMPLETE GENOME SEQUENCE 237 239 238

Whole Genome Draft Sequencing 1.Library construction 2. Random Sequencing Phase 3. Closure Phase a. isolate DNA b. fragment DNA c. clone DNA a. sequence DNA (15,000 sequences/ Mb) GGG ACTGTTC... a. assemble sequences b. close gaps c. edit Advantages: Saves time and money (~50 %) Disadvantages: Incomplete sequence, contains errors d. annotation 4. COMPLETE GENOME SEQUENCE 237 239 238

454 Genome Sequencing System Library prep, amplification and sequencing: 2-4 days Single sample preparation from bacterial to human genomic DNA Single amplification per genome with no cloning or cloning artifacts Picoliter volume molecular biology 100 Mb per run (4-5 hr); less than $ 20,000 per run Read lengths 200-230 bases Massively parallel imaging, fluidics and data analysis Requires high genome coverage for good assembly Error rate of 1-2%

454-Pyrosequencing Construct Single stranded adaptor liagated DNA Perform emulsion PCR Depositing DNA Beads into the PicoTiter Plate Sequencing by Synthesis: Simultaneous sequencing of the entire genome in hundreds of thousands of picoliter-size wells Pyrophosphate signal generation

Expressed Sequence Tags (ESTs): Sampling the Transcriptome and Genic Regions What is an EST? single pass sequence from cdna specific tissue, stage, environment, etc. cdna library in E.coli pick individual clones template prep T7 Insert in pbluescript T3 Multiple tissues, states.. with enough sequences, can ask quantitative questions

Uses of EST sequencing: -Gene discovery -Digital northerns/insights into transcriptome -Genome analyses, especially annotation of genomic DNA Issues with EST sequencing: -Inherent low quality due to single pass nature -Not 100 % full length cdna clones -Redundant sequencing of abundant transcripts Address through clustering/ assembly to build consensus sequences = Gene Index, Unigene Set, Transcript Assembly

EST Clustering All ESTs and mrnas from an organism Cluster and Assemble Set of clustered, assembled sequences= contigs, Transcript Assembly, Tentative Consensus, Unigene Sequences which do not cluster or assemble=singletons, singlets Single pass transcript Longer, more accurate sequence of the transcript

Web Links for Animation on Genome Sequencing http://www.jgi.doe.gov/education/how/how30minflash.html http://www.illumina.com/media.ilmn?title=sequencing-by- Synthesis%20Demo&Cap=&PageName=solexa%20technolo gy&pageurl=203&media=1

From Fragments to Finished Genome Overview of Sequencing, Assembly, and Closure Processes

Genome Sequencing Process Library Construction Clone Picking Template Preparation Sequencing Reactions Electrophoresis and Base Calling rdna Molecules Genome Closure Order Contigs Close Gaps Identify Repeats Finish the Genome Annotation Genome Assembly

Sequence Requirements 1. Free vector should be at low or undetectable level. 2. No chimeric clones. Chimeras occur two or more random fragments from separate parts of the genome recombine and end up next to each other. 3. The majority of the inserts should be of relatively uniform size. 4. Libraries need to be random and cover the whole genome.

Basecalling & Quality Assignments Phred & TraceTuner Read DNA sequencer traces Call bases Assign base quality values Warner Brothers, Inc. Write basecalls and quality values to output files.

What are phred quality values? The quality value q assigned to a base call is defined as: q = - 10 x log 10 (p) where p is the estimated error probability for that base-call.

OR A base-call having a probability of 1/1000 of being incorrect is assigned a quality value of 30. Probability Quality Value 1/100 20 1/10 10

Assembling the fragments

Merging two sequences overlap (19 bases) overhang (6 bases) AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT overhang % identity = 18/19 % = 94.7% overlap - region of similarity between regions overhang - un-aligned ends of the sequences The assembler screens merges based on: length of overlap % identity in overlap region maximum overhang size.

TIGR Assembler Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done

Forward-reverse constraints The sequenced ends are facing towards each other The distance between the two fragments is known (within certain experimental error) clone length F R sequenced ends

Scaffolding Given a set of non-overlapping contigs order and orient them along a chromosome II I III IV III II IV I

Clone-mates Insert F R I II R F Vector I II R F II I F R

Linking information Overlaps Mate-pair links Similarity links Physical markers Gene synteny reference genome physical map

Grouping the contigs

Assembly gaps physical gap group A group B sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap

Unifying view of assembly Assembly Scaffolding

Why Completeness is Important Improves characterization of genome features Gene order, replication origins Better comparative genomics Genome duplications, inversions Determination of presence and absence of particular genes and features is less subjective Missing sequence might be important (e.g., centromere) Allows researchers to focus on biology not sequencing Facilitates large scale correlation studies Controls for contamination

What Is Closure? Obtaining sequence that was not obtained during random sequencing which resulted in: Sequencing Gaps Physical Ends Confirming the integrity of assemblies Repeats and misassemblies Verification of Clone Coverage Confirming the base sequence of the consensus Editing Verification of Sequence Coverage

Sequence Validation: Sequence coverage 1X 2X 3X Sequence coverage rule: Every base in an assembly must be covered by at least two sequences of high quality. Why? Validating sequence coverage provides a high degree of confidence in the consensus base calls.

In this example there is an obvious discrepancy between the base calls of several of the underlying clones in this region. Sequence editor

Causes for gaps Non-random shotgun library Toxicity of genes or promoters in E. coli Genomic DNA difficult to clone (capsular polysaccharides) Unstable regions (low complexity) Sequencing problems Hard stops Secondary structures Very high or low GC content Small unit tandem repeats Loss of signal Homopolymeric tracts Very high or low GC content

Closure Challenges: Sequencing Through Secondary Structures

Hairpin structure

Homopolymeric tracts

Solutions Apply different sequencing chemistries Big-Dye terminator (default) Dye-primer dgtp mix (GC rich regions) Denature structures - Additives Betaine DMSO Break structure Restriction digest Transposon insertion Micro-libraries

Repetitive Areas Repetitive areas are regions of high similarity within the genome/bac. Sequences in these areas may be misassembled by the Assembler. Verification of the sequence of repetitive areas: A. Identify potential repetitive areas, using repeatfinder and other tools. B. Classify repeats based on length, copy number, % similarity, structure and complexity. C. If repeats are misassembled, transpose spanning clones or obtain PCR products and sequence to verify assembly.

Mis-assembled repeat Clones link different repeat flanks

Resolved Repeat Unique flank order is correct Use linking information across the repeat (large insert clones or PCR) Consensus sequence is correct Use linked clones that have one mate in the repeat and the other anchored in unique sequence Transposon mediated libraries

Fasta Format > GDRFE25TF CATTGAACACTAGGAGCCATAGAC (up to 60 bases per line) GTTCAACCGTTTAAGGCAAAACTTA AATTTTGGGCAGACTCTAGATCATG GGTAATACATACTCTGGGATTACGA

> GDRFE25TF CATTGAACACTAGGAGCCATAGACT (up to 60 bases per line) GTTCAACCGTTTAAGGCAAAACTTA AATTTTGGGCAGACTCTAGATCATG GGTAATACATACTCTGGGATTACGA > GDRFE45TF ACTGGTTCACATGGAGGGATAGTAC (up to 60 bases per line) GACACTCCGTAGCTGGCAATCCTTA GGCTCTCAATCGAGACTCTAGTTAC TCCAATATGGGCTCATGGAACAAGA > GDRFE67TF CATTGAACACTAGGAGCCATAGATC (up to 60 bases per line) AATGTGGCGTAGCTGCCACTTGGTA TACCGTCAATCGTATTGTCTAGTTAC GGGAGATAATATGGGCTCATATGGT > Multifasta Format