Next generation DNA sequencing technologies theory & prac-ce
Outline Next- Genera-on sequencing (NGS) technologies overview NGS applica-ons NGS workflow: data collec-on and processing the exome sequencing pipeline
PART I: NGS technologies Next- Genera-on sequencing (NGS) technologies overview
Landmarks in DNA sequencing 1953 Discovery of DNA double helix structure 1977 A Maxam and W Gilbert "DNA seq by chemical degrada-on" F Sanger"DNA sequencing with chain- termina-ng inhibitors" 1984 DNA sequence of the Epstein- Barr virus, 170 kb 1987 Applied Biosystems - first automated sequencer 1991 Sequencing of human genome in Venter's lab 1996 P. Nyrén and M Ronaghi - pyrosequencing 2001 A drah sequence of the human genome 2003 human genome completed 2004 454 Life Sciences markets first NGS machine
Massive parallel sequencing 11/10/13
DNA Sequencing the next generation Commercially available technologies Roche 454 GSFLX -tanium Junior Illumina HiSeq2000/2500 MiSeq Life SOLiD 5500 Ion torrent/proton (Helicos BioSciences HeliScope) Pacific Biosciences PacBio RS 11/10/13
DNA Sequencing the next generation The newer technologies cons-tute various strategies that rely on a combina-on of Library/template prepara-on Parallel sequencing
Template preparation: STEP1
Template preparation Produce a non- biased source of nucleic acid material from the genome Current methods: randomly breaking genomic DNA into smaller sizes Ligate adaptors anach or immobilize the template to a solid surface or support the spa-ally separated template sites allows thousands to billions of sequencing reac-ons to be performed simultaneously
Template preparation Clonal amplifica-on Roche 454 Illumina HiSeq Life SOLiD Life Ion Torrent Single molecule sequencing Helicos BioSciences HeliScope Pacific Biosciences PacBio RS
Template preparation: Clonal amplification In solu-on emulsion PCR (empcr) Roche 454 Life SOLiD Solid phase Bridge PCR Illumina HiSeq Life SOLiD - wildfire 11/10/13
Template preparation: Clonal amplification empcr 11/10/13
Template preparation: SOLiD 454 Ion Torrent Jeroen Van Houdt - Genomics Core - UZ Leuven- KU Leuven 11/10/13
Template preparation: Clonal amplification Bridge PCR 11/10/13
Template preparation: Single molecule templates Heliscope PacBio 11/10/13
Sequencing Sequencing By Synthesis (SBS) Roche 454 Illumina HiSeq Life Ion Torrent (label- free) Helicos BioSciences HeliScope Pacific Biosciences PacBio RS Sequecing By Liga-on Life SOLiD 11/10/13
454 - Pyrosequencing Pico-tre plate Pyrosequencing 11/10/13
454 - Pyrosequencing 11/10/13
Ion torrent label free sequencing 11/10/13
HiSeq Heliscope
Sequencing PacBio single molecule 11/10/13
Sequencing by ligation 11/10/13
DNA Sequencing The major advance offered by NGS is the ability to cheaply produce an enormous volume of data The arrival of NGS technologies in the marketplace has changed the way we think about scien-fic approaches in basic, applied and clinical research NGS allows to study different aspects of the gene-c architecture at the whole genome scale 11/10/13
Whole-Genome SEQUENCING DNA SEQUENCING
Whole-Genome SEQUENCING
WGS - Copy number variation analysis
WGS - Structural variation analysis
Whole-Genome Sequencing (WGS) Copy number varia-on analysis Sequencing a genome at 0.1-0.3x Sequencing a genome at 1-3x Structural varia-on analysis Sequencing a genome at 5-10x Whole genome re- sequencing Sequencing a genome at >30x yeast, fruit fly, bacterial genomes, human
Targeted re-sequencing DNA SEQUENCING
The beginning Random genome sequencing?????? Sanger sequencing Targeted 700-1000 bp
DNA Sequencing the next generation Library/template prepara-on Library enrichment for target Sequencing and imaging
Target enrichment strategies Random genome sequencing Hybrid Capture PCR based Sanger sequencing
Target enrichment strategies
Target enrichment strategies
Target enrichment strategies
Hybrid Capture In solu-on Agilent Nimblegen... Solid phase Agilent Nimblegen Febit...
Hybrid Capture In solu-on Rela-vely cheap High throughput is possible Small amounts of DNA sufficient Solid phase Straighworward method Flexible Higher amounts of DNA
PCR based approaches Uniplex Mul-plex Fluidigm Raindance Mul-plicon Longrange PCR products Raindance
RNA Sequencing Rapid expression profiling, transcriptome sequencing and small RNA s
RNA-seq
RNAseq: Gene Expression through sequencing Supports discovery, screening, and profiling Does not require prior gene knowledge or annota-on Unique combina-on of Qualita-ve and quan-ta-ve measurement Digital counts vs analog intensi-es Increased dynamic range and sensi-vity No probes or primers Any species - Even when reference genome not available Analyze gene expression
RNAseq: summary Coun-ng or Profiling 10 million total reads of 35 bp length from poly- A selected RNA will give performance bener than any microarray Studying Alterna-ve Splicing or quan-fying csnps for most transcripts Deeper profiling of 50 to 100 million reads, with read lengths of 50 to 100 bps, from poly- A selected RNA using mrna- Seq assay Complete Annota-on of an en-rely New Transcriptome ~500 Million reads of 100 bp read length from mul-ple -ssues Normalized stranded mrna- Seq & ncrnas Small RNA- Seq for micrornas
PART III: NGS workflow data collec-on and processing the exome sequencing pipeline
Whole Exome Sequencing The human genome Genome = 3Gb Exome = 30Mb 180 000 exons Protein coding genes cons-tute only approximately 1% of the human genome It is es-mated that 85% of the muta-ons with large effects on disease- related traits can be found in exons or splice sites
Exome sequencing gdna 3 Gb Exome 38Mb NGS
The past, present & future exome capture Seq - 2.5Gbases total cost 7000 5900 3460 2600 1100 860 300 1000 1300 Jeroen Van Houdt - 1/01/2010 Genomics Core - UZ Leuven- KU Leuven 1/08/2010 1/01/2011
Exome sequencing capacity HiSeq specifica-ons: 2 flow cells 16 lanes (8 per flow cell) 200-300 Gbases per flow cell 10 days for a single run Exome throughput 96 @ 60x coverage per run 3000 @ 60x coverage per year
Exome sequencing
Data processing workflow Data forma ng & QC Mapping & QC Variant calling Variant annota-on Variant filtering/comparison
DATA GENERATION DATA PROCESSING DATA STORAGE INTERPRETATION RESULTS REPORTING & VALIDATION
DATA GENERATION Prepare sample library Perfom exome capture Perform sequencing
DATA GENERATION Prepare sample library Perfom exome capture Perform sequencing
DATA GENERATION Prepare sample library Perfom exome capture Perform sequencing
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome
NGS data processing: overview 1 Mapping 2 3 4 5 Duplicate marking Local realignment Base quality recalibra-on Analysis- ready mapped reads
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp
DATA PROCESSING QC NGS Mapping QC HC
DATA PROCESSING QC NGS Mapping QC HC
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome Variant Calling Variant Annota-on
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome Variant Calling Variant Annota-on Variant Calls 100Mb / exome
SNPs vs Indels 1200000 1000000 800000 600000 INDEL SNP 400000 200000 0
exonic vs non-exonic 1000000 900000 800000 700000 600000 500000 400000 300000 stopgain SNV nonsynonymous SNV nonframeshih inser-on nonframeshih dele-on non- coding frameshih inser-on frameshih dele-on 200000 100000 0
Exonic 20000 18000 16000 14000 12000 10000 8000 6000 4000 synonymous SNV stoploss SNV stopgain SNV nonsynonymous SNV nonframeshih inser-on nonframeshih dele-on frameshih inser-on frameshih dele-on 2000 0
Exonic 500 450 400 350 300 250 200 150 stoploss SNV stopgain SNV nonframeshih inser-on nonframeshih dele-on frameshih inser-on frameshih dele-on 100 50 0
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome Variant Calling Variant Annota-on Variant Calls 100Mb / exome Variant Filtering Database known Variants Public & Private
DATA GENERATION DATA PROCESSING DATA STORAGE Image processing Base calling Sequence Data 10-15 Gb / exome INTERPRETATION QC sequencing Mapping sequences QC capture exp Mapping results 5 Gb / exome RESULTS Validated variants in candidate genes Variant Calling Variant Annota-on Variant Calls 100Mb / exome REPORTING & VALIDATION Variant Filtering Database known Variants Public & Private
DNA Sequencing the next generation 11/10/13