New generation sequencing: current limits and future perspectives Giorgio Valle CRIBI Università di Padova
Around 2004 the Race for the 1000$ Genome started A few questions... When? How? Why?
Standard strategy for shotgun sequencing
Why genomic sequencing is (was) difficult... Three main problems make difficult the process of sequencing long regions of DNA: 1. 2. 3. The current Sanger technology does not allow to read more that 900-1000 bases per run. Therefore longer pieces of DNA require to be sequenced in parts that must then be assembled (typically by shotgun approach). The sensitivity of the current Sanger technology does not allow to read the signal from a single DNA molecule. Therefore, the fragment of DNA to be sequenced must be physically amplified and the signal is obtained from many identical fragments. The Sanger technology requires the separation of the sequencing reaction by individual electrophoresis.
Sequencing by DNA synthesis (Sanger)
Next (now) generation sequencing 1. Cloning in bacteria should not be required. 2. Electrophoresis step should not be required. Three main technologies are currently available: Roche/454 (pyrosequencing) Illumina/Solexa (modified sanger) Applied Biosystems SOLiD (sequencing by ligation)
1. Cloning in bacteria should not be required. 2. Electrophoresis step should not be required. How can we avoid the bacterial cloning process? 1. Single molecule sequencing 2. Molecular cloning without bacteria There are two main strategies to perform molecular cloning without bacteria, using PCR colonies often called polymerase colonies or polonies : Emulsion PCR (used by Roche and SOLiD) PCR Bridge amplification (used by Illumina)
1. Cloning in bacteria should not be required. 2. Electrophoresis step should not be required. Sequencing without electrophoresis
Pyrosequencing
Illumina chemistry
SOLiD chemistry: ligation probes 3 Ligation site, cleavage site & dye are spatially separated Cleavage site 3 Ligation site Fluorescent dye interrogates base on 1st + 2nd position 2nd Base A C G T A T n n n z z z N=degenerate bases, Z=universal bases 45 = 1024 probes (256 probes per color) es t1as B Ligation Probes are Octamers A C G T
SOLiD 4-color ligation Ligation reaction universal seq primer 3 5 ligase Y-probe 3 5 3 5 XXnnnzzz 1µm 1µm bead bead 5 P1 Primer 5 XXnnnzzz X Xn n n z z z B-probe G-probe Template Sequence 3 R-probe 5 XXnnnzzz 3
SOLiD 4-color ligation Ligation reaction ligase Y-probe 3 5 3 5 XXnnnzzz X Xn n n z z z B-probe G-probe 5 XXnnnzzz 3 R-probe 5 XXnnnzzz ligase universal seq primer 1µm 1µm bead bead p5 xx 5 P1 Primer Template Sequence 3
SOLiD 4-color ligation Visualization universal seq primer 1µm 1µm bead bead xx 5 P1 Primer Template Sequence 3 Y 1-2
SOLiD ligation-based sequencing chemistry (2) Image Cap unextended strands Cleave-off fluor
SOLiD 4-color ligation Cleavage universal seq primer 1µm 1µm bead bead xx 5 P1 Primer p5 Template Sequence 3 Y 1-2
SOLiD 4-color ligation Ligation (2nd cycle) ligase Y-probe 3 5 3 5 XXnnnzzz X Xn n n z z z B-probe G-probe 5 XXnnnzzz 3 R-probe 5 XXnnnzzz ligase universal seq primer 1µm 1µm bead bead xx 5 Adapter Oligo Sequence xx Template Sequence 3 Y 1-2
SOLiD 4-color ligation Visualization (2nd cycle) universal seq primer 1µm 1µm bead bead XX 5 xx Adapter Oligo Sequence Template Sequence Y R 1-2 6-7 3
SOLiD 4-color ligation Cleavage (2nd cycle) universal seq primer 1µm 1µm bead bead XX 5 xx Adapter Oligo Sequence p5 Template Sequence Y R 1-2 6-7 3
SOLiD 4-color ligation interrogates every 4th-5th base universal seq primer 1µm 1µm bead bead XX 5 XX XX Adapter Oligo Sequence XX XX Template Sequence Y R R B 1-2 6-7 11-12 16-17 21-22 3 G
SOLiD 4-color ligation Reset 1µm 1µm bead bead 5 Adapter Oligo Sequence Template Sequence 3
SOLiD 4-color ligation (1st cycle after reset) universal seq primer n-1 3 p5 ligase Y-probe 3 5 3 5 XXnnnzzz X Xn n n z z z B-probe G-probe 5 XXnnnzzz 3 R-probe 5 XXnnnzzz ligase universal seq primer n-1 p5 1µm 1µm bead bead xx 5 Adapter Oligo Sequence Template Sequence 3
SOLiD 4-color ligation (1st cycle after reset) universal seq primer n-1 1µm 1µm bead bead xx 5 Adapter Oligo Sequence Template Sequence 3 R 0-1
SOLiD 4-color ligation (2nd Round) universal seq primer n-1 1µm 1µm bead bead XX 5 XX XX Adapter Oligo Sequence XX XX Template Sequence R R R B G 01 56 1011 1516 2021 3
Sequential rounds of sequencing Multiple cycles per round 1µm 1µm bead bead 5 Adapter Oligo Sequence 3 Template Sequence universal seq primer 1-2 3 reset 11-12 16-17 21-22 universal seq primer n-1 0-1 3 reset 5-6 10-11 15-16 20-21 14-15 19-20 24-25 universal seq primer n+3 3 reset 4-5 spacer 9-10 universal seq primer n+2 3-4 3 8-9 13-14 18-19 23-24 spacer reset universal seq primer n+1 3 6-7 spacer 2-3 7-8 12-13 17-18 22-23
2 base pair encoding reference alignment in color space A C G G T C G T C G T G T G C G T reference expected observed A C G G T C G C C G T G T G C G T A SNP to be real must be encoded by two color changes
rd 3 generation sequencing Single Molecule Sequencing
Pacific Biosciences
Michael R. Stratton, Peter J. Campbell & P. Andrew Futreal Nature 458, 719-724 (9 April 2009)
~2004: Start of the race to the 1000$ genome Target 1: 1000x cost reduction in 5 years * Target 2: further 100x cost reduction in another 5 years *
Genomics at CRIBI 1990 1995 2000 2005 2010 CRIBI Olive YEAST GENOME Wheat YEAST FAN Tomato Arabidopsis Grape Genome Telethon: Muscle functional genomics P. profundum AGER CHROMUS Nannochloropsis... more... DNA seq service BMR Genomics
SOLiD 5500 Main applications Whole genome sequencing Chromatin immunoprecipitation (ChIP) Microbial and eukaryotic resequencing Digitial karyotyping Structural variations Genotyping Gene expression Small RNA discovery
Bioinformatics at CRIBI RNA-seq Resequencing mirna de novo genomic sequencing DNA methylation ChIP De novo Assembly Mapping reads Gene prediction SNPs & structural variations Expression analysis Gene annotation Data management and analysis
Mapping reads on the genome: developmentally regulated alternative splicing
Reads alignment CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment Svetlin A Manavski and Giorgio Valle BMC Bioinformatics 2008, 9: S10 PASS: a Program to Align Short Sequences Davide Campagna, Alessandro Albiero, Alessandra Bilardi, Elisa Caniato, Claudio Forcato, Svetlin Manavski, Nicola Vitulo and Giorgio Valle Bioinformatics 2009
http://pass.cribi.unipd.it
Mate pair signatures
Mate pair libraries
CRIBI APPROACH STEP 1 Use of insert length statistics for the identification of structural variations STEP 2 Use of sequence alignment ( splice-like alignment) for the identification of the precise points of insertion/deletion.
FIRST STEP 4 indexes are created: unique MP, right distance, right orientation useful for short indels (difference between observed and expected, after filtering low coverage) unique MP, wrong distance, rigth orientation useful for long deletions (physical coverage) unique MP, wrong orientation useful for inversions (physical coverage) unique reads lacking the partner useful for long insertions (number of reads)
SECOND STEP The alignment of a structural variation aligns like a splicing site Reads that cover a breakpoint can be spliced-aligned, showing a pattern of alignment compatible with that specific structural variation By analysing these patterns, it is possibile to detect the correct breakpoint with a base-precision
Long deletions
Short deletions
Long insertions
Inversions and more
Results on random SV random genome structural variations randomly added (type, position, length, hetero/homozygosity) SNP random added (type, position, nucleotide) SV \ Coverage 5X 10X 20X 40X DELETIONS 58% 59% 72% 90% INSERTIONS 43% 76% 78% 85% INVERSIONS 56% 58% 74% 88%
How can we make sense of the data?
Advanced query platform
Acknowledgement Genomics Bioinformatics Stefano Campanaro Alessandro Vezzi Michela D'Angelo Rosanna Zimbello Riccardo Schiavon Chiara Rigobello Elisa Corteggiani Carpinelli Riccardo Rosselli Fabio De Pascale Nicola Vitulo Davide Campagna Erika Feltrin Claudio Forcato Alessandro Albiero Elisa Caniato Alessandro Maccagnan Gianpiero Zamperin Andrea Telatin Georgine Faulkner Rusha Guha Lisa Marchioretto