Next Generation Sequencing I: Technologies. Jim Noonan Department of Genetics

Next Generation Sequencing I: Technologies Jim Noonan Department of Genetics

Sequence as the readout for biological processes Determining the biological state of cells, tissues and organisms requires the quantification of sequence information Gene expression Protein-DNA interactions (ChIP) DNA-DNA interactions (3C/4C/5C) Chromatin state DNA methylation Genetic variation (SNPs/CNVs) Indirect measures Direct measures CTATGATCAGTC... TCAATCTGATCTG... GGACTTCGAGATC... AAGTCGCTGACGT... microarrays, PCR, etc. Sequencing

Outline First-generation sequencing technology Sanger sequencing Parallelization in human genome project Current massively parallel sequencing platforms 454 Illumina SOLiD Helicos Applications, advantages, issues Third-generation sequencing Pacific Biosciences Complete Genomics Ion Torrent

Metrics for evaluating sequencing methods Throughput Number of high quality bases per unit time Number of independent samples run in parallel - multiplexing Difficulty of sample prep Yield Number of useful/mappable reads per sample Read length Cost Per run and per base Equipment Reagents Infrastructure Labor Analysis The goal of all new sequencing technologies is to increase throughput and yield while reducing cost

Sanger sequencing (1975-1977) 1980 Nobel Prize in chemistry gels read by hand phi X 174 ~5300 bp radiolabeled dideoxyntps one lane per nucleotide 800 bp reads low throughput (several kb/gel)

Parallelization of Sanger sequencing: Technology Semi-automated gel electrophoresis four color ddntp labeling 800 bp reads 96 samples/gel 70,000 bp/gel automated readout metrics for basecalling and quality scoring Capillary electrophoresis 800-1000 bp reads 384 samples/cap 300,000 bp/cap

Parallelization of Sanger sequencing: Infrastructure enormous increase in sequencing production capacity throughout the HGP industrialization of Sanger sequencing, library construction, sample preparation, analysis, etc. $3 billion total cost 1 billion bp/month at largest centers (2005)

Second-generation sequencing Democratizing sequencing production Massive parallelization Reduction in per-base cost Eliminate need for huge infrastructure Millions of reads - >1Gb sequence per run Novel sequencing applications RNA-seq ChIP-seq Methyl-seq Counting applications Whole-genome and targeted resequencing Challenges Read length Quality Data analysis

1 cycle: T-A-G-C flowed in sequence across plate Intensity of signal determines how many nt (i.e. A vs. AAAAA) are incorporated 454 pyrosequencing

454 pyrosequencing Throughput & Yield 1 million 400 bp reads/10 hour run >8 samples/run (more with barcoding) Cost Machine: $500k; reagents ~$8000k/run Issues High indel rate in homopolymers Longer reads but fewer than other systems

454 sequencing applications 106.5 million reads 24.5 Gb seq 7.4 x coverage >10,000 amino acid replacements CNVs up to 1.5 Mb Nature 452:872 (2008)

Illumina Short read technologies (currently dominant) Sequencing by synthesis 250 million 36-100 bp reads/run ~32-40 Gb alignable sequence $6500-$12000 in reagent cost/run 3-10 day run time SOLiD Sequencing by ligation ~400 million 35-50 bp reads/run 100 Gb/run (SOLiD 4) ~$5000 in reagent cost/run 3-6 day run time Helicos Sequencing by synthesis No amplification 750 million reads/run $18k run cost 8 day run time

Illumina Cluster PCR on ﬂowcell (8 lanes) Sequencing by synthesis with reversible dye terminators 1 cycle Scan ﬂowcell Reverse termination Add next base

Error rates increase with read length in Illumina sequencing

Up to 100 bp paired end reads Paired end sequencing

Multiplex sequencing Up to 96 samples on a flowcell

Yoruba male from Nigeria Nature 456:53 (2008)

ChIP-seq: enhancer identification in vivo Visel et al. Nature 457:854 (2009) p300 = enhancer-associated factor p300 binding = ~90% predictive of enhancer activity

30 ng Gene expression profiling by massively parallel RNA sequencing (RNA-seq)

Array Sequence Capture NimbleGen whole exome arrays: 2.1 M features; > 60 bp probes

HiSeq 2000 2 flowcells/run 60x coverage of 1 human genome SINGLE READ Reads/flowcell* Reads/lane GB per lane Length of run in bp 50 75 100 GAIIe 100,000,000 12,500,000 0.63 0.94 1.25 GAIIx 250,000,000 31,250,000 1.56 2.34 3.13 HiSeq2000 1,000,000,000 62,500,000 3.13 4.69 6.25 Big data: Tb-scale datasets YCGA : ~1 Pb storage for 12 Illumina GAIIx PAIRED- END READ Reads/flowcell* Reads/lane GB per lane Length of run in bp 50 75 100 GAIIe 200,000,000 25,000,000 1.25 1.88 2.50 GAIIx 500,000,000 62,500,000 3.13 4.69 6.25 HiSeq2000 2,000,000,000 125,000,000 6.25 9.38 12.50

SOLiD Two base encoding in color space

Single molecule sequencing: Helicos

Single molecule sequencing: Helicos No PCR 800 million reads, 50 samples high indel rate (3%)

Third-generation sequencing Extremely high-throughput sequencing at very low cost Pacific Biosciences Sequence in real time with fluorescent NTPs Rate limited by processivity of polymerase Very long reads (>10 kb) Not well parallelized (few reads) Complete Genomics Sequencing hybridization & ligation of DNA nanoballs Short reads (70 bp) Very low cost ($5,000/genome) Ion Torrent Sequencing on semiconductors

Sequencing in real time: Pacific Biosciences SMRT cells Zero Mode Waveguides

PacBio Specs Autumn 2010 160,000 reads per SMRT cell; 96 SMRT cells per run (~15 M reads) ~1 kb reads 5% error rate (deletions due to dark bases ) ~15 b/second $695k per instrument; $99 per SMRT cell 2 Gb data footprint per SMRT cell Targets 320,000 reads per SMRT cell 50,000-70,000 bp reads Methylation studies and direct RNA seq

Complete Genomics

Complete Genomics Sequencing chemistry Sequencing service model Provide summary stats, variant calls (SNP and CNV), reads, alignments Proof of principle: sequenced 3 human genomes (2 HapMap and George Church) Drmanac et al. Science 327:78 (2010)

Ion Torrent Sequencing on semiconductor chip $45,000 per instrument; $300-$500/run 750,000 100 bp reads; 75 Mb 2 hour run time

Conclusions High-throughput sequencing has become democratized - moved out of industrial-scale genome centers Sequence is no longer limiting - next generation of sequencers will make sequencing very inexpensive Earlier methods for counting / resequencing applications are largely obsolete Scale of data production outstripping our ability to store and analyze it Next: dealing with the data and personal genomics