Introduction to next-generation sequencing data David Simpson Centre for Experimental Medicine Queens University Belfast http://www.qub.ac.uk/research-centres/cem/
Outline History of DNA sequencing NGS or massively parallel sequencing How it works: Illumina sequencing by synthesis Library preparation Clonal amplification future single molecule Characteristics of the data: Quality control Base calling and quality (FastQ format) Phasing and homopolymers Trimming Implications of PCR Duplicates and bias Contamination
Sequencing time-line 2014 : Illumina HiSeq X10 - $1,000 Genome? Andy Vierstraete
Conventional DNA sequencing Dideoxy terminator Sanger method Fluorescent dyes Gel electrophoresis 1 lane = 1 sequence Capillary electrophoresis Primer Electropherogram "G" tube: All four dntp's, ddgtp and DNA polymerase "A" tube: All four dntp's, ddatp and DNA polymerase "T" tube: All four dntp's, ddttp and DNA polymerase "C" tube: All four dntp's, ddctp and DNA polymerase http://www.bio.davidson.edu/courses/molbio/molstudents/ spring2003/obenrader/sanger_method_page.htm
Next Generation Sequencing (NGS) Process millions of sequencing reads in parallel Common concept is the analysis of millions of sequences associated with a solid surface (or in wells) Contrast with traditional gel electrophoresis Range of platforms available Illustrate with Illumina Ion Torrent (Life Technologies/Thermo Fisher)
NGS workflow Library preparation RNA DNA Fragmentation/size selection Addition of adaptors Template preparation: Single molecule clonal amplification Bridge PCR on a slide (cluster generation) Emulsion PCR Sequencing Reversible terminator (Illumina) Semiconductor (Ion Torrent) Single molecule (Nanopore)
Overview of DNA-Seq and RNA-Seq Genomic DNA cdna library AAAAAAA Extract RNA Fragmented DNA Library TACATTTGGGAAAAGTAAATTTGCTGAAAATAATCCCGGT AAGAAAGAAACACTTTTCATGTAATTAGCTTTTTTACATC AAACTTCAGAACCCAAAGTCATTGAGAATATTAGGGATCA CAGAACCACATGAGTCAGAATCATCAGAATATCCCACCAA AGGAGAAGGAAGGAGCAGAGGATTCAAAAGGAAATGGAAT GATGAATATGAAGAAATGTCAGAAATGAAAGAAGGGAAAG GAAATTGAATTCGATGAAATAAATGATACTTGCTTATCTG...... >10 million reads Exon 1 Exon 2 Reference sequence Massively parallel sequencing Align to reference sequence
Library preparation http://res.illumina.com/documents/products/research_reviews/sequencing-methods-review.pdf
Illumina: Cluster generation Clonal amplification achieved by generating clusters on the surface of a flow cell (slide) See SBS technology video at www.illumina.com/
Massively parallel sequencing Glowing dots on a glass slide mark cloned DNA being sequenced
Reading the sequence Wash over all 4 nucleotides each with a fluorescent dye Only one complementary nucleotide incorporated
Illumina: Sequencing by synthesis: Prepare libraries with different index sequences Pool and sequence together multiplexing
Platforms Illumina has several instruments Desktop-sized MiSeq that can complete smaller runs in under a day NextSeq 500 High throughput HiSeq 2500 Ion Torrent semi-conductor sequencing (Life Technologies) Fast, cheap entry level, output increasing rapidly Personal Genome Machine Proton HiSeq 2500 PGM 314 chip Proton P1 chip Total output 600/120 Gb up to 100Mb 10Gb Run time 11 days/27 hrs 2-4 hrs 2-4 hrs Output/day 55 Gb up to 200 Mb ~20 Gb Read length 2 x 100/150bp up to 400b up to 200bp # of single reads 3/0.6 Billion up to 0.6M up to 82 Million
Ion torrent Semiconductor sequencing Incorporation of a nucleotide changes ph Beads with template attached (prepared by emulsion PCR) No optics required! Detected on a semiconductor sequencing chip
Signal processing to optimise base calling Signal Decay Phase correction phasing is the rate at which single molecules within a cluster loose sync with each other. Incomplete Extension Limit read length Further discussion Ion torrent: http://biolektures.wordpress.com/2011/08/10/fundamentals-of-base-calling-part-1/ Illumina: http://pathogenomics.bham.ac.uk/blog/2013/11/diagnosing-problems-with-phasing-and-pre-phasing-on-illumina-platforms/
Read length and quality Per base sequence quality Phred quality score: Q an integer mapping of p, the probability that the corresponding base call is incorrect Damien Gregory: http://www.somewhereville.com/?p=1508
FASTQ format Nucleotide sequence and associated quality score (represented by ASCI characters) Illumina: Flowcell lane & tile X'-and Y coordinates of the cluster Index of multiplex sample @PSI179204_0007:4:1:1025:10482#0/1 GAGCAAAATTGTAGAAGAATTCAGGATCTCGTATGCCGTC +PSI179204_0007:4:1:1025:10482#0/1 C-:AC:?5:C-AAA-5>-,A5A>5:A?-DD?5A::>;><B P. J. A. Cock, C. J. Fields, N. Goto, M. L. Heuer and P. M. Rice, The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research, 2010, Vol. 38, No. 6, 1767 1771 doi:10.1093/nar/gkp1137
Homopolymers (runs of the same nucleotide) Illumina: Flow all 4 nucleotides, incorporate single one Ion torrent: Sequential flows of individual unmodified nucleotides Ionogram (Ion torrent) EBI
Trimming Quality Ends Adaptors Clip adaptors (fastx clipper) Adaptor A Insert Adaptor B Adaptor A Adaptor B FASTX-toolkit by Assaf Gordon
Implications of PCR Duplicate reads Erroneous quantification or variant detection Uneven coverage Additional sequencing required to achieve minimal coverage
Single nucleotide resolution High specificity Show ZEB1 mutation ZEB1 exon 7 Mutation: c.1920g>t p.gln640his CAG = Gln CAT = His
Contamination Sample mix ups (!) - indexing Carry-over from previous run FastQ screen
Single molecule sequencing: Nanopore Single-stranded DNA polymer is passed through a protein nanopore Individual DNA bases on the strand are identified in sequence as the DNA molecule passes through Oxford Nanopore https://www.nanoporetech.com/news/movies#movie-24-nanopore-dna-sequencing
Summary NGS works by sequencing millions of reads in parallel Library preparation Add adaptors to DNA of interest Requires clonal amplification (template preparation) Sequence data presented in FastQ format Quality control critical Errors inherent in the technology, eg. Phasing and homopolymers, PCR Trimming Contamination
To analyze NGS data effectively you need to understand the technology