Genotyping by sequencing and data analysis Ross Whetten North Carolina State University
Stein (2010) Genome Biology 11:207
More New Technology on the Horizon
Genotyping By Sequencing Timeline 2007 Complexity Reduction of Polymorphic Sequences van Orsouw et al., PLoS ONE 2(11): e1172. SNP discovery using 454 sequencing, genotyping via Keygene SNPWave > patent application 2008 Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers. Baird et al. PLoS ONE 3(10): e3376 Direct SNP genotyping by Illumina sequencing 2009 - High-throughput genotyping by whole-genome resequencing. Huang et al., Genome Res 19:1068 1076. Low-coverage whole genome resequencing of rice RILs
Genotyping By Sequencing Timeline 2011 Multiplex shotgun genotyping for rapid and efficient genetic mapping. Andolfatto et al., Genome Res. 21(4): 610 617 Single restriction enzyme digest, HMM model for data analysis 2011 A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. Elshire et al., PLoS ONE 6(5): e19379. Simplified protocol for high throughput 2012 Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach. Poland et al., PLoS ONE 7(2): e32253 Two-enzyme version of Elshire et al protocol
Genotyping By Sequencing Timeline 2012 Double-digest RAD-seq. Peterson et al., PLoS ONE 7(5): e37135 Two-enzyme method similar to Poland et al, with size-selection to increase reproducibility of genotyping 2013 RESTseq Efficient Benchtop Population Genomics with RESTriction Fragment SEQuencing. Stolle & Moritz, PLoS ONE 8(5): e63960 Complexity reduction for fewer markers, higher multiplexing, reduced costs
Multiplexing Strategies Multiplexing = pooling multiple different samples, each identified by a unique tag, and sequencing the mixture First approach: Barcodes in the sequence reads. User-designed, no modification to sequencing protocol Illumina index adapters. Variable sequence in the middle of one sequencing adapter, requires an independent sequence read
Sequencing technology overview Illumina Glass flowcell, with 8 separate lanes. GAIIx ~ 30-50 million clusters per lane, 150-nt reads HiSeq ~ 200 million clusters per lane, 100-nt reads MiSeq 25 million clusters, 300-nt reads
Sequencing technology overview Illumina Fragment DNA, ligate adaptor oligo Single-stranded DNA binds to surface
Sequencing technology overview Illumina Extend surfacebound primer, denature, wash away template New strand anneals to complementary primer, which is extended to copy it Many cycles create ~ 1000 molecules in a cluster. After PCR, free ends are blocked
Sequencing technology overview Illumina Another perspective of the amplification process, showing the clusters of products
Sequencing technology overview Illumina
Sequencing technology overview Illumina
Sequencing technology overview Illumina GCTGA CTTAG Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 AGCCG TAAGT Although four different colors are used for the fluorescent nucleotides, only two lasers are used to excite the fluorescence. The fluorescent labels are grouped in pairs - labels on A and C are excited by a red laser, and labels on G and T are excited by a green laser. The software assumes that signal from both lasers will be balanced at each cycle. This means that distinguishing between the A signal and the C signal is more difficult for the instrument than A versus G or A versus T. Base substitution errors are the most common type of sequencing error for Illumina instruments.
Sequencing technology overview Illumina The software assumes that signal from both lasers will be balanced at each cycle. It also uses patterns of base addition in adjacent clusters to determine which clusters are a single template, and which are a mixture of templates, and rejects data from clusters that are mixed templates. Barcode RE site A T C G A A T T C C G T A A A T T C G C A T A A T T C T A G C A A T T C Fixed-length barcodes Poor yield of sequence A T C G A A T T C u v w C G T A G A A T T C x y G C A T C T A A T T C z T A G C T G T A A T T C Variable-length barcodes Good yield of sequence
Experimental design Why is this important? It determines the value of the data for analysis Doesn t the statistical software take care of that? No amount of statistical sophistication can separate confounded factors after data have been collected Auer and Doerge, 2010 Sources of variation Nuisance factors technical problems, random noise Experimental factors variation among individuals
Experimental design What are the sources of variation in the experiment? Among individuals Among treatments Among sequencing runs Among lanes/sectors within run Among library preparations Which sources of variation are of interest? Avoid confounding effects of interest with nuisance effects Allocate effort to estimate effects of greatest interest Block to replicate measurements across nuisance effects Exploit barcoding/multiplexing tools for blocking Balanced or partially-balanced designs possible, using complete or incomplete blocking
Experimental design Genetics 185(2):405-16, 2010 Suppose 21 libraries (each containing different pools of multiplexed samples) are each sequenced in a separate lane Nuisance effects are confounded with sample differences
Auer & Doerge, Genetics 185:405, 2010 Experimental design
DNA sequencing provides options
Just the facts
Sense from sequence reads: methods for alignment and assembly. Flicek & Birney, Nat Methods 6(11 Suppl):S6-S12, 2009 the individual outputs of the sequence machines are essentially worthless by themselves once analyzed collectively DNA sequencing reads have tremendous versatility... Biologists interested in sequencing to answer their experimental questions should prepare themselves to join a fast-moving field and embrace the tools being developed specifically for it. As more sequence is generated, effective use of computational resources will be more and more important.
STACKS software data analysis workflow Catchen et al., Mol Ecol 22, 3124 3140, 2013
Analysis of Deep Sequencing Data Summary Data are not information Information is not knowledge Knowledge is not wisdom Anon.
Conclusions Sequencing technology continues to develop rapidly Data analysis methods are also developing, but almost all use Linux a potential barrier for biologists The costs of DNA sequencing are dropping; experiments are likely to be less expensive next year than this year Skills for computational data analysis are a key component of successful GBS experiments
Thank you!
Computational thinking four general principles Decompose a complex problem to simple steps. Linux is based on simple tools that do one thing well; these tools require problems to be framed in simple terms. Look for patterns. Recognizing similarities among different types of problems allows re-use of the same tools in new contexts. Generalize patterns to create abstract versions. A tool is most powerful when it can be applied to a variety of problems that all share common features Combine simple tools into more complex pipelines. Repetitive tasks are what computers are good at our job is to build the algorithms, or sequences of simple steps, that allow the computer to do those repetitive tasks so we don t have to.
Key principles from the Eric Raymond book chapter Clarity is better than cleverness. Document everything you do, because you won t remember what you did, or why Programmer time is more expensive than machine time. Don t worry about optimizing things unless it is necessary Prototype before polishing get it working before you optimize it. It is often easiest to start with something very simple, then add complexity and capability in steps Design for simplicity; add complexity only where you must. Make things as simple as possible, but no simpler paraphrased from Albert Einstein
Illumina flowcell geometry (Hiseq) 1 2 3 4 5 6 7 8 Tiles within a Hiseq lane are numbered using a system with three separate numbers. The first digit denotes which surface (1 = lower, 2 = upper), the second denotes a vertical swath (1 = left, 2 = middle, 3 = right), and the last two digits denote a tile within that swath (01 means closest to the outflow end of the lane; 08 or 16 means closest to the inflow end of the lane 1101 1201 1301 1102 1202 1302 1107 1207 1307 1108 1208 1308
Understanding FASTQ format or what do all these symbols mean? Instrument ID lane tile X Y barcode read# flowcell Header lines sequence quality scores Quality scores are numbers that represent the probability that the given base call is an error. These probabilities are always less than 1, so the value is given as 10 times minus log(10) of the probability For example, an error probability of 0.001 (1x10-3 ) is represented as a quality score of 30. The numbers are converted into text characters so they occupy less space a single character is as meaningful as 2 numbers plus a space between adjacent values
Understanding FASTQ format Illumina v1.8 header version: @HWI-EAS209:06:FC706VJ:5:1105:5894:21141 1:N:ATCACG Instrument /flowcell ID lane tile X Y index read# Header lines sequence quality scores Unfortunately, at least four different ways of converting numbers to characters have been used, and header line formats have also changed, so one aspect of data analysis is knowing what you have.