Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Similar documents
Introduction to next-generation sequencing data

Illumina Sequencing Technology

SEQUENCING. From Sample to Sequence-Ready

TruSeq Custom Amplicon v1.5

Illumina TruSeq DNA Adapters De-Mystified James Schiemer

MiSeq: Imaging and Base Calling

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Core Facility Genomics

PreciseTM Whitepaper

Deep Sequencing Data Analysis

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Introduction to NGS data analysis

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

DNA Sequence Analysis

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

How Sequencing Experiments Fail

Cluster Generation. Module 2: Overview

How many of you have checked out the web site on protein-dna interactions?

FOR REFERENCE PURPOSES

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Next Generation Sequencing

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

July 7th 2009 DNA sequencing

Data Analysis for Ion Torrent Sequencing

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Speed Matters - Fast ways from template to result

An Introduction to Next-Generation Sequencing for in vitro Fertilization

Introduction To Real Time Quantitative PCR (qpcr)

Next generation DNA sequencing technologies. theory & prac-ce

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

A guide to the analysis of KASP genotyping data using cluster plots

Complete Genomics Sequencing

The Techniques of Molecular Biology: Forensic DNA Fingerprinting

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

How is genome sequencing done?

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

HiPer RT-PCR Teaching Kit

An Overview of DNA Sequencing

Application Guide... 2

Innovations in Molecular Epidemiology

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Next Generation Sequencing for DUMMIES

PrimeSTAR HS DNA Polymerase

NGS data analysis. Bernardo J. Clavijo

Forensic DNA Testing Terminology

Gene Mapping Techniques

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Introduction Bioo Scientific

Computational Genomics. Next generation sequencing (NGS)

What is a contig? What are the contig assembly programs?

Concepts and methods in sequencing and genome assembly

UGENE Quick Start Guide

Troubleshooting for PCR and multiplex PCR

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Real-time PCR: Understanding C t

Single Nucleotide Polymorphisms (SNPs)

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

LightCycler 480 Real-Time PCR System

GenScript BloodReady TM Multiplex PCR System

TCRG TCRA/D IGH IGK/L

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Notice. DNA Sequencing Module User Guide

Biotechnology: DNA Technology & Genomics

G E N OM I C S S E RV I C ES

Chapter 6 DNA Replication

Reading DNA Sequences:

The RNAi Consortium (TRC) Broad Institute

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

PolyLens: Software for Map-based Visualization and Analysis of Genome-scale Polymorphism Data

A Primer of Genome Science THIRD

Beginner s Guide to Real-Time PCR

Whole genome Bisulfite Sequencing for Methylation Analysis Preparing Samples for the Illumina Sequencing Platform

Real-Time PCR Vs. Traditional PCR

Molecular and Cell Biology Laboratory (BIOL-UA 223) Instructor: Ignatius Tan Phone: Office: 764 Brown

Description: Molecular Biology Services and DNA Sequencing

BacReady TM Multiplex PCR System

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Transcription:

Genotyping by sequencing and data analysis Ross Whetten North Carolina State University

Stein (2010) Genome Biology 11:207

More New Technology on the Horizon

Genotyping By Sequencing Timeline 2007 Complexity Reduction of Polymorphic Sequences van Orsouw et al., PLoS ONE 2(11): e1172. SNP discovery using 454 sequencing, genotyping via Keygene SNPWave > patent application 2008 Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers. Baird et al. PLoS ONE 3(10): e3376 Direct SNP genotyping by Illumina sequencing 2009 - High-throughput genotyping by whole-genome resequencing. Huang et al., Genome Res 19:1068 1076. Low-coverage whole genome resequencing of rice RILs

Genotyping By Sequencing Timeline 2011 Multiplex shotgun genotyping for rapid and efficient genetic mapping. Andolfatto et al., Genome Res. 21(4): 610 617 Single restriction enzyme digest, HMM model for data analysis 2011 A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species. Elshire et al., PLoS ONE 6(5): e19379. Simplified protocol for high throughput 2012 Development of High-Density Genetic Maps for Barley and Wheat Using a Novel Two-Enzyme Genotyping-by-Sequencing Approach. Poland et al., PLoS ONE 7(2): e32253 Two-enzyme version of Elshire et al protocol

Genotyping By Sequencing Timeline 2012 Double-digest RAD-seq. Peterson et al., PLoS ONE 7(5): e37135 Two-enzyme method similar to Poland et al, with size-selection to increase reproducibility of genotyping 2013 RESTseq Efficient Benchtop Population Genomics with RESTriction Fragment SEQuencing. Stolle & Moritz, PLoS ONE 8(5): e63960 Complexity reduction for fewer markers, higher multiplexing, reduced costs

Multiplexing Strategies Multiplexing = pooling multiple different samples, each identified by a unique tag, and sequencing the mixture First approach: Barcodes in the sequence reads. User-designed, no modification to sequencing protocol Illumina index adapters. Variable sequence in the middle of one sequencing adapter, requires an independent sequence read

Sequencing technology overview Illumina Glass flowcell, with 8 separate lanes. GAIIx ~ 30-50 million clusters per lane, 150-nt reads HiSeq ~ 200 million clusters per lane, 100-nt reads MiSeq 25 million clusters, 300-nt reads

Sequencing technology overview Illumina Fragment DNA, ligate adaptor oligo Single-stranded DNA binds to surface

Sequencing technology overview Illumina Extend surfacebound primer, denature, wash away template New strand anneals to complementary primer, which is extended to copy it Many cycles create ~ 1000 molecules in a cluster. After PCR, free ends are blocked

Sequencing technology overview Illumina Another perspective of the amplification process, showing the clusters of products

Sequencing technology overview Illumina

Sequencing technology overview Illumina

Sequencing technology overview Illumina GCTGA CTTAG Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 AGCCG TAAGT Although four different colors are used for the fluorescent nucleotides, only two lasers are used to excite the fluorescence. The fluorescent labels are grouped in pairs - labels on A and C are excited by a red laser, and labels on G and T are excited by a green laser. The software assumes that signal from both lasers will be balanced at each cycle. This means that distinguishing between the A signal and the C signal is more difficult for the instrument than A versus G or A versus T. Base substitution errors are the most common type of sequencing error for Illumina instruments.

Sequencing technology overview Illumina The software assumes that signal from both lasers will be balanced at each cycle. It also uses patterns of base addition in adjacent clusters to determine which clusters are a single template, and which are a mixture of templates, and rejects data from clusters that are mixed templates. Barcode RE site A T C G A A T T C C G T A A A T T C G C A T A A T T C T A G C A A T T C Fixed-length barcodes Poor yield of sequence A T C G A A T T C u v w C G T A G A A T T C x y G C A T C T A A T T C z T A G C T G T A A T T C Variable-length barcodes Good yield of sequence

Experimental design Why is this important? It determines the value of the data for analysis Doesn t the statistical software take care of that? No amount of statistical sophistication can separate confounded factors after data have been collected Auer and Doerge, 2010 Sources of variation Nuisance factors technical problems, random noise Experimental factors variation among individuals

Experimental design What are the sources of variation in the experiment? Among individuals Among treatments Among sequencing runs Among lanes/sectors within run Among library preparations Which sources of variation are of interest? Avoid confounding effects of interest with nuisance effects Allocate effort to estimate effects of greatest interest Block to replicate measurements across nuisance effects Exploit barcoding/multiplexing tools for blocking Balanced or partially-balanced designs possible, using complete or incomplete blocking

Experimental design Genetics 185(2):405-16, 2010 Suppose 21 libraries (each containing different pools of multiplexed samples) are each sequenced in a separate lane Nuisance effects are confounded with sample differences

Auer & Doerge, Genetics 185:405, 2010 Experimental design

DNA sequencing provides options

Just the facts

Sense from sequence reads: methods for alignment and assembly. Flicek & Birney, Nat Methods 6(11 Suppl):S6-S12, 2009 the individual outputs of the sequence machines are essentially worthless by themselves once analyzed collectively DNA sequencing reads have tremendous versatility... Biologists interested in sequencing to answer their experimental questions should prepare themselves to join a fast-moving field and embrace the tools being developed specifically for it. As more sequence is generated, effective use of computational resources will be more and more important.

STACKS software data analysis workflow Catchen et al., Mol Ecol 22, 3124 3140, 2013

Analysis of Deep Sequencing Data Summary Data are not information Information is not knowledge Knowledge is not wisdom Anon.

Conclusions Sequencing technology continues to develop rapidly Data analysis methods are also developing, but almost all use Linux a potential barrier for biologists The costs of DNA sequencing are dropping; experiments are likely to be less expensive next year than this year Skills for computational data analysis are a key component of successful GBS experiments

Thank you!

Computational thinking four general principles Decompose a complex problem to simple steps. Linux is based on simple tools that do one thing well; these tools require problems to be framed in simple terms. Look for patterns. Recognizing similarities among different types of problems allows re-use of the same tools in new contexts. Generalize patterns to create abstract versions. A tool is most powerful when it can be applied to a variety of problems that all share common features Combine simple tools into more complex pipelines. Repetitive tasks are what computers are good at our job is to build the algorithms, or sequences of simple steps, that allow the computer to do those repetitive tasks so we don t have to.

Key principles from the Eric Raymond book chapter Clarity is better than cleverness. Document everything you do, because you won t remember what you did, or why Programmer time is more expensive than machine time. Don t worry about optimizing things unless it is necessary Prototype before polishing get it working before you optimize it. It is often easiest to start with something very simple, then add complexity and capability in steps Design for simplicity; add complexity only where you must. Make things as simple as possible, but no simpler paraphrased from Albert Einstein

Illumina flowcell geometry (Hiseq) 1 2 3 4 5 6 7 8 Tiles within a Hiseq lane are numbered using a system with three separate numbers. The first digit denotes which surface (1 = lower, 2 = upper), the second denotes a vertical swath (1 = left, 2 = middle, 3 = right), and the last two digits denote a tile within that swath (01 means closest to the outflow end of the lane; 08 or 16 means closest to the inflow end of the lane 1101 1201 1301 1102 1202 1302 1107 1207 1307 1108 1208 1308

Understanding FASTQ format or what do all these symbols mean? Instrument ID lane tile X Y barcode read# flowcell Header lines sequence quality scores Quality scores are numbers that represent the probability that the given base call is an error. These probabilities are always less than 1, so the value is given as 10 times minus log(10) of the probability For example, an error probability of 0.001 (1x10-3 ) is represented as a quality score of 30. The numbers are converted into text characters so they occupy less space a single character is as meaningful as 2 numbers plus a space between adjacent values

Understanding FASTQ format Illumina v1.8 header version: @HWI-EAS209:06:FC706VJ:5:1105:5894:21141 1:N:ATCACG Instrument /flowcell ID lane tile X Y index read# Header lines sequence quality scores Unfortunately, at least four different ways of converting numbers to characters have been used, and header line formats have also changed, so one aspect of data analysis is knowing what you have.