How long is long enough?



Similar documents
Next Generation Sequencing

NGS data analysis. Bernardo J. Clavijo

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Computational Genomics. Next generation sequencing (NGS)

Support Vector Machine (SVM)

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

A Simple Introduction to Support Vector Machines

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

Support Vector Machines Explained

NGS Technologies for Genomics and Transcriptomics

Support Vector Machines with Clustering for Training with Very Large Datasets

Next generation DNA sequencing technologies. theory & prac-ce

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Searching Nucleotide Databases

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

A study on machine learning and regression based models for performance estimation of LTE HetNets

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Making Sense of the Mayhem: Machine Learning and March Madness

July 7th 2009 DNA sequencing

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Support Vector Machines for Classification and Regression

Typing in the NGS era: The way forward!

An Introduction to Machine Learning

E-commerce Transaction Anomaly Classification

Introduction: Overview of Kernel Methods

Early defect identification of semiconductor processes using machine learning

Statistical Models in Data Mining

THE SVM APPROACH FOR BOX JENKINS MODELS

Solar Irradiance Forecasting Using Multi-layer Cloud Tracking and Numerical Weather Prediction

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Keeping up with DNA technologies

Drugs store sales forecast using Machine Learning

An Overview of DNA Sequencing

Environmental Remote Sensing GEOG 2021

Example: Boats and Manatees

Two scientific papers (1, 2) recently appeared reporting

Why do statisticians "hate" us?

Microbial Oceanomics using High-Throughput DNA Sequencing

HT2015: SC4 Statistical Data Mining and Machine Learning

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Statistical Machine Learning

Gene Expression Assays

Reading DNA Sequences:

Data Mining - Evaluation of Classifiers

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Linear Threshold Units

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Galaxy Morphological Classification

Analyze It use cases in telecom & healthcare

Intrusion Detection via Machine Learning for SCADA System Protection

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Data Mining: An Overview. David Madigan

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Network analysis: P.E.R.T,C.P.M & Resource Allocation Some important definition:

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

8.1 Summary and conclusions 8.2 Implications

Linear Programming for Optimization. Mark A. Schulze, Ph.D. Perceptive Scientific Instruments, Inc.

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques

Boolean Network Models

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

A PRIMAL-DUAL APPROACH TO NONPARAMETRIC PRODUCTIVITY ANALYSIS: THE CASE OF U.S. AGRICULTURE. Jean-Paul Chavas and Thomas L. Cox *

Predicting Flight Delays

Machine Learning in Statistical Arbitrage

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Biology in the Clouds

Discrete Optimization

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Software Getting Started Guide

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Protein Protein Interaction Networks

Data Mining and Neural Networks in Stata

First generation" sequencing technologies and genome assembly. Roger Bumgarner Associate Professor, Microbiology, UW

The Operational Value of Social Media Information. Social Media and Customer Interaction

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

Local classification and local likelihoods

The prevailing method of determining

Introduction to NGS data analysis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

On the Interaction and Competition among Internet Service Providers

How To Cluster Of Complex Systems

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

On Correlating Performance Metrics

Lecture 6: Logistic Regression

Innovations and Value Creation in Predictive Modeling. David Cummings Vice President - Research

Transcription:

How long is long enough? - Modeling to predict genome assembly performance - Hayan Lee@Schatz Lab Feb 26, 2014 Quantitative Biology Seminar 1

Outline Background Assembly history Recent sequencing technology + Algorithm Motivation Lander-Waterman statistics Economical meaning (ROI) Our approach Genome assembly challenges Support Vector Regression (SVR) Feature engineering Model fitting Prediction : Genome assembly performance Contribution 2

Genome Assembly 1977. Sanger et al. 1 st Complete Organism X5375 bp 1995. Fleischmann et al. 1 st Free Living Organism TIGR Assembler. 1.8Mbp 1998. C.elegans SC 1 st Multicellular Organism BAC-by-BAC Phrap. 97Mbp 2000. Myers et al. 1 st Large WGS Assembly. Celera Assembler. 116 Mbp 2001. Venter et al., IHGSC Human Genome Celera Assembler/GigaAssembler. 2.9 Gbp 2010. Li et al. 1 st Large SGS Assembly. SOAPdenovo 2.2 Gbp 3

Assembling a Genome 1. Shear & Sequence DNA 2. Construct assembly graph from overlapping reads AGCCTAGGGATGCGCGACACGT GGATGCGCGACACGTCGCATATCCGGTTTGGTCAACCTCGGACGGAC 3. Simplify assembly graph CAACCTCGGACGGACCTCAGCGAA 4. Detangle graph with long reads, mates, and other links 4

Assembly Complexity A R B R C R B A R Long Reads is the solution!!! C

Long Read Sequencing Technology PacBio RS II Moleculo Oxford Nanopore CSHL/PacBio (Voskoboynik et al. 2013) Broad/OxNano @ AGBT

Read Length (bp) PacBio SMRT Sequencing Road Map October, 2013 The new P5 polymerase and C3 chemistry combination (P5-C3) extends the industry-leading sequence read lengths to an average of approximately 8,500 bases, with the longest reads exceeding 30,000 bases, doubling throughput P4 C2 P5 C3 8,500 bp C2 C2 Early chemistries 453 1012 1734 LPR FCR ECR2 2008 2009 2010 2011 2012 2013

S. cerevisiae W303 PacBio RS II sequencing at CSHL by Dick McCombie Size selection using an 7 Kb elution window on a BluePippin device from Sage Science Over 175x coverage in 2 days using P5-C3 Mean: 5910 83x over 10kbp 8.7x over 20kb Max: 36,861bp

PacBio Assembly Algorithms PBJelly PacBioToCA & ECTools HGAP & Quiver Gap Filling and Assembly Upgrade English et al (2012) PLOS One. 7(11): e47768 Hybrid/PB-only Error Correction Koren, Schatz, et al (2012) Nature Biotechnology. 30:693 700 PB-only Correction & Polishing Chin et al (2013) Nature Methods. 10:563 569 < 5x PacBio Coverage > 50x

Many Genomes Are Sequenced Many Questions Are Raised But How long should the read length be? What coverage should be used? Given the read length and coverage, How long are the contigs? How many contigs? How many reads are in each contigs? How big are the gaps? 10

Previous Works 11

Lander-Waterman Statistics G : Genome size L : read Length N : Number of reads C : Coverage = NL G Note : Poisson distribution is assumed! P x; λ = λx e λ x! P(No reads start in a given position) = C0 e C 0! =e C P(At least 1 read starts in a given position) = 1 e C 12

Lander-Waterman Statistics P(No reads start in a given position) = C0 e C 0! =e C P(At least 1 reads starts in a given position) = 1 e C Expected # of bases in Gaps = e C G Expected # of bases in Contigs = (1 e C )G Expected # of contigs = e C N Mean of contig size = Mean of contig size = (derivation) (1 e C )G e C N expected # of bases in contigs expected # of contigs = (ec 1)G N = (ec 1)L C expeated # of bases in contigs expected # of contigs = (ec 1)L c = (e(1 θ)c 1)L c 13

HG19 Genome Assembly Performance by Lander-Waterman Statistics Two key observations 1. Contig over genome size 2. Read Length vs. Coverage Technology vs. Money = (e(1 θ)c 1)L C 14

Empirical Data-driven Approach We selected 26 species across tree of life and exhaustively analyzed their assemblies using simulated reads for 4 different length (6 for HG19) and 4 different coverage per species For the extra long reads, we fixed the Celera Assembler(CA) to support reads up to 0.5Mbp Data (X, Y) Machine Learning Algorithm (SVR) Learned Model Data (X,?) Learned Model Predicted Result 15

26 Species Across Tree of Life 16

HG19 Genome Assembly Performance by Our Simulation Read Length has stronger impact than coverage 17

Why? Lander-Waterman Statistics Assumptions!!! If genome is a random sequence, it will work Our Approach Stop assuming that we cannot guarantee!!! We tried to assume as least as possible. Instead of building on top of assumptions, we let the model learn from the data Empirical data-driven approach 18

Repeats 19

Repeats in Rice 20

Our Goal To predict genome assembly performance Performance(%) N50 from assembly N50 of chromosome segments 100 Target N50 21

Assembly Challenge Read Length Coverage Repeats Genome Size 22

Assembly Challenge (1) Read Length Read length is very important A matter of technology The longer is the better Quality was important but can be corrected PacBio produces long reads, but low quality (~15% error rate) Error correction pipeline are developed Errors are corrected very accurately up to 99% 23

- Assembly Challenge (1) - Read Length 24

Assembly Challenge (2) Coverage A matter of money Using perfect reads, assembly performance increased for most genomes : Lower bound Using real reads, overall performance line will shift to the higher coverage The higher is the better (?) But still it suggests that there would be a threshold that can maximize your return on investment (ROI) 25

Assembly Challenge (2) Coverage 26

A. thaliana Ler-0 http://blog.pacificbiosciences.com/2013/08/new-data-release-arabidopsis-assembly.html Genome size: Chromosome N50: Corrected coverage: 124.6 Mbp 23.0 Mbp 20x over 10kb A. thaliana Ler-0 sequenced at PacBio Sequenced using the previous P4 enzyme and C2 chemistry Size selection using an 8 Kb to 50 Kb elution window on a BluePippin device from Sage Science Total coverage >119x Sum of Contig Lengths: 149.5Mb N50 Contig Length: 8.4 Mb Number of Contigs: 1788 High quality assembly of chromosome arms Assembly Performance: 8.4Mbp/23Mbp = 36% MiSeq assembly: 63kbp/23Mbp =.2%

Assembly Challenge (2) Coverage 28

Assembly Challenge (3) Repeats Genome is not a random sequence Repeat hurts genome assembly performance Isolating the impact of repeats is not trivial Quantifying repeat characteristics is not trivial as well The longest repeat size # of repeats > read length 29

Assembly Challenge (3) Repeats Arabidopsis vs. Fruit fly Arabidopsis (120M) Longest repeat: 44kbp Fruit fly (130M) Longest repeat: 30kbp Mean Read Length # of repeats > read length # of repeats > read length 3,650 210 5564 7,400 112 394 15,000 44 8 30,000 14 2 30

Assembly Challenge (3) Repeats 31

$ $ 32

Assembly Challenge (4) Genome Size Increase the assembly complexity Make a hard problem harder. 33

Assembly Challenge (4) Genome Size 34

S. cerevisiae W303 S288C Reference sequence 12.1Mbp; 16 chromo + mitochondria; N50: 924kbp PacBio assembly using HGAP + Celera Assembler 12.4Mbp; 21 non-redundant contigs; N50: 811kbp; >99.8% id

36

Assembly Challenge (4) Genome Size 37

Our Goal To predict genome assembly performance Performance(%) N50 from assembly N50 of chromosome segments 100 f Read Length Coverage Repeats Genome Size 38

Challenges for Prediction Sample size is small Quality is not guaranteed Predictive Power Overfitting Support Vector Regression (SVR) Cross Validation 39

Support Vector Regression (SVR) Epsilon insensitive loss function L(y, f x, w ) = 0, if y f x, w ε y f x, w ε, otherwise Benefits 1. Simplest fit 2. Robust to outliers (ex) Example of onedimensional linear regression function with epsilon intensive band. 40

Optimization(1) Minimize 1 w 2 2 + C n i=0 (ξ i + ξ i ) Subject to y i w i x il y i + w i x il b ε + ξ i + b ε + ξ i ξ i, ξ i 0 for i = 1,2,, l (ex) Linear case Generalized Lagrange Multiplier (Karush Kuhn Tucker conditions) 41

Optimization (2) Primal min 1 w 2 2 + C n i=0 (ξ i + ξ i ) s.t y i w i x il b ε + ξ i y i + w i x il + b ε + ξ i ξ i, ξ i for i = 1,2,, l Karush Kuhn Tucker(KKT) conditions Dual max 1 2 l i,j=0 α i α i α j α j l < x i, x j > ε i=0 α i + α i l + i=0 y i α i α i s.t. i=0 l α i α i = 0 0 α i, α i C Kernel 42

Optimization (3) Parameters C } SVM ε Kernel Kernel type Linear K(x i, x j )= < x i, x j > Non-Linear meta-parameters as user-defined inputs. Polynomial K(x i, x j )=(x i T x j + 1) d usually based on applicationdomain knowledge and also should reflect distribution of input (x) values of the training data. Radial Basis Function(RBF) K(x i, x j )=exp( 1 2σ 2 x i x j 2 ) Kernel Parameters Depends on kernel type 43

SVR Model Fitting Using four features Read Length Coverage Genome Size # of Repeats > Read Length 44

Feature Engineering (1) Correlation Coefficient Performance vs. genome size R = -0.38 Performance vs. Read Length R = 0.2 45

Feature Engineering (2) Correlation Coefficient Performance and log (genome size) R = -0.49 Performance and log (read length) R = 0.32 Performance and log (genome size)/ log (read length) R = 0.6 Performance and log (coverage ) R = 0.58 Performance and log (# of repeats longer than read length) R = -0.44 46

C2 2012 How long is long enough? 47 Lee, H, Gurtowski, J, Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al. Simons (2014) Center In preparation for Quantitative Biology

Assembly N50 / Chromosome N50 C3 C3 2013 2013 C2 C2 2012 2012 How long is long enough? Lee, H, Gurtowski, J, Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al. Simons (2014) Center In preparation for Quantitative Biology

C5???? C4???? C3 2013 C2 2012 How long is long enough? Lee, H, Gurtowski, J, Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al. Simons (2014) Center In preparation for Quantitative Biology

Our Goal To predict genome assembly performance Performance(%) N50 from assembly N50 of chromosome segments 100 Read Length Coverage Repeats Genome Size 1. How do we measure predictive power? 2. How do we avoid overfitting? 50

Cross Validation K-fold Cross Validation A variation of Leave-One-Out Cross Validation (LOOCV) Leave one species out approach (LOSO) <- Our approach A variation of Leave-One-Out Cross Validation (LOOCV) Use 25 species as training data, test 1 species to measure predictive power Avoid overfitting Model selection by predictive power 51

How long is long enough? 52 Lee, H, Gurtowski, J, Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al. Simons (2014) Center In preparation for Quantitative Biology

Prediction Average of residual is 10% We can predict the new genome assembly performance in 10% of error residual boundary Genome size, read length and coverage used explicitly Repeats are included implicitly 53

Web Service How long is long enough? 54 Lee, H, Gurtowski, J, Yoo, S, Marcus, S, McCombie, WR, Schatz MC et al. Simons (2014) Center In preparation for Quantitative Biology

Contribution Empirical data driven approach We selected 26 species across tree of life and exhaustively analyzed their For the extra long reads, we fixed the Celera Assembler(CA) to support reads up to 0.5Mbp We made a new model that predicts genome assembly performance Prediction in 10% of residue boundary. Our prediction is independent from any error model so that our model will provide upper bound and can be used for the general purpose The Marginal coverage forms around 20x-40x assuming no errors in reads Read Length is more important given that we have enough coverage 55

Contribution Recommendations < 100 Mbp: HGAP/PacBio2CA @ 100x PB C3-P5 expect near perfect chromosome arms < 1GB: HGAP/PacBio2CA @ 100x PB C3-P5 expect high quality assembly: contig N50 over 1Mbp > 1GB: hybrid/gap filling expect contig N50 to be 100kbp 1Mbp > 5GB: Email mschatz@cshl.edu Machine Learning & Big Data Contact hlee@cshl.edu 56

Acknowledgements CSHL Schatz Lab Mike Schatz James Gurtowski Shoshana Marcus BNL Shinjae Yoo McCombie Lab Dick McCombie Eric Antoniou Elena Ghiban Melissa Kramer Panchajanya Deshpande Senem Mavruk Eskipehlivan Scott Ethe Sayers Sara Goodwin 57

Thank You Q & A 58