Metagenomic Assembly

Similar documents

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

CHALLENGES IN NEXT-GENERATION SEQUENCING

De Novo Assembly Using Illumina Reads

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

MK Gibson et al. Training and Benchmarking of Resfams profile HMMs

Introduction to NGS data analysis

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

MiSeq: Imaging and Base Calling

How Sequencing Experiments Fail

From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems

DNA Sequencing Data Compression. Michael Chung

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Overview of Next Generation Sequencing platform technologies

Typing in the NGS era: The way forward!

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

G E N OM I C S S E RV I C ES

UCHIME in practice Single-region sequencing Reference database mode

nuts and bolts of DNA sequencing approaches and bioinformatic tools

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

UF EDGE brings the classroom to you with online, worldwide course delivery!

A Primer of Genome Science THIRD

The NGS IT notes. George Magklaras PhD RHCE

Integrated Rule-based Data Management System for Genome Sequencing Data

Use of Whole Genome Sequencing (WGS) of food-borne pathogens for public health protection

Keeping up with DNA technologies

An Overview of DNA Sequencing

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

GC3 Use cases for the Cloud

MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Protein Protein Interaction Networks

Data Registry Workshop Report

Microbial Oceanomics using High-Throughput DNA Sequencing

A Tutorial in Genetic Sequence Classification Tools and Techniques

Overview sequence projects

MoBEDAC -- Integrated data and analysis for the indoor and built environment. Folker Meyer Argonne National Laboratory GSC 13 Shenzhen, China

Monte Carlo testing with Big Data

NGS data analysis. Bernardo J. Clavijo

Introduction to Bioinformatics 3. DNA editing and contig assembly

Module 1. Sequence Formats and Retrieval. Charles Steward

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Alison Yao, Ph.D. July 2014

Scaling up to Production

Statistics Output Guide

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Alternative Deployment Models for Cloud Computing in HPC Applications. Society of HPC Professionals November 9, 2011 Steve Hebert, Nimbix

Bioinformatics and its applications

E. coli plasmid and gene profiling using Next Generation Sequencing

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Core Facility Genomics

An Application of Machine Learning to Network Intrusion Detection

LifeScope Genomic Analysis Software 2.5

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Storage Solutions for Bioinformatics

Introduction to next-generation sequencing data

Cloud Computing for Scientific Research

Survey of clinical data mining applications on big data in health informatics

How To Cluster Of Complex Systems

July 7th 2009 DNA sequencing

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT

Updated CellTracker software manual

Databases and platforms for data analysis from NGS of MTB

Deep Sequencing Data Analysis

HEALTH DATA LEADERSHIP SUMMIT

What is a contig? What are the contig assembly programs?

Tutorial for proteome data analysis using the Perseus software platform

Resource Models: Batch Scheduling

Genome Analysis in a Dynamically Scaled Hybrid Cloud

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Budapest University of Technology and Economics Department of Measurement and Information Systems. Business Process Modeling

Next Generation Sequencing

Intro to Bioinformatics

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

Transcription:

Metagenomic Assembly Successes, Validation, And Challenges Matthew Scholz Los Alamos National Laboratory Genome Sciences Division

Metagenomic Assembly is Complex Complexity dictates method of assembly Simple communities assemble easily Complex communities.do not Complexity dictates sequencing requirement More sequence means more complexity and more difficulty

Assembly Successes High throughput pipeline Improved assemblies JGI/LANL has successfully assembled 123 metagenomes in last year Average time ~ 1 week/sample HMP metagenome assemblies LANL assembled 223 metagenomes from whole genome shotgun sequencing of HMP Assembly of 10 site specific samples (multiple samples from same site) Validation of several metagenome assemblies

Assembly differences HMP shotgun metagenome assembly Optimization Tool Selection Kmer Selection Selection of # Cores Volume production 1 Kmer Metrics JGI Metagenome Assembly Multiple tool selection Range of Kmers utilized Many different sample types

Draft Many Samples Many metrics Max contig Total Assembly size Etc. HMP assemblies How to interpret metrics?

JGI/LANL Metagenome Assembly Pipeline

JGI/LANL Assembly Process Improved merging of Multiple assemblies Statistical Metrics are better Validation supports these improved assemblies

Validation How do you validate metagenomes? Contig Statistics Read Mapping Annotation Similarity Searches Phylogenetic distribution Reference genomes

Read Mapping Are contigs correctly assembled? Do read data confirm contigs? Edge effects Generate information about coverage Average Fold Coverage Percent of Contig Covered SNP/INDEL information Indication of diversity within sample

Finding Potentially Erroneous Contigs with Read mapping Read coverage of each base from pipeline. Lines delineate regions where mean coverage deviates past thresholds Is this a good contig? Where did contig come from? Can use to automatically break or discard contigs that fail read-mapping

Reference Genomes Map reads to reference genomes Determine which/how many sequenced genomes present in sample Allow partitioning of reads SNP/INDEL analysis Post-Assembly Map contigs to genomes SNP/INDEL Validation

Ongoing Challenges Hardware/Software Kmer assembly Speed/RAM tradeoff Algorithms for metagenomes Too much software How to Determine Correct assembly? Metadata Too much data Not enough coverage

Read Incorporation % of Reads Incorporated 100 90 80 70 60 50 40 30 20 10 0 0.00E+00 2.00E+08 4.00E+08 6.00E+08 8.00E+08 # of Reads in Sample Biofilm Compost Sediment Soil Water

Not Enough Coverage Coverage varies by sample 1 Lane HiSeq is maximum for current hardware/assemblers (complex samples) ~6000X coverage ~ 70X coverage ~600X coverage ~28X coverage >95% of Reads <10% of Reads

Concluding Thoughts We need metadata (standards) Tiers 1. Assembly Statistics 2. Assembly Level 3. Contig analysis Scores

Classification of Metagenomes 1. Assembly Statistics % Read Incorporation Total Assembled Bases (post filtering?) G+C Content Coverage histograms 2. Assembly Level Draft 1 assembler, 1 set of parameters Improved Draft Current JGI/LANL assembly/merge method Read based validation High Quality (Theoretical) Binning strategies pre-assembly Read based correction/trimming Single cell genomes from site as references

Classification of Metagenomes 1. Assembly Statistics 2. Assembly Level 3. Contig analysis Read mapping based validation Clustering Gene analysis

Many Thanks: Metagenomics and Data Analysis Team Patrick Chain Tracey Freitas Ron Croonenberg Bin Hu Chien-Chi Lo Shawn Starkenburg Gary Xie Shannon Steinfadt Others Metagenome work Jim Tiedje Titus Brown Adina Howe HMP consortium Mihai Pop Joe Zhou Kostas Konstantinidis Informatics Team Ben Allen Andy Seirp Criag Blackhart Yan Xu Todd Yilk Single cell work Roger Lasken Ramunas Stepanaskus Steve Hallam Management Team Chris Detter David Bruce Tracy Erkkila Lance Green Shunsheng Han And many others Wet-lab Team Cheryl Gleasner Kim McMurry Krista Reitenga Xiaohong Shen Others Project Management Shannon Johnson Lynne Goodwin Others Kmer team Joel Berendzen Nick Hengartner Ben McMahon Judith Cohn Finishing and SCG Olga Chertov Karen Davenport Armand Dichosa Michael Fitzsimons Ahmet Zeytun Others

Metagenomic Assembly Strategies Bigger computers 1 Lane HiSeq PE reads = 400 M reads 1TB RAM (complex communities) Better Assemblers MetaIDBA MetaVelvet Ray ABySS AllPaths Binning Reads Unsupervised/Heuristic Machine Learning Statistical Reference Based To Infinity and Beyond