nuts and bolts of DNA sequencing approaches and bioinformatic tools



Similar documents
MoBEDAC -- Integrated data and analysis for the indoor and built environment. Folker Meyer Argonne National Laboratory GSC 13 Shenzhen, China

GEnomics is one of the areas where biology and medical

Building the Systems Biology Knowledgebase

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

G E N OM I C S S E RV I C ES

Accelerate genomic breakthroughs in microbiology. Gain deeper insights with powerful bioinformatic tools.

The University is comprised of seven colleges and offers 19. including more than 5000 graduate students.

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

GC3 Use cases for the Cloud

Three data delivery cases for EMBL- EBI s Embassy. Guy Cochrane

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Metagenomic and metatranscriptomic analysis

NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT

Big Data in BioMedical Sciences. Steven Newhouse, Head of Technical Services, EMBL-EBI

Cloud Computing for Scientific Research

Computational infrastructure for NGS data analysis. José Carbonell Caballero Pablo Escobar

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Metagenomic Assembly

Delivering the power of the world s most successful genomics platform

A Tutorial in Genetic Sequence Classification Tools and Techniques

Next Generation Sequencing Technologies in Microbial Ecology. Frank Oliver Glöckner

Early Cloud Experiences with the Kepler Scientific Workflow System

OpenCB a next generation big data analytics and visualisation platform for the Omics revolution

High Throughput Sequencing Data Analysis using Cloud Computing

Integrated Rule-based Data Management System for Genome Sequencing Data

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Introduction to next-generation sequencing data

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

Intro to Bioinformatics

Module 1. Sequence Formats and Retrieval. Charles Steward

Twister4Azure: Data Analytics in the Cloud

NGS data analysis. Bernardo J. Clavijo

RAST Automated Analysis. What is RAST for?

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

The NGS IT notes. George Magklaras PhD RHCE

Analysis of ChIP-seq data in Galaxy

Data Refinery with Big Data Aspects

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Next generation DNA sequencing technologies. theory & prac-ce

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

MiSeq: Imaging and Base Calling

Microbial Oceanomics using High-Throughput DNA Sequencing

Overview of Next Generation Sequencing platform technologies

Bioinformatics Resources at a Glance

HPC Cloud. Focus on your research. Floris Sluiter Project leader SARA

BioHPC Web Computing Resources at CBSU

Teaching Computational Thinking using Cloud Computing: By A/P Tan Tin Wee

Databases and platforms for data analysis from NGS of MTB

Big Data Challenges in Bioinformatics

CD-HIT User s Guide. Last updated: April 5,

Cloud Compu)ng. Yeow Wei CHOONG Anne LAURENT

Building Platform as a Service for Scientific Applications

PreciseTM Whitepaper

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Distributed Bioinformatics Computing System for DNA Sequence Analysis

Putting Genomes in the Cloud with WOS TM. ddn.com. DDN Whitepaper. Making data sharing faster, easier and more scalable

DataNet Flexible Metadata Overlay over File Resources

UCHIME in practice Single-region sequencing Reference database mode

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Welcome to the Plant Breeding and Genomics Webinar Series

Final Project Report

Healthcare, transportation,

Next Generation Sequencing: Technology, Mapping, and Analysis

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Conquering the Astronomical Data Flood through Machine

Cloud-Based Big Data Analytics in Bioinformatics

COPO: Collaborative Open Plant Omics. Rob Davey Data Infrastructure and Algorithms Group Leader

OpenCB development - A Big Data analytics and visualisation platform for the Omics revolution

Transcription:

nuts and bolts of DNA sequencing approaches and bioinformatic tools Dionysios A. Antonopoulos Institute for Genomics and Systems Biology Biosciences Division Argonne National Laboratory August 7, 2012

Metagenomics Basic concept an approach for obtaining the total gene content of a microbial community in order to understand its function Day-to-day reality of metagenomics has been used interchangeably to describe 16S rrna-based inventories and whole shotgun approaches different questions can be addressed (sometimes exclusively) and drastically different hardware needed (both in terms of wetlab and computational)

How we study microbial communities Mode for the past 25 years has been single gene as proxy for whole organism (16S rrna gene inferred) New insights on extent of uncultivable world Wide community adoption (i.e. accessible technology and analytical tools) Fox (2005) ASM News. 71:6-7

Tools Hardware equations for handling 16S rrna data: Sanger + 454 + * Illumina + -or-

Tools Database and software options for handling 16S rrna data: DATABASES FOR rrna SEQUENCES ANALYSIS PLATFORMS FOR rrna SEQUENCES

The concept bookkeeping and accounting 6

CHALLENGE #1:" WORKING IN A DE-CENTRALIZED/" EMERGING FIELD DATABASES FOR rrna SEQUENCES ANALYSIS PLATFORMS FOR rrna SEQUENCES

The concept bookkeeping and accounting 9

COGs nr subsystems

The M5 Initiative Ini$a$ve by the Genomic Standards Consor$um (GSC); the 5 M s in M5 stand for the intersec$on of: Metagenomics Metadata MetaAnalysis Models MetaInfrastructure In brief, bring together a set of strategic partners inves6ng in various aspects of new and emerging technologies (workflows, grids, clouds, turn- key desktop solu6ons, etc) to build a next- genera6on compu6ng landscape. i.e. community- share pla3orm more akin to those built by the physics (e.g. colliders) and astronomy (radiotelescopes) communi:es.

It s all about standardization

CHALLENGE #2:" SCALE OF DATA PRODUCTION HAS CHANGED; IMPACT ON COMPUTATION

Computing cost dominates cost $600,000 $900,000 $300,000 $240,000 $300,000 $120,000 $45,000 Bioinforma6cs $30,000 $30,000 $30,000 $30,000 Sequencing $15,000 $15,000 $15,000 $7,000 $3,000 $3,500 0.5 1 30 60 98 196 294 454 GAIIx HiSeq2000 DNA output (Gb) DNA seq plauorm Solution to this is to combine: novel search algorithms, data reduction strategies, and cloud computing capabilities to process this scale of data Cost is purely BLASTX on Amazon EC2 Source: Wilkening et al., Proceedings IEEE Cluster09, 2009

CHALLENGE #3:" BIOINFORMATICS IN A TRADITIONALLY DATA (COMPUTATION)-DEFICIENT FIELD

What is MG-RAST? Basic concept a web- based resource used to annotate metagenomic datasets and organize the data for queries data quality control analysis reduction shotgun rrna http://metagenomics.anl.gov

What is MG-RAST? Basic concept a web- based resource used to annotate metagenomic datasets and organize the data for queries data quality control analysis reduction shotgun Upload Preprocessing Dereplication DRISEE Screening Gene Calling AA Clustering 90% Protein Identification Annotation Mapping Abundance Profiles Done rrna RNA detection RNA Clustering 97% RNA Identification http://metagenomics.anl.gov

MG-RAST user community contribution Rapid growth; 15.11 Tb (as of last night) >54,400 datasets from thousands of groups Many labs (not just centers) produce data http://metagenomics.anl.gov

The concept bookkeeping and accounting 19

Metadata sample_name cur_land_use cur_vegeta6on cur_vegeta6on_meth previous_land_use previous_land_use_meth crop_rota6on agrochem_addi6on 6llage fire flooding link_climate_info annual_season_temp annual_season_precpt link_class_info fao_class local_class local_class_meth soil_type soil_type_meth slope_gradient slope_aspect tot_n_meth microbial_biomass microbial_biomass_meth extreme_salinity salinity_meth heavy_metals heavy_metals_meth al_sat al_sat_meth misc_param extreme_event profile_posi6on horizon drainage_class horizon_meth texture sieving texture_meth water_content_soil ph water_content_soil_meth ph_meth samp_weight_dna_ext tot_org_carb pool_dna_extracts tot_org_c_meth store_cond tot_n 20

Part of an emerging digital biology community In the current version of MG-RAST, metadata is viewed as the key to providing the intersection between datasets Users (dots) sharing pre-publication metagenomes (edges) Source: MG-RAST, 800+ shared metagenomes

Summary Sequencing strategies single gene targeted (proxy for organism) or indiscriminate sequencing (proxy for community func6on) Challenges related to: de- centralized/emerging field, increased scale of data produc6on by next- gen DNA sequencing plauorms, and a tradi6onally data- deficient field Solu6ons in forms of: standardiza6on/exchange of data types, novel data handling strategies, and web- based repositories of organized data Metadata, metadata, metadata standardiza6on and community sharing

Acknowledgements MG-RAST Team: Folker Meyer Narayan Desai Mark Domanus Mark d Souza Elizabeth M. Glass Travis Harrison Kevin Keegan Hunter Matthews Sarah Owens Tobias Paczian Will Trimble Andreas Wilke Jared Wilkening Collaborators: A. Arkin (Berkeley) E. Chang (UChicago) Dawn Field (Oxford) F.-O. Glöckner (MPI Bremen) Jack Gilbert (Argonne) Jeff Grethe (CalIT2, CAMERA) Sarah Hunter / Guy Cochrane (EBI) Rob Knight (Colorado) Nikos Kyrpides (DOE JGI) Owen White (UMaryland, HMP DACC) 23