Building the Systems Biology Knowledgebase Tom Brettin Oak Ridge National Laboratory brettints@ornl.gov outreach@kbase.us kbase-users@lists.kbase.us kbase-devel@lists.kbase.us
Integrate science and the science community JGI Sequencing Genome Annota@on Carbon Cycling Processes Bioenergy Research Integrate Science Across Ac@vi@es Metabolic Modeling Plant Feedstocks for Bioenergy Computa@onal Biology Founda@onal Research There is a tremendous wealth of data and informa@on in the Genomic Sciences program. The Knowledgebase (Kbase) is an opportunity to integrate this data and informa@on both within individual ac@vi@es as well as to integrate together different ac@vi@es.
Everyone should be a contributor! KBASE: A. Professional Computa@onal Biologists B. Data generators and basic analysts C. Knowledge Seekers D. Knowledge Generators Therefore we aim to: instances of minimum inventory/maximum diversity systems, a term coined by Peter Pearce in his book, Structure in Nature Is a Strategy for Design (MIT Press, 1978). Create a powerful framework for programma@c access to data and func@ons of Kbase. (Users A,B) Ul@mately provide stubs for use in PERL, PYTHON, R, MATLAB, Galaxy, etc. Create a set of packaged Widgets that make placement and recognizable display of Kbase func@ons on web pages (or within perhaps other apps), easy and iden@fiable. (Users B) Create a simplified portal for search and aggrega@on of data for data consumers and Knowledge Seekers. (Users C,D) Create a innova+ve pla.orm for knowledge crea+on, evolu+on and sharing. 2 DOE Office of Science Office of Biological and Environmental Research
An Integrated View of Modeling, Simulation, Experiment, and Bioinformatics Bioinformatics Analysis Tools Integrated Biological Databases Experimental Design High-throughput Experiments Analysis & Visualization
An Integrated View of Modeling, Simulation, Experiment, and Bioinformatics Problem Specification Modeling and Simulation Analysis & Visualization Bioinformatics Analysis Tools Integrated Biological Databases Experimental Design High-throughput Experiments Analysis & Visualization
Base Knowledgebase enabling predic5ve systems biology. Powerful modeling framework. Systems Biology Knowledge Community driven, extensible and scalable open source so_ware and applica@on system. Infrastructure for integra@on and reconcilia@on of algorithms and data sources. Framework for standardiza@on, search, and associa@on of data. Enable model based experimental design and interpreta<on of results. Microbes Communities Plants
Engineering a Microbe for Biofuel Produc<on Annotated Genome Annota@on algorithms Metabolic reconstruc@on Feed Stock Stresses Hydrolysate, ph, Salt, End product, intermediates Metabolic model genera@on Model op@miza@on algorithms Biomass Regulatory network inference Isoprene Other func@onal modeling Fi`ng kine@c model parameters DNA replica<on transcrip<on protein folding transla<on Regula<on Predic@ng pathway fluxes KBase Tool Integra<on Proposing strain op@miza@ons Genome Sequence Compara@ve Genomics KEGG Brenda BioCyc Published models Gene KO Phenotypes Transcriptomics Metabolomics Proteomics Growth curves Flux tracing experiments KBase Data Integra<on
Modifying Lignin Biosynthesis S G H S G H PolyPhen 2 Genome annota@on algorithms Compara@ve genomics Genome wide Correla@ve analysis SNP influenced changes in protein structure and func@on Pathway predic@ons Network inference Pathway reconstruc@on Omics & SNP overlay Model op@miza@on valida@on Phylogenomics Modeling phase I Plant systems modifica@on Phenotype Mutant popula<on Resequencing data Transcriptomics Proteomics Metabolomics
Culturing Recalcitrant Microbes from Communi<es Covaria<on Analysis, Phylogene<cally and Func<onally Interes<ng Keystone Species Phylogene<c Inference Gene Func<onal Annota<on Trp N Differen<al Gene Expression Popula@on Sta@s@cs Compara@ve Metagenomics Isolate Genomes and Models Genome Assembly from Metagenomics Annota@on and Metabolic Reconstruc@on Regula@on and Func@onal Modeling Predict Syntrophic Interac@ons Predict Culturing Condi@ons Isolate vs. Community Phenotype Species Abundance Func@onal Gene Abundance Phylo binning and scaffolding Transcriptomics Metabolomics Proteomics Temp ph Salinity Amino Acids Cofactors Syntrophies
What the KBase Needs To Provide? Scalable compute and data capabilities beyond that available locally Distributed infrastructure available 24x7 worldwide Integration with local bioinfo systems for seamless computing and data management Enables leverage of remote systems administration and support via service providers Enables access to state of the art facilities at fraction of the cost (SPs just add more servers) Centralized support of tools and data Bottom line enable biologists to focus on biology
Leverage Existing Investments We leverage the considerable investments in existing integrated databases and analysis environments Key challenge: How we build on these systems yet provide to the community an integrated view for future development
Microbes Online Model SEED MG-RAST 1000s Data Sets 300+ Daily Users Meta Microbes Online 6532 Models 1000+ Users 41,000 Metagenomes 500+ Daily Users Phyotozome 153 Metagenomes 100+ Daily Users RegFam 1000s Papers 100+ Daily Users 20,000+ users The SEED 1166 Subsystems 5859 Users 25 Plant Genomes 300 Daily Users RAST 39,000 Genomes 6000+ Users
Infrastructure Goals Our vision is to put users in the drivers seat.
DOE Systems Biology Knowledgebase KBASE Data and modeling for predictive biology Overview of Infrastructure Tom Brettin and Rick Stevens Oak Ridge and Argonne National Laboratories
Working As One Team Plant CDM Design and Build Jan 2012, ORNL Communi@es Hackathon Jan 2012, LBL First Internal Kbase Build Feb 2012, ANL
Scien@fic So_ware Technical Reviews (May 2 3, 2012)
Energy Sciences Network (ESnet) KBase leverages ESNet for 10+ Gb/s data transfer between all nodes BNL ESnet backbone ( ESnet4) is a na<onal 10 Gbps op<cal circuit infrastructure ESnet shares its op<cal network with Internet2 ESnet's IP network func<ons as a Tier 1 internet service provider
The DOE KBase Cloud Built on the DOE ASCR investment in the Magellan cloud infrastructure Current configura@on of 700 nodes homed at ANL op@mized for heterogeneous applica@ons Open Stack Cloud @ Argonne Open Stack Cloud @ Oak Ridge Cluster system @ Berkeley Cluster system @ Brookhaven
The Kbase Cloud Architecture Data Intensive Science KBase Applica<on Development Large Scale Computa<on Method Development HPC Cluster Image MapReduce Image Ubuntu Image KBase Image OpenStack IaaS Cloud SoZware Stack (EC2/S3 APIs) Commodity Compute Cluster Hardware
The KBase Services Services Oriented Architecture: The KBase Unified API access to a highly diverse set of services ranging from quick retrieval of simple data to massive computa@ons on the KBase Cloud. In a SOA the system is func@onally decomposed into many services each of which is implemented as one or more servers. Our long term goal includes community developed and contributed services. Our ini@al set of services will be backed by the following example servers: Genomic Servers Protein Family Servers Phenotype Servers Polymorphism Servers Compound and Reac5on Data Servers Metabolic Modeling Servers Expression Data Servers Regulatory Models Servers
Concept: KBase User Experience
Development Schedule A series of system builds occurring every quarter will enable a graded process. Successive builds will expand community involvement.