Cloud Computing for Scientific Research The NIH Nephele Project for Microbiome Analysis On behalf of: Yentram Huyen, Ph.D., Chief Nick Weber, Scientific Computing Project Manager Bioinformatics and Computational Biosciences Branch (BCBB) Office of Cyber Infrastructure and Computational Biology (OCICB) National Institute of Allergy and Infectious Diseases (NIAID) National Institutes of Health (NIH)
BLUF: Bottom Line, Up Front: NIH is leveraging the Amazon Web Services (AWS) cloud to help solve scientific research problems and overcome common barriers that researchers face when conducting high-throughput analyses
What is the Microbiome? Scanning electron micrograph of a clump of Staphylococcus epidermidis bacteria (green) in the extracellular matrix, which connects cells and tissue. Credit: NIAID
Microbiome Data and its Uses Data & Analyses Determine what microbes exist in a given body site and what their functions are Compare new/unknown microbial sequences to known microbial sequences to perform taxonomic assignments of sample microbes to known microbial taxonomic families (16S rrna) Assemble Whole metagenome Shotgun (WGS) sequence data to provide full length sequence of a given microbe (provides an overall picture of microbe within system to help deduce function) Common Research Questions Diseased versus healthy Diet A versus Diet B Comparison over time or at different life stages (e.g., pregnancy) Comparison of individuals living in different regions of the world (e.g., climate differences, evolutionary differences) Core microbiome in healthy people that can be augmented in sick people (e.g., via diet, fecal transplants, etc.)
NIH-Amazon Microbiome Cloud Pilot Seizing the opportunity for scientific computing in the cloud to democratize access to scientific computing + + DATA h$ps://commonfund.nih.gov/hmp/ TOOLS COLLABORATION
Data Analysis Challenges Data stored in multiple locations (NCBI, HMP DACC, etc.) Some tools are stored at the HMP DACC BUT. Most analyses cannot be performed on NCBI or DACC compute clusters Users must download data and analyze on local systems Large data downloads can take time Keeping in sync with data and tool updates can be difficult Many users don t have the compute infrastructure, knowledge, or both to store data and perform analyses
How can the Cloud Help? Co-locating microbiome data and tools in a single, cloud-based compute environment Reduces the need to continually transfer large data sets Promotes data analysis directly on the cloud compute servers Facilitates sharing of data, methods, analysis tools, and results Benefits researchers without access to sufficient compute power Serves as a potential model for other high-throughput data projects
Primary Obstacles (Opportunities) Our team started with a limited knowledge of the Amazon cloud environment Cloud is typically seen as the realm of the bioinformatics-savvy or specialized biology user How to simplify and abstract nuances of cloud to target non-specialist? Many components of a useful resource were lacking User interfaces to data and tools Documentation Glue to connect pieces
The Pilot Phase A Proof-of-Concept for Computation and Collaboration in the AWS Cloud Environment + + DATA TOOLS COLLABORATION
Microbiome Data on AWS Public dataset of Human Microbiome Project data accessible for this project h$ps://aws.amazon.com/datasets/1903160021374413 Credit: NIH Medical Arts and Printing
We ve Built Tools to Visualize those Data Nephele Data Explorer
And Capabilities to Run Analysis Pipelines Nephele Analysis Engine
Technical Architecture Presentation Layer is web front-end Future plans to be able to make system calls using web services in addition to UI Analysis Pipeline Templates and Scripts are JSON or shell/python files that define pipelines, parameters, analyses Input/Output Handling Layer acts as proxy for AWS services, interpreting user requests and forwarding to AWS Elastic Beanstalk facilitates AMI and instance type selection based on logic specified in the input/output handling layer, as well as communication with other AWS services (SQS, EC2, S3) Analysis Environment performs computational analysis
Collaboration: Participant Affiliations + +
Next Steps Approach (3-6 Months) Campaign Get Feedback Update / Test Share what the project is and what we ve been working on Understand where primary user needs are, both for our collaborators and the broader community Integrate new features into our framework (visualizafons, new pipelines), and test iterafvely
UI and UX Updates
Next Steps Approach (3-6 Months) Campaign Get Feedback Update / Test Socialize Share what the project is and what we ve been Understand where primary Integrate new features into our Share outcomes, lessons learned, working on user needs are, framework and potenfal both for our (visualizafons, policy collaborators and new pipelines), considerafons the broader community and test iterafvely
Key Considerations Moving Forward Questions we re wrestling with: How to establish a robust security model and manage risk (especially with recent "black eyes" with government websites)? How to appropriately deal with sensitive/clinical data sets? What standards to adopt so we don't have to reinvent the wheel if needs change or if we change providers? How can this type of approach facilitate research reproducibility? How can we build a scalable cost model for this and similar resources? Coordination with grant-funders B.Y.O. credit card Hybrid
Neutrophil and Methicillin-resistant Staphylococccus aureus (MRSA) BacteriaScanning electron micrograph of neutrophil ingesting methicillin-resistant Staphylococcus aureus bacteria. Credit: NIAID Credit:NIST.gov So What? Methicillin-resistant Staphylococcus aureus BacteriaMRSA (yellow) and a dead human neutrophil. Credit: NIAID
Potential Broader Administrative Impact Improved Efficiency for Grantees, IT administrators, and the NIH Establish a new culture between IT and research domains, and reskill people appropriately Facilitate methods for setting up customized infrastructure quickly when scientists request it Inform policies across NIH/HHS for when/how to best leverage cloud, which type(s) of clouds to use for what (e.g., public, private, community), etc. Streamline processes for working with outside providers, procuring services, establishing SLAs, establishing security documentation
Potential Broader Scientific Impact Promoting Open Source, Open Access, and a Data Commons Help to breakdown research silos and facilitate collaboration and reuse Level playing field for the have not s Increase speed-to-research Enhance scalability and extensibility through loosely-coupled, virtualized applications and services Potential integration with Associate Director for Data Science (ADDS s) Data Commons platform
Questions? Scanning electron micrograph of Mycobacterium tuberculosis bacteria, which cause TB. Credit: NIAID
:: Shameless Plug :: 3dprint.nih.gov
Questions? Scanning electron micrograph of Mycobacterium tuberculosis bacteria, which cause TB. Credit: NIAID
Project Background Collaborative proof-of-concept project with AWS and multiple research institutions studying the role of the microbiome in human health and disease How can cloud computing simplify the analysis of microbiome sequence data? By bringing together data sets and analysis tools on the cloud, the global microbiome research community can benefit from a more consistent, centralized, and collaborative environment that is finetuned for performing data analysis at low cost Democratization of research Benefits researchers without access to sufficient compute power or storage capacity Reduces the need to continually transfer large data sets Facilitates sharing of data, methods, analysis tools, and results Serves as a potential model for other projects h$ps://commonfund.nih.gov/hmp/