Cloud Computing for Scientific Research



Similar documents
NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Vivien Bonazzi ADDS Office (OD) George Komatsoulis (NCBI)

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Globus Genomics Tutorial GlobusWorld 2014

Accelerate genomic breakthroughs in microbiology. Gain deeper insights with powerful bioinformatic tools.

RemoteApp Publishing on AWS

Storage as a Service: Leverage the benefits of scalability and elasticity with Storage as a Service

Practical Solutions for Big Data Analytics

Cloud for Large Enterprise Where to Start. Terry Wise Director, Business Development Amazon Web Services

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Databricks. A Primer

Web Cloud Architecture

Making a Smooth Transition to a Hybrid Cloud with Microsoft Cloud OS

Migration Scenario: Migrating Batch Processes to the AWS Cloud

The 2014 Bottleneck Report on Enterprise Mobile

Cloud computing - Architecting in the cloud

LONDON. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Achieve Economic Synergies by Managing Your Human Capital In The Cloud

NORTH PACIFIC RESEARCH BOARD SEMIANNUAL PROGRESS REPORT

BASICS OF SCALING: LOAD BALANCERS

Databricks. A Primer

Scientific and Technical Applications as a Service in the Cloud

Cloud Computing. Chapter 1 Introducing Cloud Computing

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

IBM EXAM QUESTIONS & ANSWERS

TRAVERSE: VIRTUALIZATION AND PRIVATE CLOUD MONITORING

Amazon Web Services Building in the Cloud

Cloud-Based Big Data Analytics in Bioinformatics

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Amazon Elastic Beanstalk

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

5 Key Reasons to Migrate from Cisco ACE to F5 BIG-IP

Twister4Azure: Data Analytics in the Cloud

How to Unlock Agility by Backing up to, from, and in the Cloud

Automating Biostatistics Workflows for Bench Scientists Using R based

Donna J. Dean, Ph.D. October 27, 2009 Brown University

SOLUTION BRIEF Seven Secrets to High Availability in the Cloud

ediscovery and Search of Enterprise Data in the Cloud

How To Set Up Wiremock In Anhtml.Com On A Testnet On A Linux Server On A Microsoft Powerbook 2.5 (Powerbook) On A Powerbook 1.5 On A Macbook 2 (Powerbooks)

Elastic Detector on Amazon Web Services (AWS) User Guide v5

Data Semantics Aware Cloud for High Performance Analytics

Cloud.. Migration? Bursting? Orchestration? Vincent Lavergne SED EMEA, South Gary Newe Sr SEM EMEA, UKISA

How AWS Pricing Works May 2015

Chapter 9 PUBLIC CLOUD LABORATORY. Sucha Smanchat, PhD. Faculty of Information Technology. King Mongkut s University of Technology North Bangkok

Hadoop & Spark Using Amazon EMR

Introduction to Arvados. A Curoverse White Paper

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

NIAID Genomics and Bioinformatics Programs

TECHNOLOGY WHITE PAPER Jun 2012

Talking your Language. E-WorkBook 10 provides a one-platform, single source of truth without adding complexity to research

Hybrid Development and Test USE CASE

History of DNA Sequencing & Current Applications

Cloud Computing. Chapter 1 Introducing Cloud Computing

Using WebSphere Application Server on Amazon EC2. Speaker(s): Ed McCabe, Arthur Meloy

Options that make sense for you. Table of Contents. How companies are confidently migrating core industry processes to the cloud

Four Reasons Your Technical Team Will Love Acquia Cloud Site Factory

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

NASA Earth Exchange (NEX):

GeoCloud Project Report USGS/EROS Spatial Data Warehouse Project

Challenges in Hybrid and Federated Cloud Computing

ur skills.com

Flood Disaster Response Educational Resources

Simplified Private Cloud Management

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Logentries Insights: The State of Log Management & Analytics for AWS

TECHNOLOGY WHITE PAPER Jan 2016

Three Open Blueprints For Big Data Success

Big Workflow: More than Just Intelligent Workload Management for Big Data

Building. Applications. in the Cloud. Concepts, Patterns, and Projects. AAddison-Wesley. Christopher M. Mo^ar. Cape Town Sydney.

Big Data to Knowledge (BD2K)

Primetime for KNIME:

HP CloudSystem Enterprise

MOVING TO THE NEXT-GENERATION MEDICAL INFORMATION CALL CENTER

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Cloud Hosting. QCLUG presentation - Aaron Johnson. Amazon AWS Heroku OpenShift

Big Data Web Analytics Platform on AWS for Yottaa

Best Practices for Building Mobile Web

ArcGIS for Server in the Amazon Cloud. Michele Lundeen Esri

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1

Transcription:

Cloud Computing for Scientific Research The NIH Nephele Project for Microbiome Analysis On behalf of: Yentram Huyen, Ph.D., Chief Nick Weber, Scientific Computing Project Manager Bioinformatics and Computational Biosciences Branch (BCBB) Office of Cyber Infrastructure and Computational Biology (OCICB) National Institute of Allergy and Infectious Diseases (NIAID) National Institutes of Health (NIH)

BLUF: Bottom Line, Up Front: NIH is leveraging the Amazon Web Services (AWS) cloud to help solve scientific research problems and overcome common barriers that researchers face when conducting high-throughput analyses

What is the Microbiome? Scanning electron micrograph of a clump of Staphylococcus epidermidis bacteria (green) in the extracellular matrix, which connects cells and tissue. Credit: NIAID

Microbiome Data and its Uses Data & Analyses Determine what microbes exist in a given body site and what their functions are Compare new/unknown microbial sequences to known microbial sequences to perform taxonomic assignments of sample microbes to known microbial taxonomic families (16S rrna) Assemble Whole metagenome Shotgun (WGS) sequence data to provide full length sequence of a given microbe (provides an overall picture of microbe within system to help deduce function) Common Research Questions Diseased versus healthy Diet A versus Diet B Comparison over time or at different life stages (e.g., pregnancy) Comparison of individuals living in different regions of the world (e.g., climate differences, evolutionary differences) Core microbiome in healthy people that can be augmented in sick people (e.g., via diet, fecal transplants, etc.)

NIH-Amazon Microbiome Cloud Pilot Seizing the opportunity for scientific computing in the cloud to democratize access to scientific computing + + DATA h$ps://commonfund.nih.gov/hmp/ TOOLS COLLABORATION

Data Analysis Challenges Data stored in multiple locations (NCBI, HMP DACC, etc.) Some tools are stored at the HMP DACC BUT. Most analyses cannot be performed on NCBI or DACC compute clusters Users must download data and analyze on local systems Large data downloads can take time Keeping in sync with data and tool updates can be difficult Many users don t have the compute infrastructure, knowledge, or both to store data and perform analyses

How can the Cloud Help? Co-locating microbiome data and tools in a single, cloud-based compute environment Reduces the need to continually transfer large data sets Promotes data analysis directly on the cloud compute servers Facilitates sharing of data, methods, analysis tools, and results Benefits researchers without access to sufficient compute power Serves as a potential model for other high-throughput data projects

Primary Obstacles (Opportunities) Our team started with a limited knowledge of the Amazon cloud environment Cloud is typically seen as the realm of the bioinformatics-savvy or specialized biology user How to simplify and abstract nuances of cloud to target non-specialist? Many components of a useful resource were lacking User interfaces to data and tools Documentation Glue to connect pieces

The Pilot Phase A Proof-of-Concept for Computation and Collaboration in the AWS Cloud Environment + + DATA TOOLS COLLABORATION

Microbiome Data on AWS Public dataset of Human Microbiome Project data accessible for this project h$ps://aws.amazon.com/datasets/1903160021374413 Credit: NIH Medical Arts and Printing

We ve Built Tools to Visualize those Data Nephele Data Explorer

And Capabilities to Run Analysis Pipelines Nephele Analysis Engine

Technical Architecture Presentation Layer is web front-end Future plans to be able to make system calls using web services in addition to UI Analysis Pipeline Templates and Scripts are JSON or shell/python files that define pipelines, parameters, analyses Input/Output Handling Layer acts as proxy for AWS services, interpreting user requests and forwarding to AWS Elastic Beanstalk facilitates AMI and instance type selection based on logic specified in the input/output handling layer, as well as communication with other AWS services (SQS, EC2, S3) Analysis Environment performs computational analysis

Collaboration: Participant Affiliations + +

Next Steps Approach (3-6 Months) Campaign Get Feedback Update / Test Share what the project is and what we ve been working on Understand where primary user needs are, both for our collaborators and the broader community Integrate new features into our framework (visualizafons, new pipelines), and test iterafvely

UI and UX Updates

Next Steps Approach (3-6 Months) Campaign Get Feedback Update / Test Socialize Share what the project is and what we ve been Understand where primary Integrate new features into our Share outcomes, lessons learned, working on user needs are, framework and potenfal both for our (visualizafons, policy collaborators and new pipelines), considerafons the broader community and test iterafvely

Key Considerations Moving Forward Questions we re wrestling with: How to establish a robust security model and manage risk (especially with recent "black eyes" with government websites)? How to appropriately deal with sensitive/clinical data sets? What standards to adopt so we don't have to reinvent the wheel if needs change or if we change providers? How can this type of approach facilitate research reproducibility? How can we build a scalable cost model for this and similar resources? Coordination with grant-funders B.Y.O. credit card Hybrid

Neutrophil and Methicillin-resistant Staphylococccus aureus (MRSA) BacteriaScanning electron micrograph of neutrophil ingesting methicillin-resistant Staphylococcus aureus bacteria. Credit: NIAID Credit:NIST.gov So What? Methicillin-resistant Staphylococcus aureus BacteriaMRSA (yellow) and a dead human neutrophil. Credit: NIAID

Potential Broader Administrative Impact Improved Efficiency for Grantees, IT administrators, and the NIH Establish a new culture between IT and research domains, and reskill people appropriately Facilitate methods for setting up customized infrastructure quickly when scientists request it Inform policies across NIH/HHS for when/how to best leverage cloud, which type(s) of clouds to use for what (e.g., public, private, community), etc. Streamline processes for working with outside providers, procuring services, establishing SLAs, establishing security documentation

Potential Broader Scientific Impact Promoting Open Source, Open Access, and a Data Commons Help to breakdown research silos and facilitate collaboration and reuse Level playing field for the have not s Increase speed-to-research Enhance scalability and extensibility through loosely-coupled, virtualized applications and services Potential integration with Associate Director for Data Science (ADDS s) Data Commons platform

Questions? Scanning electron micrograph of Mycobacterium tuberculosis bacteria, which cause TB. Credit: NIAID

:: Shameless Plug :: 3dprint.nih.gov

Questions? Scanning electron micrograph of Mycobacterium tuberculosis bacteria, which cause TB. Credit: NIAID

Project Background Collaborative proof-of-concept project with AWS and multiple research institutions studying the role of the microbiome in human health and disease How can cloud computing simplify the analysis of microbiome sequence data? By bringing together data sets and analysis tools on the cloud, the global microbiome research community can benefit from a more consistent, centralized, and collaborative environment that is finetuned for performing data analysis at low cost Democratization of research Benefits researchers without access to sufficient compute power or storage capacity Reduces the need to continually transfer large data sets Facilitates sharing of data, methods, analysis tools, and results Serves as a potential model for other projects h$ps://commonfund.nih.gov/hmp/