Managing and Conducting Biomedical Research on the Cloud Prasad Patil Laboratory for Personalized Medicine Center for Biomedical Informatics Harvard Medical School
SaaS & PaaS gmail google docs app engine What is Cloud Computing? SaaS MS Office online IaaS virtualized hardware Definition: Clouds are a large pool of easily usable and accessible virtualized resources (such as hardware, development platforms, and/or services). These resources can be dynamically reconfigured to adjust to a variable load (scale), allowing also for an optimum resource utilization.
Why Cloud Computing? 1. Ability to scale Job that takes 10 hours on a single server can be done in 1 hour on 10 servers 2. Pay-per-use Only pay for what you use, when you use it Reduces the need to purchase hardware 3. Increased flexibility Variety of server types and operating systems
Cloud Computing at Harvard? Datacenter CBMI Countway hardware and software Goal is to go beyond the hype and explore the utility for novice users Armbrust et al. Above the Clouds (2009)
LPM Question Can we implement a systematic strategy and best practice process to efficiently manage cloud computing resources and clouded translational science projects? Goals: 1. Facilitate global, multi-institutional, multidiscipline research collaborations 2. Significantly reduce overall administrative and management requirements 3. Repeatable process at other research centers or labs to enhance scientific progress at reduce cost
Challenge for Scientific Research Objective: Create low cost, low administrative footprint Computational Center under typical academic and current technical and resource constraints: Tremendous Project diversity (scientifically, complexity, and computationally) Multiple project leads Multiple project teams at several US and Foreign locations Diversity of experience with Cloud Varying levels of project team access/ resource control One AWS account Single (20% effort) Administrator Limited Resources: Time, AWS services, Coordination Overall Focus: Science objective of the project Do not allow resource management and administration (configuration, software downloads, upgrades, version control, etc ) to distract or impede the scientific objectives.
Clouded Translational Science Seminar Participants in the Clouded Translational Science seminar will conduct a series of exercises in biomedical discovery and translational science using cloud computing technology. Participants represent Harvard, Children s Hospital of Boston, Brigham and Women s Hospital, Beth Israel Hospital, Mass General, the Broad Institute, two University of Wisconsin campuses (Madison and Milwaukee) and the Tokyo Medical and Dental University and will learn about and implement databases, analysis tools and web application development environments using the Amazon cloud computing. http://lpm.hms.harvard.edu/palaver/
Types of Projects Network analysis for disease genetics (Roundup) Translational Variome Next generation sequence analysis (DNA & RNA) i2b2 (www.i2b2.org) Pharmacogenetics using clinical avatars Cloud computing center
LPM Project Breakdown Inelastic Clinical Avatars Project Development i2b2 AMI Development Clinical Variome Managed Elastic Clinical Avatars Web Deployment i2b2 Federated Queries NGS RNA Algorithm Testing Elastic RoundUp Crossbow NGS DNA Whole Genome Mapping
Resource Access Management Option Advantage Disadvantage RightScale sub-accounts AWS Identity Access Management Secure server access Individual control Easy to implement Customizable Elastic Free educational license Free for AWS accounts Control service usage Minimal restrictions Steep learning curve Trust users RightScale hooks No UI: code-intensive Beta SSH Keys/Passwords Easy to implement We control user access Requires more mgmt. Not elastic
Cloud Management Strategy Projects Decision Criteria AWS/RightScale Main Account Project Deployment Server Configuration SSH Key Pair RightScale Sub Account Project Deployment Server Configuration RightScale SSH Inelastic Managed Elastic Elastic
RightScale Cloud Management Inelastic Elastic Managed Elastic www.rightscale.com
RightScale Instance Resource Usage
Best Practices 1. Analyze the details of the project to take full advantage of the cloud Type of OS required, size of data, CPU, memory 2. Create a backup strategy before you launch an instance 3. Only launch an instance when you are ready to start working and shut down the instance when the work is complete 4. Access an instance using a secure connection 5. Actively monitor your account
Whole Genome Mapping Strategy MAQ Sequence alignment and assembly Uses short reads from NGS technology Publicly available Apply to the African genome Using AWS Take advantage of the elasticity of the cloud Goal Flexible framework for any application MAQ developed by Heng Li at Sanger Institute Bentley et al. Nature, 2008
Launch AWS Linux Instance Where to Start? Package into an AMI Install Maq NCBI reference genome Apache web server Launch identical copies of our instance
Job Distribution Architecture EBS volume or S3 Copy files and scripts to slaves Slave 1 AMI Local Terminal Create the cluster Master AMI Slave 2 AMI Web monitoring Copy output files to master Slave 3 AMI
Web Monitoring
Costs and Results African Genome NGS Data: 370 GB Compute Nodes: 25 (5 x 5-core) Total Cost: $2,600 Cost can be spread over more instances to decrease computation time. New mapping software significantly reduces the cost (<$100) Resulting annotated variant file: 22 MB