Cloud Computing For Bioinformatics
Cloud Computing: what is it? Cloud Computing is a distributed infrastructure where resources, software, and data are provided in an on-demand fashion. Cloud Computing abstracts infrastructure from application. Cloud Computing should save you time the way software packages save you time.
Cloud Computing Before: Purchase Hardware & ensure it s all compatible Appropriate resources for hardware (power, cooling, rack space, etc) Set up & configure hardware Install baseline software (OS, packages) Develop & deploy your application With the Cloud: Request resource Develop & deploy your application
Cloud Computing Advantages: Reliability: Decoupling applications from hardware removes hardware failure concerns Scalability: Many cloud services have built-in linear scaling, allowing more resources to be brought online on-demand Turnaround: Greatly reduce time taken to procure hardware resources Cost: Limited upfront cost when compared to hardware purchase Pay as you go: Pay for what you use. Don t pay for servers sitting idly sucking power & cooling Experimentation: Because of the above, the opportunity costs of experimentation are tiny Sharing & Collaboration: Share resources such as machine images & data without worry
Cloud Computing Disadvantages: Learning Curve: One must learn how to leverage cloud & it s advantages; not how one is used to working Data Transfer: Getting data into & out of the cloud is at internet speed, not network speed Opacity: The underlying infrastructure is hidden from view
Cloud Computing: Components
Cloud Computing: Components Glossary: AWS: Amazon Web Services EC2 / Elastic Compute Cloud: Computer resources in the cloud. Essentially virtual computers with varying CPU & memory resources. EBS / Elastic Block Store: Blocklevel storage for data. They are virtual hard drives for EC2 instances. S3 / Simple Storage Service: An object store allowing you to save data in the cloud in a highlyredundant fashion EMR / Elastic Map Reduce: Automanaged map reduce infrastructure for running highlyparallel computation problems against a farm of computers. SDB / Simple Database: Run queries against structured data in real time. A very simple version of: RDS / Relational Database Service: Web service that lets you place a relational database in the cloud. AWS Import/Export: Load your data onto a device and mail it to Amazon, and let them load your data for you! There s plenty more, but these are the most important for bioinformatics.
Cloud Computing: Components Ok, here are some others: CloudWatch: Monitor AWS cloud resources, such as EC2 instances. Elastic Load Balancing: Amazonhosted load balancers distributing incoming traffic among EC2 nodes. SQS / Simple Queue Service: Hosted queue for storing messages as they pass between computers, enabling combination of disparate programs communicating with each other. VPC / Virtual Private Cloud: Fence off AWS services over an IP range via VPN, allowing cloud services to fit in with legacy security protocols. CloudFront: Content delivery service (CDN) on Amazon s collection of edge servers. SNS / Simple Notification Service: Set up, operate, and send notifications from the cloud to a variety of locations such as web page, email, SMS, etc. Amazon Mechanical Turk: As the name implies, you create a Human Intelligence Task (HITs) which a human can do easily, then you pay a modest fee each time some human performs this task. Examples would be rating quality between items, filling out forms, or solving CAPTCHAs, etc.
Cloud Computing: Components Let s learn more about those important services
Cloud Computing: Services EC2: Virtual computers offered with varying memory / cpu power How is CPU power measured in a virtual world? ECU: EC2 Compute Unit: measure of computing power on AWS. Equivalent of a 1.0GHz 2007 Xeon processor. 4 classes of instances: Standard Instances: inexpensive instances used for testing, web service, and many less intensive jobs High-Memory Instances: Large RAM images for high throughput applications e.g. databases, caches High-CPU Instances: High ECU instances for compute-intensive applications Cluster Compute Instances: Increased network performance for HPC applications e.g. map-reduce
Cloud Computing: Services Instance Type ECU Units RAM (GB) Local Storage (GB) Standard Small 1 1.7 160 Large 4 7.5 850 XL 8 15 1690 High-Memory XL 6.5 17.1 420 Double XL 13 34.2 850 Quadruple XL 26 68.4 1690 High-CPU Medium 5 1.7 350 XL 20 7 1690 Cluster Compute Quadruple XL 33.5 23 1690
Cloud Computing: Services Pricing Lot of factors affect pricing Prices commensurate with class of instance used (Standard, High-memory) Prices adjusted by OS: Linux (cheaper) and Windows (pricier) Prices adjusted by instance type: On-demand Instances: Always available to start. Priciest option. No commitment, no contract Reserved Instances: Pre-pay upfront to have the ability to run an instance at a reduced rate Spot Instances: EBay-style! Bid a max price for compute instances, and procure them when the demand price meets your top bid. Cannot get a price reliably, but can save money on instances. Prices adjusted by availability zone. 4 available: US East (cheapest across the board) US West EU Ireland APAC Singapore (new!) Estimating costs is hard, even with Amazon-provided calculators, as YMMV.
Cloud Computing: Services Availability Zone? What s that? Amazon data centers are located around the globe. This ensures protection from data-center wide failure Problem is many services are independent between zones, making this moot in most cases Proximity to your work environment will reduce latency (the speed information travels from you to Amazon and back) Choose the one closest to you, or the cheapest price, or somewhere in between This will trip you up, trust me.
Cloud Computing: Services EBS: Create disks that can be mounted onto your EC2 AMIs Disks are also placed in Availability Zones, and priced accordingly Can create new volumes based on public data sets Can create snapshots : User-initiated copies of all the data stored in super-durable Amazon S3
S3: Cloud Computing: Services Stores objects in a bucket and allows retrieval based on unique key (URI) Can store objects ranging from 1 byte to 5GB. Unlimited objects can be stored RESTful interface (Representational state transfer) Extreme durability of data, with option for cheaper service (but reduced durability) Backed by Amazon S3 SLA (service level agreement) Unlimited objects and Extreme durability? What s the catch? Simple object stores are bad when disk I/O operations are needed 5GB may be too small for data sets At the end of the day you can save data to S3 but you ll be transferring it to EBS for any operations you re going to do with it.
Cloud Computing: Services EMR: Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud Allows Wait, do you know what MapReduce is? No? Then let s back up a moment
Cloud Computing: MapReduce Super-quick Primer MapReduce: Inspired by functional programming, and introduced by Google. A way to process large amounts of data by farming out work to a cluster. Works by using two functions: Mapper: Takes huge input data and chunks it out into smaller sub-problems, applying one or more functions to each, resulting in a key/value pair of the data Reducer: Takes the key/value pairs and combines them into useful data This is just a way of thinking about a problem. You need to code everything by hand. (Think of this not as a solution, but a way to think about creating one) Hadoop is software that handles distribution and collection of the data through your Map and Reduce functions, abstracting the bookkeeping. If this still seems obtuse, Vince & Daniel have great talks on this. Also for more information, Google has the answer. Check out Google s MapReduce in a Week (http://code.google.com/edu/submissions/mapreduce/listing.html)
Cloud Computing: Services EMR: With that out of the way Hosted Hadoop infrastructure for use of MapReduce paradigm in the cloud Allows processing of vast amounts of data Built to take advantage of other systems such as S3 to process data & store results (respectively) Most Bioinformatics tools cannot make good use of EMR at this time
Cloud Computing: Services SDB: Non-relational data store (More like excel than MySQL) Think of it as S3 for data instead of files Primarily for index & query capabilities Comes with a free tier for testing, making approaching this service easy First 25 machine hours & 1GB storage / month free After that, pricing is per machine hour used Syntax: Domains: Think of this as your spreadsheet name Attributes: These would be the data in a column. Attributes have a name (header) and a value Limit of 10GB per domain Comes in two flavors: Consistent: Your read reflects the data previously written Eventually Consistent: Higher read throughput, but reads are not guaranteed to reflect everything written to it before. Latency between writing and reading updated information.
Cloud Computing: Services RDS: Literally a hosted relational database (like MySQL) Features reserved & on-demand pricing Patches the software and handles backups for a user-defined retention period Designed for use with other services (as you can imagine), so using EC2 will have low-latency to a RDS instance and vice-versa Can create snapshots (sound familiar?): User-initiated backups with indefinite retention (last until you delete them) Multi-zone deployment: Allows replication of data across availability zones for durability of data RDS instances come in various sizes which will look familiar to anyone that knows EC2 instance sizes.
Cloud Computing: Components Questions?
Cloud Computing: Components Oh yeah, here s some free money! (weren t expecting that, were ya?)