Mapping Informatics To the Cloud 2012 AIRI Petabyte Challenge Chris Dagdigian chris@bioteam.net
I m Chris. I m an infrastructure geek. I work for the BioTeam.
The C Word.
When I say cloud I m talking IaaS.
Amazon AWS Is the IaaS cloud. Most others are fooling themselves. (Has-beens, also-rans & delusional marketing zombies)
A message for the pretenders
No APIs? Not a cloud.
No self-service? Not a cloud.
I have to email a human? Not a cloud.
~50% failure rate when provisioning new servers? Stupid cloud.
Block storage and virtual servers only? (barely) a cloud;
Private Clouds: My $.02
Private Clouds in 2012: Hype vs. Reality ratio still wacky Sensible only for certain shops Have you seen what you have to do to your networks & gear? There are easier ways
Private Clouds: My Advice for 12 Remain cynical (test vendor claims) Due Diligence still essential I personally would not deploy/buy anything that does not explicitly provide Amazon API compatibility
Private Clouds: My Advice for 12 Most people are better off: Adding VM platforms to existing HPC clusters & environments Extending enterprise VM platforms to allow user self-service & server catalogs
Enough Bloviating. Advice time.
Tip #1
HPC & Clouds: Whole New World
We have spent decades learning to tune research HPC systems for shared access & many users. The cloud upends this model
Far more common to see Dedicated cloud resources spun up for each app or use case Each system gets individually tuned & optimized
Tip #2
Hybrid Clouds & Cloud Bursting
Lots of aggressive marketing Lots of carefully constructed case studies and prototypes The truth? Less usable than you ve been told Possible? Heck yeah. Practical? Only sometimes.
Advice Be cynical Demand proof Test carefully
Still want to do it? Buy it, don t build it Cycle Computing Univa BrightComputing
Follow the crowd In the real world we see: Separation between local and cloud HPC resources Send your work to the system most suitable
Tip #3
You can t rewrite EVERYTHING.
Salesfolk will just glibly tell you to rewrite your apps so you can use whatever big data analysis framework they happen to be selling today
They have no clue.
In life science informatics we have hundreds of codes that will never be rewritten. We ll be needing them for years to come.
Advice: MapReduceish methods are the future for big-data informatics It will take years to get there We still have to deal with legacy algorithms and codes
You will need: A process for figuring out when it s worthwhile to rewrite/re-architect Tested cloud strategies for handling three use cases
You need 3 cloud architectures: 1. Legacy HPC 2. Cloudy HPC 3. Big Data HPC (Hadoop)
Legacy HPC on the cloud MIT StarCluster http://web.mit.edu/star/cluster/ This is your baseline Extend as needed
Cloudy HPC Use this method when It makes sense to rewrite or rearchitect an HPC workflow to better leverage modern cloud capabilities
Cloudy HPC, continued Ditch the legacy compute farm model Leverage elastic scale-out tools (***) Spot Instances for elastic & cheap compute SimpleDB for job statekeeping SQS for job queues & workrflow glue SNS for message passing & monitoring S3 for input & output data Etc.
Big Data HPC It s gonna be a MapReduce world Little need to roll your own Ecosystem already healthy Multiple providers today Often a slam-dunk cloud use case
Tip #4
The Cloud was not designed for us
HPC is an edge case for the hyperscale IaaS clouds We need to deal with this and engineer around it.
Many examples Eventual consistency Networking & subnets Latency Node placement
Advice Manage expectations Benchmark & test Evangelize (pester the cloud sales reps )
Tip #5
Data Movement Is Still Hard
Consistently getting easier Amazon is not a bottleneck AWS Import/Export AWS Direct Connect Aspera has some amazing stuff out right now
Advice AWS Import/Export works well Size of pipe is not everything Sweat the small stuff Tracking, checksums, disk speed Dedicated workstations Secure media storage
Dedicated data movement station
naked Terabyte-scale data movement
Don t overlook media storage
Advice for 2012 BioTeam is dialing down our advocacy of physical data ingestion into the cloud Why? Operationally hard, expensive and no longer strictly needed
Real world cross-country internet-based data movement March 2012
700Mb/sec into Amazon, stress-free & zero tuning March 2012
People trying to move data via physical media quickly realize the operational difficulties Bandwidth is cheaper than hiring another body to manage physical data ingestion & movement In 2012 we strongly recommend network-based data movement when at all possible
u r doing it wrong
cool data movement, bro!
Tips #6 & 7
Cloud storage. Still slow.
Big shared storage. Still hard.
Not much we can do except engineer around it AWS compute cluster instances are a huge step forward AWS competitors take note
We are not database nerds We care about more than just random IO performance We need it all Random I/O Long sequential read/write
Faster Storage Options Software RAID on EBS Various GlusterFS options Even if you optimize everything, the virtual NICs are still a bottleneck
Big Shared Storage 10GbE nodes and NFS Software RAID sets GlusterFS or similar 2012: pnfs finally?
Tip #8
Things fail differently in the cloud.
Stuff breaks It breaks in weird ways Transient/temporary issues more common than what we see at home
Advice Pessimism is good Design for failure Think hard about How will you detect? How will you respond?
Advice Remove humans from loop Automate recovery Automate your backups
Tip #9
Serial/batch computing at-scale
Loosely coupled workflows are ideal Break the pipeline into discrete components Components should be able to scale up down independently
Component = Opportunity to: Make a scaling decision (# nodes in use) Make sizing decision (instance type in use)
Nirvana is
independent loosely connected components that can self-scale and communicate asynchronously
Advice: Many people already doing this Best practices are well known Steal from the best: RightScale, Opscode & Cycle Computing
Phew. Think I m done now.
Questions? Slides available at http://slideshare.net/chrisdag/
End;
Backup Slides
Private Clouds: Pick Your Poison OpenStack - http://openstack.org Pro: Super smart developers; significant mindshare; True Open Source Con: Commitment to AWS API compatibility (?) & stability
Private Clouds: Pick Your Poison CloudStack- http://cloudstack.org Pro: Explicit AWS API support; very recent move away from open-core model; usability Con: Developer mindshare? Sudden switch to Apache
Private Clouds: Pick Your Poison Eucalyptus- http://eucalyptus.com Pro: Direct AWS API compatibility; lots of hypervisor support Con: Open-core model; mindshare; Recent ressurection