Cloud Computing Lecture 20 Cloud Platform Comparison & Load 2011-2012 Up until now Introduction, Definition of Cloud Computing Pre-Cloud Large Scale Computing: Grid Computing Content Distribution Networks Cycle-Sharing Distributed Scheduling Cloud: Map Reduce Storage Execution Monitoring Programming 1
Cloud Platform Comparison Load Outline Comparison of Cloud Platform Google / Google App Engine Hadoop Amazon Web Services / Eucalyptus Microsoft Azure 2
Computing 3 visions for Cloud Computing: Who will win? AmazonWeb Services x86 Microsoft Azure CLR (VM) Google App Engine Framework Aplicacional (Python, Java) Storage Disk blocks SQL server API BigTable Network BlocksofIP addresses Declarative but automatic (endpoints) 3 level applicational topology»thisistheideal model! Inpractice, the overlap is much larger! Comparison: Storage AWS / Eucalyptus Microsoft Azure Google / Hadoop SQL RDS SQL Azure X Tables SimpleDB Tables (Datastore [BigTable]) / HBase Objects/Blocks S3 Blobs GFS/ HDFS Queues Simple Queue Service(SQS) Queues (Task Queue) 3
Comparison: Storage There are two general complaints: Performance (latency). Strict coherency models do not scale. The bottom-line is that the storage scalability problem is not solved. There are no available reliable metrics. The market is still too dynamic. Google services are not accessible remotely. It is always possible to make an intermediary bridge service. Programming languages: Comparison: Programming Model Amazon: Language not relevant. The program is a VM. Google: Java and Python. Azure: Any.NET language - C#, J#, VB.NET, etc... Google (servlet/jsp) has the most restrictive model. It is the simplest choice and will tend to be the first one until limitations are found. 4
Comparison: Remote Interaction Model There are little differences/variations. All systems are based on Web Services. Most services support both REST and SOAP protocols. In most cases, applications/machines/services/stores have their own DNS names. Stored objects are identified by type less strings. Comparison: Integration The Amazon VM model permits normal interactions between servers. Google requires that other servers be accessible via Web Services. Azure supports richer integration mechanism with external servers: AppFabric, Access Control e Queues. DryadLINQtransparently integrates local and remote applications. 5
Comparison: Price Resource Unit Amazon Google Microsoft Bandwidth (outgoing) GB $0.03 - $0.085 $0.12 $0.15 Bandwidth (ingoing) GB $0.10 $0.10 $0.10 Computation Instance hour $0.10 - $1.201 $0.10 $0.12 Storage GB per month $0.05 (>5PB) to 0.14 (<1TB) $0.15 $0.15 Storage Calls Each 10k calls $0.01 (GET) $0.10 (others) $0.01 Prices are very similar. AWS, because they use system VMs, has a larger granularity. Scenario Application ported to the cloud Web Application Parallel Processing Mixed Application Characteristics Monolythic application in Java or.net. Web app with load balancer, logic layer and database. Long lasting calculations without GUI. Cloud application integrated with external servers. Platform/Application Match Amazon Normal EC2 instance. System configuration needed. Normal EC2 instance + RDS. Requires system config. and AutoScale. If RDS does not scale, requires port to S3. Many pre-built instances with infra-structure, e.g. MPI. MapReduce instances may be used. EC2 instance may access external servers. Google May require porting and requires data and logic refactoring. Very good match with Google App Engine. Automatic scalability. Requires DB rewrite. No support for larger scale applications. No direct support. Some integration possible using a bridge app to the Datastore. Microsoft If.NET refactor data. Otherwise more complex. Well adapted to the Web Role model. Worker roles + blobs e queues provide some/adequate support. AppFabric ServiceBus supports integration with external applications. 6
Hurdles to CC on the 3 Main Platforms 1. Availability: Depends on the SLA and the provider s track record. 2. Lock-In: Stronger with Google App Engine, then Azure, weaker with AWS. 3. Confidentiality and Auditing: In general confidentiality is guaranteed. No open auditing is available. Regarding applications, EC2 provides higher isolation. 4. Data transfer costs: Similar prices. AWS now has bulk transfer services (you can send them your disks). Cost/benefit is application dependent. Must be analyzed. 5. Reliable Performance For general applications, the situation is similar: there are recovery and repetition mechanisms for most services. In the case of MapReducethere is skipping mode to recover tasks. 6. Scalable storage 7. Large-scale software errors 8. Speed of scale-up: Hurdles to Cloud Computing Clearer feedback with EC2 instances. 9. Reputation propagation: Similar situation on all 3 major platforms. Not solved. Less relevant for Google App Engine. 10. Compatible licensing: only relevant at AWS (solved!) 7
Conclusions The main difference between the main providers is the applicational model: Google has the most restrictive model. The cost of an easy to program system is more lock-in than lack of functionality. I can do whatever I want on EC2 but a scalable application will require distributed scalable services.. Scalability: What is the Best Approach for Cloud Computing Clients? Handling flash crowds from your garage, USENIX 08 8
Flash Crowds! We have seen several examples of scalability in a cloud platform. What about the clients? What if we have a server running an application and need to scale quickly? How do I adapt the front-ends? Three main requirements: The system must scale to a very large size. The system must scale quickly. Off-peek operation must be cheap. Data storage services: Available Tools (i) Pros: they are cheap and they scale transparently for the user. Cons: Only solve the problem of static content. Virtual servers: Before the cloud it was already possible to rent virtual servers at ISP (even at different geographical locations). Cons: It only solves the bandwidth problem. Mostly, the computation of the distributed applications doesn t really scale. 9
Available Tools (ii) Cloud computing services. External DNS services: Prevents the service from facing a bottleneck on the DNS requests. MISSING! Scalable relational database service: As we have seen, it s not trivial to scale a classical relational database service. There are many similar services but they always sacrifice some aspect: transactional model, features of the query language, scalability. Scalable Architectures (i) What is the best approach to matching a large set of clients with a multi-server service? Hyp. 1: Use only a storage service. Good for servers with a large percentage of static content. 10
Scalable Architectures (ii) Hyp. 2: Cluster with DNS load balancing Rent several machines (e.g. EC2). Add machines to the DNS record. By default, addresses are used in round-robin fashion. Causes delays to the clients who cached the DNS record but in general the issue is the large number of clients and not a large number of requests from the same client. There are commercial implementations (e.g. RightScale). Scalable Architectures (iii) Hyp. 3: Redirection Having a server redirect the initial client request to a set of back-end servers. Subsequent requests don t go through the redirection. Example: Amazon Elastic Load. Hyp. 4: L4 or L7 Rerouting A front-end server analyzes the request source (4 OSI level 4 e.g. TCP) or the content (OSI level 7 e.g. ) and reroutes the request to the corresponding back-end server. Requires a high-performance server or switch, but the client does not see the redirection. There are commercial implementations (e.g. Flexiscale). Hyp. 5: Hybrids of the 4 previous hypothesis. 11
M: repl. front-end Redirection M: repl. front-end Static Redirection 12
M: repl. front-end Static Redirection Scales very well Client arrival rate M: repl. front-end Static Redir. Scales very well Client arrival rate Redirecting clients (specially if it s done only when a session begins), is very cheap even if the front-server is receiving back-end status reports and running a load balancing algorithm. 13
M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited The UDP-based DNS response has only 512 bytes (up to 25 back-end servers). Most ISP complete the request using TCP if there are more than 25. However, some DNS clients only use the first reply. M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited Incoherent, but in the case of L4 there are growing hurdles to success: NAT, proxies,... 14
M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited Incoherent Immediate + DNS TTL M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited Incoherent Immediate + DNS TTL Immediate Session duration Session duration Days It is difficult to identify when sessions finish (e.g. webmail). There are DNS clients that ignore DNS records TTL and take days to invalidate their DNS cache. 15
M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited Incoherent Immediate + DNS TTL Immediate Session duration Session duration Days It is difficult to identify when sessions finish (e.g. webmail). There are DNS clients that ignore DNS records TTL and take days to invalidate their DNS cache. M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited Incoherent Immediate The front-end VM start-up of the time storage service. Not the web server. Immediate Session duration Session duration + DNS TTL Days Total Failure Total Failure Significant Fault Especially when using low TTL s for good scalling 16
M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited If there is load balancing of the redirection servers, Immediate one has to VM wait start-up for the timeclient VM to start-up try time another server. It should take max 2.5 s but in some Linux implementations it takes up to 3 min! Immediate Session duration Session duration Incoherent + DNS TTL Days Total Failure Total Failure Significant Fault Has no effect Total Failure Rare Effect Improbable Longdelayfor 1/m sessions? Longdelayfor 1/m sessions? Small Delay M: repl. front-end Static Redir. Scales very well Client arrival rate Request arrival rate Unlimited Incoherent Immediate + DNS TTL Immediate Session duration Session duration Days e.g., in S3 1% of first write attempts Has no fail, effect but immediate retries succeed. Improbable Improbable Improbable Improbable Total Failure Longdelayfor 1/m sessions? Has no effect Has no effect User recouverable fault Total Failure Total Failure Longdelayfor 1/m sessions? 1/m sessions fail. Has no effect Occasional fault Significant Fault Rare Effect Small Delay Some sessions have small delay. Longdelayfor 1/n sessions. Longdelayfor 1/n sessions. 17
M: repl. front-end Redir. Static Scales very well Client arrival rate Request arrival rate Unlimited Incoherent Immediate + DNS TTL Immediate Session duration Session duration Days Total Failure Total Failure Significant Fault Has no effect Total Failure Rare Effect Improbable Longdelayfor 1/m sessions? Longdelayfor 1/m sessions? Small Delay Improbable Has no effect 1/m sessions fail. Some sessions have small delay. Improbable Improbable Has no effect User recoverable fault Has no effect Occasional fault Longdelayfor 1/n sessions. Longdelayfor 1/n sessions. MapCruncher 18
Example: MapCruncher Map conversion site. Loaded with 25 GB of interactive demo maps. Flash crowd due to Microsoft publicizing it. The server had theoretical capacity to handle traffic (100 images/sec.), but the lack of reference locality (each client looking at different parts of the maps) made the thrashing unbearable. Moved all the static content to S3: they pay $4/month if there is no traffic. Example 2: Asirra 19
Example 2: Assirra CaptchaWeb Service based on distinguishing cats from dogs. EC2 servers + 100GB of images placed on S3. Database of image metadata: SQL server was slow. Nightly transfer of a image key indexed structure (read-only DB) to each of the applicational servers. Example 2: Azirra How can the session state be maintained? Hyp. 1: Inside S3. It s slow. Hyp. 2: On the applicationalservers disks. Since they use DNS load balancing it s not guaranteed that the question and answer to the captchago to the same server. Solution: Forward all session requests to the same server. Server id stored in session id. It s very cheap because it requires no disk accesses and only 10% change servers between request and response. 20
Example 2: Azirra Again, a flash crowd after a trade fair appearance. 75000 requests in 24h. Two interesting observations: 30000 requests were from a DoS. Using more instance was cheap. The attacker gave up but it would have been cheap to keep them running until a filter were set up. Example 3: InkBlotPassword.com Website for associating mnemonic images (Rorschach inkblots) to passwords. After the two previous experiences, they simplified the development process. Is it worth optimizing code? If optimizations are only for peek periods, it s better to pay for more machines. The website was mentioned on Slashdot (tech news site) without the authors knowing. They detected a flash crowd (request queue = 130!), started 12 new nodes. 20 min. later, the website was stable. Three days later they were again stable at only 3 servers. Total cost of the flash crowd: $150. 21
Next Time... Cloud Data Centers 22