Chatty Tenants and the Cloud Network Sharing Problem Hitesh Ballani, Keon Jang, Thomas Karagiannis Changhoon Kim, Dinan Gunawardena, Greg O Shea MSR Cambridge, Windows Azure
This talk is about... How to share the network in multi-tenant datacenters? Multi-tenant datacenters Public cloud datacenters Windows Azure, Amazon EC2, Rackspace, Tenants: users renting virtual machines Private cloud datacenters 2
A use-case of cloud datacenters Intra-tenant traffic Inter-tenant traffic Cloud Provider Storage Services Network is shared! Map reduce Physical Host Physical Host Virtual machines for blue tenant 3
Requirements for network sharing [ FairCloud] Tenants want predictable performance / cost Req 1. Minimum bandwidth guarantee Not all flows are equal: some tenants pay more Req 2. Proportionality Utilize spare resources as much as possible Req 3. High utilization 4
Existing solutions for network sharing Today Seawall Oktopus FairCloud (PS-L) FairCloud (PS-P) With Inter-tenant traffic Min-guarantee Proportionality High utilization Harder Problematic Prior work focuses on intra tenant traffic 5
Chatty tenants in the cloud Typical cloud applications have many dependency Provider Services A tenant s own File Storage SQL DB Web Frontend Cache Wan Optimizer Firewall Third Party Services in Marketplace 6
Inter-tenant traffic (%) Prevalence of inter-tenant traffic Measurement from 8 datacenters of a public cloud service provider 40 30 20 10 0 1 2 3 4 5 6 7 8 Datacenter Inter-tenant traffic accounts for 10-35% of traffic! 7
Min bandwidth guarantee is harder On what links bandwidth guarantee is required? Intra-tenant traffic Inter-tenant traffic Physical Host Physical Host Inter-tenant traffic leads to richer communication pattern and makes minimum bandwidth guarantee harder! 8
How to define proportionality? P and Q are paying same amount p q 1000Mbps r1 r2 r3 r4 Allocation P (Mbps) Q (Mbps) Per flow 250 750 Seawall 250 750 FairCloud 333 666 Q: Whose payment should dictate the flow bandwidth? 9
Hadrian Overview What semantics should we provide to tenants? Virtual network abstraction How to allocate bandwidth? Hose-compliant bandwidth allocation This talk How to place virtual machines? Greedy heuristic that guarantees min bandwidth 10
Hadrian Overview What semantics should we provide to tenants? Virtual network abstraction How to allocate bandwidth? Hose-compliant bandwidth allocation How to place virtual machines? Greedy heuristic that guarantees min bandwidth 11
State of the art: Hose-model Tenant Request: <N, B> Each is guaranteed to send/receive at B minimum bps of B bps B P 1 Virtual Switch 2 B P p B P Minimum bandwidth guarantee High-utilization Tenant P s 12
State of the art: Hose-model Tenant Request: <N, B> Each is guaranteed to send/receive at minimum of B bps Virtual Switch Virtual Switch Virtual Switch B P B P B P B Q B Q B Q B R B R B R 1 2 p 1 2 q 1 2 r Tenant P s Tenant Q s Tenant R s 13
Multi-hose model Tenant Request: <N, B> s in different tenants communicates with each other at a rate of min(bp, Bq) B P 1 Switch Switch Switch 2 B P p B P Multi Hose Switch B Q Assumes same inter- and intra- tenant bandwidth 1 2 B Q B Q q B R 1 2 B R B R r Tenant P s Tenant Q s Tenant R s Allows Inter-tenant communication 14
Hierarchical hose model Tenant Request: <N, B, B inter > B P 1 2 p Virtual Inter-tenant Switch Virtual Switch Virtual Switch Virtual Switch B P B P inter 1 2 B Q inter Assumes all-to-all communication B P B Q B Q B Q q B R 1 2 B R inter B R B R r Tenant P s Tenant Q s Tenant R s Separate inter-tenant bandwidth requirement 15
Communication dependency Most tenants communicate with only few other tenants Tenant Request: <N, B, B inter, list of dependent tenants> Reduces possible communication patterns Helps place dependent tenants closer Q:How about service tenants (e.g., storage )? Tenant Request: <N, B, B inter, *> 16
Hadrian Overview What semantics should we provide to tenants? Virtual network abstraction How to allocate bandwidth? Hose-compliant bandwidth allocation How to place virtual machines? Greedy heuristic that guarantees min bandwidth 17
Hose-compliant bandwidth allocation Whose payment should dictate the bandwidth of the flow? p 100Mbps 20 20 20 20 20 20 Min 40 40 5 flows 5 flows q 200Mbps Our approach : take minimum from two sides 40 40 40 18
Hose-compliant bandwidth allocation Weighted fair-share at the bottleneck p B p N p flows W pq? N q flows q B q W pq = min( B p N p, B q N q ) 19
Upper bound proportionality p 100Mbps 5 flows 20 20 20 20 20 Min (20,?) x 5 100 Max aggregate weight for p s flow = 100 Upper bound of total weight of s flows is proportional to the s payment 20
Minimum bandwidth guarantee Total weight for all flows of a given is bounded 500Mbps 500Mbps p q N flows 1000Mbps r1 r2 Total weight 1000 rk Placement algorithm enforces Total Weight Capacity on any link in the network Min bandwidth guarantee The verification can be formulated as max flow network problem 21
Evaluation Synthetic cloud workload benchmark Tenants submit requests for s and execute jobs A job has CPU Processing, Inter-tenant traffic, Intra-tenant traffic Inter-tenant traffic ratio: 10-40% Fraction of tenant w/ inter-tenant : 20% Environments Testbed: 16 end hosts Large scale simulation: 16,000 end hosts 22
Evaluation criteria Network sharing requirements Minimum bandwidth guarantee Upper-bound proportionality High-utilization Benefits of Hadrian Metric: acceptance ratio Comparison with Baseline: per flow sharing Existing approaches: Oktopus, FairCloud 23
Job completion time Always finishes in time Utilize spare resources 3.4x at 95 th percentile Relative job completion time against estimate Estimate = Minimum amount of bandwidth data / bandwidth guarantee requirement High utilization 24
Throughput for p (Mbps) Bandwidth allocation 500 p 1000Mbps r1 400 300 Per flow FairCloud(PS-L) Hose Compliant q N flows r2 rk 200 100 0 1 3 5 7 9 11 13 15 Number of flows for q Upper bound proportionality 25
Acceptance Ratio (%) Request acceptance ratio testbed 100 80 60 40 Simulation Testbed 35 37 72 72 37% more jobs 20 0 Baseline Hadrian Similar benefits in large scale simulations Testbed and Simulation shows consistent result 26
Summary We show that Inter-tenant traffic is prevalent 10~35% from a major public cloud provider We propose Hadrian Virtual network abstraction: inter-tenant, dependency Bandwidth allocation strategy: upper-bound proportionality Placement algorithm: greedy dependency aware packing Our evaluation show that Hadrian meets three network sharing requirements Hadrian delivers predictability and higher efficiency 27
Thank you