CONGA: Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, George Varghese 1
Motivation DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Single-rooted tree Ø High oversubscription Core Agg Access 1000s of server ports 2
Motivation DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, ] Ø Full bisection bandwidth, achieved via multipathing Spine Leaf 1000s of server ports 3
Motivation DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, ] Ø Full bisection bandwidth, achieved via multipathing Spine Leaf 1000s of server ports 4
Multi-rooted!= Ideal DC Network Ideal DC network: Big output-queued switch 1000s of server ports Ø No internal bottlenecks è predictable Ø Simplifies BW management [EyeQ, FairCloud, pfabric, Varys, ] 5
Multi-rooted!= Ideal DC Network Ideal DC network: Big output-queued switch Multi-rooted tree Can t build it L 1000s of server ports 1000s of server ports Ø No internal bottlenecks è predictable Ø Simplifies BW management [EyeQ, FairCloud, pfabric, Varys, ] Possible bottlenecks 6
Multi-rooted!= Ideal DC Network Ideal DC network: Big output-queued switch Multi-rooted tree Can t build it L 1000s of server ports 1000s of server ports Ø No internal bottlenecks è predictable Ø Simplifies BW management [EyeQ, FairCloud, pfabric, Varys, ] Need precise load balancing 7
Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Approximates Valiant load balancing Ø Preserves packet order H(f) % 3 = 0 8
Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Approximates Valiant load balancing Ø Preserves packet order Problems: - Hash collisions (coarse granularity) H(f) % 3 = 0 9
Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Approximates Valiant load balancing Ø Preserves packet order Problems: - Hash collisions (coarse granularity) - Local & stateless (v. bad with asymmetry due to link failures) 10
Dealing with Asymmetry Handling asymmetry needs non-local knowledge 11
Dealing with Asymmetry Handling asymmetry needs non-local knowledge 12
Dealing with Asymmetry Handling asymmetry needs non-local knowledge 13
Dealing with Asymmetry Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware Global Cong- Aware (TCP) 14
Dealing with Asymmetry: ECMP Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G Global Cong- Aware 10G (TCP) 20G 15
Dealing with Asymmetry: Local Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G Global Cong- Aware 10G (TCP) 20G Interacts poorly with TCP s control loop 16
Dealing with Asymmetry: Local Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 10G (TCP) 10G Interacts poorly with TCP s control loop 17
Dealing with Asymmetry: Global Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 10G (TCP) 10G 18
Dealing with Asymmetry: Global Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 70G 5G (TCP) 35G 19
Dealing with Asymmetry: Global Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 70G 5G (TCP) Global CA > ECMP > Local CA 35G Local congestion-awareness can be worse than ECMP 20
Global Congestion-Awareness (in Datacenters) Datacenter Opportunity Challenge Latency Topology Traffic microseconds simple, regular volatile, bursty 21
Global Congestion-Awareness (in Datacenters) Datacenter Opportunity Challenge Latency Topology Traffic microseconds Simple & simple, Stable regular volatile, Responsive bursty Key Insight: Use extremely fast, low latency distributed control 22
CONGA in 1 Slide 1. Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time 2. Use greedy decisions to minimize bottleneck util Fast feedback loops between leaf switches, directly in dataplane L0 L1 L2 23
CONGA S DESIGN 24
Design CONGA operates over a standard DC overlay (VXLAN) Ø Already deployed to virtualize the physical network VXLAN encap. L0è L2 H1è H9 L0è L2 H1è H9 L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 25
Design: Leaf-to-Leaf Feedback Track path-wise congestion metrics (3 bits) between each pair of leaf switches Dest Leaf L1 L2 Path 0 1 2 3 CongesPon- To- Leaf Table @L0 Rate Measurement Module measures link utilization L0 0 1 2 3 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 26
Design: Leaf-to-Leaf Feedback Track path-wise congestion metrics (3 bits) between each pair of leaf switches Dest Leaf L1 L2 Path 0 1 2 CongesPon- To- Leaf Table @L0 L0è L2 Path=2 CE=0 3 L0 0 1 2 3 L0è L2 Path=2 CE=0 CE=5 L1 pkt.ce ç max(pkt.ce, link.util) L2 Src Leaf L0 L1 Path 0 1 2 3 CongesPon- From- Leaf Table @L2 L0è L2 Path=2 CE=5 5 H1 H2 H3 H4 H5 H6 H7 H8 H9 27
Design: Leaf-to-Leaf Feedback Track path-wise congestion metrics (3 bits) between each pair of leaf switches Dest Leaf L1 L2 Path 0 1 2 5 3 Src Leaf L0 L1 Path 0 1 2 5 3 CongesPon- To- Leaf Table @L0 L0è L2 L2è L0 FB- Path=2 FB- Metric=5 CE=0 L0 0 1 2 3 L1 L2 CongesPon- From- Leaf Table @L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 28
Design: LB Decisions Send each packet on least congested path flowlet [Kandula et al 2007] Dest Leaf L1 L2 Path 0 1 2 3 5 3 7 2 1 1 5 4 CongesPon- To- Leaf Table @L0 L0 è L1: p * = 3 L0 è L2: p * = 0 or 1 0 1 2 3 L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 29
Why is this Stable? Stability usually requires a sophisticated control law (e.g., TeXCP, MPTCP, etc) Source Leaf Dest Leaf 30
Why is this Stable? Stability usually requires a sophisticated control law (e.g., TeXCP, MPTCP, etc) Source Leaf (few microseconds) Feedback Latency Dest Leaf Adjustment Speed (flowlet arrivals) Near-zero latency + flowlets è stable 31
How Far is this from Optimal? Given traffic demands [λ ij ]: max!!"#$%!! min!!"#$%&'" Price of Anarchy with CONGA max!!"#$%!! bottleneck routing game (Banner & Orda, 2007) L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 32
How Far is this from Optimal? Given traffic demands [λ ij ]: max!!"#$%!! min!!"#$%&'" Price of Anarchy with CONGA max!!"#$%!! bottleneck routing game (Banner & Orda, 2007) Worst-case L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 Theorem: PoA of CONGA = 2 33
Implementation Implemented in silicon for Cisco s new flagship ACI datacenter fabric Ø Scales to over 25,000 non-blocking 10G ports (2-tier Leaf-Spine) Ø Die area: <2% of chip
Evaluation Link Failure fabric links 32x10G 32x10G 32x10G 32x10G Testbed experiments Ø 64 servers, 10/ switches Ø Realistic traffic patterns (enterprise, data-mining) Ø HDFS benchmark Large-scale simulations Ø OMNET++, Linux 2.6.26 TCP Ø Varying fabric size, link speed, asymmetry Ø Up to 384-port fabric 35
HDFS Benchmark 1TB Write Test, 40 runs Cloudera hadoop-0.20.2-cdh3u5, 1 NameNode, 63 DataNodes no link failure ~2x better than ECMP Link failure has almost no impact with CONGA 36
Decouple DC LB & Transport TX H1 H2 H3 H4 H5 H6 H7 H8 H9 Big Switch Abstraction (provided by network) ingress & egress (managed by transport) H9 H8 H7 H6 H5 H4 H3 H2 H1 RX 37
Conclusion CONGA: Globally congestion-aware LB for DC implemented in Cisco ACI datacenter fabric Key takeaways 1. In-network LB is right for DCs 2. Low latency is your friend; makes feedback control easy 38
Thank You! 39