Load Balancing in Data Center Networks Henry Xu Department of Computer Science City University of Hong Kong ShanghaiTech Symposium on Data Scinece June 23, 2015
Today s plan Overview of research in DCN Background on load balancing Data plane Control plane 3
Overview How to send data across the boundaries of servers in a data center in a better way? Sigcomm 2015 Sigcomm 2014 13/40 12/43 Universities: Berkeley, UIUC, UW-Madison, USC, Princeton Industry/labs: Microsoft Research, Google, Facebook 4
Overview What makes DCN different? 1. It s entirely controlled by one operator! we can change everything from apps to switch ASICs 2. It s large-scale! huge design space makes the problems intellectually challenging 3. It s highly demanding! many problems to work on huge problem space 5
Topology design The first question: how should a large-scale DCN look like, using commodity switches? DCell (2008), Portland, BCube (2008), Jellyfish (2012) Hotness: 6
Traffic measurement What are the traffic characteristics in production DCN? Elephant vs. mice: most flows are mice (<100KB), while most bytes are from elephants (>10MB) Hotness: 7
Transport design How to make TCP better for DCN? Many extensions possible, but be aware of tons of prior work (and the experts) Hotness: 8
Bandwidth guarantees How to fairly share bandwidth in multi-tenant DCN? Hotness: 9
Centralized flow scheduling How to better coordinate the transmission of elephant flows? The goal is mainly high throughput Hotness: 10
Low latency networking How to reduce (tail) latency for partition-aggregation workloads? Multi-faceted; interests from the systems and queueing theory communities Hotness: 11
Inter-DC TE How to better perform traffic engineering in inter-dc WAN? Hotness: 12
Application-aware How can application-layer semantics help the DCN? Hotness: 13
Network management How to better manage (configure/debug) a large-scale DCN? Hotness: 14
Energy How to reduce energy of the DCN? Hotness: 15
Background on LB 16
General 3-tier Clos topology core plane 1 core plane 4 core switches 1 m 1 m s 4 m n = 4 aggregation switches 1 2 3 4 1 2 3 4 f p 4 ToR switches 1 2 r 1 2 r t p r pod 1 pod p Source: A. Andreyev. Introducing data center fabric, the next-generation Facebook data center network. Nov., 2014 Quickly scale capacity in any dimension 17
General 3-tier Clos topology Facebook s latest Altoona data center uses this topology: r = 48 ToR switches in its pod, and m = 12 out of the 48 possible core switches on its plane, resulting in a 4:1 oversubscription ratio Fat-tree is a special case A k-pod fat-tree fixes p = k and r = n = m = k/2, providing full bisection bandwidth. Fat-tree is also widely used: Amazon s EC2 cloud is deploying 10Gbps Fat-tree 18
Today s LB practice Multiple equal-cost paths for a pair of hosts How to load balance? Today s practice: ECMP, hash of five-tuple Simple, stateless, but it s local, and prone to hash collisions 19
Hash collisions S3 S4 elephant flow S1 S2 mice flow H1 H2 H3 H4 Long tail latency for mice flows Low throughput for elephants if they collide A problem widely recognized in the community 20
Data plane solutions 21
Finer granularity Per-packet: packet spraying, DRB Flare, Presto [Sigcomm 15] Reordering is a big problem with TCP Second approach: Congestion aware LB CONGA [Sigcomm 15], Expeditus 22
CONGA Only for leaf-spine networks Information collection: per-packet feedback Leaf%A% (Sender)% Dest&Leaf& Per8link%DREs% in%spine% Uplink& 0 1 k81% B 2 5 3 0 1 2 3? LB%Decision% Conges7on8To8Leaf% Table% Per8uplink% DREs% Flowlet% Table% B A% FB_LBTag=1% FB_Metric=5% Reverse%% Path%Pkt% Source&Leaf& A B% LBTag=2% CE=4% Forward%% Path%Pkt% LBTag& 0 1 k81% B 2 5 3 Leaf%B% (Receiver)% Conges7on8From8Leaf% Table% Figure 6: CONGA system diagram. To-Leaf Table at e of events involved 1. The source le field set to th CE field to ze 2. The packet is As it traverse congestion m value in the p 3. The CE field the maximum needs to be may not be im destination 23 le
CONGA Path selection: for a new flowlet, pick the uplink port that minimizes the maximum load of the two links of the path Limitation: only works for leaf-spine networks Can we extend it to 3-tier Clos? Recall CONGA uses per-path feedback 24
The answer is no core plane 1 core plane 4 core switches 1 m 1 m s 4 m n = 4 aggregation switches 1 2 3 4 1 2 3 4 f p 4 ToR switches 1 2 r 1 2 r t p r pod 1 pod p nm paths between two ToR of different pods each ToR needs to maintain state for O(nmpr) paths 221,184 paths for the FB s Altoona DC 25
Expeditus Information collection - Do local congestion monitoring at switch Two-stage path selection - Select path stage by stage in 3-tier Clos network - Select aggregation switches in the first stage, core switches in the second 26
Information Collection Perform local congestion monitoring for all uplinks to the upstream tier Port 1 2 Ingress load Egress load Core Tier Congestion Table @aggr & ToR tier Aggregation Tier t 1 1 t 3 1 ToR Tier h 1 1 h 1 1
Two-stage Path Selection Start when first packet (Exp-ping) of the flow reaches ToR Dedicated Ethernet tag SF: 0 - first stage 1 - second stage NoE: Number of entries CD: Congestion data CD Exp-ping SYN SF=0 NoE=0 t 1 1 t 3 1 Flow ID egress port PSS valid Path Selection Table (PST) PSS: path selection completed (1), otherwise (0) valid: whether the entry is valid (1), or not (0)
Two-stage Path Selection First stage: Exp-ping packet reaches src ToR switch Exp-ping SYN SF=0 NoE=2 2 5 Egress Port load 1 2 2 5 t 1 1 t 3 1 Flow ID A egress port PSS 0 valid 0 PST
> Two-stage Path Selection First stage: Exp-ping packet reaches dst ToR switch max(port1.cd, port1.ingress_load) Port t 1 1 t 3 1 Flow ID A egress port PST PSS 0 valid 0 1 2 Ingress load 6 0 max(port2.cd, port2.ingress_load) Exp-ping SYN SF=0 NoE=2 2 5
Two-stage Path Selection First stage: Exp-ping packet reaches dst ToR switch Generate Exp-pong: cp Exp-ping s header, f2 1 reverse src and dst addr f2 3 t 1 1 t 3 1 Flow ID A egress port PST PSS 0 valid 0 Exp-pong SF=1 NoE=0 Exp-ping SYN SF=0 NoE=2 2 5 Remove tag
Two-stage Path Selection Second stage: Exp-pong packet reaches dst aggr switch Ingress Port load 1 3 2 7 f 1 2 f 3 2 Exp-pong SF=1 NoE=2 3 7 t 1 1 t 3 1
Two-stage Path Selection Second stage: Exp-pong packet reaches src aggr switch Egress load 1 3 2 5 Port f2 1 f2 3 Exp-pong SF=1 NoE=2 3 7 max(port1.cd, port1.egress_load) > max(port2.cd, port2.egress_load) t 1 1 t 3 1
Two-stage Path Selection Second stage: insert new PST entry at src aggr switch Flow ID egress port A 1 PSS 1 valid 1 f2 1 f2 3 Exp-pong SF=1 NoE=2 3 7 t 1 1 t 3 1
Two-stage Path Selection Second stage: insert new PST entry at src ToR switch Flow ID egress port A 1 PSS 1 valid 1 f 1 2 f 3 2 Flow ID egress port A 2 PSS valid 1 1 Exp-pong t 1 SF=1 NoE=2 1 t 3 1 3 7 Discard
Two-stage Path Selection End-to-end path is decided Flow ID egress port A 1 PSS 1 valid 1 f 1 2 f 3 2 Flow ID egress port A 2 PSS 1 valid 1 t 1 1 t 3 1
Handling Failures Each uplink to f 1 at achieve at most 50% capacity 1 t 1 1 s 1 1 100% X f 1 1 f 3 2 50% 50% 50% 50% t 1 1 t 3 1 60% Expeditus need to avoid selection f 1 1 at first stage of path
Handling Failures ToRs are less likely to choose uplinks to f 1 1 s 1 1 X f 1 1 f 3 2 f 1 2 Exp-ping SYN SF=0 NoE=0 6 5 t 1 1 t 3 1 Egress Port load 1 3 2 5 X Port Multiplier 1 2 2 1 link load multiplier
Implementation - Click Router FromDevice (eth0) FromDevice (eth1) FromDevice (eth2) DRE FromDevice (eth3) DRE Prototype Expeditus in Click A modular software router for fast prototyping of protocols LookupIPRoute EXPRoute Output 0 Queue 0 Queue 1 Queue 2 Queue 3 DRE DRE Two more click modules: DRE and EXPRoute DRE: measure ingress/egress link load processing overhead: ~ 151 ns/packet EXPRoute: conduct two-stage path selection processing overhead: ~ 473 ns/packet ToDevice (eth0) ToDevice (eth1) ToDevice (eth2) ToDevice (eth3) Packet processing pipeline for 4-pod fat-tree
Testbed Experiments Small-scale 3-tier Clos network on Emulab Web search traffic workload Run 5 times for each data point 1 2 1 2 1 2 1 2 1 2 1 2 t 1 3 t 1 4 t 3 2 t 2 4 Flows sent from t 1 3 and t 1 to and 4 t 3 2 t 2 4 encounter hot spots at aggr-core link Hot spot on aggr-core link
Testbed Experiments Small-scale 3-tier Clos network on Emulab Web search traffic workload Run 5 times for each data point 1 2 1 2 1 2 1 2 1 2 1 2 t 1 3 t 3 2 t 2 4 Flows sent from t 1 3 to t 3 2 encounter hot spots at ToR-aggr link Hot spot on ToR-aggr link
Large-scale Simulations Topology and traffic workloads - 12-pod 10G fat-tree, 36 equal-cost paths, 864 hosts - 10G Leaf-spine fabric with 128 hosts (8 leaf switches, 8 spine switches) - Realistic traffic workload: web search and data mining Schemes: - Expeditus - Ideal: ideal scheme that uses complete global congestion information to load balance flows - ECMP: baseline scheme - CONGA-Flow: per-flow CONGA
Performance in 3-tier Clos network 12-pod fat-tree with network oversubscription 2:1 at ToR tier 50% 16%~34% Web search Small Flows (< 100KB) Large Flows (> 1MB) Performance of Expeditus approaches Ideal scheme
Performance in 3-tier Clos network 12-pod fat-tree with network oversubscription 2:1 at ToR tier 46% 26%~32% Data Mining Small Flows (< 100KB) Large Flows (> 1MB) Performance of Expeditus approaches Ideal scheme
Comparison with CONGA-Flow Leaf-spine with network oversubscription 2:1 at leaf tier Web search Data Mining Avg for all flows Avg for all flows Performs better than state-of-the-art scheme CONGA-Flow
Impact of link failure Reduction by Ideal and Expeditus over ECMP (web search workload, load 0.5) Aggr-core link failure ToR-aggr link failure Expeditus still provides moderate performance gains
Control plane solutions 47
Centralized optimization Usually done in a SDN environment with per-flow control Advantage: global network state to improve efficiency Disadvantages: slow (minutes); can only handle a small number of flows The common approach: detect elephant flows, calculate a LB solution based on some optimization, dispatch the rules to OpenFlow switches Hedera (NSDI 2010), ElastiTree (NSDI 2010) 48
Coflow LB Applications like Spark launch a group of (elephant) flows for a computation task, and all of them usually need to finish in order to proceed coflows It makes sense to do LB at the granularity of coflows, rather than flows Here LB includes both routing and rate control, just like traffic engineering for WAN 49
!!! 1! 2! 3! C 1! 4! 1! 2! C 2! 0! 2! 2! 1! 2! 3! C A toy example 1! 4! 1! 2! Both coflows end! Time! Both coflows end! Time! 2! 4! 2! 4! Flow Size @! 1! 2! 3! C 1! 4! 1! 2! C 2! 0! 2! 2! P 2! P 3! P 1! P 2! P 3! Both coflows end! Time! B/W @ Ing B/W @ Ingress! P 2! Both coflows end! 2! 4! (a) Per-flow fairness Both coflows end! 2! 4! P 3! C 2! 0! 2! 2! P 1! P 2! P 3! C 2 ends! Time! 2! 4! P 2! P 3! P 1! 1! P 2! 2! P 3! 3! P 1! C 1 ends! Time! Time! Time! B/W @ Ingress! P 1! P 2! P 3! C 2 ends! C 1 ends! C Both 2 ends! C coflows 1 ends! end! 2! 4! (b) Per-flow prioritization 2! 4! 2! 4! Time! 2! 4! P 1! P Put! 2! 1! 4! Time! avg CCT: 4 avg CCT: 3.5 P 1! P 2! P 3! Flow Size @! 1! 2! 3! P 1! Time! C 1! 4! 1! 2! P C 2! 2! 0! 2! 2! P 3! B/W @ Ingress! P 2! P 3! C 2 ends! C 1 ends! Sender! P 1! P 2! P 3! Time! C 2 ends! C 1 en Varys Daemon! Topology! Monitor! Us Estim 2! 4! Coflow Schedu Varys Figure 3: Varys architectur through a client library. avg CCT: 3 using existing technique use remaining bandwidth 2! 4! 2! 4! Time! Time! We have implemented Both coflows end! (c) WSS [15] (d) The optimal schedule ticality it can readily b large improvements for b Figure 2: Allocation of ingress port capacities (vertical axis) using different mechanisms for the coflows in Figure 1. Each port can transfer one C unit 2 ends! of C 1 ends! Fault Tolerance Failur data in one time unit. The average FCT and CCT for (a) per-flow fairness ecution, since data can are 3.4 and 4 time units; (b) per-flow prioritization are 2.8 P 1! and 3.5 time in their absence. Varys units; (c) Weighted Shuffle Scheduling (WSS) are 3.6 and 4 time units; and 50 (d) the optimal schedule are 3 and 3 time units. P built quickly upon restar 2! Source: M. Chowdhury et al., Efficient Coflow Scheduling with Varys. Proc. ACM Sigcomm, 2014 R
Coflow scheduling First studied in Chowdhury s Sigcomm 2014 Varys paper Assuming coflow information, including flow sizes, endpoints, and arrival time, is known, and the network is abstracted as a huge non-blocking switch The problem is to decide when to start the flows and at what rate to serve them to minimize the average CCT of the cluster NP-hard for the offline even when all coflows arrive at the same time 51
Coflow scheduling: Varys When to start the coflow: smallest effective bottleneck first heuristic (mimic the smallest first scheduling for minimizing average FCT) calculate the CCT by using the remaining bandwidth at the first and last hops pick the coflow with smallest CCT to start first At what rate: allocate just enough bandwidth so all flows of this coflow finish at the same time 52
Coflow scheduling and LB Varys assumes the network bottleneck is always at the edge But if you don t do LB right, the bottleneck may happen at the core. To achieve optimal CCT, we have to jointly consider scheduling and LB We can largely reuse Varys, and update the CCT calculation to consider LB For a single coflow, the joint problem can be formulate as an opt and solved using heuristics RAPIER, INFOCOM 2015 53
Interesting open questions What if we don t know the coflow information at all? How about the mice coflows? We can t use centralized optimization but is there a way we can exploit the application semantics to improve their performance? 54
To wrap up An overview of research on data center networks A more in-depth, but still high level introduction on load balancing Many interesting problems are still under-explored But the expectation/requirement is high: intuitive and exciting idea, practical solution, solid evaluation using prototypes 55
Thank you! Henry Xu City University of Hong Kong 56