Load Balancing in Data Center Networks

Similar documents

Load Balancing in Data Center Networks

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Data Center Network Topologies: FatTree

Load Balancing Mechanisms in Data Center Networks

Hedera: Dynamic Flow Scheduling for Data Center Networks

MMPTCP: A Novel Transport Protocol for Data Centre Networks

In-band Network Telemetry (INT) Mukesh Hira, VMware Naga Katta, Princeton University

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Advanced Computer Networks. Datacenter Network Fabric

Advanced Computer Networks. Scheduling

Core and Pod Data Center Design

Data Center Load Balancing Kristian Hartikainen

TinyFlow: Breaking Elephants Down Into Mice in Data Center Networks

Designing and Experimenting with Data Center Architectures. Aditya Akella UW-Madison

基於 SDN 與可程式化硬體架構之雲端網路系統交換器

Non-blocking Switching in the Cloud Computing Era

Data Center Network Topologies: VL2 (Virtual Layer 2)

Multipath TCP design, and application to data centers. Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke

TRILL Large Layer 2 Network Solution

Lecture 7: Data Center Networks"

Operating Systems. Cloud Computing and Data Centers

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Scaling 10Gb/s Clustering at Wire-Speed

SDN and Data Center Networks

Brocade Data Center Fabric Architectures

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

20. Switched Local Area Networks

Brocade Data Center Fabric Architectures

Dahu: Commodity Switches for Direct Connect Data Center Networks

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Software-Defined Networking Architecture Framework for Multi-Tenant Enterprise Cloud Environments

DiFS: Distributed Flow Scheduling for Adaptive Routing in Hierarchical Data Center Networks

OpenFlow and Onix. OpenFlow: Enabling Innovation in Campus Networks. The Problem. We also want. How to run experiments in campus networks?

SDN CENTRALIZED NETWORK COMMAND AND CONTROL

OpenFlow: History and Overview. Demo of routers

Computer Networks COSC 6377

Data Center Network Topologies

TRILL for Data Center Networks

CS6204 Advanced Topics in Networking

Empowering Software Defined Network Controller with Packet-Level Information

How To Analyse Cloud Data Centre Performance

DARD: Distributed Adaptive Routing for Datacenter Networks

Brocade Solution for EMC VSPEX Server Virtualization

Why Software Defined Networking (SDN)? Boyan Sotirov

VIRTUALIZATION [1] has been touted as a revolutionary

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲端運算行動應用研究中心

Software-Defined Networking for the Data Center. Dr. Peer Hasselmeyer NEC Laboratories Europe

T. S. Eugene Ng Rice University

Simplify Your Data Center Network to Improve Performance and Decrease Costs

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Cisco s Massively Scalable Data Center

Multipath TCP in Data Centres (work in progress)

Panel : Future Data Center Networks

Datacenter Network Large Flow Detection and Scheduling from the Edge

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Data Center Fabrics What Really Matters. Ivan Pepelnjak NIL Data Communications

Technical Bulletin. Enabling Arista Advanced Monitoring. Overview

Outline. Institute of Computer and Communication Network Engineering. Institute of Computer and Communication Network Engineering

Software Defined Networking What is it, how does it work, and what is it good for?

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

Scalable Approaches for Multitenant Cloud Data Centers

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

OpenFlow Overview. Daniel Turull

Open vswitch and the Intelligent Edge

OpenFlow Based Load Balancing

Optical Networking for Data Centres Network

Scalable High Resolution Network Monitoring

Deconstructing Datacenter Packet Transport

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Evaluating the Impact of Data Center Network Architectures on Application Performance in Virtualized Environments

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

VMDC 3.0 Design Overview

Ethernet Fabrics: An Architecture for Cloud Networking

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

4 Internet QoS Management

Wedge Networks: Transparent Service Insertion in SDNs Using OpenFlow

phost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric

Ethernet Fabric Requirements for FCoE in the Data Center

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

SDN AND SECURITY: Why Take Over the Hosts When You Can Take Over the Network

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Securing Local Area Network with OpenFlow

Data Analysis Load Balancer

Decentralized Task-Aware Scheduling for Data Center Networks

Data Center Network Architectures

Friends, not Foes Synthesizing Existing Transport Strategies for Data Center Networks

Chapter 6. Paper Study: Data Center Networking

Data Center Use Cases and Trends

The Internet. Charging for Internet. What does 1000M and 200M mean? Dr. Hayden Kwok-Hay So

Panopticon: Incremental SDN Deployment in Enterprise Networks

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Improving Datacenter Performance and Robustness with Multipath TCP

OpenFlow and Software Defined Networking presented by Greg Ferro. OpenFlow Functions and Flow Tables

Enabling Flow-based Routing Control in Data Center Networks using Probe and ECMP

Transcription:

Load Balancing in Data Center Networks Henry Xu Department of Computer Science City University of Hong Kong ShanghaiTech Symposium on Data Scinece June 23, 2015

Today s plan Overview of research in DCN Background on load balancing Data plane Control plane 3

Overview How to send data across the boundaries of servers in a data center in a better way? Sigcomm 2015 Sigcomm 2014 13/40 12/43 Universities: Berkeley, UIUC, UW-Madison, USC, Princeton Industry/labs: Microsoft Research, Google, Facebook 4

Overview What makes DCN different? 1. It s entirely controlled by one operator! we can change everything from apps to switch ASICs 2. It s large-scale! huge design space makes the problems intellectually challenging 3. It s highly demanding! many problems to work on huge problem space 5

Topology design The first question: how should a large-scale DCN look like, using commodity switches? DCell (2008), Portland, BCube (2008), Jellyfish (2012) Hotness: 6

Traffic measurement What are the traffic characteristics in production DCN? Elephant vs. mice: most flows are mice (<100KB), while most bytes are from elephants (>10MB) Hotness: 7

Transport design How to make TCP better for DCN? Many extensions possible, but be aware of tons of prior work (and the experts) Hotness: 8

Bandwidth guarantees How to fairly share bandwidth in multi-tenant DCN? Hotness: 9

Centralized flow scheduling How to better coordinate the transmission of elephant flows? The goal is mainly high throughput Hotness: 10

Low latency networking How to reduce (tail) latency for partition-aggregation workloads? Multi-faceted; interests from the systems and queueing theory communities Hotness: 11

Inter-DC TE How to better perform traffic engineering in inter-dc WAN? Hotness: 12

Application-aware How can application-layer semantics help the DCN? Hotness: 13

Network management How to better manage (configure/debug) a large-scale DCN? Hotness: 14

Energy How to reduce energy of the DCN? Hotness: 15

Background on LB 16

General 3-tier Clos topology core plane 1 core plane 4 core switches 1 m 1 m s 4 m n = 4 aggregation switches 1 2 3 4 1 2 3 4 f p 4 ToR switches 1 2 r 1 2 r t p r pod 1 pod p Source: A. Andreyev. Introducing data center fabric, the next-generation Facebook data center network. Nov., 2014 Quickly scale capacity in any dimension 17

General 3-tier Clos topology Facebook s latest Altoona data center uses this topology: r = 48 ToR switches in its pod, and m = 12 out of the 48 possible core switches on its plane, resulting in a 4:1 oversubscription ratio Fat-tree is a special case A k-pod fat-tree fixes p = k and r = n = m = k/2, providing full bisection bandwidth. Fat-tree is also widely used: Amazon s EC2 cloud is deploying 10Gbps Fat-tree 18

Today s LB practice Multiple equal-cost paths for a pair of hosts How to load balance? Today s practice: ECMP, hash of five-tuple Simple, stateless, but it s local, and prone to hash collisions 19

Hash collisions S3 S4 elephant flow S1 S2 mice flow H1 H2 H3 H4 Long tail latency for mice flows Low throughput for elephants if they collide A problem widely recognized in the community 20

Data plane solutions 21

Finer granularity Per-packet: packet spraying, DRB Flare, Presto [Sigcomm 15] Reordering is a big problem with TCP Second approach: Congestion aware LB CONGA [Sigcomm 15], Expeditus 22

CONGA Only for leaf-spine networks Information collection: per-packet feedback Leaf%A% (Sender)% Dest&Leaf& Per8link%DREs% in%spine% Uplink& 0 1 k81% B 2 5 3 0 1 2 3? LB%Decision% Conges7on8To8Leaf% Table% Per8uplink% DREs% Flowlet% Table% B A% FB_LBTag=1% FB_Metric=5% Reverse%% Path%Pkt% Source&Leaf& A B% LBTag=2% CE=4% Forward%% Path%Pkt% LBTag& 0 1 k81% B 2 5 3 Leaf%B% (Receiver)% Conges7on8From8Leaf% Table% Figure 6: CONGA system diagram. To-Leaf Table at e of events involved 1. The source le field set to th CE field to ze 2. The packet is As it traverse congestion m value in the p 3. The CE field the maximum needs to be may not be im destination 23 le

CONGA Path selection: for a new flowlet, pick the uplink port that minimizes the maximum load of the two links of the path Limitation: only works for leaf-spine networks Can we extend it to 3-tier Clos? Recall CONGA uses per-path feedback 24

The answer is no core plane 1 core plane 4 core switches 1 m 1 m s 4 m n = 4 aggregation switches 1 2 3 4 1 2 3 4 f p 4 ToR switches 1 2 r 1 2 r t p r pod 1 pod p nm paths between two ToR of different pods each ToR needs to maintain state for O(nmpr) paths 221,184 paths for the FB s Altoona DC 25

Expeditus Information collection - Do local congestion monitoring at switch Two-stage path selection - Select path stage by stage in 3-tier Clos network - Select aggregation switches in the first stage, core switches in the second 26

Information Collection Perform local congestion monitoring for all uplinks to the upstream tier Port 1 2 Ingress load Egress load Core Tier Congestion Table @aggr & ToR tier Aggregation Tier t 1 1 t 3 1 ToR Tier h 1 1 h 1 1

Two-stage Path Selection Start when first packet (Exp-ping) of the flow reaches ToR Dedicated Ethernet tag SF: 0 - first stage 1 - second stage NoE: Number of entries CD: Congestion data CD Exp-ping SYN SF=0 NoE=0 t 1 1 t 3 1 Flow ID egress port PSS valid Path Selection Table (PST) PSS: path selection completed (1), otherwise (0) valid: whether the entry is valid (1), or not (0)

Two-stage Path Selection First stage: Exp-ping packet reaches src ToR switch Exp-ping SYN SF=0 NoE=2 2 5 Egress Port load 1 2 2 5 t 1 1 t 3 1 Flow ID A egress port PSS 0 valid 0 PST

> Two-stage Path Selection First stage: Exp-ping packet reaches dst ToR switch max(port1.cd, port1.ingress_load) Port t 1 1 t 3 1 Flow ID A egress port PST PSS 0 valid 0 1 2 Ingress load 6 0 max(port2.cd, port2.ingress_load) Exp-ping SYN SF=0 NoE=2 2 5

Two-stage Path Selection First stage: Exp-ping packet reaches dst ToR switch Generate Exp-pong: cp Exp-ping s header, f2 1 reverse src and dst addr f2 3 t 1 1 t 3 1 Flow ID A egress port PST PSS 0 valid 0 Exp-pong SF=1 NoE=0 Exp-ping SYN SF=0 NoE=2 2 5 Remove tag

Two-stage Path Selection Second stage: Exp-pong packet reaches dst aggr switch Ingress Port load 1 3 2 7 f 1 2 f 3 2 Exp-pong SF=1 NoE=2 3 7 t 1 1 t 3 1

Two-stage Path Selection Second stage: Exp-pong packet reaches src aggr switch Egress load 1 3 2 5 Port f2 1 f2 3 Exp-pong SF=1 NoE=2 3 7 max(port1.cd, port1.egress_load) > max(port2.cd, port2.egress_load) t 1 1 t 3 1

Two-stage Path Selection Second stage: insert new PST entry at src aggr switch Flow ID egress port A 1 PSS 1 valid 1 f2 1 f2 3 Exp-pong SF=1 NoE=2 3 7 t 1 1 t 3 1

Two-stage Path Selection Second stage: insert new PST entry at src ToR switch Flow ID egress port A 1 PSS 1 valid 1 f 1 2 f 3 2 Flow ID egress port A 2 PSS valid 1 1 Exp-pong t 1 SF=1 NoE=2 1 t 3 1 3 7 Discard

Two-stage Path Selection End-to-end path is decided Flow ID egress port A 1 PSS 1 valid 1 f 1 2 f 3 2 Flow ID egress port A 2 PSS 1 valid 1 t 1 1 t 3 1

Handling Failures Each uplink to f 1 at achieve at most 50% capacity 1 t 1 1 s 1 1 100% X f 1 1 f 3 2 50% 50% 50% 50% t 1 1 t 3 1 60% Expeditus need to avoid selection f 1 1 at first stage of path

Handling Failures ToRs are less likely to choose uplinks to f 1 1 s 1 1 X f 1 1 f 3 2 f 1 2 Exp-ping SYN SF=0 NoE=0 6 5 t 1 1 t 3 1 Egress Port load 1 3 2 5 X Port Multiplier 1 2 2 1 link load multiplier

Implementation - Click Router FromDevice (eth0) FromDevice (eth1) FromDevice (eth2) DRE FromDevice (eth3) DRE Prototype Expeditus in Click A modular software router for fast prototyping of protocols LookupIPRoute EXPRoute Output 0 Queue 0 Queue 1 Queue 2 Queue 3 DRE DRE Two more click modules: DRE and EXPRoute DRE: measure ingress/egress link load processing overhead: ~ 151 ns/packet EXPRoute: conduct two-stage path selection processing overhead: ~ 473 ns/packet ToDevice (eth0) ToDevice (eth1) ToDevice (eth2) ToDevice (eth3) Packet processing pipeline for 4-pod fat-tree

Testbed Experiments Small-scale 3-tier Clos network on Emulab Web search traffic workload Run 5 times for each data point 1 2 1 2 1 2 1 2 1 2 1 2 t 1 3 t 1 4 t 3 2 t 2 4 Flows sent from t 1 3 and t 1 to and 4 t 3 2 t 2 4 encounter hot spots at aggr-core link Hot spot on aggr-core link

Testbed Experiments Small-scale 3-tier Clos network on Emulab Web search traffic workload Run 5 times for each data point 1 2 1 2 1 2 1 2 1 2 1 2 t 1 3 t 3 2 t 2 4 Flows sent from t 1 3 to t 3 2 encounter hot spots at ToR-aggr link Hot spot on ToR-aggr link

Large-scale Simulations Topology and traffic workloads - 12-pod 10G fat-tree, 36 equal-cost paths, 864 hosts - 10G Leaf-spine fabric with 128 hosts (8 leaf switches, 8 spine switches) - Realistic traffic workload: web search and data mining Schemes: - Expeditus - Ideal: ideal scheme that uses complete global congestion information to load balance flows - ECMP: baseline scheme - CONGA-Flow: per-flow CONGA

Performance in 3-tier Clos network 12-pod fat-tree with network oversubscription 2:1 at ToR tier 50% 16%~34% Web search Small Flows (< 100KB) Large Flows (> 1MB) Performance of Expeditus approaches Ideal scheme

Performance in 3-tier Clos network 12-pod fat-tree with network oversubscription 2:1 at ToR tier 46% 26%~32% Data Mining Small Flows (< 100KB) Large Flows (> 1MB) Performance of Expeditus approaches Ideal scheme

Comparison with CONGA-Flow Leaf-spine with network oversubscription 2:1 at leaf tier Web search Data Mining Avg for all flows Avg for all flows Performs better than state-of-the-art scheme CONGA-Flow

Impact of link failure Reduction by Ideal and Expeditus over ECMP (web search workload, load 0.5) Aggr-core link failure ToR-aggr link failure Expeditus still provides moderate performance gains

Control plane solutions 47

Centralized optimization Usually done in a SDN environment with per-flow control Advantage: global network state to improve efficiency Disadvantages: slow (minutes); can only handle a small number of flows The common approach: detect elephant flows, calculate a LB solution based on some optimization, dispatch the rules to OpenFlow switches Hedera (NSDI 2010), ElastiTree (NSDI 2010) 48

Coflow LB Applications like Spark launch a group of (elephant) flows for a computation task, and all of them usually need to finish in order to proceed coflows It makes sense to do LB at the granularity of coflows, rather than flows Here LB includes both routing and rate control, just like traffic engineering for WAN 49

!!! 1! 2! 3! C 1! 4! 1! 2! C 2! 0! 2! 2! 1! 2! 3! C A toy example 1! 4! 1! 2! Both coflows end! Time! Both coflows end! Time! 2! 4! 2! 4! Flow Size @! 1! 2! 3! C 1! 4! 1! 2! C 2! 0! 2! 2! P 2! P 3! P 1! P 2! P 3! Both coflows end! Time! B/W @ Ing B/W @ Ingress! P 2! Both coflows end! 2! 4! (a) Per-flow fairness Both coflows end! 2! 4! P 3! C 2! 0! 2! 2! P 1! P 2! P 3! C 2 ends! Time! 2! 4! P 2! P 3! P 1! 1! P 2! 2! P 3! 3! P 1! C 1 ends! Time! Time! Time! B/W @ Ingress! P 1! P 2! P 3! C 2 ends! C 1 ends! C Both 2 ends! C coflows 1 ends! end! 2! 4! (b) Per-flow prioritization 2! 4! 2! 4! Time! 2! 4! P 1! P Put! 2! 1! 4! Time! avg CCT: 4 avg CCT: 3.5 P 1! P 2! P 3! Flow Size @! 1! 2! 3! P 1! Time! C 1! 4! 1! 2! P C 2! 2! 0! 2! 2! P 3! B/W @ Ingress! P 2! P 3! C 2 ends! C 1 ends! Sender! P 1! P 2! P 3! Time! C 2 ends! C 1 en Varys Daemon! Topology! Monitor! Us Estim 2! 4! Coflow Schedu Varys Figure 3: Varys architectur through a client library. avg CCT: 3 using existing technique use remaining bandwidth 2! 4! 2! 4! Time! Time! We have implemented Both coflows end! (c) WSS [15] (d) The optimal schedule ticality it can readily b large improvements for b Figure 2: Allocation of ingress port capacities (vertical axis) using different mechanisms for the coflows in Figure 1. Each port can transfer one C unit 2 ends! of C 1 ends! Fault Tolerance Failur data in one time unit. The average FCT and CCT for (a) per-flow fairness ecution, since data can are 3.4 and 4 time units; (b) per-flow prioritization are 2.8 P 1! and 3.5 time in their absence. Varys units; (c) Weighted Shuffle Scheduling (WSS) are 3.6 and 4 time units; and 50 (d) the optimal schedule are 3 and 3 time units. P built quickly upon restar 2! Source: M. Chowdhury et al., Efficient Coflow Scheduling with Varys. Proc. ACM Sigcomm, 2014 R

Coflow scheduling First studied in Chowdhury s Sigcomm 2014 Varys paper Assuming coflow information, including flow sizes, endpoints, and arrival time, is known, and the network is abstracted as a huge non-blocking switch The problem is to decide when to start the flows and at what rate to serve them to minimize the average CCT of the cluster NP-hard for the offline even when all coflows arrive at the same time 51

Coflow scheduling: Varys When to start the coflow: smallest effective bottleneck first heuristic (mimic the smallest first scheduling for minimizing average FCT) calculate the CCT by using the remaining bandwidth at the first and last hops pick the coflow with smallest CCT to start first At what rate: allocate just enough bandwidth so all flows of this coflow finish at the same time 52

Coflow scheduling and LB Varys assumes the network bottleneck is always at the edge But if you don t do LB right, the bottleneck may happen at the core. To achieve optimal CCT, we have to jointly consider scheduling and LB We can largely reuse Varys, and update the CCT calculation to consider LB For a single coflow, the joint problem can be formulate as an opt and solved using heuristics RAPIER, INFOCOM 2015 53

Interesting open questions What if we don t know the coflow information at all? How about the mice coflows? We can t use centralized optimization but is there a way we can exploit the application semantics to improve their performance? 54

To wrap up An overview of research on data center networks A more in-depth, but still high level introduction on load balancing Many interesting problems are still under-explored But the expectation/requirement is high: intuitive and exciting idea, practical solution, solid evaluation using prototypes 55

Thank you! Henry Xu City University of Hong Kong 56