Load Balancing in Data Center Networks



Similar documents
Load Balancing in Data Center Networks

Load Balancing Mechanisms in Data Center Networks

TinyFlow: Breaking Elephants Down Into Mice in Data Center Networks

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Friends, not Foes Synthesizing Existing Transport Strategies for Data Center Networks

Advanced Computer Networks. Scheduling

Data Center Network Topologies: FatTree

Multipath TCP design, and application to data centers. Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

DiFS: Distributed Flow Scheduling for Adaptive Routing in Hierarchical Data Center Networks

Deconstructing Datacenter Packet Transport

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Advanced Computer Networks. Datacenter Network Fabric

Cisco s Massively Scalable Data Center

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

Longer is Better? Exploiting Path Diversity in Data Centre Networks

In-band Network Telemetry (INT) Mukesh Hira, VMware Naga Katta, Princeton University

phost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric

Small is Better: Avoiding Latency Traps in Virtualized DataCenters

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Improving Flow Completion Time for Short Flows in Datacenter Networks

Decentralized Task-Aware Scheduling for Data Center Networks

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

TCP Pacing in Data Center Networks

Operating Systems. Cloud Computing and Data Centers

Paolo Costa

DARD: Distributed Adaptive Routing for Datacenter Networks

Policy Based Forwarding

Data Center Networking with Multipath TCP

On real-time delay monitoring in software-defined networks

Lecture 7: Data Center Networks"

Ethernet-based Software Defined Network (SDN) Cloud Computing Research Center for Mobile Applications (CCMA), ITRI 雲 端 運 算 行 動 應 用 研 究 中 心

TRILL Large Layer 2 Network Solution

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Multipath TCP in Data Centres (work in progress)

Hedera: Dynamic Flow Scheduling for Data Center Networks

OpenFlow and Onix. OpenFlow: Enabling Innovation in Campus Networks. The Problem. We also want. How to run experiments in campus networks?

How To Analyse Cloud Data Centre Performance

Quality of Service using Traffic Engineering over MPLS: An Analysis. Praveen Bhaniramka, Wei Sun, Raj Jain

Scaling 10Gb/s Clustering at Wire-Speed

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Non-blocking Switching in the Cloud Computing Era

Voice Over IP. MultiFlow IP Phone # 3071 Subnet # Subnet Mask IP address Telephone.

VIRTUALIZATION [1] has been touted as a revolutionary

Data Center Network Topologies

Computer Networks COSC 6377

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

Improving Flow Completion Time and Throughput in Data Center Networks

Rapid IP redirection with SDN and NFV. Jeffrey Lai, Qiang Fu, Tim Moors December 9, 2015

Data Center Load Balancing Kristian Hartikainen

TRILL for Data Center Networks

20. Switched Local Area Networks

Multihoming and Multi-path Routing. CS 7260 Nick Feamster January

DeTail: Reducing the Flow Completion Time Tail in Datacenter Networks

SDN and Data Center Networks

Data Center Fabrics What Really Matters. Ivan Pepelnjak NIL Data Communications

Performance of Software Switching

MPLS-TP. Future Ready. Today. Introduction. Connection Oriented Transport

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

SIGCOMM Preview Session: Data Center Networking (DCN)

OpenFlow Based Load Balancing

Ethernet Fabric Requirements for FCoE in the Data Center

Enabling Flow-level Latency Measurements across Routers in Data Centers

Data Center Network Topologies: VL2 (Virtual Layer 2)

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Names & Addresses. Names & Addresses. Hop-by-Hop Packet Forwarding. Longest-Prefix-Match Forwarding. Longest-Prefix-Match Forwarding

Data Center Convergence. Ahmad Zamer, Brocade

Network management and QoS provisioning - QoS in the Internet

Introduction, Rate and Latency

Designing and Experimenting with Data Center Architectures. Aditya Akella UW-Madison

Avoiding Network Polarization and Increasing Visibility in Cloud Networks Using Broadcom Smart- Hash Technology

[Yawen comments]: The first author Ruoyan Liu is a visting student coming from our collaborator, A/Prof. Huaxi Gu s research group, Xidian

Per-packet Load-balanced, Low-Latency Routing for Clos-based Data Center Networks

Definition. A Historical Example

Software Defined Networking What is it, how does it work, and what is it good for?

Core and Pod Data Center Design

Disaster-Resilient Backbone and Access Networks

Requirements for Simulation and Modeling Tools. Sally Floyd NSF Workshop August 2005

Xiaoqiao Meng, Vasileios Pappas, Li Zhang IBM T.J. Watson Research Center Presented by: Payman Khani

Decentralized Task-Aware Scheduling for Data Center Networks

Detour planning for fast and reliable fault recovery in SDN with OpenState

Strategies. Addressing and Routing

Data Center Networking with Multipath TCP

The Software Defined Hybrid Packet Optical Datacenter Network SDN AT LIGHT SPEED TM CALIENT Technologies

Wedge Networks: Transparent Service Insertion in SDNs Using OpenFlow

pfabric: Minimal Near-Optimal Datacenter Transport

On the Impact of Packet Spraying in Data Center Networks

Dahu: Commodity Switches for Direct Connect Data Center Networks

CS514: Intermediate Course in Computer Systems

Empowering Software Defined Network Controller with Packet-Level Information

Architecting Low Latency Cloud Networks

Data Center Netwokring with Multipath TCP

Wave Relay System and General Project Details

Small is Better: Avoiding Latency Traps in Virtualized Data Centers

Transcription:

Load Balancing in Data Center Networks Henry Xu Computer Science City University of Hong Kong HKUST, March 2, 2015

Background Aggregator Aggregator Aggregator Worker Worker Worker Worker Low latency for a huge number of short query/response flows, especially the tail latency (e.g. 99-th) 2

Background Data centers use multi-stage Clos topologies Fat-tree Spine-Leaf Multiple equal-cost paths for a pair of hosts How to load balance? Today s practice: ECMP, local, random lots of problems 3

Background Current data center transport is ill-fitted for the task RTT measurement in EC2 us-west-2c, 100K samples Mean RTT: 0.5ms 99-th RTT: 17ms Corroborated by measurements from many existing papers 4

Background Culprit: ECMP is static and agnostic to congestion S3 S4 elephant flow S1 S2 mice flow H1 H2 H3 H4 Tail latency is even worse with elephants colliding on the same path due to ECMP 5

Our quest How can we improve load balancing in data center networks? Scalable enough to handle millions of mice flows traversing numerous links Smart enough to avoid congestion in the network dynamically 6

Our answer Patch solution: RepNet Application-layer transport that can be implemented today INFOCOM 2014, under submission Fundamental solution: Expeditus Distributed, congestion-aware load balancing protocol to replace ECMP CoNEXT student workshop 2014 best paper, on-going work 7

Chapter I RepNet 8

RepNet in a nutshell Replicate each mice flow to exploit multipath diversity S3 S4 elephant flow S1 S2 mice flow replicated flow H1 H2 H3 H4 No two paths are exactly the same The power of two choices, M Mitzenmacher Clos based topologies provide many equal-cost paths 9

RepNet s design Which flows? Less than 100KB, consistent with many existing papers When? Always! (We ll come back about the overhead issue) How? RepFlow: replicate each byte of the flow RepSYN: only replicate SYN packets and choose the quicker connection 10

Is RepNet effective? 11

Simplified queueing analysis choose 1 path 1 effective load: path n fraction of total bytes from mice (< 0.1) choose 2 path 1 (1 + ) effective load: (1 + ) 2 2 path n (1 + ) 12

Packet-level NS-3 simulations Topology: 16-pod 1Gbps fat-tree, 1,024 hosts Traffic pattern: Poisson, random src/dst, 0.5s worth Flow size distribution: Web search cluster from DCTCP paper >95% bytes are from 30% flows large than 1MB Data mining cluster from VL2 paper (not shown here) >95% bytes are from 3.6% flows large than 35MB 13

Benchmarks TCP: TCP NewReno, initial window 12KB, DropTail queues with 100 packet buffer RepFlow DCTCP: source code from authors of D2TCP RepFlow-DCTCP: RepFlow on top of DCTCP pfabric: state-of-the-art, near-optimal FCT with priority queueing, source code obtained from authors 14

Results [1/4] Mean FCT, mice flows (<100KB) 40%-45% 40% 15

Results [2/4] 99-th percentile FCT, mice flows (<100KB) >60% 16

Results [4/4] Mean FCT, elephant flows (>=100KB) Replication overhead no impact 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 3.45% 2.78% 3.13% 3.38% 3.29% 3.47% 3.22% 3.27% % 17

Is RepNet really effective? 18

Implementation Based on node, a highly scalable platform for real-time server-side networked applications Single-threaded, non-blocking socket, event driven Widely used in industry for both front-end and back-end 19

Implementation A module RepNet based on Net, the standard library for non-blocking sockets Applications only need to change one line: require( net ) > require( repnet ) RepNet.Socket: a single socket abstraction for applications while having two TCP sockets RepNet.Server: functions for listening for and managing both replicated and regular TCP connections 20

Testbed evaluation Core 1 Core 2 Core 3 ToR 1 ToR 2 Rack 1 Rack 2 Pronto 3295 switches. 1Gbps links. Oversubscribed at 2:1 Ping RTT 178us across racks Flow size distribution from the DCTCP paper 21

Testbed evaluation RepFlow and RepSYN significantly improve tail FCT when load is high in an oversubscribed network RepFlow is more beneficial 22

More results Application level performance using RepNet Mininet emulation with a 6-pod fat-tree All source code and experiment scripts are online https://bitbucket.org/shuhaoliu/repnet https://bitbucket.org/shuhaoliu/repnet_experiment 23

Recap Takeaway: RepNet is a practical and effective application layer low latency transport Open-source implementation and experimental evaluation Patch solution, short-term 24

Chapter II Expeditus 25

How to build a distributed congestion-aware load balancing protocol, for a large-scale data center network? Naive solution: track congestion information for all possible paths This simply can t scale 26

Per-path isn t scalable k-pod fat tree k 2 /4 paths between edge switches of distinct pods An edge switch talks to k 2 /2-k/2 edge switches in distinct pods Each edge switch needs to track O(k 4 ) paths! 27

Design One-hop congestion information collection Each edge and aggr switch maintains congestion information for k ports in k-pod fat-tree Two-stage path selection 28

One-hop info collection Northbound congestion information can be obtained by polling buffer occupancy of egress ports 29

One-hop info collection Southbound congestion information needs to be transmitted by piggybacking in packets Aggr switches collect congestion information coming from core switches Edge switches collect congestion information coming from aggr switches 30

Two-stage path selection SYN packet carries congestion information at source edge switch to destination edge switch Destination edge switch chooses the aggr switch with the least combined congestion at the first and last hop... 3 5 2 8 src dst 31

Two-stage path selection SYN-ACK packet carries congestion information at destination aggr switch to source aggr switch Source aggr switch chooses the core switch with the least combined congestion at the second and third hops 4 6 3 10... 3 5 2 8 src dst 32

Two-stage path selection Assemble a complete path based on selected aggr and core switches, store in host s flow routing table IP-in-IP encapsulation to enforce source routing 4 6 3 10... 3 5 2 8 src dst 33

Preliminary evaluation NS-3 simulation with a 16-pod fat-tree (1,024 hosts), oversubscribed at 4:1, DCTCP flow size distribution mean FCT 99th percentile FCT 34

Implementation on-going Click software router implementation (together with CONGA) Experiments on a fat-tree on Emulab 20 PCs with 5 NICs as Expeditus switches 16 PCs with 1 NICs as hosts https://www.emulab.net/showproject.php3?pid=expeditus 35

Related work Reducing (tail) latency in data center networks is an important problem Reduce queue length: DCTCP (2010), HULL (2012) Prioritize scheduling: D 3 (2011), PDQ (2012), DeTail (2012), pfabric (2013) They all require modifications to end-hosts and/or switches, making it difficult to deploy in reality 36

Thank you! Henry Xu City University of Hong Kong 37