CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Similar documents

Data Center Load Balancing Kristian Hartikainen

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Load Balancing in Data Center Networks

Mohammad Alizadeh. Computer networks and systems, network algorithms, cloud computing

Using Random Neural Network for Load Balancing in Data Centers

In-band Network Telemetry (INT) Mukesh Hira, VMware Naga Katta, Princeton University

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Load Balancing in Data Center Networks

Cisco s Massively Scalable Data Center

Hedera: Dynamic Flow Scheduling for Data Center Networks

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Multipath TCP in Data Centres (work in progress)

Non-blocking Switching in the Cloud Computing Era

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Operating Systems. Cloud Computing and Data Centers

Dahu: Commodity Switches for Direct Connect Data Center Networks

Load Balancing Mechanisms in Data Center Networks

Cloud Networking: A Novel Network Approach for Cloud Computing Models CQ1 2009

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

Scaling 10Gb/s Clustering at Wire-Speed

Advanced Computer Networks. Datacenter Network Fabric

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

Multipath TCP design, and application to data centers. Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

Friends, not Foes Synthesizing Existing Transport Strategies for Data Center Networks

DARD: Distributed Adaptive Routing for Datacenter Networks

Data Center Use Cases and Trends

DARD: Distributed Adaptive Routing for Datacenter Networks

SIGCOMM Preview Session: Data Center Networking (DCN)

Lecture 7: Data Center Networks"

Data Center Network Topologies: FatTree

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

Computer Networks COSC 6377

Switching Architectures for Cloud Network Designs

Scalable Approaches for Multitenant Cloud Data Centers

Gatekeeper: Supporting Bandwidth Guarantees for Multi-tenant Datacenter Networks

Datacenter Network Large Flow Detection and Scheduling from the Edge

Data Center Network Topologies

VMDC 3.0 Design Overview

Simplify Your Data Center Network to Improve Performance and Decrease Costs

ARISTA WHITE PAPER Why Big Data Needs Big Buffer Switches

DARD: Distributed Adaptive Routing for Datacenter Networks

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Kevin Webb, Alex Snoeren, Ken Yocum UC San Diego Computer Science March 29, 2011 Hot-ICE 2011

Longer is Better? Exploiting Path Diversity in Data Centre Networks

CS6204 Advanced Topics in Networking

Wedge Networks: Transparent Service Insertion in SDNs Using OpenFlow

TRILL Large Layer 2 Network Solution

Ethernet Fabrics: An Architecture for Cloud Networking

VMware Virtual SAN 6.2 Network Design Guide

Data Center Fabrics What Really Matters. Ivan Pepelnjak NIL Data Communications

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

High-Performance Automated Trading Network Architectures

Open vswitch and the Intelligent Edge

A Dell Technical White Paper Dell PowerConnect Team

Stability of QOS. Avinash Varadarajan, Subhransu Maji

Panopticon: Incremental SDN Deployment in Enterprise Networks

Why Software Defined Networking (SDN)? Boyan Sotirov

Core and Pod Data Center Design

Accelerating Network Virtualization Overlays with QLogic Intelligent Ethernet Adapters

Deconstructing Datacenter Packet Transport

On the Impact of Packet Spraying in Data Center Networks

Technical Bulletin. Enabling Arista Advanced Monitoring. Overview

Multi-site Datacenter Network Infrastructures

Data Center Networking with Multipath TCP

BigPi: Sharing Link Pools in Cloud Networks

Cisco s Massively Scalable Data Center. Network Fabric for Warehouse Scale Computer

Network Virtualization for Large-Scale Data Centers

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

Brocade Data Center Fabric Architectures

TinyFlow: Breaking Elephants Down Into Mice in Data Center Networks

DiFS: Distributed Flow Scheduling for Adaptive Routing in Hierarchical Data Center Networks

Data Center Networking with Multipath TCP

HOW MUCH IS LACK OF VISIBILITY COSTING YOU? THE LAGGING NETWORK-WIDE VISIBILITY. ARISTA WHITE PAPER Arista Network Tracers

Brocade Data Center Fabric Architectures

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Data Center Network Topologies: VL2 (Virtual Layer 2)

SOFTWARE-DEFINED NETWORKING AND OPENFLOW

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Large-Scale Distributed Systems. Datacenter Networks. COMP6511A Spring 2014 HKUST. Lin Gu

VXLAN: Scaling Data Center Capacity. White Paper

SDN and Data Center Networks

VXLAN Bridging & Routing

Duet: Cloud Scale Load Balancing with Hardware and Software

How to Monitor a FabricPath Network

Towards Predictable Datacenter Networks

TRILL for Data Center Networks

4 Internet QoS Management

BROADCOM SDN SOLUTIONS OF-DPA (OPENFLOW DATA PLANE ABSTRACTION) SOFTWARE

Cisco FabricPath Technology and Design

EyeQ: Practical Network Performance Isolation at the Edge

The Future of Cloud Networking. Idris T. Vasi

Data Center Netwokring with Multipath TCP

Transcription:

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters Mohammad Alizadeh Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, George Varghese 1

Motivation DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Single-rooted tree Ø High oversubscription Core Agg Access 1000s of server ports 2

Motivation DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, ] Ø Full bisection bandwidth, achieved via multipathing Spine Leaf 1000s of server ports 3

Motivation DC networks need large bisection bandwidth for distributed apps (big data, HPC, web services, etc) Multi-rooted tree [Fat-tree, Leaf-Spine, ] Ø Full bisection bandwidth, achieved via multipathing Spine Leaf 1000s of server ports 4

Multi-rooted!= Ideal DC Network Ideal DC network: Big output-queued switch 1000s of server ports Ø No internal bottlenecks è predictable Ø Simplifies BW management [EyeQ, FairCloud, pfabric, Varys, ] 5

Multi-rooted!= Ideal DC Network Ideal DC network: Big output-queued switch Multi-rooted tree Can t build it L 1000s of server ports 1000s of server ports Ø No internal bottlenecks è predictable Ø Simplifies BW management [EyeQ, FairCloud, pfabric, Varys, ] Possible bottlenecks 6

Multi-rooted!= Ideal DC Network Ideal DC network: Big output-queued switch Multi-rooted tree Can t build it L 1000s of server ports 1000s of server ports Ø No internal bottlenecks è predictable Ø Simplifies BW management [EyeQ, FairCloud, pfabric, Varys, ] Need precise load balancing 7

Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Approximates Valiant load balancing Ø Preserves packet order H(f) % 3 = 0 8

Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Approximates Valiant load balancing Ø Preserves packet order Problems: - Hash collisions (coarse granularity) H(f) % 3 = 0 9

Today: ECMP Load Balancing Pick among equal-cost paths by a hash of 5-tuple Ø Approximates Valiant load balancing Ø Preserves packet order Problems: - Hash collisions (coarse granularity) - Local & stateless (v. bad with asymmetry due to link failures) 10

Dealing with Asymmetry Handling asymmetry needs non-local knowledge 11

Dealing with Asymmetry Handling asymmetry needs non-local knowledge 12

Dealing with Asymmetry Handling asymmetry needs non-local knowledge 13

Dealing with Asymmetry Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware Global Cong- Aware (TCP) 14

Dealing with Asymmetry: ECMP Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G Global Cong- Aware 10G (TCP) 20G 15

Dealing with Asymmetry: Local Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G Global Cong- Aware 10G (TCP) 20G Interacts poorly with TCP s control loop 16

Dealing with Asymmetry: Local Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 10G (TCP) 10G Interacts poorly with TCP s control loop 17

Dealing with Asymmetry: Global Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 10G (TCP) 10G 18

Dealing with Asymmetry: Global Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 70G 5G (TCP) 35G 19

Dealing with Asymmetry: Global Congestion-Aware Scheme Thrput 30G (UDP) 30G ECMP (Local Stateless) Local Cong- Aware 60G 50G Global Cong- Aware 70G 5G (TCP) Global CA > ECMP > Local CA 35G Local congestion-awareness can be worse than ECMP 20

Global Congestion-Awareness (in Datacenters) Datacenter Opportunity Challenge Latency Topology Traffic microseconds simple, regular volatile, bursty 21

Global Congestion-Awareness (in Datacenters) Datacenter Opportunity Challenge Latency Topology Traffic microseconds Simple & simple, Stable regular volatile, Responsive bursty Key Insight: Use extremely fast, low latency distributed control 22

CONGA in 1 Slide 1. Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time 2. Use greedy decisions to minimize bottleneck util Fast feedback loops between leaf switches, directly in dataplane L0 L1 L2 23

CONGA S DESIGN 24

Design CONGA operates over a standard DC overlay (VXLAN) Ø Already deployed to virtualize the physical network VXLAN encap. L0è L2 H1è H9 L0è L2 H1è H9 L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 25

Design: Leaf-to-Leaf Feedback Track path-wise congestion metrics (3 bits) between each pair of leaf switches Dest Leaf L1 L2 Path 0 1 2 3 CongesPon- To- Leaf Table @L0 Rate Measurement Module measures link utilization L0 0 1 2 3 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 26

Design: Leaf-to-Leaf Feedback Track path-wise congestion metrics (3 bits) between each pair of leaf switches Dest Leaf L1 L2 Path 0 1 2 CongesPon- To- Leaf Table @L0 L0è L2 Path=2 CE=0 3 L0 0 1 2 3 L0è L2 Path=2 CE=0 CE=5 L1 pkt.ce ç max(pkt.ce, link.util) L2 Src Leaf L0 L1 Path 0 1 2 3 CongesPon- From- Leaf Table @L2 L0è L2 Path=2 CE=5 5 H1 H2 H3 H4 H5 H6 H7 H8 H9 27

Design: Leaf-to-Leaf Feedback Track path-wise congestion metrics (3 bits) between each pair of leaf switches Dest Leaf L1 L2 Path 0 1 2 5 3 Src Leaf L0 L1 Path 0 1 2 5 3 CongesPon- To- Leaf Table @L0 L0è L2 L2è L0 FB- Path=2 FB- Metric=5 CE=0 L0 0 1 2 3 L1 L2 CongesPon- From- Leaf Table @L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 28

Design: LB Decisions Send each packet on least congested path flowlet [Kandula et al 2007] Dest Leaf L1 L2 Path 0 1 2 3 5 3 7 2 1 1 5 4 CongesPon- To- Leaf Table @L0 L0 è L1: p * = 3 L0 è L2: p * = 0 or 1 0 1 2 3 L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 29

Why is this Stable? Stability usually requires a sophisticated control law (e.g., TeXCP, MPTCP, etc) Source Leaf Dest Leaf 30

Why is this Stable? Stability usually requires a sophisticated control law (e.g., TeXCP, MPTCP, etc) Source Leaf (few microseconds) Feedback Latency Dest Leaf Adjustment Speed (flowlet arrivals) Near-zero latency + flowlets è stable 31

How Far is this from Optimal? Given traffic demands [λ ij ]: max!!"#$%!! min!!"#$%&'" Price of Anarchy with CONGA max!!"#$%!! bottleneck routing game (Banner & Orda, 2007) L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 32

How Far is this from Optimal? Given traffic demands [λ ij ]: max!!"#$%!! min!!"#$%&'" Price of Anarchy with CONGA max!!"#$%!! bottleneck routing game (Banner & Orda, 2007) Worst-case L0 L1 L2 H1 H2 H3 H4 H5 H6 H7 H8 H9 Theorem: PoA of CONGA = 2 33

Implementation Implemented in silicon for Cisco s new flagship ACI datacenter fabric Ø Scales to over 25,000 non-blocking 10G ports (2-tier Leaf-Spine) Ø Die area: <2% of chip

Evaluation Link Failure fabric links 32x10G 32x10G 32x10G 32x10G Testbed experiments Ø 64 servers, 10/ switches Ø Realistic traffic patterns (enterprise, data-mining) Ø HDFS benchmark Large-scale simulations Ø OMNET++, Linux 2.6.26 TCP Ø Varying fabric size, link speed, asymmetry Ø Up to 384-port fabric 35

HDFS Benchmark 1TB Write Test, 40 runs Cloudera hadoop-0.20.2-cdh3u5, 1 NameNode, 63 DataNodes no link failure ~2x better than ECMP Link failure has almost no impact with CONGA 36

Decouple DC LB & Transport TX H1 H2 H3 H4 H5 H6 H7 H8 H9 Big Switch Abstraction (provided by network) ingress & egress (managed by transport) H9 H8 H7 H6 H5 H4 H3 H2 H1 RX 37

Conclusion CONGA: Globally congestion-aware LB for DC implemented in Cisco ACI datacenter fabric Key takeaways 1. In-network LB is right for DCs 2. Low latency is your friend; makes feedback control easy 38

Thank You! 39