Hedera: Dynamic Flow Scheduling for Data Center Networks



Similar documents
Hedera: Dynamic Flow Scheduling for Data Center Networks

Data Center Network Topologies: FatTree

Data Center Networking with Multipath TCP

MMPTCP: A Novel Transport Protocol for Data Centre Networks

Load Balancing Mechanisms in Data Center Networks

Data Center Load Balancing Kristian Hartikainen

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

DiFS: Distributed Flow Scheduling for Adaptive Routing in Hierarchical Data Center Networks

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

PortLand:! A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Dahu: Commodity Switches for Direct Connect Data Center Networks

Data Center Netwokring with Multipath TCP

Load Balancing in Data Center Networks

Multipath TCP design, and application to data centers. Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke

Advanced Computer Networks. Datacenter Network Fabric

Adaptive Routing for Layer-2 Load Balancing in Data Center Networks

Load Balancing in Data Center Networks

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

Portland: how to use the topology feature of the datacenter network to scale routing and forwarding

DARD: Distributed Adaptive Routing for Datacenter Networks

Multipath TCP in Data Centres (work in progress)

Datacenter Network Large Flow Detection and Scheduling from the Edge

Computer Networks COSC 6377

Data Center Network Topologies: VL2 (Virtual Layer 2)

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

Empowering Software Defined Network Controller with Packet-Level Information

TinyFlow: Breaking Elephants Down Into Mice in Data Center Networks

Demand-Aware Flow Allocation in Data Center Networks

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

TRILL Large Layer 2 Network Solution

Lecture 7: Data Center Networks"

Lecture 15: Congestion Control. CSE 123: Computer Networks Stefan Savage

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Scale and Efficiency in Data Center Networks

Traffic Engineering for Multiple Spanning Tree Protocol in Large Data Centers

Enabling Flow-based Routing Control in Data Center Networks using Probe and ECMP

T. S. Eugene Ng Rice University

Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri, Sivasankar Radhakrishnan, Vikram Subramanya and Amin Vahdat

Voice Over IP. MultiFlow IP Phone # 3071 Subnet # Subnet Mask IP address Telephone.

MPLS RSVP-TE Auto-Bandwidth: Practical Lessons Learned

MPLS RSVP-TE Auto-Bandwidth: Practical Lessons Learned. Richard A Steenbergen <ras@gt-t.net>

Scalable Data Center Networking. Amin Vahdat Computer Science and Engineering UC San Diego

Router Architectures

Chapter 4. VoIP Metric based Traffic Engineering to Support the Service Quality over the Internet (Inter-domain IP network)

B4: Experience with a Globally-Deployed Software Defined WAN TO APPEAR IN SIGCOMM 13

[Yawen comments]: The first author Ruoyan Liu is a visting student coming from our collaborator, A/Prof. Huaxi Gu s research group, Xidian

TE in action. Some problems that TE tries to solve. Concept of Traffic Engineering (TE)

Kevin Webb, Alex Snoeren, Ken Yocum UC San Diego Computer Science March 29, 2011 Hot-ICE 2011

A Hybrid Electrical and Optical Networking Topology of Data Center for Big Data Network

TRILL for Service Provider Data Center and IXP. Francois Tallet, Cisco Systems

Scaling 10Gb/s Clustering at Wire-Speed

Balasubramanian Ramachandran

Energy Optimizations for Data Center Network: Formulation and its Solution

VMDC 3.0 Design Overview

Improving datacenter performance and robustness with multipath TCP

WCMP: Weighted Cost Multipathing for Improved Fairness in Data Centers

Paolo Costa

Falloc: Fair Network Bandwidth Allocation in IaaS Datacenters via a Bargaining Game Approach

OpenFlow Based Load Balancing

Electronic Theses and Dissertations UC San Diego

Table of Contents. Cisco How Does Load Balancing Work?

Data Center Quantized Congestion Notification (QCN): Implementation and Evaluation on NetFPGA June 14th

DARD: Distributed Adaptive Routing for Datacenter Networks

Software Defined Networks

A New Forwarding Policy for Load Balancing in Communication Networks

Resolving Packet Loss in a Computer Centre Applications

IP Traffic Engineering over OMP technique

Data Center Switch Fabric Competitive Analysis

Data Center Networking with Multipath TCP

Operating Systems. Cloud Computing and Data Centers

Delivering Scale Out Data Center Networking with Optics Why and How. Amin Vahdat Google/UC San Diego

Avoiding Network Polarization and Increasing Visibility in Cloud Networks Using Broadcom Smart- Hash Technology

A Scalable, Commodity Data Center Network Architecture

Congestion Control Review Computer Networking. Resource Management Approaches. Traffic and Resource Management. What is congestion control?

How To Improve Traffic Engineering

Non-blocking Switching in the Cloud Computing Era

Software Defined Networking What is it, how does it work, and what is it good for?

HERCULES: Integrated Control Framework for Datacenter Traffic Management

Chapter 6. Paper Study: Data Center Networking

Symbiosis in Scale Out Networking and Data Management. Amin Vahdat Google/UC San Diego

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Proactive Surge Protection: A Defense Mechanism for Bandwidth-Based Attacks

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Transcription:

Hedera: Dynamic Flow Scheduling for Data Center Networks Mohammad Al-Fares Sivasankar Radhakrishnan Barath Raghavan * Nelson Huang Amin Vahdat UC San Diego * Williams College - USENIX NSDI 2010 -

Motivation!"#$%&'($)* #$%&'($)* +++,(#$ +++ +++ +++!""#$"%&'() +++ +++ +++ +++ +++ +++ *+"$ +++ +++ Current data center networks support tens of thousands of machines Limited port-densities at the core routers Horizontal expansion = increasingly relying on multipathing

Motivation MapReduce / Hadoop -style workloads have substantial MapReduce Workflow Input Dataset BW requirements Map Map Map Shuffle phase stresses network interconnect Oversubscription / Bad forwarding Jobs often Reduce Reduce Reduce bottlenecked by network Output Dataset

Contributions Integration + working implementation of: 1. Centralized Data Center routing control 2. Flow Demand Estimation 3. Efficient + Fast Scheduling Heuristics Enables more efficient utilization of network infrastructure Upto 96% of optimal bisection bandwidth, > 2X better than standard techniques

Background Core 0 Core 1 Core 2 Core 3 Downstream Collision Flow A Flow B Flow C Flow D Local Collision Agg 0 Agg 1 Agg 2 Current industry standard: Equal-Cost Multi-Path (ECMP) Given a packet to a subnet with multiple paths, forward packet based on a hash of packet s headers Originally developed as a wide-area / backbone TE tool Implemented in: Cisco / Juniper / HP... etc.

Background Core 0 Core 1 Core 2 Core 3 Downstream Collision Flow A Flow B Flow C Flow D Local Collision Agg 0 Agg 1 Agg 2 ECMP drawback: Static + Oblivious to link-utilization! Causes long-term local/downstream flow collisions On 27K-host fat-tree and a randomized matrix, ECMP wastes average of 61% of bisection bandwidth!

Problem Statement Problem: Given a dynamic traffic matrix of flow demands, how do you find paths that maximize network bisection bandwidth? Constraint: Commodity Ethernet switches + No end-host mods

Problem Statement 1. Capacity Constraint 2. Flow Conserva5on 3. Demand Sa5sfac5on k i=1 f i (u,v) c(u,v) w V f i (u,w)=0 (u = s i,t i ) v,u : f i (u,v)= f i (v,u) w V w V f i (s i,w)=d i f i (w,t i )=d i MULTI-COMMODITY FLOW problem: Single path forwarding (no flow splitting) Expressed as Binary Integer Programming (BIP) Combinatorial, NP-complete Exact solvers CPLEX/GLPK impractical for realistic networks

Problem Statement 1. Capacity Constraint 2. Flow Conserva5on 3. Demand Sa5sfac5on k i=1 f i (u,v) c(u,v) w V f i (u,w)=0 (u = s i,t i ) v,u : f i (u,v)= f i (v,u) w V w V f i (s i,w)=d i f i (w,t i )=d i Polynomial-time algorithms known for 3-stage Clos Networks (based on bipartite edge-coloring) None for 5-stage Clos (3-tier fat-trees) Need to target arbitrary/general DC topologies!

Architecture 1. Detect Large Flows 2. Estimate Flow Demands 3. Schedule Flows Hedera : Dynamic Flow Scheduling Optimize achievable bisection bandwidth by assigning flows non-conflicting paths Uses flow demand estimation + placement heuristics to find good flow-to-core mappings

Architecture 1. Detect Large Flows 2. Estimate Flow Demands 3. Schedule Flows Scheduler operates a tight control-loop: 1. Detect large flows 2. Estimate their bandwidth demands 3. Compute good paths and insert flow entries into switches

Elephant Detection

Elephant Detection Scheduler continually polls edge switches for flow byte-counts Flows exceeding B/s threshold are large > %10 of hosts link capacity in our implementation (i.e. > 100Mbps) What if only small flows? Default ECMP load-balancing efficient

Elephant Detection Hedera complements ECMP! Default forwarding uses ECMP Hedera schedules large flows that cause bisection bandwidth problems

Demand Estimation

Demand Estimation Motivation: Empirical measurement of flow rates are not suitable / sufficient for flow scheduling Current TCP flow-rates may be constrained to inefficient forwarding Need to find the flows overall fair bandwidth allocation, to better inform placement algorithms

Demand Estimation TCP s AIMD + Fair Queueing try to achieve max-min fairness in steady state When routing is a degree of freedom, establishing max-min fair demands is hard Ideal case: find max-min fair bandwidth allocation as if constrained by host-nic

Demand Estimation Given traffic matrix of large flows, modify each flow s size at Src + Dst iteratively: 1. Sender equally distributes unconverged bandwidth among outgoing flows 2. NIC-limited receivers decrease exceeded capacity equally between incoming flows 3. Repeat until all flows converge Guaranteed to converge in O( F ) time

Demand Estimation A X B Y C Flow Estimate Conv.? A X A Y B Y C Y Sender Available Unconv. BW Senders Flows Share A 1 2 1/2 B 1 1 1 C 1 1 1

Demand Estimation A X B Y C Flow Estimate Conv.? A X 1/2 A Y 1/2 B Y 1 C Y 1 Recv RL? Non-SL Flows Receivers Share X No - - Y Yes 3 1/3

Demand Estimation A X B Y C Flow Estimate Conv.? A X 1/2 A Y 1/3 Yes B Y 1/3 Yes C Y 1/3 Yes Sender Available Unconv. BW Senders Flows Share A 2/3 1 2/3 B 0 0 0 C 0 0 0

Demand Estimation A X B Y C Flow Estimate Conv.? A X 2/3 Yes A Y 1/3 Yes B Y 1/3 Yes C Y 1/3 Yes Recv RL? Non-SL Flows Receivers Share X No - - Y No - -

Placement Heuristics

Global First-Fit Scheduler Flow A Flow B Flow C??? 0 1 2 3 New flow detected, linearly search all possible paths from S D Place flow on first path whose component links can fit that flow

Global First-Fit Scheduler Flow A Flow B Flow C 0 1 2 3 Flows placed upon detection, are not moved Once flow ends, entries + reservations time out

Simulated Annealing Probabilistic search for good flow-to-core mappings Goal: Maximize achievable bisection bandwidth Current flow-to-core mapping generates neighbor state Calculate total exceeded bandwidth capacity Accept move to neighbor state if bisection BW gain Few thousand iterations for each scheduling round Avoid local-minima; non-zero prob. to worse state

Simulated Annealing Implemented several optimizations that reduce the search-space significantly: Assign a single core switch to each destination host Incremental calculation of exceeded capacity.. among others

Simulated Annealing Scheduler Flow A Flow B Flow C Core 2 1 0 02 3??? 0 1 2 3? Example run: 3 flows, 3 iterations

Simulated Annealing Scheduler Flow A Flow B Flow C Core???? 2 0 3 0 1 2 3 Final state is published to the switches and used as the initial state for next round

Fault-Tolerance

Fault-Tolerance Scheduler Flow A Flow B Flow C 0 1 2 3 Link / Switch failure: Use PortLand s fault notification protocol Hedera routes around failed components

Fault-Tolerance Scheduler Flow A Flow B Flow C 0 1 2 3 Scheduler failure: Soft-state, not required for correctness (connectivity) Switches fall back to ECMP

Implementation

Implementation 16-host testbed k=4 fat-tree data-plane Parallel 48-port non-blocking Quanta switch 20 machines; 4-port NetFGPAs / OpenFlow 1 Scheduler machine Dynamic traffic monitoring OpenFlow routing control

Evaluation - Testbed ECMP Global First-Fit Simulated Annealing Non-blocking Bisection Bandwidth (Gbps) 15 12 9 6 3 0 Stag1(.2,.3) Stag2(.2,.3) Stag1(.5,.3) Stag2(.5,.3) Stride(2) Stride(4) Stride(8) Communication Pattern

Evaluation - Testbed ECMP Global First-Fit Simulated Annealing Non-blocking Bisection Bandwidth (Gbps) 15 12 9 6 3 0 Rand0 Rand1 RandBij0 RandBij1 RandX2 RandX3 Hotspot Communication Pattern

Data Shuffle ECMP GFF SA Control Total Shuffle Time (s) 438.4 335.5 336.0 306.4 Avg. Comple5on Time (s) 358.1 258.7 262.0 226.6 Avg. Bisec5on BW (Gbps) 2.81 3.89 3.84 4.44 Avg. host goodput (MB/s) 20.9 29.0 28.6 33.1 16-hosts: 120 GB all-to-all in-memory shuffle Hedera achieves 39% better bisection BW over ECMP, 88% of ideal non-blocking switch

Evaluation - Simulator For larger topologies: Models TCP s AIMD behavior when constrained by the topology Stochastic flow arrival times / Bytes What about ns2 / OMNeT++? Calibrated its performance against testbed Packet-level simulators impractical at these network scales

Simulator - 8,192 hosts (k=32) ECMP Global First-Fit Simulated Annealing Non-blocking Bisection Bandwidth (Gbps) 7000 6000 5000 4000 3000 2000 1000 0 Stag(.5,0) Stag(.2,.3) Stride(1) Stride(16) Stride(256) RandBij Rand Communication Pattern

Reactiveness Demand Estimation: 27K hosts, 250K flows, converges < 200ms Simulated Annealing: Asymptotically dependent on # of flows + # iter: 50K flows and 10K iter: 11ms Most of final bisection BW: first few hundred iter Scheduler control loop: Polling + Estimation + SA = 145ms for 27K hosts

Limitations Matrix Stability ECMP Stable Unstable Hedera Flow Size Dynamic workloads, large flow turnover faster than control loop Scheduler will be continually chasing the traffic matrix Need to include penalty term for unnecessary SA flow re-assignments

Future Work Improve utility function of Simulated Annealing SA movement penalties (TCP) Add flow priorities (QoS) Incorporate other metrics: e.g. Power Release combined system: PortLand + Hedera (6/1) Perfect, non-centralized, per-packet Valiant Load Balancing

Conclusions Simulated Annealing delivers significant bisection BW gains over standard ECMP Hedera complements ECMP RPC-like traffic is fine with ECMP If you re running MapReduce/Hadoop jobs on your network, you stand to benefit greatly from Hedera; tiny investment!

Questions? http://cseweb.ucsd.edu/~malfares/

Traffic Overhead 27K host network: Polling: 72B / flow * 5 flows/host * 27K hosts / 0.1 sec = < 100MB/s for DC Could also use data-plane