MMPTCP: A Novel Transport Protocol for Data Centre Networks



Similar documents
Computer Networks COSC 6377

Multipath TCP design, and application to data centers. Damon Wischik, Mark Handley, Costin Raiciu, Christopher Pluntke

Multipath TCP in Data Centres (work in progress)

Data Center Networking with Multipath TCP

Data Center Networking with Multipath TCP

Data Center Netwokring with Multipath TCP

Load Balancing in Data Center Networks

Load Balancing Mechanisms in Data Center Networks

On the Impact of Packet Spraying in Data Center Networks

Improving datacenter performance and robustness with multipath TCP

Outline. VL2: A Scalable and Flexible Data Center Network. Problem. Introduction 11/26/2012

TinyFlow: Breaking Elephants Down Into Mice in Data Center Networks

Friends, not Foes Synthesizing Existing Transport Strategies for Data Center Networks

[Yawen comments]: The first author Ruoyan Liu is a visting student coming from our collaborator, A/Prof. Huaxi Gu s research group, Xidian

Dahu: Commodity Switches for Direct Connect Data Center Networks

Voice Over IP. MultiFlow IP Phone # 3071 Subnet # Subnet Mask IP address Telephone.

Advanced Computer Networks. Datacenter Network Fabric

Improving Datacenter Performance and Robustness with Multipath TCP

Hedera: Dynamic Flow Scheduling for Data Center Networks

OpenFlow based Load Balancing for Fat-Tree Networks with Multipath Support

CONGA: Distributed Congestion-Aware Load Balancing for Datacenters

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

The TCP Outcast Problem: Exposing Unfairness in Data Center Networks

Data Center Network Topologies: VL2 (Virtual Layer 2)

Data Center Network Topologies: FatTree

DiFS: Distributed Flow Scheduling for Adaptive Routing in Hierarchical Data Center Networks

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Simulation-Based Comparisons of Solutions for TCP Packet Reordering in Wireless Network

MAPS: Adaptive Path Selection for Multipath Transport Protocols in the Internet

ICTCP: Incast Congestion Control for TCP in Data Center Networks

Improving our Evaluation of Transport Protocols. Sally Floyd Hamilton Institute July 29, 2005

Operating Systems. Cloud Computing and Data Centers

Data Center Load Balancing Kristian Hartikainen

Lecture 15: Congestion Control. CSE 123: Computer Networks Stefan Savage

Advanced Computer Networks. Scheduling

Empowering Software Defined Network Controller with Packet-Level Information

Application Level Congestion Control Enhancements in High BDP Networks. Anupama Sundaresan

A Passive Method for Estimating End-to-End TCP Packet Loss

ARISTA WHITE PAPER Why Big Data Needs Big Buffer Switches

DeTail: Reducing the Flow Completion Time Tail in Datacenter Networks

Explicit Multipath Congestion Control for Data Center Networks

The TCP Outcast Problem: Exposing Unfairness in Data Center Networks

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation

Ring Protection: Wrapping vs. Steering

TCP and UDP Performance for Internet over Optical Packet-Switched Networks

Data Center Infrastructure of the future. Alexei Agueev, Systems Engineer

phost: Distributed Near-Optimal Datacenter Transport Over Commodity Network Fabric

Switching Architectures for Cloud Network Designs

# % # % & () # () + (, ( ( (. /( ( (. /(. 0!12!3 &! 1. 4 ( /+ ) 0

Data Center Network Topologies

THE Internet provides a convenient and cost-effective

Multipath TCP in Practice (Work in Progress) Mark Handley Damon Wischik Costin Raiciu Alan Ford

Ant Rowstron. Joint work with Paolo Costa, Austin Donnelly and Greg O Shea Microsoft Research Cambridge. Hussam Abu-Libdeh, Simon Schubert Interns

Exercises on ns-2. Chadi BARAKAT. INRIA, PLANETE research group 2004, route des Lucioles Sophia Antipolis, France

Multipath TCP under Massive Packet Reordering

Referring to the above question, the end-to-end delay (transmission delay plus propagation delay) is

Paolo Costa

IP ETHERNET STORAGE CHALLENGES

Congestion Control Review Computer Networking. Resource Management Approaches. Traffic and Resource Management. What is congestion control?

Analogues between tuning TCP for Data Acquisition and Datacenter Networks

Joint ITU-T/IEEE Workshop on Carrier-class Ethernet

Finishing Flows Quickly with Preemptive Scheduling

Intel Ethernet Switch Converged Enhanced Ethernet (CEE) and Datacenter Bridging (DCB) Using Intel Ethernet Switch Family Switches

Load Balancing in Data Center Networks

Chapter 4. VoIP Metric based Traffic Engineering to Support the Service Quality over the Internet (Inter-domain IP network)

Strategies. Addressing and Routing

Application Performance Analysis and Troubleshooting

Using Linux Traffic Control on Virtual Circuits J. Zurawski Internet2 February 25 nd 2013

Robustness to Packet Reordering in High-speed Networks

How To Create A Data Center Network With A Data Centre Network

The Impact of QoS Changes towards Network Performance

CSE 473 Introduction to Computer Networks. Exam 2 Solutions. Your name: 10/31/2013

! 10 data centers. ! 3 classes " Universities " Private enterprise " Clouds. ! Internal users " Univ/priv " Small " Local to campus. !

Unified Fabric: Cisco's Innovation for Data Center Networks

Architecting Low Latency Cloud Networks

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

Disaster-Resilient Backbone and Access Networks

NUIT Tech Talk: Trends in Research Data Mobility

FlowRadar: A Better NetFlow for Data Centers

TCP Pacing in Data Center Networks

Space Shuffle: A Scalable, Flexible, and High-Bandwidth Data Center Network

An Improved TCP Congestion Control Algorithm for Wireless Networks

A Talari Networks White Paper. Turbo Charging WAN Optimization with WAN Virtualization. A Talari White Paper

A Congestion Control Algorithm for Data Center Area Communications

APPOSITE TECHNOLOGIES Smoothing the Transition to 10 Gbps. WAN Emulation Made Easy

Lecture 7: Data Center Networks"

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

Mellanox Technologies Inc Stender Way, Santa Clara, CA Tel: Fax:

International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: Volume 8 Issue 1 APRIL 2014.

Open vswitch and the Intelligent Edge

TCP in Wireless Mobile Networks

Lecture 8 Performance Measurements and Metrics. Performance Metrics. Outline. Performance Metrics. Performance Metrics Performance Measurements

Increasing Datacenter Network Utilisation with GRIN Alexandru Agache, Razvan Deaconescu and Costin Raiciu University Politehnica of Bucharest

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

Deconstructing Datacenter Packet Transport

Cisco s Massively Scalable Data Center

PACE Your Network: Fair and Controllable Multi- Tenant Data Center Networks

Effect of Packet-Size over Network Performance

Performance of Cisco IPS 4500 and 4300 Series Sensors

PART III. OPS-based wide area networks

Transcription:

MMPTCP: A Novel Transport Protocol for Data Centre Networks Morteza Kheirkhah FoSS, Department of Informatics, University of Sussex

Modern Data Centre Networks FatTree It provides full bisection bandwidth between all pair of servers. It provides dense interconnectivity in the network. It relies on ECMP routing to use its path diversity.

Data Centre Network Properties 1. Short flow dominance. 99% of flows are short flows (flow size < 100MB). Majority of short flows are query flows with deadline in their flow completion time (flow size < 1MB e.g. 50KB) 90% of total bytes come from long flows (size > 100MB). 2. Traffic pattern is very bursty, volatile and unpredictable Bursty traffic pattern is originated from short flows. 3. Low latency and high bandwidth communication. Latency is in the order of microsecond (100-250). Minimum link capacity is 1Gbps.

Congestion in FatTree Equal-Cost Multi-Path (ECMP) Issue A key limitation of ECMP is two or more long-lived TCP flows can collide on their hashes and end up on the same output port, creating avoidable congestion. Core 1... Core 4 10 Gbps AGG 1 AGG 2 AGG 3 AGG 4 AGG 5 AGG 6 AGG 7 AGG 8 10 Gbps TOR1 TOR2 TOR3 TOR4 TOR5 TOR6 TOR7 TOR8 TOR9 TOR10 TOR11 TOR12 TOR13 TOR14 TOR15 TOR16 1 Gbps 20 Servers

Possible Approach to ECMP Issue 1. MPTCP 2. MMPTCP (Our Solution)

Possible Approach to ECMP Issue Multi-Path TCP (MPTCP) MPTCP with four subflows. Each subflow looks similar to single-path TCP Core 1 Core 2 Core 3 Core 4 AGG 1 AGG 2 AGG 3 AGG 4 AGG 5 AGG 6 AGG 7 AGG 8 10 Gbps TOR1 TOR2 TOR3 TOR4 TOR5 TOR6 TOR7 TOR8 TOR9 TOR10 TOR11 TOR12 TOR13 TOR14 TOR15 TOR16 1 Gbps 20 Servers

Possible Approach to ECMP Issue Multi-Path TCP (cont) Pros: It can handle hotspots in the network core gracefully. It significantly increases overall network throughput. Cons: It is not good for handling short flows. It is not good for competing with TCP short flows.

Possible Approach to ECMP Issue Multi-Path TCP (cont) Simulation Setup: A 4:1 oversubscribed FatTree topology with 512 nodes, running a Permutation traffic matrix, 33% of nodes send continuous traffic (long MPTCP flows with 8 subflows) and the remainder send short MPTCP flows (70KB) based on a Poisson arrival (250 arrival/sec in average, assigned by a central short flow scheduler). Result: MPTCP is not good for short flows due to small window and timeout problem in subflows. Simulation Setup: A 1:1 oversubscribed FatTree topology with 128 nodes running a Permutation traffic matrix of long MPTCP flows. Result: MPTCP is extremely good for long flows (8 subflows seems to be the right number for achieving high overall network throughput in a full bisection bandwidth topology) Milliseconds Average Goodput (Mbps) 700 600 500 400 300 200 100 ms 140 120 100 FatTree, 512 Nodes 80 1 2 3 4 5 6 7 8 9 Standard Deviation Mean Completion Time 0 1 2 3 4 5 6 7 8 9 100 90 80 70 60 50 40 30 20 10 0 No. of MPTCP Subflows FatTree, 128 Nodes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 No. of MPTCP Subflows

Possible Approach to ECMP Issue 1. MPTCP 2. MMPTCP (Our Solution)

Possible Approach to ECMP Issue MMPTCP goals High bandwidth for long flows Low latency for short flows High burst tolerance

Possible Approach to ECMP Issue MMPTCP (our idea) MMPTCP connection starts with a single subflow (i.e. TCP) and randomises its traffic on per-packet basis (packet scattering) until it delivers a certain data volume e.g. 1 MB (to cover short/query flows). It then opens several subflows (e.g. 8) for rest of the connection, during which the MPTCP congestion control governing data transmission (the initial subflow would be deactivated at this point). Core 1 Core 2 Core 3 Core 4 AGG 1 AGG 2 AGG 3 AGG 4 AGG 5 AGG 6 AGG 7 AGG 8 10 Gbps TOR1 TOR2 TOR3 TOR4 TOR5 TOR6 TOR7 TOR8 TOR9 TOR10 TOR11 TOR12 TOR13 TOR14 TOR15 TOR16 1 Gbps 20 Servers

Possible Approach to ECMP Issue MMPTCP (our idea) Pros: It handles bursty traffic patterns gracefully by diffusing them throughout the network. It decreases flow completion time for short flows. It increases network throughput for large flows. Cons: Packets get reorder in the initial phase.

MMPTCP and Packet Reordering Three aspects need to be considered: Preventing, detecting and mitigating spurious retransmissions due to out of order packets. Solutions: RR-TCP (needs DSACK), Eifel (only for detection and mitigation) and so on. Quick solution with TCP NewReno - just preventing spurious retransmission by adjusting TCP dupack threshold based on topology-specific information.

Evaluation FatTree topology Simulation Setup: A 4:1 oversubscribed FatTree topology with 512 nodes, running a Permutation traffic matrix, 33% of nodes send continuous traffic (long MPTCP flows with 8 subflows) and the remainder send short flows (70KB) based on a Poisson arrival (250 arrival/sec in average, assigned by a central short flow scheduler).

Evaluation MMPTCP vs. MPTCP with 8 subflows MPTCP with 8 subflows Mean flow completion time: 125ms Standard deviation: 425ms MMPTCP Mean flow completion time: 116ms Standard deviation: 101ms

Evaluation MMPTCP vs. MPTCP SFTCP MPTCP SFTCP Mean flow completion time: 89.2ms Standard deviation: 108.9ms MMPTCP Mean flow completion time: 116ms Standard deviation: 101ms

Evaluation MMPTCP vs. MMPTCP LT (in a 2:1 FatTree) MMPTCP Mean flow completion time: 98.9 ms Standard deviation: 74.8 ms MMPTCP LT Mean flow completion time: 89.1 ms Standard deviation: 67.2 ms

Future Direction We realized that employing TCP congestion control during the initial phase of MMPTCP is an overkill approach. We believe a right congestion control (e.g. DCTCP-like) could significantly improve the performance of MMPTCP for short flows. MMPTCP is capable of utilising multi-homed network topologies. In this way, TCP Incast could potentially be eliminated because MMPTCP is capable of delivering all flows via all available network interface devices. Advance QoS features have become increasingly available in data centre switches. Our hypothesis is that if packets of the initial phase of MMPTCP are marked high priority and routed through a different queues, then MMPTCP effectively and seamlessly helps latency sensitive short flows to complete their data delivery faster and potentially meet their deadline.

MPTCP implementation in ns-3: https://github.com/mkheirkhah/mptcp Publication: Kheirkhah, M., Wakeman, I. and Parisis, G. Short vs. Long Flows: A Battle That Both Can Win, In Proceedings of ACM SIGCOMM 2015, London, UK. Thank you! Question?