TCP Adaptation for MPI on Long-and-Fat Networks

Similar documents

Transport Layer Protocols

Low-rate TCP-targeted Denial of Service Attack Defense

Lecture 15: Congestion Control. CSE 123: Computer Networks Stefan Savage

Application Level Congestion Control Enhancements in High BDP Networks. Anupama Sundaresan

TCP over Wireless Networks

TCP in Wireless Mobile Networks

First Midterm for ECE374 03/09/12 Solution!!

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

A Survey on Congestion Control Mechanisms for Performance Improvement of TCP

CSE 473 Introduction to Computer Networks. Exam 2 Solutions. Your name: 10/31/2013

TCP Westwood for Wireless

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation

Outline. TCP connection setup/data transfer Computer Networking. TCP Reliability. Congestion sources and collapse. Congestion control basics

Data Networks Summer 2007 Homework #3

High-Speed TCP Performance Characterization under Various Operating Systems

Lecture Objectives. Lecture 07 Mobile Networks: TCP in Wireless Networks. Agenda. TCP Flow Control. Flow Control Can Limit Throughput (1)

TCP and Wireless Networks Classical Approaches Optimizations TCP for 2.5G/3G Systems. Lehrstuhl für Informatik 4 Kommunikation und verteilte Systeme

TCP Pacing in Data Center Networks

International Journal of Scientific & Engineering Research, Volume 6, Issue 7, July ISSN

Mobile Communications Chapter 9: Mobile Transport Layer

15-441: Computer Networks Homework 2 Solution

ALTHOUGH it is one of the first protocols

TCP for Wireless Networks

High Speed Internet Access Using Satellite-Based DVB Networks

Simulation-Based Comparisons of Solutions for TCP Packet Reordering in Wireless Network

Research of TCP ssthresh Dynamical Adjustment Algorithm Based on Available Bandwidth in Mixed Networks

This sequence diagram was generated with EventStudio System Designer (

Congestions and Control Mechanisms n Wired and Wireless Networks

B-2 Analyzing TCP/IP Networks with Wireshark. Ray Tompkins Founder of Gearbit

Analytic Models for the Latency and Steady-State Throughput of TCP Tahoe, Reno and SACK

Computer Networks. Chapter 5 Transport Protocols

First Midterm for ECE374 03/24/11 Solution!!

TCP/IP Optimization for Wide Area Storage Networks. Dr. Joseph L White Juniper Networks

Optimizing Throughput on Guaranteed-Bandwidth WAN Networks for the Large Synoptic Survey Telescope (LSST)

2 TCP-like Design. Answer

FEW would argue that one of TCP s strengths lies in its

Optimizing TCP Forwarding

TCP/IP Over Lossy Links - TCP SACK without Congestion Control

TCP PACKET CONTROL FOR WIRELESS NETWORKS

A Study on TCP Performance over Mobile Ad Hoc Networks

Network Friendliness of Mobility Management Protocols

An enhanced TCP mechanism Fast-TCP in IP networks with wireless links

Question: 3 When using Application Intelligence, Server Time may be defined as.

The Effect of Packet Reordering in a Backbone Link on Application Throughput Michael Laor and Lior Gendel, Cisco Systems, Inc.

SJBIT, Bangalore, KARNATAKA

Final for ECE374 05/06/13 Solution!!

La couche transport dans l'internet (la suite TCP/IP)

TCP Fast Recovery Strategies: Analysis and Improvements

Chapter 5. Transport layer protocols

4 High-speed Transmission and Interoperability

D. SamKnows Methodology 20 Each deployed Whitebox performs the following tests: Primary measure(s)

Advanced Computer Networks Project 2: File Transfer Application

Parallel TCP Data Transfers: A Practical Model and its Application

Efficient End-to-End Mobility Support in IPv6

Visualizations and Correlations in Troubleshooting

Chapter 6 Congestion Control and Resource Allocation

HyperIP : VERITAS Replication Application Note

STUDY OF TCP VARIANTS OVER WIRELESS NETWORK

TCP in Wireless Networks

TCP Tuning Techniques for High-Speed Wide-Area Networks. Wizard Gap

Network Performance in High Performance Linux Clusters

TCP AND ATM IN WIDE AREA NETWORKS

An Improved TCP Congestion Control Algorithm for Wireless Networks

High-Performance Automated Trading Network Architectures

Transport layer issues in ad hoc wireless networks Dmitrij Lagutin,

An enhanced approach for transmission control protocol traffic management Mechanism for Wireless Network

Early Binding Updates and Credit-Based Authorization A Status Update

Extensions to FreeBSD Datacenter TCP for Incremental Deployment Support

RFC 6349 Testing with TrueSpeed from JDSU Experience Your Network as Your Customers Do

A Transport Protocol for Multimedia Wireless Sensor Networks

TCP tuning guide for distributed application on wide area networks 1.0 Introduction

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

TCP performance optimization for handover Management for LTE satellite/terrestrial hybrid. network.

TCP based Denial-of-Service Attacks to Edge Network: Analysis and Detection

The Impact of Background Network Traffic on Foreground Network Traffic

Linux TCP Implementation Issues in High-Speed Networks

ICOM : Computer Networks Chapter 6: The Transport Layer. By Dr Yi Qian Department of Electronic and Computer Engineering Fall 2006 UPRM

Computer Networks. Data Link Layer

TCP Flow Control. TCP Receiver Window. Sliding Window. Computer Networks. Lecture 30: Flow Control, Reliable Delivery

Using TrueSpeed VNF to Test TCP Throughput in a Call Center Environment

Transcription:

TCP Adaptation for MPI on Long-and-Fat Networks Motohiko Matsuda, Tomohiro Kudoh Yuetsu Kodama, Ryousei Takano Grid Technology Research Center Yutaka Ishikawa The University of Tokyo

Outline Background Review of TCP Basics TCP Behavior versus MPI Traffic Modifications to TCP and Implementation Evaluation; Simple Traffic and NPB Benchmarks Related Work Conclusion

Background Demands on high-performance MPI using TCP/IP Commodity clusters, computing in the Grid,... GridMPI Project Run unmodified HPC applications in the Grid Focus on metropolitan-area, high-bandwidth environment: 10Gpbs, 500miles (10ms one-way) computing resource site site A computing resource site site B Wide-area Network Single (monolithic) MPI application across the Grid

Background (cont.) MPI traffic is sub-optimal for TCP TCP is designed for streams Typical MPI applications: Repeat computation/communication phases No communication during computation phases Change traffic by communication patterns e.g., one-to-one to all-to-all, and vice versa Known, old issue, but no fixing in actual implementations Fix TCP behavior to non-contiguous traffic Show impact in MPI applications

Review of TCP (Slow-Start) Basic behavior of TCP cwnd = send buffer size = number of packets in-flight estimated bandwidth-delay product low cwnd value, low bandwidth utilization cwnd is low during Slow-Start cwnd adapts slowly during Congestion Avoidance Slow-Start Phase Congestion Avoidance Phase cwnd Slow-Start Threshold (ssthresh) Time

Review of TCP (Fast-Retransmit) Fast-Retransmit Sender resends lost packet at duplicate ACK. Recovery by Fast-Retransmit is fast Retransmit-Timeout (RTO) Original mechanism of retransmission. Timeout time is very large ( 200ms in Linux implementation) Recovery by RTO is slow Sender Seq#1 Seq#2 Seq#3 Seq#4 Seq#5 Seq#6 -- Seq#3 Sender retransmits Seq#3 at receiving two Ack#2 Recver Ack#1 Ack#2 -- Ack#2 Ack#2 Ack#2 -- Ack#3

TCP Behavior versus MPI Traffic Quick changes of communication patterns cwnd mismatch From all-to-all to one-to-one, vice versa Adapting cwnd is slow, taking some round-trip time. Phases of computation/communication Frequent Slow-Start Long pause forces TCP to enter Slow-Start state 200ms in Linux implementation Non-contiguous flow of packets Long pause waiting for Retransmit-Timeout Drop of a tail of a message cannot be recovered by Fast-Retransmit. Falls back to Retransmit-Timeout.

Three Modifications to TCP Behavior Modifications Pacing at Start-up Reducing RTO (Retransmit-Timeout) time TCP Parameter Switching Addressed Problems Frequent Slow-Start Losing ACK-clocking Long pause waiting for Retransmit-Timeout Mismatch of cwnd value at communication phase changes Actual implementations diverge from the definition of TCP, but behave essentially the same at non-contiguous point. All problems frequently happen in MPI applications.

Implementation of Pacing at Start-up Pacing at Start-up Pacing keeps constant pace during start-up. Avoid burst traffic, to avoid router buffer overflow. No ACK-clocking at start-ups, and pacing is needed. Target bandwidth target bandwidth = (ssthresh * MSS) / RTT ssthresh holds saved cwnd value Target is checked every 10µsec. Packet is sent when pace is lower than the target Use 10µsec timer interrupt for clock generation Use APC Timer of IA32 CPU Timer stops at receiving ACK (within RTT) Small overhead, 1% to 2% slowdown

Implementation of Parameter Switching TCP Parameter Switching Save/restore ssthresh ssthresh holds last cwnd value ssthresh is stable, easy to save/restore Start communication at a rate by ssthresh Use Pacing at Start-up Parameter Switching is activated at pattern changes MPI library notifies pattern changes to TCP Before/after collective communication calls Two ssthresh values, one for point-to-point and one for collectives, are saved in a communicator.

Implementation of Reducing RTO Time Reducing RTO Time Use the definition in RFC, without rounding up to 200ms RTO = SRTT + 4 * RTTVAR (SRTT: Smoothed RTT, RTTVAR: RTT Variance) Linux implementation RTO = SRTT + MAX(200, 4 * RTTVAR) Roughly (200ms + RTT) NOTE: RTO Time is used for two purposes Time to retransmission Time to entering Slow-Start state Reducing RTO has side-effect of frequently entering Slow-Start state.

Evaluation Setting 8 PCs 8 PCs Node0 Host 0 Host 0 Host 0 Node7 Catalyst 3750 WAN Emulator GtrcNET-1 Delay: 10ms Bandwidth: 500Mbps Catalyst 3750 Node8 Host 0 Host 0 Host 0 Node15 CPU: Pentium4/2.4GHz, Memory: DDR400 512MB NIC: Intel PRO/1000 (82547EI) OS: Linux-2.6.9-1.6 (Fedora Core 2) Socket Buffer Size: 20MB

Effect of TCP Parameter Switching Start one-to-one communication at time 0 after all-to-all. Without TCP Parameter Switching (traffic starts with cwnd=49) Graphs show sending 50MB data. cwnd is reduced after allto-all communication. With Switching (traffic starts with cwnd=582)

Effect of Pacing at Start-up Repeat sending 10MB data with 2second intervals. Without Pacing at Start-up Graphs show sending one chunk of 10MB data With Pacing at Start-up

Effect of Reducing RTO Time (Bandwidth) Repeat MPI_Alltoall with varying data size. Bandwidth (MB/s) 30 25 20 15 10 5 0 With Reducing RTO W/O Reducing RTO 10K 32K 512K 1M Data Size (bytes) Setting is a cluster environment, no delay, no bandwidth restriction. Packets are dropped at the switch between clusters, which causes Retransmit-Timeout.

Effect of Reducing RTO Time (Traffic) Repeat all-to-all communication with data size 32KB. Without Reducing RTO Time Upper graph has long pauses despite of repeated calls of MPI_Alltoall With Reducing RTO Time

NPB Benchmarks NAS Parallel Benchmarks2.3 Relative performance to the standard TCP STD: Standard TCP SWT: Parameter Switching RTO: Reducing RTO PCE: Pacing at Start-up

NPB Traffic Samples Traffic samples taken from NPB Sampling point Traffic between two clusters Sum of the traffic from 8 nodes Extraction of one benchmark loop Some deviation on time 0 point, due to trigger

Traffic of BT Benchmark Standard TCP Modified TCP

Traffic of CG Benchmark Standard TCP Modified TCP

Traffic of FT Benchmark Standard TCP Modified TCP

Traffic of IS Benchmark Standard TCP Modified TCP

Traffic of LU Benchmark Standard TCP Modified TCP

Traffic of MG Benchmark Standard TCP Modified TCP

Traffic of SP Benchmark Standard TCP Modified TCP

Related Work Issues are old: Old networks and OS [Aron,Durschel] Evaluation with 10Kbps-100Kbps network Coarse timers, improvement from 500ms to 10ms Focus on HTTP traffic [Hughes,Touch,Heidemann] MPI is much sensitive to the issues Most evaluations are simulation Our approach: Cooperation with MPI library Implemented in the Linux TCP stack Fast timer for 1Gbps to10gbps environments

Conclusion Modifications in TCP behavior at start-ups Pacing at Start-up avoid frequent Slow-Start Reducing RTO Time avoid long pause at packet losses Cooperation with MPI library TCP Parameter Switching avoid cwnd mismatch NO CHANGES in Congestion Avoidance phase Start-up behavior is mostly ignored, but it affects the performance of some MPI applications largely. Fairness will be retained, because no modification in Congestion Avoidance phase.

GridMPI Project Page http://www.gridmpi.org/