Scalable TCP: Improving Performance in Highspeed Wide Area Networks



Similar documents
Static Fairness Criteria in Telecommunications

Performance Analysis of IEEE in Multi-hop Wireless Networks

Computer Networks Framing

Chapter 1 Microeconomics of Consumer Theory

Hierarchical Clustering and Sampling Techniques for Network Monitoring

A Holistic Method for Selecting Web Services in Design of Composite Applications

Channel Assignment Strategies for Cellular Phone Systems

Deadline-based Escalation in Process-Aware Information Systems

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 9, NO. 3, MAY/JUNE

Price-based versus quantity-based approaches for stimulating the development of renewable electricity: new insights in an old debate

Unit 12: Installing, Configuring and Administering Microsoft Server

AUDITING COST OVERRUN CLAIMS *

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

SLA-based Resource Allocation for Software as a Service Provider (SaaS) in Cloud Computing Environments

Sebastián Bravo López

Supply chain coordination; A Game Theory approach

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

MATE: MPLS Adaptive Traffic Engineering

Henley Business School at Univ of Reading. Chartered Institute of Personnel and Development (CIPD)

Learning Curves and Stochastic Models for Pricing and Provisioning Cloud Computing Services

Using Live Chat in your Call Centre

Deduplication with Block-Level Content-Aware Chunking for Solid State Drives (SSDs)

) ( )( ) ( ) ( )( ) ( ) ( ) (1)

THE PERFORMANCE OF TRANSIT TIME FLOWMETERS IN HEATED GAS MIXTURES

Dataflow Features in Computer Networks

An integrated optimization model of a Closed- Loop Supply Chain under uncertainty

Green Cloud Computing

From a strategic view to an engineering view in a digital enterprise

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

Customer Efficiency, Channel Usage and Firm Performance in Retail Banking

On the Characteristics of Spectrum-Agile Communication Networks

To Coordinate Or Not To Coordinate? Wide-Area Traffic Management for Data Centers

An Enhanced Critical Path Method for Multiple Resource Constraints

Optimal Sales Force Compensation

Trade Information, Not Spectrum: A Novel TV White Space Information Market Model

Weighting Methods in Survey Sampling

OpenSession: SDN-based Cross-layer Multi-stream Management Protocol for 3D Teleimmersion

Bandwidth Allocation and Session Scheduling using SIP

Soft-Edge Flip-flops for Improved Timing Yield: Design and Optimization

A Three-Hybrid Treatment Method of the Compressor's Characteristic Line in Performance Prediction of Power Systems

Agile ALM White Paper: Redefining ALM with Five Key Practices

Findings and Recommendations

A novel active mass damper for vibration control of bridges

Capacity at Unsignalized Two-Stage Priority Intersections

A Keyword Filters Method for Spam via Maximum Independent Sets

Intelligent Measurement Processes in 3D Optical Metrology: Producing More Accurate Point Clouds

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

OpenScape 4000 CSTA V7 Connectivity Adapter - CSTA III, Part 2, Version 4.1. Developer s Guide A31003-G9310-I D1

Recovering Articulated Motion with a Hierarchical Factorization Method

arxiv:astro-ph/ v2 10 Jun 2003 Theory Group, MS 50A-5101 Lawrence Berkeley National Laboratory One Cyclotron Road Berkeley, CA USA

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

Parametric model of IP-networks in the form of colored Petri net

A Comparison of Service Quality between Private and Public Hospitals in Thailand

Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

Open and Extensible Business Process Simulator

IT Essentials II: Network Operating Systems

Suggested Answers, Problem Set 5 Health Economics

Improved SOM-Based High-Dimensional Data Visualization Algorithm

In many services, the quality or value provided by the service increases with the time the service provider

Context-Sensitive Adjustments of Cognitive Control: Conflict-Adaptation Effects Are Modulated by Processing Demands of the Ongoing Task

Lemon Signaling in Cross-Listings Michal Barzuza*

WORKFLOW CONTROL-FLOW PATTERNS A Revised View

Impedance Method for Leak Detection in Zigzag Pipelines

TRENDS IN EXECUTIVE EDUCATION: TOWARDS A SYSTEMS APPROACH TO EXECUTIVE DEVELOPMENT PLANNING

Board Building Recruiting and Developing Effective Board Members for Not-for-Profit Organizations

Granular Problem Solving and Software Engineering

Solving the Game of Awari using Parallel Retrograde Analysis

Table of Contents. Appendix II Application Checklist. Export Finance Program Working Capital Financing...7

Algorithm of Removing Thin Cloud-fog Cover from Single Remote Sensing Image

How To Fator

Measurement of Powder Flow Properties that relate to Gravity Flow Behaviour through Industrial Processing Lines

RATING SCALES FOR NEUROLOGISTS

HEAT CONDUCTION. q A q T

The Application of Mamdani Fuzzy Model for Auto Zoom Function of a Digital Camera

Software Ecosystems: From Software Product Management to Software Platform Management

Programming Basics - FORTRAN 77

Electrician'sMathand BasicElectricalFormulas

FIRE DETECTION USING AUTONOMOUS AERIAL VEHICLES WITH INFRARED AND VISUAL CAMERAS. J. Ramiro Martínez-de Dios, Luis Merino and Aníbal Ollero

A Survey of Usability Evaluation in Virtual Environments: Classi cation and Comparison of Methods

VOLUME 13, ARTICLE 5, PAGES PUBLISHED 05 OCTOBER DOI: /DemRes

Computational Analysis of Two Arrangements of a Central Ground-Source Heat Pump System for Residential Buildings

The Basics of International Trade: A Classroom Experiment

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

UNIVERSITY AND WORK-STUDY EMPLOYERS WEB SITE USER S GUIDE

Account Contract for Card Acceptance

Chapter 5 Single Phase Systems

5.2 The Master Theorem

Transcription:

Salable TP: Improving Performane in Highspeed Wide Area Networks Tom Kelly ERN - IT Division Geneva 3 Switzerland tk@am.a.uk ABSTRAT TP ongestion ontrol an perform badly in highspeed wide area networks beause of its slow response with large ongestion windows. The hallenge for any alternative protool is to better utilize networks with high bandwidth-delay produts in a simple and robust manner without interating badly with existing traffi. Salable TP is a simple sender-side alteration to the TP ongestion window update algorithm. It offers a robust mehanism to improve performane in highspeed wide area networks using traditional TP reeivers. Salable TP is designed to be inrementally deployable and behaves identially to traditional TP staks when small windows are suffiient. The performane of the sheme is evaluated through experimental results gathered using a Salable TP implementation for the Linux operating system and a gigabit transatlanti network. The results gathered suggest that the deployment of Salable TP would have negligible impat on existing network traffi at the same time as improving bulk transfer performane in highspeed wide area networks.. INTRODUTION A ommuniation network an experiene periods where the traffi offered to it exeeds the available transmission apaity; during suh periods the network is said to be ongested. TP ongestion ontrol [9] was introdued to relieve ongestion ollapse that had ourred in the Internet. A result of ongestion ontrol is that resoures are shared between flows during periods of ongestion. This sharing leads to similar throughput for flows with similar round trip times and avoids starving individual flows. TP has proved to be remarkably suessful at sharing bandwidth while agressively utilizing available apaity under a range of dynami traffi loads. The TP flow ontrol algorithm uses a window and end-to-end aknowledgment sheme to provide reliable data transfer aross a Tom Kelly is a member of the Laborartory for ommuniation Engineering, ambridge University Engineering Department, Trumpington Street, ambridge B PZ, United Kingdom. network; a brief desription is given here and a more omplete referene is [5]. The sending host maintains a ongestion window, wnd, whih plaes an upper bound on the number of segments that may be sent into the network awaiting aknowledgment by the reeiver. Upon reeiving a data paket the reeiver shedules a umulative aknowledgment, that overs all reeived pakets, to be sent to the sender. The reeiver also advertises to the sender a reeive window, rwnd, whih is the size of the available soket reeive buffer for this onnetion. The sender is allowed to have at most the minimum of wnd and rwnd pakets in the network awaiting aknowledgment. The reeive window provides flow ontrol for the reeiving appliation; if the reeiving appliation annot proess data at the speed it is being sent the window advertisements from the reeiver, rwnd, will shrink as the soket reeive buffer fills. The ongestion window is intended to provide flow ontrol during periods in whih the network is ongested. Paket loss is deteted either through the timeout of an unaknowledged paket, the reeipt of several dupliate aknowledgments, or through seletive aknowledgment (SAK) reports [] sent by the reeiver. Paket loss is used as a signal of ongestion; it is assumed to be aused by a buffer overflow due to offered traffi exeeding available apaity on the end-to-end path of a onnetion. TP senders update the ongestion window in response to aknowledgments of reeived pakets and the detetion of ongestion. For eah aknowledgment reeived in a round trip time in whih ongestion has not been deteted wnd wnd + wnd and on the first detetion of ongestion in a given round trip time wnd wnd This proess of inreasing and dereasing wnd allows TP to aggressively utilize the available bandwidth on a given end-to-end path. The agility of this ongestion window adjustment algorithm an be studied by onsidering the time taken to reah the same sending rate following the detetion of a transient ongestion event. Suppose a onnetion has a round trip time of 00ms and a paket size of 500 bytes. An available bandwidth of Gbps orresponds to a ongestion window of about 6000. Immediately after the detetion of a ongestion event wnd will be set to 8000, whih is equivalent to sending at 500Mbps. To reah the sending rate of The use of expliit ongestion notifiation (EN) [8] by routers allows ongestion to be signaled to the sender (via aknowledgments from the reeiver) without the loss of pakets.

Gbps again will take 8000 round trip times or about 7 minutes! In many highspeed wide area networks this reovery time is muh longer than the time between transient ongestion periods. This an lead to low utilization even when the network is unongested for extended periods. However, by altering the ongestion window adjustment algorithm, the agility with large windows an be dramatially improved. This paper will onsider the use of the following ongestion ontrol algorithm. For eah aknowledgment reeived in a round trip time in whih ongestion has not been deteted wnd wnd + 0.0 and on the first detetion of ongestion in a given round trip time wnd wnd 0.5 wnd The time taken for a soure using this algorithm to double its sending rate is about 70 round trip times for any rate; the window update algorithm is salable and a TP implementing it is termed Salable TP. In the previous ase of a Gbps onnetion with a round trip time of 00ms, the salable algorithm will reover its original rate after a transient in under 3 seonds. This suggests that this algorithm ould better utilize the bandwidth of a highspeed wide area network that experienes transient ongestion. This paper studies the design, implementation, and presents early results on the performane of the Salable TP modifiation to TP ongestion ontrol. Setion desribes the problems assoiated with TP ongestion ontrol in highspeed wide area networks and presents a ontext within whih Salable TP would be benefiial. Setion 3 onsiders the analytial properties of the generalized Salable TP algorithm and motivates the hoie of the parameters 0.0 and 0.5. Setion 4 presents results of experiments performed using a Salable TP implementation in the Linux operating system over the DataTAG highspeed transatlanti testbed. Setion 5 onsiders how this sheme differs from the related work on improving the performane of ongestion ontrol in high speed networks. Setion 6 summarizes what has been ahieved and gives diretions for future work.. MOTIVATION AND ONTEXT This work is motivated by the poor performane of TP when used for bulk transfers in highspeed wide area networks. These networks have speeds greater than 00Mbps and round trip times above 50ms. Several ommunities use suh networks and need to distribute substantial amounts of data over them. For example, the large datasets olleted by the High Energy Physis, Bioinformatis and Radioastronomy ommunities require global distribution for the data to be analyzed effetively. Define the supporting loss rate for a onnetion to be the maximum paket loss rate that a ongestion ontrol algorithm will tolerate to sustain a given level of throughput. Let the paket loss reovery time for a given rate and onnetion be the length of time required by a ongestion ontrol algorithm to return to its initial sending rate following the detetion of a paket loss. Traditional TP onnetions are unable to ahieve high throughput in highspeed wide area networks due to the long paket loss reovery times and the need for low supporting loss rates. Table shows the properties of a traditional TP onnetion with a round trip time of 00ms and a segment size of 500 bytes. A paket loss rate of 0 7 is omparable with those that an our on long haul fiber links, within network devies, and in end-systems; this plaes a limit on throughput before any transient ongestion due to load flutuations are onsidered. This onstraint on the loss rate beomes problemati for a onnetion with a round trip time of 00ms at around 00Mbps. Furthermore the paket loss reovery time for a 0Mbps onnetion with round trip time of 00ms beomes omparable with inter-page think times for a user s Web requests. A reovery time of more than a few minutes ould be detrimental to effiient utilization of a network with periods of transient ongestion; at a round trip time of 00ms this effet would our at rates of more than 00Mbps. This paper onsiders whether a simple hange to the ongestion ontrol algorithm is suffiient to improve highspeed wide area network operation. Salable TP is an evolution of the existing ongestion ontrol algorithm that improves performane when there is a high available bandiwdth on long haul routes. It is designed to be easily implemented in urrent TP staks and inrementally deployable without needing modifiations to network devies. Salable TP builds on the HighSpeed TP proposal [6] and previous work on engineering stable ongestion ontrols []. 3. ANALYSIS AND DESIGN The analysis will make use of standard fluid limit approximations and the following notation onventions. Let eah soure and destination pair in the network be identified with a route, r, and the end-to-end dropping probability on a route be denoted by P r(t). Let wnd r and T r denote the sender s ongestion window and the round trip time of a onnetion on route r. The generalized Salable TP window update algorithm responds to eah aknowledgment reeived in a round trip time in whih ongestion has not been deteted with the update wnd r wnd r + a where a is a onstant with 0 < a <. Further, on the first detetion of ongestion in a given round trip time, the ongestion window is altered by wnd r wnd r b wnd r where b is a onstant with 0 < b <. Figures and illustrate the ongestion window dynamis of a single onnetion using traditional TP or Salable TP over a dediated link of apaity or ( < ). Paket loss reovery times for a traditional TP onnetion are proportional to the onnetion s window size and round trip time. A Salable TP onnetion has paket loss reovery times that are proportional to the onnetion s round trip time only; this invariane to link sizes allows Salable TP to outperform traditional TP in highspeed wide area networks. The saling property applies for any hoie of the onstants a and b; implementation and deployment onstraints determine these onstants. The use of a = 0.0 and b = 0.5 will be motivated by onsidering Salable TP s impat on legay traffi, bandwidth alloation properties, flow rate variane, onvergene properties, and ontrol theoreti stability. 3. Response urve and bandwidth alloation A ongestion window update algorithm relates the ongestion window size to the end-to-end signaling rate through a response urve. The generalized Salable TP algorithm has a response urve that an be approximated for small end-to-end drop rates by wnd r a b P r This an be derived by onsidering the ongestion window size at equilibrium through a differential equation model of wnd or the expetation of a stohasti model of wnd.

Throughput Window Paket loss reovery time Supporting loss rate Mbps 7pkts.7s 5. 0 3 0Mbps 70pkts 7s 5. 0 5 00Mbps 700pkts mins 50s 5. 0 7 Gbps 7000pkts 8mins 5.4 0 9 0Gbps 70000pkts 4hrs 43mins 5.4 0 Table : harateristis of a 00ms TP onnetion using traditional ongestion ontrol. log( b) log(+a) ( b) b log( b) log(+a) ( b) b Figure : Traditional TP saling properties. Figure : Salable TP saling properties. The traditional TP response urve [4] an be approximated for small end-to-end drop rates by.5 wnd r The two response urves have different forms for the multipliative funtion of P r; the two shemes annot have average windows of the same value for all end-to-end loss rates P r. However, all that is needed is a suitable evolutionary approah that allows onnetions to better use bandwidth in wide area networks when it is available. The argument that follows was first introdued in [6]. Traditional TP onnetions an not effetively use large windows and in pratie have a limited amount of soket reeive and send buffer memory available, so they will tend not to have a windows greater than a ertain size; all this the legay window size lwnd. Assoiate with this window size the legay loss rate, P l, whih is the maximum paket loss rate needed to support windows larger than lwnd. Suppose Salable TP uses the traditional ongestion window update algorithm when wnd lwnd and the Salable TP ongestion window update algorithm for wnd > lwnd. The sharing properties of Salable TP an then be onsidered in two states. For levels of ongestion with drop rates higher than p l the Salable TP onnetions use the traditional TP algorithm and P r reeive the same share as a traditional TP stak. 3 For levels of ongestion with drop rates less than p l legay onnetions will have a window of at least lwnd. Salable TP onnetions will reeive larger windows than legay onnetions but the legay onnetions are never starved of bandwidth. The hoie of the value lwnd is a poliy deision. If lwnd = 6, it is only when traditional TP onnetions have a window of about 40 that Salable TP onnetions of the same round trip time will reeive twie the bandwidth. This suggests that onerns about Salable TP reeiving a higher bandwidth than traditional TP onnetions with windows greater than lwnd should not arise until the window size is already large enough for there to be onerns about TP paket loss reovery times. For the purposes of this paper we will assume that lwnd is 6 pakets; this orresponds to 4KB with 500 byte segments and a legay loss rate, p l, of 3 There is not an intrinsi problem with using the Salable TP algorithm in a small window regime; previous studies [, 3] suggest that there may be benefits to doing so in the ontext of EN IP networks. However Salable TP onnetions would reeive a smaller share of the bandwidth, would reat more slowly to ongestion, and may alter the dynamis of existing traffi. These effets ould make evolution through inremental deployment more diffiult and so are avoided in the design presented here.

000 Standard TP Salable TP This onstraint is often trivially satisfied; with a lwnd of 6, it beomes satisfied for any b > 0 beause.5p l 3 < 0. Window size (pkts) 00 0 0.000 0.00 0.0 0. Loss rate Figure 3: Response urves for traditional TP and Salable TP. 5.86 0 3. The response urves for traditional TP and Salable TP are plotted for an lwnd of 6 in Figure 3. To ensure a ontinuous and dereasing response urve, the Salable TP response urve must pass through the point (p l, lwnd) giving the following onstraint on a and b a b = p l lwnd.5p l () The number of free variables is now redued to one; hoosing b fixes a. 3. Instantaneous rate variation The instantaneous rate of a TP onnetion probes around a mean value giving it a share of the available apaity. The size of this stohasti rate variation for the Salable TP ongestion window update algorithm has been studied previously [3]. 4 The oeffiient of variane for the instantaneous sending rate is ( ) wndr ov (x r) = ov T r b provided P r 0. This suggests that b should be hosen as small as possible to redue instantaneous rate variation, a onlusion that agrees with intuitive arguments based on the paket loss reovery times shown in Figure. It appears sensible not to make the algorithm have a rate variation larger than traditional TP, so b should satisfy b. The Salable TP algorithm responds to ongestion events at most one per round trip time. Therefore it is neessary that the window expansion and ontration yle lasts longer than a round trip time. 5 Using the paket loss reovery time of Salable TP and noting that b is the only free variable, this onstraint beomes b >.5p l 4 The responses were onsidered in the ontext of an EN implementation. However these results provide a good approximation with large windows, low drop probabilities, and the onstraint in Equation 3. 5 EN implementations reating to eah ongestion notifiation do not neessarily suffer from this limitation. () 3.3 onvergene onvergene speed is of signifiane to an elasti rate protool that must adapt to hanging network onditions on reasonable timesales. Ideally onvergene should happen instantaneously. However the use of paket loss as a signaling hannel, the need to provide ompatibility with legay traffi and to use minimal ost network devies, an make this goal diffiult. Suppose that at time t 0 a sudden overload shok ours and P r inreases. Then a soure will redue its sending rate upon reeiving feedbak by a fator of in less than log( )Tr log( b) In fat this is an overestimate of the time needed. Any overload that auses loss and delay will lead to a lower sending rate beause aknowledgments from the reeiver are needed to release pakets into the network; this self-loking is a robust mehanism that reats within a round trip time to overload events. Traditional TP ongestion ontrol orresponds to a hoie of b = ; a fairly rapid onvergene speed in the fae of overload. In response to a sudden inrease in the available apaity on a route, P r 0, and the time taken for the soure to inrease its sending rate by a fator of is log()t r log( + a) By ontrast a traditional TP onnetion would require wnd r(t 0) round trip times to respond to the inrease in available apaity. The Salable TP algorithm responds more effetively to hanges in available apaity when window sizes are large. These onvergene properties suggest that b (and also a) 6 should be hosen as large as possible for fastest onvergene. This onflits with the desire to keep the instantaneous rate variation small whih requires b to be small, from Equation. Table shows the properties of a Salable TP onnetion for a general round trip time and when it is equal to 00ms; these hoies of a and b are ompatible with a legay window of 6 pakets. The setting of a and b is a poliy hoie determined by whih system properties are deemed to be most important. It would appear that the variability of hoosing b = is too large. However the slow onvergene times of b = would suggest that hoosing b between and is desirable. In this paper b = is seleted beause 6 4 8 8 it offers a good balane between rate flutuation and onvergene time. The hoosing of the optimal parameter in this range appears to make only a marginal differene to the theoretial dynamis of the algorithm; further experimentation using more implementations and real workloads will help to refine this hoie. 3.4 Stability It has been shown [6] that for heterogeneous round trip times and arbitrary network topologies, the generalised Salable TP algo- 6 a is proportional to b by Equation ; so a large b gives a large a.

b a oeffiient of variation for rate 4 8 6 Paket loss reovery time Time to halve rate Time to double rate 0.50 7.7T 50 r or 3.54s T r or 0.0s 7.7T r or 3.54s 0.35 4.5T 50 r or.9s.4t r or 0.48s 35T r or 7.00s 0.5 3.4T 00 r or.68s 5.9T r or.04s 69.7T r or 3.9s 0.8.9T 00 r or.59s 0.7T r or.5s 39T r or 7.8s Table : Properties of a Salable TP onnetion with a variety of parameter settings for a general round trip time or at 00ms. rithm is loally stable 7 about its equilibrium provided a < pj(ŷj) ŷ jp j (ŷj) j J (3) where ŷ j is the equilibrium rate at eah link, p j(y) is the probability of loss at link j for an arrival rate y, and J is the set of all links. 6xGbps.4Gbps, 0ms Geneva 6xGbps For example, assuming Poisson paket arrivals, 8 the sheme is stable if FIFO network buffers are provisioned to be of size at most. Hene if the network buffers an be onfigured the system an a be made stable in a ontrol theoreti sense. A ontrol theoreti approah to the design of a stable and salable TP using EN is given in []. Further improvements and enhanements are possible with the use of adaptive queue management (AQM) shemes at network devies but is beyond the sope of this paper. 4. EXPERIMENTS Salable TP was implemented in the network stak of the Linux.4.9 operating system. This kernel implements a sophistiated TP stak supporting the following relevant standards: TP extensions for high performane 9 [0], SAK [7], and D-SAK [8]. The stak also implements paket retransmission timeout heking to detet lost pakets 0, reordering detetion using D-SAK, rate halving, and burst limiting. The Salable TP path adds the ongestion window algorithm hanges, salings to kernel buffers, the removal of speial ase small paket handling in the SysKonnet driver, and debug ounters. The saling of kernel buffers inreases the send and reeive queues that lie between the kernel and devie driver. This is needed beause sheduling timeslies have remained onstant while interfae speeds have inreased. The SysKonnet devie driver for Linux.4.9 opies small pakets into their own buffer to onserve memory. In order to optimize for speed rather than spae effiieny, the driver s interrupt handling routine was hanged to not make this extra opy. Both of these hanges were simple and signifiantly improved TP throughput; they will be termed the gigabit kernel modifiations. In order to adjust for the 7 This is in the sense that the differential equations for all the sending rates are loally stable with respet to the feedbak loop ontrolling them. 8 Other traffi models an also be onsidered and the results are qualitatively similar; see [] for some examples. 9 This provides the following enhanements: window saling, timestamping, and protetion against wrapped sequene numbers. 0 This is similar to that used in TP Vegas [] to quikly detet losses with dupliate aknowledgements. The Salable TP path used for the experiments in this paper an be downloaded from: http://www-le.eng.am.a. uk/ tk/salable/. The transmit side queue is limited by a devie s txqueue variable. The reeive side queue size is set by the systl variable net.ore.netdev max baklog. These were inreased to 000 and 3000 respetively to hold the inreased number of pakets that an arrive during a period where the operating system annot proess them immediately. hiago Figure 4: Testbed topology used for experiments. effet of delayed aknowledgments a was set to 0.0. The implementation of byte ounting [], whih updates the ongestion window in proportion to the exat number of bytes aknowledged, would remove the need to adjust for delayed aknowledgments. The DataTAG testbed onsists of high performane Ps that have Supermiro P4DP8-G motherboards with dual.4ghz Xeon proessors and gigabytes of memory. SysKonnet SK-9843 Gigabit Ethernet ards on a 33MHz/64bit PI bus provided onnetivity to the testbed network. 6 servers are loated at ERN, Geneva, and 6 servers at StarLight, hiago. The lusters are onneted through two iso 76xx routers with a.4gbps paket over SONET link between Geneva and hiago. The Ps are onneted to eah routers through gigabit Ethernet ports. This topology is shown in Figure 4. The round trip time for a ping from Geneva to hiago was 0ms. In the experiments that follow the interfae between Geneva and hiago had a FIFO queue of 048 pakets. All the other gigabit Ethernet interfaes on the routers had the fatory default setting of a 40 paket FIFO queue. At most 9% of the bandwidth-delay produt is available as buffers on the path; this trend towards a derease in available buffering delay is likely to ontinue due to the ost of implementing highspeed memory systems in network devies. Three sender side test ases are ompared: TP in an unaltered Linux.4.9 kernel, TP in a Linux.4.9 kernel with the gigabit kernel modifiations, and Salable TP in a Linux.4.9 kernel with the gigabit kernel modifiations. The reeivers used an unaltered Linux.4.9 kernel in all ases. The experiments were designed to explore the performane of Salable TP for bulk data transfer as ould be found in wide area sientifi networks. 4. Basi performane In these tests 4 server and reeiver pairs were used with TP flows distributed evenly aross the 4 mahines. Eah reeiver in hiago would requested a file of size Gigabytes from its assoiated server in Geneva. The server responded by transferring Gigabytes of data (from memory) bak to the reeiver in hiago. Upon ompletion of the Gigabyte transfer the onnetion was ompleted

Number of flows.4.9 TP.4.9 TP with gigabit kernel modifiations 7 6 44 4 39 93 4 7 60 35 8 47 86 40 6 66 06 4 Salable TP Table 3: Number of Gigabyte transfers ompleted in 00 seonds. and another request was initiated. This was intended to apture some slow-start and termination dynamis. In all ases eah TP soket had send and reeive buffers set to 64MB; this allowed a single flow to make full use of any bandwith available to it. Table 3 shows the results of these experiments. A signifiant throughput improvement of 60% to 80% was observed simply by saling the internal Linux kernel buffers and removing the opying of small pakets in the reeive path of the SysKonnet devie driver. The Salable TP ongestion ontrol algorithm further inreased throughput by 34% to 75% over that observed with traditional TP using the gigabit kernel modifiations. Using 6 Salable TP flows aross four mahines aheived 8% of the maximal performane possible over a saturated.4gbps link after aounting for the required IP and TP header overhead inurred with pakets of size 500 bytes. 3 The Linux.4.9 kernel with gigabit kernel modifiations ould get 6% of the maximal.4.gbps performane with 6 flows. A standard Linux.4.9 kernel ahieved at most 38% of the maximal performane with 6 flows. 4. Performane with Web traffi These tests attempted to measure the impat on Web traffi of large bulk transfer users. In partiular they assessed whether Salable TP has a detrimental effet on existing TP users. In these tests, three reeiver and server pairs eah generated traffi equivalent to 400 ative Web users. 4 Two mahine pairs generated transfer requests of Gigabytes in size, in the same way as the basi throughput test, with eight transfers in progress aross the two mahines at any one time. The parameters used for the Web traffi model are given in Table 4; these parameters are the same as those measured in [5] to generate self-similar traffi. The Web traffi was made repeatable in the sense that the sample paths of user think times, embedded pages, inter-objet times and page-sizes were the same for a given user aross eah test. This repeatability allowed the Web traffi to be run in isolation and then with additional traffi to measure the impat of the bulk traffi on the Web transfers. Table 5 displays the results of the experiments on mixing the traffi types. In none of the tests did the Web traffi experiene any notieable hange in throughput. This offers evidene to suggest that the design of Salable TP has indeed provided a solution with negligible impat on existing traffi. The standard Linux.4.9 kernel with no modifiations ahieved 40% of the maximal possible system throughput over the time period. Applying the gigabit kernel modifiations improved traditional TP performane and ahieved 5% of the maximal possible throughput. The bulk transfers using the Salable TP algorithm boosted the total traffi transferred to 3 For a ombined IP and TP header using the timestamp option the maximal apaity is about 96% of the stated interfae apaity. 4 This traffi is not ompletely representative of Web traffi observed in real networks beause only one round trip time was available for experimental purposes. 75% of the maximum possible throughput. 5. RELATED WORK Several authors have made the ase for using TP Vegas [, 7] and similar variants [3] in high-speed networks. The argument proeeds by observing that TP Vegas uses network buffer delay as an impliit ongestion signal as opposed to drops. Hene if network buffer delay an be ontrolled and used as a signaling mehanism, it should be possible to run the network at very high utilizations. This approah may prove to be suessful but is hallenging to implement. To sueed TP Vegas implementations are needed that an run robustly in environments where noise affets delay estimates; noise ould arise from heterogeneous network buffering shemes, operating system sheduling, network firewall proessing, and ross traffi whih does not ontrol buffer delay suh as traditional TP or UDP streams. Others have used mehanisms to make one logial onnetion behave like multiple TP onnetions to improve performane in high bandwidth wide area networks; this an be aheived either at the transport layer [4] or at the appliation layer by opening multiple onnetions. The results displayed in Table 3 show that this an be a pragmati solution to improve throughput. However it an be diffiult to tune in a way that onsistently provides good performane without ausing a detrimental effet on existing network traffi when ongestion ours. This work builds on the Highspeed TP proposal [6] and uses the same arguments to ahieve good sharing with legay appliations. Salable TP is simpler to implement than the parameterized Highspeed TP algorithm due to its use of onstants in the window update algorithm. The work also shares the analysis and design methods used to engineer other EN TP variants [, 3]. 6. ONLUSION Salable TP presents a simple hange to the ongestion window update algorithm whih improves throughput in highspeed wide area networks. The performane improvement an be dramati for senders using the Salable TP algorithm in bulk transfer networks; the improvement attributable to the algorithm an sometimes be over 00%. The sheme also promises to interoperate well with legay traffi; results from the experiments onduted with Web traffi using traditional TP staks in parallel with several Salable TP flows performing bulk transfers showed negligible impat on the Web traffi transferred. A surprising result of the experiments performed is that simple optimizations to kernel devie drivers an improve traditional TP performane by over 00% when ompared to a standard kernel. Future work is needed to onsider the impat of heterogeneous round trip times. There may be a requirement to orret the bias TP has towards onnetions with smaller round trip times; the

omponent Probability density funtion Parameters Mean Think times (se) (Pareto) p(x) = αk α x (α+), x > k k = 0.0, α =.0 0.0 Objets per page (Pareto) p(x) = αk α x (α+), x > k k = 3.0, α =.5 9.0 Request file sizes (bytes) (Pareto) p(x) = αk α x (α+), x > k k = 000, α =. 7000 Inter objet times (se) (Pareto) p(x) = αk α x (α+), x > k k = 0.5, α =.5.5 Table 4: Summary of distributions and parameters used in the Web user TP onnetion model. Type of bulk transfer users Web traffi transferred Gigabyte transfers ompleted No bulk transfers 65GB n/a TP in.4.9 65GB 36 TP in.4.9 with buffer saling 65GB 58 Salable TP 65GB 96 Table 5: Performane with 400 onurrent Web users and 8 bulk transfer users over 00 seonds. methods used for salable EN variants [] ould provide a good starting point for suh modifiations. Additional work ould also onsider more omplex workload models whih apture the needs of the appliations that may be run on highspeed wide area networks. 7. AKNOWLEDGMENTS Jean-Philippe Martin-Flatin offered valuable omments on early drafts of this paper and useful oding advie. Helpful disussions on the design of Salable TP were had with Glenn Vinniombe, Sally Floyd, and Frank Kelly. Thanks also go to the DataTAG testbed support teams at ERN, alteh and StarLight. This work was funded by the IST Programme of the European Union (grant IST-00-3459, DataTAG projet), the Royal ommission for the Exhibition of 85, and AT&T Labs - Researh. 8. REFERENES [] M. Allman. TP Byte ounting Refinements. AM omputer ommuniation Review, 9(3), July 999. [] L. S. Brakmo and L. L. Peterson. TP Vegas: End to End ongestion Avoidane on a Global Internet. IEEE Journal on Seleted Areas in ommuniations, 3(8):465 480, Otober 995. [3] D. H. hoe and S. H. Low. Stabilized Vegas. In Pro. of the 39th Annual Allerton onferene on ommuniation, ontrol, and omputing, Montiello, IL, Otober 00. [4] J. rowroft and P. Oehslin. Differentiated End-to-End Internet Servies using a Weighted Proportional Fair Sharing TP. omputer ommuniation Review, 8(3), July 998. [5] A. Feldmann, A. Gilbert, P. Huang, and W. Willinger. Dynamis of IP Traffi: A Study of the Role of Variability and the Impat of ontrol. In SIGOMM 999, Boston, MA, August 999. [6] S. Floyd. HighSpeed TP for Large ongestion Windows. Internet Draft <draft-floyd-tp-highspeed-0.txt>, August 00. Work in progress. [7] S. Floyd, J. Mahdavi, M. Mathis, and M. Podolsky. An Extension to the Seletive Aknowledgement (SAK) Option for TP. Internet RF 883, July 000. [8] S. Floyd, K. K. Ramakrishnan, and D. Blak. The addition of expliit ongestion notifiation (EN) to IP. Internet RF 368, September 00. [9] V. Jaobson. ongestion Avoidane and ontrol. In SIGOMM 988. An updated version is available via ftp: //ftp.ee.lbl.gov/papers/ongavoid.ps.z. [0] V. Jaobson, R. Braden, and D. Borman. TP Extensions for High performane. Internet RF 33, May 99. [] T. Kelly. On Engineering a Stable and Salable TP Variant. Tehnial Report UED/F-INFENG/TR.435, Laboratory for ommuniation Engineering, ambridge University, June 00. [] M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TP Seletive Aknowledgment Options. Internet RF 08, Otober 996. [3] A. Misra and T. J. Ott. Performane Sensitivity and Fairness of EN-Aware Modified TP. In Networking 00: Networking Tehnologies, Servies, and Protools; Performane of omputer and mmmuniation Networks; and Mobile and Wireless ommuniations, Seond International IFIP-T6 Networking onferene Proeedings. [4] J. Padhye, V. Firoiu, D. Towsley, and J. Kurose. Modeling TP Reno Performane: A Simple Model and its Empirial Validation. IEEE/AM Transations on Networking, 8():33 45, April 000. [5] W. R. Stevens. TP/IP Illustrated, Volume : The Protools. Addison-Wesley, 994. [6] G. Vinniombe. On the stability of networks operating TP-like ongestion ontrol. In Pro. of the 5th IFA World ongress on Automati ontrol, Barelona, Spain, July 00. [7] E. Weigle and W. Feng. A ase for TP Vegas in High-Performane omputational Grids. In Pro. of the 9th IEEE International Symposium on High performane Distributed omputing (HPD 0), San Franiso, A, August 00.