TCP enhancements M. Veeraraghavan, April 3, 2004 In this writeup, we summarize the extensions made to TCP (relative to what I teach in the Internet architecture/protocols course). The list includes: 1. Larger window sizes accommodated through a window scale option is proposed for LFNs (Networks with Long Fat Pipes) [1], which are networks that have large values of the delay bandwidth product (DBP). TCP performance depends upon this product. The example cited is satellite networks in which round-trip times are at least 558 ms [8]. In optical networks, the bandwidth is high; therefore even if propagation delay is not that high, the total DBP can be high. Therefore much of the extensions made to TCP for satellite networks is applicable to optical networks. Reference [1] additionally proposes a RTTM (Round Trip Time Measurement) option and a PAWS (Protect Against Wrapped Sequences) for LFNs. 2. Even with fast retransmit/recovery, if multiple packets are dropped within one window, the system will go into Slow Start [1]. This is explained in [10]. In the absence of SACKs, it says that when multiple packets are lost in one window, then the key difference between retransmits that occur after an RTO vs. after a triple duplicate kicks in. After an RTO, all packets are retransmitted following the one that was lost. Whereas after a TD loss detection, only the lost packet is retransmitted. The fast recovery scheme increases cwnd by three because it assumes that three packets were successfully received, which led to the three duplicate ACKs. This is called inflating the window. After retransmitting the lost packet, for every duplicate ACK, the cwnd is inflated by 1 on the assumption that the duplicate ACK was generated every time a new packet was successfully received. Now, if multiple packets were lost in the window, this will not be recognized at the sender until it receives the ACK for the retransmitted packet. When this arrives, it will see that the ACK is not for all packets sent subsequent to the lost packet; instead it asks for some other packet. I assume by the time this happens the RTO for the lost packet will expire causing a drop of cwnd to 1 and Slow Start recovery. Reference [10] calls this a partial ACK, and proposes a modification to Fast Recovery that prevents this dropping off to Slow Start recovery. This is called NewReno. It is an experimental RFC - not standard. For details of how the recovery should proceed if a partial ACK is received after a Fast Retransmit, see [10]. 1
3. How Fast Retransmit and Fast Recovery algorithms work: Fast Retransmit is simple. When the sender receives three duplicate ACKs, it realizes that the network is telling it something, i.e., that one packet got lost but remaining are being delivered. This is because a duplicate ACK is generated only upon receipt of a new packet. With Fast Retransmit, the sender simply retransmits the lost packet (unlike after an RTO, where all packets following the lost packet are retransmitted). The Fast Recovery works as follows. It drops ssthresh to half of cwnd (this is more correctly the smaller of two numbers: half of flightsize and 2 segments [9], where flightsize is the number of bytes sent but not yet acknowledged); this is the same as after an RTO); furthermore, it sets cwnd to ssthresh+3. The reason for this is that since three duplicate ACKs were received, it assumes that three segments got through and hence this inflation. It then increases cwnd by 1 for each duplicate ACK received because a duplicate ACK is received presumably when another data packet was received successfully. If permitted it keeps sending packets. In [7], it is stated that Reno TCP's wait of roughly half a round-trip time during Fast Recovery. An explanation for this is that packets are continued to be sent after the lost packet is retransmitted when the next duplicate ACK is received - which is half a round-trip time? When the retransmitted packet is ACK ed, Fast Recovery ends by dropping cwnd to ssthresh - this means deflation, which brings it quickly into CA instead of SS. 4. Difference between RFC 2001 and 2581: the drop in ssthresh after an RTO or a TD in 2001 is half of current window (minimum of cwnd and AW) but at least two segments. In 2581, it is half of flightsize (which is the amount of data sent but unacked) but at least two segments. A second difference is that IW and RW MUST be less than 2 segments. In other words, cwnd can be 2 segments instead of 1. A third difference is that in 2001, it says if cwnd = ssthresh, the sender is in SS, but in 2581, it says it could be in either SS or CA when this happens. 5. Difference between CA and SS: In CA mode, cwnd increases utmost by 1 segment for every RTT no matter how many ACKs are received, but in SS, cwnd increments by the number of segments received. In Allman s paper [20], he talks of byte counting, which means if an ACK acknowledges two segments then the cwnd will increase by 2 segments, while in ordinary SS, it will only increase by 1 segment. In CA, increase is MSS*MSS/cwnd each time an ACK is received. 6. Restart window 2
In [9] (RFC 2581), three types of windows are described: initial congestion window, restart CW and loss congestion window. The initial window can be as high as 2 segments. The restart window is the same as the initial window but the loss window, the starting point in a Slow Start recovery is always 1 segment. The restart CW is used to reset the CW after an idle period. The problem is during an idle period, the TCP sender cannot use the arrival of ACKs to determine when to send new segments into the network. Therefore Slow Start is used after an idle period, which is defined as follows. If a segment is not received for one retransmission timeout period, then cwnd is reduced to the size of RW. RW is set equal to the initial window size. But with this rule, in http 1.1 where a persistent TCP connection is used, the server always receives a segment (with the URL) before it sends data. Therefore, there is a possibility of sending a burst because the cwnd may not get reset to the RW value before the sender sends. Therefore the rule to determine an idle period is changed from the last received segment to the last sent segment. In other words, if a segment was not sent within an RTO value, the cwnd is reset to the RW value. My take: with a long think time, even with the last received rule, cwnd will get reset before the URL for the new request is received. Therefore cwnd before the response is sent will get reset to the RW value. So I don t really see the need for this change from received to sent. 7. Initial window size In [5], which is an experimental RFC, the proposal to start with an initial window size of up to 4 segments is made. It is stated that after TO loss, when the sender re-enters Slow Start, the window size will always be restricted to 1. This is referred to as the loss window in [9]. The advantage of starting with a larger window is that for small file sizes, delay can be improved from 3RTT down to 1. This is especially important for LFN networks. The disadvantage with starting with a larger initial window is that a burst of 4 segments may not be handable in a router. This will lead to dropped packets, retransmissions, more delay and overall worse network behavior. The actual formula stated in [5] is Initial window size = min(4*mss, max(2*mss, 4380 bytes) (1) In [9], the initial window was limited to 2 segments, and [9] (RFC 2581) is a standard track RFC (not experimental). My conclusion is that a larger initial window size is good for LFNs. Not good for highly congested networks. There will be too much loss and higher retransmission time outs, which results in idle time with the sender waiting for an RTO, and hence lower throughput. Reference [6] describes a simple experiment with only 3 buffers leading into a 9600 baud modem at the receiver. It claims that there is no significant degradation of performance even when the initial window size is 4. 3
8. SACK Reference [3] describes two SACK options. The first is a SACK permitted option that is indicated in the SYN segment. The SACK option itself specifies blocks of accepted segments. Given the limitations of TCP header options, a maximum of 4 blocks can be specified. Receiver sends SACK and the sender does selective repeat. This option is especially well suited for LFNs. Other RFCs describe that with this option, the Fast Recovery procedure works well but without this SACK option, NewReno is needed [10]. 9. Differences between RFCs 2001 & 2581. In 2581, the IW is increased to 2. When cwnd=ssthresh, it states that either SS or CA can be used, while RFC 2001 states that when cwnd=ssthresh, it is in SS. In 2001, it states that when the congestion occurs, the ssthresh is set to the min. of the cwnd and advertised window, but at least two segments. But RFC 2581 states that when congestion occurs (detected with a TO or TD), then ssthresh = max (FlightSize / 2, 2*SMSS) (2) where FlightSize is the amount of data that has been sent but not yet acknowledged. This is clearly different from cwnd. If cwnd < AW, then Flightsize will be cwnd - which is what can be sent without an ACK. If a loss occurs in the middle of a cwnd send, then Flightsize could be less than the cwnd if the whole cwnd has not yet been sent. If AW < cwnd, then only AW can bse sent. Again, at the time of loss, the Flightsize could be smaller than AW. Finally, RFC 2581 clarifies some of the procedures related to generating ACKs. 10.ECN +RED Basically says a delayed ACK must be generated within utmost 500ms of receiving a segment. It also talks about a difference between RMSS (MSS at receiver) and the MSS decided by pathmtu discovery, sender, etc. The rule that ACK every other segment is only a SHOULD not a MUST. References [1] V. Jacobson, R. Braden, D. Borman, TCP extensions for high performance, IETF RFC 1323, May 1992. [2] W. Stevens, TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms, IETF RFC 2001, January 1997. 4
[3] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, TCP Selective Acknowledgement Options, IETF RFC 2018, Oct. 1996. [4] D. Borman, TCP and UDP over IPv6 Jumbograms, IETF RFC 2147, May 1997. [5] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial Window, IETF RFC 2414, September 1998. [6] T. Shepard, C. Partridge, When TCP Starts Up With Four Packets Into Only Three Buffers, IETF RFC 2416, September 1998. [7] K. Ramakrishnan and S. Flyod, A Proposal to add Explicit Congestion Notification (ECN) to IP, IETF RFC 2481, Jan. 1999. [8] M. Allman, D. Glover, L. Sanchez, Enhancing TCP Over Satellite Channels using Standard Mechanisms, IETF RFC 2488, January 1999. [9] M. Allman, V. Paxson, W. Stevens, TCP Congestion Control, IETF RFC 2581, Apr. 1999. [10] S. Floyd, T. Henderson, The NewReno Modification to TCP's Fast Recovery Algorithm, IETF RFC 2582, April 1999. [11] S. Floyd, J. Mahdavi, M. Mathis, M. Podolsky, An Extension to the Selective Acknowledgement (SACK) Option for TCP, IETF RFC 2883, July 2000. [12] M. Allman, H. Balakrishnan, S. Floyd, Enhancing TCP's Loss Recovery Using Limited Transmit, IETF RFC 3042, January 2001. [13] W. Doeringer and others: A survey of light-weight transport protocols for high-speed networks, IEEE Trans. Comm., 38(11):2025-39, Nov. 1990. [14] R. Gupta and others: A receiver-driven transport protocol for the web, Proc. Informs, 2000 [15] T. R. Henderson: Design principles and performance analysis of SSCOP: A new ATM Adaptation Layer protocol, Comp. Comm. Review, 25(2):47-59, Apr. 1995 [16] R. R. Stewart and others: Stream Control Transmission Protocol, IETF, Internet Draft draft-ietf-sigtran-sctp- 09.txt, 19 Apr. 2000 [17] V. Jacobson, Congestion avoidance and control, Proc. ACM SIGCOMM '88, pp. 314-29, Aug. 1988. [18] S. Iren, P. D. Amer and P. T. Conrad, The Transport Layer: Tutorial and Survey, ACM Computing Surveys, Vol. 31, No. 4, Dec. 99. [19] S. Floyd, V. Jacobson, Random Early Detection Gateways for Congestion Avoidance, IEEE/ACM Transactions on Networking, 1993.. [20] M. Allman, On the Generation and Use of TCP Acknowledgments, ACM Computer Communication Review, vol. 28, no. 5, Oct. 1998. [21] M. Matthis, J. Semke, J. Mahdavi, T. Ott, The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm, ACM Computer Communication Review, vol. 27, no. 3, July 1997. 5