Part1: Lecture 2 TCP congestion control
Summary of last time TCP headers and details of the flags Flow control TCP sequence numbers TCP connection establishment and termination End to end principle, layering and functionalities
Performance
Flow control Sliding window: Initial window 1 2 3 4 5 6 7 8 9 10 Acknowledged packets Window slides ---> 1 2 3 4 5 6 7 8 9 10
Link capacity In TCP you are limited by the receive window (your upper bound). Imagine you don t have such buffers constrains: How fast you put them in? Your bandwidth in bits/sec How long you have to wait for an ACK? Your RTT in seconds What about when you have links with different bandwidth and different RTT?
BDP The BDP - Bandwidth Delay Product = bandwidth (bits per second) * round trip time(in seconds) A network with a large BPDP (>10 5 bits>12.5kbytes) is called a LFN - long fat network. Your node in Amsterdam (1Gbps) talking to a node in San Diego (1Gbps) BDP =1gbps 162msec = 1, 000, 000, 000bits 0.162sec =162, 000, 000bits = 1sec 162, 000, 000 bytes = 20, 250, 000bytes = 20, 25MB 8
Problems with LFN Receive window size (wasting bandwidth) Need better RTT measurements (used for timeouts calculation) Wrapping of sequence numbers (32bits) Packet loss reduce dramatically throughput More information to be found at:"! Enabling High Performance Data Transfers!
Error recovery
Positive acknowledgements with retransmission It uses a positive acknowledgement schema: The ACKNOWLEDGEMENT NUMBER in the header specifies the sequence number of next missing octet (the stream flowing in the opposite direction of the segment) Events at sender side Events at receiver side Send Packet 1 Receive Packet 1 Send ACK 1 Receive ACK 1 Send Packet 2 Receive Packet 2 Send ACK 2 Receive ACK 2
Error recovery How does TCP handle problems in the transmission? What to do when some segments are lost? And when can you actually say in TCP that a segment is actually lost?
Retransmission It uses an adaptive retransmission algorithm to determine the timeout value before retransmission. Events at sender side Send Packet 1 Start timer ACK would normally arrive Events at receiver side Packet should arrive ACK should be sent Timer expires Retransmit Packet 1 Start timer Receive ACK 1 Receive Packet 1 Send ACK 1 Cancel timer How do you determine what is the ideal RTO (retransmission timeout)?
RTT Round trip time (RTT). The time taken by the signal to be transmitted from sender to receiver Plus acknowldegement for receipt to go from receiver to sender Speed of light in fiber: 200km/ms
Know more: Computing TCP s retransmission timers RFC 6298 June 2011 RTT estimation SampleRTT is measured once per RTT for packets that have been transmitted once One RTT measure per ACK if timestamp option is ON. SmoothedRTT SRTT - is the weighted average of the SampleRTTs values collected: an exponential weighted moving average SRTT = (1 α) SRTT + α SampleRTT if α = 1/ 8 = 0.125 SRTT = 0.875 SRTT + 0.125 SampleRTT
Timeout interval Sample RTT SRTT RTTVAR is the variation on the RTT the EWMA of the difference between SampleRTT and SRTT RTTVAR = (1 β) RTTVAR + β SampleRTT SRTT β =1/ 4 = 0.25 RTO = SRTT + max(clock, 4 DevRTT )
Complex TCP retransmission Premature timeout! Cumulative ACKs! Host A Seq=92 timeout Seq=92, 8 bytes data Seq=100, 20 bytes data Seq=92, 8 bytes data Host B Host A timeout Seq=92, 8 bytes data Seq=100, 20 bytes data X loss ACK=100 Host B Seq=92 timeout ACK=120 time time
TCP ACK generation Event at Receiver Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Arrival of in-order segment with expected seq #. One other segment has ACK pending Arrival of out-of-order segment higher-than-expect seq. #. Gap detected Arrival of segment that partially or completely fills gap TCP Receiver action Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Immediately send single cumulative ACK, ACKing both in-order segments Immediately send duplicate ACK, indicating seq. # of next expected byte Immediate send ACK, provided that segment starts at lower end of gap
Refinements through options
TCP options End of option Kind =0 No operation Kind =1 Maximum segment size Kind =2 Len=4 MSS Window scale factor Kind =3 Len=3 Shift count Timestamp Kind =4 Len=10 Timestamp value Timestamp echo reply SACK Kind =5 Len=10 Left edge of 1 st block Right edge of 1 st block Left edge of N th block Right edge of N th block
MTU MTU - Maximum Transmission Unit: largest packet size that can travel through the network, in bytes Ethernet: 1500 bytes Ethernet w/ Jumbo frames : 9000 bytes Path MTU: the smallest MTU on an IP path, as discovered by Path MTU Discovery - or - the largest packet size that will transverse the network without fragmentation
Fragmentation IP packets are encapsulated in frames: DATAGRAM HEADER DATAGRAM DATA FRAME HEADER FRAME DATA IP packets are fragmented to fit within the Path MTU FRAGMENT1 HEADER FRAGMENT2 HEADER DATA2 DATA1
Know more: Path MTU discovery RFC 1191- Nov. 1990 MSS MSS - Maximum Segment Size: the largest amount of data in bytes that a device can handle in a single and un-fragmented piece. Announced at the start of the TCP transmission in the SYN packet. The resulting IP datagram will be MSS+40bytes (20bytes TCP header and 20 bytes IP header). MTU Frame header IP header TCP header TCP data MSS
Window scaling option The standard receive window on TCP systems is 65K bytes. RFC 1323 TCP Large Window Extensions introduced the WSCALE option: A scale factor for the receive window Negotiated at start up (in a SYN packet), and cannot be renogotiated Cannot exceed the maximum permitted buffer size by the system Receive window should be: equal to the BPDP or better BPDP < window < BPDB + B (buffer size at intermediate routers) 11:44:45.679928 IP u019857.1x.uva.nl.65295 > rembrandt0.uva.netherlight.nl.ssh: Flags [S], seq 3977286301, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 629245282 ecr 0,sackOK,eol], length 0
Timestamp option A timestamp is placed in very segment and used for more accurate RTT calculation, based on each received ACK. Receivers echoes back what he receives. No need to clock synchronization! Provides Protection Against Wrapped Sequence Numbers (PAWS) 15:10:01.802654 IP u019857.1x.uva.nl.55721 > rembrandt0.uva.netherlight.nl.ssh: Flags [P.], seq 1094:1110, ack 1609, win 65535, options [nop,nop,ts val 758648946 ecr 325477188], length 16 15:10:01.841480 IP rembrandt0.uva.netherlight.nl.ssh > u019857.1x.uva.nl.55721: Flags [.], ack 1110, win 283, options [nop,nop,ts val 325477199 ecr 758648946], length 0 1
SACKs Know more: TCP Selective Acknowledgements Option RFC 2018 Oct. 1996 An extension to the Selective Acknowledgements (SACK) Option for TCP RFC 2883 Jul. 2000 It allows to acknowledge out-of-order segments selectively. It can be combined with selective retransmission. DSACK: acknowledges duplicate packets using the SACK field, using the first block. Transmitted Segment Received Segment ACK Sent (Including SACK Blocks) 3500-3999 3500-3999 4000 4000-4499 (data packet dropped) 4500-4999 4500-4999 4000, SACK=4500-5000 5000-5499 5000-5499 4000, SACK=4500-5500 Duplicated packet 5000-5499 4000, SACK=5000-5500, 4500-5500
Congestion control
One source of congestion rcwd1 1gbps 1gbps 1gbps rcwd3 rcwd2
What happens if. rcwd1 1gbps 1gbps 1gbps rcwd3 the buffer is bloated? rcwd2
Router with infinite buffer R/2 rcwd1 λ out λ λ in R/2 in λ out maximum per-connection R throughput: R/2 delay λ in R/2 large delays as arrival rate, λ in approaches capacity
Router with finite buffer sender retransmission of timed-out packet application-layer input = application-layer output: λ in = λ out transport-layer input includes retransmissions : λ in λ in λ in : original data λ' in : original data, plus retransmitted data λ out Host A Host B finite shared output link buffers
Retransmissions R/2 λ out when sending at R/2, some packets are retransmissions including duplicated that are delivered! The cost of congestion: - more work (retrans) for given goodput - unneeded retransmissions: link carries multiple copies of pkt decreasing goodput λ in R/2
Congestion: a problem in the network? Congestion indicates a problem in the network! Long delays due to queueing in router buffers; Lost packets and retransmissions due to buffer overflows; Unneeded retransmissions by the sender if large delays leading to congestion collapse
Is congestion bad? Not really. if you know how to manage it. TCP has a mechanism to handle it. Congestion is unavoidable given we want to use the network capacity as efficiently as possible.
Two possible approaches End-end congestion control! Network-assisted congestion control! no explicit feedback from network congestion inferred from endsystem observed loss, delay approach taken by TCP routers provide feedback to end systems single bit indicating congestion Explicit Congestion Notification explicit rate sender should send at
TCP congestion control How does a TCP sender limit the rate at which it sends traffic into its connection? How does a TCP sender perceives that there is congestion on the path to the destination? What algorithm should the sender use to change its sending rate as function of the perceived congestion?
Test Time
Pause
Congestion Control algorithm TCP congestion control algorithm has four components: Slow start Congestion avoidance Fast retransmit Fast recovery Devised in 1988 by Van Jacobson: "Congestion avoidance and control", Proceedings of SIGCOMM 88, Stanford, CA, Aug. 1988, ACM Continued evolution: RFC 5681 TCP congestion control - 2009
TCP stack evolution TCP Tahoe was the original implementation; TCP Reno implemented fast recovery; TCP New Reno improves retransmission during the fast recovery phase of TCP Reno. Learn more: The NewReno Modification to TCP's Fast Recovery Algorithm RFC 3782 Apr. 2004
Congestion window TCP maintains on the sender side also a congestion window (cwnd ): Used to restrict data flow to less than the receivers buffer size when congestion occurs Allowed window = min(rwnd,cwnd) rcwd= 6, cwnd = 8 Sent; acked 1 2 3 4 5 6 7 8 9 10 11 12 13 Sent; not acked Ok to send
Congestion detection Two mechanisms indicate congestion: Timeouts Duplicate acks The congestion window is _not_ static. It increases and decreases based on the arrival of ACKs: It increases slowly if the link has low bandwidth or the link has high delays and viceversa. This is called self-clocking.
Know more: Increasing TCP initial window RFC 3390 October 2002 Slow start At start cwnd is equal to: min (4*MSS, max (2*MSS, 4380 bytes)) ~4K bytes Host A Host B one segment To avoid wasting bandwidth the initial increase is exponential. RTT two segments Doubling cwnd every RTT. cwnd += cwnd + MSS (when an ACK arrives) four segments time
Slow start threshold The Slow Start Threshold (ssthresh) determines if cwnd should follow slow start or congestion control. Congestion avoidance Slow start phase ssthresh Initially very high (equal rwnd) Decreases after congestion cwnd
Know more: Congestion Control Principles RFC 2914 September 2000 Congestion avoidance AIMD - Additive Increase Multiplicative Decrease. Multiplicative Decrease Half congestion window for every lost segment cwnd -= 0.5cwnd (Cannot decrease below 1MSS) Additive increase: Every time an ACK arrives: cwnd += MSS * MSS / cwnd Every RTT congestion window increases by 1 MSS
Which rate can you achieve? AIMD saw tooth behavior: probing for bandwidth cwnd: TCP sender congestion window size additively increase window size. until loss occurs (then cut window in half) time In an animation: http://guido.appenzeller.net/anims/ Courtesy of Guido Appenzeller and Nick McKeown (Stanford University)
TCP sending rate rate ~ cwnd RTT bytes/sec Or better said, given cwnd and RTT vary with time: rate ~ Cwnd(t) bytes/sec RTT (t)
Reaction to timeouts TCP reacts differently depending on the type of loss detected. ssthresh= cwnd/2 at loss event 1. After one timeout: slow-start up to cwnd> ssthresh (cwnd(at loss)/2); then congestion avoidance 2. After three ACKs: saw toothed behavior of congestion avoidance Fast recovery, implemented first in TCP Reno
Fast Retransmit If sender receives 3 ACKs for the same data: Host A Host B resends segment before timer expires; X waits for an acknowledgment of the entire transmit window before returning to congestion avoidance. timeout" resend 2 nd segment" time"
Remember SACK? TCP New Reno What is used in systems today. Able to detect multiple losses. Same as Tahoe/Reno on timeouts. Improves further on fast retransmit phase. Keep track of last un-acked packet when entering fast recovery On every ACK increase cwnd by one MSS When last ACK arrives, return to congestion avoidance, set cwnd to value when entering fast recovery
Evolution of algorithm Fast recovery When receiving duplicate ACKs
Λ cwnd = 4Kbytes ssthresh = rwnd dupackcount = 0 timeout ssthresh = cwnd/2 cwnd = 1 MSS dupackcount = 0 retransmit missing segment dupackcount == 3 ssthresh= cwnd/2 cwnd = ssthresh + 3 retransmit missing segment duplicate ACK dupackcount++ slow start Summary New ACK! new ACK cwnd = cwnd+mss dupackcount = 0 transmit new segment(s), as allowed cwnd > ssthresh Λ timeout ssthresh = cwnd/2 cwnd = 4 KBytes dupackcount = 0 retransmit missing segment timeout ssthresh = cwnd/2 cwnd = 1 dupackcount = 0 retransmit missing segment fast recovery duplicate ACK new ACK cwnd = cwnd + MSS (MSS/cwnd) dupackcount = 0 transmit new segment(s), as allowed cwnd = ssthresh dupackcount = 0 congestion avoidance New ACK! New ACK cwnd = cwnd + MSS transmit new segment(s), as allowed. New ACK! duplicate ACK dupackcount++ dupackcount == 3 ssthresh= cwnd/2 cwnd = ssthresh + 3MSS retransmit missing segment
Congestion control on LFN cwnd = cwnd a*cwnd (when loss is detected) cwnd = cwnd + b/cwnd (when an ACK arrives) Scalable TCP: A = 0.125 and b = 0.01 = congestion window does not oscillate, throughput increases slightly High-speed TCP (HSTCP) a(w) and b(w). Particularly suitable for large BPDP networks TCP BIC It is used by default in Linux kernels 2.6.8 through 2.6.18. CUBIC It is a less aggressive derivative of BIC. Default in Linux kernels since version 2.6.19. Fast TCP, Westwood TCP, H-TCP, TCP VEGAS..
New TCP flavors Want to know more? Scalable TCP: http://www.deneholme.net/tom/scalable/ High-speed TCP: RFC 3649 HighSpeed TCP for Large Congestion Windows TCP BIC/CUBIC: http://netsrv.csc.ncsu.edu/twiki/bin/view/main/bic Fast TCP: http://netlab.caltech.edu/fast/ Westwood TCP: http://www.cs.ucla.edu/nrl/hpi/tcpw/ TCP Vegas: http://www.cs.arizona.edu/projects/protocols/
TCP fairness
TCP Fairness fairness goal: if K TCP sessions share same bottleneck link of bandwidth R, each should have average rate of R/K TCP connection 1 TCP connection 2 bottleneck router capacity R
Why is TCP fair? Two competing sessions: additive increase gives slope of 1, as throughout increases multiplicative decrease decreases throughput proportionally R equal bandwidth share Connection 2 throughput loss: decrease window by factor of 2 congestion avoidance: additive increase loss: decrease window by factor of 2 congestion avoidance: additive increase Connection 1 throughput R
Fairness Fairness and UDP Fairness and parallel TCP connections multimedia apps often do not use TCP do not want rate throttled by congestion control instead use UDP: pump audio/video at constant rate, tolerate packet loss nothing prevents app from opening parallel connections between 2 hosts.
Home reading For the test on Apr. 08 read: A Comparison of SIP and H.323 for Internet Telephony By H. Schulzrinne and J. Rosenberg URL: http://www.cs.columbia.edu/~hgs/papers/ Schu9807_Comparison.pdf
Literature Chapter 20: TCP Bulk Data Flow Chapter 21: TCP Timeout and Retransmission Chapter 24: TCP Future and Performance Chapter 3: Transport Layer Few slides were adapted from: Computer Networking: A Top Down Approach, 5 th edition. Jim Kurose, Keith Ross Addison-Wesley, April 2009 Chapter 7: Transport Over IP