TCP. UDP Header Format. User Datagram Protocol (UDP) Transmission Control Protocol (TCP)

User Datagram Protocol (UDP) Thin wrapper around IP services Service Model Unreliable unordered datagram service Addresses multiplexing of multiple connections Multiplexing 16-bit port numbers (some are well-known ) Checksum Validate header Optional in IPv4 Mandatory in IPv6 UDP Header Format 0 8 16 31 Source Port Destination Port UDP Length Length includes 8-byte header and data Checksum Uses IP checksum algorithm Computed on header, data and pseudo header : UDP Checksum 0 8 16 31 Source IP Address 0 17 (UDP) Destination IP Address UDP Length Transmission Control Protocol (TCP) TCP Guaranteed delivery: Messages delivered in the order they were sent Messages delivered at most once No limit on message size Synchronization between sender and receiver Multiple connections per host Flow control 3 4

TCP TCP vs. Direct Link Connection oriented Explicit setup and teardown required Byte stream abstraction No boundaries in data App writes bytes, TCP send segments, App receives bytes Full duplex Data flows in both directions simultaneously Point-to-point connection Implements congestion control Flow control: receiver controls sender rate Congestion control: network indirectly controls sender rate Explicit connection setup requires RTT varies, depending on destination and network condition adaptive approach to retransmission Packets Delayed Reordered Late 5 6 Peer capabilities vary TCP vs. Direct Link Minimum link speed on route Buffering capacity at destination adaptive approach to window sizes Network capacity varies Other traffic competes for most links Requires global congestion control strategy TCP: Connection Stages 1. Connection setup 3-way handshake 2. Data transport: Sender writes data, and TCP Breaks data into segments Sends segment in IP packets Retransmits, reorders and removes duplicates as necessary Delivers data to receiver 3. Teardown 4 step exchange 7 8

TCP Segment Header Format TCP Segment Header 0 8 16 31 Source Port Destination Port Sequence Number Sequence Number Header Length 0 Flags Advertised Window TCP Checksum Urgent Pointer Options Meta header 0 8 16 31 Source IP Address Destination IP Address 0 16 (TCP) TCP Segment Length 16-bit source and destination ports 32-bit send and sequence numbers 4-bit header length (unit = 32 bits) Minimum 5 (20 bytes) Used as offset to first data byte 6 1-bit flags URG: *Segment contains urgent data : sequence number is valid PSH: *Do not delay delivery of data RST: Reset connection (reject or abn. termination) SYN: Synchronize segment for setup FIN: Final segment for teardown 9 10 TCP Segment Header (cont.) 16-bit advertised window Space remaining in receive window 16-bit checksum Uses IP checksum algorithm Computed on header, data and pseudo header 16-bit urgent data pointer If URG = 1 Index of last byte of urgent data in segment TCP Options Negotiate maximum segment size (MSS) Each host suggests a value Minimum of two values is chosen Prevents IP fragmentation over first and last hops Packet timestamp Allows RTT calculation for retransmitted packets Extends sequence number space for identification of stray packets Negotiate advertised window scaling factor Allows larger windows: 64KB too small for routes with large bandwidth-delay products 11 12

TCP: Data Transport Data broken into segments Limited by maximum segment size (MSS) Negotiable during connection setup Typically set to MTU of directly connected network size of TCP and IP headers Three events cause a segment to be sent At least MSS bytes of data ready to be sent Explicit PUSH operation by application Periodic timeout Write bytes Application process TCP Send buffer TCP Byte Stream Read bytes TCP Segment TCP Segment TCP Segment Application process TCP Recv buffer 13 14 TCP SNs and s TCP rules Seq. # s: Count bytes, not packets. First SN to avoid insertion s: SN of next byte expected from other side cumulative GBN: TCP spec doesn t say what to do with premature packets - up to implementation User types C host s receipt of echoed C Host A Host B simple telnet scenario host s receipt of C, echoes back C time 15 Event in-order segment arrival, no gaps, everything else already ed in-order segment arrival, no gaps, one delayed pending out-of-order segment arrival higher-than-expect seq. # gap detected arrival of segment that partially or completely fills gap TCP Receiver action delayed. Wait up to 500ms for next segment. If no next segment, send immediately send single cumulative send duplicate, indicating seq. # of next expected byte immediate if segment starts at lower end of gap 16

TCP: retransmission scenarios TCP: Retransmission and Timeouts Host A Host B Host A Host B Round-trip time (RTT) Retransmission TimeOut (RTO) timeout X loss Seq=100 timeout Seq=92 timeout Host A Host B Data1 Data2 Guard Band Estimated RTT time lost scenario premature timeout, cumulative s 17 TCP uses an adaptive retransmission timeout value Dynamic network (congestion, changes in routing) => RTT cannot be static 18 TCP: Retransmission and Timeouts TCP: Retransmission and Timeouts (Jacobson/Karels alg.) RTO value is important: too big: wait too long to retransmit a packet too small: unnecessarily retransmit packets. Original algorithm for picking RTO: 1. EstimatedRTT = α EstimatedRTT + (1 - α) SampleRTT 2. RTO = 2 EstimatedRTT Characteristics of the original algorithm: Std. dev. implicitly assumed to be bounded by RTT. But if utilization = 75%, could have factor 16 between typical (mean±2stdev) short and long RTTs Newer algorithm estimates std. dev. of RTT: 1. Diff = SampleRTT - EstimatedRTT 2. EstimatedRTT = EstimatedRTT + δ Diff (for some 0<δ<1) 3. Deviation = Deviation + δ ( Diff - Deviation ) 4. RTO = μ EstimatedRTT+ φ Deviation μ 1 φ 4 19 20

TCP: Retransmission and Timeouts (Karn s Alg.) TCP Sliding Window Protocol Sender Side Host A Host B Host A Host B LastByteAcked <= LastByteSent LastByteSent <= LastByteWritten Buffer bytes between LastByteAcked and LastByteWritten Retransmission Retransmission Maximum buffer size Wrong RTT Sample Wrong RTT Sample Advertised window Problem: How to estimate RTT of retransmitted packets? Solution: Don t! Also: double RTO. First unacknowledged byte Data available, but outside window Last byte sent 21 22 TCP Sliding Window Protocol Receiver Side LastByteRead < NextByteExpected NextByteExpected <= LastByteRcvd + 1 Buffer bytes between NextByteRead and LastByteRcvd Shrinks as data arrives and Grows as the application consumes data Maximum buffer size Next byte to be read by application Advertised window Buffered, out-of-order data Next byte expected ( value) TCP Flow Control Receiving side Receive buffer size = MaxRcvBuffer LastByteRcvd - LastByteRead < = MaxRcvBuffer AdvertisedWindow = MaxRcvBuffer -(NextByteExpected - NextByteRead) Shrinks as data arrives and Grows as the application consumes data Sending side Send buffer size = MaxSendBuffer LastByteSent - LastByteAcked < = AdvertisedWindow EffectiveWindow = AdvertisedWindow -(LastByteSent - LastByteAcked) EffectiveWindow > 0 to send data LastByteWritten - LastByteAcked < = MaxSendBuffer block sender if (LastByteWritten - LastByteAcked) + y > MaxSenderBuffer 23 24

TCP Flow Control TCP Flow Control Problem: Slow receiver application Advertised window goes to 0 Sender cannot send more data Receiver may not spontaneously generate update or update may be lost Sender gets stuck Solution Sender periodically sends 1-byte segment, ignoring advertised window of 0 Eventually window opens Sender learns of opening from next of 1-byte segment Problem: Application delivers tiny pieces of data to TCP Example: telnet in character mode Each piece sent as a segment, returned as Very inefficient Solution Delay transmission to accumulate more data Nagle s algorithm Send first piece of data Accumulate data until first piece ed Send accumulated data and restart accumulation Not ideal for some traffic (e.g. mouse motion) 25 26 TCP Flow Control Problem: Slow application reads data in tiny pieces Receiver advertises tiny window Sender fills tiny window Known as silly window syndrome Solution Advertise window opening only when MSS or ½ of buffer is available Sender delays sending until window is MSS or ½ of receiver s buffer (estimated) TCP Bit Allocation Limitations Sequence numbers vs. packet lifetime Assumed that IP packets live less than 60 seconds Can we send 2 32 bytes in 60 seconds? approx. 573Mbps: Less than an STS-12 line Advertised window vs. delay-bandwidth Only 16 bits for advertised window coast-coast RTT = 100 ms Adequate for only 5.24 Mbps! 27 28

TCP Sequence Numbers 32-bit TCP Connection Establishment Bandwidth Speed Time until wrap around T1 1.5 Mbps 6.4 hours Ethernet 10 Mbps 57 minutes T3 45 Mbps 13 minutes FDDI 100 Mbps 6 minutes STS-3 155 Mbps 4 minutes STS-12 622 Mbps 55 seconds STS-24 1.2 Gbps 28 seconds 29 3-Way Handshake Exchange initial sequence numbers (j,k) Message Types Synchronize (SYN) Acknowledge (): cumulative! Passive Open Server listens for connection from client Active Open Client initiates connection to server Client Server listen Time flows down 30 TCP: Connection Termination TCP State Descriptions Message Types Finished (FIN) Acknowledge () Active Close Client Server CLOSED LISTEN SYN_RCVD SYN_SENT Disconnected Waiting for incoming connection Connection request received Connection request sent Sends no more data Passive close Accepts no more data Connection can be half closed (one-way) Time flows down ESTABLISHED CLOSE_WAIT LAST_ FIN_WAIT_1 FIN_WAIT_2 CLOSING TIME_WAIT Connection ready for data transport Connection closed by peer Connection closed by peer, closed locally, await Connection closed locally Connection closed locally and d Connection closed by both sides simultaneously Wait for network to discard related packets 31

TCP State Transition Diagram TCP State Transition Diagram SYN_RCVD FIN_WAIT_1 FIN_WAIT_2 Passive open SYN/SYN + FIN/ FIN/ FIN + / CLOSED Close LISTEN SYN/SYN + ESTABLISHED CLOSING TIME_WAIT Send/SYN Timeout Close SYN_SENT SYN + / FIN/ CLOSE_WAIT LAST_ CLOSED Active open/syn Questions State transitions Describe the path taken by a server under normal conditions Describe the path taken by a client under normal conditions Describe the path taken assuming the client closes the connection first TIME_WAIT state What purpose does this state serve Prove that at least one side of a connection enters this state Explain how both sides might enter this state 33 34 TCP State Transition Diagram TCP State Transition Diagram SYN_RCVD FIN_WAIT_1 FIN_WAIT_2 Passive open SYN/SYN + FIN/ FIN/ FIN + / CLOSED Close LISTEN SYN/SYN + ESTABLISHED CLOSING TIME_WAIT Send/SYN Timeout Close SYN_SENT SYN + / FIN/ CLOSE_WAIT LAST_ CLOSED Active open/syn CLOSED Passive open Close SYN/SYN + LISTEN SYN_RCVD SYN/SYN + ESTABLISHED FIN_WAIT_1 FIN/ CLOSING FIN_WAIT_2 FIN + / FIN/ TIME_WAIT Send/SYN Close SYN_SENT SYN + / Active open/syn FIN/ CLOSE_WAIT LAST_ Timeout CLOSED 35 36

Congestion Congestion Control & Avoidance Cumulative bytes H 1 H 2 A 1 (t)+a 2 (t) A 1 (t) 10Mb/s A 2 (t) 100Mb/s X(t) A 2 (t) A 1 (t) R1 D(t) 1.5Mb/s H 3 A 1 (t) A 2 (t) X(t) D(t) D(t) 37 t 38 TCP Congestion Control Ideal steady state: self-clocking Basic idea: control rate by window size. Average rate (window)/rtt Crude Add notion of congestion window Effective window is minimum of Advertised window (flow control), and Congestion window (congestion control) 39 40

TCP Congestion Control Slow Start Start up phase: quickly find the correct rate Slow Start Steady state: gently try to increase rate, back off quickly when congestion detected Congestion Avoidance Objective: determine available capacity Idea: Begin with cwnd = 1 packet Increment cwnd by 1 packet for each Source Destination Phases are determined by the value of variable ssthres Meaning: double every RTT! 41 42 Slow Start Implementation When starting or restarting after timeout, cwnd=1. On each ack for new segment, cwnd += segsize. Slow Start Trace Each dot is a 512B packet sent, y-axis is sequence number, x-axis is time, straight line is 20 KBps of available bandwidth. without ss: ~7KBps, with ss: ~19KBps 43 44

Congestion is good? Empty buffers => low delay, low utilization Full buffers => good utilizaion, but high delay, potential loss Real question: how much congestion is too much? Host Solutions Q: How does the source determine whether or not the network is congested? A: Timeout signals packet loss Packet loss is rarely due to transmission error (on wired networks) Lost packet implies congestion! 45 46 Congestion Avoidance Control vs. avoidance Control: minimize impact of congestion when it occurs Avoidance: avoid producing congestion In terms of operating point limits avoidance power optimal load control idealized power curve load How to get to steady-state? If overusing link => packet loss => decrease rate Why increase at all? Must check all the time so in order not to leave dead bandwidth; only indication is dropped packets Slow-start: multiplicative increase Timeout: decrease to 1! Symmetric multiplicative increase and decrease: strong oscillation, poor throughput. Rush-hour effect. 47 48

Rush Hour Effect Additive Increase/ Multiplicative Decrease Easy to drive the network into saturation, but difficult for the network to recover. Analogy to rush hour traffic rate Arrivals & departures Queue size Algorithm Increment cwnd by one packet per RTT Linear increase Divide CongestionWindow by two whenever a timeout occurs Multiplicative decrease Source Destination 50 AIMD: additive increase, multiplicative decrease Why AIMD? increase window by 1 per RTT decrease window by factor of 2 on loss event Fairness goal: if N TCP sessions share same bottleneck link, each should get 1/N of link capacity TCP connection 1 TCP connection 2 bottleneck router capacity R Model: Two sessions compete for R bandwidth underutilized & unfair to 1 desired region R overutilized & unfair to 1 underutilized & unfair to 2 Conn 1 throughput full utilization line R overutilized & unfair to 2 51 52

Model assumptions AIMD Convergence Sessions know if link is overused (losses) Sessions don t know relative rates Simplification: Sessions respond simultaneously, and in the same direction (both increase or both decrease) R Conn 1 throughput full utilization line R Additive Increase up at 45º angle (both connections add 1) Multiplicative Decrease down R toward the origin X pt. of convergence full utilization line Conn 1 throughput R 53 54 TCP Congestion Avoidance Convergence Avoidance Typical Trace When a new segment is acked, the sender does the following: If (cwnd < ssthresh) cwnd += segsize else cwnd += segsize/cwnd (What happens when an arrives for x new segments?) On timeout: ssthresh := cwnd/2 cwnd := 1 (i.e., slow start) Trace: sawtooth behavior KB 70 60 50 40 30 20 10 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Time (seconds) 10.0 55 56

Fast Retransmit and Fast Recovery TCP Congestion Control: summary Problem: crude TCP timeouts lead to idle periods, slow start is not fast Fast retransmit: use duplicate s to trigger retransmission Fast recovery: skip slow start, go directly to half the last successful cwnd (called ssthresh) Packet 1 Packet 2 Packet 3 Packet 4 Packet 5 Packet 6 TIMEOUT! Retransmit packet 3 Sender Receiver 1 2 2 2 2 6 Maintain threshold window size ( last good estimate ) Threshold value Initially set to maximum window size Set to 1/2 of current window on timeout or 3 dup s Congestion window drops to 1 on timeout, drops by half on 3 dup s When congestion window smaller than threshold: Double window for each window d (multiplicative increase) When congestion window larger than threshold: Increase window by one MSS for each window d Try to avoid timeouts by fast retransmit 57 58 TCP Congestion Window Trace TCP Dynamics: Rate Congestion Window 70 60 50 40 30 20 10 0 timeouts fast retransmission slow start period additive increase threshold congestion window 0 10 20 30 40 50 60 Time TCP Reno Sending rate: Congwin*MSS / RTT Assume fixed RTT Actual Sending rate: between W*MSS / RTT and (1/2) W*MSS / RTT Average (3/4) W*MSS / RTT W W/2 59 60

TCP Dynamics: Loss Congestion Avoidance Loss rate (TCP Reno) Consider a cycle Total packet sent: about (3/8) W 2 MSS/RTT = O(W 2 ) One packet loss Loss Probability: p=o(1/w 2 ) or W=O(1/ p) W W/2 TCP s strategy: increase load until congestion occurs, then back off Alternative Strategy Predict when congestion is about to happen and reduce rate just before packets start being discarded Two possibilities Some help from network: DECbit, RED Host-centric TCP Vegas 61 62