TCP loss sensitivity analysis ADAM KRAJEWSKI, IT-CS-CE
Original problem IT-DB was backuping some data to Wigner CC. High transfer rate was required in order to avoid service degradation. 4G out of... 10G Meyrin But the transfer rate was: Initial: ~1G After some tweaking: ~4G Wigner The problem lies within TCP internals! ADAM KRAJEWSKI - TCP CONGESTION CONTROL 2 / 37
TCP introduction TCP (Transmission Control Protocol) is a protocol providing reliable, connection-oriented transport over network. It operates at Transport Layer (4th) of the OSI reference model. Defined in RFC 793 from 1981. [10 Mb/s at that time! ] Its reliability means that it guarantees no data will be lost during communication. It is connection-oriented because it creates and maintains a logical connection between sender and receiver. Application Presentation Session Transport Network Data Link Physical ADAM KRAJEWSKI - TCP CONGESTION CONTROL 3 / 37
TCP operation basics: connection init Connection initiation: 3-way handshake Sequence counters setup: Client sends random SEQ number (ISN Initial Sequence Number) Server confirms it (ACK=SEQ+1) and sends its own SEQ number Client confirms and syncs to server s SEQ number Client Server ADAM KRAJEWSKI - TCP CONGESTION CONTROL 4 / 37
TCP is reliable TCP uses SEQ and ACK numbers to monitor which segments have been successfully transmitted. Lost packets are easily detected and retransmitted. This provides reliability....but do we really need to wait for every ACK? Timeout Client Server ADAM KRAJEWSKI - TCP CONGESTION CONTROL 5 / 37
TCP window (1) Let us assume that we have a client and a server connected with a 1G Ethernet link with 1 ms delay... RTT = 2ms, B = 1 Gbps 2ms CLIENT DATA ACK SERVER If we send packets of 1460 bytes each and wait for acknowledgment, then the effective transfer rate is 1460B/2ms = 5.84 Mbps. So we end up using 0.5% of available bandwidth. ADAM KRAJEWSKI - TCP CONGESTION CONTROL 6 / 37
TCP window (2) In order to use the full bandwidth we need to send at least B RTT unacknowledged bytes. This way we make sure we will not just sleep and wait for ACKs to come, but keep sending traffic in the meantime. 2ms CLIENT 12 34 5 ACK-1 ACK-2 ACK-3 ACK-4 ACK-5 SERVER The B RTT factor is referred to as Bandwidth-Delay product. ADAM KRAJEWSKI - TCP CONGESTION CONTROL 7 / 37
TCP window (3) Client sends a certain amount of segments one after another. As soon as first ACK is received it assumes transmission is fine and continues to send segments. Basically, at each time t 0 there are W unacknowledged bytes on the wire. W is referred to as TCP Window Size W!= const Client ACK=1461 Server 1 W 1461 2921 4381 5841 7301 8761 SEQ ACK=2921 ADAM KRAJEWSKI - TCP CONGESTION CONTROL 8 / 37
Window growth It starts with a default value wnd W def W DROP Then, it grows by 1 every RTT: wnd wnd + 1 But if a packet loss is detected, the window is cut in half: wnd wnd/2 W 1 0.5 W 1 W def α α This is an AIMD algorithm (Additive Increase Multiplicative Decrease). t ADAM KRAJEWSKI - TCP CONGESTION CONTROL 9 / 37
Congestion control Network congestion appears when there is more traffic that the switching devices can handle. TCP prevents congestion! It starts with a small window thus low transfer rate Then the window grows linearly thus increasing transfer rate TCP assumes that congestion is happening when there are packet losses. So in case of a packet loss it cuts the window size in half thus reducing transfer rate. B [Gbps] B c 0.5 B c B 0 α α t [s] ADAM KRAJEWSKI - TCP CONGESTION CONTROL 10 / 37
Long Fat Network problem Long Fat Networks (LFNs) high bandwidth-delay product (B RTT) Example: CERN link to Wigner. To get full bandwith on 10G link with 25ms RTT we need a window: W 25 = 10Gbps 25ms 31. 2 MB If drop occurs at peak throughput... W drop = 15. 6 MB T Wdrop W 25 = 15. 6 10 6 25ms 108h ADAM KRAJEWSKI - TCP CONGESTION CONTROL 11 / 37
TCP algorithms (1) The default TCP algorithm (reno) behaves poorly. Other congestion control algorithms have been developed such as: BIC CUBIC Scalable Compund TCP Each of them behaves differently and we want to see how they perform in our case... ADAM KRAJEWSKI - TCP CONGESTION CONTROL 12 / 37
Testbed Two servers connected back-to-back. How the congestion control algorithm performance changes with packet loss and delay? Tested with iperf3. Packet loss and delay emulated using NetEm in Linux: tc qdisc add dev eth2 root netem loss 0.1 delay 12.5ms ADAM KRAJEWSKI - TCP CONGESTION CONTROL 13 / 37
Bandwidth [Gbps] 0 ms delay, default settings 10 9 8 7 6 5 4 3 cubic reno bic scalable 2 1 0 0.0001 0.001 0.01 0.1 1 10 Packet loss rate [%] ADAM KRAJEWSKI - TCP CONGESTION CONTROL 14 / 37
Bandwidth [Gbps] 25ms delay, default settings PROBLEM 1.2 1 0.8 0.6 0.4 cubic reno bic scalable 0.2 0 0.0001 0.001 0.01 0.1 1 10 Packet loss rate [%] ADAM KRAJEWSKI - TCP CONGESTION CONTROL 15 / 37
Growing the window (1) We need... W 25 = B RTT = 10Gbps 25ms 31. 2 MB Default settings: net.core.rmem_max = 124K net.core.wmem_max = 124K net.ipv4.tcp_rmem = 4K 87K 4M net.ipv4.tcp_wmem = 4K 16K 4M net.core.netdev_max_backlog = 1K B max = W max RTT = 4MB 25ms Window can t grow bigger than maximum TCP send and receive buffers! 1. 28 Gbps WARNING: The real OS numbers are in pure bytes count so less readable. ADAM KRAJEWSKI - TCP CONGESTION CONTROL 16 / 37
Growing the window (2) We tune TCP settings on both server and client. Tuned settings: net.core.rmem_max = 67M net.core.wmem_max = 67M net.ipv4.tcp_rmem = 4K 87K 67M net.ipv4.tcp_wmem = 4K 65K 67M net.core.netdev_max_backlog = 30K Now it s fine WARNING: TCP buffer size doesn t translate directly to window size because TCP uses some portion of its buffers to allocate operational data structures. So the effective window size will be smaller than maximum buffer size set. Helpful link: https://fasterdata.es.net/host-tuning/linux/ ADAM KRAJEWSKI - TCP CONGESTION CONTROL 17 / 37
Bandwidth [Gbps] 25 ms delay, tuned settings 10 9 8 7 6 5 4 3 2 1 0 0.0001 0.001 0.01 0.1 1 10 Packet loss rate [%] cubic reno scalable bic ADAM KRAJEWSKI - TCP CONGESTION CONTROL 18 / 37
Bandwidth [Gbps] 0 ms delay, tuned settings 10 9 8 7 cubic 6 5 DEFAULT CUBIC 4 3 2 1 0 0.0001 0.001 0.01 0.1 1 10 scalable reno bic Packet loss rate [%] ADAM KRAJEWSKI - TCP CONGESTION CONTROL 19 / 37
Bandwith fairness Goal: Each TCP flow should get a fair share of available bandwidth B 10G 10G 5G 10G 5G 5G t ADAM KRAJEWSKI - TCP CONGESTION CONTROL 20 / 37
Bandwith competition How do congestion control algorithms compete? Tests were done for: Reno, BIC, CUBIC, Scalable 0ms and 25ms delay Two cases: 2 flows starting at the same time 1 flow delayed by 30s 10G 10G congestion 10G ADAM KRAJEWSKI - TCP CONGESTION CONTROL 21 / 37
cubic vs cubic + offset ADAM KRAJEWSKI - TCP CONGESTION CONTROL 23 / 37
Bandwidth [Gbps] cubic vs bic 10 Congestion control 0ms 8 6 cubic 4 bic 2 0 0.0001 0.001 0.01 0.1 1 10 Packet loss rate [%] ADAM KRAJEWSKI - TCP CONGESTION CONTROL 24 / 37
Bandwidth [Gbps] cubic vs scalable Congestion control 0ms 10 9 8 7 6 5 4 3 2 1 0 0.0001 0.001 0.01 0.1 1 10 Packet loss rate [%] cubic scalable ADAM KRAJEWSKI - TCP CONGESTION CONTROL 25 / 37
TCP (reno) friendliness ADAM KRAJEWSKI - TCP CONGESTION CONTROL 26 / 37
Bandwidth [Gbps] cubic vs reno Congestion control 25ms delay 10 9 8 7 6 5 4 3 2 1 0 0.0001 0.001 0.01 0.1 1 10 Packet loss rate [%] cubic reno ADAM KRAJEWSKI - TCP CONGESTION CONTROL 27 / 37
Why default cubic? (1) BIC was the default Linux TCP congestion algorithm until kernel version 2.6.19 It was changed to CUBIC in kernel 2.6.19 with commit message: Change default congestion control used from BIC to the newer CUBIC which it the successor to BIC but has better properties over long delay links. Why? ADAM KRAJEWSKI - TCP CONGESTION CONTROL 28 / 37
Why default cubic? (2) It is more friendly to other congestion algorithms to Reno... ADAM KRAJEWSKI - TCP CONGESTION CONTROL 29 / 37
Why default cubic? (2) It is more friendly to other congestion algorithms to CTCP (Windows default)... ADAM KRAJEWSKI - TCP CONGESTION CONTROL 30 / 37
Why default cubic? (3) It has better RTT fairness properties 0 ms RTT 25 ms RTT Meyrin Wigner ADAM KRAJEWSKI - TCP CONGESTION CONTROL 31 / 37
Why default cubic? (4) RTT fairness test: 0 ms RTT 0 ms RTT 25 ms RTT 25 ms RTT ADAM KRAJEWSKI - TCP CONGESTION CONTROL 32 / 37
The real problem Lots of networks paths! LAG Meyrin ToR 100G 100G LAG Packet losses on: Links (dirty fibers,...) NICs (hardware limitations, bugs,...) ToR Wigner Difficult to troubleshoot! ADAM KRAJEWSKI - TCP CONGESTION CONTROL 33 / 37
Conclusions To achieve high transfer rate to Wigner: User part Tune TCP settings! Follow https://fasterdata.es.net/host-tuning/ Make sure you use at least CUBIC for Linux and CTCP for Windows Networking team part Minimizing packet loss throughout the network Fiber cleaning Avoid BIC: Better performance on lossy links, but unfair to other congestion algorithms unfair to itself (RTT fairness) Submit a ticket if you observe low network performance for your data transfer Collaboration is essential! ADAM KRAJEWSKI - TCP CONGESTION CONTROL 34 / 37
Setting TCP algorithms (1) OS-wide setting (sysctl...): net.ipv4.tcp_congestion_control = reno Add new: # modprobe tcp_bic net.ipv4.tcp_available_congestion_control = cubic reno bic Local setting: setsockopt(), TCP_CONGESTION optname net.ipv4.tcp_allowed_congestion_control = cubic reno ADAM KRAJEWSKI - TCP CONGESTION CONTROL 35 / 37
Setting TCP algorithms (2) OS-wide settings: C:\> netsh interface tcp show global Add-On Congestion Control Provider: ctcp Enabled by default in Windows Server 2008 and newer. Needs to be enabled manually on client systems: Windows 7: C:\>netsh interface tcp set global congestionprovider=ctcp Windows 8/8.1 [Powershell]: Set-NetTCPSetting CongestionProvider ctcp ADAM KRAJEWSKI - TCP CONGESTION CONTROL 36 / 37
References [1] http://indico.cern.ch/event/36803/material/slides/0.pdf [2] https://www.cnaf.infn.it/~ferrari/papers/myslides/2006-04-03- hepix-ferrari.pdf ADAM KRAJEWSKI - TCP CONGESTION CONTROL 37 / 37
TCP loss sensitivity analysis ADAM KRAJEWSKI, IT-CS-CE