Oriana Riva, Department of Computer Science ETH Zürich Advanced Computer Networks 263-3501-00 Layer-7-Switching and Loadbalancing Patrick Stuedi, Qin Yin and Timothy Roscoe Spring Semester 2015
Outline Last time Virtual machine networking Para-virtualization SR-IOV IOMMU Today Load balancing TCP Splicing Distributed load balancing 2
Challenge: accessing services Datacenters are designed to be scalable Datacenters are replicated Each has lots of machines Service span (and share) data centers So: What address does, e.g. www.search.ch resolve to? What entity does this address refer to? What does this entity do? 3
Requirements Close by datacenter Load balance across machines in a center Target machines where the user s state is kept Accessed using TCP (HTTP, SSL, ) 4
Option 1: IP Anycast One IP address refers to multiple destinations BGP advertizes multiple destinations Packets end up at nearest destination to source. Problems: IP layer only reliable for stateless protocols (UDP) All packets of a TCP flow must go to the same machine Service location pushed into BGP couples routing with end-system provision 5
Option 1: IP Anycast One IP address refers to multiple destinations BGP advertizes multiple destinations Packets end up at nearest destination to source. Problems: IP layer only reliable for stateless protocols (UDP) Service location pushed into BGP couples routing with end-system provision Used for DNS root server location 6
Requirements Close by datacenter Load balance across machines in a center Target machines where the user s state is kept Accessed using TCP (HTTP, SSL, ) All packets of a TCP flow must go to the same machine 7
Recall DNS lookup 8
Option 2: DNS Insight: who says the answer is always the same? Idea: smart DNS server authoritative for service Query for, e.g.. www.google.com or www.bing.com returns a different A record depending on: Source address of browser machine Current state of the service Load Failures A random number 9
DNS tricks One-level of indirection Single DNS server returns different Arecs Additional level of indirection First service resolver returns CNAME Regional service resolver can be more specific Used for finding the nearest datacenter for a service 10
Using CNAMEs timeouts 11
DNS does not solve the problem Need IP address for every instance of the service 100,000 machines 100,000 globally routable IP addresses expensive! Machine fails need to update DNS state DNS state changes rapidly short TTL on queries even higher load on DNS servers Slow to react to hot spots or other load skews Selection of machine can only be made based on address of client's primary resolver don't know which client this is 12
Next step: use 1 IP address Use Network Address Translation Hash source addresses to server machines
TCP three-way handshake
TCP three-way handshake
Stateless hashing Hash(Source IP) Completely static No dynamic load balancing Hash(Source IP, Source TCP port) Better, but still static Limited to 64k destinations per client machine Known as a Layer-4 load balancer 16
Stateless hashing Hash(Source IP) Completely static No dynamic load balancing Hash(Source IP, Source TCP port) Better, but still static Limited to 64k destinations per client machine Known as a Layer-4 load balancer Basic problem: nothing else is known by the end of the handshake! 17
Why is static hashing bad? Machine failure/upgrade/provisioning Can t update hash function efficiently in switch Load balancing Can t avoid a heavily-loaded machine Lack of Locality Resource being accessed Client accessing the resource 18
What else might we want to hash on? 19
HTTP Host: header Introduced in HTTP/1.1 mandatory Hosting providers need to switch based on virtual host, not physical host Different services have different virtual host Avoids replicating all service state everywhere 20
Switching on URL Locality: Allows state to be partitioned across machines Isolation: Rare, computationally intensive URLs can be sequestered Sensitive data can be kept on more expensive, auditted machines 21
Hashing on cookies Enables partioning of servers by User state Session state Critical for scaling online services to billions of users No need to share state No need to synchronize state 22
How to do it? Problem: Don t know the hash key until after the HTTP request Typically the first segment after the 3WS Solution: Don t establish connection to server until client has sent HTTP request 23
Late-binding of TCP connection Client Switch Server Port = 3620 time 24
Late-binding of TCP connection Client Switch Server Port = 3620 TCP connection setup + HTTP GET time 25
Late-binding of TCP connection Client Switch Server Port = 3620 TCP connection setup + HTTP GET TCP connection setup + HTTP GET time 26
Late-binding of TCP connection Client Switch Server Port = 3620 TCP connection setup + HTTP GET TCP connection setup + HTTP GET HTTP response (acks not shown) time 27
Late-binding of TCP connection Client Switch Server Port = 3620 TCP connection setup + HTTP GET TCP connection setup + HTTP GET HTTP response HTTP response (acks not shown) time 28
Late-binding: Naïve implementation (SOCKS protocol) 29
Late-binding: Naïve implementation (SOCKS protocol) Inefficient: switch needs to copy data between the connections! 30
TCP Splicing Proposed around 1997 by Maltz & Bhagwat at IBM Key idea: Take two established TCP connections and splice them Transfer segments unmodified between them Remap port numbers and segment numbers on the fly Advantages: Very simple calculation per packet Not much state to maintain per spliced connection No segmentation/reassembly No buffering/copying 31
Splicing pseudocode (from Maltz & Bhagwat) 32
Splicing in pseudo code queue packets received from server splice connections, but allow for final 'n' bytes to be transmitted to the client before splicing 'n' bytes message signaling the completion of the splicing operation 33
Splicing in pseudo code 34
What state is needed? For each packet, need to do the following: IP header operations: Rewrite source and destination IP addresses Update IP header checksum TCP header operations: Rewrite source and destination port numbers Apply fixed offset to sequence number Apply fixed offset to acknowledgement number Update TCP header checksum calculated from existing connection state when splice occurs 35
It s easy to do in hardware A10 AX Application Delivery Controller Advanced layer 4 / layer 7 server load balancing HTTP Proxy Layer 7 URL and URL hash switching Comprehensive Layer 7 application persistence support Load balancing methods: Round Robin, Least Connections, Weighted Round Robin, Weighted Least Connections, Fastest Response Aggregated throughput: up to 115 Gbps 36
Problems of single-box load balancing Expensive! Scale-up Buy bigger (more expensive) load balancer when reaching capacity 37
Ananta: Load balancing in Windows Azure Windows Azure: Microsoft's cloud computing platform Compute, Storage, Databases, etc. in the cloud Ananta: Distributed, scalable load balancing running on hosts in a datacenter Lower cost Scale on demand 38
Background: Windows Azure load balancing Clients connect to service using a virtual IP (VIP) Load balancer (LB) load balances traffic to specific server machines using a direct IP (DIP) 39
Background: Windows Azure load balancing (2) Load balancer is also used when two services communicate within the same data center 40
Ananta: Inbound traffic Ananta Manager 41
Ananta: Inbound traffic 1 Spread packet to MUX using ECMP 4 5 De-capsulate and forward to DIP 2 Lookup the VIP-to-DIP mapping 6 7 Encapsulate response 3 Tunnel packet to DIP 8 Forward to router (bypass MUX) 42
Summary IP Anycast: select a DNS root server Dynamic DNS: locate nearby data centers Layer-3-switching: balance connections across machines TCP splicing: seamlessly join two connections Layer-7-switching: use splicing to late-bind servers to HTTP connects Ananta: distributed load balancing 43
References Host Anycasting Service, C. Partridge, T. Mendez, W. Milliken, Internet RFC 1546, November 1993. TCP Splicing for Application Layer Proxy Performance, David A. Maltz, and Pravin Bhagwat. IBM Research Report 21139 (Computer Science/Mathematics), IBM Research Division, 1998. Ananta: Cloud Scale Load Balancing, SigComm 2013 44