Small is Better: Avoiding Latency Traps in Virtualized DataCenters SOCC 2013 Yunjing Xu, Michael Bailey, Brian Noble, Farnam Jahanian University of Michigan 1
Outline Introduction Related Work Source of latency Design and Implementation Evaluation Conclusion 2
Introduction Public clouds have become a popular platform for building Internet-scale applications Applications built with public clouds are often highly sensitive to response time Large data centers have become the cornerstone of modern, Internet-scale Web applications Unlike dedicated data centers,a public cloud relies on virtualization to both hide the details of the underlying host infrastructure as well as support multi-tenancy 3
Related Work Kernel-centric modify the operating system (OS) kernel to deploy new TCP congestion control algorithms. DCTCP, HULL Application-centric applications must be modified to tag the packets they generate with scheduling hints. D 3, D 2 TCP, DeTail, PDQ, and pfabric Operator-centric require operators to change their application deployment. Bobtail 4
Host-centric It does not require or trust guest cooperation, and it only modifies the host infrastructure controlled by cloud providers. The existing host-centric method: EyeQ it mainly focuses on bandwidth sharing in the cloud require feedback loops between hypervisors need explicit bandwidth headroom to reduce latency only solves one of the three latency problems addressed by our solution. 5
Sources of latency VM scheduling delay the server VM cannot process the packets until scheduled by the hypervisor Host network queueing delay the response packets first go through the host network stack, which processes I/O requests on behalf of all guest VMs Switch queueing delay response packets on the wire may experience switch queueing delay on congested links 6
7 Sources of latency
EC2 and the Xen hypervisor These studies find that virtualization and multi-tenancy are keys to EC2 s performance variation Xen runs on bare metal hardware to manage guest VMs A privileged guest VM called dom0 is used to fulfill I/O requests for non-privileged guest VMs 8
9 EC2 measurements --VM Scheduling delay and Swich queueing delay
10 EC2 measurements --VM Scheduling delay and Swich queueing delay
Testbed experiments --Host network queueing delay Kernel network stack NIC transmission queue Byte Queue Limits (BQL) A new device driver interface CoDel If queued packets have already spent too much time in the queue, the upper layer of the network stack is notified to slow down 11
Testbed experiments --Host network queueing delay physical machine B1 C1 physical machine Cause congestion switch ping A1 A2 12 physical machine
Testbed experiments --Host network queueing delay Congestion Free: A1 and B1 are left idle. Congestion Enabled: A1 sends bulk traffic to B1 without BQL or CoDel. Congestion Managed: A1 sends bulk traffic to B1 with BQL and CoDel enabled. 13
Testbed experiments --Host network queueing delay While BQL and CoDel can significantly reduce the latency, the result is still four to six times as large when compared to the baseline 14
Summary Switch queueing delay increases network tail latency by over 10 times; together with VM scheduling delay, it becomes more than 20 times as bad Host network queueing delay also worsens the FCT tail by four to six times. 15
Design and Implementation Principle I: Not trusting guest VMs Principle II: Shortest remaining time first Principle III: No undue harm to throughput 16
VM scheduling delay Credit Scheduler is currently Xen s default VM scheduler Each guest VM has at least one VCPU Credits are redistributed in 30ms intervals VCPU has three states OVER UNDER BOOST 17
VM scheduling delay The BOOST mechanism only prioritizes VMs over others in UNDER or OVER states BOOSTed VMs can not preempt each other, and they are round-robin scheduled We deploy a more aggressive VM scheduling policy to allow BOOSTed VMs to preempt each other Xen has a rate limit mechanism that maintains overall system throughput by preventing preemption when the running VM has run for less than 1ms in its default setting. 18
Host network queueing delay From Table2, when BQL and CoDel are both enabled, the queue delay is four to six times The root cause is requests often too large and hard to be preempted. Our solution is to break large jobs into smaller ones to allow CoDel to conduct fine-grained packet scheduling. 19
Host network queueing delay guest VM dom0 frontend backend NIC A packet sent out by guest VMs first goes to the frontend, which copies the packet to the backend Xen s backend announce to the guest VMs that hardware segmentation offload is not supported, then guests have to segment the packets before copying them 20
Switch queueing delay Letting switches favor small flows when scheduling packets Define a flow as the collection of any IP packets Define flow size as the instant size of a message the flow In reality, it is also difficult to define message boundaries. Thus, we measure flow by rate as an approximation of the message semantic We classify flows into two classes, small and large 21
Switch queueing delay First, we build a monitoring and tagging module in the host that sets priority on outgoing packets. Small flows are tagged as high-priority Second, we need switches that support basic priority queueing 22
Putting it all together We change a single line in the credit scheduler of Xen 4.2.1 to enable the new scheduling policy We also modify the CoDel kernel module in Linux 3.6.6 with about 20 lines to segment large packets in the host We augment the Xen s network backend with about 200 lines of changes to do flow monitoring and tagging 23
Evaluation The testbed consists of five four-core physical machines running Linux 3.6.6 and Xen 4.2.1. They are connected to a Cisco Catalyst 2970 switch 24
Evaluation 25 A1 serves small responses to E1 A2 sends bulk traffic to B1 (Host network queueing delay) A3 A4 run CPU-bound tasks (VM scheduler delay) C1 D1 respond to E2 s queries for large flows and congest E s access link (Switch queueing delay)
Evaluation Our solution achieves about 40% reduction in mean latency Over 56% for the 99th percentile, and almost 90% for the 99.9th percentile. 26
VM scheduling delay A1 A3 switch E1 Use E1 to query A1 for small responses Keep A3 running a CPU-bound workload 27
VM scheduling delay Our new policy reduces latency at the 99.9th percentile by 95% at the cost of 3.8% reduction in CPU throughput 28
Host network queueing delay B1 A1 A2 switch E1 A1 to ping E1 once every 10ms for round-trip time A2 to saturate B1 s access link with iperf 29
Host network queueing delay Our solution can yield an additional 50% improvement at both the body and tail of the distribution 30
Switch queueing delay B1 C1 D1 A1 switch E1 E2 E1 queries A1 and B1 in parallel for small flows E2 queries C1 and D1 for large flows to congest the access link 31
Switch queueing delay When QoS support on the switch is enabled to recognize our tags, all small flows enjoy a low latency with an order of magnitude improvement at both the 99th and 99.9 th percentiles On the other hand, the average throughput loss for large flows is less than 3% 32
Conclusion We design a host-centric solution that extends the classic shortest remaining time first scheduling policy from the virtualization layer, through the host network stack, to the network switches 33