COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1
Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. INTEL PRODUCTS ARE NOT INTENDED FOR USE IN MEDICAL, LIFE SAVING, OR LIFE SUSTAINING APPLICATIONS. Intel may make changes to specifications and product descriptions at any time, without notice. All products, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice. Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. *Other names and brands may be claimed as the property of others. Copyright 2012 Intel Corporation.
Agenda Background COarse-grain LOck-stepping Performance Optimization Evaluation Summary 3
Non-Stop Service with VM Replication Typical Non-stop Service Requires Expensive hardware for redundancy Extensive software customization VM Replication: Cheap Application-agnostic Solution 4
Existing VM Replication Approaches Replication Per Instruction: Lock-stepping Execute in parallel for deterministic instructions Lock and step for un-deterministic instructions Replication Per Epoch: Continuous Checkpoint Secondary VM is synchronized with Primary VM per epoch Output is buffered within an epoch 5
Lock-stepping Problems Excessive replication overhead memory access in an MP-guest is un-deterministic Continuous Checkpoint Extra network latency Excessive VM checkpoint overhead 6
Agenda Background COarse-grain LOck-stepping Performance Optimization Evaluation Summary 7
Why COarse-grain LOck-stepping (COLO) VM Replication is an overly strong condition Why we care about the VM state? The client care about response only Can the control failover without precise VM state replication? Coarse-grain lock-stepping VMs Secondary VM is a replica, as if it can generate same response with primary so far Be able to failover without service stop Non-stop service focus on server response, not internal machine state! 8
How COLO Works Response Model for C/S System & are the request and the execution result of an un-deterministic instruction Each response packet from the equation is a semantics response Successfully failover at k th packet if (C is the packet series the client received) 9
Architecture of COLO COarse-grain LOck-stepping Virtual Machine for Non-stop Service 10
Why Better Comparing with Continuous VM checkpoint No buffering-introduced latency Less checkpoint frequency On demand vs. periodic Comparing with lock-stepping Eliminate excessive overhead of un-deterministic instruction execution due to MP-guest memory access 11
Agenda Background COarse-grain LOck-stepping Performance Optimization Evaluation Summary 12
Performance Challenges Frequency of Checkpoint Highly dependent on the Output Similarity, or Response Similarity Key Focus is TCP packet! Cost of Checkpoint Xen/Remus uses passive-checkpoint Secondary VM is not resumed until failover Slow path COLO implements active-checkpoint Secondary VM resumes frequently 13
Improving Response Similarity Minor Modification to Guest TCP/IP Stack Coarse Grain Time Stamp Highly-deterministic ACK mechanism Coarse Grain Notification Window Size Per-Connection Comparison 14
Number of Packets Time (ms) Packets # Time (ms) Similarities after Optimization Web Server FTP Server Packets # Duration Packets # Duration 16000 600 4000 400 500 12000 400 3000 300 8000 300 2000 200 200 4000 100 1000 100 0 0 1 2 4 8 16 32 0 Number of Threads PUT GET *Run Web Bench in Client For more complete information about performance and benchmark results, visit Performance Test Disclosure 0 15
Reducing the Cost of Active-checkpoint Lazy Device State Update Lazy network interface up/down Lazy event channel up/down Fast Path Communication 16
Spent Time (ms) Checkpoint Cost with Optimizations Leave Net UP 1800 1500 1200 900 Suspend Resume Netif-Up Netif-Down Mem Xmit Leave EventChannel up Replace XenStore Access with Eventchannel 600 300 0 Baseline Lazy Netif Lazy Event Channel Efficient Comm. Final cost: 74ms/checkpoint: (1/3 on page transmission, 2/3 on suspend/resume) For more complete information about performance and benchmark results, visit Performance Test Disclosure 17
Agenda Background COarse-grain LOck-stepping Performance Optimization Evaluation Summary 18
Hardware Configurations Intel Core i7 platform, a 2.8 GHz quad-core processor 2GB RAM Intel 82576 1Gbps NIC * 2 (internal & external) Software Xen 4.1 Domain 0: RHEL5U5 Guest: 32-bit BusyBox 1.20.0, Linux kernel 2.6.32 256MB RAM and uses a ramdisk for storage 19
Bandwidth (Mb/s) Bandwidth (Mb/s) Bandwidth of NetPerf Native Remus-20ms Remus-40ms COLO Native Remus-20ms Remus-40ms COLO 1000 1000 800 800 600 600 400 400 200 200 0 54 1500 64K 0 54 1500 64K Message Size Message Size TCP UDP For more complete information about performance and benchmark results, visit Performance Test Disclosure 20
Time (s) FTP Server 128 64 32 16 8 4 Remus-20ms Remus-40ms COLO Native 2 1 0.5 PUT GET For more complete information about performance and benchmark results, visit Performance Test Disclosure 21
Throughput (Mbps) Web Server - Concurrency Native Remus-20ms Remus-40ms COLO 1000 100 10 1 1 2 4 8 16 32 Threads Run Web Bench in Client For more complete information about performance and benchmark results, visit Performance Test Disclosure 22
Throughput (response/s) Web Server - Throughput Native Remus-20ms Remus-40ms COLO 1200 1000 800 600 400 200 0 100 500 1000 1500 Request / second Run httperf in Client For more complete information about performance and benchmark results, visit Performance Test Disclosure 23
Average Latency (ms) Latency in Netperf/Ping 40 30 20 Native Remus-20ms Remus-40ms COLO 10 0 0.28 0.40 0.28 0.4 0.38 0.55 Netperf-TCP-RR Netperf-UDP-RR Ping For more complete information about performance and benchmark results, visit Performance Test Disclosure 24
Response Latency (ms) Web Server - Latency 1200 Native Remus-20ms Remus-40ms COLO 1000 800 600 400 200 0 100 500 1000 1500 Request/second For more complete information about performance and benchmark results, visit Performance Test Disclosure Run httperf in Client 25
Agenda Background COarse-grain LOck-stepping Performance Optimization Evaluation Summary 26
Summary COLO is an ideal Application-agnostic Solution for Non-stop service Web server: 67% of native performance CPU, memory and netperf: near-native performance Next steps: Merge into Xen More optimizations 27
28