Virtual Machine Synchronization for High Availability Clusters Yoshiaki Tamura, Koji Sato, Seiji Kihara, Satoshi Moriai NTT Cyber Space Labs. 2007/4/17
Consolidating servers using VM Internet services are growing Costs for running servers aren t small Consolidating servers using VM is popular However More services may fail in a consolidated environment High availability is needed! Copyright 2007 Nippon Telegraph and Telephone Corporation 2
Our goal A new high availability clustering independent of applications or OS Hardware failure Keep running transparently High Availability Clustering using Xen Copyright 2007 Nippon Telegraph and Telephone Corporation 3
What needs to be done? Virtual Machine Synchronization Primary node and Secondary node must be identical Detection of failure Failover mechanism Extension of current techniques Primary node Failover Secondary node Hardware Failure Apps VM Synchronization Apps Guest OS Guest OS VMM Hardware Network VMM Hardware SAN Copyright 2007 Nippon Telegraph and Telephone Corporation 4
How to synchronize VMs? t interval Primary VM t sync 1. Pause Primary, and sync with secondary 2. Resume Primary after sync Secondary VM Need to make the overhead of sync smaller Make sync time shorter Only transfer updated data Sync VMs less often Secondary must be able to continue transparently Sync VMs before sending or receiving Events Events: Storage, network, timer, console Copyright 2007 Nippon Telegraph and Telephone Corporation 5
Sync VMs on every event Types of events in a server environment Timer, network, storage are majority Network and File I/O performance may degrade Comparison of performance between no-sync and sync on every event (Netperf,IOzone) Network 76% down File I/O: O_sync write 90% down,bufferd write + fsync 69% down No-sync Network [Mbps] 93.39 File I/O [MB/s] O_SYNC Buffered + fsync 5.67 53.95 Sync on event 22.81 0.58 16.54 Sync protocol needs to be improved Keep the degradation to 50% at most Copyright 2007 Nippon Telegraph and Telephone Corporation 6
Types of events to sync VM >login Network Timer Storage Console Copyright 2007 Nippon Telegraph and Telephone Corporation 7
Sync on events from VM to storage Primary VM V i Primary VMM 1. Read / Write request Storage S j Secondary VMM Secondary VM V i-1 2. Sync VM and event 4. Sync completed 3. Update Secondary V i V i+1 6. Reply 5. Resume Read / Write S j+1 Resume point: Just before operating storage V i : VM s state S j : Storage s state Secondary may redo the same operation as Primary Secondary will receive the same reply as Primary Copyright 2007 Nippon Telegraph and Telephone Corporation 8
Sync on events from network to VM Primary VM Primary VMM Client Secondary VMM Secondary VM V i 1. Start transfer N j V i-1 2. Sync VM and event 3. Update Secondary 4. Sync completed V i V i+1 5. Resume transfer 6. Reply Resume point: Just before receiving the data N j+1 V i : VM s state N j : Client s state Secondary may resend the same data Primary has sent Reliable protocols (ex: TCP) can handle by itself Apps should handle in case of unreliable protocols (ex: UDP) Copyright 2007 Nippon Telegraph and Telephone Corporation 9
Prototype Implemented the protocol using Xen s live migration (2006/12) 2. Transfer dirtied pages and VCPU context of Primary DomU 3. Overwrite Secondary DomU with updates Transfer updates Sync DomU Apply updates Dom0 DomU DomU Dom0 Back-end Front-end Front-end Back-end Event Channel Xen Hardware Network Event Channel Xen Hardware 1. Trap target events and pause Primary DomU SAN Copyright 2007 Nippon Telegraph and Telephone Corporation 10
Implementation xen/common/event_channel.c Hooks at evtchn_send() Device -> Domain Calls domain_pause_by_systemcontroller(rd) Notifies the user land process to start transfer via VIRQ evtch_set_pending Domain -> Device Sets the port number of the user land process s event channel Calls domain_pause_for_debugger() User land process sends event to the device after transfer tools/libxc/xc_linux_save.c Waits for a signal from the user process Sends dirtied pages and vcpu context to Secondary Continuously wait and stop until receiving the signal to finalize tools/libxc/xc_linux_restore.c Receives dirtied pages and vcpu context, and updates the domain Continuously receives data and updates the domain Starts building domains after receiving notification from Primary Fails pinning page tables Copyright 2007 Nippon Telegraph and Telephone Corporation 11
Evaluation Evaluation items Performance of the Primary VM (Network and File I/O) Number of events to sync Average time to transfer dirtied page Test machines Hardware spec CPU: Intel Xeon 3GHz X 2 Memory: 4GB Network: Gigabit Ethernet SAN: FC Disk VM spec VMM: Xen unstable (2006/12) Guest OS: Debian Etch Testing Memory: 1GB Copyright 2007 Nippon Telegraph and Telephone Corporation 12
Performance of Primary VM Sync on every event Proposed protocol 35 30 1 39% Up 57% Up 84% Up 0.9 35 30 0.8 25 0.7 25 Throughput [Mbps] 20 15 Throughput [MByte/s] 0.6 0.5 0.4 Throughput [MByte/s] 20 15 10 0.3 10 0.2 5 5 0.1 0 Network 0 O_SYNC write Performance improved by applying proposed protocol Degradation of buffered write + fsync compared to no-sync was 44% 0 Buffered write + fsync Copyright 2007 Nippon Telegraph and Telephone Corporation 13
Num of events, Average time to transfer 1.2 Sync on every event Proposed protocol 20 1 18 16 Numbers of events to sync (relative to sync on every events) 0.8 0.6 0.4 Average time to transfer dirtied pages (relative to sync on every events) 14 12 10 8 6 0.2 4 2 0 Network O_SYNC write Buffered write + fsync Number of events to sync 0 Network O_SYNC write Buffered write + fsync Average time to transfer dirtied pages Number of events to sync reduced 49% - 96% Average time to transfer dirtied pages multiplied by 2-18 Optimize transfer function,compress pages? Copyright 2007 Nippon Telegraph and Telephone Corporation 14
Related Work Hypervisor-based fault tolerance T. C. Bressoud et al. ACM TOCS 96 Protocols to implement a fault tolerant system using VMM SecondSite B. Cully et al. HotDep 06 Continuously checkpoints a domain to a persistent storage ExtraVirt D. Lucchetti et al. SOSP 05 Runs multiple replicas on a single machine to recover from processor faults Copyright 2007 Nippon Telegraph and Telephone Corporation 15
Conclusion VM synchronization for high availability clusters Applications and OS can continue without modifications Proposed protocol makes overhead smaller Implemented the prototype using Xen Showed the effectiveness of the protocol Performance improved 39% - 84% Number of events to sync reduced 49% - 96% Copyright 2007 Nippon Telegraph and Telephone Corporation 16
Future work Function for Xen to stop domains instantly and resume on Secondary Currently using pause to stop domains Improve the performance of Primary VM Optimize transfer function Functions to implement for practical use Detection of failure Failover mechanism Copyright 2007 Nippon Telegraph and Telephone Corporation 17