The Journal of Systems and Software

Transcription

1 The Journal of Systems and Software 85 (2012) Contents lists available at SciVerse ScienceDirect The Journal of Systems and Software jo u rn al hom epage: Optimizing virtual machines using hybrid virtualization Qian Lin a, Zhengwei Qi a,, Jiewei Wu a, Yaozu Dong b, Haibing Guan a a Shanghai Key Laboratory of Scalable Computing and Systems Shanghai Jiao Tong University, Shanghai, PR China b Intel Open Source Technology Center, PR China a r t i c l e i n f o Article history: Received 10 October 2011 Received in revised form 31 May 2012 Accepted 31 May 2012 Available online 9 June 2012 Keywords: Hybrid virtualization Hardware-assisted virtualization Paravirtualization a b s t r a c t Minimizing virtualization overhead and improving the reliability of virtual machines are challenging when establishing virtual machine cluster. Paravirtualization and hardware-assisted virtualization are two mainstream solutions for modern system virtualization. Hardware-assisted virtualization is superior in CPU and memory virtualization and becoming the leading solution, yet paravirtualization is still valuable in some aspects as it is capable of shortening the disposal path of I/O virtualization. Thus we propose the hybrid virtualization which runs the paravirtualized guest in the hardware-assisted virtual machine container to take advantage of both. Experiment results indicate that our hybrid solution outweighs origin paravirtualization by nearly 30% in memory intensive test and 50% in microbenchmarks. Meanwhile, compared with the origin hardware-assisted virtual machine, hybrid guest owns over 16% improvement in I/O intensive workloads Elsevier Inc. All rights reserved. 1. Introduction System virtualization is becoming ubiquitous in contemporary datacenter. Consolidating physical server by building virtual machine cluster is universally adopted to maximize the utilization of hardware resources for computing. Two fundamental but challenging requirements are to minimize virtualization overhead (Mergen et al., 2006) and to guarantee the reliability building virtualized infrastructure. Therefore, low level design of VM architecture is of great significance. The conventional x86 architecture is incapable of classical trapand-emulate virtualization, causing paravirtualization to be the optimal virtualization strategy in the past (Barham et al., 2003; Adams and Agesen, 2006). Recently, hardware-assisted virtualization on x86 architecture has become a competitive alternative method. Yet Adams and Agesen (2006) compared the performance between software-only VMM and hardware-assisted VMM, and the statistics showed that HVM suffered from much higher overhead than PVM owing to the frequent context switching, which had to perform an extra host/guest round trip in the early HVM solution. However, the latest hardware-assisted virtualization improvement introduces heavy overhead. Hardware-assisted paging (Neiger et al., 2006) allows hardware to handle the guest MMU operation and translate guest physical address to real machine address dynamically, accelerating memory relevant operations and improving overall performance of the HVM. Corresponding author. Tel.: (+86) address: [email protected] (Z. Qi). Although hardware-assisted virtualization performs well with CPU intensive workloads, it manifests low efficiency when processing I/O events. Our experiment shows that PVM performs up to 20% lower CPU utilization than HVM with the 10 Gbps network workload. The interrupt controller of HVM originates in the native environment with fast memory-mapped I/O access but is suboptimal in the virtual environment due to the requirement of trap-and-emulate. Frequent interrupts lead to frequent context switches and high round trip penalty, particularly for multiple virtual machines (Menon et al., 2005). Consequently, hardware-assisted virtualization is superior in CPU and memory virtualization, and software-only virtualization owns optimized features for I/O virtualization. In practice, the performance issue is very workload-dependent because most real world applications are the mix of CPU and I/O intensive tasks. Therefore, hybrid virtualization techniques (Adams and Agesen, 2006) become promising. Nevertheless, the previous Hybrid VMM prototype (Adams and Agesen, 2006) leveraged the guest behaviordriven heuristics to improve performance. But its performance gain heavily depended on the prediction accuracy, and became marginal for modern workloads. The contribution of this paper is a practical one. We propose a novel hybrid solution which takes both superiority features of PVM and HVM, and implement the prototype on Xen platform. The principal idea of our hybrid virtualization is to run the paravirtualized guest in the HVM container to reach maximum optimization. The Hybrid VM primarily features less MMU operation latency benefited from hardware-assisted paging technique and lower interrupt disposal overhead profited from the paravirtualized event channel. Besides, the original hardware-assisted virtualization environment /$ see front matter 2012 Elsevier Inc. All rights reserved.

2 2594 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Fig. 1. Xen architecture. The Xen hypervisor is managing three types of VM. Domain0 plays an administrator role and supplies service for DomainU involving PVM and HVM. The front-end device drivers in DomainU communicate with the back-end drivers in Domain0 through device channel. suffers from the issue of timer synchronization, which renders different timer resources hard to keep their relative timing pace to guarantee the timing correctness of VMs. We propose a feasible solution within the hybrid virtualization to solve this problem. The rest of this paper is organized as follows. Section 2 introduces the background of virtualization as well as the software and hardware approaches with their advantages and disadvantages. Section 3 presents the hybrid virtualization architecture and design details. Section 4 deeply analysis the key factor affecting the efficiency of system call in guest OS, which can be treated as the performance indicator of VM. Section 5 specifically discusses the issue of timer synchronization under virtualized environment and the solution accessed by the hybrid virtualization. Section 6 analyzes the performance evaluation to demonstrate the performance improvements of the hybrid virtualization. Section 7 summarizes related work and Section 8 concludes. 2. Pros and cons in different virtualization mechanisms With the promotion of virtualization technology, software-only and hardware-assisted virtualization approaches play different superiority in various fields. In this section, we firstly introduce the background of two mainstream virtualization techniques, and then present the details about advantages and disadvantages between them Paravirtualization Xen (Barham et al., 2003; Clark et al., 2004) is famous for supporting paravirtualization (Whitaker et al., 2002). The Xen hypervisor locates between the physical hardware layer and the guest OS layer, as shown in Fig. 1 (Liu et al., 2006). The Xen hypervisor runs at the lowest level and owns the most privileged access to hardware. Among various VMs, Domain0 plays an administrator role and provides service for DomainU VMs. Domain0 also extends part of the functionalities of hypervisor. For example, Domain0 hosts back-end device drivers to manage the device multi-access from VMs, which utilize front-end device drivers and device channel to communicate with the back-end foundation (Xen.org, 2008). The PVM guest kernel requires purposive modifications to adapt efficient software-only virtualization (Barham et al., 2003). Generally, x86 CPU privilege level is distinguished by different rings, where Ring0 is most privileged and Ring3 is least. As the hypervisor requires higher privilege level than VMs, the PVM guest kernel yields Ring0 to the hypervisor. Since paravirtualization does not change the application interfaces, user software can run in the Xen environment without any modification. Besides, paravirtualization uses DPT (Barham et al., 2003) as its memory virtualization strategy. In order to avoid page table switch at the time of hypervisor/guest boundary crossing, DPT modifies guest page table to be suitable for hardware processor usage as well as guest OS access. By modifying guest kernel, DPT partitions address space between guest OS and hypervisor, utilizing segment limit check to protect hypervisor from guest access. It reserves certain area of address space from each guest kernel to be dedicated to hypervisor usage. Consequently, each PVM shares its page table with the hypervisor so that the hypervisor can paravirtualize the guest paging mechanism Hardware-assisted virtualization Hardware-assisted virtualization technique simplifies the design of virtualization management layer, i.e. hypervisor, and enhances the general performance with the help of processor virtualization. The conventional virtualization technique of dynamical binary translation was a compromising solution for system virtualizaiton without guest OS modification. The critical issue of dynamical binary translation is its low performance efficiency and design complexity due to the incapability of classical trap-and-emulate virtualization with previous generation of x86 architecture. Nevertheless, modern x86 architecture with hardware-assisted virtualization extension has fixed the trapand-emulate virtualization hole on the architecture level, which highly reduces the design complexity of hypervisor. Hardwareassisted virtualization becomes an alternative and improved solution replacing dynamical binary translation. Furthermore,

3 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Table 1 Comparison summary between paravirtualization and hardware-assisted virtualization. Paravirtualization Hardware assisted virtualization Pros Cons Pros Cons CPU virtualization No time injection. Ring compression. No modification for the guest OS. Boundary switching between hypervisor and VM consumes remarkable extra CPU cycles. Hypercall rather than hardware trapping. Kernel modification. Facilitating hypervisor design. Memory virtualization Less translation. Page table updating requires hypervisor intervention. Memory share. Causing considerable TLB flushes. I/O virtualization Performance boost. Requiring specialized drivers in the guest. Optimal for CPU utilization. Isolation and stability depend on implementation. Guest is able to maintain its unmodified paging mechanism. Software-only shadow page table introduces much overhead. Acceleration of hardware assisted paging. Close to native performance. Exclusive device access. Using original driver from guest OS. Scalability drawback with PCI slots limitation in the system. Good isolation. unlike paravirtualization, hardware-assisted virtualization requires no paravirtualized modifications to guest OS kernel to guarantee the trap-and-emulate efficiency because most of the VM state transitions are handled by hardware. However, the unmodified guest OSes using processor virtualization alone require frequent VM traps and incur even higher overhead since the VM exit, state transition from VM to hypervisor, is more costly than the fast system call (e.g., implemented as hypercall in Xen (Barham et al., 2003)). More than 6000 CPU cycles are required to process each VM exit and its return (Barham et al., 2003). So in the old days paravirtualization dominated the performance against hardware-assisted virtualization (Adams and Agesen, 2006). But as advanced hardware-assisted virtualization technology continues to be developed, such as enhancement of hardware-assisted paging (Neiger et al., 2006), the overall performance of HVM catches up with PVM and exceeds it currently. Prior to hardware-assisted paging, software-only virtualization of paging mechanisms like SPT are widely adopted in HVM. SPT entitles guest OS to maintain its page table independently, i.e., mapping relationship from guest virtual address to guest physical address. The guest physical address is not really physical, but pseudo. Hypervisor maintains another mapping translation from guest physical address to host physical address, the real machine address through which accessing physical memory. To accelerate memory accessing of VM, SPT is adopted to store the direct mapping from guest virtual address to host physical address. Alternatively, hardware-assisted paging offers an additional dimension of addressing and translates guest physical address to machine address. No matter with SPT or hardware-assisted paging, the HVM guest OS is enabled to use its original paging mechanism without cooperating with hypervisor. But hardware-assisted paging could help to avoid VM exits within the page table update operation, which is the major source of virtualization overhead when SPT is in use Comparison between paravirtualization and hardware-assisted virtualization Table 1 summarizes the main technical advance and defect between paravirtualization and hardware-assisted virtualization. The comparison falls into three categories: CPU, memory, and I/O virtualization. CPU. Putting the guest kernel to Ring1 or Ring3 makes the PVM suffer from remarkable overheads introduced by the boundary switching within the system call and hypercall. Hardwareassisted virtualization could eliminate these overheads by putting the guest kernel back to Ring0 and take the nature of the native OS. But its crucial point is the expensive VM exit overhead, even though alleviated by the hardware improvement. Memory. DPT plays an essential role in the virtualization of paging mechanism, making it the unique memory virtualization strategy in paravirtualization. SPT holds the great advantage of no requirement for guest paging mechanism modification, yet its performance fails to compete with the DPT. However, with the emergence of hardware-assisted paging, its efficiency and compatibility outweigh both those of DPT and SPT. It accelerates the HVM paging by simplifying the MMU address translation and reducing the VM exit amount, especially with CPU intensive workload. I/O. I/O virtualization solution could be flexible in both PVM and HVM, as the virtual device share and direct I/O are both available. The critical difference between PVM and HVM within the asynchronous event is with respect to the interrupt handling mechanism. Using event channel and virtual IRQ strategies, PVM could save the CPU utilization compared with HVM which mainly adopts the native APIC. In brief, it is sensible to merge both the superiority of paravirtualization and hardware-assisted virtualization for performance maximization. This is the primary motivation of our hybrid virtualization approach. Besides, issues about reliability such as the legacy timer synchronization problem also mean to be solved by the hybrid architecture. 3. Hybrid virtualization design 3.1. Overview The performance issue with x86 64 PVM derives from the compromised architecture in which the kernel space and user space reside in the same privilege ring level (Ring3). Yet they use different page directory to maintain the space isolation. Consequently, when the boundary switching occurs between kernel mode and user mode, the necessary TLB flushes will cause overhead and much more system call overheads will also be introduced. Thus, the primary motivation of hybrid virtualization is to eliminate these overheads by locating the guest kernel back to Ring0. There are two probable architecture types of hybrid virtualization, termed hybrid PVM and hybrid HVM.

4 2596 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Fig. 2. Hybrid virtualization architecture. (a) Hybrid PVM starts from paravirtualization. It puts the paravirtualized Linux kernel back to Ring0 and introduces hardwareassisted paging (HAP) support. (b) Hybrid HVM starts from hardware-assisted virtualization. It reuses most hardware-assisted virtualization features and imports several paravirtualized components to obtain performance improvement Hybrid PVM The hybrid PVM constructs the hybrid architecture based on PVM, as shown in Fig. 2(a). In order to eliminate the overheads due to the boundary switching between kernel space and user space, PVM guest kernel should be moved back to the Ring0. Meanwhile, some hardware-assisted virtualization features such as hardware-assisted paging should be introduced to improve the PVM performance via hardware-assisted virtualization technology. By the nature of paravirtualization mechanism, hybrid PVM guest does not need any QEMU 1 device model and the system booting could be very quick. Nevertheless, it also contributes to its disadvantage that hybrid PVM guest OS is incapable of the native booting, i.e., the hypervisor furnishes a customized boot loader for the PVM guest usage Hybrid HVM The alternative implementation of hybrid architecture is hybrid HVM, as shown in Fig. 2(b), which enhances the current HVM with several paravirtualized components. Hybrid HVM leverages native code and extends to its superset. As an incremental approach, hybrid HVM is easier to reuse most hardware-assisted virtualization features and less modification is needed when compared with the hybrid PVM. Linux kernel and the later versions merge the paravirtualization optional support into the mainstream, termed pv ops. Linux kernel can apply pv ops to self-patch its binary code, converting the sensitive instructions into non-sensitive ones for the hypervisor to do the paravirtualization work. Meanwhile, pv ops is similar to the kernel hook providing paravirtualized interface to the hypervisor and facilitating the paravirtualization behavior changes. The hybrid HVM imports some Xen paravirtualization APIs and utilizes the pv ops to build up the hybrid virtualization guest. Although there exist many differences between the hybrid PVM and hybrid HVM approaches, their ultimate goals meet the same intrinsic property: adopt superiority features of both the paravirtualization and hardware-assisted virtualization solutions in the HVM container. 1 QEMU project: Xen modification Hybrid PVM and hybrid HVM are both feasible for the implementation of hybrid virtualization and they can reach the same point eventually. But the latter is more natural and practical with the hardware-assisted virtualization inheritance and less code modification to Linux. Therefore, our hybrid extension is started from HVM, importing a component based paravirtualization feature selection such as paravirtualized halt, paravirtualized timer, event channel and paravirtualized drivers. Consequently, the guest with hybrid extension feature can take advantage of both paravirtualization and hardware-assisted virtualization Hardware-assisted paging Hardware-assisted paging (e.g., Intel Extended Page Table technology and AMD Nested Page Table technology) plays a vital accelerating role in the hardware-assisted virtualization (Bhargava et al., 2008). Without hardware-assisted paging support, HVM utilizes shadow page table (Barham et al., 2003; Adams and Agesen, 2006) and shadow TLBs for the accuracy of guest memory mapping and accessing. The crucial defect of the Shadow series strategy is that each MMU address translation needs to be trapped into the hypervisor (such behavior is called VM exit) and travel another long execution path to fetch the real address. During such procedure, an inevitable and considerable round trip overhead is introduced which takes more than 10 times of CPU cycles compared with the native MMU address translation. But with the hardwareassisted paging support, those Shadow series could be deserted and the MMU address translation in HVM could avoid triggering a great deal of VM exits Interrupt disposal changes Redundant VM exits also exist in the interrupt disposal of HVM, yet no hardware-assisted solution could fix it currently. This can act as the bottleneck of I/O efficiency. Fortunately, paravirtualization owns a sound strategy in this situation, which offers an opportunity to import it into our hybrid solution. Event channel and QEMU device support are entitled for the hybrid guest. Each QEMU emulated I/O APIC pin is mapped to a virtual interrupt request so that one virtual interrupt request instead of I/O APIC interrupt would

5 Q. Lin et al. / The Journal of Systems and Software 85 (2012) be delivered to the guest if the device asserts the pin. Event channel is a signaling mechanism for inter-domain communication. One domain can send a signal to another domain through event channel, and each domain can receive signals by registering event channel handler. Besides, the disposal path of Message Signaled Interrupts and its extension (MSI/MSI-X), which conventionally relies on the local APIC, is also altered for optimization. Hybrid solution paravirtualized the MSI/MSI-X handling so that MSI/MSI-X do not cause VM exit Implementation Current hybrid virtualization approach could support x86 uniprocessor guests and x86 64 uniprocessor and symmetric multiprocessor guests with MSI/MSI-X support. Single binary with hybrid feature support can run in PVM, HVM, Hybrid VM and native environment. User can turn on the hybrid virtualization support for the DomainU by setting a special Hybrid Feature CPUID script in the VM configuration file. As long as the Hybrid Feature CPUID is identified within the DomainU creating progress, HVMOP enable hybrid hypercall will be triggered to invoke the hybrid capability of hypervisor, i.e., hybrid virtualized guest will be built up. Our hybrid virtualization approach is built on the vanilla Xen and the guest Linux kernel The modifications are added to both the Xen hypervisor and guest kernel with 267 and 1003 source lines of code, respectively. We also pack the design changes as patches involving the hypervisor part and the guest kernel part. These patches have been released to the Xen open source community Efficiency of system call The key point in paravirtualized CPU control is the ring compression (or called de-privileging), putting the guest kernel to upper ring level (less privileged). This idea actually furnished a boosted software-only virtualization solution in the past days. Nevertheless, things have been changed with hardware-assisted virtualization improvement. Paravirtualization is now suffering from many bottlenecks in the virtualization world, one of which refers to the bad performance of system call with the legacy ring decompression and isolation consideration. Both 32-bit and 64-bit PVM guests are subject to this issue while the latter is more serious. Although the 64-bit architecture (e.g., x86 64) still has four privilege rings, only Ring0 and Ring3 are available to separate the system kernel and user application currently. Thus, in the 64-bit PVM case, the kernel space and user space have to stay in the same privilege ring (Ring3) and the Domain0 host occupies Ring0. The outer line in Fig. 3 illustrates the system call path in 64- bit PVM. When a system call occurs, it is first trapped from the guest user space to the host, and then the hypervisor injects the system call event into the guest kernel; once the guest kernel completes the service, the execution flow jumps into the hypervisor again and finally returns to the guest user space. Such hypervisor intervention introduces considerable overhead of those round trips within the system call. Additionally, the code path overhead should also be taken into account. The same bouncing mechanism is employed when handling exceptions such as page faults (Nakajima and Mallick, 2007), which are also first intercepted by hypervisor even if generated purely by user processes. Meanwhile, extra TLB flushes within this procedure further the slowdown. Normally, the transition between user and kernel space expects no TLB flush except for switching to the process with 2 Fig. 3. The paths of system call. The outer line indicates the system call execution path in the 64-bit PVM whose kernel lies in Ring3. Similarly, the inner line demonstrates that in the Hybrid VM which puts the kernel in Ring0. The sequences of boundary switch are numbered in both cases. another page table, i.e., system call does not need TLB flush usually. But in the 64-bit PVM case, guest user space and kernel space are both in Ring3 and they have to be separated from each other. So they do not share the same page table as they normally do. The hypervisor is located in the high memory region and marks all its pages as global. Generally, the principal event leading TLB flush is the address space switching. Therefore, when the guest executes a system call, it will be trapped by the hypervisor and then the hypervisor injects the system call event to the guest kernel, which requires a TLB flush due to the different page table; when it comes back to the user space, another TLB flush is needed. In the hybrid virtualization, the paravirtualized guest runs in the HVM container so that the guest kernel can be back to Ring0. Thus, guest system call could avoid hypervisor intervention and curtail the code path, as demonstrated by the inner line of Fig. 3. System call just bounces in the DomainU so that the TLB could be maintained. In the meantime, with hardware-assisted paging acceleration, the overhead of secure page table modification disappears. Hence, hybrid virtualization can show close-to-native performance with the memory intensive workload, much going beyond the pure paravirtualization. 5. Timer synchronization among high volume VM cluster Modern computer systems use a variety of timer resources to count clock ticks. PIT, TSC and HPET are more often than not applied for the time keeping in commodity OS as Windows and Linux. Although these diverse timer resources may tick in the different frequencies or trigger interrupt in different intervals, they all walk forward at a fixed and related pace reflecting the external time elapsed. For example, OS may rely on either TSC or HPET as the timing base. A 2 GHz TSC ticks 200 million times which means a 100 ms real time interval; but a 10 MHz HPET only needs to tick 1 million times to reach the same duration. Similarly, interrupt based interval timer as PIT has to trigger 100 interrupts in the same period when programmed at 1 khz interrupt frequency. None of these timer resources, however, is absolutely reliable, while timing for operating systems is essential. Consequently, through cross-referencing timer resources, OS as Linux is capable of correcting the potential time drift due to software or hardware factors of turbulence. For example, since PIT uses crystal which is more precise than TSC oscillator, Linux uses PIT to calibrate the TSC

6 2598 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Fig. 4. Chaos of timer resources. frequency. However, owing to the chance of interrupt delay or miss, it is probable that two PIT interrupts may be merged if the previous one has not been serviced yet when a new one comes. Therefore, some versions of Linux may use TSC conversely to detect the potential interrupt miss from PIT if time between two PIT interrupts exceeds a tick period. In other words, all timer resources in the real world are synchronized to indicate the external time elapsed. However, virtualization breaks this relationship, leading that conventional HVM suffers from timer synchronization issue. Each timer resource in the virtualization environment is emulated by the hypervisor independently. The guest TSC is pinned to the host TSC with a constant interval offset. PIT interrupts may be stacked because of the VCPU scheduling. For example, when a VCPU is switched out for 30 ms (Fig. 4a), the hypervisor may inject 30 interrupts (if PIT is programmed at 1 khz) immediately to reflect the elapsed time when the VCPU is switched back (Fig. 4b d). If the PIT interrupt service routine in the guest OS refers the PIT interrupt quantity (e.g., jiffies in Linux) to its TSC, it will immediately sense tremendous lost interrupts and pick them up since the guest TSC is already advanced but the jiffies is still staying with the value when the VCPU was switched out. However, the hypervisor does not know whether the guest will pick up the lost interrupts, and is consequently stuck in a dilemma whether it should inject the entire missing PIT interrupts. OS as Linux calculates lost ticks on each clock interrupt according to the current TSC and the TSC of the last PIT interrupt, and then adds lost ticks to jiffies in order to fix the inaccuracy (Fig. 4 (1 3)); but the hypervisor also injects lost ticks to the guest (Fig. 4b d). Consequently, the redundant compensation accounting causes the chaos of guest timer resources, as shown in Fig. 4d, (4). Furthermore, it is not wise to just depend on guests themselves without the hypervisor s tick compensation, because other guest OSes may not support the self-compensation ability such as Microsoft Windows and old Linux kernel (e.g., early than ). Although temporarily drifting the TSC within the PIT interrupt delivery may solve the problem to a certain degree, it introduces another problem for SMP guests. Each VCPU in SMP guests has its own TSC, which is synchronized in the real world with each other as well as other timers such as PIT and HPET. Given that each VCPU is scheduled independently in Xen, if all timer resources are synchronized, the single VCPU with its TSC blocked due to being scheduled out will block the platform timer resources as well. Consequently, other VCPU TSC will be frozen for the sake of synchronization. In other words, such forcible timer synchronization paradoxically prohibits VCPUs from scheduling. In order to achieve a comprehensive solution for timer synchronization, the hybrid achievement in this paper modifies HVM by importing the paravirtualized timer component to establish the uniform timer management (UTM). All guest timer resources are paravirtualized and redirected to the hypervisor aware field. The hypervisor does the whole synchronization work and prepares the accurate time values for the guests, who are entitled to fetching them via shared memory. Hence, UTM could eliminate the legacy time drift and guarantee precious timer synchronization. In the meantime, UTM also saves a great amount of unnecessary interrupt injecting in the hypervisor and the tick counting in the guest. Apart from the traditional tick based kernel, Linux can be configured as tickless kernel 3 by setting CONFIG NO HZ in the kernel compilation options. The feature of tickless kernel is replacing the old periodic timer interrupts with on-demand interrupts, whose timers are reprogrammed to calculate the time interval via per- CPU high resolution timers. Consequently, unlike the conventional mechanism of OS heartbeat driven by periodical tick, the tickless kernel allows idle CPUs to remain idle until a new task is queued for processing. However, similar to the issue of losing periodic timer interrupts due to the delayed VM schedule, the tickless kernel may also suffer the pain from losing asynchronous interrupts in the virtualized environment. Unlike the strategy of cross-referencing among timers in tick based kernel, the tickless kernel uses a one shoot way to generate asynchronous interrupts and is not capable of detecting the interrupt losing problem. Consequently, by the nature of the tickless kernel, no interrupt compensation is 3 Linux tickless kernel:

7 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Hybrid HVM PVM System Call Process Creation Pipe-based Context Switching Pipe Throughput File Copy (4KB,8000) File Copy (256B,500) File Copy (1KB,2000) Execl Throughput Whetstone Dhrystone Shell Scripts (8 concurrent) 12.33% 17.18% 16.11% 21.14% 25.67% 27.14% 28.21% 39.58% 77.55% 85.11% 95.60% 95.67% 91.77% 90.31% 94.33% 93.98% 96.61% 97.19% 92.81% 93.86% 93.21% 94.62% 89.80% 88.54% 85.64% 95.67% 95.79% 95.84% 95.62% 95.49% 95.76% % % 0 % 20 % 40 % 60 % 80 % 100 % Fig. 5. UnixBench performance comparison among Hybrid VM, HVM and PVM. Higher is better. available if a VM is not scheduled in timely, resulting in the problem of losing asynchronous interrupts. To address this issue, hypervisor is able to detect and collect the missing interrupts of VM and behave the compensation. Although this is not involved in the current implementation of hybrid virtualization in this paper, it is feasible to add this feature into the further improvement of our hybrid virtualization prototype. 6. Evaluation All experiments were conducted on the platform configured with 3.20 GHz Intel Core i7-965 processor. The hypervisor was Xen The host (Domain0) and the guest VMs ran CentOS 5.5 for x The native environment and the DomainU, including PVM, HVM and hybrid guests, shared the same kernel binary of Linux with hybrid virtualization extension, whereas Domain0 adopted the XenLinux kernel. All the OSes were set with the same clock source (time stamp counter, TSC), the same kernel tick frequency (250 Hz) and the same memory size (2 GB) Overall performance OS basic operations in a variety of conditions can be a true reflection of the overall system performance. UnixBench 4 is applied to evaluate the overall performance of Hybrid VM, HVM and PVM. The test suites of UnixBench are for local operation, not covering the network performance. Fig. 5 illustrates the benchmark result of running UnixBench as single process, with all data normalized to the non-virtualized (i.e., native) performance, higher is better. Due to the overhead introduced by virtualization, Fig. 5 in most of the data are indicative of the test program in the virtual machine s performance less than native performance. Most performance results 4 UnixBench: of Hybrid VM are very close to the HVM, and the gap is basically allowed within the measurement error (less than 2%), except for pipe-based context switching test. On the ground of optimized system call path and memory virtualization, Hybrid VM demonstrates remarkable performance improvement against PVM. Hybrid VM owning 7% performance lower than HVM in the pipebased context switching test is due to the slight drawback of hybrid virtualization when handling read/write of virtual block device in memory, e.g. pipe. In the hybrid solution in this paper, the performance of read/write against block device of disk (except DMA) is enhanced due to the changed strategy of I/O handling based on virtual interrupt. However, with respect to the I/O with virtual block device in memory, system call of read() and write() also triggers the virtual interrupt requests which is later treated as invalid ones since the destination media is memory. Such invalid virtual interrupt is inevitable because kernel does not know the media type of destination before executing read() and write() until the latter translation of path via file system. The overhead resulting from such invalid virtual interrupt, albeit extremely slight, would be magnified within highly frequent iteration of the access to virtual block device in memory. In the pipe-based context switching test, it measures the number of times two processes can exchange an increasing integer through a pipe. This simple test triggers tremendous read/write against pipe which exists as a kind of virtual block device in memory, leading the performance drop of Hybrid VM cased by invalid virtual interrupt handling. Additionally, the Shell program test should be treated as a special case which manifests that the performance of HVM and Hybrid VM outweighs native environment more than 200%. The Shell script test program of UnixBench counts the number of execution loops of a certain Shell script within a minute unit. The function of this Shell script is a series of characters conversion in a data file. Xen virtual machine for the disk operation is optimized leveraging file system cache mechanism. Consequently, read/write operation against virtual disk image holds superiority of lazy accessing physical disk, especially accompanied by operations with high locality such as the

8 2600 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Hybrid PVM(XenLinux) PVM(pv_ops) Hybrid PVM(XenLinux) PVM(pv_ops) 100 % 100 % Normalized to native performance 80 % 60 % 40 % 20 % Normalized to native performance 80 % 60 % 40 % 20 % 0 % stat open close selct TCP sig inst sig hndl fork proc exec proc sh proc 0 % Pipe AF UNIX UDP RPC/ UDP TCP RPC/ TCP TCP conn Fig. 6. LMBench processor benchmarks. They highlight the overhead contributed by fundamental facilities of OS like system call and signal dispatch. Higher is better. Fig. 8. LMBench local communication latency benchmarks. They measure the response time of I/O request. Higher is better. UnixBench Shell script test program. Although PVM also benefits from such optimization, the overhead caused by its memory operation using DPT techniques leaves down its overall performance, making it not outweigh that of native environment Microbenchmarks LMBench is used to evaluate the system call performance in different execution environments. All the experimental objects, involving the native and VMs, are equipped with 2 CPUs. Especially, each virtualized CPU is pinned to the physical CPU uniquely in order to get rid of turbulence brought by the VCPU scheduling of hypervisor. All the running time of benchmark testing is normalized to the native execution. Figs. 6 8 present the measurement results with higher percentage better. And the benchmarks exhibit a low runtime variability, i.e., standard deviation within 2.05%. The majority of these bars illustrate a notable performance improvement with the hybrid virtualization. The traditional direct page table used by PVM needs more hypercalls when guest page faults occur and thus introduces more TLB flushes and performance regression. The hybrid approach wipes the costly trampoline in PVM and obtains general performance enhancements. Overall, most advancements of Hybrid VM are brought by hardware-assisted paging, and also by avoiding unnecessary frequent boundary switches between Ring0 and Ring3 in x86 64 mode, e.g., when executing system calls. We witness the Normalized to native performance 100 % 80 % 60 % 40 % 20 % 0 % 2p/16K Hybrid PVM(XenLinux) PVM(pv_ops) 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K Fig. 7. LMBench context switch benchmarks. They focus on the overhead added to process context switch. 2p/16k stands for 2 processes handling data of 16 kilobytes. Higher is better. performance gain of about 30% in average on benchmarks that highlight the overhead of process creation (like fork and exec) in Fig. 6 and context switch in Fig. 7. Besides, Fig. 8 indicates that hardwareassisted paging also benefits local communication bandwidth. Fig. 7 presents the comparison of micro performance of context switching between Hybrid VM and PVM. Note that np/mk stands for n threads process m KB data parallelly. In the PVM case, the execution of context switching from user-mode to kernel-mode is similar to that of system call. Context switching in PVM causes multiple TLB flush due to the intervention of hypervisor which makes user space and kernel space use different page tables to distinguish with each other. Similar progress also happens in the context switching caused by interrupt. In the case of process switching, TLB flush is inevitable because page directory requires refreshing. But within the PVM case, a TLB flush is generated when user-mode process requests process switching via system call, and another TLB flush is generated when the kernel-mode service completes the response. Although the context switching accompany with the semantic of process switching is merged to that of system call returning, the total amount of TLB flush in PVM is still one more time than that of non-virtualized environment. Because Hybrid VM inherits from HVM container, the execution way of guest OS running inside Hybrid VM is identical to that of non-virtualized environment. Therefore, no extra TLB flush is introduced within the exchanging of privilege levels so that the micro performance of context switching in Hybrid VM goes beyond that in PVM. The optimal of context switching in Hybrid VM is that of 70% performance in the non-virtualized environment. This performance gap is mainly due to the overhead of address translation within the virtualization mechanism. In the non-virtualized environment, virtual address of the program needs only once translation by the MMU. But in the virtualized environment, virtual address in guest OS needs one translation through virtualized page tables (i.e., SPT, DPT or EPT/NPT) and another translation through MMU. Although hardware-assisted paging strategy has memory virtualization performance bottleneck to a minimum, its inherent overhead is inevitable compared with that in the non-virtualized environment. Furthermore, the execution of context switching itself requires few CPU cycles. Consequently, the slight overhead of accessing memory in Hybrid VM becomes the major portion of overall overhead within the procedure of context switching. Such execution style can be treated as an exception since most operations in the system requires significantly more CPU cycles than address translation. Several results, e.g., sig hndl in Fig. 6 and TCP in Fig. 8 indicate that the PVM of XenLinux goes beyond the hybrid one. XenLinux is specially optimized for Domain0 usage, leading the representative

9 Q. Lin et al. / The Journal of Systems and Software 85 (2012) % Hybrid PVM(XenLinux) PVM(pv_ops) Table 2 I/O efficiency with ethernet workload. Each guest is configured with 4 virtualized CPUs and the data reflecting their total utilization. Lower is better. Normalized to native performance 80 % 60 % 40 % 20 % 0 % 1 cpu 2 cpus 4 cpus 8 cpus* *: The 8 cpus data benefit from Intel Hyper-Threading tech. Fig. 9. Kernel compile benchmark (KCBench). All the measurement results are normalized to native execution performance. Higher is better. limitation for generic VM guest. Therefore, such individual comparisons are not adequate to deduce the prominence of the XenLinux PVM guest CPU intensive workloads CPU intensive performance is evaluated via KCBench, as illustrated in Fig. 9. Vanilla Linux kernel build time intervals are measured on each object with 1, 2, 4, 8 CPUs respectively, as well as making the compilation thread number equal to the CPU amount. The performance of corresponding measurement in the native environment is used as a normalization reference. KCBench is mainly specialized in memory manipulation intensive testing which mostly relies on the MMU. Taking advantage of hardware-assisted paging acceleration, the efficiency of memory access and address translation in HVM outweighs that of paravirtualization solution which adopts direct page table. As inherited such superiority from pure HVM, hybrid solution takes the equivalent performance with respect to the CPU intensive workload. KCBench frequently calls various types of system call to process the kernel building tasks. The result of such CPU intensive workloades verifies the theoretical analysis in Section I/O efficiency The experiment of I/O efficiency is measuring the CPU utilization when processing I/O intensive workloads, rather than the bandwidth capability. As the typical I/O intensive workload is accompanied with network communication, different bandwidth configurations of NIC are used to evaluate the I/O efficiency of Hybrid VM versus pure HVM. Netperf is utilized to generate the network data and configured to saturate the available bandwidth. The guests are set to adopt VT-d pass-through NIC rather than Xen virtual network so that the comparison can be more fair. Note that the experiment focuses on the measurement and comparison on CPU utilization, while the network throughput of both types of VM are equivalent. Table 2 records the utilization (in percentage format) of virtualized CPUs when running Netperf against Hybrid VM and pure HVM, with percentage lower is better. Hybrid VM manifests better I/O efficiency than the pure HVM due to the reduction of VM exit amount. Using event channel and virtual interrupt request mechanism rather than emulated APIC, Hybrid VM saves more than 60% CPU cycles compared with the pure HVM guest when handling single interrupt. The CPU utilization results would depend on the interrupt density. In the 10 Gigabit Ethernet workload case, NIC VM Pure HVM Hybrid VM 10 Mbps 4.0% 3.0% 100 Mbps 9.7% 7.1% 1 Gbps 21.2% 17.3% 10 Gbps 79.0% 62.0% about 8000 interrupts per second per virtualized CPU are triggered, where the end of interrupt and MSI/MSI-X take up more than 60% of the interrupt handling. As a result, compared with the pure HVM, Hybrid VM is capable of saving about 3 4% CPU utilization on each 3.20 GHz processor core, i.e., the total CPU utilization can be reduced by 12 16% with the 4 virtualized CPUs guest. Table 2 shows that the gap between Hybrid VM and pure HVM will be magnified with the increment of interrupt workload, especially under the circumstances of interrupt saturation. 7. Related work To enhance the performance of applications running in virtualization environment, previous researches cut down the overhead by various means of optimization Memory virtualization optimization Data copying and frequent remapping incur high performance penalty. Menon et al. (2005) implemented a system-wide statistical profiling toolkit for the Xen virtual machine environment, and analyzed each domain s overhead for network applications. Their experiment summarized that Xen Domain0 degraded the throughput because of its high TLB miss rate while the guest domain suffered the instruction cost of the hypervisor and the driver domain. To avoid these costly overheads, a copy in the TCP package receive path has been implemented to replace data transfer mechanism between guests and driver domains. Santos s (2008) work reduced the inter-domain round trip latency by sharing static circular memory buffer between domains instead of Xen pageflipping mechanism, and speeded up the second data copy between domains by using hardware support of modern NIC to copy data directly into the guest memory. Our hybrid solution saves CPU cycles of MMU operations with the help of hardware assisted paging support. Adams and Agesen (2006) argued that nested paging hardware should easily repay the costs of slower TLB misses as the process of filling the TLB is more complicated than that of typical virtual memory systems. Our evaluation indicates that hybrid solution outstands in the overall performance of memory operations compared with paravirtualization. Implemented with direct page table, PVM suffers from expensive overhead of TLB flushes each time returning from guest kernel to application for hypervisor address space protection I/O virtualization improvement I/O performance is a popular research issue in the recent virtualization world (Raj and Schwan, 2007; Liu et al., 2006; Eiraku et al., 2009). For reliability, security sensitive instructions must be trapped and handled by VMM. Frequent interrupts affect the system efficiency seriously as the context switch is the primary cause of overhead. Sugerman et al. (2001) analyzed the major overhead of virtual network with VMware VMM and described a mechanism to reduce the number of world switches to improve the network performance. King et al. (2003) optimized the performance

10 2602 Q. Lin et al. / The Journal of Systems and Software 85 (2012) of Type-II virtual machines by avoiding the context switches. Their implementation can support multiple address spaces within a single process by using different segment bounds. Our hybrid solution simplifies the interrupt handling process by employing event channel which can transfer interrupt information from guest domain to Xen Domain0 directly. High end network device introduces new techniques including Scatter-gather DMA, TCP checksum offloading, and TCP segmentation offloading which move the functionality from the host to NIC (Menon and Zwaenepoel, 2008). Establishing the selective functionality on a programmable network device to provide higher throughput and less latency for virtual network interfaces could also enhance system performance (Raj and Schwan, 2007). We believe that the hybrid solution would achieve better performance if we develop new functions for virtual network interface to take advantage of these new techniques High performance computing virtual machines Virtualization technology has been gaining acceptance in the scientific community due to its overall flexibility in running HPC applications (Tikotekar et al., 2008; Mergen et al., 2006). While extensive research has been targeted at the optimization of virtualization architecture and device for HPC (Raj and Schwan, 2007; Liu et al., 2006), some studies focus on the performance analysis by investigating the behavior and identifying patterns of various overheads for HPC benchmark applications (Tikotekar et al., 2008). The problem of predicting performance for applications is difficult, and becomes even more difficult in virtual environments due to its complexity. One of the few tools that can be used as a system wide profiler on Xen is Xenoprof (Menon et al., 2005). Our work also utilizes Xenoprof for the experimental data analysis. Besides, some study proposed the idea that whether novel virtual machine usage scenarios could lead to high flexibility versus performance trade-off. Tikotekar et al. s (2009) study showed that different VM configurations could exert diverse performance impact on HPC virtual machines. Our future work also intends to develop the utilities which are pretty suitable for the hybrid virtualization application as well as construct an optimal VM configuration for the hybrid virtualized HPC virtual machines. 8. Conclusion The HVM benefits from hardware assisted paging so that it shows outstanding performance with memory intensive workloads, whereas the PVM is more efficient for the I/O intensive applications. The hybrid virtualization is capable of merging both superiorities of paravirtualization and hardware assisted virtualization. Our hybrid approach reuses most of the hardware assisted virtualization features, as well as imports several paravirtualized components to the Hybrid VM. The experiment results demonstrate that the overall performance of our hybrid solution is much superior than PVM and close to the pure HVM, while the I/O efficiency of Hybrid VM outweighs that of pure HVM. Acknowledgements This work is supported by the Program for PCSIRT and NCET of MOE, National Natural Science Foundation of China (Grant No ), the 863 Program (Grant No. 2011AA01A202, 2012AA010905), 973 Program (Grant No. 2012CB723401), the International Cooperation Program of China (Grant No. 2011DFA10850), the Ministry of Education and Intel joint research foundation (Grant No. MOE-INTEL-11-05). References Adams, K., Agesen, O.,2006. A comparison of software and hardware techniques for x86 virtualization. In: ASPLOS-XII: Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, USA, pp. 2 13, doi: Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Neugebauer, R., Pratt, I., Warfield, A.,2003. Xen and the art of virtualization. In: SOSP 03: Proceedings of the 19th ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, pp , doi: Bhargava, R., Serebrin, B., Spadini, F., Manne, S.,2008. Accelerating two-dimensional page walks for virtualized systems. In: ASPLOS XIII: Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, USA, pp , doi: Clark, B., Deshane, T., Dow, E., Evanchik, S., Finlayson, M., Herne, J., Matthews, J.N.,2004. Xen and the art of repeated research. In: ATC 04: Proceedings of the Annual Conference on USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, USA. Eiraku, H., Shinjo, Y., Pu, C., Koh, Y., Kato, K., Fast networking with socketoutsourcing in hosted virtual machine environments. In: SAC, pp King, S.T., Dunlap, G.W., Chen, P.M.,2003. Operating system support for virtual machines. In: ATC 03: Proceedings of the Annual Conference on USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, USA, pp Liu, J., Huang, W., Abali, B., Panda, D.K.,2006. High performance VMM-bypass I/O in virtual machines. In: ATC 06: Proceedings of the Annual Conference on USENIX 06 Annual Technical Conference. USENIX Association, Berkeley, CA, USA. Menon, A., Zwaenepoel, W.,2008. Optimizing TCP receive performance. In: ATC 08: Proceedings of the Annual Conference on USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, USA, pp Menon, A., Santos, R., Jose, Turner, Y., Janakiraman, G.J., Zwaenepoel, W.,2005. Diagnosing performance overheads in the Xen virtual machine environment. In: VEE 05: Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments. ACM, New York, NY, USA, pp , doi: Mergen, M.F., Uhlig, V., Krieger, O., Xenidis, J., Virtualization for highperformance computing. ACM SIGOPS Operating Systems Review 40 (2), 8 11, doi: Nakajima, J., Mallick, K.A., Hybrid-virtualization enhanced virtualization for Linux. In: Linux Symposium, pp , Neiger, G., Santoni, A., Leung, F., Rodgers, D., Uhlig, R., Intel virtualization technology: hardware support for efficient processor virtualization. Intel Technology Journal 10 (3). Raj, H., Schwan, K.,2007. High performance and scalable I/O virtualization via self-virtualized devices. In: HPDC 07: Proceedings of the 16th International Symposium on High Performance Distributed Computing. ACM, New York, NY, USA, pp , doi: Santos, R., Jose, Y., Turner, Janakiraman, G., Pratt, I.,2008. Bridging the gap between software and hardware techniques for I/O virtualization. In: ATC 08: Proceedings of the Annual Conference on USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, USA, pp Sugerman, J., Venkitachalam, G., Lim, B.-H.,2001. Virtualizing I/O devices on vmware workstation s hosted virtual machine monitor. In: ATC 01: Proceedings of the Annual Conference on USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, USA, pp Tikotekar, A., Vallée, G., Naughton, T., Ong, H., Engelmann, C., Scott, S.L., An analysis of HPC benchmarks in virtual machine environments. In: Euro-Par Workshops, pp Tikotekar, A., Ong, H., Alam, S., Vallée, G., Naughton, T., Engelmann, C., Scott, S.L.,2009. Performance comparison of two virtual machine scenarios using an HPC application: a case study using molecular dynamics simulations. In: HPCVirt 09: Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing. ACM, New York, NY, USA, pp , doi: Whitaker, A., Shaw, M., Gribble, S.D.,2002. Scale and performance in the denali isolation kernel. In: OSDI 02: Proceedings of the 5th Symposium on Operating Systems Design and Implementation. ACM, New York, NY, USA, pp , doi: Xen.org, Xen Architecture Overview. Qian Lin received his M.Eng. degree from Shanghai Jiao Tong University in 2011 and B.Sc. degree from South China University of Technology in Currently he is a Ph.D. candidate in School of Computing, National University of Singapore. His research interests include operating system, cloud systems, trusted computing and distributed database.

11 Q. Lin et al. / The Journal of Systems and Software 85 (2012) Zhengwei Qi received his B.Eng. and M.Eng degrees from Northwestern Polytechnical University, in 1999 and 2002, and Ph.D. degree from Shanghai Jiao Tong University in He is an Associate Professor at the School of Software, Shanghai Jiao Tong University. His research interests include static/dynamic program analysis, model checking, virtual machines, and distributed systems. Yaozu Dong is a senior staff in Intel Open Source Technology Center. His research focuses on architecture and system including virtualization, operating system, and distributed and parallel computing. Jiewei Wu received his bachelor degree from Shanghai Jiao Tong University in He is now a graduate student in Shanghai Key Laboratory of Scalable Computing and Systems. His research interests include operating system and virtualization technology. Haibing Guan received received Ph.D. degree from Tongji University in He is a Professor of School of Electronic, Information and Electronic Engineering, Shanghai Jiao Tong University, and the director of the Shanghai Key Laboratory of Scalable Computing and Systems. His research interests include distributed computing, network security, network storage, green IT and cloud computing.