Xen and the Art of Virtualization Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian Pratt, Andrew Warfield University of Cambridge Computer Laboratory, SOSP 2003 Presenter: Dhirendra Singh Kholia
Outline What is Xen? Xen: Goals, Challenges and Approach Detailed Design Benchmarks (skip?) Xen Today Conclusion Discussion
What is Xen? Xen is a virtual machine monitor (VMM) for x86, x86-64, Itanium and PowerPC architectures. Xen can securely execute multiple virtual machines, each running its own OS, on a single physical system with close-to-native performance. It is a Type-1 (native, bare-metal) hypervisor. It runs directly on the host's hardware as a hardware control and guest operating system monitor.
Xen Goals Performance isolation between guests (resource control for some guarantee of QoS) Minimal performance overhead Support for different Operating Systems. Maintain Guest OS ABI (thus allowing existing applications to run unmodified) Need to support full multi-application operating systems.
x86 CPU virtualization x86 : most successful architecture ever! Easy: Has built-in privilege levels/protection rings ( Ring 0, Ring 1, Ring 2, Ring 3). Ring 1 and Ring 2 are unused Hard: VMM needs to run on highest privilege level (Ring 0) to provide isolation, resource scheduling and performance BUT Guest Kernels too are designed to run in Ring 0 - Running certain sensitive instructions (aka non- virtualizable instructions) without sufficient permissions causes silent failures instead of generating a convenient trap (GPF) to VMM. Thus, a VMM will never get an opportunity to simulate the effect of the instruction Source: Ring Diagrams: http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection
x86 CPU virtualization approaches 1 Full Virtualization (VMware Workstation, presents Virtual resources) Doesn t require Guest OS modifications Uses binary translation : A technique to dynamically rewrite Guest OS Kernel code in order to catch non-trapping privileged instructions. Relatively lower performance (translation overhead, page table sync. and update overhead) Time Synchronization can be problematic (lost ticks, backlog truncation) frequently requiring a Guest Tool to maintain synchronization.
x86 CPU virtualization approaches 2 Paravirtualization (Xen, presents Virtual + Real resources) Requires modifications to Guest OS s Kernel. Improved performance (due to exposure of real hardware, one time guest modification) Exposing real time allows correct handling of time-critical stuff like TCP timeouts and RTT estimates. Hardware Assisted Virtualization Conceptually it can be understood as adding Ring -1 above Ring 0 in which hypervisor executes and can trap and emulate privileged instructions Allows for a much cleaner implementation of full virtualization.
Full Virtualization vs. Paravirtualization User Applications Ring 3 Control Plane User Apps Ring 2 Guest OS Ring 1 Dom0 Guest OS VMM Binary Translation Ring 0 Xen Full Virtualization Paravirtualization http://www.cs.uiuc.edu/homes/kingst/spring2007/cs598stk/slides/20070201-kelm-thompson-xen.ppt
Cost of Porting/Paravirtualizing an OS x86 dependant (Privileged instructions + Page table access) Virtual Network driver, Virtual Block device driver Xen Code (schedulers, hypercall implementation etc) For Linux 2.4, < 1.5% (around 3000 lines) of x86 code base size modified/added. How much modification of Guest OS is too much? Is several thousand lines of code per operating system actually minimal effort? - Considering Linux Kernel is around 11.5 million lines of code (Source: Linux Foundation, August 2009), I think few thousand lines of code is minimal.
Paravirtualization: Xen s approach 1 Xen runs in Ring 0, modified Guest Kernel runs in Ring 1 and Guest Applications run unmodified in Ring 3 (hence Guest OS remains protected) Guest OS Kernel must be modified to use a special hypercall ABI instead of executing privileged and sensitive instructions directly. A hypercall (0x82) is a software trap from a domain to the hypervisor, just as a syscall (0x80) is a software trap from user space to the kernel. e.g. When the system is idle, Linux issues HLT instruction which requires Ring 0 privilege to execute. In XenoLinux this is replaced by a hypercall which transfer control to Xen Ring 0 from Ring 1.
Paravirtualization: Xen s approach 2 Xen is mapped to top 64MB (for x86) of every OS s address space. This is done to save a TLB flush when going from Ring 1 to Ring 0 (VMM). Xen itself is protected by segmentation. Trap/Exception (System call, page-fault) handlers are registered with Xen for validation. Guest OS may install a fast exception handler for system calls, allowing direct calls from an application into its guest OS and avoiding indirecting through Xen on every call.
Paravirtualization: Xen s approach Source: http://www.linuxjournal.com/article/8540
Control Transfer: Hypercalls and Events Events for notification from Xen to guest OS E.g. data arrival on network; virtual disk transfer complete Events replace device interrupts! Hypercalls: Synchronous calls from guest OS to Xen (similar to system calls). E.g. set of page table updates
I/O Rings : Data Transfer Sort of message passing abstraction built on top of Xen SHM IPC Networking Example: A Domain (Request Producer) can supply buffers using requests and Xen (Response Producer) provides responses to signal arrival of packet into the buffers. In order this efficiently (avoid copy of packet data from Xen to Domain pages) Xen exchanges the its packet buffer with an unused page frame which has to be supplied by the Domain!
MMU virtualization VMware Solution (Shadow Page Tables, Slow) - Two sets of page tables are maintained - The guest virtual page tables aren t visible to MMU. - The hypervisor traps virtual page table updates and is responsible for validating them and propagating changes to the MMU shadow page table. Xen Solution (Direct Page Tables access) - Guest OS is allowed read only access to the real page tables. - Page tables updates must still go through the hypervisor which validates them - Guest OSes allocate and manage their own PTs using hypercalls - The OS must not give itself unrestricted PT access, access to hypervisor space, or access to other VMs.
Networking Xen provides a Virtual Firewall-router (VFR). Each domain has one or more VIFs attached to VFR. Two I/O buffer descriptor rings. (one each for Transmit and Receive). Transmit: Domain updates the transmit descriptor ring. Xen copies the descriptor and the packet header. Header is inspected by VFR. Payload copying is avoided by using Gather DMA technique in NIC driver. Receive: Avoid copying by used page flipping technique.
Disk Only Domain0 has direct access to disks Other domains need to use virtual block devices (VBD) Use the I/O ring Guest I/O scheduler reorders requests prior to enqueuing them on the ring Xen can also reorder requests to improve performance Zero-copy data transfer done using between DMA and pinned memory pages.
Xen Architecture Source: http://www.arunviswanathan.com/content/ppts/xen_virt.pdf
Domain 0: Control and Management Separation of mechanism and policy Domain0 hosts the application-level management software which uses control interfaces provided by Xen. Create/Terminate other domains, control scheduling, CPU, Memory allocation, creation of VIFs and VBDs which have list of parameters to manage include access control (for i/o devices), amount of physical memory per domain, VFR rules etc.
I/O Handling dom0 runs the backend of the device, which is exported to each domain via a frontend netback, netfront for network devices (NICs) blockback blockback, blockfront for block devices PCI pass through exists for other kinds of devices (e.g. sound)
Driver Architecture Source: http://www.linuxjournal.com/article/8909
Benchmarks (all taken from Ian s presentation in 2006) In short, Xen provides close to native performance!
MMU Micro-Benchmarks
TCP Benchmarks
Xen Today (Xen 3.x) Xen 3.x supports running unmodified guest OS by using hardware assisted virtualization (Intel VT, AMD-V) Supports NetBSD, OpenSolaris, Linux 2.4/2.6 as both guest and host. Runs FreeBSD, Windows (using HVM) as guest. Live Migration of VMs between Xen hosts. x86/x86-64/itanium/powerpc, SMP (64-way!) guests support, enhanced Power Management, XenCenter for management. Awesome hardware support! (ESX HCL is very limited). DomU (paravirtualization) patches merged in Linux 2.6.23 Dom0 patches are still struggling to get merged upstream. (KVM is gaining support!)
Xen 3.0 Architecture
Questions - Security What is the chance of the Hypervisor and other Guest OS s getting affected by a compromised Guest OS, running on top Dom0? Game Over, protection of Domain 0 is critical! Can t we get rid of Domain Zero Guest OS? I think if can do that we can reduce the vulnerable surface of Xen (In one of their Security presentation they admit they should minimize the TCB). What are the other implication that might have towards the system if we remove Dom 0 Guest OS? Where will the management code go?, Xen relies on Dom0 drivers.
Questions Security 2 Hypervisor takes up the upper 64MB address space. Will it incur problems if we don't want to modify operating system any more by using Intel-VT. - With Intel-VT, Xen isn t mapped into Guest OS address space. If a hacker managed to place a VM co-resident with the target, as a next step he can extract confidential information via a cross-vm attack. There are a number of avenues for such an attack. E.g: side-channels: cross-vm information leakage due to the sharing of physical resources (e.g., the CPU s data caches). In the multi-process environment, such attacks have been shown to enable extraction of RSA and AES secret keys. How this problem can be avoided in XEN? -???
Questions Security 3 The Dom0 domain accesses the hardware directly, while all other domains see virtual abstractions of devices. Does that mean that all drivers, regardless of domain run in the same address space, i.e. that of Dom0. If so, how does it prevent a driver from doing a DMA write to the memory of an arbitrary domain? Drivers can be pushed out from Domain 0(Ring 1) to Driver Domains (Ring 1). This makes the system more robust. However the fundamental problem of unsafe DMA access is solved by IOMMU hardware.
Questions Resource Management In Xen each guest OS has its own memory reservation and disk allocation. Is this a way to statically allocate hardware resources which is often considered as a waste of the resources? - Yes, Resource Management is complicated Xen can do memory over commitment and then use ballooning to do dynamic memory management. Parallax handles the space management problem (using COW?). Memory and disk are cheap these days though, I would focus more on isolation, QoS and security problems. In the section about Physical memory, they talk about either using a balloon driver or modifying the kernel memory management routine to adjust memory usage of a domain. Both these approaches seem to require the modification of the OS. With hardware supported virtualization now allowing OSes to run unmodified, how is this problem solved? The balloon driver works with HVM guest.
Questions Resource Management In Xen, what strategy is utilized by hypervisor to schedule the other domains fairly (to balance the load for each domain)? How about some domains always have heavier average load than other domains? The new CREDIT scheduler assigns a weight and a cap to each domain. A domain with 2X weight implies that it gets twice as much CPU as a domain with weight X. Cap decides how many processors the Domain can use. You can always assign (even at runtime) higher weight to a Domain which requires more CPU time. I don t see why the paper says delegating the task of building new domain to Domain0 is better than building a domain entirely within Xen. Isn t Domain0 a part of Xen? How can the complexity be reduced? By Xen the authors mean the VMM part running in Ring 0. Domain 0 runs in Ring 1. Management code has to be present and Domain 0 is the logical place to put it!
Questions - Isolation How can this paper prove that it allows multiple commodity operating systems to share hardware in a safe and resource managed fashion, when the Xen prototype can only support XenoLinux guest OS when this paper is written Xen today handles many different Guest OSes. Even in 2003 they had a working XP prototype (it could run notepad and minesweeper). It is impossible to run a guest OS on Xen that only supports 2 privilege levels in hardware? Yes I think so, with 2 privilege levels Guest OS wouldn t be able to protect itself from applications. If Xen VMM is not used on a processor X86 with four privilege levels, will the whole architecture impair? I mean, then, how to separate the guest OS kernel and guest application in a safe-proof fashion? 3 Rings are good, 2 are NOT!
Questions Performance If we can modify memory management subsystem, why cannot we modify the I/O system to directly transfer from/to the disk? It seems I/O performance could be improved in this way. Is it hard? - Xen already does Zero-Copy transfer (by using DMA) for Disk I/O. Did I understand the question correctly? DomU gets resources from Dom0 except the CPU resource and the memory from Xen VMM, which will make a lot of overhead between communication. How to reduce it in the next version of Xen? Zero Copy Transfers, Underlying IPC used (SHM) is fast, Batching Updates and Events, PCI Pass through. 4MB address reserved by Xen for the avoidance of the TLB flush per address space seems to be a great consumption if 100 OSes run on VMM. Does this paper mean that Xen need to use 64MB for each process run on each OS run on it? If it is the case, it seems to be a disaster. - NO!, Xen is mapped into top 64MB of every guest address space. It doesn t physically consume 64MB of RAM for every Guest OS
Questions Utility In what kind of scenarios in practice we need to have multiple different kinds of operating systems running on the same machine, especially applications nowadays are becoming more and more portable on different platforms? To test the the very same portable applications Virtual Machines are an excellent solution! You can run Windows, Linux, OSX on the same box and test your applications.
Questions Future Work In the future work they talk about a shared universal buffer cache. Is this similar to the shared memory mentioned in Disco? Was this ever implemented? Yes, I think so. Yes, XenFS project seems to be active.
Questions Although the paper claims that minimal modification is required to port an guest OS, the porting work of Windows XP was still incomplete in their experiments. So do you think it really easy to achieve that? - It ran into licensing problems (M$!). With HVM, such a port is not required. I leave the answering of last part to the audience The authors refer a number of times to a paravirtual port of Windows XP. A quick Web search reveals that licensing issues prevent this port from ever being published; thus, today, Windows XP can only be run under Xen using hardware-assisted virtualization (added in Xen 3). Why do the authors bother describing the paravirtualization of Windows XP, when no researcher can replicate their results and no user can take advantage of this port (due to unavailability of the code)? Simply to illustrate that different OSes could be potentially be ported to run on top of Xen with minimal changes, that would be my guess!
More Questions From this paper, it seems VMware lose a lot to Xen in performance, so I'm wondering is there any scenario that we may prefer binary translation as VMware over paravirtualization as Xen? BT is required in order to run unmodified Guest OS on top of plain x86. BT is not required if processor supports hardware virtualization. However BT is still used because it gives better performance than VT in some scenarios.
Even More Questions! Would it be a heavy performance loss on the guest OSes that every privileged instruction has to be validated by Xen? How does VMware handle such a problem? -??? The authors chose to not implement paging in the VMM, but to allow each OS to perform paging itself. They state that this decision was made to help achieve performance isolation, by preventing one domain from performing thrashing-inciting memory access patterns and thus reducing the performance of other domains. Is there any paging policy that would allow the VMM to perform paging, with all the attendant benefits (better resource sharing in asymmetric-load situations, etc), while not suffering substantially from a breakdown in performance isolation? -???
Even More Questions! A minor question: What is "QOS crosstalk" problem referred in Section 1? Xen can provide three types of time: real, virtual and wall-clock time. The virtual time is used by the guest OS to make proper scheduling decisions but nowadays, Intel-VT enables us to use unmodified guests. However, if the guest OS does not know the virtual time, how can it make good scheduling decisions? By using Intel-VT, how could we provide the guest OS the virtual time, at the same time to give it the real time?
References Ring Diagrams: http://duartes.org/gustavo/blog/post/cpurings-privilege-and-protection J. S. Robin and C. E. Irvine. Analysis of the Intel Pentium's ability to support a secure virtual machine monitor Introduction to the Xen Virtual Machine: http://www.linuxjournal.com/article/8540
Conclusions High performance, Strong isolation and Effective scaling Commercially Successful (Citrix) and Widely used in Industry (It is the VMM driving Cloud Computing, at least Amazon S3 uses it!) Xen is awesome