Virtual Machine Monitors: Disco and Xen Robert Grimm New York University
The Three Questions What is the problem? What is new or different? What are the contributions and limitations?
Disco Overview
Background: ccnuma Cache-coherent non-uniform memory architecture Multi-processor with high-performance interconnect Non-uniform memory Global address space But memory distributed amongst processing elements Cache-coherence How to ensure consistency between processor caches? Solutions: Bus snooping, directory Targeted system: FLASH, Stanford's own ccnuma
The Challenge Commodity OS's not well-suited for ccnuma Do not scale Lock contention, memory architecture Do not isolate/contain faults More processors more failures Customized operating systems Take time to build, lag hardware Cost a lot of money
The Disco Solution Add a virtual machine monitor (VMM) Commodity OS's run in their own virtual machines (VMs) Communicate through distributed protocols VMM uses global policies to manage resources Moves memory between VMs to avoid paging Schedules virtual processors to balance load PE PE PE PE PE PE PE PE nterconnect, ccnuma Multiprocessor
Virtual Machine Challenges Overheads nstruction execution, exception processing, /O Memory Code and data of hosted operating systems Replicated buffer caches Resource management Lack of information dle loop, lock busy-waiting Page usage Communication and sharing Not really a problem anymore b/c of distributed protocols
Disco in Detail
nterface MPS R10000 processor All instructions, the MMU, trap architecture Memory-based interface to VMM Enabling/disabling interrupts, accessing privileged registers Physical memory Contiguous, starting at address 0 /O devices Virtual devices exclusive to VM Physical devices multiplexed by Disco dealized interface to SCS disks and network Virtual subnet across all virtual machines
Virtual CPUs Three modes Kernel mode: Disco Provides full access to hardware Supervisor mode: Guest operating system Provides access to protected memory segment User mode: Applications Emulation by direct execution Not for privileged instructions, access to physical memory, and /O devices Emulated in VMM with per VM data structures (registers, TLB) Syscalls, page faults handled by guest OS's trap handlers
Virtual Physical Memory Adds level of translation: physical-to-machine Performed in software-reloaded TLB Based on pmap data structure: entry per physical page Requires changes in rix memory layout Flushes TLB when scheduling different virtual CPUs MPS TLB is tagged with address space D ncreases number of misses, but adds software TLB Guest operating system now mapped through TLB TLB is flushed on virtual CPU switches Virtualization introduces overhead
NUMA Memory Management Two optimizations Heavily accessed pages relocated to using node Read-only shared pages replicated across nodes Based on cache miss counting facility of FLASH Supported by memmap data structure For each machine page, points to physical addr's, VMs, copies ;iiozci- - ----------~ r---------------~ Node 1,.j3 VCprTrr V! t! VCPU.. \..---------------J +-- f/. \, zl,i--&', 1 1,;.---------------d Krtual Pages Physical Pages Machine Pages
Virtual /O Devices Specialized interface for common devices Special drivers for guest OS's: one trap per operation DMA requests are modified From physical to machine memory Copy-on-write disks Page with same contents requires only one copy Physical Memory of VM 1 Physical Memory of VM 2 Code Data Buffer Cache -Code Data Buffer Cache! Private Pages Data Code Buffer Cache.Data Machine Memory cl 7 Shared Pages Free Pages
FGURE 4. kansparent Sharing of Pages Over NFS. This figure illustrates the case when the NFS reply, to a read request, i a data page. (1) The monitor s networking device remaps the data page from the source s machine address space to the destinatio The monitor remaps the data page from the driver s mbuf to the clients buffer cache. This remap is initiated by the operating Virtual Network nterface ssue: Different VMs communicate through standard distributed protocols (here, NFS) May lead to page duplication in memory Solution: virtual subnet Ethernet-like addresses, no maximum transfer unit Read-only mapping instead of copying Supports scatter/gather WS Server NFS Client Wrtual Pages What about NUMA? Physical Pages Machine Pages
rix 5.3 on Disco Changed memory layout to make all pages mapped Added device drivers to special /O devices Disco's drivers are the same as those in rix Patched HAL to use memory loads/stores instead of privileged instructions Added new calls Request zeroed-out memory pages, inform about freed page Changed mbuf management to be page-aligned Changed bcopy to use remap (with copy-on-write)
Disco Evaluation
Experimental Methodology FLASH machine "unfortunately not yet available" Use SimOS Models hardware in enough detail to run unmodified code Supports different levels of accuracy, checkpoint/restore Workloads pmake, engineering, scientific computing, database
Execution Overhead Uniprocessor configuration comparing rix and Disco Disco overhead between 3% and 16% (!) Mostly due to trap emulation and TLB reload misses m 1601-116 - ff dle DSCO Kt3UhSl Use1-1 70 l-. YP-. Bul 40t Pmake Engineering Raytrace Database RX
5.2 Execution Overheads To evaluate the overheads of running on Disco, we ran each workload on a uniprocessor, once using RX directly on the simulated hardware, and once using Disco running RX in a single virtual machine on the same hardware. Figure 5 shows this comparison. Overall, the overhead of virtualization ranges from a modest 3% for These slowdowns can be explained by the common path to enter and leave the kernel for all page faults, system calls and interrupts. This path includes many privileged instructions that must be individually emulated by Disco. A restructuring of the HAL of RX could remove most of this overhead. For example, RX uses the same TLB wired entry for different purposes in user mode and in the kernel. The path on each kernel entry and exit contains many Diggin' Deeper Operating System Service % of Avg Time $f System per Slowdown z.g -E:-2 : 3 Time nvocation on z 3 3? 0 m-1 Disco sz w SwE Re lathe Execution Time on Disco z% iii? 33 3 co-= k s;lg? E2 lea E; :E g E& *a m &S iz: 92 i! DEMAND-ZERO a 30% 21 cls 1.42 0.43 0.21 0.16 0.47 0.16 QUCK-FAULT -~--r 0% 5N-s 3.17 1.27 1 0.80 1 0.56 1 0.00 1 0.53 open 9% 42~ 1 1.63 11 1.16 1 0.08 1 0.06 1 0.02 1 0.30 UTLB-MSS 7% 0.035 ps 1 1.35 0.07 1.22 1 0.05 1 0.00 1 0.02 write read execve 6% 12w 1 2.14 11 1.01 1 0.24 1 0.21 1 0.31 1 0.17 6% 23~ 1 1.53 11 1.10 0.13 0.09 0.01 1 0.20 6% 1 437ps 1 1.60 11 0.97 1 0.03 1 0.05 1 0.17 1 Table 2. Service Breakdown for the Pmake workload. This table breaks down the overheads of the virtualization for the seven top kernel services of the pmake workload. DEMAND-ZERO is demand zero page fault, QUCK-FAULT, is slow TLB refill, UTLB-MSS is a fast TLB What refill. Other does than the this UTLB-MSS table service, tell the RX us? and RX on Disco configurations request the same number of services of each category. For each service, the execution time is expressed as a fraction of the RX time and separates the time spend in the kernel, emulating What TLB writes is the and privileged problem instructions, with performing entering/exiting monitor call and emulating the unmapped kernel? segments. The slowdown column is the sum of the relative execution times and measures the average slowdown for each service. What is the problem with placing OS in mapped memory? 0.40
Memory Overhead Workload: 8 instances of pmake Memory partitioned across virtual machines NFS configuration uses more memory than available. YP-. Bulfer-Cache 77 70 l- r-l - 40t 36 3 0 JR v M ce Database RX 1VM 2VMs 4VMs 8VMs 8VMstNFS
Scalability rix: High synchronization and memory overheads memlock: spinlock for memory management data structures Disco: Partitioning reduces overheads What about RADX experiment? B E 160 9 5 0 140 t w 0 120 w ii: - dle m OSCO SVllC ii 09 60 60 20 pmake 8VM 8VMlnfs SplashOS RADX 0 Engineerin 16% 78% RlX DSCO
Page Migration and Replication 20 SplashOS RADX 0 16% 78% 100% 6% 76% 100% RlX DSCO UMA RX DSCO UMA Engineering Raytrace e perfort-processor e running y on top of FGURE 8. Performance Benefits of Page Migration and Replication. For each workload, the tigure compares the execution time of RX on NUMA, RX on Disco on NUMA with page migration and replication, and RX on an bus-based UMA, The What does this figure tell us?
Xen in (Some) Detail
Actually, the Big Picture First Control Plane Software GuestOS (XenoLinux) Xeno-Aware Device Drivers Domain0 control interface virtual x86 CPU User Software GuestOS (XenoLinux) Xeno-Aware Device Drivers virtual phy mem User Software GuestOS (XenoBSD) Xeno-Aware Device Drivers virtual network H/W (SMP x86, phy mem, enet, SCS/DE) How does this differ from Disco? User Software GuestOS (XenoXP) Xeno-Aware Device Drivers virtual blockdev Figure 1: The structure of a machine running the Xen hypervisor, hosting a number of different guest operating systems, including Domain0 running control software in a XenoLinux environment. X E N read-only prevent so This co current st level man of admini server: cu filters and packet an interfaces of higheristrative p 3. DE n this that make Xen and g rent discu
Xen by Comparison Three main differences Less complete virtualization Domain0 to initialize/manage VMs, incl. to set policies Strong performance isolation Primary contribution nterface that is pretty close to hardware and enables low-overhead virtualization Need to change more OS code than in Disco Yet still not too much: 3000 lines for Linux
CPU nterface Xen runs in ring 0 Guest OS's run in the otherwise unused ring 1 Privileged instructions need to be processed by Xen Though, x86 makes life a little difficult how? Fast exception handlers do not require Xen interaction Must execute outside ring 0 what does Xen need to do?
Memory nterface x86 has hardware-accessed page tables and TLB Guest OS's responsible for managing page tables Provide machine memory to Xen (from guest's reservation) Have direct read-only access Defer to Xen for performing (batched) updates
Device Driver nterface Shockingly, idealized hardware abstraction Virtual firewall-router Virtual network interfaces Virtual block devices
Figure 2: The structure of asynchronous /O rings, which are VM/VMM Communication Control transfer Hypercalls: synchronous software traps to VMM Events: asynchronous, possibly batched upcalls to VMs Data transfer through /O rings Separate descriptors from actual data Zero-copy transfer during device /O Support batching and re-ordering Request Consumer Private pointer in Xen Response Producer Shared pointer updated by Xen Request Producer Shared pointer updated by guest OS Response Consumer Private pointer in guest OS Request queue - Descriptors queued by the VM but not yet accepted by Xen Outstanding descriptors - Descriptor slots awaiting a response from Xen Response queue - Descriptors returned by Xen in response to serviced requests Unused descriptors
Memory Management Virtual memory Guest OS's manage page tables (not shadows) Expose names and allocation Validated by types and reference counts Page directory/table, local/global descriptor table, writable Page directory and tables pinned.e., they cannot be swapped why? Physical memory Controlled through balloon driver in guest OS Requests and pins pages, which are then returned to VMM May be mapped into hardware memory why? Xen publishes machine-to-physical mapping
Performance solation Four "domains" PostgresSQL SPECweb99 Disk bandwidth hog (sustained dd) Fork bomb Results: Xen sees only 2-4% loss, Linux locks up What is the key ingredient?
What Do You Think?