OS Virtualization Frank Hofmann OP/N1 Released Products Engineering Sun Microsystems UK
Overview Different approaches to virtualization > Compartmentalization > System Personalities > Virtual Machines Efficiency considerations Hypervisors/VMMs > OS-level requirements for running as guest > CPU support for efficient hypervisor implementation
Multiple Machines IPC App 2 App 3 RPC App 1 App 2 syscalls syscalls App 1 Operating System Operating System App 3 Direct access Direct access System Hardware System Hardware Machine 1 Machine 2
Compartments Compartment 1 Compartment 2 IPC RPC App 2 App 3 App 1 App 2 syscalls syscalls Operating System Direct access System Hardware
System Personalities / Branding Compartment 1 Compartment 2 IPC RPC App 2 App 3 App 1 App 2 syscalls type 1 syscalls type 2 Personality type 1 Personality type 2 Operating System Direct access System Hardware
Virtual Machines VM 1 VM 2 App 2 App 3 App 1 App 2 syscalls Operating System 1 syscalls Operating System 2 hypercalls hypercalls Hypervisor / Virtual Machine Monitor Direct access System Hardware
Full Virtualization VM 1 VM 2 App 2 App 3 App 1 App 2 syscalls Operating System 1 syscalls Operating System 2 Direct access (fails!) Direct access (fails!) Trap Handler Interpose on Privileged Access Attempts Hypervisor / Virtual Machine Monitor Direct access System Hardware
System Personalities Simulate presence of a foreign OS to an application Application code runs native foreign system calls cause exceptions Exception handler implements foreign OS behaviour Issues: What if: > two OSs use the same system call mechanism? > foreign and native applications can't tolerate each other?
System Personalities Implementation Implementation options: Foreign syscall handled by kernel (module): > FreeBSD LKP > Linux ibcs2 Foreign syscall handled by user interposition library: > Wine > Lxrun Foreign syscall handler done by user/kernel hybrid: > BrandZ
System Personalities Issues Implementation problems: Hosted application expects certain filesystem structure > Simple for Wine need drive letter emulation anyway > Pathname mapping - foreign installation under e.g. /compat > Simpler: Run hosted environment as a compartment foreign OS and host may use the same syscall method (Definitely the case on AMD64!) > Use heuristics to determine what implementation to call > Per-compartment system call interface. Simpler: no heuristics needed!
Branded Zones I
Paravirtualization Old technique used as far back as the 1960's Advantages: > Lightweight and fast > Allows virtualizing architectures that do not support full virtualization Disadvantages: > Requires porting Guest OS, to use hypercalls instead of sensitive instructions. > No VM nesting (hypervisor running hypervisor) > No support for legacy operating systems
Paravirtualization Xen Hypervisor Xen hypercalls (excerpt): <xen/include/public/xen.h> /* * XEN "SYSTEM CALLS" (a.k.a. HYPERCALLS). */ /* * x86_32: EAX = vector; EBX, ECX, EDX, ESI, EDI = args 1, 2, 3, 4, 5. * EAX = return value * (argument registers may be clobbered on return) * x86_64: RAX = vector; RDI, RSI, RDX, R10, R8, R9 = args [ 1... 6 ] * RAX = return value * (argument registers not clobbered on return; RCX, R11 are) */ #define HYPERVISOR_set_trap_table 0 #define HYPERVISOR_mmu_update 1 #define HYPERVISOR_set_gdt 2 #define HYPERVISOR_stack_switch 3 #define HYPERVISOR_set_callbacks 4 #define HYPERVISOR_fpu_taskswitch 5 #define HYPERVISOR_sched_op 6 #define HYPERVISOR_dom0_op 7 [... ]
Paravirtualization sun4v hypervisor sun4v hypercalls (excerpt): <usr/src/uts/sun4v/sys/hypervisor_api.h> /* * Trap types */ #define FAST_TRAP 0x80 /* Function # in %o5 */ [... ] /* * Function numbers for FAST_TRAP. */ #define HV_MACH_EXIT #define HV_MACH_DESC #define HV_CPU_YIELD #define CPU_QCONF #define HV_CPU_STATE #define MMU_TSB_CTX0 #define MMU_TSB_CTXNON0 #define MMU_DEMAP_PAGE #define MMU_DEMAP_CTX #define MMU_DEMAP_ALL #define MAP_PERM_ADDR [... ] 0x00 0x01 0x12 0x14 0x17 0x20 0x21 0x22 0x23 0x24 0x25
Full Virtualization Transparent Virtualization: possible if all sensitive instructions are privileged. > Sensitive instruction: Modifies/Accesses critical state > Privileged instruction: Causes a trap when run unprivileged Formal proof exists (Popek/Goldberg, 1974) Classical x86 is not fully virtualizable it has sensitive instructions that don't trap in user mode. Some of VMware's overhead comes from trying to detect these instructions!
Pacifica and Vanderpool Transparent virtualization support for x86 by AMD/Intel Instruction set extension New operating mode: hyperprivileged Allows intercept on access to (hypervisor redirect): > Control register / MSR / IO ports > Interrupts, Traps, Exceptions > Sensitive instructions > Pagetables, DMA (Pacifica only) Hypercall/return instructions Virtual Machine state management
References BrandZ/Xen OpenSolaris communities: http://www.opensolaris.org/os/community/brandz http://www.opensolaris.org/os/community/xen Xen Hypervisor: > XenSource (binaries): http://xensource.com > Sourcecode: http://xen.sf.net Collection of documents on Virtualization: http://www.cis.upenn.edu/~cis700-6/04f/ Popek/Goldberg, Formal Requirements for Virtualizable Third Generation Architectures: http://portal.acm.org/citation.cfm?id=361011.361073
We live by what we believe, not by what we see. St. Paul 2 Corinthians 5:7
OS Virtualization Frank Hofmann Frank.Hofmann@sun.com