1 Virtual Machines Virtual Machine (VM) Layered model of computation Software and hardware divided into logical layers Layer n Receives services from server layer n 1 Provides services to client layer n + 1 Layers interact through well-defined programming interface Virtual layer Software emulation of hardware or software layer n Transparent to layer n + 1 Provides service to layer n + 1 as expected from real layer n Virtual layer n can run at some layer m n in real system n + 1 n n 1 Virtual System n + 1 virtual n = m m 1 Real System 2 Examples of Virtual Systems Web browser exchanges data with server Browser Local Protocol Stack Client real virtual Network Web server Server Protocol Stack Server Cloud computing Virtual Service level agreement (SLA) specifies infrastructure requirements User sees hardware / software configuration / performance Real Provider assembles virtual configuration Meets SLA requirements May be implemented in any way real Types of Virtual Machine Process Virtual Machine VM provides application interpretation above Hosted Virtual Machine Virtual machine monitor () Runs above primary / below guest Provides guest with software emulation of real hardware system System Virtual Machine Emulation of system-level hardware environment Runs above physical hardware and below one or more s Basic System System VM Guest Hosted VM Process VM VM 3 4
5 Process VM Example Java Designed for program portability between platforms Provides standard interface to software Java VM located above a standard Interface to hardware implementation dependent I/O operations performed by calls to Java compiled to bytecode Bytecode usually run (interpreted) in Java VM Java without VM Java bytecode processor in IBM mainframes Native machine language (ISA) is Java bytecode Execute Java bytecode without interpretation Hosted VM Example Guest Over D command line interface over Windows Windows allocates 1 MB virtual memory space Copies D kernel into low memory System calls handled by guest D kernel D accesses to hardware Trapped and served by Windows host Responses returned to D Concurrent D windows Multiple allocations of 1 MB virtual memory spaces DEBUG running in virtual D machine Sees 1 MB memory space allocated by Windows Register values Windows emulates real values to D Debug emulates D values to user Windows Windows debug Virtual 86 http://java.sun.com/docs/books/tutorial/getstarted/intro/definition.html Parallels, VirtualBox, VMware, DBox,... Host Windows, Linux, D, as guest s over host 6 Virtual Machine in IBM z/990 Mainframe CPUs, I/O system, internal communication network (hypervisor) Operator console for partitioning/configuring CPUs and I/O Provides hardware emulation as abstraction to layer Logical partition (LPAR) runs separate instance of operating system Run z/, MVS, VM, Unix, Linux, Windows, instances in parallel Non-Windows versions expect to see hypervisor (not hardware) User User sees single-user interface provided by one User User User User User User User User User User LPAR LPAR LPAR LPAR Systems Manager Hypervisor VM as System Management Tool Isolate user environments on single hardware platform Multiple copies of single operating system running independently Multiple operating systems running concurrently Maintain higher security App1 Resource management redundancy High availability Recovery management pooling Assemble hardware cluster Map applications to hardware efficiently Load balancing Remap applications to hardware Server App2 App2 Server App3 7 8
9 z/990 Parallel Sysplex Model Parallel Sysplex Merge 2 to 32 instances of z/ into a single system s divide work and data among LPARs High capacity for very large workloads Resource sharing Dynamic workload balancing Geographical diversity Coupled LPARs on remote physical systems Physical backup Automatic failure recovery Continuous availability Coupling Facility User User User User User User User User User User LPAR - LPAR - LPAR - LPAR - Systems Manager (processors, RAM, I/O) User User User User User User User User User User LPAR - LPAR - LPAR - LPAR - Systems Manager (processors, RAM, I/O) Virtualization for Server Systems Old file server model Run one application per physical server Server specified for worst case load Large number of typically underutilized servers Huge aggregate space capacity Competition from mainframes provides dynamic load balancing provides centralized power, cooling, monitoring, backup High SAR scalability, availability, reliability Lower cost per served client than server farm Virtualization in server Partition hardware resources to run independent applications Intel virtualization IA-32 and IA-64 ISA support I/O chipset support 10 HP Virtual Partitions (vpars) System VM Organization Boot Order Hypervisor Virtual machine monitor () Lowest layer above physical hardware (host) Uniprocessor or multiprocessor system Creates virtual machine (VM) environments for guest s Allocates physical host resources to virtual resources VM overhead Processor intensive applications low overhead Infrequent use of calls Most instructions run directly on hardware I/O intensive applications high overhead Frequent use of calls calls for I/O services run in emulation I/O-limited applications Program throughput limited by I/O latency Emulation adds relatively small overhead Hewlett-Packard, "Installing and Managing HP-UX Virtual Partitions (vpars)" 11 12
13 Requirements abstraction Guest environment must replicate hardware must present well-defined software interface to Protection Isolate guests from one another Protect from guest and application software Guest software cannot change allocation of physical resources Privilege runs in kernel mode Guest s and applications run in user mode support for Virtualization primitives built into mainframe ISA Any or application access to hardware causes trap to catches every access to hardware abstraction layer (HAL) Virtualization Awareness Virtualization-aware guest written to run above /hypervisor Expects to interact with virtual host Does not expect full or direct control of physical hardware code interfaces with hypervisor code No need to remap (bluff) pointers intended for real hardware May be presented with view of real system for limited operations Example mainframe Writes I/O outputs to hypervisor interface Does not attempt to configure I/O hardware devices Particular may be given direct control of particular I/O device Virtualization-unaware guest written to run above physical hardware Expects full and direct control of real hardware Requires extensive intervention and remapping by 14 Emulation Activities sees hardware through operations instructions cause to CPU initiate memory and I/O operations I/O devices initiate DMA operations and interrupts Operation CPU Access CPU I/O Device Access I/O device actions Real Read data or instruction Write data Read data or instruction Write data DMA or IRQ Emulation Translate data/instruction from guest to host format Remap address space Read/Write to real host memory Translate data/instruction from guest to host format Remap I/O port space Read/Write to real host I/O device manages I/O device DMA Translate interrupt handlers from guest format to host format Full/Partial System Emulation Full system emulation intervenes in every access to hardware CPU Chipset I/O devices Translates guest ISA to host ISA Translates memory size and organization Translates guest configuration instructions to host Translates guest driver to host driver CPU emulation example Run Nintendo game on PC Translate each Nintendo instruction to IA-32 instruction set Partial system emulation Part of host hardware presented to unchanged passes guest operations to host with minimal intervention Most system VMs emulate subset/superset of real host hardware CPU emulation only in special cases 15 16
17 Software Emulation of I/O Advantages provides emulation of widely supported device hardware Guest runs available device drivers without modification Difficulties Requires very accurate device emulation Includes hardware revisions and "bug emulation" Performance issues intervention on every guest access to I/O device Context switch from guest to emulates I/O access and access to real I/O device Context switch back to guest with response Adds considerable overhead Emulation is compute-intensive increases CPU utilization Least-bad case Virtual device = real device Remap I/O ports no change to driver operation Bootstrap Process in System VM System boot Device Discovery Secondary Boot Device Discovery Workstation without CPU loads initial system loader (ISL) ISL points to system boot device Boot device contains loader writes to host I/O space Chipset and I/O devices respond loads drivers for host devices provides user interface Workstation with CPU loads initial system loader (ISL) ISL points to system boot device Boot device contains loader writes to host I/O space Chipset and I/O devices respond loads drivers for host devices provides administrator interface Administrator configures VM partitions Administrator points to device containing boot image boots into partition loader writes to virtual I/O space responds for I/O devices loads drivers for virtual devices provides user interface 18 Virtualization Difficulties for IA 32 IA-32 designed to provide hardware support to segmentation Virtual memory and paging Task management Interrupt management Protection and privilege for segmentation, paging, interrupts Workaround virtualization Treat like user application Can create a kludge on IA-32 systems IA-32 operating systems Expect to have highest privilege Can easily discover their lower privilege Virtual User Kernel Real Resource Compression manages resources using IA-32 system tables Assigns pointer to page table root (directory) Manages page table entries Manages memory segmentation with descriptor tables limited to 8 K entries Global descriptor table (GDT) Map segment pointer to virtual address Define segment type (code, data, system) and privilege level Interrupt descriptor table (IDT) Map interrupts and traps to service routines compression must reserve part of guest virtual memory for management expects to see the full virtual memory space Table resource compression requires entries in GDT and IDT for management of must prevent access to its descriptors expects full control of all 8 K table entries 19 20
21 Ring Aliasing Privilege rings segments assigned privilege from 0 (highest) to 3 (lowest) Stored in segment descriptor (table entry defining segment) Access rights for code limited to segments of same or lower privilege Copied into code segment selector (pointer to segment via descriptor) User mode ~ ring 3 kernel mode ~ ring 0 Ring aliasing Deprivileging Run at ring 0 and at ring 1 Issues Paging restricted to two levels 4 level privilege not supported in 64-bit systems can read its CPL from code segment selector CPL CPL Access Granted Access Denied DPL CPL DPL 0 1 CPL DPL 2 3 Non Faulting Access to Privileged State Privileged registers Control configuration of hardware systems must Intercept access to privileged registers Provide virtual values based determined for guest environment Access to privileged registers in IA-32 Access by unprivileged software usually prevented Causes protection fault emulates response to guest instruction Some unprivileged accesses privileged state and do not fault On user access to system state Protection fault on write No fault on read GDTR IDTR LDTR TR pointer to GDT pointer to IDT pointer to LDT pointer to current task segment CPL privilege level of code segment DPL privilege level for data access or branch target DPL Guest can determine that it does control CPU 22 System Calls and Interrupts System calls in ring 3 invokes in ring 0 Require indirect mechanism (call gate) Redirects to hidden ring 0 address must emulate call gates SYSENTER instruction provides fast calls to ring 0 Will call instead of guest SYSEXIT instruction ends SYSENTER routine Faults to ring 0 if executed from lower privilege must emulate response to SYSENTER/SYSEXIT Interrupts Interrupts can be masked by controlling interrupt flag (IF) must mask interrupts and handle interrupts by emulation Some s toggle IF frequently requiring many interventions Intel Virtualization Technology (VT) Virtual machine monitor boots (3rd party) software instead of configures hardware resources among guest systems Remaps hardware locations to virtual pointers for guests s boot within guest partitions support for virtualization VT-enabled processors alternate between operating modes Root mode grants full hardware control to Non-root mode presents virtual pointers to guest VT-enabled chipset Grants control of I/O to root mode Remaps I/O channels for non-root mode Operating system Sees virtual machine as real system Operates in ring 0 for maximum privilege VMX non root VMX root Sends instructions to hardware pointers in usual way User http://www.intel.com/technology/itj/2006/v10i3/index.htm Ring 3 User Privilege Ring 0 Virtual Full Privilege Real Full Privilege 23 24
25 System Issues in Virtualization CPU virtualization support Handles operations initiated by CPU access by guest software assigns virtual address space to guest I/O access by guest software translates driver output for host device PCI Host-to-Bus Bridge (bus controller) CPU ROM PCI (expansion) bus RAM VT x for IA 32 Processor Virtualization Virtual machine extensions (VMX) VMX root operation Operating mode designed for Grants highest privilege access to host CPU hardware state VMX non-root operation Operating mode designed for guest Presents with virtual host configured by sees standard ring 0 access to virtual IA-32 resources access to privileged state trapped by Graphics Chipset virtualization support Handles operations initiated by I/O device Interrupts and DMA accesses by I/O device Intercepted by and remapped I/O I/O I/O I/O ISA/EISA bus disk I/O ISA Bridge Mode transitions VM entry VMX root operation VMX non-root operation VM exit VMX non-root operation VMX root operation VM Exit VM Entry VMX non root VMX root User Host 26 Virtual Machine Control Structure Virtual-machine control structure (VMCS) Used for mode transition management VM entry Saves processor state to VMCS host-state area Loads processor state from VMCS guest-state area VM exit Saves processor state to VMCS guest-state area Loads processor state from VMCS host-state area VMCS host-state area Segment register selectors for operations Privileged system table pointers (GDTR, IDTR, TR, page table root) VMCS guest-state area Segment register selectors for operations Virtual system table pointers determined by physical address space not mapped to guest virtual address space Interrupt flag (IF) VMCS Details Referenced by physical address No page table entry in any guest address space Location determined by software VMCS structure Not determined by architecture Defined as set of VMCS access host instructions author chooses implementation VM entry Loads table pointers from VMCS Pointer updates cause context shift to VM process can optionally inject virtual event (interrupt) to cause VM response VM exit VM saves context to memory All VMs exit to common entry point in VM exit records details of reason for exit in VMCS provides detailed response to VM exit 27 28
29 VMCS Control Fields Settable options for interrupt virtualization External interrupt exiting Interrupt window exiting VM exit on external interrupt External interrupts not maskable by guest VM exit if guest allows interrupts Guest/Host mask for control register virtualization Status flags in control registers determine processor options masks selected flags to prevent write by guest Guest write to masked flag causes VM exit Guest reads flag value specified by in VMCS VM exit bitmaps chooses subset of guest actions that cause VM exit Exception bitmap 32 exceptions that optionally cause VM exit I/O bitmap each 16-bit I/O port can be set to VM exit on guest access Instruction bitmap selects privileged instructions that cause VM exit VT x Solves Virtualization Problems Ring aliasing and compression Guest software runs at intended privilege level Address-Space Compression Guest/ transitions can change virtual address space Guest software has full use of its own address space VMCS resides in physical address space Does not use not linear address space Nonfaulting access to privileged state VMCS controls interrupts allows guest access to privileged registers Accesses cause transition to System calls Guest runs at ring 0 as intended Interrupts VMCS controls response to interrupt through VMCS 30 VT x Exception Handling Interrupt Virtualization not set in bitmap handles continues Set option external-interrupt exiting Exception set in bitmap VM exit to services exception updates system tables event injection VM entry Interrupt VM exit to prepares system tables event injection VM entry Event injection replicates interrupt Possible updates interrupt tables, system registers, I/O configuration,... Event injection replicates exception Possible updates page tables, system registers, I/O configuration,... 31 32
33 VT d for PCI Chipset Virtualization allocates resources to guest s Virtual address space CPU Bridge RAM Virtual I/O devices mapped to real I/O devices accesses real I/O device through mapping DMA remapping configures virtual I/O devices I/O I/O I/O Enables device-initiated DMA operations to guest address space Real I/O device must write to guest through emulation mapping Interrupt remapping Real I/O devices my interrupt CPU Interrupt intended for one guest Real I/O device must deliver interrupt to guest through emulation mapping DMA Protection Domains Protection domain Subset of physical memory allocated for device-initiated DMA Protection domains may be allocated to Guest Driver process running under guest I/O device May be assigned to a protection domain Can only perform DMA to assigned protection domain DMA address translation I/O device DMA request to bridge contains memory address VT-d treats request address as DMA virtual address (DVA) Guest Physical Address (GPA) of guest General software-generated virtual I/O address DVA translated to Host Physical Address (HPA) http://www.intel.com/technology/itj/2006/v10i3/index.htm 34 Mapping I/O Devices to Protection Domains PCI device requester ID Identifies DMA device and request Address Space Overview PCI Bus Device Function Assigned by PCI configuration software during device discovery Root-Entry Table Index 8-bit bus number from requester ID Entry Pointer to context-entry table Context-Entry Table Index 8-bit device/function number from requester ID Entry pointer to page structure used to translate DVA Guest Virtual Page Tables GVA GPA Guest Physical HPA Host Physical Page Structures HPA DVA DMA Virtual Context Entry Table Root Entry Table PCI Bus Device Function I/O device DMA Request ID Page structure Multilevel table structure Similar to IA-32 page tables Virtual Emulated Physical Physical Virtual 35 36
37 IA 32 Interrupt Handling Legacy interrupts Interrupt controller in chipset handles device interrupts Programmable Interrupt Controller (PIC) integrated into ISA chipset APIC (Advanced PIC) integrated into PCI chipset I/O device assigned interrupt request (IRQ) connection to APIC APIC Translates device IRQ to 8-bit CPU interrupt number n Sends hardware interrupt signal (INTR) to processor CPU Loads 64-bit entry n from Interrupt Descriptor Table (IDT) Entry points to Interrupt Service Routine (ISR) Message signaled interrupts (MSI) I/O APIC in PCI chipset formats IRQ signal into structured message Message transferred on PCI bus as device-initiated DMA operation Local APIC in CPU receives and decodes message Message Interrupt Handling Local APIC CPU interrupt controller Receives/decodes local interrupt signals Receives interrupt messages from I/O APIC I/O APIC PCI chipset interrupt controller Receives/decodes device IRQ signals Sends/receives interrupt messages IA-32 Intel Architecture Software Developer s Manual 38 Interprocessor Interrupts Interprocessor Interrupt (IPI) Subset of APIC interrupt message table CPU writes to interrupt command register (ICR) in local APIC Local APIC issues IPI message on system bus Used to boot and spawn threads in multiprocessor system Interrupt Remapping Message signaled interrupt (MSI) Encodes interrupt vector and destination processor Real I/O device not aware of guest view of emulated I/O device must intercept MSI redefines interrupt message format Provides substitute MSI DMA write request contains Message identifier No interrupt attributes (vector and destination processor) Requester ID of real I/O device generating interrupt Requester ID mapped through table structure (root/context tables) Points to interrupt remapping table (IRT) Entry provides vector and destination processor 39 40
41 Caching of Remapping Structures VT-d supports hardware caching of remapping tables Root/Context tables Paging structures IOTLB Interrupt remapping table entries responsible for maintaining remapping cache Must invalidate stale cache entries Remapping errors DMA access request returns error message Device response to error implementation dependent Errors logged to may reset cache or I/O device configuration tables VirtualBox Open source hosted by Oracle (Sun Microsystems) Runs on Intel and AMD x86 hardware Runs above Windows, Linux, Mac X (Intel), Solaris Provides VM with guest Standard D, Windows, Linux, /2, FreeBSD, Solaris Uses hardware virtualization support if available (not required) Scheduling Host grants timeslice to VM VM sub-processes scheduled by guest Host x86 Guest VirtualBox Hypervisor 42 VirtualBox Architecture CPU Operating Modes Front-end (client) VirtualBox hypervisor Runs above host Without Intel VT performs workaround virtualization Hypervisor runs in ring 0 of guest context Guest runs as user program in ring 1 of guest context Limited use of Intel VT if available Guest Hypervisor Back-end (server) Host Ring 0 driver in host VirtualBox Driver Copes with "gory details of x86 architecture" Allocates physical memory for VM (guest ) Saves/restores guest CPU context during host interrupt Registers and descriptor tables No intervention in guest process management Runs native (not emulated) on CPU Host applications at ring 3 Host code at ring 0 Guest "safe" application code at ring 3 Non-system activities Makes system calls to guest Runs emulated on CPU at ring 3 Guest application code that causes guest interventions Disable interrupts Trap of prohibited accesses Executes real mode code Each instruction interpreted by VirtualBox driver Interpreted code run in CPU instead of native code Runs native on CPU at ring 1 Guest ring 0 code VirtualBox driver handles "gory details" of workaround 43 44
45 Xen Open source system Runs on Intel and AMD x86 hardware Runs directly above hardware Linux required to build and install Xen Provides VMs with guest s Linux, Solaris, Windows XP, 2003 Server virtualization support required for Windows guest Para-virtualization for Linux/Unix guest kernel modified to support Xen explicitly Operating systems ported to run on Xen Similar effort to porting to new hardware platform Para-virtual machine architecture very similar to native hardware User space applications and libraries not modified Xen Architecture Overview, http://wiki.xensource.com/xenwiki Xen Architecture Xen hypervisor Directly above hardware Boots system on on start-up Initialized by hypervisor on boot Runs XenLinux modified Linux kernel Provides Domain Management and Control (DMC) Domain U VM running guest DMC XenLinux Guest Domain U Xen Hypervisor x86 Guest Domain U Guest Domain U 46 Hypervisor Full privilege Operates directly on hardware in ring 0 Functions CPU scheduling for virtual machines partitioning Provides hardware abstraction to virtual machines No awareness of Networking External storage devices Video Common I/O Xen Hypervisor x86 Domain U CPU Scheduler Domain U Process List Domain U Partitioner Page Tables Domain 0 I/O XenLinux Modified Linux kernel running in unique VM over hypervisor Direct privileged access rights to physical I/O resources Provides I/O virtualization to Domain U guest VMs Generic I/O drivers Network Backend Driver Manages local networking hardware Processes all VM networking requests from Domain U guests Block Backend Driver Manages local storage disk Processes all read/write data requests from Domain U guests Domain U Scheduler CPU Process List Domain U Partitioner Page Tables I/O Drivers I/O 47 48
49 Domain U PV Domain U PV guests Paravirtualized VM running modified Linux/UNIX kernels expectations No direct access to host hardware Shares host hardware with other VMs Guest drivers provide I/O access Access backend drivers in PV Network Driver PV Block Driver Domain U HVM Domain U HVM Guests Fully virtualized machines Run standard Windows or other unmodified runs as VMX non-root operation with VT-x expectations No hardware virtualization Not sharing with other VMs Normal hardware access for boot Driver Driver Xen virtual firmware runs as VMX root operation with VT-x Simulates BI expected by on initial startup Domain U daemon PV Driver PV Driver Domain U Backend Driver I/O support No special drivers runs Qemu-dm daemon for each HVM Guest Supports Domain U HVM Guest for networking and disk access requests 50 Domain Management Xend daemon Python application running in System manager for Xen environment Processes requests as XML remote procedure call (RPC) Qemu-dm Daemon handles networking and disk requests from Domain U HVM Provides full emulation of hardware for standard I/O drivers Virtual firmware Provides full emulation of BI for Domain U HVM Guest Xend Qemu XenLinux Unix XenUnix Domain U PV Xen Hypervisor x86 Linux XenLinux Domain U PV Windows Standard Windows Domain U HVM Domain U PV to Communication Domain U PV Guest requests I/O from via hypervisor No direct support in hypervisor for I/O Inter-Domain event channel and each Domain U have shared memory area Asynchronous inter-domain interrupts implemented in hypervisor Example Domain U PV Guest data write to hard disk Guest sends write request to PV block driver Guest PV block driver Writes data to shared memory through hypervisor Sends inter-domain interrupt to Domain-0 through hypervisor receives interrupt from hypervisor Triggers PV Block Backend Driver access to shared memory Backend Driver Reads blocks from Domain U PV Guest shared memory Writes data to hard disk 51 52
53 I/O Driver Communication Xen PV and HVM Performance DMC Unix Write request Test Configuration Intel Xeon @ 2.3 GHz 4 GB DDR2 533 MHz memory 160 GB Seagate SATA disk Intel E100 Ethernet controller Write disk Interrupt Backend block Driver XenLinux PV block driver XenLinux Domain U PV Interrupt Read shared memory Xen Hypervisor Write shared memory Interrupt x86 Interrupt Dong, et. al., "Extending Xen with Intel Virtualization Technology", Intel Technical Journal 54 I/O Bottleneck Bottleneck Single Ethernet controller Guest tasks waiting for I/O access hides performance degradation caused by virtualization Web server running over native Linux without Xen Threads compete above 2.5 Gbps Web server running over XenLinux in Threads compete above 1.9 Gbps Web server running over XenLinux in Domain U PV Threads compete above 0.9 Gbps 55