Advanced Operating Systems and Virtualization MS degree in Computer Engineering Sapienza University of Rome Lecturer: Francesco Quaglia Topics 1. Basic support for guest systems execution
System virtualization concepts It targets how to show a view of the resources rthat is different form the real one It allows to run more than one operating system on the same hardware. This is base don the concept of virtual machine, whichis a software implemenaoitn of a real one It relies on an additonal software layer: Hypervisor or VMM (Virtual Machine Monitor). Advantages: Isolation of differnet execution environemtns (ont eh sam hardware) Reduction of hardware and administration costs
TERMINOLOGY: Hypervisor Host system: the real system where software implemented virtual machines run. Guest system: the system that runs on top of a software implemented virtual machine. Hypervisor: Manages the hardware reosurces that are availabel to the host system, so as to allow corret usage by the guets operaying system. It makes virtualized resources available to the guest system Hypervisor categories: 1. Native ii runs on the native host hardware. It looks like a lightweight virtualization kenrel operating on top of the harware. 2. Hosted it runs as an applicaiton, which accesses the actual host services via system calls
Pure software x86 virtualization Instructions are executed by the native phisical CPU within the host platofrm There is the need for emulation/interpretatin of a subset of the instruction set No particualrl hardware component playes a role in the virtualiztion task (as isntead for the case fo Intel VT-x o AMD-V). The core issue What of RING 0 is required for the guest system tasks? Risk to bypass the VMM resourc management policy in case of actual RING 0 access The solution: ring deprivileging.
Ring deprivileging This allows the guest kernel to run at priviledge level that simulates 0 The actual envisioned strategies: 1. 0 / 1 / 3 Model: o VMM runs at ring0. o Kernel guest runs at ring 1 (whic is typoically unused by native kernels) o Applications still run at ring 3. o This is the most used approach. 2. 0 / 3 / 3 Model : o VMM runs at ring 0. o Kernel guest and applications run at ring 3. o Too close to emulation, too high costs.
The 0 / 1 / 3 model The application layer (running at ring 3) cannot damage the guest operating system state (whihc runs at ring 1). The guest system cannot access to the hadware priviledged facilities bypassinbg VMM, so we still guarantee the isolation of guest systems execution Any exception needs to b trapped by VMM (operating at ring 0) and need to be somehow handled (e.g. by reflecting them into Ring 1 tasks) Issues to cope with: Ring aliasing Virtualizzazione of the interrupts Freqent access to privileged resources
Ring aliasing Isseus connected with the fact that (kernel) software sesigned to be boud to specific ring vaues (say 0) and need to run at a different elevel (ring 1 for guest systems). CPL= Current Privilege Level -> this information we know is kept by the 2 lsb within the CS register (Code Segment). Proviledged instructions: we knwo they genrate a trap is not executed at ring 0 (CPL > 0). Exampes for x86 machines are: HALT LIDT LGDT INVD MOV CR x I/0 sensistive instructions: we know they generate a tra if executed when CPL > IOPL (I/O Privilege Level). Classical examples are: CLI (IF=0) STI (IF=1) The actual generated trap (general protection fault) needs to be handled by VMM, so as to finally dtrmine how to handle it (emulation vs interpretation)
The VirtualBox example Base on hosted hyperisor with kenrnel augmentation, via classical special devices. Pure software virtualization is supported for X86: Fast Binary Translation (code patching): the kernel code is annalysed and modified before being executed so as to substitute privileged instructions (e.g. CLI) with blocks of code that are functionally equivalenty Based on the 0/1/3.
Execution modes and contexts Guest context (GC) execution context for the guest system. It is bsed on two modes 1. Raw mode: native guest code runs at ring level 3 or ring level 1 2. Hypervisor: VirtualBox code is run at the magximm privilege level (ring 0) Host context (HC) execution context for user space portion of VirtualBox (ring 3): o o The running thread implelting the VM lives in this context upon a mode change REM mode: execution mode for emulating critical/privilegd instructions
Raw Mode: the actual strategy The native kernel cannot run at ring 1 We need pathces for host and guest systems,so as to support the 0/1/3 model Host side: o Uasge of special devices (and drivers) for access to ring 0 Guest system wa have wrappers for: o GDT (Global Descriptor Table) o TSS (Task State Segment) o IDT (Interrupt Descriptor Table)
VirtualBox GDT Introduction of gate descriptors for kernel code/data segments with DPL=1. Hence these segments are accessible with CPL=1 New TSSD pointing ot the TSS wrapper which keeps info on stack positioning at ring 1 (ss1,esp1) and ring 0 (ss0,esp0). 2 new segments for the Hypervisor are addedd with DPL=0 DECRIPTION OFFSET DPL Entry 0 (0000) H dc null... KERNEL CODE SEGMENT KERNEL DATA SEGMENT (0060) H 1 (0068) H 1 VIRTUALBOX TSSD (FFE0) H 0 BASE ORIGINAL TSS VBOXTSS unused unused HYPERVISOR DATA SEGMENT HYPERVISOR CODE SEGMENT (FFF0) H 0 (FFF8) H 0 CPL = Current Privilege Level ss1=ss0 1 DPL = Descriptor Privilege Level
Virtual trap Genuine trap VBOXIDT: interrupt gate Any interrupt neew to me managed by VMM. IDT ORIGINALE To achieve this, a wrapper for the IDT is generated Proper handlers are instantiated which get executed by the Hypervisor upin traps. VMM can take control thanks to the ad-hoc segment selector (at GDT proper offset for the hypervisor code segment). IN case of a «genuine» trap, then control goes to the native kernel, otherwise the virtual handler is executed 0xD 14 (0060) H 0 0xE 14 (0060) H 0 VBOXIDT 0x0 0xD 14 (FFF8) H 0 0xE 14 (FFF8) H 0 VMM handler
VBOXIDT: gate 0x80 INT 0x80 has an ad-hoc management ORIGINAL IDT Ithe syscall gate is modified soa st to provide a segment selector with RPL = 1 It indicates the GDT offset for the code segment (at ring 1). 0x80 15 (0060) H 3 system_call handler Hence calling a system call deos not require interaction with the Hypervisor The trampolina handler is then iused to louch the actual syscall handler VBOXIDT 0x80 15 (0061) H 3 Ring 1 handler Handler trampolino
Access to raw mode This is used for privileged insturctons LIDT -> idtr punterà a VBOXIDT LGDT -> gdtr punterà a VBOXGDT LTR -> tr punterà a nuovo VBOXTSS The gest system canb then take back control by returning from the trap (IRET), with the following registers saved oin the stack SS ESP EFLAGS CS EIP
Privileged instructoins: patching Privileged instructions may hamper performane given that the Hypervisor needs to take back control for handling any of them A way to cope with this is staich patching of these instrucitons An example: the CLI instruction Trap if CPL<=IOPL VMM sets IOPL=0 upon entering raw mode Problem: if IF=0, then VMM cannot handle interrupts anymore. The solution: the code block CLI STI is substituted with a functionallt equivalient one. Unterrupt is disabled only for the guest system The Hypevisor will take care of finally delivering it.
REM mode I does not use runtime patching -> because of issues in the efficiency for managing, e.g., code storage Actually executed in HC mode ring 3. It relies on QEMU. Quite usefull for privileged instruction management The issue: the emulation process can be slow, since we need to keep track of processor state changes to be restored upon reentring raw mode Typically, at each emulation step it is checked whether native code execution can be restored
Privileged instruction srategies Upon adding kernel code, privileged instructions must be emulated/interpreted byvmm. VBOXIDT Emulation is used for more probelmatic instrucitons (e.g. CLI STI). #GP fault! Interpretation is used for privileged instructions that cause traps, which are not problematic (e.g. MOV from/to CR X)