I/O in Linux Hypervisors and Virtual Machines Lecture for the Embedded Systems Course CSD, University of Crete (May 12 & 14, 2015)



Similar documents
Nested Virtualization

Virtualization in Linux KVM + QEMU

Architecture of the Kernel-based Virtual Machine (KVM)

Advanced Computer Networks. Network I/O Virtualization

Full and Para Virtualization

Virtualization Technology. Zhiming Shen

Intel Virtualization Technology Overview Yu Ke

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

Brian Walters VMware Virtual Platform. Linux J. 1999, 63es, Article 6 (July 1999).

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

KVM Architecture Overview

Virtualization. ! Physical Hardware. ! Software. ! Isolation. ! Software Abstraction. ! Encapsulation. ! Virtualization Layer. !

Intel s Virtualization Extensions (VT-x) So you want to build a hypervisor?

COS 318: Operating Systems. Virtual Machine Monitors

Virtual Machines. COMP 3361: Operating Systems I Winter

Cloud^H^H^H^H^H Virtualization Technology. Andrew Jones May 2011

Models For Modeling and Measuring the Performance of a Xen Virtual Server

Virtualization. Jia Rao Assistant Professor in CS

Hypervisors and Virtual Machines

Bare-Metal Performance for x86 Virtualization

Virtualization. Types of Interfaces

Chapter 5 Cloud Resource Virtualization

Virtualization Technologies

Enterprise-Class Virtualization with Open Source Technologies

Virtualization VMware Inc. All rights reserved

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

Virtualization. Dr. Yingwu Zhu

Using Linux as Hypervisor with KVM

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY

Toward Exitless and Efficient Paravirtual I/O

Hybrid Virtualization The Next Generation of XenLinux

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Kernel Virtual Machine

Xen and the Art of Virtualization

Introduction to Virtualization & KVM

OS Virtualization. CSC 456 Final Presentation Brandon D. Shroyer

CS5460: Operating Systems. Lecture: Virtualization 2. Anton Burtsev March, 2013

Cloud Computing CS

Virtualization. Explain how today s virtualization movement is actually a reinvention

How To Understand The Power Of A Virtual Machine Monitor (Vm) In A Linux Computer System (Or A Virtualized Computer)

SR-IOV Networking in Xen: Architecture, Design and Implementation Yaozu Dong, Zhao Yu and Greg Rose

Xen and the Art of. Virtualization. Ian Pratt

BHyVe. BSD Hypervisor. Neel Natu Peter Grehan

Introduction to Virtual Machines

Virtualization. Jukka K. Nurminen

ELI: Bare-Metal Performance for I/O Virtualization

The Turtles Project: Design and Implementation of Nested Virtualization

Nested Virtualization

Knut Omang Ifi/Oracle 19 Oct, 2015

Virtualization. Pradipta De

INFO5010 Advanced Topics in IT: Cloud Computing

The Microsoft Windows Hypervisor High Level Architecture

Virtualization. P. A. Wilsey. The text highlighted in green in these slides contain external hyperlinks. 1 / 16

kvm: Kernel-based Virtual Machine for Linux

KVM on S390x. Revolutionizing the Mainframe

Taming Hosted Hypervisors with (Mostly) Deprivileged Execution

How To Make A Minecraft Iommus Work On A Linux Kernel (Virtual) With A Virtual Machine (Virtual Machine) And A Powerpoint (Virtual Powerpoint) (Virtual Memory) (Iommu) (Vm) (

Virtual machines and operating systems

PCI-SIG SR-IOV Primer. An Introduction to SR-IOV Technology Intel LAN Access Division

High-performance vnic framework for hypervisor-based NFV with userspace vswitch Yoshihiro Nakajima, Hitoshi Masutani, Hirokazu Takahashi NTT Labs.

Nested Virtualization

Hypervisors. Introduction. Introduction. Introduction. Introduction. Introduction. Credits:

Chapter 16: Virtual Machines. Operating System Concepts 9 th Edition

Networked I/O for Virtual Machines

Network Virtualization Technologies and their Effect on Performance

Beyond the Hypervisor

The Art of Virtualization with Free Software

x86 Virtualization Hardware Support Pla$orm Virtualiza.on

Virtualization in Linux. DCLUG talk Przemek Klosowski October 2011

2972 Linux Options and Best Practices for Scaleup Virtualization

CS 695 Topics in Virtualization and Cloud Computing. More Introduction + Processor Virtualization

Performance Profiling in a Virtualized Environment

Understanding Full Virtualization, Paravirtualization, and Hardware Assist. Introduction...1 Overview of x86 Virtualization...2 CPU Virtualization...

The QEMU/KVM Hypervisor

2 Purpose. 3 Hardware enablement 4 System tools 5 General features.

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

kvm: the Linux Virtual Machine Monitor

COM 444 Cloud Computing

COS 318: Operating Systems. Virtual Machine Monitors

Chapter 14 Virtual Machines

Virtualization. P. A. Wilsey. The text highlighted in green in these slides contain external hyperlinks. 1 / 16

Distributed Systems. Virtualization. Paul Krzyzanowski

Using IOMMUs for Virtualization in Linux

RCL: Design and Open Specification

Intel Virtualization Technology

Hardware virtualization technology and its security

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Securing Your Cloud with Xen Project s Advanced Security Features

KVM: Kernel-based Virtualization Driver

Xen Live Migration. Networks and Distributed Systems Seminar, 24 April Matúš Harvan Xen Live Migration 1

Compromise-as-a-Service

VMkit A lightweight hypervisor library for Barrelfish

KVM KERNEL BASED VIRTUAL MACHINE

KVM Security Comparison

A quantitative comparison between xen and kvm

The NOVA Microhypervisor

EE282 Lecture 11 Virtualization & Datacenter Introduction

Performance Analysis of Large Receive Offload in a Xen Virtualized System

ELI: Bare-Metal Performance for I/O Virtualization

Transcription:

I/O in Linux Hypervisors and Virtual Machines Lecture for the Embedded Systems Course CSD, University of Crete (May 12 & 14, 2015) ManolisMarazakis (maraz@ics.forth.gr) Institute of Computer Science (ICS) Foundation for Research and Technology Hellas (FORTH)

Virtualization Use-cases Server (workload) consolidation Legacy software systems Virtual desktop infrastructure (VDI) End-user virtualization (e.g. S/W testing & QA, OS research) Compute clouds Embedded (e.g. smartphones) How does virtualization work, in detail? emphasis on I/O 2 I/O in Linux Hypervisors and Virtual Machines

Virtualization is hard! 3 I/O in Linux Hypervisors and Virtual Machines

Outline Elements of virtualization Machine emulator, Hypervisor, Transport Present alternative designs for device I/O path Device emulation (fully virtualized) Para-virtualized devices Direct device assignment (pass-through access) Follow the I/O path of hypervisor technologies commonly used on Linux servers (x86 platform) xen, kvm Provide a glimpse of hypervisor internals * kvm: virtio xen: device channels, grant tables * VMware: market leader but out-of-scope for this lecture 4 I/O in Linux Hypervisors and Virtual Machines

Why is I/O hard to virtualize? Multiplexing/de-multiplexing for guests Programmed I/O (PIO), Memory-mapped I/O (MMIO) PIO: privilegedcpu instructions specifically for performing I/O I/O devices have a separate address space from general memory MMIO: the CPU instructions used to access the memory are also used for accessing devices The memory-mapped I/O region is protected Direct Memory Access (DMA) Allow certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the CPU Synchronous: triggered by software Asynchronous: triggered by devices (e.g. NIC) Implementation layers: system call (application to GuestOS) trap to VMM driver call (GuestOS) paravirtualization Hypercall between modified driver in GuestOSand VMM I/O operation (GuestOSdriver to VMM) 5 I/O in Linux Hypervisors and Virtual Machines

Processing an I/O Request from a VM Source: Mendel Rosenblum, Carl Waldspurger: "I/O Virtualization" ACM Queue, Volume 9, issue 11, November 22, 2011 6 I/O in Linux Hypervisors and Virtual Machines

I/O device types Dedicated E.g. display Partitioned E.g. Disk Shared E.g. NIC Devices that can be enumerated (PCI, PCI-Express) VMM needs to emulate discovery method, over a bus/interconnect Devices with hard-wired addresses (e.g. PS/2) VMM should maintain status information on virtual device ports Emulated (e.g. experimental hardware) VMM must define & emulate all H/W functionality GuestOSneeds to load corresponding device drivers 7 I/O in Linux Hypervisors and Virtual Machines

Device Front-Ends & Back-Ends Source: Mendel Rosenblum, Carl Waldspurger: "I/O Virtualization" ACM Queue, Volume 9, issue 11, November 22, 2011 8 I/O in Linux Hypervisors and Virtual Machines

VM I/O acceleration 9 I/O in Linux Hypervisors and Virtual Machines

Device emulation: full virtualization vs para-virtualization GuestOS VMM (full virtualization) Traps Device Emulation Hardware GuestOS VMM (full virtualization) PV drivers Traps Device Emulation Hardware 10 I/O in Linux Hypervisors and Virtual Machines

Device model (with full-system virtualization) VMM intercepts I/O operations from GuestOSand passes them to device model at the host Device model emulates I/O operation interfaces: PIO, MMIO, DMA, Two different implementations: Part of the VMM User-space standalone service 11 I/O in Linux Hypervisors and Virtual Machines

xen history Developed at Systems Research Group, Cambridge University (UK) Creators: KeirFraser, Steven Hand, Ian Pratt, et al (2003) Broad scope, for both Host and Guest Merged in Linux mainline: 2.4.22 Company: Xensource.com Acquired by Citrix Systems (2007) 12 I/O in Linux Hypervisors and Virtual Machines

Xen VMM: Paravirtualization(PV) OS es Dedicated control domain: Dom0 Modified Guest OS Devices Front-end (net-front) for Guest OS to communicate with Dom0 I/O channel (zero copy) Backend (net-back) for Dom0 to communicate with underlying systems 13 I/O in Linux Hypervisors and Virtual Machines

kvm history Prime creator: AviKivity, Qumranet, circa. 2005 (IL) Company acquired by RedHat(2008) Narrow focus: x86 platform, Linux host Assumes Intel VT-x or AMD svm Merged in Linux kernel mainline: 2.6.20 < 4 months after 1 st announcement! 14 I/O in Linux Hypervisors and Virtual Machines

kvmvmm: tightly integrated in the Linux kernel QEMU QEMU QEMU Hypervisor: Kernel module Guest OS: User-space process (QEMU) Requires H/W virtualization extensions 15 I/O in Linux Hypervisors and Virtual Machines

xenvs kvm Xen Strong support for paravirtualization with modified host-os Near-native performance for I/Os Separate code based for DOM0 and device drivers Security model: Rely on DOM0 Maintainability Hard to catch up all versions of possible guests due to PV KVM Requires H/W virtualization extension Intel VT, AMD Pacifica (AMD-V) Limited support for paravirtualization Code-base integrated into Linux kernel source tree Security model: Rely on Commodity/Casual Linux systems Maintainability Easy Integrated well into infrastructure, code-base 16 I/O in Linux Hypervisors and Virtual Machines

I/O virtualization in xen Bridge in driver domain: multiplex/demultiplex network I/Os from guests I/O Channel Zero-copy transfer with Grant-copy Enable driver domain to access I/O buffers in guest memory Xennetwork I/O extension schemes MultipleRX queues, SR-IOV 17 I/O in Linux Hypervisors and Virtual Machines Source: Bridging the gap between software and hardware techniques for i/o virtualization Usenix Annual Technical conference, 2008

I/O virtualization in kvm Native KVM I/O model PIO: Trap MMIO: The machine emulator executes the faulting instruction Slow due to mode-switch! Extensions to support PV VirtIO: An API for Virtual I/O aims to support many hypervisors of all type 18 I/O in Linux Hypervisors and Virtual Machines

Virtualization components Machine emulation CPU, Memory, I/O Hardware-assisted Hypervisor Hyper-call interface Page Mapper I/O Interrupts Scheduler Transport Messaging, and bulk-mode 19 I/O in Linux Hypervisors and Virtual Machines

Emulated Platform System bus(es) and I/O devices Interrupt Controller(s) (APIC) Instruction Set MMU CPU Misc. resources Memory 20 I/O in Linux Hypervisors and Virtual Machines

QEMU machine emulator Creator: FabriceBellard(circa. 2006) Machine emulator using a dynamic translator Run-time conversion of target CPU instructions to the host instruction set Translation cache Emulated Machine := { CPU emulator, Emulated Devices, Generic Devices } + machine description Link between emulated devices & underlying host devices Alternative storage back-ends e.g. POSIX/AIO ( thread pool) Caching modes for devices: cache=none O_DIRECT cache=writeback buffered I/O Cache=writethrough buffered I/O, O_SYNC 21 I/O in Linux Hypervisors and Virtual Machines

Virtual disks Exported by host Physical device/partition or logical device or file Raw-format Image format (e.g.:.qcow2,.vmdk, raw.img) Features: compression, encryption, copy-on-write snapshots 22 I/O in Linux Hypervisors and Virtual Machines

I/O stack: Host User-Level Applications File System System Calls Virtual File System (VFS) Buffer Cache Block-level Device Drivers SCSI Layer Raw I/O Storage Controller 23 I/O in Linux Hypervisors and Virtual Machines

I/O stack: Host + Guest Guest Host User-Level Applications System Calls Virtual File System (VFS) Raw I/O File System Page Cache Block-level Device Drivers SCSI Layer Storage Controller 2 filesystems possibly one more in image file 24 I/O in Linux Hypervisors and Virtual Machines

Device Emulation Host Guest device 1 Device driver Hypervisor emulates real devices traps I/O accesses (PIO, MMIO) device 2 Emulation of DMA and interrupts 4 3 Device emulation Device driver 25 I/O in Linux Hypervisors and Virtual Machines

Para-virtualized drivers Host Guest device 1 Front-end device driver 2 Hypervisorspecific virtual drivers in Guest Involvement of Hypervisor (Back-end) device 3 Back-end device driver Device driver 26 I/O in Linux Hypervisors and Virtual Machines

Direct device assignment Host Guest Front-end device driver 1 device Bypass of the Hypervisor to directly access devices Security & safety concerns! IOMMU for address translation & isolation (DMA restrictions) SR-IOV for shared access 27 I/O in Linux Hypervisors and Virtual Machines

kvm guest initialization (command-line) qemu-system-x86_64 -L /usr/local/kvm/share/qemu -hda /mnt/scalusmar2013/01/scalusvm.img -drive file=/mnt/scalusmar2013/01/datavol.img,if=virtio,index=0,cache=writethrough -net nic -net user,hostfwd=tcp::12301-:22 -nographic -m 4096 -smp 2 28 I/O in Linux Hypervisors and Virtual Machines

Virtualized I/O flow 29 I/O in Linux Hypervisors and Virtual Machines

kvm run-time environment Guest Guest libvirtd QEMU QEMU User space kvm-intel/ kvm-amd kvm Kernel space Scheduler 30 I/O in Linux Hypervisors and Virtual Machines

kvmexecution model Processes create virtual machines A process together with /dev/kvmis in essence the Hypervisor VMs contain memory, virtual CPUs, and (in-kernel) devices Guest ( physical ) memory is part of the (virtual) address space of the creating process Virtual MMU+TLB, and APIC/IO-APIC (in-kernel) Machine instruction interpreter (in-kernel) vcpusrun in process context (i.e. as threads) To the host, the process that started the guest _is_ the guest! 31 I/O in Linux Hypervisors and Virtual Machines

Run-time view of a kvmguest Guest Memory Hypervisor Process vcpu vcpu Thread Thread I/O Thread /dev/kvm module Linux kernel Guest Host switch via scheduler Use of Linux subsystems: scheduler, memory management, Re-use of user-space tools VM images Network configuration CPU CPU CPU 32 I/O in Linux Hypervisors and Virtual Machines

Virtualization of (x86) instruction set Trap-and-Emulate Alternatives: Binary translation (e.g. early VMware), with code substitution CPU para-virtualization (e.g. Xen), with hyper-calls Hardware-assisted virtualization Objective: reduce need for binary translation and emulation Host + Guest state, per core World switch Intel VT-x, AMD SVM Intel EPT (extended page tables), AMD NPT (nested page tables) Add a level of translation for guest physical memory R/W/X bits can generate faults for physical memory accesses vcpu-tagged address spaces avoid TLB flushes 33 I/O in Linux Hypervisors and Virtual Machines

Virtualized CPU 34 I/O in Linux Hypervisors and Virtual Machines

Trap-and-Emulate VMM and Guest OS : System Call (ring 3) CPU will trap to interrupt handler vector of VMM (ring 0). VMM jump back into guest OS (ring 1). Hardware Interrupt Hardware makes CPU trap to interrupt handler of VMM. VMM jump to corresponding interrupt handler of guest OS. Privileged Instruction Running privilege instructions in guest OS will trap to VMM for instruction emulation. After emulation, VMM jumps back to guest OS. 35 I/O in Linux Hypervisors and Virtual Machines

Switching between VMs Timer Interrupt in running VM. Context switch to VMM. VMM saves state of running VM. VMM determines next VM to execute. VMM sets timer interrupt. VMM restores state of next VM. VMM sets PC to timer interrupt handler of next VM. Next VM active. 36 I/O in Linux Hypervisors and Virtual Machines

VM state management VMM will hold the system states of all VMs in memory When VMM context-switches from one VM to another: Write the register values back to memory Copy the register values of next guest OS to CPU registers. 37 I/O in Linux Hypervisors and Virtual Machines

Intel VT-x VMX Root Operation (Root Mode) All instruction behaviors in this mode are no different to traditional ones. All legacy software can run in this mode correctly. VMM should run in this mode and control all system resources. VMX Non-Root Operation (Non-Root Mode) All sensitive instruction behaviors in this mode are redefined. The sensitive instructions will trap to Root Mode. Guest OS should run in this mode and be fully virtualized through typical trap and emulation model. 38 I/O in Linux Hypervisors and Virtual Machines

Intel VT-x VMM de-privileges the guest OSinto Ring 1, and takes up Ring 0 OS un-aware it is not running in traditional ring 0 privilege Requires compute intensive SW translation to mitigate 39 I/O in Linux Hypervisors and Virtual Machines VMM has its own privileged level where it executes No need to de-privilege the guest OS OSes run directly on the hardware

Context Switch VMM switch different virtual machines with Intel VT-x : VMXON/VMXOFF These two instructions are used to turn on/off CPU Root Mode. VM Entry This is usually caused by the execution of VMLAUNCH/VMRESUMEinstructions, which will switch CPU mode from Root Mode to Non-Root Mode. VM Exit This may be caused by many reasons, such as hardware interrupts or sensitive instruction executions. Switch CPU mode from Non-Root Mode to Root Mode. 40 I/O in Linux Hypervisors and Virtual Machines

Intel vmx VXMON VMLAUNCH Host mode VMRESUME Guest mode VMEXIT Expensive transitions: VM entry/exit Mode switch (world switch) Some exits are heavier than others 41 I/O in Linux Hypervisors and Virtual Machines

VMCS: Virtual Machine Control Structure State Area Store host OS system state when VM-Entry. Store guest OS system state when VM-Exit. Control Area Control instruction behaviors in Non-Root Mode. Control VM-Entry and VM-Exit process. Exit Information Provide the VM-Exit reason and some hardware information. When VM Entry or VM Exit occur, CPU will automatically read/write corresponding information into VMCS. 42 I/O in Linux Hypervisors and Virtual Machines

System state management Binding virtual machine to virtual CPU VCPU (Virtual CPU) contains two parts: VMCS: maintains virtual system states, which is approached by hardware. Non-VMCS: maintains other non-essential system information, which is approach by software. VMM needs to handle Non-VMCS part. 43 I/O in Linux Hypervisors and Virtual Machines

EPT: Extended page Table Instead of walking along with only one page table hierarchy, EPT implements one additional page table hierarchy: One page table is maintained by guest OS, which is used to generate guest physical address. Another page table is maintained by VMM, which is used to map guest physical address to host physical address. For each memory access operation, the EPT MMU will directly get guest physical address from guest page table, and then get host physical address by the VMM mapping table automatically. 44 I/O in Linux Hypervisors and Virtual Machines

Cost of VM exits Exit Type Number of Exits [ source: Landau, et al, WIOV 2011 ] Cycle Cost/Exit External interrupt 8961 363,000 I/O instruction 10042 85,000 APIC access 691249 18,000 EPT violation 645 12,000 netperfclient run on 1 GbE, with para-virtualized NIC Total run: ~7.1x10 10 cycles vs ~5.2x10 10 cycles for bare-metal 35% slow-down due to the guest and hypervisor sharing the same core 45 I/O in Linux Hypervisors and Virtual Machines

kvm components hypervisor module (/dev/kvmcharacter device) heavily relies on Linux kernel subsystems ioctl() API calls for requests 1 file descriptor per resource : System: VM creation, capabilities VM: CPU initialization, memory management, interrupts Virtual CPU (vcpu): access to execution state Platform emulator (qemu) Misc. modules & tools: KSM (memory deduplication for VM images) Libvirt.org tools (virsh, virt-manager) 46 I/O in Linux Hypervisors and Virtual Machines

kvm Hypervisor API ioctl() system calls to /dev/kvm Create new VM Provision memory to VM Read/write vcpuregisters Inject interrupt into vcpu Run a vcpu qemu-kvm for user-space: Config. VMs and I/O devices Execute Guest code via /dev/kvm I/O emulation /dev/kvm== communication channel bet. kvmand QEMU Communication bet. kvmand VM instances: kvmassumes x86 architecture with virtualization extensions [ Intel VT-x / AMD-V ] VMCS + vmxinstructions VMXON, VMXOFF, VMLAUNCH, VMRESUME [ root non-root ] VM-Entry [non-root root ] VM-Exit VMCS: VMREAD, VMWRITE 47 I/O in Linux Hypervisors and Virtual Machines

Qemu main event loop fd= open( /dev/kvm, O_RDWR); Event handling: ioctl(fd, KVM_CREATE_VM, ); timers ioctl(fd, KVM_CREATE_VCPU, ); I/O for(;;) { ioctl(fd, KVM_RUN, ); switch(exit_reason) { case EXIT_REASON_IO_INSTRUCTION: break; case EXIT_REASON_TASK_SWITCH: break; case EXIT_REASON_PENDING_INTERRUPT: break; } } monitor commands Event handlingvia select/poll system calls to wait on multiple file descriptors 48 I/O in Linux Hypervisors and Virtual Machines

VM creation with kvm int fd_kvm = open("/dev/kvm", O_RDWR); int fd_vm = ioctl(fd_kvm, KVM_CREATE_VM, 0); ioctl(fd_vm, KVM_SET_TSS_ADDR, 0xffffffffffffd000); ioctl(fd_vm, KVM_CREATE_IRQCHIP, 0); 1 process per Guest 1 thread per vcpu + 1 thread for the main event loop + offload threads 49 I/O in Linux Hypervisors and Virtual Machines

Adding physical memory to a VM with kvm void *addr= mmap(null, 10 * MB, PROT_READ PROT_WRITE, MAP_ANONYMOUS MAP_PRIVATE, -1, 0); struct kvm_userspace_memory_region region = { };.slot = 0,.flags = 0, // Can be Read Only.guest_phys_addr = 0x100000,.memory_size= 10 * MB,.userspace_addr = ( u64)addr ioctl(fd_vm, KVM_SET_MEMORY_REGION, &region); 50 I/O in Linux Hypervisors and Virtual Machines

vcpu initialization with kvm int fd_vcpu = ioctl(fd_vm, KVM_CREATE_VCPU, 0); struct kvm_regs regs; ioctl(fd_vcpu, KVM_GET_REGS, &regs); regs.rflags = 0x02; regs.rip = 0x0100f000; ioctl(fd_vcpu, KVM_SET_REGS, &regs); 51 I/O in Linux Hypervisors and Virtual Machines

Running a vcpuin kvm int kvm_run_size = ioctl(fd_kvm, KVM_GET_VCPU_MMAP_SIZE, 0); // access to the arguments of ioctl(kvm_run) struct kvm_run*run_state = mmap(null, kvm_run_size, PROT_READ PROT_WRITE, MAP_PRIVATE, fd_vcpu, 0); for (;;){ int res = ioctl(fd_vcpu, KVM_RUN, 0); switch (run_state->exit_reason) { // use run_state to gather information about the exit } } Safely execute Guest code directly on the Host CPU (relying on virtualization support in the CPU) 52 I/O in Linux Hypervisors and Virtual Machines

Programmed I/O (PIO) in kvm struct kvm_ioeventfd{ u64 datamatch; u64 addr; /* legal pio/mmio address */ u32 len; /* 1, 2, 4, or 8 bytes */ s32 fd; u32 flags; u8 pad[36]; }; A guest write in the registered address will signal the provided event instead of triggering an exit. 53 I/O in Linux Hypervisors and Virtual Machines

Memory-mapped I/O (MMIO) in kvm Exit reason : KVM_EXIT_MMIO struct{ u64 phys_addr; u8 data[8]; u32 len; u8 is_write; } mmio; 54 I/O in Linux Hypervisors and Virtual Machines

kvm guest execution flow [ user mode ] [ guest mode ] [ kernel mode ] Init.guest execution (ioctl) Enter guest-mode Execute natively Handle VM-exit Handle I/O Y Special I/O instruction? Y N Signal pending? N I/O PIO, MMIO, interrupts Trap & emulate External events signals 55 I/O in Linux Hypervisors and Virtual Machines

Device I/O in kvm Configuration via MMIO/PIO eventfdfor events between host/guest irqfd: host guest ioeventfd: guest host virtio: abstraction for virtualized devices Device types: PCI, MMIO Configuration Queues 56 I/O in Linux Hypervisors and Virtual Machines

virtio A family of drivers which can be adapted for various hypervisors, by porting a shim layer Related: VMware tools, Xenpara-virtualized drivers Explicit separation of Drivers, Transport, Configuration kvm back-end driver device emulation Guest Front-end driver virtio lguest back-end driver Configuration: register for (device-id, vendor-id) device emulation 57 I/O in Linux Hypervisors and Virtual Machines

virtio architecture virtio driver virtio PCI controller Guest Front-end A kernel module in guest OS. Accepts I/O requests from user process. Transfer I/O requests to back-end. vring Transport Back-end A device in QEMU. Accepts I/O requests from front-end. Perform I/O operation via physical device. virtio PCI controller virtio device Host (QEMU) Virtqueues(per device) Vring(per virtqueue) Queue requests 58 I/O in Linux Hypervisors and Virtual Machines

kvm with virtio GUEST GUEST VirtIO driver Driver Transport IO controller & device QEMU Other Processes VirtIO controller & device QEMU Other Processes KVM Hardware KVM Device driver Host Hardware [ kvm with default device emulation ] [ kvm with virtio device handling ] 59 I/O in Linux Hypervisors and Virtual Machines

vring& virtqueue vring: transport implementation (ring-buffer) shared (memory-mapped) between Guest and QEMU Reduce the number of MMIOs published & usedbuffers descriptors virtqueueapi: add_buf: expose buffer to other end get_buf: get next used buffer kick: (after add_buf) notify QEMU to handle buffer disable_cb, enable_cb: disable/enable callbacks buffer := scatter/gather list (address, length) pairs QEMU: virtqueue_pop, virtqueue_push virtio-blk: 1 queue virtio-net: 2 queues 60 I/O in Linux Hypervisors and Virtual Machines

virtio processing flow virtio driver add_buf get_buf virtqueue write vring IN/OUT virtqueue read vring kick pop push virtio device 61 I/O in Linux Hypervisors and Virtual Machines

Virtual interrupts Virtual interrupt Guest End-of-Interrupt At least 2 exits: - Delivery - Completion signal Hypervisor Host interrupt End-of-Interrupt (I/O APIC) Device 62 I/O in Linux Hypervisors and Virtual Machines

I/O path via virtio Host -> Guest : interrupt injection Guest -> Host : hyper-call (instruction emulation) Guest user-space QEMU system emulator virtual-queue (back-end) Guest kernel-space emulated device virtual-queue (front-end) Host user-space Native driver KVM module Notifications Host kernel-space Host (Linux kernel + KVM module) 63 I/O in Linux Hypervisors and Virtual Machines

vhost-net Guest networking based on in-kernel assists virtio+ ioeventfd+ irqfd Avoid heavy VM exits, and reduce packet copying 64 I/O in Linux Hypervisors and Virtual Machines

Acceleration of VMs (SplitX) Cost of an VM exit 3 components Direct (CPU World switch ) Synchronous (due to exit processing in hypervisor) Indirect (slowdown from having 2 contexts on same core cache pollution) 65 I/O in Linux Hypervisors and Virtual Machines

Acceleration of VMs (ELI) Instead of running the guest with its own IDT, run the guest with shadow IDT prepared by the host 66 I/O in Linux Hypervisors and Virtual Machines

SR-IOV (1/2) PCI-SIG Single Root I/O Virtualization and Sharing spec. Physical Functions (PF) Full configuration & management capability Virtual Functions (VF) lightweight functions: contain data-movement resources, with a reduced set of configuration resources An SR-IOV-capable device can be configured to appear (to the VMM) in the PCI configuration space as multiple functions The VMM assigns VF(s) to a VM by mapping the configuration space of the VF(s) to the configuration space presented to the VM. 67 I/O in Linux Hypervisors and Virtual Machines

SR-IOV : NIC example 68 I/O in Linux Hypervisors and Virtual Machines

IOMMU Faster I/O via Pass-Through Devices Exclusively used devices can be directly exposed to guest VM, without introducing device virtualization code However, erroneous/malicious DMA operations are capable of corrupting/attacking memory spaces Options: Remapping of DMA operations by hardware (e.g. Intel VT-d) Virtualizabledevices (e.g. PCI-Express SR-IOV) I/O MMU allow a guest OS running under a VMM to have direct control of a device fine-grain control of device access to system memory transparent to device drivers software: swiotlb(bounce buffers), xengrant tables hardware: AMD GART, IBM Calgary, AMD Pacifica 69 I/O in Linux Hypervisors and Virtual Machines

IOMMU concept Indirection between addresses used for DMA and physical addresses Similar MMU, for address translation in the device context Devices can only access memory addresses present in their protection domain I/O virtualization through direct device assignment 70 I/O in Linux Hypervisors and Virtual Machines

Intel VT-d CPU CPU System Bus North Bridge VT-d DRAM Integrated Devices PCIe* Root Ports PCI Express South Bridge PCI, LPC, Legacy devices, 71 I/O in Linux Hypervisors and Virtual Machines

Intel VT-d DMA Requests Device ID Virtual Address Length Bus 255 Bus N Dev 31, Func 7 Dev P, Func 2 Page Frame Fault Generation Bus 0 Dev P, Func 1 Dev 0, Func 0 4KB Page Tables DMA Remapping Engine Translation Cache Device Assignment Structures Device D1 Device D2 Address Translation Structures Context Cache Address Translation Structures Memory Access with System Physical Address Memory-resident Partitioning And Translation Structures 72 I/O in Linux Hypervisors and Virtual Machines

AMD s IOMMU architecture [ source: AMD IOMMU specification (revision 2) ] 73 I/O in Linux Hypervisors and Virtual Machines

Xen device channels Asynchronous shared-memory transport Event ring (for interrupts) Xen peer domains Inter-guest communication Mapping one guest s buffers to another Grant tables for DMA (bulk transfers) Xendom0 (privileged domain) can access all devices Exports subset to other domains Runs back-end of device drivers (e.g. net, block) 74 I/O in Linux Hypervisors and Virtual Machines

Xen grant tables Share & Transfer pages between domains a software implementation of certain IOMMU functionality Shared pages: Driver in local domain advertises buffer notify hypervisor that this page can be access by other domains Use case: block drivers receive their data synchronously, i.e. know which domain requested data to be transferred via DMA Transferred pages: Driver in local domain advertises buffer notify hypervisor Driver then transfers page to remote domain _and_ takes a free page from a producer/consumer ring ( page-flip ) Use case: network drivers receive their data asynchronously, i.e. may not know origin domain (need to inspect network packet before actual transfer between domains) With RDMA NICs, we can transfer (DMA) directly into domains 75 I/O in Linux Hypervisors and Virtual Machines

IOMMU Emulation Shortcomings of device assignment for unmodified guests: Requires pinning all of the guest s pages, thereby disallowing memory over-commitment exposes the guest s memory to buggy device drivers A single physical IOMMU can emulate multiple IOMMU s (for multiple guests) Why? Allow memory over-commitment (pin/unpin during map/unmap of I/O buffers) Intra-guest protection, redirection of DMA transactions without compromising inter-guest protection 76 I/O in Linux Hypervisors and Virtual Machines

IO Emulation 77 I/O in Linux Hypervisors and Virtual Machines

Sources (1) www.linux-kvm.org www.xen.org Mendel Rosenblum, Carl Waldspurger: "I/O Virtualization" ACM Queue, Volume 9, issue 11, November 22, 2011 URL: http://queue.acm.org/detail.cfm?id=2071256 Paul Barham, Boris Dragovic, KeirFraser, Steven Hand, Tim Harris, Alex Ho, Rolf Neugebauer, Ian Pratt, Andrew Warfield, Xen and the Art of Virtualization, SOSP 03 Xen(v.3.0. for x86) Interface Manual http://pdub.net/proj/usenix08boston/xen_drive/resources/de veloper_manuals/interface.pdf Jun Nakajima, AsitMallick, Ian Pratt, Keir Fraser, X86-64 Xen- Linux: Architecture, Implementation, and Optimizations, OLS 2006 Xen: finishing the job, lwn.net -2009 78 I/O in Linux Hypervisors and Virtual Machines

Sources (2) AviKivity, et al: kvm: The Linux Virtual Machine Monitor, Proceedings of the Linux Symposium, 2007 http://www.linux-kvm.com/sites/default/files/kivity-reprint.pdf http://kerneltrap.org/node/8088 MuliBen-Yehuda, EranBorovik, Michael Factor, EranRom, Avishay Traeger, Ben-Ami Yassour Adding Advanced Storage Controller Functionality via Low-Overhead Virtualization, FAST '12 Abel Gordon, NadavAmit, NadavHar'El, MuliBen-Yehuda, Alex Landau, AssafSchuster, Dan TsafrirELI: Bare-Metal Performance for I/O Virtualization, ASPLOS '12 Alex Landau, MuliBen-Yehuda, Abel Gordon SplitX: Split Guest/Hypervisor Execution on Multi-Core, WIOV '11 Abel Gordon, MuliBen-Yehuda, Dennis Filimonov, MaorDahan, VAMOS: Virtualization Aware Middleware, WIOV '11 79 I/O in Linux Hypervisors and Virtual Machines

Sources (3) MuliBen-Yehuda, Michael D. Day, ZviDubitzky, Michael Factor, Nadav Har'El, Abel Gordon, Anthony Liguori, Orit Wasserman, Ben-Ami Yassour The Turtles Project: Design and Implementation of Nested Virtualization, OSDI '10 Ben-Ami Yassour, MuliBen-Yehuda, Orit Wasserman On the DMA Mapping Problem in Direct Device Assignment, SYSTOR '10 Alex Landau, David Hadas, MuliBen-Yehuda, Plugging the Hypervisor Abstraction Leaks Caused by Virtual Networking, SYSTOR '10 NadavAmit, MuliBen-Yehuda, Dan Tsafir, AssafSchuster, viommu: Efficient IOMMU Emulation, USENIX ATC 2011 KVM ioctl() API: https://www.kernel.org/doc/documentation/virtual/kvm/api.txt 80 I/O in Linux Hypervisors and Virtual Machines