novm: Hypervisor Rebooted Adin Scannell
What is this talk about? 1. Rethinking the hypervisor 2. A new VMM for Linux (novm)
Who am I? Adin Scannell Systems software developer Where do I work? Formerly CTO @ Gridcentric Inc. Now Software Engineer @ Google How can you reach me? ascannell@google.com
Virtualization is amazing! Powers massive compute infrastructures Makes maintaining legacy systems easier (and developing and testing on new systems) Enables high-availability, backup, live-migration, etc.
Why is everyone excited about containers?
Some people, when confronted with a problem managing their server, think "I know, I'll use virtualization." Now, they have $(virsh list wc -l) problems.
Virtualization pain points Legacy devices, legacy BIOS, etc. Performance problems Dealing with disk images
DOCKERMANIA Lightweight runtime (Linux) App store distribution (registry) Simple software stack (tarballs and files)
Containers are amazing!
Containers aren t perfect Host kernel dependency limits... Portability: SO_REUSEPORT? Everything must be >= 3.9! Isolation: Security is tough (CVE-2013-1858) Shared kernel state is complex and difficult to isolate Migration, suspend & resume are much harder
How can we make containers more like VMs?
How can we make containers more like VMs? How can we make VMs more like containers?
What do I want? (usage) Support docker-style deployment: novm run --docker_image ubuntu:14.04 grep -v '^#' /etc/apt/sources.list Map in different filesystem trees easily: novm run --read /var/log=>/prod/foo/log log_analyzer.py
What do I want? (usage) Support docker-style deployment: novm run --docker_image ubuntu:14.04 grep -v '^#' /etc/apt/sources.list Map in different filesystem trees easily: novm run --read /var/log=>/prod/foo/log log_analyzer.py Support different kernels per container : novm run --kernel linux-3.9 nodejs so_reuseport.js Also: live migration, suspend & resume, etc.
What s novm? A lightweight VMM, written in Go. Designed to run applications, not systems.
Containers app app app OS Hardware
Containers cgroups + namespaces syscall app app app container container OS Hardware
Virtual machines app app app app OS OS OS OS Hypervisor Hardware
Virtual machines app app app app x86 + vmcalls vmx / svm OS OS OS OS Hypervisor Hardware
Virtual machines on Linux app app app OS OS app VMM (qemu) VMM Linux Kernel KVM Hardware
Dimensions Lifecycle Virtualization Containers Performance Virtualization Containers Isolation & Security Virtualization Containers Portability Virtualization Containers
Containers app syscall Ring 3 Ring 0 Host Kernel Host Untrusted
Virtual machines VMM User Code Devices application application syscall vmexit syscall Ring 3 Ring 0 Host Kernel (KVM) Guest Kernel Kernel Host Guest
novm process interactions (stdin, stdout, signals, etc.) [1] novm proxy virtio rpc Devices [1] proxy application syscall vmexit virtio rpc syscall Ring 3 Ring 0 Host Kernel (KVM) Guest Kernel Kernel Host Guest
Creating a novm (< 1s) 1. Create a KVM VM a. (Management layer creates tap devices, etc.) 2. Layout kernel and initrd payload a. (Build page-tables and use protected entry point) 3. Run guest kernel a. initrd mounts two 9p filesystems: sysroot & noguest b. switch_root to noguest as init, / is sysroot c. noguest opens virtio console, starts RPC server d. noguest sets up IP configuration, etc. 4. Talk to noguest to run process
Dimensions process-like Lifecycle Virtualization novm Containers Performance Virtualization Containers Isolation & Security Virtualization Containers Portability Virtualization Containers
Go is great for a VMM! Built-in scalability and async tasks Better error protection Garbage collection Bounds checking, type checking Built-in serialization and reflection Eliminates bookkeeping for S&R
VirtIO Channels == Go Channels? for buf := range vchannel.incoming { header := buf.map(0, VirtioNetHeaderSize) pktstart := VirtioNetHeaderSize - device.vnet pktend := buf.length() - pktstart // Read a packet from the tap device. buf.read(device.fd, pktstart, pktend) vchannel.outgoing <- buf }
Asynchronous I/O func (fs *VirtioFsDevice) process(buf *VirtioBuffer) { fs.handle(buf) fs.virtiodevice.channels[0].outgoing <- buf } func (fs *VirtioFsDevice) run() error { for { buf := <-fs.virtiodevice.channels[0].incoming go fs.process(buf) } }
Closures efd := vm.newboundeventfd(addr, ioevent.size(), ioevent.data()) go func(ioevent IoEvent) { for { // Wait for the next event. efd.wait() // Resubmit the ioevent; no need to lookup the device. handler.submit(ioevent, offset) } }(ioevent)
Dimensions process-like Lifecycle Virtualization novm Containers Performance Virtualization novm Containers Isolation & Security Virtualization novm Containers Portability virtio only Virtualization Containers
File mapper read : { / : /, }, write : { / : /tmp/vm, /var/mysql : /proddb } Filesystem Mapper novm syscall 9p Devices virtio9p application syscall Ring 3 not in kernel space! Ring 0 Host Kernel (KVM) Linux Guest Kernel Host Guest
Dimensions process-like Lifecycle Virtualization novm Containers Performance Virtualization novm Containers Isolation & Security Virtualization novm Containers Portability virtio only Virtualization novm Containers file-based, not disk-based
Status What works? Legacy devices: ACPI, UART, PCI, RTC, PIT, etc. Virtio devices: Net, Block, FS, Console 100% zero copy backends Zero downtime restart and upgrades TBD: Live migration, suspend & resume Performance
What was great? Working with KVM! int kvm_fd = open( /dev/kvm, O_RDWR); int kvm_vm = ioctl(kvm_fd, KVM_CREATE_VM, 0); int kvm_vcpu = ioctl(kvm_vm, KVM_CREATE_VCPU, 0); int r = ioctl(kvm_vcpu, KVM_RUN); Go is amazing!
What was tricky? Legacy free? Hardly. Device trees? Nope. Virtio-mmio? Nope. Virtio devices: PCI w/ MSI-X interrupts (& eventfds) VCPUs are goroutines How do you interrupt a goroutine? Performance analysis will be tricky
Thanks! Questions? Code available: https://github.com/google/novm Email: ascannell@google.com
How does a traditional VMM work? VMM
How does a traditional VMM work? VMM BIOS
How does a traditional VMM work? VMM BIOS
How does a traditional VMM work? VMM H/W H/W BIOS
How does a traditional VMM work? tap device VMM H/W H/W BIOS disk image
How does a traditional VMM work? VMM H/W H/W boot loader BIOS
How does a traditional VMM work? VMM H/W H/W real mode OS boot loader BIOS
How does a traditional VMM work? VMM H/W H/W OS real mode OS BIOS
How does a traditional VMM work? app app VMM H/W H/W OS BIOS
How do you build a VMM? (part 1) int kvm_fd = open( /dev/kvm, O_RDWR); (1) int kvm_vm = ioctl(kvm_fd, KVM_CREATE_VM, 0); (2) int kvm_vcpu = ioctl(kvm_vm, KVM_CREATE_VCPU, 0); (3) int r = ioctl(kvm_vcpu, KVM_RUN); crash
How do you build a VMM? (part 2) void* memory_alloc = malloc(100 * 1024 * 1024); struct kvm_userspace_memory_region m = {.slot = 0, };.flags = 0,.guest_phys_addr = 0,.memory_size = 100 * 1024 * 1024,.userspace_addr = ( u64)memory_alloc, int r = ioctl(kvm_vcpu, KVM_SET_USER_MEMORY_REGION, &m); (4) int r = ioctl(kvm_vcpu, KVM_RUN); crash
How do you build a VMM? (part 3) struct kvm_run *kvm = mmap(kvm_vcpu); int r = ioctl(kvm_vcpu, KVM_RUN); if (kvm->exit_reason == KVM_IO && kvm->io.port == 0xCF8) { /* Pretend to be a PCI bus! */ }