10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting & Signaling Best processor selection Load Balancing Idling High Level Performance Data Q&A 1

Hypervisor runs just above the hardware and below the VMs Microkernel architecture Kernel (in our case the Hypervisor) provided abstractions reduced to the minimum Address spaces, CPU, other hardware abstraction (exceptions, traps, interrupts, etc.), IPC (Inter Partition Communication), Other services are provided as higher level services E.g. virtualization stack in the root partition Provided by: Parent Partition Child Partitions Windows Server 2008 64Bit Windows Server 2003/2008, Windows XP, Linux Windows Server 2003/2008, Windows XP, Linux Windows Server 2003/2008, Windows XP, Linux Windows hypervisor Devices Processor s Memory 2

Hypervisor A hypervisor is a layer of software that sits just above the hardware and beneath one or more operating systems. Its primary job is to provide isolated execution environments called partitions. The hypervisor is to Hyper-V what the kernel is to Windows. Does not contain device drivers. Launched by a system driver in V1. Virtualization Stack The virtualization stack is everything else that makes up Hyper-V. This is the user interface, management services, virtual machine processes, emulated devices, etc... Partition It is the unit of isolation within the hypervisor. A partition is a lighter weight concept than a virtual machine - and could be used outside of the context of virtual machines to provide a highly isolated execution environment. Virtual machine The combination of a partition and the services provided to that partition via the virtualization stack is called the virtual machine. Also called the guest. Root Partition Controlling partition in which the virtualization stack runs and which owns hardware devices This is the first partition on the computer. Specifically this is the partition that is responsible for initially starting the hypervisor Parent partition Manages resources for a set of child partitions In V1, only the root is the parent partition Child partition Created by the parent partition Guest operating systems, applications run in these partitions Virtual Processor (VP) Each partition has one or more virtual processors associated with it. Allows us to run more (virtual) processors than there are physically in the system! 3

Intercept Transition from guest mode to monitor mode, e.g. a VM generated a page fault or an interrupt arrived while executing in guest mode. Enlightenments Kernel Enlightenments inbuilt into the NT Kernel. Those make the Kernel run faster on top of the Hypervisor. Device Enlightenments ICs (Integration Components) which provide synthetic devices that improve performance. Those are installed inside the guest on top of the OS as drivers and services. Logical Processor (LP) Single execution pipeline on the physical processor\core. Examples: A physical core with Hyperthreading would have 2 LPs (assuming only 2 hardware threads) QuadCore processor would have 4 LPs Provided by: OS ISV / IHV / OEM Microsoft Hyper-V Microsoft / XenSource Parent Partition VM Worker Processes Applications WMI Provider VM Service Child Partitions Applications Applications Applications User Mode Windows Server 2008 IHV Drivers VMBus VS P Windows Server 2003, 2008 Windows Kernel VMBus VSC Non- Hypervisor Aware OS Emulation Xen-Enabled Linux Kernel Linux VSC VMBus Hypercall Adapter Kernel Mode Windows hypervisor Ring -1 Designed for Windows Server Hardware 4

Throughput Fairness The Hypervisor scheduler implements Weights, Reserves and Caps which all contribute to the actual share any given virtual processor (VP) in a VM will get Latency Response time to external events such as interrupts CPU Utilization There should be no CPU idle when there is actual work to do 5

Ready Time No thread should be unscheduled for an unbounded time if it is unblocked and ready to run. Or in other words limit the amount of time the thread is not scheduled Logical Processor Affinity The rate of thread migrations between processors should be kept low Power Saving Basic goal for V1 is to honor the power manager s policies from the root on the schedulable LPs Multiprocessor Scheduler Single processor scheduler only needs to know which thread to run next Multi-processor scheduler needs to know which thread to run next and where to run it Preemptive Scheduler Time Sharing Proportional-share Scheduling (not a priority based scheduler) 6

Hypervisor thread is the entity that is subject to scheduling One hypervisor thread per VP One Idle thread per LP Each thread either runs in guest mode, to execute guest instructions, or in monitor mode, to execute instructions inside the hypervisor in response to intercepts generated by the guest code execution (or doing work on behalf of a work request) 7

Each VP can have max. one CPU worth of processing power at any given time The general ratio of LP to VPs is relatively low, ~1:1 1:8 Timers Local APIC hardware timer No periodic clock tick!!! No periodic noise and much finer granularity (100ns) System could potentially stay idle for longer periods Process One per partition Does not have its own address space like in NT Consists of: Thread Threads Scheduler controls (weights, reserves & caps) Compartment, the memory pool Ideal Node and scheduled cpu set ID One per VP (root and non-root partitions) Idle threads Consists of: Stack Collection of flags (current, blocking, ) and properties (affinity, priority, ) Scheduling controls (weights, reserves & caps), inherited from process (all threads within a process are the same) Ideal node and cpu Rank Timeslice Workqueue (any thread can send work to any other thread) ID and stats 8

Rank Each thread has a rank inside a queue which determines when it is being scheduled. Set when its timeslice is being calculated (nowtime + scheduling interval). BUT, it is not a deadline. Boosts Anti-starving is done via limiting the boost (50us), avoid double boosts and by carefully selecting places where boosts can be made Thread Rank set to 0(head of the queue) Time Accounting Each thread accumulates its guest, hypervisor and total runtime On hypervisor entry\exit (due to an intercept) and thread switch Support for group runtime Pretty accurate, granularity of 100ns Per Processor Local Run List Ordered by thread s rank in increasing order Rank is time unit based, but not really a earliest deadline Lock-free, only accessed by the same cpu Per Processor Deferred Ready List Used for migrating threads from one processor to another in a lock-free fashion Per Processor Time Slice Timer Hypervisor Timers are mostly lockless for most frequent timers in the system One shot timer, there is no periodic hardware timer IdleSummary Thread and process structures 9

SchWaitForEvent e.g. HLT, MWAIT, send synchronous work, intercept, explicit suspend, caps suspend, thread start \terminate, Running SchSignalEvent If idle cpu was selected or lower rank thread unblocked Preemption Thread with a lower rank becomes ready Explicit Yield Timeslice end Waiting Currently running thread waits, yields, or reaches Timeslice end SchSignalEvent Ready Init Terminated Timeslice represents the given share in this proportional-share scheduling Timeslice is calculated when a previous timeslice ends or a thread arrives from another cpu Only is calculated for the arriving thread! 10

The scheduling interval is the inverse of the hardware interrupt rate Re-calculated every 100ms max. 10ms and min. 500us Each LP has its individual scheduling interval Goal is to limit the interrupt latency All ready threads of a particular LP will be scheduled within such a scheduling interval, that is the scheduling plan Thread receives timeslice based on the current scheduling interval and the current total weight (and relative to their own weight) Further adjustment is done according the thread s reserve and cap Interrupts are turned off while running code inside the hypervisor (except when idling) Hardware interrupts are only acknowledged while in guest mode or idle Root VP thread will be signaled, but does not necessarily run immediately after the interrupt has been received (see scheduling interval) 11

Reserve a certain amount of CPU time for a VM, whether or not it is actually using it Range from 0-100% By default virtual machines have a CPU reserve of 0% Threads would still get min. timeslices if running on a cpu with 100% reserve Take precedence over weights 12

Easy way to assign priorities to your VMs Effective weight is calculated from the remaining unreserved scheduling interval Range from 1 (lowest) -10000 (highest) Schedule threads such that their actual share is proportional to their weights for any given thread Inaccuracy due to different thread activities, time slicing, context switch overhead For all threads in the system the perceived or actual share should be as close to the ideal share as possible The maximum capacity of CPU time that a VM may use Caps are periodically monitored and the timeslice is being adjusted accordingly If a thread exceeds its share it will be suspended (CapSuspend) => Waiting on a internal event until a cap timer expires Pretty accurate due to the monitoring, but also somewhat expensive due to the timer usage By default the maximum capacity is set to 100% for all VMs 13

Thread I 5 ms Thread II 2 ms Thread III 2 ms Thread IV 1 ms Scheduling interval, e.g. 10ms Thread I Reserve: 40% Weight: 100 Thread II Reserve: 10% Weight: 100 Thread III Reserve: 0% Weight: 200 Thread IV Reserve: 0% Weight: 100 14

Switch from one thread A to another thread B Save\Restore the thread s monitor mode context Stack pointer update VP state needs to be saved and reloaded (hardware specific, TSC drift mitigations, ) Synchronization with concurrent unblocks Also responsible for signaling a event for thread termination if requested Clear the time slice timer when going into idle 15

Only single waiters Frequent operation Threads are blocking for various reasons VP halts Explicit suspend Intercept handling Hypercall running on one VP waits for a remote execution on another VP Cap suspend Wait removes the thread from the run list and connects itself to the waiting event On signal, the thread is unblocked and readied on the best possible processor (temp. affinity, idle processor or based on balancing decision the scheduler does) Event flags are updated and queried under a spinlock (mostly not contended) Deferring a thread Unblock from a wait Timeslice and temporal affinity ended for a running thread Eviction due to exceeded reserves Affinity changes (never happens in V1) 2-step best processor selection 1. Idle processor if available 2. Least busy non-idle processor Single affinity cases are easy, just run it! 16

Single idle mask for the entire system (limits it to 64 processors) Reduction process is as follows: Last ran CPU (if already in ideal node) Last ran CPU package with shared cache Ideal CPU All other CPUs in ideal node All the rest Select from highest to lowest 2 processors are chosen 1 st processor is either the current, ideal or highest numbered cpu from the remaining set 2 nd processor is the next left most cpu from the last victim cpu (round robin) Between the two cpus, the one with the lowest total weight wins Care is taken to exclude processors that already have threads from the same partition\process running 17

Balancing is important for the Hypervisor Provide fair share for each VM (proportional fair share) Commonly triggered due to unblocking or thru temp. affinity expiration on time slice end Done via Best Processor Selection, aka deferring a ready thread When there is no work, the scheduler runs the idle thread Arms the timer Enables/disables interrupts on entry/exit The idle thread is woken up by External interrupts (e.g. root partition timer tick, 15.6 ms periodic timer) Timer expiration inside the Hypervisor (int 0xFF) Signaling by another processor (int 0xFE), e.g. arrival of a new thread Other external events Each processor has its own idle thread with a infinite run list rank 18

Which power management state does the logical processor go? Determined by the root partition Root partition owns the power management policy and determines the appropriate power state based on business of a particular cpu which it gets from the hypervisor (total utilization including guest activity, ) VM scale for Win2k3 on a 8p host Win2k3 versus Win2k8 on a 8p host 1,2 9 1 8 7 0,8 6 5 4 3 1x1P Win2k3 1x2P Win2k3 2x2P Win2k3 4x2P Win2k3 8x2P Win2k3 0,6 0,4 Win2k3 WF (dynamic) Win2k8 WF (dynamic) 2 1 0,2 0 WF (dynamic) 0 1x1P 1x2P 2x2P 4x2P 8x2P 19

Virtualization in general: http:// www.microsoft.com/virtualization/ default.mspx Public API s - Hypervisor Hypercalls and MSRs http://www.microsoft.com/downloads/ details.aspx?familyid=91e2e518- c62c-4ff2-8e50-3a37ea4100f5&displayl ang=en 20