10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Similar documents
Cooperation of Operating Systems with Hyper-V. Bartek Nowierski Software Development Engineer, Hyper-V Microsoft Corporation

The Microsoft Windows Hypervisor High Level Architecture

Virtualization with Windows

Windows Server Virtualization & The Windows Hypervisor

Multi-core Programming System Overview

Full and Para Virtualization

HyperV_Mon 3.0. Hyper-V Overhead. Introduction. A Free tool from TMurgent Technologies. Version 3.0

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Understanding Linux on z/vm Steal Time

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

CPU Scheduling Outline

VIRTUALIZATION 101. Brainstorm Conference 2013 PRESENTER INTRODUCTIONS

Monitoring Microsoft Hyper-V. eg Enterprise v6.0

Windows Server Virtualization & The Windows Hypervisor. Background and Architecture Reference

Chapter 14 Virtual Machines

HyperV_Mon. Introduction. A Free Tool From TMurgent Technologies

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

Hypervisors. Introduction. Introduction. Introduction. Introduction. Introduction. Credits:

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Chapter 6, The Operating System Machine Level

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

ò Paper reading assigned for next Thursday ò Lab 2 due next Friday ò What is cooperative multitasking? ò What is preemptive multitasking?

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Real-time KVM from the ground up

An Implementation Of Multiprocessor Linux

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Distributed Systems. Virtualization. Paul Krzyzanowski

Compromise-as-a-Service

Large-scale performance monitoring framework for cloud monitoring. Live Trace Reading and Processing

CPU Scheduling. Core Definitions

Operating Systems. Lecture 03. February 11, 2013

Chapter 5 Linux Load Balancing Mechanisms

HP Data Protector software. Assuring Business Continuity in Virtualised Environments

Linux scheduler history. We will be talking about the O(1) scheduler

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

Memory Forensics with Hyper-V Virtual Machines.

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

SYSTEM ecos Embedded Configurable Operating System

Virtualization. Pradipta De

Virtualization in the ARMv7 Architecture Lecture for the Embedded Systems Course CSD, University of Crete (May 20, 2014)

KVM as a Microsoft-compatible hypervisor.

matasano Hardware Virtualization Rootkits Dino A. Dai Zovi

Linux Scheduler. Linux Scheduler

Windows8 Internals, Sixth Edition, Part 1

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Cloud Operating Systems for Servers

Completely Fair Scheduler and its tuning 1

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Lecture 6: Interrupts. CSC 469H1F Fall 2006 Angela Demke Brown

Comp 204: Computer Systems and Their Implementation. Lecture 12: Scheduling Algorithms cont d

Why Relative Share Does Not Work

Capacity Estimation for Linux Workloads

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Real- Time Mul,- Core Virtual Machine Scheduling in Xen

Performance tuning Xen

Virtualization. Jukka K. Nurminen

Comparison of the Three CPU Schedulers in Xen

Operating Systems. III. Scheduling.

Intel DPDK Boosts Server Appliance Performance White Paper

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

Linux Process Scheduling Policy

Operating Systems 4 th Class

Nested Virtualization

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

A hypervisor approach with real-time support to the MIPS M5150 processor

CS161: Operating Systems

ELEC 377. Operating Systems. Week 1 Class 3

Virtual machine CPU monitoring with Kernel Tracing

Best Practices for SAP on Hyper-V

Chapter 5 Process Scheduling

Unifying Information Security

COS 318: Operating Systems. Virtual Machine Monitors

Virtual Machines.

Process Description and Control william stallings, maurizio pizzonia - sistemi operativi

CPU Scheduling. CSC 256/456 - Operating Systems Fall TA: Mohammad Hedayati

Virtualization. Explain how today s virtualization movement is actually a reinvention

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Chapter 16: Virtual Machines. Operating System Concepts 9 th Edition

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition

Hyper-V R2: What's New?

Operating System Components and Services

Job Scheduling Model

OPERATING SYSTEMS SCHEDULING

iseries Logical Partitioning

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Scheduling 0 : Levels. High level scheduling: Medium level scheduling: Low level scheduling

Performance Management for Cloudbased STC 2012

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

x86 ISA Modifications to support Virtual Machines

Transcription:

Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting & Signaling Best processor selection Load Balancing Idling High Level Performance Data Q&A 1

Hypervisor runs just above the hardware and below the VMs Microkernel architecture Kernel (in our case the Hypervisor) provided abstractions reduced to the minimum Address spaces, CPU, other hardware abstraction (exceptions, traps, interrupts, etc.), IPC (Inter Partition Communication), Other services are provided as higher level services E.g. virtualization stack in the root partition Provided by: Parent Partition Child Partitions Windows Server 2008 64Bit Windows Server 2003/2008, Windows XP, Linux Windows Server 2003/2008, Windows XP, Linux Windows Server 2003/2008, Windows XP, Linux Windows hypervisor Devices Processor s Memory 2

Hypervisor A hypervisor is a layer of software that sits just above the hardware and beneath one or more operating systems. Its primary job is to provide isolated execution environments called partitions. The hypervisor is to Hyper-V what the kernel is to Windows. Does not contain device drivers. Launched by a system driver in V1. Virtualization Stack The virtualization stack is everything else that makes up Hyper-V. This is the user interface, management services, virtual machine processes, emulated devices, etc... Partition It is the unit of isolation within the hypervisor. A partition is a lighter weight concept than a virtual machine - and could be used outside of the context of virtual machines to provide a highly isolated execution environment. Virtual machine The combination of a partition and the services provided to that partition via the virtualization stack is called the virtual machine. Also called the guest. Root Partition Controlling partition in which the virtualization stack runs and which owns hardware devices This is the first partition on the computer. Specifically this is the partition that is responsible for initially starting the hypervisor Parent partition Manages resources for a set of child partitions In V1, only the root is the parent partition Child partition Created by the parent partition Guest operating systems, applications run in these partitions Virtual Processor (VP) Each partition has one or more virtual processors associated with it. Allows us to run more (virtual) processors than there are physically in the system! 3

Intercept Transition from guest mode to monitor mode, e.g. a VM generated a page fault or an interrupt arrived while executing in guest mode. Enlightenments Kernel Enlightenments inbuilt into the NT Kernel. Those make the Kernel run faster on top of the Hypervisor. Device Enlightenments ICs (Integration Components) which provide synthetic devices that improve performance. Those are installed inside the guest on top of the OS as drivers and services. Logical Processor (LP) Single execution pipeline on the physical processor\core. Examples: A physical core with Hyperthreading would have 2 LPs (assuming only 2 hardware threads) QuadCore processor would have 4 LPs Provided by: OS ISV / IHV / OEM Microsoft Hyper-V Microsoft / XenSource Parent Partition VM Worker Processes Applications WMI Provider VM Service Child Partitions Applications Applications Applications User Mode Windows Server 2008 IHV Drivers VMBus VS P Windows Server 2003, 2008 Windows Kernel VMBus VSC Non- Hypervisor Aware OS Emulation Xen-Enabled Linux Kernel Linux VSC VMBus Hypercall Adapter Kernel Mode Windows hypervisor Ring -1 Designed for Windows Server Hardware 4

Throughput Fairness The Hypervisor scheduler implements Weights, Reserves and Caps which all contribute to the actual share any given virtual processor (VP) in a VM will get Latency Response time to external events such as interrupts CPU Utilization There should be no CPU idle when there is actual work to do 5

Ready Time No thread should be unscheduled for an unbounded time if it is unblocked and ready to run. Or in other words limit the amount of time the thread is not scheduled Logical Processor Affinity The rate of thread migrations between processors should be kept low Power Saving Basic goal for V1 is to honor the power manager s policies from the root on the schedulable LPs Multiprocessor Scheduler Single processor scheduler only needs to know which thread to run next Multi-processor scheduler needs to know which thread to run next and where to run it Preemptive Scheduler Time Sharing Proportional-share Scheduling (not a priority based scheduler) 6

Hypervisor thread is the entity that is subject to scheduling One hypervisor thread per VP One Idle thread per LP Each thread either runs in guest mode, to execute guest instructions, or in monitor mode, to execute instructions inside the hypervisor in response to intercepts generated by the guest code execution (or doing work on behalf of a work request) 7

Each VP can have max. one CPU worth of processing power at any given time The general ratio of LP to VPs is relatively low, ~1:1 1:8 Timers Local APIC hardware timer No periodic clock tick!!! No periodic noise and much finer granularity (100ns) System could potentially stay idle for longer periods Process One per partition Does not have its own address space like in NT Consists of: Thread Threads Scheduler controls (weights, reserves & caps) Compartment, the memory pool Ideal Node and scheduled cpu set ID One per VP (root and non-root partitions) Idle threads Consists of: Stack Collection of flags (current, blocking, ) and properties (affinity, priority, ) Scheduling controls (weights, reserves & caps), inherited from process (all threads within a process are the same) Ideal node and cpu Rank Timeslice Workqueue (any thread can send work to any other thread) ID and stats 8

Rank Each thread has a rank inside a queue which determines when it is being scheduled. Set when its timeslice is being calculated (nowtime + scheduling interval). BUT, it is not a deadline. Boosts Anti-starving is done via limiting the boost (50us), avoid double boosts and by carefully selecting places where boosts can be made Thread Rank set to 0(head of the queue) Time Accounting Each thread accumulates its guest, hypervisor and total runtime On hypervisor entry\exit (due to an intercept) and thread switch Support for group runtime Pretty accurate, granularity of 100ns Per Processor Local Run List Ordered by thread s rank in increasing order Rank is time unit based, but not really a earliest deadline Lock-free, only accessed by the same cpu Per Processor Deferred Ready List Used for migrating threads from one processor to another in a lock-free fashion Per Processor Time Slice Timer Hypervisor Timers are mostly lockless for most frequent timers in the system One shot timer, there is no periodic hardware timer IdleSummary Thread and process structures 9

SchWaitForEvent e.g. HLT, MWAIT, send synchronous work, intercept, explicit suspend, caps suspend, thread start \terminate, Running SchSignalEvent If idle cpu was selected or lower rank thread unblocked Preemption Thread with a lower rank becomes ready Explicit Yield Timeslice end Waiting Currently running thread waits, yields, or reaches Timeslice end SchSignalEvent Ready Init Terminated Timeslice represents the given share in this proportional-share scheduling Timeslice is calculated when a previous timeslice ends or a thread arrives from another cpu Only is calculated for the arriving thread! 10

The scheduling interval is the inverse of the hardware interrupt rate Re-calculated every 100ms max. 10ms and min. 500us Each LP has its individual scheduling interval Goal is to limit the interrupt latency All ready threads of a particular LP will be scheduled within such a scheduling interval, that is the scheduling plan Thread receives timeslice based on the current scheduling interval and the current total weight (and relative to their own weight) Further adjustment is done according the thread s reserve and cap Interrupts are turned off while running code inside the hypervisor (except when idling) Hardware interrupts are only acknowledged while in guest mode or idle Root VP thread will be signaled, but does not necessarily run immediately after the interrupt has been received (see scheduling interval) 11

Reserve a certain amount of CPU time for a VM, whether or not it is actually using it Range from 0-100% By default virtual machines have a CPU reserve of 0% Threads would still get min. timeslices if running on a cpu with 100% reserve Take precedence over weights 12

Easy way to assign priorities to your VMs Effective weight is calculated from the remaining unreserved scheduling interval Range from 1 (lowest) -10000 (highest) Schedule threads such that their actual share is proportional to their weights for any given thread Inaccuracy due to different thread activities, time slicing, context switch overhead For all threads in the system the perceived or actual share should be as close to the ideal share as possible The maximum capacity of CPU time that a VM may use Caps are periodically monitored and the timeslice is being adjusted accordingly If a thread exceeds its share it will be suspended (CapSuspend) => Waiting on a internal event until a cap timer expires Pretty accurate due to the monitoring, but also somewhat expensive due to the timer usage By default the maximum capacity is set to 100% for all VMs 13

Thread I 5 ms Thread II 2 ms Thread III 2 ms Thread IV 1 ms Scheduling interval, e.g. 10ms Thread I Reserve: 40% Weight: 100 Thread II Reserve: 10% Weight: 100 Thread III Reserve: 0% Weight: 200 Thread IV Reserve: 0% Weight: 100 14

Switch from one thread A to another thread B Save\Restore the thread s monitor mode context Stack pointer update VP state needs to be saved and reloaded (hardware specific, TSC drift mitigations, ) Synchronization with concurrent unblocks Also responsible for signaling a event for thread termination if requested Clear the time slice timer when going into idle 15

Only single waiters Frequent operation Threads are blocking for various reasons VP halts Explicit suspend Intercept handling Hypercall running on one VP waits for a remote execution on another VP Cap suspend Wait removes the thread from the run list and connects itself to the waiting event On signal, the thread is unblocked and readied on the best possible processor (temp. affinity, idle processor or based on balancing decision the scheduler does) Event flags are updated and queried under a spinlock (mostly not contended) Deferring a thread Unblock from a wait Timeslice and temporal affinity ended for a running thread Eviction due to exceeded reserves Affinity changes (never happens in V1) 2-step best processor selection 1. Idle processor if available 2. Least busy non-idle processor Single affinity cases are easy, just run it! 16

Single idle mask for the entire system (limits it to 64 processors) Reduction process is as follows: Last ran CPU (if already in ideal node) Last ran CPU package with shared cache Ideal CPU All other CPUs in ideal node All the rest Select from highest to lowest 2 processors are chosen 1 st processor is either the current, ideal or highest numbered cpu from the remaining set 2 nd processor is the next left most cpu from the last victim cpu (round robin) Between the two cpus, the one with the lowest total weight wins Care is taken to exclude processors that already have threads from the same partition\process running 17

Balancing is important for the Hypervisor Provide fair share for each VM (proportional fair share) Commonly triggered due to unblocking or thru temp. affinity expiration on time slice end Done via Best Processor Selection, aka deferring a ready thread When there is no work, the scheduler runs the idle thread Arms the timer Enables/disables interrupts on entry/exit The idle thread is woken up by External interrupts (e.g. root partition timer tick, 15.6 ms periodic timer) Timer expiration inside the Hypervisor (int 0xFF) Signaling by another processor (int 0xFE), e.g. arrival of a new thread Other external events Each processor has its own idle thread with a infinite run list rank 18

Which power management state does the logical processor go? Determined by the root partition Root partition owns the power management policy and determines the appropriate power state based on business of a particular cpu which it gets from the hypervisor (total utilization including guest activity, ) VM scale for Win2k3 on a 8p host Win2k3 versus Win2k8 on a 8p host 1,2 9 1 8 7 0,8 6 5 4 3 1x1P Win2k3 1x2P Win2k3 2x2P Win2k3 4x2P Win2k3 8x2P Win2k3 0,6 0,4 Win2k3 WF (dynamic) Win2k8 WF (dynamic) 2 1 0,2 0 WF (dynamic) 0 1x1P 1x2P 2x2P 4x2P 8x2P 19

Virtualization in general: http:// www.microsoft.com/virtualization/ default.mspx Public API s - Hypervisor Hypercalls and MSRs http://www.microsoft.com/downloads/ details.aspx?familyid=91e2e518- c62c-4ff2-8e50-3a37ea4100f5&displayl ang=en 20