EE282 Lecture 11 Virtualization & Datacenter Introduction Christos(Kozyrakis( ( h.p://ee282.stanford.edu( EE282$ $Spring$2013$ $Lecture$11$
Announcements Project 1 is due on 5/8 th HW2 is due on 5/20 th Project 2 is pushed to next week 2
Virtualization Summary With%VMs:(Mul<ple( OSes(share(hardware( resources(.% VM 0 App App... Guest OS 0 App... VM 1 App App... Guest OS 1 App Virtual Machine Monitor (VMM) Physical Host Hardware The VMM creates virtual copies of a complete HW system Key properties: partitioning, encapsulation Uses: server consolidation/scaledown, client consolidation, security, Options: hypervisor vs hosted architecture, full vs para-virtualization Essential properties: safety, equivalency, efficiency 3
Reminder: Is my ISA Virtualizable? Basic requirement: at least two execution modes (user & kernel) Extra requirement: all sensitive instructions must be privileged Sensitive instructions: those that change the HW configurations (allocations, mapping, ) or whose outcome depends on HW configuration Notes E.g., write TLB or read processor mode Priviledged instructions: if executed in user mode they trap into kernel mode There can be privileged instructions that are not sensitive Memory accesses must go through a privileged translation stage (e.g. paging) An architecture may provide further support for VMs 4
x86 Virtualization Challenges Ring deprivileging Run guest OS above ring 0 Control privileged state access Virtualization holes Ring compression Non-trapping operations Excessive trapping Software solutions Binary translation Enables optimization! Paravirtualization Ring(3( Ring(1( Ring(0( VM 0( 1( Ring( Deprivileging( Apps( CPUID% Legacy(OS( CLI/STI% Binary(Translator( Binary(Transla<on( Cache( CPU 0( POPF% CPU n( ParaO( 3( Virtualiza<on( VM n( Apps( Modified(OS( VMM( 2( Binary( Transla<on( Processors( 5
HW Support for CPU Virtualization New(CPU(opera<ng(mode( Guest OSes run at intended rings VT(Root(opera<on((for(VMM)( NonORoot(opera<on((for(Guest)( Eliminates(ring(compression( New(transi<ons( Ring 3 Ring 0 VM 0 Apps Windows VM n Apps Linux VM(entry(and(exit( Swaps(registers(and(address( space(in(one(atomic(opera<on( VM(control(structure((VMCS)( VT Root Mode VM Entry VM Exit H/W VM Control Structure (VMCS) VMCS Configuration VMM Memory and I/O Virtualization Configured(by(VMM(soYware( Specifies(guest(OS(state( Controls(when(VM(exits(occur( VT-x CPU 0 CPU n Processors with VT-x (or VT-i) 6
Memory Virtualization Challenges Address Translation Guest OS expects contiguous, zero-based physical memory VM 0 VM n VMM must preserve this illusion Guest Page Tables Guest Page Tables Page-table shadowing Induced VM Exits Remap VMM intercepts paging operations VMM Constructs copy of page tables Shadow Page Tables Overheads VM exits add to execution time Shadow page tables consume significant host memory TLB CPU 0 Memory 7
HW Support for Memory Virtualization Extended page pables (EPT) Map guest physical to host address VM 0 VM n New hardware page-table walker Performance benefit Guest OS can modify its own page tables freely No VM Exits VMM Eliminates VM exits VT-x with EPT I/O Virtualization Memory Savings Shadow page tables required for each guest user process (w/o EPT) EPT Walker A single EPT supports entire VM Extended Page Tables (EPT) CPU 0 8
Extended Page Tables CR3 EPT Base Pointer (EPTP) Guest Linear Address Guest Page Tables Guest Physical Address Extended Page Tables Host Physical Address Regular page tables Map guest-linear to guest-physical (translated again) Can be read and written by guest New EPT page tables under VMM control Map guest-physical to host-physical (accesses memory) Referenced by new EPT base pointer No VM exits due to page faults, INVLPG, CR3 accesses 9
EPT Details CR 3 Guest Linear Address Host Physical Address + EPT Tables Page Directory + EPT Tables Page Table + EPT Tables All guest-physical addresses translated by EPT CR3, PDE, PTE Guest Physical Page Base Address Includes PDPTRs and 64-bit paging structures Page faults take priority over VM exits Guest Physical Address 10
Watch out: Higher TLB Miss Cost Virtual(Addr( CR3( PDE( TLB( PTE( Physical(Addr( With 32-bit Addressing: 2-level Walk 11
Watch out: Higher TLB Miss Cost Virtual(Addr( CR3( PML4( TLB( PDP( PDE( PTE( Physical(Addr( With 64-bit Addressing: 4-level Walk 12
Watch out: Higher TLB Miss Cost Virtual(Addr( Extended%Page%Table%(EPT)%Walk% CR3( EPTP( L4( L3( L2( L1( PML4( TLB( EPTP( L4( L3( L2( L1( PDP( EPTP( L4( L3( L2( L1( PDE( EPTP( L4( L3( L2( L1( PTE( EPTP( L4( L3( L2( L1( Physical(Addr( With Virtualization: 24 Steps in Walk! 13
Discussion How do we solve the problem of expensive translation? 14
I/O Virtualization Challenges Virtual device interface Traps device commands Translates DMA operations Injects virtual interrupts 1 VM 0 I/O Device Emulation Guest Device Driver Para- 2 virtualization VM n Guest Device Driver Software methods I/O device emulation Paravirtualize device interface Challenges Overheads of copying I/O buffers Controlling DMA and interrupts Device Model Device Model VMM Memory Physical Device Driver Storage Network 15
Virtual and Physical Device Interfaces VM 0 VM 0 Guest OS and Apps Guest device driver programs virtual device interface: Device Configuration Accesses I/O-port Accesses Memory-mapped Device Registers Guest OS and Apps Virtual device model proxies device activity back to guest OS: Copying (or translation) of DMA buffers Injection of virtual interrupts Virtual Device Interface and Model Virtual Device Interface and Model Virtual device model proxies accesses to physical device driver: Possible translation of commands Translation of DMA addresses Physical Device Interface and Driver Physical Device Interface and Driver Device driver programs actual physical I/O device: Device configuration I/O-port and MMIO accesses Physical device responds to commands: DMA transactions to host physical memory Physical device interrupts Physical Device 16
Motivation for HW Support Example: DMA incoming network packets to guest space Guest OS buffer or guest user buffer Requires DMAs that understand virtual addresses With virtualization Must offer security in the presence of multiple apps/vms 17
HW Support for I/O Virtualization Common challenges to I/O virtualization Controlling device access to memory (DMA remapping) Controlling device interrupts (interrupt remapping) Applications for DMA Remapping Memory protection and isolation (for reliability/security) Direct assignment of I/O devices to VMs Controlling DMA to I/O buffers within a VM Applications for Interrupt Remapping Isolate interrupt requests to proper VMs Enable VMM to efficiently route interrupt requests Complement CPU support for interrupt virtualization 18
HW Support for IO Virtualization CPU CPU System Bus I/O Controller VT-d (IOMMU) DRAM Integrated Devices PCIe* Root Ports PCI Express South Bridge PCI, LPC, Legacy devices, Defines an architecture for DMA and interrupt remapping Implemented as part of core logic chipset Most functionality is now integrated in CPU chips 19
IOMMU Architecture DMA Requests Device ID Virtual Address Length Fault Generation Bus 255 Bus N Bus 0 Dev 31, Func 7 Dev P, Func 2 Dev P, Func 1 Dev 0, Func 0 4KB Page Tables 4KB Page Frame DMA Remapping Engine Translation Cache Context Cache Device Assignment Structures Device D1 Device D2 Address Translation Structures Address Translation Structures Memory Access with System Physical Address Memory-resident Partitioning & Translation Structures 20
Page Tables Requestor ID DMA Virtual Address 15 8 7 3 2 0 Bus Device Func 63 57 56 48 000000b 000000000b 47 Level-4 table offset 40 39 Level-3 table offset 30 29 21 20 12 Level-2 table offset Level-1 table offset 11 Page Offset 0 Base Device Assignment Tables Level-4 Page Table Example Device Assignment Table Entry specifying 4-level page table Level-3 Page Table Level-2 Page Table Level-1 Page Table 4KB Page 21
Translation Caching H/W caches frequently used remapping structures Avoids overhead of accessing structures in memory Caches support tagging by software-assigned ID IOTLB Caches translations for recently accessed pages IOTLB scaling through PCIe address translation services Allows devices to locally cache translations Context cache Caches device-assignment entries 22
Interrupt Remapping Interrupt request specifies request & originator IDs Remap hardware transforms request into physical interrupt Interrupt-remapping hardware Enforces isolation through use of originator ID Caches frequently used remap structures Software may modify remap for efficient interrupt redirection Applicable to all interrupt sources Legacy interrupts delivered through I/O APICs Message signaled interrupts (MSI, MSI-X) Works with existing device hardware 23
Discussion How do NICs with multiple queues help with virtualization? 24
Deployment Models for I/O Virtualization Hypervisor Model Service VM Model Pass-through Model Service VMs Guest VMs VM 0 VM n VM 0 VM n Guest OS and Apps Guest OS and Apps I/O Services VM 0 VM n Guest OS and Apps Guest OS and Apps I/O Services Device Drivers Guest OS and Apps Device Drivers Device Drivers Device Drivers Hypervisor Hypervisor Hypervisor Shared Devices Shared Devices Assigned Devices Pro: High Performance Pro: Higher Security Pro: Higher Performance Pro: I/O Device Sharing Pro: I/O Device Sharing Pro: Rich Device Features Pro: VM Migration Pro: VM Migration Con: Limited Sharing Con: Large Hypervisor Con: Lower Performance Con: VM Migration Limits 25
IOMMU & Hypervisor Model Improved reliability and protection Hypervisor Model VMM hypervisors remap tables Errant DMA detected & reported to VMM VM 0 Guest OS and Apps VM n Guest OS and Apps Bounce buffer support Limited DMA addressability in some I/O devices prevents access to high memory Bounce buffers are a software technique to copy I/O buffers into high memory Extra copies IOMMU eliminates need for bounce buffers I/O Services Device Drivers Pro: High Performance Pro: I/O Device Sharing Pro: VM Migration Hypervisor Shared Devices Con: Large Hypervisor 26
IOMMU & Service VM Model Device driver deprivileging Device drivers run in Service OS Device drivers program devices in DMA-virtual address space Service VM Forwards DMA API calls to hypervisor Hypervisor sets up DMA-virtual to host-physical translation Further Improvements in protection Guest device driver unable to compromise hypervisor code or data either through DMA or through CPU-initiated memory accesses Service VM Model Service VMs I/O Services Device Drivers VM 0 Pro: Higher Security Pro: I/O Device Sharing Pro: VM Migration Guest VMs VM n Guest OS and Apps Hypervisor Shared Devices Con: Lower Performance 27
IOMMU & Pass-through Model Direct device assignment to guest OS Pass-through Model Hypervisor sets up guest-to-host physical VM 0 VM n DMA mapping tables Guest OS directly programs physical device Guest OS and Apps Device Drivers Guest OS and Apps Device Drivers Multi-queue or multi-interface devices Hypervisor Can assign device interfaces directly to VMs See PCI I/O virtualization standards Assigned Devices Pro: Higher Performance Pro: Rich Device Features Con: Limited Sharing Con: VM Migration Limits 28
VMMs Before & After HW Support Virtual Machines (VMs) VM 0 VM 1 VM 2 Apps Apps Apps OS OS OS VM n Apps OS VMM (a.k.a., hypervisor) Higher-level VMM Functions: Resource Discovery / Provisioning / Scheduling / User Interface Processor Virtualization Memory Virtualization I/O Device Virtualization Ring VT-x or VT-iBinary Deprivileging Configuration Translation Page-table EPT Configuration Shadowing DMA I/O DMA and Interrupt Interrupt Remap Remapping Configuration Remapping I/O Device Emulation Physical Platform Resources VT-i VT-x CPU 0 CPU n EPT VT-d PCI SIG Storage Network Processors Memory I/O Devices 29
Performance Implications Of Virtualization Layered( Resource( Management( Frequent( Context( Switching( Pressure(on(TLB( and(address( transla<on( VM App App... Guest OS Virtual(Machine(Monitor((VMM)( Physical(Host(Hardware( Processors App... Memory & Cache VM App App... Guest OS I/O App Longer(I/O( code(paths,( data(copies( Increased(memory( hierarchy(conten<on( 30
HW Support Vs. Binary Translation Binary translation VMM: Converts traps to callouts Callouts faster than trapping Faster emulation routine VMM does not need to reconstruct state Avoids callouts entirely Hardware supported VMM: Preserves code density No precise exception overhead Faster system calls 31
Compute-bound Benchmarks Bottomline: little difference for SPEC 32
Other Benchmarks Explanation for mixed results? 33
Reminder: Uses of VMs Memory Compression Memory Deduplication Virtual Machine Migration Thin Provisioning Enable old OS on modern hardware Archiving applications Sandboxing Take your apps with you on a memory stick 34
Introduction to Data Centers Readings for today Barroso & Holzle textbook, chapters 1 & 2 Slide credits: James Hamilton, Jeff Dean, Facebook Hamilton s blog is a great research for DC technology http://perspectives.mvdirona.com Facebook http://opencompute.org/ 35
What is a Datacenter (DC) The compute infrastructure for internet-scale services & cloud computing Examples: Google, Facebook, Yahoo, Amazon web services, Microsoft, Baidu, Both consumer and enterprise services Windows Live, gmail, hotmail, dropbox, bing, google, Adcenter, Exchange hosted services, Web apps, exchange online, salesforce.com, azureplatform, GoogleApps, 36
What is a Datacenter (DC) A simplistic view Scaled-up version of machine rooms for enterprise computing A large collection of commodity components PC-based servers (CPUs, DRAM, disks), Ethernet networking Commodity OS and software stack 10s to 100s of thousands of nodes System software for DC management (centralized) Software the implements internet services 37
A More Complete View of a DC Apart from computers & network switches, you need: Power infrastructure: voltage converters and regulators, generators and UPSs, Cooling infrastructure: A/C, cooling towers, heat exchangers, air impellers, Co-designed! 38
Example: MS Quincy Datacenter 470k sq feet (10 football fields) Next to a hydro-electric generation plant At up to 40 MegaWatts, $0.02/kWh is better than $0.15/kWh That s equal to the power consumption of 30,000 homes 39
Example: MS Chicago Datacenter Microsoft s Chicago Data Center [K. Vaid, Microsoft Global Foundation Services, 2010] Oct$3,$2010$ Kushagra$Vaid,$HotPower'10$ 10$ 40
Motivation for Internet-scale Services & Datacenters Some applications need big machines Examples: search, language translation, etc User experience Ubiquitous access Ease of management (no backups, no config) Vendor benefits (all translate to lower costs) Faster application development Tight control of system configuration Ease of (re)deployment for upgrades and fixes Single-system view for storage and other resources Lower cost by sharing HW resources across many users Lower cost by amortizing HW/storage management costs 41
Data Center Cost Business models based on low costs High volume needs low costs How much do you pay for gmail? Key competitive advantage Key metric: total cost of ownership (TCO) Two components to cost Capital expenses (CAPEX) Operational expenses (OPEX) 42
Cost Model: Facilities CAPEX Size of facility: 8MW; Cost of facility: $11/W See table below for typical construction costs Total facility CAPEX costs: $88M % facility costs for power and cooling = 82% = $72.16M % other infrastucture = 18% = $15.84M US accounting rules to convert CAPEX to OPEX Facilities amortization time = 10-15 years Annual cost of money = 5% OPEX = Pmt( interest_rate, number_payments, principal) Monthly opex: Power and cooling: 765K; other infrastucture: 168K 43
Cost Model: Systems CAPEX Servers 45,978 servers x $1450 per server = $66.7M CAPEX Depreciation: 3 years; cost of money = 5% Monthly CAPEX: $2,000K Networking Rack switches: 1150 x $4800; Array switches: 22 x $300K; Layer3 switch: 2 x $500K; Border routers: 2 x $144.8K = $13.41M CAPEX Depreciation: 4 years; cost of money = 5% Monthly CAPEX: $309K 44
Cost Model: OPEX Costs Power [=MegaWattsCriticalLoad*AveragePowerUsage/ 1000*PUE*PowerCost*24*365/12] 0.07c/KWhr; PUE = 1.45; average power use: 80% People $475K OPEX (monthly) Security guards: 3 x 24x365x$20 + Facilities: 1x24x365x$30 ; Benefits multiplier: 1.3 $85K OPEX (monthly) Network bandwidth costs to internet Varies by application and usage Vendor maintenance fees + sysadmins Varies by equipment and negotiations 45
Cost Model: TCO Servers amortized( $1,998,103 Power & Cooling Infrastructure amortized( $765,369 Power ( $474,208 Other Infrastructure amortized( $168,008 Network amortized( $308,814 People ( $85,410 Observa<ons( 34%(costs(related(to(power((trending(up(while(server(costs(down)( Networking(high(@(8%(of(overall(costs;(15%(of(server(costs( How(is(this(different(from(tradi<onal(enterprise(compu<ng?( 46
Enterprise Vs Internet-scale Computing: a Cost Perspective Enterprise computing approach Largest cost is people -- scales roughly with servers (~100:1 common) Enterprise interests focus on consolidation & utilization Consolidate workload onto fewer, larger systems Large SANs for storage & large routers for networking Internet-scale services approach Largest costs is server H/W Typically followed by cooling, power distribution, power Networking varies from very low to dominant depending upon service People costs under 10% & often under 5% (>1000+:1 server:admin) Services interests centered around work-done-per-$ (or watt) Observations People costs shift from top to nearly irrelevant. Focus instead on work done /$ & work done/watt 47
TCO Discussion Anything else missing? Tip: what can go wrong in a data center? 48
Using the cost analysis Cost model powerful tool for design tradeoffs E.g., can we reduce power cost with different disk? Burdened cost of a Watt per year What does this mean? ($765K+$475K)*12/8MW = $1.86/Watt/year A 1TB disk uses 10W of power, costs $90. An alternate disk consumes only 5W, but costs $150. If you were the data center architect, what would you do? 49
Answer A 1TB disk uses 10W of power, costs $90. An alternate disk consumes only 5W, but costs $150. If you were the data center architect, what would you do? @ $2/Watt even if we saved the entire 10W of power for disk, we would save $20 per year. We are paying $60 more for the disk probably not worth it. What is this analysis missing? 50