KVM & Memory Management Updates

Similar documents

Memory Management under Linux: Issues in Linux VM development

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Real-time KVM from the ground up

2972 Linux Options and Best Practices for Scaleup Virtualization

Red Hat Enterprise Virtualization Performance. Mark Wagner Senior Principal Engineer, Red Hat June 13, 2013

Automatic NUMA Balancing. Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master Technologist, HP

COS 318: Operating Systems. Virtual Machine Monitors

Using process address space on the GPU

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Cloud^H^H^H^H^H Virtualization Technology. Andrew Jones May 2011

COS 318: Operating Systems

Linux VM Infrastructure for memory power management

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Maximizing SQL Server Virtualization Performance

Virtualization. Dr. Yingwu Zhu

Understanding Virtual Memory In Red Hat Enterprise Linux 3

PARALLELS CLOUD SERVER

Hyper-V vs ESX at the datacenter

COS 318: Operating Systems. Virtual Machine Monitors

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

BridgeWays Management Pack for VMware ESX

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Local and Remote Memory: Memory in a Linux/NUMA System

Determining the Correct Usage of Swap in Linux * 2.6 Kernels

Virtual Memory Behavior in Red Hat Linux Advanced Server 2.1

ovirt QoS Martin Sivák Red Hat Czech KVM Forum October 2013 ovirt QOS

Virtualization in Linux KVM + QEMU

Using Linux as Hypervisor with KVM

Beyond the Hypervisor

Page replacement in Linux 2.4 memory management

The Classical Architecture. Storage 1 / 36

Database Virtualization

Google File System. Web and scalability

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS. Mark Wagner Principal SW Engineer, Red Hat August 14, 2011

BHyVe. BSD Hypervisor. Neel Natu Peter Grehan

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

SQL Server Virtualization

Understanding Memory Resource Management in VMware vsphere 5.0

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Building Docker Cloud Services with Virtuozzo

Cloud Operating Systems for Servers

Avoiding Performance Bottlenecks in Hyper-V

3 Red Hat Enterprise Linux 6 Consolidation

Performance Management in the Virtual Data Center, Part II Memory Management

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

RPM Brotherhood: KVM VIRTUALIZATION TECHNOLOGY

Pete s All Things Sun: Comparing Solaris to RedHat Enterprise and AIX Virtualization Features

KVM KERNEL BASED VIRTUAL MACHINE

Linux on z/vm Memory Management

International Journal of Advancements in Research & Technology, Volume 1, Issue6, November ISSN

Scaling Microsoft Exchange in a Red Hat Enterprise Virtualization Environment

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

RED HAT ENTERPRISE VIRTUALIZATION 3.0

Status and Direction of Kernel Development

Xen and XenServer Storage Performance

How To Understand The Power Of A Virtual Machine Monitor (Vm) In A Linux Computer System (Or A Virtualized Computer)

Enabling Technologies for Distributed Computing

Architecture of the Kernel-based Virtual Machine (KVM)

Big Fast Data Hadoop acceleration with Flash. June 2013

<Insert Picture Here> Oracle Database Support for Server Virtualization Updated December 7, 2009

Masters Project Proposal

Memory Resource Management in VMware ESX Server

How To Backup Your Computer With A File Copy Engine

Whitepaper. 2 Scalability: Theory and Practice. 3 Scalability Drivers

Chapter 16: Virtual Machines. Operating System Concepts 9 th Edition

Best Practices for Deploying & Tuning Oracle Database 12c on RHEL6

How To Make Your Database More Efficient By Virtualizing It On A Server

VIRTUALIZATION, The next step for online services

ARM Caches: Giving you enough rope... to shoot yourself in the foot. Marc Zyngier KVM Forum 15

Using IOMMUs for Virtualization in Linux

Best Practices for Optimizing Your Linux VPS and Cloud Server Infrastructure

WHITE PAPER Optimizing Virtual Platform Disk Performance

Optimizing Ext4 for Low Memory Environments

Nested Virtualization

HRG Assessment: Stratus everrun Enterprise

Where is the memory going? Memory usage in the 2.6 kernel

How To Write To A Linux Memory Map On A Microsoft Zseries (Amd64) On A Linux (Amd32) (

Delivering Quality in Software Performance and Scalability Testing

Maximizing Your Server Memory and Storage Investments with Windows Server 2012 R2

Performance tuning Xen

IBM. Kernel Virtual Machine (KVM) Best practices for KVM

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Enabling Technologies for Distributed and Cloud Computing

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Comparison of Memory Management Systems of BSD, Windows, and Linux

Understanding Memory Resource Management in VMware ESX 4.1

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

Rackspace Cloud Databases and Container-based Virtualization

Module I-7410 Advanced Linux FS-11 Part1: Virtualization with KVM

The QEMU/KVM Hypervisor

Virtualization and Performance NSRC

Transcription:

KVM & Memory Management Updates KVM Forum 2012 Rik van Riel Red Hat, Inc.

KVM & Memory Management Updates EPT Accessed & Dirty Bits 1GB hugepages Balloon vs. Transparent Huge Pages Automatic NUMA Placement Automatic Ballooning 2

EPT Accessed & Dirty Bits Extended Page Tables (EPT) Second set of page tables Translates guest physical (virtual) to machine physical Removes need for shadow page tables Originally, EPT only supported permissions and translations Accessed & Dirty bits emulated in software Take extra page faults to track A & D information EPT supports hardware Accessed & Dirty bit tracking in newer CPUs Already upstream Eliminates the extra page faults 3

1GB hugepages Interface to hugetlbfs Use cases In -mm Allows 1GB size huge pages Desired page size can be specified at mmap time Statically allocated Not evictable, always pinned In hugetlbfs only, not for transparent huge pages HPC on bare metal Use KVM for partitioning, with large guests Want the last bit of performance Does not need memory overcommit 4

Balloon vs. Transparent Huge Pages Transparent Huge Pages (THP) Balloon driver Use 2MB pages for userspace when possible Typical 5-15% performance improvement Memory defragmented through compaction Move data around, to free up 2MB area Allocate memory in a guest Return memory to the host Guest does not use memory while in balloon Effectively shrink guest memory size Proposed patch series by Rafael Aquini 5

THP & Compaction Transparent Huge Pages (THP) needs 2MB blocks of memory Normal (4kB) allocations can fragment free memory Compaction can move page cache & anonymous memory data around Frees up 2MB areas, for THP use 6

Compaction vs. Balloon pages Balloon pages are not anonymous or page cache memory The compaction code does not know how to move them Balloon pages can take a lot of memory When compaction fails, THP performance benefits not realized Teach compaction how to move balloon pages 7

NUMA Non Uniform Memory Access Each CPU has its own memory (fast) Other memory can be accessed indirectly (slower) 8

Unlucky NUMA Placement Without NUMA placement code, this can happen VM1vcpu1 VM2vcpu1 VM2vcpu2 VM1vcpu2 Node 0 Node 1 9

Optimized NUMA Placement With numad, autonuma, sched/numa, numa/core,... 3-15% performance improvement typical VM1vcpu1 VM1vcpu2 VM2vcpu1 VM2vcpu2 Node 0 Node 1 10

Automatic NUMA Placement Obvious what better NUMA placement looks like,...... but how do we get there? Numad Userspace NUMA placement daemon For long lived tasks Checks how much memory and CPU each task uses bin packing to move tasks to NUMA nodes where they fit Works right now Not as good with dynamically changing workloads More overhead, higher latency than kernel side solution Kernel side solution desired Several competing kernel side solutions proposed 11

Automatic NUMA Placement (Kernel) Three codebases: Some similarities Some differences Autonuma (Andrea Arcangeli) Sched/numa & numa/core (Peter Zijlstra & Ingo Molnar) Merged simple codebase (Mel Gorman) Strategies are mostly the same NUMA Page Faults & Migration Node scheduler affinity driven by fault statistics Mostly implementation details 12

NUMA Faults & Migration Page fault driven NUMA migration Memory follows CPU Periodically, mark process memory as inaccessible Clear present bit in page table entries, mark as NUMA Rate limited to some number of MB/second 3-15% NUMA gain typical, overhead limited to less On older processes, short lived processes not affected When process tries to access memory Page fault code recognizes NUMA fault NUMA fault handling code is called If page is on wrong NUMA node Try to migrate to where the task is running now If migrate fails, leave page on old NUMA node No free memory on target node, page locked,... Increment per-task NUMA fault statistics for node where the faulted-on page is now 13

NUMA Faults & Migration Page fault driven NUMA migration Memory follows CPU Periodically, mark process memory as inaccessible Clear present bit in page table entries, mark as NUMA Rate limited to some number of MB/second 3-15% NUMA gain typical, overhead limited to less On older processes, short lived processes not affected When process tries to access memory Page fault code recognizes NUMA fault NUMA fault handling code is called If page is on wrong NUMA node Try to migrate to where the task is running now If migrate fails, leave page on old NUMA node No free memory on target node, page locked,... Increment per-task NUMA fault statistics for node where the faulted-on page is now 14

NUMA Fault Statistics NUMA page faults keep an array per task Recent NUMA faults incurred on each node Periodically the fault stats are divided by 2 Ages the stats, new faults count more than old ones Where is the memory I am currently using? Faults Node0 100 Node1 3200 Node2 5 Node3 0 15

Fault Driven Scheduler Affinity CPU follows memory Scheduler tries to run the task where its memory is Sched/numa & numa/core set desired task NUMA node Hope the load balancer moves it later Autonuma searches other NUMA nodes Node with more memory accesses than current node Current task's NUMA locality must improve Find task to trade places with Overall cross-node accesses must go down Tasks trade places immediately Not using scheduler load balancer 16

Autonuma vs. Sched/numa vs.... Autonuma Performs better, but unknown exactly why Actively groups threads within a task together Fault stats not just per task, but also per mm More complex than sched/numa or Mel's merged tree Not known which complexity could be removed Sched/numa & numa/core Mel's merged tree Simpler than autonuma Does not perform as well Large regressions on some tests, against mainline Unclear what needs to be changed to make it work better Clean, but extremely simplistic (MORON policy) Basic infrastructure from autonuma and sched/numa Meant as basis for a better placement policy Clear way forward from this codebase 17

Dealing With NUMA Overflow Some workloads do not fit nicely on NUMA nodes Memory overflow CPU overflow A process with more memory than on one NUMA node A process with more threads than there are CPUs in a node Not all of a process's memory fits on one node Page migrations will fail Fault statistics tell the scheduler where the memory is Hopefully threads will migrate to their memory Not all threads can run on one node Memory gets referenced from multiple nodes Only migrate on sequential faults from same node Threads cannot get migrated to #1 preferred node Migrate to nearby node instead Migrate to node where process has many NUMA faults Unclear if this is enough... 18

Dealing With NUMA Overflow Some workloads do not fit nicely on NUMA nodes Memory overflow CPU overflow A process with more memory than on one NUMA node A process with more threads than there are CPUs in a node Not all of a process's memory fits on one node Page migrations will fail Fault statistics tell the scheduler where the memory is Hopefully threads will migrate to their memory Not all threads can run on one node Memory gets referenced from multiple nodes Only migrate on sequential faults from same node Threads cannot get migrated to #1 preferred node Migrate to nearby node instead Migrate to node where process has many NUMA faults Unclear if this is enough... 19

Memory Overcommit Cloud computing is a race to the bottom Cheaper and cheaper virtual machines expected Need to squeeze more virtual machines on each computer CPU & bandwidth are easy More CPU time and bandwidth become available each second Amount of memory stays constant ( non-renewable resource ) Swapping to disk is too slow SSB is an option, but could be too expensive Need something fast & free Frontswap + zram an option Compacting unused data slower than tossing it out Ballooning guests is an option Balloon driver returns memory from guest to host 20

Memory Overcommit Cloud computing is a race to the bottom Cheaper and cheaper virtual machines expected Need to squeeze more virtual machines on each computer CPU & bandwidth are easy More CPU time and bandwidth become available each second Amount of memory stays constant ( non-renewable resource ) Swapping to disk is too slow SSB is an option, but could be too expensive Need something fast & free Frontswap + zram an option Compacting unused data slower than tossing it out Ballooning guests is an option Balloon driver returns memory from guest to host 21

Automatic Ballooning Goal: squeeze more guests into a host, with minimal performance hit Avoid swapping to disk Balloon driver is used to return memory to the host For one guest to grow, others must give memory back Could be done automatically Balloon guests down when host runs low on free memory Put pressure on every guest Guests return a little bit of memory to the host Memory can be used by host itself, or by other guests Memory pressure inside guests deflates balloons Guests get memory back from host in small increments Ideally, the pressures will balance out Some of the pieces are in place, some proposed, some not developed yet longer term project 22

Vmevents & Guest Shrinking Proposed patch series by Anton Vorontsov, Pekka Engberg,... Vmevents file descriptor Program opens fd using special syscall Can be blocking read() or poll/select() Kernel tells userspace when memory needs to be freed Minimal pressure free garbage collected memory? Medium pressure free some caches, shrink some guests? OOM pressure something unimportant should exit now Qemu-kvm, libvirtd or something else in virt stack could register for such vm events When required, tell one or more guests to shrink Guests will balloon some memory, freeing it to the host 23

Ballooning & Guest Growing Automatically shrinking guests is great...... but they may need their memory back at some point Linux pageout code has various pressure balancing concepts #scanned / #rotated to see how heavily used pages in each LRU set are ( use ratio ) Each cgroup has its own LRU sets, with its own use ratios Pressure between cgroups independent of use ratios... seek cost for objects in slab caches Slab cache pressure independent of use ratio of LRU pages... this could use some unification The balloon can be called from the pageout code inside a guest When we reclaim lots of pages...... also request some pages from the balloon Avoid doing this for streaming file I/O Adding pressure in the host, will result in being shrunk again later 24

Ballooning & QOS Ballooning can help avoid host swapping, which helps latencies But... Shrinking a guest too small could also impact latencies...... or even cause a guest to run out of memory Never shrink a guest below a certain size How big? No way to tell automatically... Reasonable minimum size needs to be specified in the configuration Minimums enforced by whatever listens for VM notifications Qemu-kvm, libvirtd, vdsm,...? When a host gets overloaded, other actions need to be taken Live migrate a guest away Kill an unimportant guest? Automatic ballooning is part of a larger solution 25

Conclusions KVM performance continues to improve No large gains left, but many small ones Uncontroversial changes (nearly) upstream EPT A/D bits & 1GB huge pages Larger changes take time to get upstream Complex problem? Unresolved questions... Two fiercely competing NUMA placement projects Mel Gorman heroically smashed the two together New tree looks like a good basis for moving forward Typical 3-15% performance gains expected Squeeze more guests onto each system Shrink and grow guest memory as needed Extensive use of ballooning makes THP allocations harder Patches proposed to make balloon pages movable 26

Questions? 27