Memory Management under Linux: Issues in Linux VM development



Similar documents
KVM & Memory Management Updates

Local and Remote Memory: Memory in a Linux/NUMA System

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Understanding Virtual Memory In Red Hat Enterprise Linux 3

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Virtual Memory Behavior in Red Hat Linux Advanced Server 2.1

Using process address space on the GPU

2972 Linux Options and Best Practices for Scaleup Virtualization

Where is the memory going? Memory usage in the 2.6 kernel

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

How To Write To A Linux Memory Map On A Microsoft Zseries (Amd64) On A Linux (Amd32) (

General Purpose Operating System Support for Multiple Page Sizes

KVM: A Hypervisor for All Seasons. Avi Kivity avi@qumranet.com

Linux Driver Devices. Why, When, Which, How?

Chapter 11 I/O Management and Disk Scheduling

COS 318: Operating Systems. Virtual Memory and Address Translation

Using Linux as Hypervisor with KVM

An Implementation Of Multiprocessor Linux

Virtual vs Physical Addresses

Page replacement in Linux 2.4 memory management

You are here. Automatic Page Migration for Linux [A Matter of Hygiene] Lee T. Schermerhorn HP/Open Source & Linux Organization

Full and Para Virtualization

Whitepaper. 2 Scalability: Theory and Practice. 3 Scalability Drivers

Devices and Device Controllers

Virtualization. Explain how today s virtualization movement is actually a reinvention

Computer Engineering and Systems Group Electrical and Computer Engineering SCMFS: A File System for Storage Class Memory

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

BridgeWays Management Pack for VMware ESX

COS 318: Operating Systems. Virtual Machine Monitors

Determining the Correct Usage of Swap in Linux * 2.6 Kernels

OPERATING SYSTEM - VIRTUAL MEMORY

COS 318: Operating Systems. Virtual Machine Monitors

Operating Systems. Virtual Memory

HPC performance applications on Virtual Clusters

Xen and the Art of. Virtualization. Ian Pratt

Virtual Memory. Chapter 4

BHyVe. BSD Hypervisor. Neel Natu Peter Grehan

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

Computer Architecture

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

An Overview of Red Hat Enterprise Linux 2.1 Reliability, Availability, Scalability and Manageability (RASM) Features

W4118 Operating Systems. Instructor: Junfeng Yang

Linux VM Infrastructure for memory power management

Chapter 3 Operating-System Structures

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

Chapter 1 Computer System Overview

Chapter 11 I/O Management and Disk Scheduling

Machine check handling on Linux

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Multi-core Programming System Overview

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Google File System. Web and scalability

IOMMU: A Detailed view

Xen and XenServer Storage Performance

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Intel s Virtualization Extensions (VT-x) So you want to build a hypervisor?

Operating Systems, 6 th ed. Test Bank Chapter 7

Storage and File Systems. Chester Rebeiro IIT Madras

Red Hat Linux Internals

Virtualization in Linux KVM + QEMU

Memory management basics (1) Requirements (1) Objectives. Operating Systems Part of E1.9 - Principles of Computers and Software Engineering

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

Cloud^H^H^H^H^H Virtualization Technology. Andrew Jones May 2011

Memory Allocation. Static Allocation. Dynamic Allocation. Memory Management. Dynamic Allocation. Dynamic Storage Allocation

Using IOMMUs for Virtualization in Linux

Performance Management in the Virtual Data Center, Part II Memory Management

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

A Survey of Parallel Processing in Linux

System Virtual Machines

Virtualization Technology. Zhiming Shen

These sub-systems are all highly dependent on each other. Any one of them with high utilization can easily cause problems in the other.

Understanding Memory Resource Management in VMware vsphere 5.0

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

How To Make A Minecraft Iommus Work On A Linux Kernel (Virtual) With A Virtual Machine (Virtual Machine) And A Powerpoint (Virtual Powerpoint) (Virtual Memory) (Iommu) (Vm) (

Comparison of Memory Management Systems of BSD, Windows, and Linux

SLURM Resources isolation through cgroups. Yiannis Georgiou Matthieu Hautreux

Chapter 6, The Operating System Machine Level

Linux on z/vm Memory Management

kvm: the Linux Virtual Machine Monitor

Chip Coldwell Senior Software Engineer, Red Hat

tmpfs: A Virtual Memory File System

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

PARALLELS CLOUD SERVER

Overview of Operating Systems Instructor: Dr. Tongping Liu

HRG Assessment: Stratus everrun Enterprise

The Classical Architecture. Storage 1 / 36

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

RevoScaleR Speed and Scalability

Real-Time KVM for the Masses Unrestricted Siemens AG All rights reserved

1. Computer System Structure and Components

Transcription:

Memory Management under Linux: Issues in Linux VM development Christoph Lameter, Ph.D. Technical Lead, Linux Kernel Software Silicon Graphics Inc. clameter@sgi.com 2008-03-12 2008 SGI Sunnyvale, California

Introduction Short overview of Linux Memory Management Projects that were merged in the last year Problems with solutions Unsolved Problems Future Directions 2

Brief Introduction to Memory Management Virtual Memory Management Provide memory to the processes and for the operating system services A significant influence on performance. Includes various forms of synchronization in Multiprocessor systems Resource management Allocate memory so that the processes competing for memory make the best progress. Linux challenges Increasing complexity of the VM as the number of processors and the memory sizes grow Management of memory in 4k chunks in a world with machines that have Gigabytes of memory. 3

Memory Management Components 4k Chunks of Memory Page allocator Higher order allocations Anonymous memory File backed memory Swapping Slab allocator Device DMA allocator Page Cache read() / write() Mmapped I/O. Page Cache Buffers Slab allocator Kernel Core Process Memory Page Allocator Vmalloc PCI Subsystem Anonymous Pages Device Drivers 4

Philosophy of Linux Memory Management All memory must be put to use. Unused memory is a waste of memory. Memory is freed if we find another use for it. It is not freed without reason. Memory is managed in zones. A zone is a range of memory suitable for a certain purpose. Memory suitable for legacy I/O devices Memory that is easily reclaimable Memory that requires explicit handling (HIGHMEM) General memory NUMA implemented by adding more zones. A major problem of memory management under NUMA is selecting the zone from which to get memory for a process. 5

Projects that were merged Anti-Fragmentation logic for the page allocator (Mel Gorman) Memoryless node support Device based dirty throttling SPARSEMEM Virtually mapped memmap Quicklists 64k support on IA64 and in the ext2/3/4 filesystems. Cgroups a framework for resource control. Support for more processors on x86_64 Reducing per cpu allocation overhead overhead (support for up to 4k/16k cpus on x86_64) Reducing stack use for cpumasks SLUB 6

Device based dirty throttling Current waste of memory. A device must have enough pages buffered up to make it work at optimal speed. Pages should be kept hot Devices may not run at full speed because dirty pages may be limited due to another slow device creating dirty pages New dirty throttling algorithm calculates the dirty rate of a process and the write out rate of the device and then makes the process produce dirty pages at the rate of disk I/O More memory available. Higher device speed. Some report doubling of speed in some situation Addresses most of our dirty pages issues. Not sure though if it addresses cases where a node ends up full of dirty pages 7

Cpuset based dirty throttling Dirty pages are only limited on the system as a whole Small cpusets can have all of memory dirty No memory is reclaimable thus we get an out of memory errors (and thoughts of adding memory come up). Changes the calculation of dirty ratios based on the number of pages in a cpuset. Provides an upper limit in addition to the limits imposed by the device based throttling. 8

SLAB vs. SLUB Old allocator SLAB Exponentially increasing memory use the more nodes and processors. Per object NUMA control which led to inefficiencies in the alloc and free handler. Bugs still surface. No ability to defragment memory New allocator SLUB: Memory efficient (->low cache footprint -> more speed) Speed through uses of atomic operations not queuing objects Framework for slab reclaim Support for higher order allocations to increase speed Simpler and easy to understand. cmpxchg_local fastpath that does not require disabling interrupts. Cycle could is reduced by 30%-60% 9

SLUB issues Page allocator dependency Variable order slabs Tunable slabs Cmpxchg_local fastpath 30-60% performance increases Can be used in combination with tunable orders Slab defragmentation 10

64K page size support Addresses High TLB fault cost on IA64. Allows larger file system buffers. Also scales page allocator. Increases addressable memory for 3 level page tables. Enables very large systems. 64K page size seems to be a requirement for systems with 2k or 4k processors and about 1k nodes. 11

Memoryless nodes Included in 2.6.24. Per node status information SLAB issues. Need per node structures for nodes that have no memory. Introduction of a node status Node is possible Node is online Node can accept high memory allocations Node has normal memory Node has processors 12

Cgroup SMP mechanism to control memory Adds a per cgroup LRU which establishes page ownership to the cgroup. Potential issue if it is to replace cpusets: No per node LRUs! Cpuset/cgroup kill on swap? 13

Virtual Memmap Some performance improvement vs. DISCONTIG and regular SPARSEMEM. Simplifies code in the kernel in general Allows SPARSEMEM to avoid using page flags for section Ids. x86_64 has VMEMMAP as the only memory model. Discontig / Flatmem removed. Same is to be done for IA64. Tony Luck indicated that he wants to do this. i386: Key problem to freeing page flags here with NUMAQ. Problem is that SPARSEMEM_STATIC leaves no available page flags. Future: Movable 2M pages, fallback to 4K? 14

Problems for which patch sets and solutions exist Off lining memory and nodes: Memory unplug Performance and scaling problems because of 4k page size in large systems or due to page pinning LRU optimizations (Rik van Riel) Not scanning pinned, mlocked or anonymous pages (Rik) Large Blocksize patchset (64k blocksize) Too many dirty pages accumulate on some nodes Cpuset/cgroup based dirty throttling Swap is not acceptable for HPC applications Cpuset/cgroup kill on swap (Paul Jackson) Support for more processors (4k 16k - 64k) cpu_alloc cpumask size reduction work (Mike Travis) 15

Problems with solutions continued Almost empty slabs can consume lots of memory Slab defragmentation Unreliability of higher order allocations Fallback to order 0 in SLUB Virtual Compound pages (for stacks etc) Avoiding cacheline contention in percpu allocation cpu_alloc Effective per cpu counter and other operations cpu_ops DMA zone problems (OOM issues, NUMA memory balancing) DMA Zone allocator (Andi Kleen) Getting rid of ZONE_DMA(32?) Lack of page flags (removal of SPARSE_STATIC) 16

Unresolved Problems Page allocator performance for order 0 pages Page allocator performance for higher order pages. Devices or subsystems pinning memory (MMU notifier, EMM notifier exist but are unable to sleep without additional modifications). Support for I/O from vmalloc'ed memory Removal of SPARSEMEM_STATIC from i386 to free up page flags. 17

Avoiding scans of pinned pages Pinned pages are not reclaimable (RDMA, Xpmem etc) Too many per node may lead to Out of memory errors A too large percentage will lead the VM to continually scan through pages that are not reclaimable Live locking under load. Solution is to remove the pages from the reclaim list and go to other nodes if there is no memory available. Work in progress by Rik van Riel. 18

Virtually Mapped Compound Pages Avoids vmalloc. The use of virtually mapped memory needs page table lookups and 4k ptes. Vmalloc is only used if there is no contiguous memory available. Reduces TLB pressure Allow the use of antifrag measures to speed up the system. Allows large stack sizes Avoids failure on order 1 stack allocations Provides base for higher order page cache fall back feature. Allows to provide fall back for various higher order allocations in the kernel (as f.e. useful for node plug in) Allows reliable use of higher order allocations. 19

Large Buffer / Higher order page cache Needed for scaling VM and I/O Avoids the LRU scaling problem Scales I/O. Reduces scatter / gather list sizes. Linear I/O for large amount of memory possible. Potential to make huge page use transparent. May avoid TLB pressure issues 20

Page allocator Fastpath is not fast. 8x slower than slab allocator fastpaths. May need to assign blocks to various higher orders. Memory defrag missing. Mel did a draft about a year ago but so far no demand. More higher order page cache use may make the defrag logic necessary. Easily fragments. 21

Patch sequence for Compound page support in the Page Cache Page flag freeing Page cache functions Virtual Compound Support Support for larger buffers in the I/O layer VM support for compound pages on the LRU Support for DMA to virtual compound pages 22

Pinned pages issues Driver needs to directly map user space pages for DMA transfer or other special needs. VM thinks the pinned pages are only temporarily unavailable and continues attempts to reclaim memory. Works fine for small amount of pinned pages. The more pages are pinned the more the VM will uselessly attempt to reclaim memory. One solution: Take pinned pages off the LRU so that they are no longer scanned. The other solution: Send a notification to the device driver so that the pages are unmapped and memory can be reclaimed. 23

Conclusion Questions? 24