How To Write To A Linux Memory Map On A Microsoft Zseries 2.2.2 (Amd64) On A Linux 2.3.2 2.4.2 3.5.2 4.5 (Amd32) (



Similar documents
Where is the memory going? Memory usage in the 2.6 kernel

Virtual Memory Behavior in Red Hat Linux Advanced Server 2.1

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Memory Management under Linux: Issues in Linux VM development

KVM & Memory Management Updates

Determining the Correct Usage of Swap in Linux * 2.6 Kernels

Linux VM Infrastructure for memory power management

W4118 Operating Systems. Instructor: Junfeng Yang

Lecture 17: Virtual Memory II. Goals of virtual memory

Virtual vs Physical Addresses

Local and Remote Memory: Memory in a Linux/NUMA System

OPERATING SYSTEM - VIRTUAL MEMORY

Linux on z/vm Memory Management

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Understanding Virtual Memory In Red Hat Enterprise Linux 3

COS 318: Operating Systems. Virtual Memory and Address Translation

Using Linux as Hypervisor with KVM

Virtual Machines. COMP 3361: Operating Systems I Winter

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Intel P6 Systemprogrammering 2007 Föreläsning 5 P6/Linux Memory System

Page replacement in Linux 2.4 memory management

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

An Implementation Of Multiprocessor Linux

Virtual Memory. Chapter 4

Storage and File Systems. Chester Rebeiro IIT Madras

Virtualization. Types of Interfaces

RCL: Software Prototype

These sub-systems are all highly dependent on each other. Any one of them with high utilization can easily cause problems in the other.

Performance and scalability of a large OLTP workload

Understanding Memory Resource Management in VMware vsphere 5.0

Chapter 11 I/O Management and Disk Scheduling

VM Architecture. Jun-Chiao Wang, 15 Apr NCTU CIS Operating System lab Linux kernel trace seminar

Lecture 25 Symbian OS

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

Linux on System z Performance Tips David Simpson - IT Specialist IBM Corporation simpson.dave@us.ibm.com

Linux kernen. Version , arch. DIKU Intel, SMP. . NEL, Kernel,

Xen and XenServer Storage Performance

Operating Systems. Virtual Memory

& Data Processing 2. Exercise 3: Memory Management. Dipl.-Ing. Bogdan Marin. Universität Duisburg-Essen

Virtualization in Linux KVM + QEMU

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

The Classical Architecture. Storage 1 / 36

Operating Systems, 6 th ed. Test Bank Chapter 7

opensm2 Enterprise Performance Monitoring December 2010 Copyright 2010 Fujitsu Technology Solutions

How To Make A Minecraft Iommus Work On A Linux Kernel (Virtual) With A Virtual Machine (Virtual Machine) And A Powerpoint (Virtual Powerpoint) (Virtual Memory) (Iommu) (Vm) (

How To Write A Page Table

Storage in Database Systems. CMPSCI 445 Fall 2010

Chapter 1 Computer System Overview

Intel s Virtualization Extensions (VT-x) So you want to build a hypervisor?

Carnegie Mellon Virtual Memory of a Linux Process Process-specific data structs (ptables, Different for

Optimizing Network Virtualization in Xen

SYSTEM ecos Embedded Configurable Operating System

Understanding Memory Resource Management in VMware ESX 4.1

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Bridging the Gap between Software and Hardware Techniques for I/O Virtualization

CS5460: Operating Systems

W4118: segmentation and paging. Instructor: Junfeng Yang

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

Virtuoso and Database Scalability

System Virtual Machines

Comparison of Memory Management Systems of BSD, Windows, and Linux

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

OPERATING SYSTEMS MEMORY MANAGEMENT

kvm: Kernel-based Virtual Machine for Linux

Using process address space on the GPU

Virtualization. Explain how today s virtualization movement is actually a reinvention

Operating System Overview. Otto J. Anshus

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Hybrid Virtualization The Next Generation of XenLinux

Optimizing Network Virtualization in Xen

Page 1 of 5. IS 335: Information Technology in Business Lecture Outline Operating Systems

The Linux Virtual Filesystem

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

FRONT FLYLEAF PAGE. This page has been intentionally left blank

Memory Resource Management in VMware ESX Server

COS 318: Operating Systems

Full and Para Virtualization

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

IOMMU: A Detailed view

Chapter 16: Virtual Machines. Operating System Concepts 9 th Edition

The Microsoft Windows Hypervisor High Level Architecture

Architecture of the Kernel-based Virtual Machine (KVM)

Supplementing Windows 95 and Windows 98 Performance Data for Remote Measurement and Capacity Planning

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Chapter 6, The Operating System Machine Level

Machine check handling on Linux

COS 318: Operating Systems. Virtual Machine Monitors

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

Chapter 10: Virtual Memory. Lesson 03: Page tables and address translation process using page tables

Database Hardware Selection Guidelines

1. Computer System Structure and Components

Chapter 5 Cloud Resource Virtualization

Virtual machines and operating systems

Revisiting Software Zero-Copy for Web-caching Applications with Twin Memory Allocation

Transcription:

Understanding Linux Memory Management SHARE 102 Session 9241 Dr. Ulrich Weigand Linux on zseries Development, IBM Lab Böblingen Ulrich.Weigand@de.ibm.com

Linux Memory Management - A Mystery? What does this mean: $ free total used free shared buffers cached Mem: 512832 473256 39576 0 52300 236776 -/+ buffers/cache: 184180 328652 Swap: 524280 912 523368 Or this: $ cat /proc/meminfo MemTotal: 512832 kb Dirty: 28 kb MemFree: 39512 kb Writeback: 0 kb Buffers: 52308 kb Mapped: 5492 kb Cached: 236768 kb Slab: 158608 kb SwapCached: 532 kb Committed_AS: 7656 kb Active: 246328 kb PageTables: 208 kb Inactive: 61920 kb VmallocTotal: 1564671 kb HighTotal: 0 kb VmallocUsed: 724 kb HighFree: 0 kb VmallocChunk: 1563947 kb LowTotal: 512832 kb LowFree: 39512 kb SwapTotal: 524280 kb SwapFree: 523368 kb

Agenda Physical Memory Dynamic Address Translation Process Address Spaces and Page Cache Kernel Memory Allocators Page Replacement and Swapping Virtualization Considerations

Physical Memory Basic allocation unit: Page (4096 bytes) Use of each page described by 'struct page' Allocation status, backing store, LRU lists, dirty flag,... Pages aggregated into memory zones Zones may be aggregated into nodes (NUMA systems) Boot process Kernel loaded at bottom of memory Determine size of memory, create 'struct page' array Kernel pages and boot memory marked as 'reserved' All other pages form the master pool for page allocation

Memory Zones GPF_HIGHMEM ("High memory zone") Memory not directly addressable from kernel space On Intel machines all > 4/2/1 GB On zseries empty GPF_NORMAL ("Normal zone") Memory generally usable by the kernel, except possibly for I/O On 31-bit zseries empty On 64-bit zseries all > 2 GB GPF_DMA ("DMA zone") Memory usable without restrictions On 31-bit zseries all memory On 64-bit zseries all < 2 GB (On Intel all < 16 MB)

Kernel Memory Allocators Low-level page allocator Buddy system for contiguous multi-page allocations Provides pages for in-kernel allocations (slab cache) vmalloc areas (kernel modules, multi-page data areas) page cache, anonymous user pages misc. other users Slab cache Manages allocations of objects of the same type Large-scale users: inodes, dentries, block I/O, network... kmalloc (generic allocator) implemented on top

Buddy Allocator Order-n free lists (per zone) 0 1 2 3 4 5 6 7 8... MAX_ORDER 2^0 pages 2^3 pages 2^7 pages Allocate order-n block: If none is free, a order-(n+1) block is split Free order-n block: If 'buddy' is also free, merge them to order-(n+1) block

Dynamic Address Translation: 31-bit Segment Table Origin STO Virtual Address Segment Page Index (11 bit) Index (8 bit) Byte Index (12 bit) SX PX BX Segment Table Page Table + PTO + PFRA Real Address PFRA Page Frame Real Address BX

Dynamic Address Translation: 64-bit Region Table Origin RFTO Virtual Address Region-1st Region-2nd Region-3rd Segment Page Index (11 bit) Index (11 bit) Index (11 bit) Index (11 bit) Index (8 bit) RFX RSX RTX Byte Index (12 bit) SX PX BX Region-1st Table Region-2nd Table Region-3rd Table Segment Table Page Table + RSTO + RTTO + STO + PTO + PFRA Real Address PFRA BX

DAT: 64-bit Three Level Translation Region Table Origin RTTO Virtual Address Region-1st Region-2nd Region-3rd Segment Page Index (11 bit) Index (11 bit) Index (11 bit) Index (11 bit) Index (8 bit) 0 0 RTX Byte Index (12 bit) SX PX BX Region-1st Table Region-2nd Table Region-3rd Table Segment Table Page Table + STO + PTO + PFRA Real Address PFRA BX

zseries Address Translation Modes Directly accessible address spaces Primary space: STO/RTO in Control Register 1 Secondary space: STO/RTO in Control Register 7 Home space: STO/RTO in Control Register 13 Access-register specified spaces Access registers Base register used in memory access identifies AR AR specifies STO/RTO via Access List Entry Token Operating System manages ALETs and grants privilege ALET 0 is primary space, ALET 1 is secondary space

Address Translation Modes (cont.) Translation mode specified in PSW Primary space mode Instructions fetched from primary space, data in primary space Secondary space mode Instructions in primary, data in secondary Home space mode Instructions and data in home space Access register mode Instructions in primary, data in AR-specified address space

zseries Program Status Word I E Prog E 0 R 0 0 0 T O X Key 0 M W P A S C C Mask 0 0 0 0 0 0 0 A _ _ _ _ _ _ _ _ _ _ _ _ _ 0 5 8 12 16 18 20 24 31 B A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 _ 32 63 Instruction Address 64 95 R: Program Event Recording Mask T: Dynamic Address Translation Mode IO: I/O Interruption Mask EX: External Interruption Mask Key: PSW Key (storage proctection) M: Machine Check Mask W: Wait State P: Problem State AS: Address Space Control CC: Condition Code PM: Program Mask EA: Extended Addressing Mode BA: Basic Addressing Mode Instruction Address (Continued) 96 127

Linux on zseries: Use of Address Spaces 0x7FFFFFFF Stack Shared Libs vmalloc area primary space mode primary space (= kernel space) home space mode home space (= secondary or user space) 0x40000000 0x00000000 User Program and Heap Home Space physical memory mapping Kernel code Primary Space mvc 0(8,%r2),0(%r4) access register mode primary space %ar4=0 mvc 0(8,%r2),0(%r4) secondary space %ar4=1 mvc 0(8,%r2),0(%r4) la 4,<source> la 2,<destination> sacf 512 mvc 0(8,%r2),0(%r4) sacf 0

Memory Management Data Structures 'Address spaces' Represent some page-addressed data Examples: inodes (files), shared memory segments, swap Contents cached in 'page cache' 'Memory map' Describes a process' user address space List of 'virtual memory arenas' VMA maps part of an address space into a process Page tables Hardware-defined structure: region, segment, page tables Linux uses platform-independent abstraction Contents filled on-demand as defined by MM

Virtual Memory Arenas Page tables Memory map Address spaces page cache Shared VMA page cache page cache anon page swap entry swap cache Private VMA page cache

Page Table Entries 0 PFRA 0 I P 0 - - - PFRA: Page Frame Real Address _ _ _ _ _ I: Invalid 0 1 20 24 31 P: Page Protection 0 0 0 1 0 0 0 0 Empty PTE slot _ _ _ _ _ _ 0 1 20 24 31 0 PFRA 0 0 0 0 0 0 Read-write page _ _ _ _ _ _ 0 1 20 24 31 0 PFRA 0 0 1 0 0 0 Read-only page _ _ _ _ _ _ 0 1 20 24 31 0 PFRA 0 1 0 0 0 1 No-access page _ _ _ _ _ _ 0 1 20 24 31 0 SO 0 1 1 0 SA 0 Swapped-out page _ _ _ _ _ _ SA: Swap area (file/device) 0 1 20 24 31 SO: Offset within area 0 FOH 0 1 1 0 FOL 1 Paged-out remapped page _ _ _ _ _ _ FOL: Offset (low bits) 0 1 20 24 31 FOH: Offset (high bits)

Reverse Mappings VM A VM B VM C Physical Page Reverse Mapping

Reverse Mappings (cont.) Advantages of reverse mappings Easy to unmap page from all address spaces Page replacement scans based on physical pages Less CPU spent inside memory manager Less fragile behaviour under extreme load Challenges with reverse mappings Overhead to set up rmap structures Out of memory while allocating rmap?

Page Fault Handling Hardware support Accessing invalid pages causes 'page translation' check Writing to protected pages causes 'protection exception' Translation-exception identification provides address 'Suppression on protection' facility essential! Linux kernel page fault handler Determine address/access validity according to VMA Invalid accesses cause SIGSEGV delivery Valid accesses trigger: page-in, swap-in, copy-on-write Extra support for stack VMA: grows automatically Out-of-memory if overcommitted causes SIGBUS

Page Replacement Page replacement strategy Applies to all page cache and anonymous pages Second-chance LRU using active/inactive page lists Scan phys. memory zones, find PTEs via reverse map Page replacement actions Allocate swap slot for anon. pages Unmap from all process page tables Async. write-back dirty pages to backing store Remove clean pages from page cache Shrinking in-kernel slab caches Call-back to release inode, dentry,... cache entries Balance against page cache replacement

Page Replacement Life Cycle PTE: valid PTE: valid / swap entry PTE: valid / empty Anonymous Page Swap Cache Page Page Cache Page Subject to page replacement (active/inactive lists) Swap File Backing Store

Page Replacement "LRU" Lists referenced Tail Active List Head unreferenced referenced Head Inactive List Start Writeback Tail clean dirty

The Mystery Solved What does this mean: $ free total used free shared buffers cached Mem: 512832 473256 39576 0 52300 236776 -/+ buffers/cache: 184180 328652 Swap: 524280 912 523368 Or this: $ cat /proc/meminfo MemTotal: 512832 kb Dirty: 28 kb Fixed MemFree: 39512 kb Writeback: 0 kb Buffers: 52308 kb Page Cache Mapped: 5492 kb User Cached: 236768 kb Slab: 158608 kb SwapCached: 532 kb Committed_AS: 7656 kb Active: 246328 kb Total LRU PageTables: 208 kb Inactive: 61920 kb (PC + Anon) VmallocTotal: 1564671 kb HighTotal: 0 kb VmallocUsed: 724 kb HighFree: 0 kb VmallocChunk: 1563947 kb LowTotal: 512832 kb LowFree: 39512 kb SwapTotal: 524280 kb SwapFree: 523368 kb

Buffers - What's That? Original 'buffer cache' (up to Linux 2.0) Cache block device contents All file access used buffer cache; page cache separate Gradual elimination of buffer cache Linux 2.2: Page cache reads bypass buffer cache Linux 2.4: Buffer cache completely merged into page cache Linux 2.6: 'Buffer heads' removed from block I/O layer Meaning of the 'buffers' field Linux 2.4: Page cache pages with buffer heads attached Linux 2.6: Page cache pages for block devices Approximates size of cached file system metadata

Some Tunable Parameters sysctl or /proc interface /proc/sys/vm/overcommit_memory / overcommit_ratio Controls relation of committed AS to total memory + swap /proc/sys/vm/swappiness Influences page-out decision of mapped vs. unmapped pages /proc/sys/vm/page-cluster Controls swap-in read-ahead /proc/sys/vm/dirty_background_ratio / dirty_ratio Percentage of memory allowed to fill with dirty pages /proc/sys/vm/dirty_writeback_centisecs / expire_centisecs Average/maximum time a page is allowed to remain dirty

Virtualization Considerations Two-level dynamic address translation Linux DAT: Linux virtual address -> Linux 'real' address VM DAT: Guest 'real' address -> Host real address Two-level page replacement / swapping Linux LRU prefers to touch pages VM swapped out Linux / VM cooperative memory management Exploit VM shared memory features Kernel in Named Saved Systems Block device on Discontiguous Saved Segments

Two-level Page Fault Handling Two different scenarios possible Guest page fault Linux page fault handler invoked Initiates page-in operation from backing store Suspends user process until page-in completed Other user processes continue to run Host page fault VM page fault handler invoked Initiates page-in operation from backing store Suspends guest until page-in completed No other user processes can run

Two-level Page Fault Handling (cont.) Solution: Pseudo Page Faults VM page fault handler invoked Initiates page-in operation from backing store Triggers guest 'pseudo page fault' Linux pseudo page fault handler suspends user process VM does not suspend the guest On completion of page-in operation VM calls guest pseudo page fault handler again Linux handler wakes up blocked user process Caveats Access to kernel pages Access to user page from kernel code

Cooperative Memory Management Problem: Large guest size hurts performance Linux will use all memory; LRU tends to reuse cold pages Recommendation: Make guest size as small as possible But how to determine that size? Cooperative Memory Management New feature released on devworks 01/2004 Allows to reserve a certain number of pages Kernel module allocates pages, so Linux cannot use them anymore Pages are (if possible) reported as free to VM Changes effective available memory size without reboot! IUCV special message interface allows central instance to manage server farm total memory consumption

Cooperative Memory Management (cont.) sysctl or /proc user interface /proc/sys/vm/cmm_pages Read to query number of pages permanently reserved Write to set new target (will be achieved over time) /proc/sys/vm/cmm_timed_pages Read to query number of pages temporarily reserved Write increment to add to target /proc/sys/vm/cmm_timeout Holds pair of N pages / X seconds (read/write) Every time X seconds have passed, release N temporary pages IUCV special message interface CMMSHRINK/CMMRELEASE/CMMREUSE Same as cmm_pages/cmm_timed_pages/cmm_timeout write

Resources Mel Gorman's Linux VM Documentation http://www.csn.ul.ie/~mel/projects/vm/ Linux on zseries developerworks page http://www.software.ibm.com/ developerworks/opensource/linux390/index.html Linux on zseries technical contact address linux390@de.ibm.com