Operating Systems & Memory Systems: Address Translation

Operating Systems & Memory Systems: Address Translation Computer Science 220 ECE 252 Professor Alvin R. Lebeck Fall 2006 Outline Finish Main Memory Address Translation basics 64-bit Address Space Managing memory OS Performance Throughout Review Computer Architecture Interaction with Architectural Decisions CPS 220 2 Page 1

Fast Memory Systems: DRAM specific Multiple RAS accesses: several names (page mode) 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns New DRAMs to address gap; what will they cost, will they survive? Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock RAMBUS: reinvent DRAM interface (Intel will use it)» Each Chip a module vs. slice of memory» Short bus between CPU and chips» Does own refresh» Variable amount of data returned» 1 byte / 2 ns (500 MB/s per chip) Cached DRAM (CDRAM): Keep entire row in SRAM CPS 220 3 Main Memory Summary Big DRAM + Small SRAM = Cost Effective Cray C-90 uses all SRAM (how many sold?) Wider Memory Interleaved Memory: for sequential or independent accesses Avoiding bank conflicts: SW & HW DRAM specific optimizations: page mode & Specialty DRAM, CDRAM Niche memory or main memory?» e.g., Video RAM for frame buffers, DRAM + fast serial output IRAM: Do you know what it is? CPS 220 4 Page 2

Review: Reducing Miss Penalty Summary Five techniques Read priority over write on miss Subblock placement Early Restart and Critical Word First on miss Non-blocking Caches (Hit Under Miss) Second Level Cache Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between CPS 220 5 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache CPS 220 6 Page 3

Review: Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + 0 Higher Associativity + 1 Victim Caches + 2 Pseudo-Associative Caches + 2 HW Prefetching of Instr/Data + 2 Compiler Controlled Prefetching + 3 Compiler Reduce Misses + 0 Priority to Read Misses + 1 Subblock Placement + + 1 Early Restart & Critical Word 1st + 2 Non-Blocking Caches + 3 Second Level Caches + 2 Small & Simple Caches + 0 Avoiding Address Translation + 2 Pipelining Writes + 1 CPS 220 7 System Organization Processor interrupts Cache Core Chip Set Main Memory I/O Bus Disk Controller Graphics Controller Network Interface Disk Disk Graphics Network CPS 220 8 Page 4

Computer Architecture Interface Between Hardware and Software Operating System Applications Compiler Software This is IT CPU Memory I/O Multiprocessor Networks Hardware CPS 220 9 Memory Hierarchy 101 Very fast <1ns clock Multiple Instructions per cycle P $ Memory SRAM, Fast, Small Expensive DRAM, Slow, Big,Cheap (called physical or main) Magnetic, Really Slow, Really Big, Really Cheap => Cost Effective Memory System (Price/Performance) CPS 220 10 Page 5

Virtual Memory: Motivation Virtual Physical Process = Address Space + thread(s) of control Address space = PA programmer controls movement from disk protection? relocation? Linear Address space larger than physical address space» 32, 64 bits v.s. 28-bit physical (256MB) Automatic management CPS 220 11 Virtual Memory Process = virtual address space + thread(s) of control Translation VA -> PA What physical address does virtual address A map to Is VA in physical memory? Protection (access control) Do you have permission to access it? CPS 220 12 Page 6

Virtual Memory: Questions How is data found if it is in physical memory? Where can data be placed in physical memory? Fully Associative, Set Associative, Direct Mapped What data should be replaced on a miss? (Take Compsci 210 ) CPS 220 13 Segmented Virtual Memory Virtual address (2 32,2 64 ) to Physical Address mapping (2 30 ) Variable size, base + offset, contiguous in both VA and PA Virtual 0x1000 0x6000 0x9000 0x11000 0x0000 0x1000 0x2000 Physical CPS 220 14 Page 7

Intel Pentium Segmentation Logical Address Physical Address Space Seg Selector Offset Global Descriptor Table (GDT) Segment Descriptor Segment Base Address CPS 220 15 Pentium Segmention (Continued) Segment Descriptors Local and Global base, limit, access rights Can define many Segment Registers contain segment descriptors (faster than load from mem) Only 6 Must load segment register with a valid entry before segment can be accessed generally managed by compiler, linker, not programmer CPS 220 16 Page 8

Paged Virtual Memory Virtual address (2 32,2 64 ) to Physical Address mapping (2 28 ) virtual page to physical page frame Virtual page number Offset Fixed Size units for access control & translation Virtual 0x1000 0x6000 0x9000 0x11000 0x0000 0x1000 0x2000 Physical CPS 220 17 Page Table Kernel data structure (per process) Page Table Entry (PTE) VA -> PA translations (if none page fault) access rights (Read, Write, Execute, User/Kernel, cached/uncached) reference, dirty bits Many designs Linear, Forward mapped, Inverted, Hashed, Clustered Design Issues support for aliasing (multiple VA to single PA) large virtual address space time to obtain translation CPS 220 18 Page 9

Alpha VM Mapping (Forward Mapped) 64-bit address divided into 3 segments seg0 (bit 63=0) user code/heap seg1 (bit 63 = 1, 62 = 1) user stack kseg (bit 63 = 1, 62 = 0) kernel segment for OS base 21 seg 0/1 Three level page table, each one page Alpha 21064 only 43 unique bits of VA (future min page size up to 64KB => 55 bits of VA) PTE bits; valid, kernel & user read & write enable (No reference, use, or dirty bit) What do you do for replacement? + L1 10 L2 L3 PO 10 10 13 + + phys page frame number CPS 220 19 Inverted Page Table (HP, IBM) Virtual page number Offset Inverted Page Table (IPT) Hash VA PA,ST Hash Anchor Table (HAT) One PTE per page frame only one VA per physical frame Must search for virtual address More difficult to support aliasing Force all sharing to use the same VA CPS 220 20 Page 10

Intel Pentium Segmentation + Paging Logical Address Linear Address Space Dir Table Offset Physical Address Space Seg Selector Offset Global Descriptor Table (GDT) Page Table Segment Descriptor Page Dir Segment Base Address CPS 220 21 The Memory Management Unit (MMU) Input virtual address Output physical address access violation (exception, interrupts the processor) Access Violations not present user v.s. kernel write read execute CPS 220 22 Page 11

Translation Lookaside Buffers (TLB) Need to perform address translation on every memory reference 30% of instructions are memory references 4-way superscalar processor at least one memory reference per cycle Make Common Case Fast, others correct Throw HW at the problem Cache PTEs CPS 220 23 Fast Translation: Translation Buffer Cache of translated addresses Alpha 21164 TLB: 48 entry fully associative Page Number Page offset 1 2 v r w tag phys frame...... 4...... 3 48 48:1 mux CPS 220 24 Page 12

TLB Design Must be fast, not increase critical path Must achieve high hit ratio Generally small highly associative Mapping change page removed from physical memory processor must invalidate the TLB entry PTE is per process entity Multiple processes with same virtual addresses Context Switches? Flush TLB Add ASID (PID) part of processor state, must be set on context switch CPS 220 25 Hardware Managed TLBs CPU TLB Control Memory Hardware Handles TLB miss Dictates page table organization Compilicated state machine to walk page table Multiple levels for forward mapped Linked list for inverted Exception only if access violation CPS 220 26 Page 13

Software Managed TLBs CPU TLB Control Memory Software Handles TLB miss Flexible page table organization Simple Hardware to detect Hit or Miss Exception if TLB miss or access violation Should you check for access violation on TLB miss? CPS 220 27 Digital Unix Kseg kseg (bit 63 = 1, 62 = 0) Kernel has direct access to physical memory One VA->PA mapping for entire Kernel Lock (pin) TLB entry or special HW detection Mapping the Kernel 2 64-1 User Stack Kernel User Code/ Data Physical Memory Kernel 0 CPS 220 28 Page 14

Considerations for Address Translation Large virtual address space Can map more things files frame buffers network interfaces memory from another workstation Sparse use of address space Page Table Design space less locality => TLB misses OS structure microkernel => more TLB misses CPS 220 29 Address Translation for Large Address Spaces Forward Mapped Page Table grows with virtual address space» worst case 100% overhead not likely TLB miss time: memory reference for each level Inverted Page Table grows with physical address space» independent of virtual address space usage TLB miss time: memory reference to HAT, IPT, list search CPS 220 30 Page 15

Hashed Page Table (HP) Virtual page number Offset Hash Hashed Page Table (HPT) VA PA,ST Combine Hash Table and IPT [Huck96] can have more entries than physical page frames Must search for virtual address Easier to support aliasing than IPT Space grows with physical space TLB miss one less memory ref than IPT CPS 220 31 Clustered Page Table (SUN) VPBN Hash Boff... Offset VPBN next PA0 attrib PA1 attrib PA2 attrib PA3 attrib VPBN next PA0 attrib Combine benefits of HPT and Linear [Talluri95] Store one base VPN (TAG) and several PPN values virtual page block number (VPBN) block offset VPBN next PA0 attrib VPBN next PA0 attrib... CPS 220 32 Page 16

Reducing TLB Miss Handling Time Problem must walk Page Table on TLB miss usually incur cache misses big problem for IPC in microkernels Solution build a small second-level cache in SW on TLB miss, first check SW cache» use simple shift and mask index to hash table CPS 220 33 Cache Indexing Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Address TAG Index Block offset Fully Associative: No index Direct-Mapped: Large index CPS 220 34 Page 17

Address Translation and Caches Where is the TLB wrt the cache? What are the consequences? Most of today s systems have more than 1 cache Digital 21164 has 3 levels 2 levels on chip (8KB-data,8KB-inst,96KB-unified) one level off chip (2-4MB) Does the OS need to worry about this? Definition: page coloring = careful selection of va->pa mapping CPS 220 35 TLBs and Caches CPU CPU CPU VA TLB PA $ VA Tags VA $ VA TLB PA Tags VA $ TLB PA L2 $ PA PA MEM MEM Conventional Organization MEM Virtually Addressed Cache Translate only on miss Alias (Synonym) Problem Overlap $ access with VA translation: requires $ index to remain invariant across translation CPS 220 36 Page 18

Virtual Caches Send virtual address to cache. Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache or Real Cache Avoid address translation before accessing cache faster hit time to cache Context Switches? Just like the TLB (flush or pid) Cost is time to flush + compulsory misses from empty cache Add process identifier tag that identifies process as well as address within process: can t get a hit if wrong process I/O must interact with cache CPS 220 37 I/O and Virtual Caches Virtual Cache Processor interrupts Physical Addresses Cache Memory Bus I/O Bridge I/O is accomplished with physical addresses DMA flush pages from cache need pa->va reverse translation coherent DMA Main Memory Disk Controller Disk Disk I/O Bus Graphics Controller Graphics Network Interface Network CPS 220 38 Page 19

2 64-1 0 User Stack Kernel User Code/ Data Aliases and Virtual Caches Physical Memory Kernel aliases (sometimes called synonyms); Two different virtual addresses map to same physical address But, but... the virtual address is used to index the cache Could have data in two different locations in the cache CPS 220 39 Index with Physical Portion of Address If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag Page Address Page Offset Address Tag Index Block Offset Limits cache to page size: what if want bigger caches and use same trick? Higher associativity Page coloring CPS 220 40 Page 20

Page Coloring for Aliases HW that guarantees that every cache frame holds unique physical address OS guarantee: lower n bits of virtual & physical page numbers must have same value; if direct-mapped, then aliases map to same cache frame one form of page coloring Page Address Page Offset Address Tag Index Block Offset CPS 220 41 Page Coloring to reduce misses Cache Page frames Notion of bin region of cache that may contain cache blocks from a page Random vs careful mapping Selection of physical page frame dictates cache index Overall goal is to minimize cache misses CPS 220 42 Page 21

Careful Page Mapping [Kessler92, Bershad94] Select a page frame such that cache conflict misses are reduced only choose from available pages (no VM replacement induced) static smart selection of page frame at page fault time dynamic move pages around CPS 220 43 A Case for Large Pages Page table size is inversely proportional to the page size memory saved Fast cache hit time easy when cache <= page size (VA caches); bigger page makes it feasible as cache size grows Transferring larger pages to or from secondary storage, possibly over a network, is more efficient Number of TLB entries are restricted by clock cycle time, larger page size maps more memory reduces TLB misses CPS 220 44 Page 22

A Case for Small Pages Fragmentation large pages can waste storage data must be contiguous within page Quicker process start for small processes(??) CPS 220 45 Superpages Hybrid solution: multiple page sizes 8KB, 16KB, 32KB, 64KB pages 4KB, 64KB, 256KB, 1MB, 4MB, 16MB pages Need to identify candidate superpages Kernel Frame buffers Database buffer pools Application/compiler hints Detecting superpages static, at page fault time dynamically create superpages Page Table & TLB modifications CPS 220 46 Page 23

More details on page coloring to reduce misses CPS 220 47 Page Coloring Make physical index match virtual index Behaves like virtual index cache no conflicts for sequential pages Possibly many conflicts between processes address spaces all have same structure (stack, code, heap) modify to xor PID with address (MIPS used variant of this) Simple implementation Pick abitrary page if necessary CPS 220 48 Page 24

Bin Hopping Allocate sequentially mapped pages (time) to sequential bins (space) Can exploit temporal locality pages mapped close in time will be accessed close in time Search from last allocated bin until bin with available page frame Separate search list per process Simple implementation CPS 220 49 Best Bin Keep track of two counters per bin used: # of pages allocated to this bin for this address space free: # of available pages in the system for this bin Bin selection is based on low values of used and high values of free Low used value reduce conflicts within the address space High free value reduce conflicts between address spaces CPS 220 50 Page 25

Hierarchical Best bin could be linear in # of bins Build a tree internal nodes contain sum of child <used,free> values Independent of cache size simply stop at a particular level in the tree CPS 220 51 Benefit of Static Page Coloring Reduces cache misses by 10% to 20% Multiprogramming want to distribute mapping to avoid inter-address space conflicts CPS 220 52 Page 26

Dynamic Page Coloring Cache Miss Lookaside (CML) buffer [Bershad94] proposed hardware device Monitor # of misses per page If # of misses >> # of cache blocks in page must be conflict misses interrupt processor move a page (recolor) Cost of moving page << benefit CPS 220 53 Page 27