Lecture 14: Memory Hierarchy. Housekeeping

Transcription

1 S 16 L Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University Housekeeping S 16 L14 2 Your goal today big picture intro to memory subsystem understand the general concept of memory hierarchy Notices I don t teach SRAM/DRAM/Flash except that they exist one faster than another one more capacity than another one cheaper (per bit) than another I also don t teach delay line, Selectron, drum, or core Readings P&H Ch6 for the next many lectures

2 Wishful Memory S 16 L14 3 So far we imagined a program sees a contiguous 4GB memory access anywhere in memory in 1 proc. cycle We are in good company Burks, Goldstein, von Neumann, 1946 The Reality S 16 L14 4 Can t afford and don t need as much memory as the size of the user address space (think about 64 bit ISAs) Most machines are multi tasked between several programs You can t find memory technology that is affordable in GByte and also cycle in GHz The magic memory abstraction are nevertheless very useful approximation of reality due to memory hierarchy: large and fast virtual memory: contiguous and private

3 The Law of Storage S 16 L14 5 Bigger is slower SRAM 512 sub nanosec SRAM nanosec DRAM ~50 nanosec Hard Disk ~10 millisec Faster is more expensive (dollars and chip area) SRAM ~$10K per GByte DRAM ~$10 per GByte Hard Disk ~$0.1 per GByte ***Note*** these sample values scale with time How to make memory Bigger, Faster, and Cheaper? S 16 L14 6 Principles behind the solution

4 #1: The Locality Principle One s recent past is a very good predictor of one s near future. Temporal Locality: If you just did something, it is very likely that you will do the same thing again soon since you are here today, there is a good chance you will be here again and again regularly inverse is also true Spatial Locality: If you just did something, it is very likely you will do something similar/related every time I find you in this room, you are probably sitting in the same seat (or nearby) you are probably sitting near the same people S 16 L14 7 Programs are even more predictable than people #1: Memory Locality Typical programs have strong locality in memory references *** typical programs are composed of loops Temporal: programs tend to reference (read and write) the same memory location many times and within a small window of time Spatial: programs tend to reference a cluster of near by memory locations (most notable examples 1. instruction memory references and 2. array/data structure references) Corollary: a program may reference a large number of different memory locations over its lifetime but not all around the same time S 16 L14 8

5 #2: Memoization If something is expensive to compute, remember the answer for a while, just in case it is needed again, soon Memoization needs locality to be effective Without locality storing a large number of different answers (many of which never reused) locating an answer from a large number of stored answers can be more expensive than recomputing it With locality small number of answers gets reused all the time! store a small number of frequent answers can avoid most recomputations S 16 L14 9 #3: Cost Amortization S 16 L14 10 overhead cost : one time cost to set something up per unit cost : cost for per unit of operation total cost = overhead + per unit cost x N average cost = total cost / N = ( overhead / N ) + per unit cost the essence of amortization It is often okay to have a high overhead cost if the cost can be distributed over a large number of units lower the average cost

6 S 16 L14 11 Putting the principles to work Memory Hierarchy S 16 L14 12 move what you use here fast small with strong locality appears as fast as and as large as faster per byte cheaper per byte backup everything here big but slow

7 Managing Memory Hierarchy S 16 L14 13 Manage data movement across hierarchies manually vacuum tubes vs Selectron discussed in von Neumann paper core vs drum memory in the 50 s too painful for programmers on substantial programs still done in some embedded processors on chip scratchpad SRAM in lieu of a cache Automatic management simple heuristic: keep most recently used items in fast mem dates back to ATLAS, 1962 in every modern computer (fast processor, slow DRAM) average programmer doesn t need to know about it You don t need to know how big the cache is to write a correct program! You may if you want a fast program. Memory Abstraction Modern Memory Hierarchy Register File 32 words, sub nsec L1 cache ~32 KB, ~nsec S 16 L14 14 manual register spilling L2 cache 512 KB ~ 1MB, many nsec L3 cache,... automatic cache management Main memory (DRAM), GB, ~100 nsec Swap Disk 100 GB~TB, ~10 msec automatic demand paging

8 S 16 L14 15 Hierarchical Performance Analysis For a given memory hierarchy level i it has a technologyintrinsic access time of t i Perceived access time T i is longer than t i Except for the outer most hierarchy, when looking up a given address a chance (hit rate h i ) you hit and access time is t i a chance (miss rate m i ) you miss and access time t i +T i+1 h i + m i = 1 Thus T i = h i t i + m i (t i + T i+1 ) T i = t i + m i T i+1 think this of as miss penalty **Note**, h i and m i are defined to be the hit rate and miss rate of just the references that missed at L i 1 Hierarchy Design Compromises S 16 L14 16 Recursive latency equation T i = t i + m i T i+1 The goal: achieve desired T 1 within allowed cost T i t i is desirable but not necessary Keep m i low increase capacity C i lowers m i, but increases t i lower m i by smarter management, e.g., replacement: anticipate what you don t need prefetching: anticipate what you will need Keep T i+1 low faster lower hierarchies help, but at increased cost and/or reduced capacity often better to introduce intermediate hierarchies

9 Hierarchy Design Considerations S 16 L14 17 DRAM optimized for capacity/dollar T DRAM is essentially same regardless of capacity SRAM optimized first for capacity/latency; second for capacity/dollar different compromise between capacity and latency possible t i = O( C i ) Hierarchies bridge the difference between CPU speed and DRAM speed T pclk T DRAM no hierarchy needed T pclk << T DRAM one or more levels of SRAM hierarchies to minimize T 1 while staying within cost Intel P4 Example (very fast, very deep pipeline) 90nm P4, 3.6 GHz if m 1 =0.1, m 2 =0.1 L1 D cache T 1 =7.6, T 2 =36 C 1 = 16K if m 1 =0.01, m 2 =0.01 t 1 = 4 cyc int / 9 cycle fp T 1 =4.2, T 2 =19.8 (note: pipelined) L2 D cache if m 1 =0.05, m 2 =0.01 C 2 =1024 KB T 1 =5.00, T 2 =19.8 t 2 = 18 cyc int / 18 cyc fp if m 1 =0.01, m 2 =0.50 Main memory T 1 =5.08, T 2 =108 t 3 = ~ 50ns or 180 cyc Notice best case latency is not 1 anymore Why not? worst case access latency are into 300+ cyc, depending exactly what happens S 16 L14 18

10 Aside: Why is DRAM slow? S 16 L14 19 DRAM fabrication at the forefront of VLSI technology, but scaled with Moore s law in capacity and cost, not speed Between 1980 ~ 2004 DRAM 64K bit 1024M bit (exponential ~55% annual) 250ns 50ns (linear) But, remember, this is a very deliberate choice. We can engineer faster DRAM if we needed to Memory capacity needs to grow linearly with CPU speed to keep a balanced system Amdahl s Other Law DRAM/processor speed difference reconciled through memory hierarchies (L1, L2, L3,...) L2 became common place in the 1990s L3 became common place in the 2000s Pop Quiz S 16 L14 20 What does the principle of say about hierarchical memory? Memory Locality Memoization Amortization

11 S 16 L14 21 Cache Design Basics Cache S 16 L14 22 Generically in computing, any structure that memoizes frequently used results to avoid repeating the long latency operations required to reproduce the results from scratch, e.g. a web cache In computer architecture, an automatically managed memory hierarchy (usually) based on SRAM memoize in a small SRAM the most frequently accessed DRAM memory locations to avoid repeatedly paying for the DRAM access latency

12 Cache Interface for Dummies ready MemWrite S 16 L14 23 Instruction address Instruction memory Instruction valid Address Write data Data memory Read data valid MemRead Like the magic memory we assumed earlier present address, R/W command, etc most of the time result or update valid after a short/fixed latency (1 cyc?) Except, cache may not be valid/ready on every cycle eventually will become valid/ready but what happens to the pipeline until then? [Based on figures from P&H CO&D, COPYRIGHT 2004 Elsevier. ALL RIGHTS RESERVED.] The Basic Problem S 16 L14 24 Potentially M=2 m bytes of memory, how to keep the most frequently used ones in C bytes of fast storage where C << M Basic issues (intertwined) (1) where to cache a memory location? (2) how to find a cached memory location? (3) granularity of management: large, small, uniform? (4) when to bring a memory location into cache? (5) which cached memory location to evict to free up space? Optimizations

13 address Basic Operation S 16 L14 25 (2) cache lookup (1, 3, 5) yes hit? no choose location occupied? no yes return data update cache fetch new from L i+1 evict old to L i+1 data Ans to (4): memory location brought into cache on demand. What about prefetch? Basic Cache Parameters S 16 L14 26 Let M = 2 m be the size of the address space in bytes sample values: 2 32, 2 64 Let G=2 g be the cache access granularity in bytes sample values: 4, 8 Let C be the capacity of the cache in bytes sample values: 16 KByte (L1), 1 MByte (L2) 100% hit rate working set size (W) C

14 Direct Mapped Cache (v1) lg 2 M bit address tag idx g Tag Bank Data Bank S 16 L14 27 t bits lg 2 (C/G) bits C/G lines by t bits valid C/G lines by G bytes t bits = What about writes? G bytes let t= lg 2 M lg 2 (C) hit? data Storage Overhead S 16 L14 28 For each cache block of G bytes, must also store additional t+1 bits where t=lg 2 M lg 2 (C) if M=2 32, G=4, C=16K=2 14 t=18 bits for each 4 byte block 60% storage overhead 16KB cache really needs 25.5KB of SRAM Solution: let multiple G byte words share a common tag each B byte block holds B/G words if M=2 32, B=16, G=4, C=16K t=18 bits for each 16 byte block 15% storage overhead 16KB cache needs 18.4KB of SRAM 15% of 16KB is small, 15% of 1MB is 152KB larger block size for lower/larger hierarchies

15 Direct Mapped Cache (final) lg 2 M bit address tag idx bo g S 16 L14 29 lg 2 (C/B) bits Tag Bank C/B by t bits valid Data Bank C/B by B bytes t bits lg 2 (B/G) bits = t bits B bytes let t= lg 2 M lg 2 (C) hit? G bytes data Direct Mapped Cache C bytes of storage divided into C/B blocks A block of memory is mapped to one particular cache block according the address block index field All addresses with the same block index field map to the same cache block 2 t such addresses; can cache only one such block at a time even if C > working set size, collision is possible given 2 random addresses, chance for collision is 1/(C/B) Notice likelihood for collision decreases with increasing number of cache blocks (C/B) 100% S 16 L14 30 hit rate working set size (W) C

16 Block Size and m i S 16 L14 31 Bytes that share a common tag are all in or all out Loading a multi word block at a time has the effect of prefetching for spatial locality pay miss penalty only once per block works especially well in instruction caches effective up to the limit of spatial locality But, increasing block size (while holding C constant) reduces the number of blocks increases possibility for collision hit rate B Block Size and T i+1 S 16 L14 32 Loading a large block can increase T i+1 if I want the last word on a block, I have to wait for the entire block to be loaded solution 1 critical word first reload L i+1 returns the requested word first then rotate around the complete block supply requested word to pipeline as soon as available solution 2: sub blocking individual valid bits for different sub blocks reload only requested sub block on demand note: all sub blocks stall share common tag tag v s block 0 v s block 1 v s block 2