CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Transcription

1 CS/COE1541: Introduction to Computer Architecture Memory hierarchy Sangyeun Cho Computer Science Department CPU clock rate Apple II MOS 6502 (1975) 1~2MHz Original IBM PC (1981) Intel MHz Intel MHz DEC Alpha EV4 (1991) 100~200MHz EV5 (1995) 266~500MHz EV6 (1998) 450~600MHz EV67 (1999) 667~850MHz Today s processors 2~5GHz 2

2 DRAM access latency How long would it take to fetch a memory block from main memory? Time to determine that data is not on-chip (cache miss resolution time) Time to talk to memory controller (possibly on a separate chip) Possible queuing delay at memory controller Time for opening a DRAM row Time for selecting a portion in the selected row (CAS latency) Time to transfer data from DRAM to CPU Possibly through a separate chip Time to fill caches as needed The total latency can be anywhere from 100~400 CPU cycles 3 Growing gap 4

3 Idea of memory hierarchy Smaller Faster More expensive per byte CPU Regs L1 cache SRAM L2 cache SRAM Main memory DRAM Larger Slower Cheaper per byte Hard disk Magnetics 5 Memory hierarchy goals To create an illusion of fast and large main memory As fast as a small SRAM-based memory As large (and cheap) as DRAM-based memory To provide CPU with necessary data and instructions as quickly as possible Frequently used data must be placed near CPU Cache hit when CPU finds its data in cache Cache hit rate = # of cache hits/# cache accesses Avg. mem. access lat. (AMAL) = cache hit time + (1 cache hit rate) miss penalty To decrease AMAL, decrease, increase, and reduce. Cache helps reduce bandwidth pressure on memory bus Cache becomes a filter 6

4 Cache (From Webster online): 1. a: a hiding place especially for concealing and preserving provisions or implements b: a secure place of storage 3. a computer memory with very short access time used for storage of frequently or recently used instructions or data called also cache memory Notions of cache memory were used in as early as 1967 in IBM mainframe machines 7 Cache operations CPU &A[0] Data Cache No Get A[0] Data Main memory Do I have A[0]? for (i=0; i<n; i++) { C[i] = A[i] + B[i]; SUM = SUM + C[i]; } 8

5 Cache operations CPU-cache operations CPU: Read request Cache: hit, provide data Cache: miss, talk to main memory; hold CPU until data arrives; fill cache (if needed, kick out old data); provide data CPU: Write request Cache: hit, update data Cache: miss, put data in a buffer (pretend data is written); let CPU proceed Cache-main memory operations Cache: Read request Memory: provide data Cache: Write request Memory: update memory 9 Cache organization Caches use blocks or lines (block > byte) as their granule of management Memory > cache: we can only keep a subset of memory blocks Cache is in essence a fixed-width hash table; the memory blocks kept in a cache are thus associated with their addresses (or tagged ) addr N bits Memory addr N bits addr table data table 2 N words supporting circuitry Hit/miss Specified data word Specified data word 10

6 L1 cache vs. L2 cache Their basic parameters are similar Associativity, block size, and cache size (the capacity of data array) Address used to index L1: typically virtual address (to quickly index first) Using a virtual address causes some complexities L2: typically physical address Physical address is available by then System visibility L1: not visible L2: page coloring can affect hit rate Hardware organization (esp. in multicores) L1: private L2: often shared among cores 11 Key questions Where to place a block? How to find a block? Which block to replace for a new block? How to handle a write? Writes make a cache design much more complex! 12

7 Where to place a block? Block placement is a matter of mapping If you have a simple rule to place data, you can find them later using the same rule 13 Direct-mapped cache 2 B byte block 2 M entries N-bit address (N-B-M) bits 2 B bytes (N-B-M) bits M bits B bits Address Tag Data =? Hit/miss Data 14

8 2-way set-associative cache 2 B byte block 2 M entries N-bit address 2 M entries N bits Address DM Cache 2 (M-1) entries DM Cache 2 (M-1) entries Hit/miss Data 15 Fully associative cache 2 B byte block 2 M entries N-bit address CAM (Content addressable memory) Input: content Output: index Used for the tag memory (N-B) bits 2 B bytes (N-B) bits B bits Tag M-bit index Data Address Hit/miss Data 16

9 Why caches work (or do not work) Principle of locality Temporal locality If the location A is accessed now, it ll be accessed again soon Spatial locality If the location A is accessed now, the location nearby (e.g., A+1) will be accessed soon Can you explain how locality is manifested in your program (at the source code level)? Data Instructions Can you write two programs doing the same thing, each having A high degree of locality Very low locality 17 Which block to replace? Which block to replace, to make room for a new block on a miss? Goal: minimize the # of total misses Trivial in a direct-mapped cache N choices in N-way associative cache What is the optimal policy? MRU (most remotely used) is considered optimal This is an oracle scheme we do not know the future Replacement approaches LRU (least recently used) look at the past to predict the future FIFO (first in first out) honor the new ones Random don t remember anything Cost-based what is the cost (e.g., latency) of bringing this block again? 18

10 How to handle a write? Design considerations Performance Design complexity Allocation policy (on a miss) Write-allocate No-write-allocate Write-validate Update policy Write-through Write-back Typical combinations Write-back with write-allocate Write-through with no-write-allocate 19 Write-through vs. write-back L1 cache: advantages of write-through + no-write-allocate Simple control No stalls for evicting dirty data on L1 miss with L2 hit Avoids L1 cache pollution with results that are not read for a while Avoids problems with coherence (L2 is consistent with L1) Allows efficient transient error handling: parity protection in L1 and ECC in L2 What about high traffic between L1 and L2, esp. in a multicore processor? L2 cache: advantages of write-back + write-allocate Typically reduces overall bus traffic by filtering all L1 write-through traffic Better able to capture temporal locality of infrequently written memory locations Provides a safety net for programs where write-allocate helps a lot Garbage-collected heaps Write-followed-by-read situations Linking loaders (if unified cache, need not be flushed before execution) Some ISA/caches support explicitly installing cache blocks with empty contents or common values (e.g., zero) 20

11 Write buffer Waiting for a write to main memory is inefficient for a writethrough scheme Every write to cache causes a write to main memory and the processor stalls for that write Example: base CPI = 1.2, 13% of all instructions are a store, 10 cycles to write to memory CPI = = 2.5 More than doubled the CPI by waiting 21 Write buffer A write buffer holds data while it is waiting to be written to (slow) memory; frees processor to continue executing instructions Processor thinks that write was done (quickly) When write buffer is empty Send written data to buffer and eventually to main memory When write buffer is full, Wait for write buffer to become empty; then send data to write buffer 22

12 Write buffer Is one write buffer enough? Probably not Writes may occur faster than they can be handled by the memory system Write tend to occur in bursts; memory may be fast but can t handle several writes at once Typical buffer size is 4~8, possibly more 23 Revisiting write-back cache Tag array Data array D V Address tag Dirty bit is set when data is modified Valid bit is set when data is brought in Address tag is the id of the associated data 24

13 Alpha example 64B block = 64kB 2 ways 25 More examples IBM Power5 L1I: 64kB 2-way 128B block LRU L1D: 32kB 4-way 128B block write-through LRU L2: 1.875MB (3 banks) 10-way 128B block pseudo LRU Intel Core Duo L1I: 32kB 8-way 64B block LRU L1D: 32kB 8-way 64B block LRU write-through L2: 2MB 8-way 64B line LRU write-back Sun Niagara L1I: 16kB 4-way 32B block random L1D: 8kB 4-way 16B block random write-through write no-allocate L2: 3MB 4 banks 64B block 12-way write-back 26

14 Impact of caches on performance Average memory access latency (AMAL) = cache hit time + (1 cache hit rate) miss penalty Example 1 Hit time = 1 cycle Miss penalty = 100 cycles Miss rate = 2% Average memory access latency? Example 2 1GHz processor Two configurations: 16kB direct-mapped, 16kB 2-way Two miss rates: 3%, 2% Hit time = 1 cycle, but clock cycle time is stretched by 1.1 in 2-way Miss penalty = 100ns (how many cycles?) Average memory access latency? Reducing miss penalty Multi-level caches miss penalty L1 = hit time L2 + miss rate L2 miss penalty L2 Critical word first and early restart When L2-L1 bus width is smaller than L1 cache block Giving priority to read misses Esp. in dynamically scheduled processors Merging write buffer Victim caches Non-blocking caches Esp. in dynamically scheduled processors 28

15 Victim cache w/o victim cache w/ victim cache Requests and replacements Requests L1 cache L2 cache L1 cache L2 cache Data 2:1 Requests and replacements Victim cache Replacements Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, ISCA Categorizing misses Compulsory I ve not met this block before Capacity Working set is larger than my cache Conflict Cache has space but blocks in use map to busy sets How can you measure contributions from different miss categories? 30

16 2. Reducing miss rate Larger block size Reduces compulsory misses Larger cache size Tackles capacity misses Higher associativity Attacks conflict misses Prefetching Relevancy issues (due to pollution) Pseudo-associative caches Compiler optimizations Loop interchange Blocking Reducing hit time Small & simple cache (e.g., direct-mapped cache) Test different cache configurations using cacti ( Avoid address translation during cache indexing Pipeline access 32

17 Prefetching Memory hierarchy generally works well L1 cache: 1- ~ 3-cycle latency L2 cache: 8- ~ 13-cycle latency Main memory: 100- ~ 300-cycle latency Cache hit rates are critical to high performance Prefetching: if we know what data we ll need from level-n cache a priori, get data from level-(n+1) and place it in level-n cache before the data is accessed by the processor What are design goals and issues? E.g., shall we load prefetched block in L1 or not? E.g., instruction prefetch vs. data prefetch 33 Evaluating a prefetching scheme Coverage By applying a prefetching scheme, how many misses are covered? Timeliness Are prefetched data arriving early enough so that the (otherwise) miss latency is effectively hidden? Relevance and usefulness Are prefetched blocks actually used? Do they replace other useful data? How would program execution time change? Especially, dynamically scheduled processors have intrinsic ability to tolerate long latencies to a degree 34

18 Hardware schemes Prefetch data under hardware control Without software modification Without increasing instruction count Hardware prefetch engines Programmable engines Program this engine with data access pattern information Who programs this engine, then? Automatic Hardware detects access behavior 35 Stride detection A popular method using a reference prediction table Load instruction PC Last address A i-1 Last stride S = A i-1 A i-2 Other flags, e.g., confidence, time stamp, Next address for this PC A i = A i-1 + S In practice, simpler stream buffer like methods are often used IBM Power4/5 use 8 stream buffers between L1 & L2, L2 & L3 (or main memory) 36

19 Power4 example 8 stream buffers: ascending/descending Requires at least 4 sequential misses to install a stream Supports L2 to L1, L3 to L2, memory to L3 Based on physical address When page boundary is met, stop there Software interface To explicitly (and quickly) install a stream 37 Software schemes Use prefetch instruction needs ISA support Check if the desired block is in cache already If not, bring the cache block into the cache If yes, do nothing In any case, prefetch request is a performance hint and not a correctness requirement Not acting on it must not affect program output Compiler or programmer then inserts prefetch instructions in programs Hardware prefetch vs. software prefetch 38

20 Software prefetch example for (int i=0; i<100; ++i) { a[i] = b[i] + c[i]; } prefetch(&b[0]); prefetch(&c[0]); for (i=0; i<100; i++) { prefetch (&b[i+4]); prefetch (&c[i+4]); a[i] = b[i] + c[i]; } 39 Virtual memory Process 1 Process 2 Process 3 E H C E D F C I D A B Physical memory B A B A I H G F G Hard disk 40

21 Virtual memory Motivation Large, private memory space view What is the typical physical memory size today? Virtual view to programmers Virtual view to each process Management goals Efficient sharing of physical memory Protection Paging vs. segmentation Architectural issues Fast translation of virtual addresses Support for protection Impact on cache design 41 Cache vs. virtual memory Block size 16~128B 4k~64kB Hit time 1~3 cycles (L1) 50~150 cycles (main memory) Miss penalty 8~150 cycles (L2/memory) 1M~10M cycles (hard drive) Miss rate 0.1~10% ~0.001% Address mapping 25~45 bits to 14~20 32~64 bits to 25~45 Replacement Hardware-based OS 42

22 Paging vs. segmentation Words per address One Programmer visible? No Replacing a block Trivial (same block size) Memory use inefficiency Internal fragmentation Efficient disk traffic Yes Two (segment & offset) Maybe Can be difficult External fragmentation Not always 43 Paging basics Block placement Anywhere in memory (fully associative) Block identification Page table keeps mapping information Block replacement LRU by OS Write policy Write-back 44

23 Page table Implemented in OS kernel memory space Per process Table size? Architectural support for fast address translation Translation Look-aside Buffer (TLB) Small cache to keep frequently used page table entries 45 Virtually indexed L1 caches Address from a load/store instruction is virtual VA-PA translation is done via TLB If L1 cache access is done after TLB access, L1 latency becomes longer Virtually indexed but physically tagged cache Allows simultaneous access to TLB and cache Overlaps cache access time and TLB access time 46

24 Virtually indexed L1 caches Parallel access Physical address governs in this region 47 Issues Virtually indexed L1 caches Validity of data across context switches (there are many virtual address spaces!) Flush at context switches ( ) Use physical tags Use address space identifiers Handling of aliases (two VAs for one PA) Hardware scheme: limit cache size Software scheme: page coloring I/O and cache coherence I/O can result in inconsistency among data in main memory and caches I/O is between main memory and outside world Solutions Flush cache before & after I/O operations Address bus snooping for cache line invalidation Allocate I/O buffers in a non-cacheable memory region 48

25 Protection Between user and kernel processes User mode vs. privileged mode Between users Address boundary & access type checking Base address upper bound Write allowed? Protections are enforced when accessing page table (TLB) 49 Alpha example 8kB page size 8B page table entry: 4B for physical page number (PPN) and 4B for other information 3FF FFFF FFFF 3FF Reserved (shared libraries) Not yet allocated Dynamic Data 64-bit address is not fully used seg1: bits 63~47: Information maintained by OS, not accessible by users kseg: bits 63~47: Kernel-accessible physical addresses No address translation performed seg0: bits 63~47: User space $sp Static Data Not used Text (Code) Stack Not yet allocated Reserved 50

26 Alpha example 51 DRAM technology DRAM = dynamic RAM Memory cell is based on capacitor If capacitor has enough charge, it s 1 If not, it means 0 Memory content decays as capacitors lose their charges over time Periodic refresh actions are required Reads are also destructive the data read must be written back before reading another row DRAM designs are driven by standards ( SDRAM (synchronous DRAM) is most widely used RDRAM (Rambus DRAM) is a high-performance proprietary design Playstation series has been the major consumer 52

27 DRAM internal Command interface Row address path Memory banks Data I/O interface Column address path 53 DRAM internal 1. A Row in a bank is selected and read using {BA1,BA0,A13~A0} Row address 2. Data is moved to I/O latch 3. Part of data is selected using column address Column address 4. Selected data is moved to chip I/O 54

28 DRAM-related concepts Row address & column address Transfer address (e.g., from CPU to DRAM) in two steps Row address first and column address later (why?) Halves # of address bits Fast page mode Once you ve read a row, it s fast to read parts of it (rather than data from other rows) Cycle time determines bandwidth Synchronous DRAM has a clock DDR (double data rate) use both edges of a clock when transferring data 55 DRAM modules SIMM (single in-line memory module) 32-bit data bus Primarily in 386, 486, 586, and computers from Apple DIMM (dual in-line ) 64-bit data bus RIMM (Rambus in-line ) Heat spreader 1.6 GBps FB-DIMM (fully buffered ) New standard 56

29 FB-DIMM Serial, packet-based interface (new design trend) Industry s acceptance is still low (10% or lower) 57 FB-DIMM Scalability and throughput 192 GB, 6 channels, 8 DIMMS/channel 6.7 GB/s per channel Improved board layouts 69 pins per channel (c.f., 240 pins per channel today) Reliability considerations ECC in commands, address, and data Automatic retry on errors Headroom More open-ended serial differential signaling scheme 58

30 Memory bank interleaving Organization to improve memory bandwidth processor With proper # of banks, successive memory access latencies are hidden 59 Page coloring OS virtual to physical address mapping affects L2 cache performance P-Page 0 V-Page A Page mapping P-Page 1 V-Page B P-Page 2 P-Page 3 Bin 0 P-Page 4 L2 cache Bin 1 Bin 2 Page mapping P-Page 5 P-Page 6 Main memory Bin 3 P-Page 7 60

31 Page coloring This is about deciding which physical page to give to a virtual page on a page allocation event Goal: minimize the # of conflict misses Approaches Basic method: match a few lower bits in PPN with those of VPN Let neighboring pages (in virtual address space) occupy different cache bins Bin hopping: a consecutive page allocation gets the next bin Let pages accessed in the same time frame occupy different cache bins Best bin: track the page count for each bin; choose the bin with the minimum page count Use the OS knowledge about the page allocations Example Solaris supports several different page coloring strategies and lets the user choose one 61