Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Chapter Seven emories: Review SRA: value is stored on a pair of inverting gates very fast but takes up more space than DRA (4 to transistors) DRA: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRA (factor of 5 to )

Exploiting emory Hierarchy Users want large and fast memories! SRA access times are - 5ns at cost of $ to $5 per byte. DRA access times are -ns at cost of $5 to $ per byte. Disk access times are to million ns at cost of $. to $. per byte. 997 Try and give it to them anyway build a memory hierarchy CPU Levels in the memory hierarchy Level Level Increasing distance from the CPU in access time Level n Size of the memory at eachlevel 3 Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level 4

Cache Two issues: How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 5 Direct apped Cache apping: address is modulo the number of blocks in the cache C a ch e e m o ry

Direct apped Cache For IPS: H it Address (showing bit positions) 3 3 3 By te offset Tag Index D ata Inde x V alid T ag D ata 3 3 What kind of locality are we taking advantage of? 7 Direct apped Cache Taking advantage of spatial locality: Address (showing bit positions) 3 5 4 3 Hit Tag Byte offset Index Block offset Data bits bits V Tag Data 4K entries 3 3 3 3 ux 3

Hits vs. isses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word 9 Hardware Issues ake reading multiple words easier by using banks of memory CPU CPU CPU Cache ultiplexor Cache Cache Bus Bus Bus emory emory bank emory bank emory bank emory bank 3 emory b. Wide memory organization c. Interleaved memory organization a. One-word -wide memory organization It can get a lot more complicated...

Performance Increasing the block size tends to decrease miss rate: iss rate 4% 35% 3% 5% % 5% % 5% % 4 4 5 Block size (bytes) KB KB KB 4 KB 5 KB Use split caches because there is more spatial locality in code: Program Block size in words Instruction miss rate Data miss rate Effective combined miss rate gcc.%.% 5.4% 4.%.7%.9% spice.%.3%.% 4.3%.%.4% Performance Simplified model: execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?

Set Associative Caches Basic Idea: a memory block can be mapped to more than one location in the cache Cache is divided into sets Each memory block is mapped to a particular set Each set can have more than one block Number of blocks in set = associativity of cache If a set has only one block, then it is a direct-mapped cache I.e. direct mapped caches have a set associativity of Each memory block can be placed in any of the blocks of the set to which it maps 3 Direct mapped cache: block N maps to ( N mod num of blocks in cache) Set associative cache: block N maps to set (N mod num of sets in cache) Example below shows placement of block whose address is Direct mapped Block # 3 4 5 7 Set associative Set # 3 Fully associative Data Data Data Tag Tag Tag Search Search Search 4

Decreasing miss ratio with associativity One-way set associative (direct mapped) Block 3 4 5 7 Tag Data Two-way set associative Set 3 Tag Data Tag Data Four-way set associative Set Tag Data Tag Data Tag Data Tag Data Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Compared to direct mapped, give a series of references that: results in a lower miss ratio using a -way set associative cache results in a higher miss ratio using a -way set associative cache assuming we use the least recently used replacement strategy 5 An implementation A d dress 3 3 9 3 Ind e x V T a g D ata V T a g D a ta V T ag D a ta V T ag D a ta 5 3 5 4 5 5 3 4 - to - multiplexor H it D a ta

Performance 5% % 9% iss rate % 3% % One-way Two-way Four-way Eight-way Associativity KB KB KB 3 KB 4 KB 4 KB KB KB 7 Set Associative Caches Advantages: iss ratio decreases as associativity increases Disadvantages Extra memory needed for extra tag bits in cache Extra time for associative search

Block Replacement Policies What block to replace on a cache miss? We have multiple candidates (unlike direct mapped caches) Random FIFO (First In First Out) LRU (Least Recently Used) Typically, cpus use Random or Approximate LRU because easier to implement in hardware 9 Example Cache size = 4 one word blocks Replacement Policy = LRU Sequence of memory references,,,, Set associativity = 4 (Fully Associative); Number of Sets = Address Hit/iss Set Set Set Set H H

Example cont d Cache size = 4 one word blocks Replacement Policy = LRU Sequence of memory references,,,, Set associativity = ; Number of Sets = Address Hit/iss Set Set Set Set H Example cont d Cache size = 4 one word blocks Replacement Policy = LRU Sequence of memory references,,,, Set associativity = (Direct apped Cache) Address Hit/iss 3

Decreasing miss penalty with multilevel caches Add a second level cache: often primary cache is on the same chip as the processor use SRAs to add another cache above primary memory (DRA) miss penalty goes down if data is in nd level cache Example: CPI of. on a 5hz machine with a 5% miss rate, ns DRA access Adding nd level cache with ns access time decreases miss rate to % Using multilevel caches: try and optimize the hit time on the st level cache try and optimize the miss rate on the nd level cache 3 Im p r o v e m e n t fa c to r 9 9 9 4 9 9 9 9 9 9 9 9 4 9 9 Y e a r C P U (fa s t) C P U (slo w ) D R A 4

4% % % iss rate per type % % 4% % Capacity % 4 3 4 Cache size (KB) One-way Two-way Four-way Eight-way 5 Virtual emory ain memory can act as a cache for the secondary storage (disk) Virtual addresses Address translation Physical addresses Disk addresses Advantages: illusion of having more physical memory program relocation protection

hex Text Recall: Each IPS program has an address space of size 3 bytes $sp 7fff ffff hex Stack Dynamic data $gp hex Static data pc 4 hex Reserved 7 Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback Virtual address 3 3 9 7 5 4 3 9 3 Virtual pagenumber Page offset Translation 9 7 5 4 3 9 3 Physical page number Page offset Physical address

Page Tables Virtual page number Page table Physical page or Valid disk address Physical memory Disk storage 9 Page Tables Page table register Virtual address 3 3 9 7 5 4 3 9 3 Virtual page number Page offset Valid Physical page number Page table If then page is not present in memory 9 7 5 4 3 9 3 Physical page number Page offset Physical address 3

aking Address Translation Fast A cache for address translations: translation lookaside buffer Virtual page number Valid Tag TLB Physical page address Physical memory Page table Physical page Valid or disk address Disk storage 3 TLBs and caches Virtual address TLB access TLB miss exception No TLB hit? Yes Physical address No Write? Yes Try to read data from cache No Write access bit on? Y es Cache miss stall No Cache hit? Yes Write protection exception Write data into cache, update the tag, and put the data and the address into the write buffer Deliver data to the CPU 3

odern Systems Very complicated memory systems: Characteristic Intel Pentium Pro PowerPC 4 Virtual address 3 bits 5 bits Physical address 3 bits 3 bits Page size 4 KB, 4 B 4 KB, selectable, and 5 B TLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data Both four-way set associative Both two-way set associative Pseudo-LRU replacement LRU replacement Instruction TLB: 3 entries Instruction TLB: entries Data TLB: 4 entries Data TLB: entries TLB misses handled in hardware TLB misses handled in hardware Characteristic Intel Pentium Pro PowerPC 4 Cache organization Split instruction and data caches Split intruction and data caches Cache size KB each for instructions/data KB each for instructions/data Cache associativity Four-way set associative Four-way set associative Replacement Approximated LRU replacement LRU replacement Block size 3 bytes 3 bytes Write policy Write-back Write-back or write-through 33 Some Issues Processor speeds continue to increase very fast much faster than either DRA or disk access times Design challenge: dealing with this growing disparity Trends: synchronous SRAs (provide a burst of data) redesign DRA chips to provide higher bandwidth or processing restructure code to increase locality use prefetching (make cache visible to ISA) 34