Computer Science 246 Computer Architecture

Similar documents
361 Computer Architecture Lecture 14: Cache Memory

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

1. Memory technology & Hierarchy

Computer Architecture

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Operating Systems. Virtual Memory

Lecture 17: Virtual Memory II. Goals of virtual memory

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Board Notes on Virtual Memory

Computer Systems Structure Main Memory Organization

Outline. Cache Parameters. Lecture 5 Cache Operation

Computer Architecture

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Virtual Memory Paging

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

CISC, RISC, and DSP Microprocessors

Choosing a Computer for Running SLX, P3D, and P5

Chapter 1 Computer System Overview

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

The Classical Architecture. Storage 1 / 36

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Data Storage - I: Memory Hierarchies & Disks

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

A3 Computer Architecture

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Memory Basics. SRAM/DRAM Basics

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

HY345 Operating Systems

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Introduction to Microprocessors

2


AES Power Attack Based on Induced Cache Miss and Countermeasure

RAM & ROM Based Digital Design. ECE 152A Winter 2012

14.1 Rent-or-buy problem

COS 318: Operating Systems. Virtual Memory and Address Translation

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton

Thread level parallelism

Price/performance Modern Memory Hierarchy

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Introducción. Diseño de sistemas digitales.1

Why Latency Lags Bandwidth, and What it Means to Computing

StrongARM** SA-110 Microprocessor Instruction Timing

Lecture 5: Gate Logic Logic Optimization

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Putting it all together: Intel Nehalem.

Giving credit where credit is due

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

EE361: Digital Computer Organization Course Syllabus

OC By Arsene Fansi T. POLIMI

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Chapter 11 I/O Management and Disk Scheduling

Intel Pentium 4 Processor on 90nm Technology

Computer Architecture

Record Storage and Primary File Organization

The Central Processing Unit:

RAM. Overview DRAM. What RAM means? DRAM

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Chapter 11 I/O Management and Disk Scheduling

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

on an system with an infinite number of processors. Calculate the speedup of

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

CS 6290 I/O and Storage. Milos Prvulovic

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Solid State Drive Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

ESSBASE ASO TUNING AND OPTIMIZATION FOR MERE MORTALS

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

University of Dublin Trinity College. Storage Hardware.

With respect to the way of data access we can classify memories as:

Transcription:

Computer Science 246 Computer Architecture Spring 2009 Harvard University Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu Memory Hierarchy and Caches (Part 1) Computer Science 246 David Brooks

Memory Hierarchy Design Until now we have assumed a very ideal memory All accesses take 1 cycle Assumes an unlimited size, very fast memory Fast memory is very expensive Large amounts of fast memory would be slow! Tradeoffs Cost-speed and size-speed Solution: Smaller, faster expensive memory close to core cache Larger, slower, cheaper memory farther away

Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance Gap (grows 50%/year) 1992 1995 1998 2001 2004 DRAM 7%/year (2X/10yrs) Year

The Memory Wall Logic vs DRAM speed gap continues to grow 1000 Clocks per instruction 100 10 1 0.1 Core Memory Clocks per DRAM access 0.01 VAX/1980 PPro/1996 2010+

Memory Performance Impact on Performance Suppose a processor executes at ideal CPI = 1.1 50% arith/logic, 30% ld/st, 20% control and that 10% of data operations miss with a 50 cycle miss penalty InstrMiss, 0.5 DataMiss, 1.5 memory Ideal CPI, 1.1 CPI = ideal CPI + average stalls per instruction = 1.1(cycle) + ( 0.30 (datamemops/instr) x 0.10 (miss/datamemop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2.6 so 58% of the time the processor is stalled waiting for memory! A 1% instruction miss rate would add an additional 0.5 to the CPI!

The Memory Hierarchy Goal Fact: Large memories are slow and fast memories are small How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)? With hierarchy With parallelism

A Typical Memory Hierarchy By taking advantage of the principle of locality Can present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache Cache Second Level Cache (SRAM) Main Memory (DRAM) Secondary Memory (Disk) Speed (%cycles): ½ s 1 s 10 s 100 s 1,000 s Size (bytes): 100 s K s 10K s M s G s to T s Cost: highest lowest

Characteristics of the Memory Hierarchy Increasing distance from the processor in access time Processor 4-8 bytes (word) L1$ 8-32 bytes (block) L2$ 1 to 4 blocks Main Memory Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM Secondary Memory 1,024+ bytes (disk sector = page) (Relative) size of the memory at each level

Memory Hierarchy Technologies Caches use SRAM for speed and technology compatibility Address 21 Low density (6 transistor cells), high power, expensive, fast Static: content will last forever (until power turned off) Chip select Output enable Write enable Din[15-0] 16 SRAM 2M x 16 16 Dout[15-0] Main Memory uses DRAM for size (density) High density (1 transistor cells), low power, cheap, slow Dynamic: needs to be refreshed regularly (~ every 8 ms) 1% to 2% of the active cycles of the DRAM Addresses divided into 2 halves (row and column) RAS or Row Access Strobe triggering row decoder CAS or Column Access Strobe triggering column selector

Memory Performance Metrics Latency: Time to access one word Access time: time between the request and when the data is available (or written) Cycle time: time between requests Usually cycle time > access time Typical read access times for SRAMs in 2005 are 2 to 4 ns for the fastest parts to 8 to 20ns for the typical largest parts Bandwidth: How much data from the memory can be supplied to the processor per unit time width of the data channel * the rate at which it can be used Size: DRAM to SRAM - 4 to 8 Cost/Cycle time: SRAM to DRAM - 8 to 16

What is a cache? Small, fast storage used to improve average access time to slow memory Hold subset of the instructions and data used by program Exploits spatial and temporal locality Proc/Regs Bigger L1-Cache L2-Cache Memory Faster Disk, Tape, etc.

Caches are everywhere In computer architecture, almost everything is a cache! Registers a cache on variables software managed First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information?

Program locality is why caches work Memory hierarchy exploit program locality: Programs tend to reference parts of their address space that are local in time and space Temporal locality Recently referenced addresses are likely to be referenced again (reuse) Keep this data close to the CPU Spatial locality If an address is referenced, nearby addresses are likely to be referenced soon Programs that don t exploit locality won t benefit from caches

Locality Example j=val1; k=val2; For (i=0; i<10000;i++) { A[i] += j; B[i] += k;} Data Locality: i,a,b,j,k? Instruction Locality?

The Memory Hierarchy: Terminology Higher levels in the hierarchy are closer to the CPU At each level a block is the minimum amount of data whose presence is checked at each level Blocks are also often called lines Block size is always a power of 2 Block sizes of the lower levels are usually fixed multiples of the block sizes of the higher levels

The Memory Hierarchy: Terminology (2) Hit: data is in some block in the upper level (Blk X) Hit Rate: the fraction of memory accesses found in the upper level Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss To Processor Upper Level Memory Lower Level Memory From Processor Blk X Blk Y Miss: data is not in the upper level so needs to be retrieve from a block in the lower level (Blk Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty

Terminology (3) Miss Penalty = Access Time + Transfer Time Miss Penalty Access Time Transfer Time Block Size Access time is a function of latency Time for memory to get request and process it Transfer time is a function of bandwidth Time for all of block to get back

Access vs. Transfer Time Cache A Memory C Miss penalty = A + B + C Access Time = A + B Transfer Time = C B

Memory Hierarchy Performance Misleading indicator of memory hierarchy performance is miss rate Average memory access time (AMAT) is better, but execution time is always the best AMAT = Hit Time + Miss Rate * Miss Penalty Example: Hit Time = 1ns, Miss Penalty = 20ns, Miss Rate = 0.1 AMAT = 1 + 0.1 * 20 = 3ns

Caching Basics Most Basic Caching questions: Q1: How do we know if a data item is in the cache? Q2: If it is, how do we find it? If it isn t, how do we get it?

More Detailed Questions Block placement policy? Where does a block go when it is fetched? Block identification policy? How do we find a block in the cache? Block replacement policy? When fetching a block into a full cache, how do we decide what other block gets kicked out? Write strategy? Does any of this differ for reads vs. writes?

General View of Caches Cache is made of frames Frame = data + tag + state bits State bits: Valid (tag/data there), Dirty (wrote into data) Cache Algorithm Find frame(s) If incoming tag!= stored tag then Miss Evict block currently in frame Replace with block from memory (or L2 cache) Return appropriate word within block

Cache Index Valid 00 01 10 11 Caching: A Simple First Example Tag Q1: Is it there? Data Compare the cache tag to the high order 2 memory address bits to tell if the memory block is in the cache 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Main Memory Two low order bits define the byte in the word (32b words) Q2: How do we find it? Use next 2 low order memory address bits theindex to determine which cache block (i.e., modulo the number of blocks in the cache) (block address) modulo (# of blocks in the cache)

Direct Mapped Cache Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 1 2 3 4 3 4 15 0 miss 1 miss 2 miss 3 miss 00 Mem(0) 00 Mem(0) 00 Mem(1) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(0) 00 Mem(1) 00 Mem(2) 00 Mem(3) 01 4 miss 3 hit 4 hit 15 miss 4 00 Mem(0) 01 Mem(4) 01 Mem(4) 01 Mem(4) 00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(1) 00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(2) 00 Mem(3) 00 Mem(3) 00 Mem(3) 11 00 Mem(3) 15 8 requests, 6 misses

MIPS Direct Mapped Cache Example One word/block, cache size = 1K words 31 30... 13 12 11... 2 1 0 Byte offset Hit Tag 20 10 Index Data Index Valid Tag Data 0 1 2... 1021 1022 1023 20 32 What kind of locality are we taking advantage of?

Handling Cache Hits Read hits (I$ and D$) this is what we want! Write hits (D$ only) allow cache and memory to be inconsistent write the data only into the cache block (write-back the cache contents to the next level in the memory hierarchy when that cache block is evicted ) need a dirty bit for each data cache block to tell if it needs to be written back to memory when it is evicted require the cache and memory to be consistent always write the data into both the cache block and the next level in the memory hierarchy (write-through) so don t need a dirty bit writes run at the speed of the next level in the memory hierarchy so slow! or can use a write buffer, so only have to stall if the write buffer is full

Write Buffer for Write-Through Caching Processor Cache DRAM Write buffer between the cache and main memory Processor: writes data into the cache and the write buffer Memory controller: writes contents of the write buffer to memory The write buffer is just a FIFO write buffer Typical number of entries: 4 Works fine if store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer s nightmare When the store frequency (w.r.t. time) 1 / DRAM write cycle leading to write buffer saturation One solution is to use a write-back cache; another is to use an L2 cache (next lecture)

Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 4 0 4 0 4 0 4 0 miss 4 miss 0 miss 4 miss 01 4 00 00 Mem(0) 00 Mem(0) 01 Mem(4) 0 01 00 Mem(0) 4 0 miss 4 miss 0 miss 4 miss 00 01 4 01 4 01 Mem(4) 0 00 0 00 Mem(0) 01 Mem(4) 00 Mem(0) 8 requests, 8 misses Ping pong effect due to conflict misses - two memory locations that map into the same cache block

Sources of Cache Misses Compulsory (cold start or process migration, first reference): First access to a block, cold fact of life, not a whole lot you can do about it If you are going to run millions of instruction, compulsory misses are insignificant Conflict (collision): Multiple memory locations mapped to the same cache location Solution 1: increase cache size Solution 2: increase associativity (next lecture) Capacity: Cache cannot contain all blocks accessed by the program Solution: increase cache size

Handling Cache Misses Read misses (I$ and D$) stallthe entire pipeline, fetch the block from the next level in the memory hierarchy, install it in the cache and send the requested word to the processor, then let the pipeline resume Write misses (D$ only) 1. stall the pipeline, fetch the block from next level in the memory hierarchy, install it in the cache (which may involve having to evict a dirty block if using a write-back cache), write the word from the processor to the cache, then let the pipeline resume or (normally used in write-back caches) 2. Write allocate just write the word into the cache updating both the tag and data, no need to check for cache hit, no need to stall or (normally used in write-through caches with a write buffer) 3. No-write allocate skip the cache write and just write the word to the write buffer (and eventually to the next memory level), no need to stall if the write buffer isn t full; must invalidate the cache block since it will be inconsistent (now holding stale data)

Multiword Block Direct Mapped Cache Four words/block, cache size = 1K words Hit 31 30... 13 12 11... 4 3 2 1 0 Byte offset Data Tag 20 Index 8 Block offset Index Valid 0 1 2... 253 254 255 Tag 20 Data What kind of locality are we taking advantage of? 32

Taking Advantage of Spatial Locality Let cache block hold more than one word Start with an empty cache - all blocks initially marked as not valid 0 miss 00 Mem(1) Mem(0) 0 1 2 3 4 3 4 15 1 hit 2 miss 00 Mem(1) Mem(0) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 3 hit 4 miss 3hit 01 5 4 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 00 Mem(1) Mem(0) 00 Mem(3) Mem(2) 01 Mem(5) Mem(4) 00 Mem(3) Mem(2) 4 hit 15 miss 01 Mem(5) Mem(4) 1101 Mem(5) Mem(4) 15 14 00 Mem(3) Mem(2) 00 Mem(3) Mem(2) 8 requests, 4 misses

Miss Rate vs Block Size vs Cache Size 10 Miss rate (%) 8 KB 16 KB 5 64 KB 256 KB 0 8 16 32 64 128 256 Block size (bytes) Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller (increasing capacity misses)

Block Size Tradeoff Larger block sizes take advantage of spatial locality but If the block size is too big relative to the cache size, the miss rate will go up Larger block size means larger miss penalty Latency to first word in block + transfer time for remaining words Miss Rate Exploits Spatial Locality Fewer blocks compromises Temporal Locality Miss Penalty Average Access Time Increased Miss Penalty & Miss Rate Block Size Block Size Block Size In general, Average Memory Access Time = Hit Time + Miss Penalty x Miss Rate

Multiword Block Considerations Read misses (I$ and D$) Processed the same as for single word blocks a miss returns the entire block from memory Miss penalty grows as block size grows Early restart datapath resumes execution as soon as the requested word of the block is returned Requested word first requested word is transferred from the memory to the cache (and datapath) first Nonblocking cache allows the datapath to continue to access the cache while the cache is handling an earlier miss Write misses (D$) Can t use write allocate or will end up with a garbled block in the cache (e.g., for 4 word blocks, a new tag, one word of data from the new block, and three words of data from the old block), so must fetch the block from memory first and pay the stall time

Cache Summary The Principle of Locality: Program likely to access a relatively small portion of the address space at any instant of time Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three major categories of cache misses: Compulsory misses: sad facts of life. Example: cold start misses Conflict misses: increase cache size and/or associativity Nightmare Scenario: ping pong effect! Capacity misses: increase cache size Cache design space total size, block size, associativity (replacement policy) write-hit policy (write-through, write-back) write-miss policy (write allocate, write buffers)

Impacts of Cache Performance Relative cache penalty increases as processor performance improves (faster clock rate and/or lower CPI) The memory speed is unlikely to improve as fast as processor cycle time. When calculating CPI stall, the cache miss penalty is measured in processor clock cycles needed to handle a miss The lower the CPI ideal, the more pronounced the impact of stalls A processor with a CPI ideal of 2, a 100 cycle miss penalty, 36% load/store instr s, and 2% I$ and 4% D$ miss rates Memory-stall cycles = 2% 100 + 36% 4% 100 = 3.44 So CPI stalls = 2 + 3.44 = 5.44 What if the CPI ideal is reduced to 1? 0.5? 0.25? What if the processor clock rate is doubled (doubling the miss penalty)?

Reducing Cache Miss Rates #1 1. Allow more flexible block placement In a direct mapped cache a memory block maps to exactly one cache block At the other extreme, could allow a memory block to be mapped to any cache block fully associative cache A compromise is to divide the cache into sets each of which consists of n ways (n-way set associative). A memory block maps to a unique set (specified by the index field) and can be placed in any way of that set (so there are n choices) (block address) modulo (# sets in the cache)

Cache Way 0 1 Set 0 1 0 1 V Tag Q1: Is it there? Set Associative Cache Example Data Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Main Memory Two low order bits define the byte in the word (32-b words) One word blocks Q2: How do we find it? Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache)

Another Reference String Mapping Consider the main memory word reference string Start with an empty cache - all blocks initially marked as not valid 0 4 0 4 0 4 0 4 0 miss 4 miss 0 hit 4 hit 000 Mem(0) 000 Mem(0) 000 Mem(0) 000 Mem(0) 010 Mem(4) 010 Mem(4) 010 Mem(4) 8 requests, 2 misses Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist!

Four-Way Set Associative Cache 2 8 = 256 sets each with four ways (each with one block) 31 30... 13 12 11... 2 1 0 Byte offset Tag 22 Index 8 Index V Tag Data V Tag Data V Tag Data V Tag Data 0 1 2... 253 254 255 0 1 2... 253 254 255 0 1 2... 253 254 255 0 1 2... 253 254 255 32 4x1 select Hit Data

Range of Set Associative Caches For a fixed size cache, each increase by a factor of two in associativity doubles the number of blocks per set (i.e., the number or ways) and halves the number of sets decreases the size of the index by 1 bit and increases the size of the tag by 1 bit Used for tag compare Selects the set Selects the word in the block Tag Index Block offset Byte offset Decreasing associativity Direct mapped (only one way) Smaller tags Increasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset

Costs of Set Associative Caches When a miss occurs, which way s block do we pick for replacement? Least Recently Used (LRU): the block replaced is the one that has been unused for the longest time Must have hardware to keep track of when each way s block was used relative to the other blocks in the set For 2-way set associative, takes one bit per set set the bit when a block is referenced (and reset the other way s bit) N-way set associative cache costs N comparators (delay and area) MUX delay (set selection) before data is available Data available after set selection (and Hit/Miss decision). In a direct mapped cache, the cache block is available before the Hit/Miss decision So its not possible to just assume a hit and continue and recover later if it was a miss

Benefits of Set Associative Caches The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Miss Rate 12 10 8 6 4 2 0 1-way 2-way 4-way 8-way Associativity 4KB 8KB 16KB 32KB 64KB 128KB 256KB 512KB Data from Hennessy & Patterson, Computer Architecture, 2003 Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)

Reducing Cache Miss Rates #2 2. Use multiple levels of caches With advancing technology have more than enough room on the die for bigger L1 caches or for a second level of caches normally a unified L2 cache (i.e., it holds both instructions and data) and in some cases even a unified L3 cache For our example, CPI ideal of 2, 100 cycle miss penalty (to main memory), 36% load/stores, a 2% (4%) L1I$ (D$) miss rate, add a UL2$ that has a 25 cycle miss penalty and a 0.5% miss rate CPI stalls = 2 +.02 25 +.36.04 25 +.005 100 +.36.005 100 = 3.54 (as compared to 5.44 with no L2$)

Multilevel Cache Design Considerations Design considerations for L1 and L2 caches are very different Primary cache should focus on minimizing hit time in support of a shorter clock cycle Smaller with smaller block sizes Secondary cache(s) should focus on reducing miss rate to reduce the penalty of long main memory access times Larger with larger block sizes The miss penalty of the L1 cache is significantly reduced by the presence of an L2 cache so it can be smaller (i.e., faster) but have a higher miss rate For the L2 cache, hit time is less important than miss rate The L2$ hit time determines L1$ s miss penalty L2$ local miss rate >> than the global miss rate

Key Cache Design Parameters Total size (blocks) Total size (KB) Block size (B) Miss penalty (clocks) Miss rates (global for L2) L1 typical 250 to 2000 16 to 64 32 to 64 10 to 25 2% to 5% L2 typical 4000 to 250,000 500 to 8000 32 to 128 100 to 1000 0.1% to 2%

Two Machines Cache Parameters L1 organization L1 cache size L1 block size L1 associativity L1 replacement L1 write policy L2 organization L2 cache size L2 block size L2 associativity L2 replacement L2 write policy 8KB for D$, 96KB for trace cache (~I$) 64 bytes 4-way set assoc. ~ LRU write-through Unified 512KB 128 bytes 8-way set assoc. ~LRU write-back Intel P4 Split I$ and D$ AMD Opteron Split I$ and D$ 64KB for each of I$ and D$ 64 bytes 2-way set assoc. LRU write-back Unified 1024KB (1MB) 64 bytes 16-way set assoc. ~LRU write-back

4 Questions for the Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Q3: Which block should be replaced on a miss? Easy for direct mapped only one choice Set associative or fully associative Random LRU (Least Recently Used) For a 2-way set associative cache, random replacement has a miss rate about 1.1 times higher than LRU. LRU is too costly to implement for high levels of associativity (> 4-way) since tracking the usage information is costly

Q1&Q2: Where can a block be placed/found? Direct mapped Set associative # of sets # of blocks in cache (# of blocks in cache)/ associativity Blocks per set 1 Associativity (typically 2 to 16) Fully associative 1 # of blocks in cache Direct mapped Set associative Fully associative Index Location method Index the set; compare set s tags Compare all blocks tags # of comparisons 1 Degree of associativity # of blocks

Q4: What happens on a write? Write-through The information is written to both the block in the cache and to the block in the next lower level of the memory hierarchy Write-through is always combined with a write buffer so write waits to lower level memory can be eliminated (as long as the write buffer doesn t fill) Write-back The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. Need a dirty bit to keep track of whether the block is clean or dirty Pros and cons of each? Write-through: read misses don t result in writes (so are simpler and cheaper) Write-back: repeated writes require only one write to lower level

Improving Cache Performance 0. Reduce the time to hit in the cache smaller cache direct mapped cache smaller blocks for writes no write allocate no hit on cache, just write to write buffer write allocate to avoid two cycles (first check for hit, then write) pipeline writes via a delayed write buffer to cache 1. Reduce the miss rate bigger cache more flexible placement (increase associativity) larger blocks (16 to 64 bytes typical) victim cache small buffer holding most recently discarded blocks

Improving Cache Performance 2. Reduce the miss penalty smaller blocks use a write buffer to hold dirty blocks being replaced so don t have to wait for the write to complete before reading check write buffer (and/or victim cache) on read miss may get lucky for large blocks fetch critical word first use multiple cache levels L2 cache not tied to CPU clock rate faster backing store/improved memory bandwidth wider buses memory interleaving, page mode DRAMs

Summary: The Cache Design Space Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back write allocation The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Cache Size Associativity Block Size Bad Good Factor A Factor B Less More