Lecture 16: Cache Memories

Similar documents
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

361 Computer Architecture Lecture 14: Cache Memory

Lecture 17: Virtual Memory II. Goals of virtual memory

Outline. Cache Parameters. Lecture 5 Cache Operation

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Computer Architecture

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Board Notes on Virtual Memory

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Price/performance Modern Memory Hierarchy

Chapter 1 Computer System Overview

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

COS 318: Operating Systems. Virtual Memory and Address Translation

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Discussion Session 9. CS/ECE 552 Ramkumar Ravi 26 Mar 2012

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

x64 Servers: Do you want 64 or 32 bit apps with that server?

Architecture of Hitachi SR-8000

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Physical Data Organization

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Operating Systems. Virtual Memory

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Memory Management CS 217. Two programs can t control all of memory simultaneously

Cost-Efficient Memory Architecture Design of NAND Flash Memory Embedded Systems

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Solid State Drive Architecture

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Record Storage and Primary File Organization

Intel Pentium 4 Processor on 90nm Technology

AMD Opteron Quad-Core

Configuring Apache Derby for Performance and Durability Olav Sandstå

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

A3 Computer Architecture

Flash for Databases. September 22, 2015 Peter Zaitsev Percona

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Virtual vs Physical Addresses

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

The Classical Architecture. Storage 1 / 36

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Main Memory. Memories. Hauptspeicher. SIMM: single inline memory module 72 Pins. DIMM: dual inline memory module 168 Pins

Outline: Operating Systems

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Topics in Computer System Performance and Reliability: Storage Systems!

Secure Cloud Storage and Computing Using Reconfigurable Hardware

Parallel Processing and Software Performance. Lukáš Marek

1 Storage Devices Summary

A Study of Application Performance with Non-Volatile Main Memory

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Computer Systems Structure Main Memory Organization

Storage in Database Systems. CMPSCI 445 Fall 2010

Intel P6 Systemprogrammering 2007 Föreläsning 5 P6/Linux Memory System

OC By Arsene Fansi T. POLIMI

Data Storage - I: Memory Hierarchies & Disks

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Concept of Cache in web proxies

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Storing Data: Disks and Files

Chapter 2 - Computer Organization

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Creative Commons Attribution-Share 3.0 United States License

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Putting it all together: Intel Nehalem.

This Unit: Main Memory. CIS 501 Introduction to Computer Architecture. Memory Hierarchy Review. Readings

6. Storage and File Structures

Outline. CS 245: Database System Principles. Notes 02: Hardware. Hardware DBMS Data Storage

CS 6290 I/O and Storage. Milos Prvulovic

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Lecture 16: Storage Devices

Choosing a Computer for Running SLX, P3D, and P5

University of Dublin Trinity College. Storage Hardware.

StrongARM** SA-110 Microprocessor Instruction Timing

A Deduplication File System & Course Review

2

COS 318: Operating Systems

Transcription:

Lecture 16: Cache Memories Last Time AMAT average memory access time Basic cache organization Today Take QUIZ 12 over P&H 5.7-10 before 11:59pm today Read 5.4, 5.6 for 3/25 Homework 6 due Thursday March 25, 2010 Hardware cache organization Reads versus Writes Cache Optimization UTCS 352, Lecture 16 1

Cache Memory Theory Big Fast Small fast memory + big slow memory Looks like a big fast memory M C Small Fast M M Big Slow UTCS 352, Lecture 16 2

The Memory Hierarchy CPU Chip Registers Level 1 Cache Latency Bandwidth 1 cyc 3-10 words/cycle compiler managed < 1KB 1-3cy 1-2 words/cycle hardware managed 32KB -1MB Chips Level 2 Cache DRAM 5-10cy 1 word/cycle hardware managed 1MB - 4MB 30-100cy 0.5 words/cycle OS managed 64MB - 4GB Mechanical Disk 10 6-10 7 cy 0.01 words/cycle OS managed 4GB+ Tape UTCS 352, Lecture 16 3

Direct Mapped Each block mapped to exactly 1 cache location Cache location = (block address) MOD (# blocks in cache) 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14151617 1819 20 2122 23 24 252627 28 29 30 31 UTCS 352, Lecture 16 4

Fully Associative Each block mapped to any cache location Cache location = any 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14151617 1819 20 2122 23 24 252627 28 29 30 31 UTCS 352, Lecture 16 5

Set Associative Each block mapped to subset of cache locations Set selection = (block address) MOD (# sets in cache) 0 1 2 3 4 5 6 7 Set 0 1 2 3 2-way set associative = 2 blocks in set This example: 4 sets 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14151617 1819 20 2122 23 24 252627 28 29 30 31 UTCS 352, Lecture 16 6

How do we use memory address to find block in the cache? UTCS 352, Lecture 16 7

How Do We Find a Block in The Cache? Our Example: Main memory address space = 32 bits (= 4GBytes) Block size = 4 words = 16 bytes Cache capacity = 8 blocks = 128 bytes 32 bit Address block address tag index 28 bits block offset 4 bits index which set tag which data/instruction in block block offset which word in block # tag/index bits determine the associativity tag/index bits can come from anywhere in block address UTCS 352, Lecture 16 8

Finding a Block: Direct-Mapped S Entries 3 25 = Tag Index Address Hit Data With cache capacity = 8 blocks UTCS 352, Lecture 16 9

Finding A Block: 2-Way Set-Associative S - sets A - elements in each set A-way associative S=4, A=2 2-way associative 8-entry cache 4 Sets 2 2 elements per set 26 = = Tag Index Address Hit Data UTCS 352, Lecture 16 10

Finding A Block: Fully Associative 28 = = = = = = = = Tag Address Hit Data UTCS 352, Lecture 16 11

Set Associative Cache - cont d All of main memory is divided into S sets All addresses in set N map to same set of the cache Addr = N mod S A locations available Shares costly comparators across sets Low address bits select set 2 in example High address bits are tag, used to associatively search the selected set Extreme cases A=1: Direct mapped cache S=1: Fully associative A need not be a power of 2 UTCS 352, Lecture 16 12

Cache Organization Address 27 95 90 99 96 15 11 12 13 14 42 75 74 73 72 Valid bits 86 33 35 31 37 Data Where does a block get placed? - DONE How do we find it? - DONE Which one do we replace when a new one is brought in? What happens on a write? UTCS 352, Lecture 16 13

Which Block Should Be Replaced on Miss? Direct Mapped Choice is easy - only one option Associative Randomly select block in set to replace Least-Recently used (LRU) Implementing LRU 2-way set-associative >2 way set-associative UTCS 352, Lecture 16 14

What Happens on a Store? Need to keep cache consistent with main memory Reads are easy - no modifications Writes are harder - when do we update main memory? Write-Through On cache write - always update main memory as well Use a write buffer to stockpile writes to main memory for speed Write-Back On cache write - remember that block is modified (dirty bit) Update main memory when dirty block is replaced Sometimes need to flush cache (I/O, multiprocessing) UTCS 352, Lecture 16 15

BUT: What if Store Causes Miss! Write-Allocate Bring written block into cache Update word in block Anticipate further use of block No-write Allocate Main memory is updated Cache contents unmodified UTCS 352, Lecture 16 16

Improving cache performance UTCS 352, Lecture 16 17

How Do We Improve Cache Performance? AMAT = thit + pmiss penaltymiss UTCS 352, Lecture 16 18

How Do We Improve Cache Performance? AMAT = thit + pmiss penaltymiss Reduce hit time Reduce miss rate Reduce miss penalty UTCS 352, Lecture 16 19

Questions to think about As the block size goes up, what happens to the miss rate? what happens to the miss penalty? what happens to hit time? As the associativity goes up, what happens to the miss rate? what happens to the hit time? UTCS 352, Lecture 16 20

Reducing Miss Rate: Increase Associativity Reduce conflict misses Rules of thumb 8-way = fully associative Direct mapped size N = 2-way set associative size N/2 But! Size N associative is larger than Size N direct mapped Associative typically slower that direct mapped (t hit larger) UTCS 352, Lecture 16 21

Reducing Hit Time Make Caches small and simple Hit Time = 1 cycle is good (3.3ns!) L1 - low associativity, relatively small Even L2 caches can be broken into sub-banks Can exploit this for faster hit time in L2 UTCS 352, Lecture 16 22

Reducing Miss Rate: Increase Block Size Fetch more data with each cache miss 16 bytes 64, 128, 256 bytes! Works because of Locality (spatial) 25 20 1K 16K 256K 4K 64K Miss Rate 15 10 5 0 16 32 64 128 256 Block Size UTCS 352, Lecture 16 23

Reduce Miss Penalty: Transfer Time Should we transfer the whole block at once? Wider path to memory Transfer more bytes/cycle Reduces total time to transfer block Limited by wires Two ways to do this: Wider path to each memory Separate paths to multiple memories multiple memory banks Block size and transfer unit not necessarily equal! UTCS 352, Lecture 16 24

Reduce Miss Penalty: Deliver Critical word first Only need one word from block immediately LW R3,8(R5) Don t write entire word into cache first Fetch word 2 first (deliver to CPU) Fetch order: 2 3 0 1 0 1 2 3 UTCS 352, Lecture 16 25

Reduce Miss Penalty: More Cache Levels Average access time = HitTime L1 + MissRate L1 * MissPenalty L1 MissPenalty L1 = HitTime L2 + MissRate L2 * MissPenalty L2 etc. Size/Associativity of higher level caches? L1 L2 L3 UTCS 352, Lecture 16 26

Reduce Miss Penalty: Read Misses First Let reads pass writes in Write buffer SW 512(R0),R3 LW R1,1024(R0) LW R2,512(R0) CPU Tag Data =? write buffer MAIN MEMORY UTCS 352, Lecture 16 27

Reduce Miss Penalty: Lockup (nonblocking) Free Cache Let cache continue to function while miss is being serviced LW R1,1024(R0) LW R2,512(R0) MISS CPU Tag Data LW R2,512(R0) =? write buffer LW R1,1024(R0) MAIN MEMORY UTCS 352, Lecture 16 28

Reducing Miss Rate: Prefetching Fetching Data that you will probably need Instructions Alpha 21064 on cache miss Fetches requested block intro instruction stream buffer Fetches next sequential block into cache Data Automatically fetch data into cache (spatial locality) Issues? Compiler controlled prefetching Inserts prefetching instructions to fetch data for later use Registers or cache UTCS 352, Lecture 16 29

Reducing Miss Rate: Use a Victim Cache Small cache (< 8 fully associative entries) Jouppi 1990 Put evicted lines in the victim FIRST Search in both the L1 and the victim cache Accessed in parallel with main cache Captures conflict misses Tag =? CPU L1 Victim =? UTCS 352, Lecture 16 30

VC: Victim Cache Example Given direct mapped L1 of 4 entries, fully associative 1 entry VC Address access sequence 8, 9, 10, 11, 8, 12, 9, 10, 11, 12, 8 ^ ^ ^ A B C Tag =? L1 =? Victim CPU First access to 12 misses, put 8 in VC, put 12 in L1, third access to 8 hits in VC After: A B C 8 9 10 11 12 9 10 11 8 12 9 10 11 8 L1 Victim L1 Victim L1 Victim UTCS 352, Lecture 16 31

Summary Recap Using a memory address to find location in cache Deciding what to evict from the cache Improving cache performance Next Time Homework 6 is due March 25, 2010 Reading: P&H 5.4, 5.6 Virtual Memory TLBs UTCS 352, Lecture 16 32