Cache Memory and Performance

Similar documents
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

361 Computer Architecture Lecture 14: Cache Memory

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Random-Access Memory (RAM) The Memory Hierarchy. SRAM vs DRAM Summary. Conventional DRAM Organization. Page 1

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

1. Memory technology & Hierarchy

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Computer Systems Structure Main Memory Organization

Lecture 9: Memory and Storage Technologies

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Memory Basics. SRAM/DRAM Basics

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

The Central Processing Unit:

Board Notes on Virtual Memory

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

CS 6290 I/O and Storage. Milos Prvulovic

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

CHAPTER 7: The CPU and Memory

Communicating with devices

Discovering Computers Living in a Digital World

Price/performance Modern Memory Hierarchy

Lecture 17: Virtual Memory II. Goals of virtual memory

Outline. CS 245: Database System Principles. Notes 02: Hardware. Hardware DBMS Data Storage

Memory. The memory types currently in common usage are:

Operating Systems. Virtual Memory

Chapter 1 Computer System Overview

CS 3530 Operating Systems. L02 OS Intro Part 1 Dr. Ken Hoganson

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

Computer Architecture

RAM & ROM Based Digital Design. ECE 152A Winter 2012

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

2

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Computer Literacy. Hardware & Software Classification

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Putting it all together: Intel Nehalem.

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Lecture 2: Computer Hardware and Ports.

Computer Organization. and Instruction Execution. August 22

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

The Classical Architecture. Storage 1 / 36

Data Storage - I: Memory Hierarchies & Disks

Parts of a Computer. Preparation. Objectives. Standards. Materials Micron Technology Foundation, Inc. All Rights Reserved

Computer Systems Structure Input/Output

1. Computer System Structure and Components

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Outline: Operating Systems

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Machine Architecture and Number Systems. Major Computer Components. Schematic Diagram of a Computer. The CPU. The Bus. Main Memory.

Memory Management CS 217. Two programs can t control all of memory simultaneously

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

1 / 25. CS 137: File Systems. Persistent Solid-State Storage

Introducción. Diseño de sistemas digitales.1

Using Synology SSD Technology to Enhance System Performance Synology Inc.

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

With respect to the way of data access we can classify memories as:

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Introduction to Microprocessors

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

Obj: Sec 1.0, to describe the relationship between hardware and software HW: Read p.2 9. Do Now: Name 3 parts of the computer.

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Computer Architecture

Solid State Drive Architecture

COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton

Candidates should be able to: (i) describe the purpose of RAM in a computer system

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Concept of Cache in web proxies

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Main Memory & Backing Store. Main memory backing storage devices

Memory Systems. Static Random Access Memory (SRAM) Cell

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

Introduction To Computers: Hardware and Software

Memory management basics (1) Requirements (1) Objectives. Operating Systems Part of E1.9 - Principles of Computers and Software Engineering

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Indexing on Solid State Drives based on Flash Memory

Algorithms and Methods for Distributed Storage Networks 3. Solid State Disks Christian Schindelhauer

System Architecture. CS143: Disks and Files. Magnetic disk vs SSD. Structure of a Platter CPU. Disk Controller...

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

COS 318: Operating Systems. Virtual Memory and Address Translation

Chapter 13. PIC Family Microcontroller

Transcription:

Cache Memory and Performance Memory Hierarchy 1 Many of the following slides are taken with permission from Complete Powerpoint Lecture Notes for Computer Systems: A Programmer's Perspective (CS:APP) Randal E. Bryant and David R. O'Hallaron http://csapp.cs.cmu.edu/public/lectures.html The book is used explicitly in CS 2505 and CS 3214 and as a reference in CS 2506.

An Example Memory Hierarchy Memory Hierarchy 2 L0: Registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: L1 cache (SRAM) L2 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: Remote secondary storage (tapes, distributed file systems, Web servers)

Random-Access Memory (RAM) Memory Hierarchy 3 Key features RAM is traditionally packaged as a chip. Basic storage unit is normally a cell(one bit per cell). Multiple RAM chips form a memory. Static RAM (SRAM) Each cell stores a bit with a four or six-transistor circuit. Retains value indefinitely, as long as it is kept powered. Relatively insensitive to electrical noise (EMI), radiation, etc. Faster and more expensive than DRAM. Dynamic RAM (DRAM) Each cell stores bit with a capacitor. One transistor is used for access Value must be refreshed every 10-100 ms. More sensitive to disturbances (EMI, radiation, ) than SRAM. Slower and cheaper than SRAM.

SRAM vs DRAM Summary Memory Hierarchy 4 Trans. Access Needs Needs per bit time refresh? EDC? Cost Applications SRAM 4 or 6 1X No Maybe 100x Cache memories DRAM 1 10X Yes Yes 1X Main memories, frame buffers

Traditional CPU-Memory Bus Structure Memory Hierarchy 5 A bus is a collection of parallel wires that carry address, data, and control signals. Buses are typically shared by multiple devices. CPU chip Register file ALU System bus Memory bus Bus interface I/O bridge Main memory

Memory Read Transaction (1) Memory Hierarchy 6 CPU places address A on the memory bus. Register file Load operation:movl A, %eax %eax ALU Bus interface I/O bridge A Main memory 0 x A

Memory Read Transaction (2) Memory Hierarchy 7 Main memory reads A from the memory bus, retrieves word x, and places it on the bus. Register file Load operation:movl A, %eax %eax ALU I/O bridge Main memory x 0 Bus interface x A

Memory Read Transaction (3) Memory Hierarchy 8 CPU read word x from the bus and copies it into register %eax. Register file Load operation:movl A, %eax %eax x ALU I/O bridge Main memory 0 Bus interface x A

Memory Write Transaction (1) Memory Hierarchy 9 CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive. Register file Store operation:movl %eax, A %eax y ALU Bus interface I/O bridge A Main memory 0 A

Memory Write Transaction (2) Memory Hierarchy 10 CPU places data word y on the bus. Register file Store operation:movl %eax, A %eax y ALU Bus interface I/O bridge y Main memory 0 A

Memory Write Transaction (3) Memory Hierarchy 11 Main memory reads data word y from the bus and stores it at address A. register file Store operation:movl %eax, A %eax y ALU I/O bridge main memory 0 bus interface y A

The Bigger Picture: I/O Bus Memory Hierarchy 12 CPU chip Register file ALU System bus Memory bus Bus interface I/O bridge Main memory USB controller Graphics adapter I/O bus Disk controller Expansion slots for other devices such as network adapters. Mouse Keyboard Monitor Disk

Storage Trends Memory Hierarchy 13 SRAM Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 $/MB 19,200 2,900 320 256 100 75 60 320 access (ns) 300 150 35 15 3 2 1.5 200 DRAM Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 $/MB 8,000 880 100 30 1 0.1 0.06 130,000 access (ns) 375 200 100 70 60 50 40 9 typical size (MB) 0.064 0.256 4 16 64 2,000 8,000 125,000 Disk Metric 1980 1985 1990 1995 2000 2005 2010 2010:1980 $/MB 500 100 8 0.30 0.01 0.005 0.0003 1,600,000 access (ms) 87 75 28 10 8 4 3 29 typical size (MB) 1 10 160 1,000 20,000 160,000 1,500,000 1,500,000

The CPU-Memory Gap Memory Hierarchy 14 The gap between DRAM, disk, and CPU speeds. 100,000,000.0 10,000,000.0 1,000,000.0 100,000.0 Disk SSD ns 10,000.0 1,000.0 100.0 DRAM Disk seek time Flash SSD access time DRAM access time SRAM access time CPU cycle time Effective CPU cycle time 10.0 1.0 0.1 CPU 0.0 1980 1985 1990 1995 2000 2003 2005 2010 Year

Locality Memory Hierarchy 15 Principle of Locality:Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time

Locality Example Memory Hierarchy 16 sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data references Reference array elements in succession (stride-1 reference pattern). Reference variable sum each iteration. Instruction references Reference instructions in sequence. Cycle through loop repeatedly. Spatial locality Temporal locality Spatial locality Temporal locality

Taking Advantage of Locality Memory Hierarchy 17 Memory hierarchy Store everything on disk Copy recently accessed (and nearby) items from disk to smaller DRAM memory Main memory Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory Cache memory attached to CPU

An Example Memory Hierarchy Memory Hierarchy 18 L0: Registers CPU registers hold words retrieved from L1 cache Smaller, faster, costlier per byte Larger, slower, cheaper per byte L4: L3: L2: L1: L1 cache (SRAM) L2 cache (SRAM) Main memory (DRAM) Local secondary storage (local disks) L1 cache holds cache lines retrieved from L2 cache L2 cache holds cache lines retrieved from main memory Main memory holds disk blocks retrieved from local disks Local disks hold files retrieved from disks on remote network servers L5: Remote secondary storage (tapes, distributed file systems, Web servers)

Caches Memory Hierarchy 19 Cache: a smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy: For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Why do memory hierarchies work? Because of locality, programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.

General Cache Concepts Memory Hierarchy 20 Cache 84 9 14 10 3 Smaller, faster, more expensive memory caches a subset of the blocks 10 4 Data is copied in block-sized transfer units Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Larger, slower, cheaper memory viewed as partitioned into blocks

General Cache Concepts: Hit Memory Hierarchy 21 Request: 14 Data in block b is needed Cache 8 9 14 3 Block b is in cache: Hit! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

General Cache Concepts: Miss Memory Hierarchy 22 Request: 12 Data in block b is needed Cache 8 12 9 14 3 Block b is not in cache: Miss! 12 Request: 12 Block b is fetched from memory Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim)

Types of Cache Misses Memory Hierarchy 23 Cold (compulsory) miss Cold misses occur because the cache is empty. Conflict miss Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k. E.g. Block iat level k+1 must be placed in block (imod 4) at level k. Conflict misses occur when the level kcache is large enough, but multiple data objects all map to the same level kblock. E.g. Referencing blocks 0, 8, 0, 8, 0, 8,... would miss every time. Capacity miss Occurs when the set of active cache blocks (working set) is larger than the cache.

Examples of Caching in the Hierarchy Memory Hierarchy 24 Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By Registers 4-8 bytes words CPU core 0 Compiler TLB Address translations On-Chip TLB 0 Hardware L1 cache 64-bytes block On-Chip L1 1 Hardware L2 cache 64-bytes block On/Off-Chip L2 10 Hardware Virtual Memory 4-KB page Main memory 100 Hardware + OS Buffer cache Parts of files Main memory 100 OS Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer cache Parts of files Local disk 10,000,000 AFS/NFS client Browser cache Web pages Local disk 10,000,000 Web browser Web cache Web pages Remote server disks 1,000,000,000 Web proxy server

Cache Memories Memory Hierarchy 25 Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure: CPU chip Register file Cache memories ALU System bus Memory bus Bus interface I/O bridge Main memory

Cache Performance Metrics Memory Hierarchy 26 Miss Rate Hit Time Fraction of memory references not found in cache (misses / accesses) = 1 hit rate Typical numbers (in percentages): 3-10% for L1 can be quite small (e.g., < 1%) for L2, depending on size, etc. Time to deliver a line in the cache to the processor includes time to determine whether the line is in the cache Typical numbers: Miss Penalty 1-2 clock cycle for L1 5-20 clock cycles for L2 Additional time required because of a miss typically 50-200 cycles for main memory (Trend: increasing!)

Lets think about those numbers Memory Hierarchy 27 Huge difference between a hit and a miss Could be 100x, if just L1 and main memory Would you believe 99% hits is twice as good as 97%? Consider: cache hit time of 1 cycle miss penalty of 100 cycles Average access time: 97% hits: 1 cycle + 0.03 * 100 cycles =4 cycles 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles This is why miss rate is used instead of hit rate

Measuring Cache Performance Memory Hierarchy 28 Components of CPU time Program execution cycles Includes cache hit time Memory stall cycles Mainly from cache misses With simplifying assumptions: Memory stallcycles Memory accesses = Miss rate Miss penalty Program = Instructions Program Misses Instruction Miss penalty

Cache Performance Example Given I-cache miss rate = 2% D-cache miss rate = 4% Miss penalty = 100 cycles Base CPI (ideal cache) = 2 Load & stores are 36% of instructions Memory Hierarchy 29 Miss cycles per instruction I-cache: 0.02 100 = 2 D-cache: 0.36 0.04 100 = 1.44 Actual CPI = 2 + 2 + 1.44 = 5.44 Ideal CPU is 5.44/2 =2.72 times faster

Average Access Time Memory Hierarchy 30 Hit time is also important for performance Average memory access time (AMAT) AMAT = Hit time + Miss rate Miss penalty Example CPU with 1ns clock, hit time = 1 cycle, miss penalty = 20 cycles, I-cache miss rate = 5% AMAT = 1 + 0.05 20 = 2ns 2 cycles per instruction

Performance Summary Memory Hierarchy 31 When CPU performance increased Miss penalty becomes more significant Decreasing base CPI Greater proportion of time spent on memory stalls Increasing clock rate Memory stalls account for more CPU cycles Can t neglect cache behavior when evaluating system performance

Multilevel Caches Memory Hierarchy 32 Primary cache attached to CPU Small, but fast Level-2 cache services misses from primary cache Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache

Multilevel Cache Example Memory Hierarchy 33 Given CPU base CPI = 1, clock rate = 4GHz Miss rate/instruction = 2% Main memory access time = 100ns With just primary cache Miss penalty = 100ns/0.25ns = 400 cycles Effective CPI = 1 + 0.02 400 = 9

Example (cont.) Memory Hierarchy 34 Now add L-2 cache Access time = 5ns Global miss rate to main memory = 0.5% Primary miss with L-2 hit Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss Extra penalty = 400 cycles CPI = 1 + 0.02 20 + 0.005 400 = 3.4 Performance ratio = 9/3.4 = 2.6

Multilevel Cache Considerations Memory Hierarchy 35 Primary cache Focus on minimal hit time L-2 cache Focus on low miss rate to avoid main memory access Hit time has less overall impact Results L-1 cache usually smaller than a single cache L-1 block size smaller than L-2 block size