The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy



Similar documents
Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

361 Computer Architecture Lecture 14: Cache Memory

Computer Architecture

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Lecture 17: Virtual Memory II. Goals of virtual memory

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Outline. Cache Parameters. Lecture 5 Cache Operation

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Chapter 10: Virtual Memory. Lesson 03: Page tables and address translation process using page tables

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Physical Data Organization

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Price/performance Modern Memory Hierarchy

Operating Systems. Virtual Memory

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Computer Architecture

CSCA0102 IT & Business Applications. Foundation in Business Information Technology School of Engineering & Computing Sciences FTMS College Global

Chapter 1 Computer System Overview

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Computer Systems Structure Main Memory Organization

CS 6290 I/O and Storage. Milos Prvulovic

Data Storage - I: Memory Hierarchies & Disks

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

How NAND Flash Threatens DRAM

Board Notes on Virtual Memory

1 / 25. CS 137: File Systems. Persistent Solid-State Storage

The Classical Architecture. Storage 1 / 36

Flash s Role in Big Data, Past Present, and Future OBJECTIVE ANALYSIS. Jim Handy

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

COS 318: Operating Systems. Virtual Memory and Address Translation

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Discussion Session 9. CS/ECE 552 Ramkumar Ravi 26 Mar 2012

Communicating with devices

Solid State Drive Architecture

1. Memory technology & Hierarchy

Main Memory & Backing Store. Main memory backing storage devices

Intel P6 Systemprogrammering 2007 Föreläsning 5 P6/Linux Memory System

Storing Data: Disks and Files

A3 Computer Architecture

Storage in Database Systems. CMPSCI 445 Fall 2010

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Database Hardware Selection Guidelines

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Concept of Cache in web proxies

Configuring Memory on the HP Business Desktop dx5150

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Why Latency Lags Bandwidth, and What it Means to Computing

6. Storage and File Structures

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

NAND Flash Architecture and Specification Trends

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

OC By Arsene Fansi T. POLIMI

Input/output (I/O) I/O devices. Performance aspects. CS/COE1541: Intro. to Computer Architecture. Input/output subsystem.

Chapter 11 I/O Management and Disk Scheduling

Virtual Memory Paging

Exam 1 Review Questions

SDRAM and DRAM Memory Systems Overview

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Storage Class Memory Aware Data Management

File Systems for Flash Memories. Marcela Zuluaga Sebastian Isaza Dante Rodriguez

Oracle Aware Flash: Maximizing Performance and Availability for your Database

With respect to the way of data access we can classify memories as:

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Memory Basics. SRAM/DRAM Basics

External Sorting. Why Sort? 2-Way Sort: Requires 3 Buffers. Chapter 13

Topics in Computer System Performance and Reliability: Storage Systems!

Flash for Databases. September 22, 2015 Peter Zaitsev Percona

This Unit: Main Memory. CIS 501 Introduction to Computer Architecture. Memory Hierarchy Review. Readings

Disk Storage & Dependability

Understanding Flash SSD Performance

Random Access Memory (RAM) Types of RAM. RAM Random Access Memory Jamie Tees SDRAM. Micro-DIMM SO-DIMM

CS161: Operating Systems

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Record Storage and Primary File Organization

Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COS 318: Operating Systems. Storage Devices. Kai Li Computer Science Department Princeton University. (

Transcription:

The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And it s getting worse» processors speed up by 50% annually» memory accesses speed up by 9% annually» it s becoming harder and harder to keep these processors fed A Solution: Memory Hierarchy Memory Hierarchy Keep copies of the active data in the small, fast, expensive storage fast, small, expensive storage Memory Level Registers Fabrication Tech Registers Access Time (ns) <0.5 Typ. Size (bytes) 256 $/MB Keep all data in the big, slow, cheap storage slow, large, cheap storage L1 Cache L2 Cache Memory SRAM SRAM DRAM 1 0 64K 1M 512M 0 0 0 Disk Magnetic Disk M 0G 0.35

What is a Cache? A cache allows for fast accesses to a subset of a larger data store our web browser s cache gives you fast access to pages you visited recently» faster because it s stored locally» subset because the web won t fit on your disk The memory cache gives the processor fast access to memory that it used recently» faster because it s fancy and usually located on the CPU chip» subset because the cache is smaller than main memory CPU Memory Hierarchy Registers L1 cache L2 Cache Main Memory Locality of reference Temporal locality - nearness in time» Data being accessed now will probably be accessed again soon» Useful data tends to continue to be useful Spatial locality - nearness in address» Data near the data being accessed now will probably be needed soon» Useful data is often accessed sequentially

Memory Access Patterns Cache Terminology Memory accesses don t usually look like this» random accesses! Memory accesses do usually look like this! hot variables! step through arrays Hit and Miss» the data item is in the cache or the data item is not in the cache Hit rate and Miss rate» the percentage of references that the data item is in the cache or not in the cache Hit time and Miss time» the time required to access data in the cache (cache access time) and the time required to access data not in the cache (memory access time) Effective Access Time t effective = (h)t cache + (1-h)t memory effective access time cache hit rate cache access time cache miss rate memory access time Cache Contents When do we put something in the cache?» when it is used for the first time When do we overwrite something in the cache?» when we need the space in the cache for some other entry» all of memory won t fit on the CPU chip so not every location in memory can be cached aka, Average Memory Access Time (AMAT)

A small two-level hierarchy Fully Associative Cache 0 = 0 4 = 0 8 = 0 12 = 010 16 = 0 20 = 0 8-word cache 32-word memory (128 bytes) 6 = 0 120 = 124 = 0 In a fully associative cache,» any memory word can be placed in any cache line» each cache line stores an address and a data value» accesses are slow (but not as slow as you might think) Address 0 0 0 010 010 0 0 0 Valid Value 0x01 0x09D91D 0x04 0x012D 0x05 0x0349A291 0x0123A8 0x02 Direct Mapped Caches Fully associative caches are often too slow With direct mapped caches the address of the item determines where in the cache to store it» In our example, the lowest order two bits are the byte offset within the word stored in the cache» The next three bits of the address dictate the location of the entry within the cache» The remaining higher order bits record the rest of the original address as a tag for this entry Address Tags A tag is a label for a cache entry indicating where it came from» The upper bits of the data item s address Tag (2) 7 bit Address 1 Index (3) 1 Byte Offset (2) 01

Direct Mapped Cache -way Set Associative Caches Memory Address 0 0 010 0 0 0 Tag 01 Cache Contents Valid Value 0x01 0x09D91D 0x04 0x012D 0x05 0x0349A291 0x0123A8 0x02 Cache Index 0 2 = 0 1 2 = 1 0 2 = 2 0 2 = 3 0 2 = 4 1 2 = 5 1 2 = 6 1 2 = 7 Direct mapped caches cannot store more than one address with the same index If two addresses collide, then you overwrite the older entry 2-way associative caches can store two different addresses with the same index» 3-way, 4-way and 8-way set associative designs too Reduces misses due to conflicts Larger sets imply slower accesses 2-way Set Associative Cache Associativity Spectrum Index 0 1 0 0 0 1 1 1! Tag 01 Valid Value 0x01 0x09D91D 0x04 0x012D 0x05 0x0349A291 0x0123A8 0x02 Tag 01 Valid Value 0x02 0x3B 0xCF 0xA2 0x0333 0x3333 0xC2 0x05 Direct Mapped Fast to access Conflict Misses -way Associative Slower to access Fewer Conflict Misses Fully Associative Slow to access o Conflict Misses The highlighted cache entry contains values for addresses 1xx 2 and 1xx 2.

Spatial Locality Using the cache improves performance by taking advantage of temporal locality» When a word in memory is accessed it is loaded into cache memory» It is then available quickly if it is needed again soon This does nothing for spatial locality Memory Blocks Divide memory into blocks If any word in a block is accessed, then load an entire block into the cache Block 0 0x 0x3F Block 1 0x40 0x7F Block 2 0x80 0xBF Cache line for 16 word block size tag valid w 0 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w w w 12 w 13 w 14 w 15 Address Tags Revisited A cache block size > 1 word requires the address to be divided differently Instead of a byte offset into a word, we need a byte offset into the block Assume we have -bit addresses, 8 cache lines, and 4 words (16 bytes) per cache line block Tag (3) 0 bit Address 0101 Index (3) 1 Block Offset (4) 01 The Effects of Block Size Big blocks are good» Fewer first time misses» Exploits spatial locality Small blocks are good» Don t evict as much data when bringing in a new entry» More likely that all items in the block will turn out to be useful

Reads vs. Writes Caching is essentially making a copy of the data When you read, the copies still match when you re done When you write, the results must eventually propagate to both copies» Especially at the lowest level of the hierarchy, which is in some sense the permanent copy Write-Through Caches Write all updates to both cache and memory Advantages» The cache and the memory are always consistent» Evicting a cache line is cheap because no data needs to be written out to memory at eviction» Easy to implement Disadvantages» Runs at memory speeds when writing (can use write buffer to reduce this problem) Write-Back Caches Write the update to the cache only. Write to memory only when cache block is evicted Advantage» Runs at cache speed rather than memory speed» Some writes never go all the way to memory» When a whole block is written back, can use high bandwidth transfer Disadvantage» complexity required to maintain consistency Dirty bit When evicting a block from a write-back cache, we could» always write the block back to memory» write it back only if we changed it Caches use a dirty bit to mark if a line was changed» the dirty bit is 0 when the block is loaded» it is set to 1 if the block is modified» when the line is evicted, it is written back only if the dirty bit is 1

i-cache and d-cache There usually are two separate caches for instructions and data.» Avoids structural hazards in pipelining» The combined cache is twice as big but still has an access time of a small cache» Allows both caches to operate in parallel, for twice the bandwidth Cache Line Replacement How do you decide which cache block to replace? If the cache is direct-mapped, it s easy» only one slot per index Otherwise, common strategies:» Random» Least Recently Used (LRU) LRU Implementations LRU is very difficult to implement for high degrees of associativity 4-way approximation:» 1 bit to indicate least recently used pair» 1 bit per pair to indicate least recently used item in this pair We will see this again at the operating system level Multi-Level Caches Use each level of the memory hierarchy as a cache over the next lowest level Inserting level 2 between levels 1 and 3 allows:» level 1 to have a higher miss rate (so can be smaller and cheaper)» level 3 to have a larger access time (so can be slower and cheaper)

Summary: Classifying Caches Where can a block be placed?» Direct mapped, -way Set or Fully associative How is a block found?» Direct mapped: by index» Set associative: by index and search» Fully associative: by search What happens on a write access?» Write-back or Write-through Which block should be replaced?» Random» LRU (Least Recently Used)