Chapter Seven. SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)

Similar documents
Board Notes on Virtual Memory

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

361 Computer Architecture Lecture 14: Cache Memory

Operating Systems. Virtual Memory

Lecture 17: Virtual Memory II. Goals of virtual memory

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

COS 318: Operating Systems. Virtual Memory and Address Translation

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

Main Memory. Memories. Hauptspeicher. SIMM: single inline memory module 72 Pins. DIMM: dual inline memory module 168 Pins

CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.

Virtual vs Physical Addresses

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Outline. Cache Parameters. Lecture 5 Cache Operation

Computer Architecture

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Chapter 1 Computer System Overview

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

The Classical Architecture. Storage 1 / 36


COMPUTER HARDWARE. Input- Output and Communication Memory Systems

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Giving credit where credit is due

Computer Architecture

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

W4118: segmentation and paging. Instructor: Junfeng Yang

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

Intel Pentium 4 Processor on 90nm Technology

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Computer Systems Structure Main Memory Organization

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Operating Systems, 6 th ed. Test Bank Chapter 7

Physical Data Organization

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

Memory Management CS 217. Two programs can t control all of memory simultaneously

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

OPERATING SYSTEM - VIRTUAL MEMORY

Data Storage - I: Memory Hierarchies & Disks

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Virtual Memory Paging

Chapter 12. Paging an Virtual Memory Systems

A3 Computer Architecture

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems

Concept of Cache in web proxies

Memory management basics (1) Requirements (1) Objectives. Operating Systems Part of E1.9 - Principles of Computers and Software Engineering

StrongARM** SA-110 Microprocessor Instruction Timing

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

1. Memory technology & Hierarchy

Intel P6 Systemprogrammering 2007 Föreläsning 5 P6/Linux Memory System

Price/performance Modern Memory Hierarchy

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

Operating System Overview. Otto J. Anshus

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

HY345 Operating Systems

Bandwidth Calculations for SA-1100 Processor LCD Displays

& Data Processing 2. Exercise 3: Memory Management. Dipl.-Ing. Bogdan Marin. Universität Duisburg-Essen

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Outline: Operating Systems

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Putting it all together: Intel Nehalem.

Addendum Intel Architecture Software Developer s Manual

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

2

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Discussion Session 9. CS/ECE 552 Ramkumar Ravi 26 Mar 2012

Candidates should be able to: (i) describe the purpose of RAM in a computer system

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

Using Synology SSD Technology to Enhance System Performance Synology Inc.

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

Communicating with devices

Computer Organization and Architecture

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

Memory Basics. SRAM/DRAM Basics

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

RAM & ROM Based Digital Design. ECE 152A Winter 2012

Exploiting Address Space Contiguity to Accelerate TLB Miss Handling

Midterm Exam #2 November 10, 1999 CS162 Operating Systems

İSTANBUL AYDIN UNIVERSITY

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

SOC architecture and design

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Managing Storage Space in a Flash and Disk Hybrid Storage System

Binary search tree with SIMD bandwidth optimization using SSE

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Microprocessor & Assembly Language

Transcription:

Chapter Seven emories: Review SRA: value is stored on a pair of inverting gates very fast but takes up more space than DRA (4 to transistors) DRA: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRA (factor of 5 to )

Exploiting emory Hierarchy Users want large and fast memories! SRA access times are - 5ns at cost of $ to $5 per byte. DRA access times are -ns at cost of $5 to $ per byte. Disk access times are to million ns at cost of $. to $. per byte. 997 Try and give it to them anyway build a memory hierarchy CPU Levels in the memory hierarchy Level Level Increasing distance from the CPU in access time Level n Size of the memory at eachlevel 3 Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level 4

Cache Two issues: How do we know if a data item is in the cache? If it is, how do we find it? Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level 5 Direct apped Cache apping: address is modulo the number of blocks in the cache C a ch e e m o ry

Direct apped Cache For IPS: H it Address (showing bit positions) 3 3 3 By te offset Tag Index D ata Inde x V alid T ag D ata 3 3 What kind of locality are we taking advantage of? 7 Direct apped Cache Taking advantage of spatial locality: Address (showing bit positions) 3 5 4 3 Hit Tag Byte offset Index Block offset Data bits bits V Tag Data 4K entries 3 3 3 3 ux 3

Hits vs. isses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, restart Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word 9 Hardware Issues ake reading multiple words easier by using banks of memory CPU CPU CPU Cache ultiplexor Cache Cache Bus Bus Bus emory emory bank emory bank emory bank emory bank 3 emory b. Wide memory organization c. Interleaved memory organization a. One-word -wide memory organization It can get a lot more complicated...

Performance Increasing the block size tends to decrease miss rate: iss rate 4% 35% 3% 5% % 5% % 5% % 4 4 5 Block size (bytes) KB KB KB 4 KB 5 KB Use split caches because there is more spatial locality in code: Program Block size in words Instruction miss rate Data miss rate Effective combined miss rate gcc.%.% 5.4% 4.%.7%.9% spice.%.3%.% 4.3%.%.4% Performance Simplified model: execution time = (execution cycles + stall cycles) cycle time stall cycles = # of instructions miss ratio miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?

Set Associative Caches Basic Idea: a memory block can be mapped to more than one location in the cache Cache is divided into sets Each memory block is mapped to a particular set Each set can have more than one block Number of blocks in set = associativity of cache If a set has only one block, then it is a direct-mapped cache I.e. direct mapped caches have a set associativity of Each memory block can be placed in any of the blocks of the set to which it maps 3 Direct mapped cache: block N maps to ( N mod num of blocks in cache) Set associative cache: block N maps to set (N mod num of sets in cache) Example below shows placement of block whose address is Direct mapped Block # 3 4 5 7 Set associative Set # 3 Fully associative Data Data Data Tag Tag Tag Search Search Search 4

Decreasing miss ratio with associativity One-way set associative (direct mapped) Block 3 4 5 7 Tag Data Two-way set associative Set 3 Tag Data Tag Data Four-way set associative Set Tag Data Tag Data Tag Data Tag Data Eight-way set associative (fully associative) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Compared to direct mapped, give a series of references that: results in a lower miss ratio using a -way set associative cache results in a higher miss ratio using a -way set associative cache assuming we use the least recently used replacement strategy 5 An implementation A d dress 3 3 9 3 Ind e x V T a g D ata V T a g D a ta V T ag D a ta V T ag D a ta 5 3 5 4 5 5 3 4 - to - multiplexor H it D a ta

Performance 5% % 9% iss rate % 3% % One-way Two-way Four-way Eight-way Associativity KB KB KB 3 KB 4 KB 4 KB KB KB 7 Set Associative Caches Advantages: iss ratio decreases as associativity increases Disadvantages Extra memory needed for extra tag bits in cache Extra time for associative search

Block Replacement Policies What block to replace on a cache miss? We have multiple candidates (unlike direct mapped caches) Random FIFO (First In First Out) LRU (Least Recently Used) Typically, cpus use Random or Approximate LRU because easier to implement in hardware 9 Example Cache size = 4 one word blocks Replacement Policy = LRU Sequence of memory references,,,, Set associativity = 4 (Fully Associative); Number of Sets = Address Hit/iss Set Set Set Set H H

Example cont d Cache size = 4 one word blocks Replacement Policy = LRU Sequence of memory references,,,, Set associativity = ; Number of Sets = Address Hit/iss Set Set Set Set H Example cont d Cache size = 4 one word blocks Replacement Policy = LRU Sequence of memory references,,,, Set associativity = (Direct apped Cache) Address Hit/iss 3

Decreasing miss penalty with multilevel caches Add a second level cache: often primary cache is on the same chip as the processor use SRAs to add another cache above primary memory (DRA) miss penalty goes down if data is in nd level cache Example: CPI of. on a 5hz machine with a 5% miss rate, ns DRA access Adding nd level cache with ns access time decreases miss rate to % Using multilevel caches: try and optimize the hit time on the st level cache try and optimize the miss rate on the nd level cache 3 Im p r o v e m e n t fa c to r 9 9 9 4 9 9 9 9 9 9 9 9 4 9 9 Y e a r C P U (fa s t) C P U (slo w ) D R A 4

4% % % iss rate per type % % 4% % Capacity % 4 3 4 Cache size (KB) One-way Two-way Four-way Eight-way 5 Virtual emory ain memory can act as a cache for the secondary storage (disk) Virtual addresses Address translation Physical addresses Disk addresses Advantages: illusion of having more physical memory program relocation protection

hex Text Recall: Each IPS program has an address space of size 3 bytes $sp 7fff ffff hex Stack Dynamic data $gp hex Static data pc 4 hex Reserved 7 Pages: virtual memory blocks Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback Virtual address 3 3 9 7 5 4 3 9 3 Virtual pagenumber Page offset Translation 9 7 5 4 3 9 3 Physical page number Page offset Physical address

Page Tables Virtual page number Page table Physical page or Valid disk address Physical memory Disk storage 9 Page Tables Page table register Virtual address 3 3 9 7 5 4 3 9 3 Virtual page number Page offset Valid Physical page number Page table If then page is not present in memory 9 7 5 4 3 9 3 Physical page number Page offset Physical address 3

aking Address Translation Fast A cache for address translations: translation lookaside buffer Virtual page number Valid Tag TLB Physical page address Physical memory Page table Physical page Valid or disk address Disk storage 3 TLBs and caches Virtual address TLB access TLB miss exception No TLB hit? Yes Physical address No Write? Yes Try to read data from cache No Write access bit on? Y es Cache miss stall No Cache hit? Yes Write protection exception Write data into cache, update the tag, and put the data and the address into the write buffer Deliver data to the CPU 3

odern Systems Very complicated memory systems: Characteristic Intel Pentium Pro PowerPC 4 Virtual address 3 bits 5 bits Physical address 3 bits 3 bits Page size 4 KB, 4 B 4 KB, selectable, and 5 B TLB organization A TLB for instructions and a TLB for data A TLB for instructions and a TLB for data Both four-way set associative Both two-way set associative Pseudo-LRU replacement LRU replacement Instruction TLB: 3 entries Instruction TLB: entries Data TLB: 4 entries Data TLB: entries TLB misses handled in hardware TLB misses handled in hardware Characteristic Intel Pentium Pro PowerPC 4 Cache organization Split instruction and data caches Split intruction and data caches Cache size KB each for instructions/data KB each for instructions/data Cache associativity Four-way set associative Four-way set associative Replacement Approximated LRU replacement LRU replacement Block size 3 bytes 3 bytes Write policy Write-back Write-back or write-through 33 Some Issues Processor speeds continue to increase very fast much faster than either DRA or disk access times Design challenge: dealing with this growing disparity Trends: synchronous SRAs (provide a burst of data) redesign DRA chips to provide higher bandwidth or processing restructure code to increase locality use prefetching (make cache visible to ISA) 34