Computer Architecture

Similar documents
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Operating Systems. Virtual Memory

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

361 Computer Architecture Lecture 14: Cache Memory

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Lecture 17: Virtual Memory II. Goals of virtual memory

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

Concept of Cache in web proxies

Computer Architectures

Computer Systems Structure Main Memory Organization

The continuum of data management techniques for explicitly managed systems

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Virtual Memory Paging

CHAPTER 7: The CPU and Memory

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Binary search tree with SIMD bandwidth optimization using SSE

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

Bandwidth Calculations for SA-1100 Processor LCD Displays

Computer Architecture

COS 318: Operating Systems. Virtual Memory and Address Translation

Annotation to the assignments and the solution sheet. Note the following points

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages

Week 1 out-of-class notes, discussions and sample problems

Outline. Cache Parameters. Lecture 5 Cache Operation

A3 Computer Architecture

Seeking Fast, Durable Data Management: A Database System and Persistent Storage Benchmark

Parallel Processing and Software Performance. Lukáš Marek

Big Graph Processing: Some Background

HY345 Operating Systems

6. Storage and File Structures

Board Notes on Virtual Memory

The Classical Architecture. Storage 1 / 36

Cache Conscious Programming in Undergraduate Computer Science

OPERATING SYSTEM - VIRTUAL MEMORY

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Chapter 11 I/O Management and Disk Scheduling

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

AMD Opteron Quad-Core

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Finite State Machine. RTL Hardware Design by P. Chu. Chapter 10 1

İSTANBUL AYDIN UNIVERSITY

Physical Data Organization

Parallel Algorithm Engineering

Database Hardware Selection Guidelines

PowerPC Microprocessor Clock Modes

Storage Class Memory Aware Data Management

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

How it can benefit your enterprise. Dejan Kocic Netapp

MS SQL Performance (Tuning) Best Practices:

Computer Organization and Architecture

COS 318: Operating Systems

CS (CCN 27156) CS (CCN 26880) Software Engineering for Scientific Computing. Lecture 1: Introduction

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

In-memory Tables Technology overview and solutions

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

The new 32-bit MSP432 MCU platform from Texas

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Energy Constrained Resource Scheduling for Cloud Environment

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Introduction to Cloud Computing

Visualization Enables the Programmer to Reduce Cache Misses

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Performance Tuning for the Teradata Database

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

PLC-Cache: Endurable SSD Cache for Deduplication-based Primary Storage

SSD Performance Tips: Avoid The Write Cliff

Discussion Session 9. CS/ECE 552 Ramkumar Ravi 26 Mar 2012

The Central Processing Unit:

SOC architecture and design

Computer Architecture

How it can benefit your enterprise. Dejan Kocic Hitachi Data Systems (HDS)

Intelligent Heuristic Construction with Active Learning

Chapter 1 Computer System Overview

TaP: Table-based Prefetching for Storage Caches

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Disks and RAID. Profs. Bracy and Van Renesse. based on slides by Prof. Sirer

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

DISK DEFRAG Professional

Transcription:

Cache Memory Gábor Horváth 2016. április 27. Budapest associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu

It is the memory,... The memory is a serious bottleneck of Neumann computers Because it is slow Using virtual memory makes it even worse 1 memory operation generates several memory accesses due to page table lookup 2

Remedy Programs do not refer to memory addresses randomly Memory addresses referenced by the programs show a special pattern we can utilize it! Locality of reference: Temporal: a memory content referenced will be referenced again in the near future Spatial: if a memory content has been referenced, its neighborhood will be referenced as well in the near future Algorithmic: some algorithms (like traversing linked lists) refer to the memory in a systematic way Examples: Media players: Spatial locality: yes, temporal locality: no For loop in the C language: Both temporal and spatial locality hold 3

Memory hierarchy If the programs show this locality of reference behavior: Let us move the frequently used data close the CPU, and store it in a special, fast memory Why, isnt all kinds of memory slow? No. The slowest is the hard disk drive... and we have SRAM, that is much faster than DRAM SRAM is the fastest, but also much more expensive Storage technology Access time Price/GB SRAM 0.5 2.5 ns 2000 5000 $ DRAM 50 70 ns 20 50 $ HDD 5 20 * 106 ns (data from 2008) 0.2 $ 4

Memory hierarchy 5

Cache implementations Classification of the various cache implementations: According to addressing scheme the cache can be Transparent The cache memory stores a part of the system memory Non-transparent A part of the address space is covered by cache memory (SRAM), the rest is covered by the system memory (DRAM) According to management scheme the cache may have: Implicit management: The content of the cache is managed by the hardver (CPU) Explicit management: The content of the cache is managed by the applications 6

Cache implementations Addressing scheme Management scheme Transparent cache Transparent Implicit Software managed cache Transparent Explicit Self-managed scratch-pad Non-transparent Implicit Scratch-pad memory Non-transparent Explicit Transparent cache: classical CPU cache Scratch-pad memory: DSPs, microcontrollers, PlayStation 3 Software managed cache: No cache hit it is the task of the application to update the cache Self-managed scratch-pad Rare 7

Cache Implementation What we are going to cover are: Cache organization: How to store data in the cache in an efficient way Cache content management: When to put a data to the cache and when not What shall we throw out from the cache if we want to put new data there 8

Cache organizations 9

Transparent cache organizations Data units in the cache are called: block Size of blocks = 2 L Lower L bit of memory addresses: position inside a block Upper bits: number (ID) of the block Additional information to be stored in the cache with each block: Cache tag (which block of the system memory is stored here) Valid bit: if =1, this cache block stores valid data Dirty bit: if =1, this cache block has been modified since in the cache The principal questions of cache organization are: How to store blocks in the cache To enable finding a block fast To make it simple and relatively cheap 10

Fully associative cache organization Blocks can be placed anywhere in the cache Cache tag: which block of the system memory is stored here It has high energy consumption (thus, heat), since Lookup: block number of the address has to be compared to all cache tags The comparators are wide: the number of bits to compare equals the number of bits describing a block number 11

Direct mapped organization Each block of the memory can be placed only to a single place in the cache The block number determines, where Eg. the lower bits of the block number determines it 12

Direct mapped organization Example: a cache can store 4 blocks: 1 yellow, 1 red, 1 blue, 1 green The blocks of the memory are following each other in this order color = lowest 2 bits of the block number 13

Direct mapped organization Lookup consists of two steps: 1) Indexing: The color is given by the lower two bits of the memory address: assume it is red (this is called: index) 2) Comparison: Is the red block of the cache storing our memory block? To decide, we need a single comparison! The number of bits to compare equals to the number of bits of the Tag. 14

Conclusion Fully associative: Block can be placed anywhere At each lookup all of the comparators are in use, and the comparators need to compare a lot of bits which is complex, and consumes a lot of energy Direct mapped organization: Placement is restricted: Contention is possible Eg. a program might refer only to red blocks At each lookup, only a single comparator is in use, and there are fewer bits to compare 15

N-way set-associative organization Combines the advantages of the previous two solutions The lower bits of the block number of memory addresses determine where the block can be placed in the cache But not in a unique way! The lower bits select a set of places, where the block can be stored Each set consists of n blocks, the block we are looking for can be stored in any of them 16

N-way set-associative organization Lookup consists of two steps: 1) Indexing: The color is given by the lower two bits of the memory address: assume it is red (this is called: index) 2) Comparison: There are n places in the cache that can store red blocks. To find out if our data is in the cache, we need to compare the address to the tags of all the n places n comparators are in use, the width of each of them is given by the width of the Tag 17

Conclusion N-way set-associative organization: Restricted placement of blocks, with n possibilities less contention Only n comparators are switched on at lookup moderate complexity and energy consumption Typical values of n: n=2...16 Core i7 AMD Bulldozer L1 8 2 (instr.) / 4 (data) 4 L2 8 16 L3 16 64 ARM Cortex A9 16 18

Cache content management 23

Putting data into the cache When to put a memory block to the cache? Never When the program refers to it at the first time Before the program refers to it (in the hope that it will use it later) Never: avoids cache pollution eg. media players: displays frames only once, frames will never be referred later it is not worth to put such data into the cache At the first reference: The first reference is slow (memory cache transfer), but it will be fast later (accessible from cache) Befor the first reference (prefetch): needs speculation When writing and reading are not treated equally: Write allocate: puts data to the cache at write operations as well Write no-allocate: data is not loaded to the cache when writing to the memory (if it is not already there) 25

Avoidance of cache pollution Cache pollution: a block that was not referred to between loading and leaving the cache It was a mistake putting it to the cache Useful data might have been thrown out due to loading it Explicit solutions: specific instructions The programmer can tell the processor what not to load into the cache x86: MOVNTI, MOVNTQ Moves data from an SSE register to the memory, bypassing the cache PowerPC: LVXL reads vectors from memory to registers The block will be put to the cache, but it will be marked to be not important PA-RISC: Several instructions have a bit: Spatial Locality Cache Control Hint Itanium: Data movement instructions have an option:.nt Indicates that the data does not exhibit temporal locality Alpha: ECB instruction (Evict Data Cache Block) Indicates that the block will not be referred in the near future 26

Avoidance of cache pollution Implicit solutions: CPU tries to detect instructions that pollute the cache automatically Several solutions in use. Example: Rivers' algorithm Idea: if an instruction causes too many cache misses the CPU does not allow it to use the cache 27

Data prefetch An important functionality Goal: the CPU should not wait for the slow memory at all All data has to be loaded to the cache before the first reference occurs to them It is simple in case of Instruction cache: Non-jump instructions are executed sequentially the next instructions to load is known The jump addresses of unconditional jump instructions can be pre-computed the next instructions can be loaded in advance In case of conditional jump instructions the jump address can be estimated 28

Data prefetch Predicting which data will be referred in the near future is less trivial If the prediction is: Too conservative: the data will not be in the cache, when needed too many cache misses Too aggressive: pollutes the cache with unnecessary blocks useful data get thrown out too many cache misses Explicit solutions: Special instructions Implicit solutions: The CPU speculates what will be the next referred data 29

Data prefetch Explicit prefetch instructions: x86: PREFETCHT0/1/2, PREFETCHNTA Load a block to the cache PREFETCHNTA: loads data and puts it the first block in the n-way set-associative cache avoids cache pollution as well! PowerPC: DCBT, DCBTST Loads a block to the cache PA-RISC: LDD, LDW Loads a block to the cache Itanium: lfetch Loads a block to the cache Alpha: If the destination register of a data movement instruction is R31, then the CPU treats it as a prefetch request 30

Data prefetch Platform independent solution of the GCC C compiler: builtin_prefetch (pointer); 31

Data prefetch Implicit solution: To detect iterating over data placed equi-distantly in the memory: X, X + stride, X + 2*stride, X + 3*stride,... Detects stride automatically = "Intel Smart Memory Access IP Prefetcher" in Intel Core i7 32

Cache replacement strategy We already know which block to put to the cache and when to do it But where to put it? the cache memory is small, therefore it is usually full A block needs to be removed from the cache Number of candidates = the associativity of the cache There is no choice in case of direct mapped cache The new block can be stored to a single place in the cache. The previous content of that place is removed. Possible algorithms to select the victim: Random choice Round robin Least recently used (LRU) this is the most popular strategy Not the recently used Least frequently used 33

Cache coherence System memory is used by several players in the computer The content of the cache has to be consistent with it all the time Cause of the problems: the write operations Strategies to follow when modifying data in the cache Write-through the modification of the cache immediately implies updating the system memory as well Write-back the data is modified in the cache only, the system memory is modified only when the block leaves the cache In this case the content of the memory and the cache will be different! 34