CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

Similar documents

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

361 Computer Architecture Lecture 14: Cache Memory

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

Lecture 17: Virtual Memory II. Goals of virtual memory

CSE 6040 Computing for Data Analytics: Methods and Tools

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to Microprocessors

Board Notes on Virtual Memory

Computer Architecture

Binary search tree with SIMD bandwidth optimization using SSE

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Architecture of Hitachi SR-8000

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Configuring CoreNet Platform Cache (CPC) as SRAM For Use by Linux Applications

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Parallel Algorithm Engineering

Intel Pentium 4 Processor on 90nm Technology

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Flash Performance in Storage Systems. Bill Moore Chief Engineer, Storage Systems Sun Microsystems

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Parallel Programming Survey

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Putting it all together: Intel Nehalem.

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Rethinking SIMD Vectorization for In-Memory Databases

RAMCloud and the Low- Latency Datacenter. John Ousterhout Stanford University

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Hardware Performance. Fall 2015

Performance Application Programming Interface

1 Storage Devices Summary

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

Enterprise Applications

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Memory management basics (1) Requirements (1) Objectives. Operating Systems Part of E1.9 - Principles of Computers and Software Engineering

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Storage Classes CS 110B - Rule Storage Classes Page 18-1 \handouts\storclas

Data Storage - I: Memory Hierarchies & Disks

Random Access Memory (RAM) Types of RAM. RAM Random Access Memory Jamie Tees SDRAM. Micro-DIMM SO-DIMM

Using Synology SSD Technology to Enhance System Performance Synology Inc.

COS 318: Operating Systems. Virtual Memory and Address Translation

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Technical Product Specifications Dell Dimension 2400 Created by: Scott Puckett

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

A Deduplication File System & Course Review

Testing Database Performance with HelperCore on Multi-Core Processors

Caching Mechanisms for Mobile and IOT Devices

Chapter 1 Computer System Overview

The team that wrote this redbook Comments welcome Introduction p. 1 Three phases p. 1 Netfinity Performance Lab p. 2 IBM Center for Microsoft

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

An Oracle White Paper July Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

FPGA-based Multithreading for In-Memory Hash Joins

Intel X38 Express Chipset Memory Technology and Configuration Guide

An Implementation Of Multiprocessor Linux

RAM. Overview DRAM. What RAM means? DRAM

Understanding Hardware Transactional Memory

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

OpenMP and Performance

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Lecture 23: Multiprocessors

1. Memory technology & Hierarchy

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Chapter 6, The Operating System Machine Level

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Operating Systems, 6 th ed. Test Bank Chapter 7

Computer Architecture

System Requirements Table of contents

Solid State Drive Architecture

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Lecture 1: the anatomy of a supercomputer

SOFORT: A Hybrid SCM-DRAM Storage Engine for Fast Data Recovery

Introduction to Cloud Computing

Transcription:

CSE 160 Lecture 5 The Memory Hierarchy False Sharing Cache Coherence and Consistency Scott B. Baden

Using Bang coming down the home stretch Do not use Bang s front end for running mergesort Use batch, or interactive nodes, via qlogin Use the front end for editing & compiling only 10% penalty for using the login nodes improperly, doubles with each incident! EE Times 2013 Scott B. Baden / CSE 160 / Fall 2013 2

Announcements SDSC Tour on Friday 11/1 EE Times 2013 Scott B. Baden / CSE 160 / Fall 2013 3

Today s lecture The memory hierarchy Cache Coherence and Consistency Implementing synchronization False sharing 2013 Scott B. Baden / CSE 160 / Fall 2013 4

Performance The processor-memory gap The result of technological trends Difference in processing and memory speeds growing exponentially over time 1000 100 10 1 1980! 1981! 1982! 1983! 1984! 1985! 1986! 1987! 1988! 1989! Moore s Law CPU! DRAM 1990! 1991! 1992! 1993! 1994! 1995! 1996! 1997! 1998! 1999! 2000! Time µproc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM 7%/yr. 2013 Scott B. Baden / CSE 160 / Fall 2013 5

An important principle: locality Memory accesses exhibit two forms of locality Temporal locality (time) Spatial locality (space) Often involves loops Opportunities for reuse Idea: construct a small & fast memory to cache re-used data 32 to 64 KB 256KB to 4 MB Smaller and faster CPU L2 1CP (1 word) L1 2-3 CP (10 to 100 B) O(10) CP (10-100 B) for t=0 to T-1 for i = 1 to N-2 u[i]=(u[i-1] + u[i+1])/2 TB to PB GB DRAM Disk O(100) CP O(10 6 ) CP 2013 Scott B. Baden / CSE 160 / Fall 2013 6

The Benefits of Cache Memory Let say that we have a small fast memory that is 10 times faster (access time) than main memory If we find what we are looking for 90% of the time (a hit), the access time approaches that of fast memory T access = 0.90 1 + (1-0.9) 10 = 1.9 Memory appears to be 5 times faster We organize the references by blocks We can have multiple levels of cache 2013 Scott B. Baden / CSE 160 / Fall 2013 7

Sidebar If cache memory access time is 10 times faster than main memory Cache hit time T cache = T main / 10 T main is the cache miss penalty And if we find what we are looking for f 100% of the time ( cache hit rate ) Access time = f T cache + (1- f ) T main = f T main /10 + (1- f ) T main = (1-(9f/10)) T main We are now 1/(1-(9f/10)) times faster To simplify, we use T cache = 1, T main = 10 2013 Scott B. Baden / CSE 160 / Fall 2013 8

Different types of caches Separate Instruction (I) and Data (D) Unified (I+D) Direct mapped / Set associative Write Through / Write Back Allocate on Write / No Allocate on Write Last Level Cache (LLC) Translation Lookaside Buffer (TLB) Core2 Core2 Core2 Core2 32K L1 32K L1 4MB Shared L2 FSB 10.66 GB/s 32K L1 32K L1 4MB Shared L2 Core2 Core2 Core2 Core2 32K L1 32K L1 4MB Shared L2 FSB 10.66 GB/s 32K L1 32K L1 4MB Shared L2 Sam Williams et al. 2013 Scott B. Baden / CSE 160 / Fall 2013 9

Simplest cache Direct mapped cache Line 0:" valid" tag" cache block" selected line" Line 1:" valid" tag" cache block" " t bits" s bits" 0 0 0 0 1" b bits" Line 1 S-1:" valid" tag" cache block" m-1" tag" Line index"block offset" 0" Randal E. Bryant and David R. O 2013 Scott B. Baden / CSE 160 / Fall 2013 10

Accessing a Direct mapped cache =1?" (1) The valid bit must be set" selected line (i):" 1" 0" 1" 2" 3" 4" 5" 6" 7" 0110" w 0! w 1 " w 2 " w 3" (2) The tag bits for the cache" line must match the" tag bits in the address" m-1" =?" t bits" 0110" tag" s bits" b bits" i" 100! Line index"block offset" 0" (3) If (1) and (2) are true, " then we have a" cache hit; the block offset " selects" starting byte. " Randal E. Bryant and David R. O 2013 Scott B. Baden / CSE 160 / Fall 2013 11

Set associative cache Why use the middle bits for the index? 1 valid bit per line valid T tag bits per line tag 0 B = 2 b bytes per cache block 1 B 1 m-1 T bits s bits b bits 0 set 0: valid tag 0 1 B 1 <tag> <set index> <block offset> set 1: valid valid tag tag 0 0 1 1 B 1 B 1 valid tag 0 1 B 1 set S-1: valid tag 0 1 B 1 Randal E. Bryant and David R. O 2013 Scott B. Baden / CSE 160 / Fall 2013 12

The 3 C s of cache misses Cold Start Capacity Conflict Line Size = 64B (L1 and L2) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB 10.66 GB/s FSB 10.66 GB/s Chipset (4x64b controllers) 21.3 GB/s(read) 10.6 GB/s(write) 667MHz FBDIMMs Sam Williams et al. 2013 Scott B. Baden / CSE 160 / Fall 2013 13

Bang s Memory Hierarchy Intel Clovertown processor Intel Xeon E5355 (Introduced: 2006) Two Woodcrest dies (Core2) on a multichip module Two sockets Intel 64 and IA-32 Architectures Optimization Reference Manual, Tab 2.16 techreport.com/articles.x/10021/2 Access latency, throughput (clocks) 3, 1 Line Size = 64B (L1 and L2) Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 Associativity 8 14*, 2 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 16 FSB FSB * Software-visible latency will vary depending on access patterns and other factors 10.66 GB/s 21.3 GB/s(read) Chipset (4x64b controllers) 667MHz FBDIMMs 10.66 GB/s 10.6 GB/s(write) Write update policy: Writeback Sam Williams et al. 2013 Scott B. Baden / CSE 160 / Fall 2013 14

Examining Bang s Memory Hierarchy /proc/cpuinfo summarizes the processor vendor_id : GenuineIntel model name : Intel(R) Xeon(R) CPU E5345 @2.33GHz cache size : 4096 KB cpu cores : 4 processor : 0 through processor : 7 Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB 10.66 GB/s FSB 10.66 GB/s 2013 Scott B. Baden / CSE 160 / Fall 2013 15

Detailed memory hierarchy information /sys/devices/system/cpu/cpu*/cache/index*/* Login to bang and view the files Core2 Core2 Core2 Core2 Core2 Core2 Core2 Core2 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 32K L1 4MB Shared L2 4MB Shared L2 4MB Shared L2 4MB Shared L2 FSB 10.66 GB/s FSB 10.66 GB/s 2013 Scott B. Baden / CSE 160 / Fall 2013 16

Today s lecture The memory hierarchy Cache Coherence and Consistency Implementing synchronization False sharing 2013 Scott B. Baden / CSE 160 / Fall 2013 17

Cache Coherence A central design issue in shared memory architectures Processors may read and write the same cached memory location If one processor writes to the location, all others must eventually see the write X:=1 Memory 2013 Scott B. Baden / CSE 160 / Fall 2013 18

Cache Coherence P1 & P2 load X from main memory into cache P1 stores 2 into X The memory system doesn t have a coherent value for X X:=1 Memory X:=2 X:=1 P1 P2 X:=1 2013 Scott B. Baden / CSE 160 / Fall 2013 19

Cache Coherence Protocols Ensure that all processors eventually see the same value Two policies Update-on-write (implies a write-through cache) Invalidate-on-write X:=2 Memory P2 P1 X:=2 X:=2 X:=2 2013 Scott B. Baden / CSE 160 / Fall 2013 20

SMP architectures Employ a snooping protocol to ensure coherence Cache controllers listen to bus activity updating or invalidating cache as needed P 1 Bus snoop P n $ $ Mem I/O devices Cache-memory transaction Patterson & Hennessey 2013 Scott B. Baden / CSE 160 / Fall 2013 21

Memory consistency and correctness Cache coherence tells us that memory will eventually be consistent The memory consistency policy tells us when this will happen Even if memory is consistent, changes don t propagate instantaneously These give rise to correctness issues involving program behavior 2013 Scott B. Baden / CSE 160 / Fall 2013 22

Memory consistency A memory system is consistent if the following 3 conditions hold Program order (you read what you wrote) Definition of a coherent view of memory ( eventually ) Serialization of writes (a single frame of reference) 2013 Scott B. Baden / CSE 160 / Fall 2013 23

Program order If a processor writes and then reads the same location X, and there are no other intervening writes by other processors to X, then the read will always return the value previously written. X:=2 X:=2 Memory P X:=2 2013 Scott B. Baden / CSE 160 / Fall 2013 24

Definition of a coherent view of memory If a processor P reads from location X that was previously written by a processor Q, then the read will return the value previously written, if a sufficient amount of time has elapsed between the read and the write. X:=1 Memory P X:=1 Q X:=1 Load X 2013 Scott B. Baden / CSE 160 / Fall 2013 25

Serialization of writes If two processors write to the same location X, then other processors reading X will observe the same the sequence of values in the order written If 10 and then 20 is written into X, then no processor can read 20 and then 10 2013 Scott B. Baden / CSE 160 / Fall 2013 26

Memory consistency models Should it be impossible for both if statements to evaluate to true? With sequential consistency the results should always be the same provide that Each processor keeps its access in the order made We can t say anything about the ordering across different processors: access are interleaved arbitrarily Processor 1 Processor 2 A=0 A=1 if (B==0) B=0 B=1 if (A==0) 2013 Scott B. Baden / CSE 160 / Fall 2013 28

Undefined behavior in C++11 Global int x, y; Thread 1 Thread 2 x =17 cout << y << " "; y = 37; cout << x << endl; Compiler may rearrange statements to improve performance Processor may rearrange order of instructions Memory system may rearrange order that writes are committed Memory might not get updated; eventually can be a long time (though in practice it s often not) 2013 Scott B. Baden / CSE 160 / Fall 2013 29

Undefined behavior in earlier versions of C++ Global int x, y; Thread 1 Thread 2 char c; char b; c=1; b =1; int x=c; int y=b; In C++11, x=1 and y=1; they are separate memory locations But in earlier dialects you might get 1&0, 0&1, 1&1 The linker could allocate b and c next to each other in the same word of memory Modern processors can t write a single byte, so they have to do read-modify-write 2013 Scott B. Baden / CSE 160 / Fall 2013 30

Today s lecture The memory hierarchy Cache Coherence and Consistency Implementing synchronization False sharing 2013 Scott B. Baden / CSE 160 / Fall 2013 32

Implementing Synchronization We build mutex and other synchronization primitives with special atomic operations, implemented with a single machine instruction, e.g. CMPXCHG Do atomically: compare contents of memory location loc to expected; if they are the same, modify the location with newval CAS (*loc, expected, newval ) { if (*loc == expected ) { *loc = newval; return 0; } else return 1 We can then build mutexes with CAS Lock( *mutex ) { while (CAS ( *mutex, 1, 0)) ; } Unlock( *mutex ) { *mutex = 1; } 2013 Scott B. Baden / CSE 160 / Fall 2013 33

Memory fences How are we assured that a value updated within a critical section becomes visible to all other threads? With a fence instruction, e.g. MFENCE A serializing operation guaranteeing that every load and store instruction that precedes, in program order, the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible. [Intel 64 & IA32 architectures software developer manual] Also see www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html mutex mtx; mutex.mtx.lock(); sum += local sum; mutex.mtx.unlock(); 2013 Scott B. Baden / CSE 160 / Fall 2013 34

Today s lecture The memory hierarchy Cache Coherence and Consistency Implementing synchronization False sharing 2013 Scott B. Baden / CSE 160 / Fall 2013 37

False sharing Consider two processors that write to different locations mapping to different parts of the same cache line Main memory P0 P1 2013 Scott B. Baden / CSE 160 / Fall 2013 38

P0 writes a location False sharing Assuming we have a write-through cache, memory is updated P0 2013 Scott B. Baden / CSE 160 / Fall 2013 39

False sharing P1 reads the location written by P0 P1 then writes a different location in the same block of memory P0 P1 2013 Scott B. Baden / CSE 160 / Fall 2013 40

False sharing P1 s write updates main memory Snooping protocol invalidates the corresponding block in P0 s cache P0 P1 2013 Scott B. Baden / CSE 160 / Fall 2013 41

False sharing Successive writes by P0 and P1 cause the processors to uselessly invalidate one another s cache P0 P1 2013 Scott B. Baden / CSE 160 / Fall 2013 42

Eliminating false sharing Cleanly separate locations updated by different processors Manually assign scalars to a pre-allocated region of memory using pointers Spread out the values to coincide with a cache line boundaries 2013 Scott B. Baden / CSE 160 / Fall 2013 43

How to avoid false sharing Reduce number of accesses to shared state False sharing occurs a small fixed number of times static int counts[]; for (int k = 0; k<reps; k++) for (int r = first; r <= last; ++ r) if ((values[r] % 2) == 1) counts[tid]++; int _count = 0; for (int k = 0; k<reps; k++){ for (int r = first; r <= last; ++ r) if ((values[r] % 2) == 1) _count++; counts[tid] = _count; } 4.7s, 6.3s, 7.9s, 10.4 [NT=1,2,4,8] 3.4s, 1.7s, 0.83, 0.43 [NT=1,2,4,8] 2013 Scott B. Baden / CSE 160 / Fall 2013 44

Spreading Put each counter in its own cache line 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 0 1 2 3 4 5 6 7 31 static int counts[]; for (int k = 0; k<reps; k++) for (int r = first; r <= last; ++ r) if ((values[r] % 2) == 1) counts[tid]++; static int counts[][line_size]; for (int k = 0; k<reps; k++) for (int r = first; r <= last; ++ r) if ((values[r] % 2) == 1) counts[tid][0]++; NT=1 NT=2 NT=4 NT=8 Unoptimized 4.7 sec 6.3 7.9 10.4 Optimized 4.7 5.3 1.2 1.3 2013 Scott B. Baden / CSE 160 / Fall 2013 45

Cache performance bottlenecks in nearest neighbor computations Recall the image smoothing algorithm for (i,j) in 0:N-1 x 0:N-1 I new [i,j] = ( I[i-1,j] + I[i+1,j]+ I[i,j-1]+ I[i, j+1])/4 Original 100 iter 1000 iter 2013 Scott B. Baden / CSE 160 / Fall 2013 46

Memory access pattern Some nearest neighbors in space are far apart in memory Stride = N along the vertical dimension for (i,j) in 0:N-1 x 0:N-1 I new [i,j] = ( I[i-1,j] + I[i+1,j]+ I[i,j-1]+ I[i, j+1])/4 2013 Scott B. Baden / CSE 160 / Fall 2013 47

False sharing and conflict misses False sharing involves internal boundaries, poor spatial locality, cache line internally fragmented Large memory access strides: conflict misses, poor cache locality Even worse in 3D: large strides of N 2 Contiguous access on a single processor Cache block straddles boundary P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 Parallel Computer Architecture, Culler, Singh, & Gupta On a single processor On multiple processors 2013 Scott B. Baden / CSE 160 / Fall 2013 48