Multiprocessor Cache Coherence



Similar documents
Chapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Lecture 23: Multiprocessors

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Embedded Parallel Computing

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Interconnection Networks

Symmetric Multiprocessing

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems

Annotation to the assignments and the solution sheet. Note the following points

COS 318: Operating Systems. Virtual Memory and Address Translation

Measuring and Characterizing Cache Coherence Traffic

Distributed System: Definition

Parallel Algorithm Engineering

Chapter 2: OS Overview

Operating system Dr. Shroouq J.

Virtual Shared Memory (VSM)

Principles and characteristics of distributed systems and environments

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Distributed Systems LEEC (2005/06 2º Sem.)

Lecture 2 Parallel Programming Platforms

Multi-Threading Performance on Commodity Multi-Core Processors

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

How To Understand The Concept Of A Distributed System

Lesson Objectives. To provide a grand tour of the major operating systems components To provide coverage of basic computer system organization

A Locally Cache-Coherent Multiprocessor Architecture

Operating Systems 4 th Class

Chapter 2 Parallel Architecture, Software And Performance

SOC architecture and design

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Switch Fabric Implementation Using Shared Memory

Data Memory Alternatives for Multiscalar Processors

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

An Interconnection Network for a Cache Coherent System on FPGAs. Vincent Mirian

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Chapter 1 Computer System Overview

Virtual vs Physical Addresses

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Outline: Operating Systems

OPERATING SYSTEMS Internais and Design Principles

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Measuring IP Performance. Geoff Huston Telstra

How To Make A Distributed System Transparent

1. Computer System Structure and Components

COURSE OUTLINE Survey of Operating Systems

Virtual machine interface. Operating system. Physical machine interface

Why the Network Matters

CMSC 611: Advanced Computer Architecture

Java DB Performance. Olav Sandstå Sun Microsystems, Trondheim, Norway Submission ID: 860

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

Introduction to Cloud Computing

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

white paper Capacity and Scaling of Microsoft Terminal Server on the Unisys ES7000/600 Unisys Systems & Technology Modeling and Measurement

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

BBM467 Data Intensive ApplicaAons

Chapter 2. Multiprocessors Interconnection Networks

361 Computer Architecture Lecture 14: Cache Memory

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Traditional IBM Mainframe Operating Principles

Scalability and Classifications

Chapter 17: Distributed Systems

High Performance Computing. Course Notes HPC Fundamentals

Distributed Operating Systems

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Distributed System Principles

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

Introduction to AMBA 4 ACE and big.little Processing Technology

MONITORING PERFORMANCE IN WINDOWS 7

Memory Architecture and Management in a NoC Platform

Microsoft Office SharePoint Server 2007 Performance on VMware vsphere 4.1

FROM RELATIONAL TO OBJECT DATABASE MANAGEMENT SYSTEMS

Intel P6 Systemprogrammering 2007 Föreläsning 5 P6/Linux Memory System

Axapta Object Server MICROSOFT BUSINESS SOLUTIONS AXAPTA

The Oracle Universal Server Buffer Manager

Real Time Programming: Concepts

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Weighted Total Mark. Weighted Exam Mark

How to design and implement firmware for embedded systems

Query OLAP Cache Optimization in SAP BW

Chapter 18: Database System Architectures. Centralized Systems

Windows 8 SMB 2.2 File Sharing Performance

YZP : SAUTER Vision Center

StrongARM** SA-110 Microprocessor Instruction Timing

Transcription:

Multiprocessor Cache Coherence M M BUS P P P P The goal is to make sure that READ(X) returns the most recent value of the shared variable X, i.e. all valid copies of a shared variable are identical. 1. Software solutions 2. Hardware solutions Snooping Cache Protocol (for bus-based machines) Directory Based Solutions (for NUMA machines using a scalable switch)

Software Solutions Compiler tags data as cacheable and non-cacheable. Only read-only data is considered cachable and put in private cache. All other data are non-cachable, and can be put in a global cache, if available. Memory Global cache BUS P P P P

Hardware Solution: Snooping Cache Widely used in bus-based multiprocessors. The cache controller constantly watches the bus. Write Invalidate When a processor writes into C, all copies of it in other processors are invalidated. These processors have to read a valid copy either from M, or from the processor that modified the variable. Write Broadcast Instead of invalidating, why not broadcast the updated value to the other processors sharing that copy? This will act as write through for shared data, and write back for private data. Write broadcast consumes more bus bandwidth compared to write invalidate. Why?

MESI Protocol (Papamarcos & Patel 1984) It is a version of the snooping cache protocol. Each cache block can be in one of four states: INVALID SHARED EXCLUSIVE MODIFIED Not valid Multiple caches may hold valid copies. No other cache has this block, M-block is valid Valid block, but copy in M-block is not valid. Event Local Remote Read hit Use local copy No action Read miss I to S, or I to E (S,E,M) to S Write hit (S,E) to M (S,E,M) to I Write miss I to M (S,E,M) to I When a cache block changes its status from M, it first updates the main memory.

Examples of state transitions under MESI protocol A. Read Miss x=5 x=5 x=5 x=3 x=5 S S S I x=5 x=5 x=5 x=5 x=5 S S S S x=5 x=2 x=4 x=3 x=5 E I I I x=5 x=2 x=4 x=5 x=5 S I I S

B. More Read Miss x=9 x=2 x=4 x=3 x=5 M I I I x=9 x=2 x=4 x=3 x=9 writeback M try again x=9 x=2 x=4 x=9 x=9 S I I S Following the read miss, the holder of the modified copy signals the initiator to try again. Meanwhile, it seizes the bus, and write the updated copy into the main memory.

C. Write Miss x=9 x=2 x=4 x=5 M I I I x=9 x=2 x=4 x=9 writeback M allocate X x=9 x=2 x=3 x=9 x=9 now modify x=9 x=2 x=3 x=10 x=9 I I I M

Directory-based cache coherence The snooping cache protocol does not work if there is no bus. Large-scale shared memory multiprocessors may connect processors with memories through switches. A directory has to beep track of the states of the shared variables, and oversee that they are modified in a consistent way. Maintenance of the directory in a distributed environment is another issue. Naïve solutions may lead to deadlock. Consider this: P1 has a read miss for x2 (local to P2) P2 has a read miss for x1 (local to P1) Each will block and expect the other process to send the correct value of x: deadlock (!)

Memory Consistency Models Multiple copies of a shared variable can be found at different locations, and any write operation must update all of them. Coherence vs. consistency Cache coherence protocols guarantee that eventually all copies are updated. Depending on how and when these updates are performed, a read operation may sometimes return unexpected values. Consistency deals with what values can be returned to the user by a read operation (may return unexpected values if the update is not complete). Consistency model is a contract that defines what a programmer can expect from the machine.

Sequential Consistency Program 1. process 0 process 1 {initially,x=0 and y=0} x:=1; y:=1; if (y=0) then x:=2; if (x=0) then y:=2; print x; print y; If both processes run concurrently, then can we see a printout (x=2, y=2)? A key question is: Did process 0 read y before process 1 modified it? One possible scenario is: x=0 w(x,1) read y process 0 y=0 w(y,1) read x process 1 Here, the final values are: (x=1, y=1)

x=0 w(x,1) read y process 0 y=0 w(y,1) read x process 1 Here, the final values are: (x=2, y=1) Properties of SC. SC1. All operations in a single process are executed in program order. SC2. The result of any execution is the same as if a single sequential order has been maintained among all operations.

Implementing SC Case 1 Consider a switch-based multiprocessor. Assume there is no cache. p0 p1 p2 x y Process 0 executes: (x:=1; y:=2) To prevent p1 or p2 from seeing these in a different order, p0 must receive an acknowledgement after every write operation. Case 2 In a multiprocessor where processors have private cache, all invalidate signals must be acknowledged.

Write-buffers and New problems P cache buffer memory Go back to the first program now. Let both x:=1 and y:=1 be written into the write buffers, but before the memory is updated, let the two if statements be evaluated. Both can be true, and (x:=2, y:= 2) are possible! This violates sequential consistency.