Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Size: px

Start display at page:

Download "Vorlesung Rechnerarchitektur 2 Seite 178 DASH"

Theodore French
8 years ago
Views:

1 Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica Lam, Anoop Gupta,... Features scalable high-performance MIMD System distributed shared memory, single address space coherent caches caching of shared data Methodes for achieving scalability distributed directories hierarchical cluster configuration cache coherence protocol efficient synchronization Prototype with 64 processors (MIPS R3000) 4x Cluster MESI 16x 4x Remote Access Cluster Interconnection Network MESI Remote Access General Architecture of

.. Features scalable high-performance MIMD System distributed shared memory, single address space coherent caches caching of shared data Methodes for achieving

2 Vorlesung Rechnerarchitektur 2 Seite 179 Block diagram and topology based on Silicon Graphics Power Station 4D/240 (R3000) addition of a directory controller board 4 processor per node (cluster), shared bus with MESI synchronous pipelined memory bus, no split transactions (local) long latency transactions are retried and arbitrated only on a completed transfer (remote) L1 L1 L1 L2 L2 L2 L2 Controller to Interconnection Network Bus with MESI globally addressed VME Interface Main Block diagram of node Request IN Reply IN Node #1 Node #2 Node #3 Node #4 Block diagram of 2 x 2 system

transactions are retried and arbitrated only on a completed transfer (remote) L1 L1 L1 L2 L2 L2 L2 Controller to Interconnection Network Bus with

3 Vorlesung Rechnerarchitektur 2 Seite 180 Reply Network (performance monitor not shown) Request Network Mesh Routing Chip Mesh Routing Chip Reply Controller (RC) Remote Access (RAC) stores state of pending memory requests RAC snoops on bus Arbitration Masks MPBUS Data Pseudo CPU () Forward remote CPU request to local MPBUS Issue cache line invalidations and lock grants MPBUS request MPBUS Address/Control Controller () DRAM Forward local requests to remotes Reply to remote requests Respond to MPBUS with directory information Storage of locks and lock queues Remote Status Bus Retry board block diagram There is one directory entry for each memory block. Each directory entry contains a bit vector, each bit representing the state (cache copy) of the corresponding processor cache (fullmap directory). Another two bits declare the memory block as not copied, copied or dirty. Every of the N nodes of the system keeps a list of M/L entries in its directory, where M is the megabits of node memory and L is the cache line (i.e. memory block) size in bits. A major scalability concern unique to Dash-like machines is the amount of directory memory required. If the physical memory of the machine grows proportional with the number of nodes, then using a bit vector to keep track of all clusters caching a memory block does not scale well [Dash 92]. The total amount of directory memory is N 2 * M/L in bits.

Address/Control Controller () DRAM Forward local requests to remotes Reply to remote requests Respond to MPBUS with directory information Storage of locks and lock queues Remote Status Bus Retry

4 Vorlesung Rechnerarchitektur 2 Seite 181 Transaction examples In the case of a read of a dirty memory block located in a remote node, the initiator sends a read request to the home node (determined by top part of address). The directory in the home node has the memory block marked as dirty and the node holding the modified copy is marked in the associated bit vector. The home node forwards the read request to the node holding the dirty copy. There, the pseudo CPU issues the request on the local bus, and the directory controller forwards the reply of the local cache to the requesting and home node. 1. read request Node #1 (local) Node #2 RC 3b. sharing writeback (home) 3a. read reply Node #2 (dirty copy) 2. forward read request Read of dirty remote memory block In the case of a write the invalidation-based protocol requires the write buffer to invalidate all copies (acquire exclusive ownership) before completing the store. Thus, a read exclusive request is issued to the home node. The home node and all the nodes holding a copy reply to the local node upon invalidation. The local node waits for reception of all invalidation acknowlegdements (count given by Read exclusive reply). Thus, sequential consistency is maintained, and latency for read exclusive requests is minimized. 1. read exclusive (RdEx) request Node #1 (local) 2a. RdEx reply Node #2 RC (home) 2b. invalidations 3. invalidate acks Node #3 Node #4 Node #5 Write to shared remote memory block

The home node forwards the read request to the node holding the dirty copy.

5 Vorlesung Rechnerarchitektur 2 Seite 182 hierarchie processor level processor cache local cluster level Other processor cache within local cluster directory home level directory/memory associated with address access through request/reply network remote cluster level processor caches in remote clusters hierarchie of References: Hwang,Xu: Scalable Parallel Computing Hwang: Advanced Computer Architecture Lenoski: Scalable Shared- Multiprocessing

through request/reply network remote cluster level processor caches in remote clusters hierarchie of

Chapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1

Chapter 12: Multiprocessor Architectures Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1 Objective To understand cache coherence problem To learn the methods used to solve