Chap. 3 - Parallel Architectures

Transcription

1 Chap. 3 - Parallel Architectures Types in an overview Multiprocessor systems with shared memory Programming of shared memory systems Cache & Memory Coherency Multiprocessor systems with distributed memory Programming of distributed memory systems Networks for parallel computers Vector Processors Array Computers New Trends Parallel Computer Systems p.1/31 Types of Parallel Computers (1) Rough classification scheme according the number of control streams and the number of data streams DataStreams Single (SD) Multiple (MD) Instruction Streams Single (SI) Multiple (MI) SISD SIMD MIMD Classification principle by Flynn, 1966 Classical von Neumann computers covered in class SISD Parallel Computer Systems p.2/31

2 Types of Parallel Computers (2) MIMD - Multiple Instruction Multiple Data All multiprocessor systems - each processor can work with an individual instruction stream onto an individual stream of operands Subclasses: Multiple processors with shared memory, near to PRAM model but without step wise synchronization Multiple processors with local memory, connected via a network (Distributed Memory) Mixed architecture: Distributed Shared Memory (DSM) Hardware structured like distributed memory, but shared address space via MMU address translation plus software Parallel Computer Systems p.3/31 Types of Parallel Computers (3) SIMD - Single Instruction Multiple Data - Each instruction causes operations on multiple pairs of data Subclasses: Vector processors (some of the number crunchers, e.g. CRAY, NEC) Array processors Early parallel computers (Massively parallel) Nowadays a few special purpose architectures ISA-extensions: MMX, SSID, AltiVec for a small set of parallel units Parallel Computer Systems p.4/31

3 Shared Memory Multiprocessors Structure: P 0 P 1 P 2 P 3 P (p 1) Cache Cache Cache Cache Cache Communication Network MEM MEM MEM MEM MEM Coordination and cooperation using shared variables in memory A single instance of the operating system Parallel Computer Systems p.5/31 Programming Shared Memory (1) Options: Multiple processes (using fork) and communication via Shmem segments Explicit message passing among multiple processes: Unix-Pipelines, MPI Multithreading: Threads run on different nodes and utilize parallel machine, Threads run onto a shared address space OpenMP - Set of compiler directives for controlling multi-threaded, space divided execution Parallel Computer Systems p.6/31

4 Programming Shared Memory (2) Several threads run onto several processors under control of the operating system OS-specific thread functions, e.g. Solaris threads portability standard: POSIX-Threads, pthread library Basic functions: int pthread_create(pthread_t *thread, const pthread_attr_t *attr, void *(*start_routine)(void*), void *arg); void pthread_exit(void *value_ptr); int pthread_join(pthread_t thread, void **value_ptr); Parallel Computer Systems p.7/31 Programming Shared Memory (3) OpenMP: Example for Loop-parallelization: for (i=0;i<256;i++) #pragma omp parallel for for (j=0; j<256;j++) { img[i,j] = img[i,j]-minvalue; img[i,j] = (int) ( (float) img[i,j] * (float)maxvalue / (float)(maxvalue-minvalue) ); } Pragma-preprocessor-instruction tells compiler that for-loop is to parallelize Parallel Computer Systems p.8/31

5 Shared Memory and Caching Caches are used in order to release network and main memory from frequent data transfer Non shared data can be kept in caches for a long time without interaction with main memory This improves scalability of the system, but introduces a consistency problem. This problem is solved by cache coherency protocols. Coherency: Ensures that no old copies of data are used Weaker than consistency, i.e. inconsistencies are allowed but along with keeping track of inconsistencies Parallel Computer Systems p.9/31 Cache Coherency Protocols: Invalidation: invalidate a copy when another processor is writing on address (snooping), always write-through is necessary MESI: keep track on usage of data, snooping, write-back only when necessary Directory based Cache Coherency: for systems without shared address bus Parallel Computer Systems p.10/31

6 Cache Coherency: MESI (1) Motivation für MESI: Allow the Write-back strategy as long no other processor is accessing to the cached address Protocols similar to MESI also exist for DSM system without a shared snooping medium, directory-based caches The term MESI comes from the four states M, E, S and I Parallel Computer Systems p.11/31 Cache Coherency: MESI (2) M Exclusive Modified The line is exclusively in this cache and got modified (written) The line is exclusively in this cache E Exclusive Unmodified but was not modified, i.e. was only accessed by read operations S Shared Unmodified This line is also present in another processors cache, but was not modified Line was modified by another I Invalid processor, cache entry may not be used Parallel Computer Systems p.12/31

7 Cache Coherency: MESI (3) States and transitions: Figure taken from: T. Ungerer, Parallelrechner und Parallele Programmierung Parallel Computer Systems p.13/31 Memory Consistency Models (1) A memory consistency model determines in which order processes get notice of memory accesses by other processes. Is this really necessary, is there any problem? Normally not. But yes, because we introduced some optimizations into memory access. Speculative read operations and non blocking Caches Delayed write operations Parallel Computer Systems p.14/31

8 Memory Consistency Models (2) Sequential Consistency - same result as sequential execution of operations in any order. Solely the local order from the view of the local processor is to keep. All processors see the same order. Processor Consistency - Order complies with local order of each processor, arbitrary mixture. Different processors may see different orders. Weak Consistency - Order solely guaranteed related to synchronization operations (Memory barriers) Release Consistency - Classification in concurrent and non-concurrent accesses, non-concurrent accesses are seen in a processor consistent way, concurrent accesses get ordered related to lock and release operations. Parallel Computer Systems p.15/31 Distributed Memory Multiprocessors Structure MEM MEM MEM MEM MEM P 0 P 1 P 2 P 3 P (p 1) Communication Network No shared memory, only local memory distributed across the processors Coordination and cooperation using message transfer Each node runs an instance of the operating system or a micro-kernel Parallel Computer Systems p.16/31

9 Programming Distributed Memory (1) Explicit Message Passing: Occam (CSP) Sequential programming languages (Fortran, C, C++, Java) with extensions Libraries: Most common is MPI, MPI-2, PVM Explicit Message Passing also works on shared memory Parallel Computer Systems p.17/31 Programming Distributed Memory (2) Approach: The same program is started on each node, on each node a process is created The process asks for the processor number (r) and the number of processes on the parallel computer (s) The process decides accordingly to r and s what to do The process selects a subset of input data according r and s This concept is called SPMD: Same Program Multiple Data Parallel Computer Systems p.18/31

10 Networks for Parallel Computers (1) A necessary condition is a direct or indirect connection for each pair of processors. Indirect connections have to be utilized by a message routing protocol. Static network types: Bus: time-divided Medium, easy to implement, not scalable for larger systems Mesh topologies: Distinct nodes are statically connected, indirect connections via multiple connections allowed Fully connected: best choice, but expensive and hard to implement for large systems Grid, Torus, 2d and 3d, Hypercube,... Parallel Computer Systems p.19/31 Networks for Parallel Computers (2) Dynamic networks (with switches): Switched network: Star-like topology with a switch as central node, appropriate for small systems, Hierarchies of switches for large systems Switches can be built differently: N inputs, N outputs: Crossbar is a matrix of N 2 Switches Multiple stages with lower- complexity switches Parallel Computer Systems p.20/31

11 Vector Processors (1) SIMD computers, based on the arithmetic pipelining principle Split execution of floating point instructions into steps Load operand pair from vector register Exponents comparison Match exponents, by shifting of mantissa Execute operation on mantissa and exponent Normalize result Store back result in vector register Normally these steps are executed sequentially in a microprocessor, but executed in a pipelined way by vector processors Parallel Computer Systems p.21/31 Vector Processors (2) Vector pipeline: Speedup of a vector pipeline: n...length of the vectors k...numberofpipelinestages S p = T 1 T p = n k k+n 1 for n>>k, S p = k Parallel Computer Systems p.22/31

12 Vector Processors (3) Generic structure of a vector computer Components: Control unit At least one scalar processing unit L/S L/S Vector Unit Vector unit, composed of many (specialized) vector pipelines Main Memory L/S Scalar Processing Unit Registers: scalar and vector Interleaved main memory Instruction Buffer Control Unit Exec Control Load/Store units Parallel Computer Systems p.23/31 Vector Processors (4) Vector computers are mostly Load/Store architectures. Vector registers: Act as source and destination for vector pipelines Store temporary data in chained vector operations Overlapping memory access and operand flow to vector pipelines Continous store back for write operations Main Memory L/S Sequential access by pipeline with high clock rate Interleaved memory Continous re fill for load operation Parallel Computer Systems p.24/31

13 Vector Processors (5) Chaining of vector operations: VMA V0, V1, V2, V4; V0 * V1 + V2 -> V4 Multiply Add L/S V0 V4 L/S L/S V1 L/S V2 Chaining allows to increase k and to obtain a higher speedup Parallel Computer Systems p.25/31 Vector Processors (6) Types of parallelism in vector computers: Vector-pipeline parallelism: Iterations on same type of operands can be executed as pipelined vector instructions Usage of multiple vector pipelines: Execute several independent vector operations in parallel Split large vector pairs and execute them in parallel using multiple pipelines Chaining of vector operations Parallel Computer Systems p.26/31

14 Array Computers SIMD computer in a real array structure A single control unit decodes instructions and generates control signals Large number of processing elements execute same instructions (step-synchronized), but on different data Instruction Stream Control Unit Control Signals Memory Data Data Neighborhood Network Execution units local registers Parallel Computer Systems p.27/31 New Trends (1) Parallel processing moves into modern processors Shared memory MPS: (1) Multicore processors (2) Multithreaded processors, with many virtual processors (e.g. HT) Combinations of (1) and (2) are announced Other Trend: Many small processors on a chip Relatively small local memory Connected via on-chip network DMA engines for remote memory transfer Example: IBM Cell Parallel Computer Systems p.28/31

15 New Trends (2) Cell architecture: Parallel Computer Systems p.29/31 New Trends (3) Synergistic processor element (SPE) architecture: Parallel Computer Systems p.30/31

16 Summary Several classes of parallel computers... SIMD Vector and Array processors Programming with vector or array instructions Vector processors suited for problems with huge fraction of floating point calculations Array computers for regular structured problems, e.g. image processing MIMD Multiprocessors The most universal class Need explicit parallel programming or compiler tools for automatic code parallelization Shared memory requires techniques for consistency, but programming is easier Parallel Computer Systems p.31/31