Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16
Parallel Architectures Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias Müller, ITC, RWTH Aachen References Computer Organization and Design. David A. Patterson, John L. Hennessy. Chapter 7. Computer Architecture: A Quantitative Approach. John L. Hennessy, David A. Patterson. Appendix F. Diego Fabregat Parallel Programming 2 / 29
Outline 1 Flynn s Taxonomy 2 Shared-memory Architectures 3 Distributed-memory Architectures 4 Interconnection Networks Diego Fabregat Parallel Programming 3 / 29
Flynn s Taxonomy According to instructions and data streams Single instruction stream, single data stream (SISD): Classical single-core processor Single instruction stream, multiple data stream (SIMD): Vector extensions, GPUs Multiple instruction stream, single data stream (MISD): No commercial processor exists Multiple instruction stream, multiple data stream (MIMD): Multi-core, multi-processors, clusters Diego Fabregat Parallel Programming 4 / 29
Flynn s Taxonomy According to instructions and data streams Single instruction stream, single data stream (SISD): Classical single-core processor Single instruction stream, multiple data stream (SIMD): Vector extensions, GPUs Multiple instruction stream, single data stream (MISD): No commercial processor exists Multiple instruction stream, multiple data stream (MIMD): Multi-core, multi-processors, clusters From a parallel programming perspective, only two are relevant: SIMD and MIMD. Focus of this course: MIMD. Diego Fabregat Parallel Programming 4 / 29
Multiple-Instruction Multiple-Data Most general model: Each processor works on its own data with its own instruction stream In practice: Single Program Multiple Data (SPMD) All processors execute the same code stream Just not the same instruction at the same time Control flow relatively independent (can be completely different) Amount of data to process may vary Further breakdown based on memory organization: Shared-memory systems Distributed-memory systems Diego Fabregat Parallel Programming 5 / 29
Shared Memory Multiprocessors Diego Fabregat Parallel Programming 6 / 29
Shared Memory Multiprocessors Programmer s view: Single physical address space Processors communicate via shared variables in memory All processors can access any location via loads and stores Usually come in one of two flavors: Uniform memory access(uma) Nonuniform memory access(numa) Diego Fabregat Parallel Programming 7 / 29
Shared Memory Multiprocessors Uniform Memory Access (UMA) UMA About the same time to access main memory Does not matter which processor and address Diego Fabregat Parallel Programming 8 / 29
Shared Memory Multiprocessors Nonuniform Memory Access (NUMA) NUMA Some memory accesses are faster than others Depends on which processor accesses which word Diego Fabregat Parallel Programming 9 / 29
Cluster of Processors Diego Fabregat Parallel Programming 10 / 29
Cluster of Processors Each processor (node) has its own private address space Processors communicate via message passing Coordination/Synchronization via send/receive routines Diego Fabregat Parallel Programming 11 / 29
Cluster of (Multi)Processors Nowadays, we typically find hybrid configurations: Commodity clusters: Standard nodes Standard interconnection network Custom clusters: Custom nodes Custom interconnection network Example: IBM BlueGene Diego Fabregat Parallel Programming 12 / 29
Outline 1 Flynn s Taxonomy 2 Shared-memory Architectures 3 Distributed-memory Architectures 4 Interconnection Networks Diego Fabregat Parallel Programming 13 / 29
Interconnection Networks Components of a network: Nodes Links Interconnection Diego Fabregat Parallel Programming 14 / 29
Network Performance Bandwidth Maximum rate at which data can be transfered Aggregate bandwidth: total data bw supplied by the network Effective bandwidth (throughput): fraction of the aggregate bandwidth delivered to an application Diego Fabregat Parallel Programming 15 / 29
Network Performance Latency: Time to send and receive a message Components of packet latency. Source: Computer architecture, Appendix F. Patterson, Hennessy. Diego Fabregat Parallel Programming 16 / 29
Shared-media networks Only one message at a time Processors broadcast their message over the medium Each processor listens to every message and receives the ones for which it is the destination Decentralized arbitration Before sending a message, processors listen until medium is free Message collision can degrade performance Low cost but does not scale Example: bus networks to connect processors to memory Diego Fabregat Parallel Programming 17 / 29
Switched-media networks Support point-to-point messages between nodes Each node has its own communication path to the switch Advantages Support concurrent transmission of multiple messages among different node pairs Scales much more than bus Diego Fabregat Parallel Programming 18 / 29
Crossbar switch Non-blocking Links are not shared among paths to unique destinations Requires n 2 crosspoint switches Limited scalability Diego Fabregat Parallel Programming 19 / 29
Omega network Multi-stage interconnection network (MIN) Splits crossbar into multiple stages with simpler switches Complexity O(nlog(n)) Omega with k k switches log k (n) stages n k log k(n) switches Blocking due to paths between different sources and destinations simultaneously sharing network links Diego Fabregat Parallel Programming 20 / 29
Distributed switched networks Each network switch has one or more end node devices directly attached to it Mostly used for distributed-memory architectures End-node devices: processor(s) + memory Network node: end-node + switch These nodes are directly connected to other nodes without going through external switches Also called direct or static interconnection networks Ratio of switches to nodes 1:1 Diego Fabregat Parallel Programming 21 / 29
Evaluation criteria Network degree Maximum node degree Node degree: number of adjacent nodes (in + out edges) Diameter Largest distance between two nodes Bisection width (or bisection bandwidth) Minimum number of edges between nodes that must be removed to cut the network into roughly two equal halves Edge/node connectivity Minimum number of edges/nodes that need to be removed to render network disconnected Diego Fabregat Parallel Programming 22 / 29
Requirements Low network degree to reduce hardware costs Low diameter to ensure low distance (i.e., latency) for message transfer High bisection bandwidth to ensure high-throughput High connectivity to ensure robustness Good scalability to connect large numbers of device nodes Diego Fabregat Parallel Programming 23 / 29
Fully connected topology Each node is directly connected to every other node Expensive for large numbers of nodes Dedicated link between each pair of nodes Diego Fabregat Parallel Programming 24 / 29
Fully connected topology Each node is directly connected to every other node Expensive for large numbers of nodes Dedicated link between each pair of nodes Assuming 64 nodes Performance: Diameter: 1 BW Bisection (# links): 1024 Edge connectivity: 63 Cost: # Switches: 64 Network degree: 64 # links: 2080 Diego Fabregat Parallel Programming 24 / 29
Ring topology Lower-cost n 3 3 switches, n network links Not a bus! simultaneous transfers possible Diego Fabregat Parallel Programming 25 / 29
Ring topology Lower-cost n 3 3 switches, n network links Not a bus! simultaneous transfers possible Assuming 64 nodes Performance: Diameter: 32 BW Bisection (# links): 2 Edge connectivity: 2 Cost: # Switches: 64 Network degree: 3 # links: 128 Diego Fabregat Parallel Programming 25 / 29
N-dimensional meshes Typically 2 or 3 dimensions Direct link to neighbors Each node has 1 or 2 neighbors per dimension 2 in the center Less for border or corner nodes Efficient nearest neighbor communication Suitable for large number of nodes Diego Fabregat Parallel Programming 26 / 29
N-dimensional meshes Typically 2 or 3 dimensions Direct link to neighbors Each node has 1 or 2 neighbors per dimension 2 in the center Less for border or corner nodes Efficient nearest neighbor communication Suitable for large number of nodes Assuming 64 nodes Performance: Diameter: 14 BW Bisection (# links): 8 Edge connectivity: 2 Cost: # Switches: 64 Network degree: 5 # links: 176 Diego Fabregat Parallel Programming 26 / 29
Torus Mesh with wrap-around connections Each node has exactly 2 neighbors per dimension Diego Fabregat Parallel Programming 27 / 29
Torus Assuming 64 nodes Performance: Diameter: 8 BW Bisection (# links): 16 Edge connectivity: 4 Cost: # Switches: 64 Network degree: 5 # links: 192 Diego Fabregat Parallel Programming 28 / 29
Summary Flynn s classification This course: Focus on MIMD Shared-memory architectures Single address space Communication: shared variables Distributed-memory architectures Multiple private address spaces Communication: message passing Network topologies, performance and cost latency, bandwidth diameter, bisection bandwidth, connectivity # switches, # links, network degree Diego Fabregat Parallel Programming 29 / 29