Parallel Programming

Similar documents

Lecture 2 Parallel Programming Platforms

Scalability and Classifications

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Chapter 2 Parallel Architecture, Software And Performance

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Topological Properties

Lecture 23: Multiprocessors

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Introduction to Cloud Computing

CMSC 611: Advanced Computer Architecture

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Components: Interconnect Page 1 of 18

Chapter 2. Multiprocessors Interconnection Networks

Interconnection Networks

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Interconnection Networks

Parallel Architectures and Interconnection

Parallel Programming Survey

Principles and characteristics of distributed systems and environments

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Interconnection Network

Why the Network Matters

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

High Performance Computing

Annotation to the assignments and the solution sheet. Note the following points

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Interconnection Network Design

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

An Introduction to Parallel Computing/ Programming

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

Symmetric Multiprocessing

Chapter 14: Distributed Operating Systems

Introduction to GPU Programming Languages

Module 15: Network Structures

Chapter 16: Distributed Operating Systems

Operating System Concepts. Operating System 資訊工程學系袁賢銘老師

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

Computer Architecture TDTS10

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

Communication Networks. MAP-TELE 2011/12 José Ruela

Chapter 2 Parallel Computer Architecture

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

How To Understand The Concept Of A Distributed System

Multilevel Load Balancing in NUMA Computers

Introduction to Networks

CSCI 362 Computer and Network Security

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Distributed Systems LEEC (2005/06 2º Sem.)

On-Chip Interconnection Networks Low-Power Interconnect

Local-Area Network -LAN

Systolic Computing. Fundamentals

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Industrial Ethernet How to Keep Your Network Up and Running A Beginner s Guide to Redundancy Standards

Switched Interconnect for System-on-a-Chip Designs

Scalable Source Routing

Network Architecture and Topology

2 Basic Concepts. Contents

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Load Balancing Mechanisms in Data Center Networks

TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS

Overview of Network Hardware and Software. CS158a Chris Pollett Jan 29, 2007.

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

LSN 2 Computer Processors

Cellular Computing on a Linux Cluster

Performance Monitoring of Parallel Scientific Applications

Review Methods Configuration, Administration and Network Monitoring in High-Rate Onboard Networking Standards

Influence of Load Balancing on Quality of Real Time Data Transmission*

Parallel Computing. Benson Muite. benson.

Binary search tree with SIMD bandwidth optimization using SSE

Computer Networks Vs. Distributed Systems

TÓPICOS AVANÇADOS EM REDES ADVANCED TOPICS IN NETWORKS

ANALYSIS OF SUPERCOMPUTER DESIGN

Definition. A Historical Example

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

White Paper The Numascale Solution: Extreme BIG DATA Computing

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Next Generation GPU Architecture Code-named Fermi

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

SCSI vs. Fibre Channel White Paper

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

PART II. OPS-based metro area networks

Architecting Low Latency Cloud Networks

Data Center Switch Fabric Competitive Analysis

QUALITY OF SERVICE METRICS FOR DATA TRANSMISSION IN MESH TOPOLOGIES

Transcription:

Parallel Programming Parallel Architectures Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS15/16

Parallel Architectures Acknowledgements Prof. Felix Wolf, TU Darmstadt Prof. Matthias Müller, ITC, RWTH Aachen References Computer Organization and Design. David A. Patterson, John L. Hennessy. Chapter 7. Computer Architecture: A Quantitative Approach. John L. Hennessy, David A. Patterson. Appendix F. Diego Fabregat Parallel Programming 2 / 29

Outline 1 Flynn s Taxonomy 2 Shared-memory Architectures 3 Distributed-memory Architectures 4 Interconnection Networks Diego Fabregat Parallel Programming 3 / 29

Flynn s Taxonomy According to instructions and data streams Single instruction stream, single data stream (SISD): Classical single-core processor Single instruction stream, multiple data stream (SIMD): Vector extensions, GPUs Multiple instruction stream, single data stream (MISD): No commercial processor exists Multiple instruction stream, multiple data stream (MIMD): Multi-core, multi-processors, clusters Diego Fabregat Parallel Programming 4 / 29

Flynn s Taxonomy According to instructions and data streams Single instruction stream, single data stream (SISD): Classical single-core processor Single instruction stream, multiple data stream (SIMD): Vector extensions, GPUs Multiple instruction stream, single data stream (MISD): No commercial processor exists Multiple instruction stream, multiple data stream (MIMD): Multi-core, multi-processors, clusters From a parallel programming perspective, only two are relevant: SIMD and MIMD. Focus of this course: MIMD. Diego Fabregat Parallel Programming 4 / 29

Multiple-Instruction Multiple-Data Most general model: Each processor works on its own data with its own instruction stream In practice: Single Program Multiple Data (SPMD) All processors execute the same code stream Just not the same instruction at the same time Control flow relatively independent (can be completely different) Amount of data to process may vary Further breakdown based on memory organization: Shared-memory systems Distributed-memory systems Diego Fabregat Parallel Programming 5 / 29

Shared Memory Multiprocessors Diego Fabregat Parallel Programming 6 / 29

Shared Memory Multiprocessors Programmer s view: Single physical address space Processors communicate via shared variables in memory All processors can access any location via loads and stores Usually come in one of two flavors: Uniform memory access(uma) Nonuniform memory access(numa) Diego Fabregat Parallel Programming 7 / 29

Shared Memory Multiprocessors Uniform Memory Access (UMA) UMA About the same time to access main memory Does not matter which processor and address Diego Fabregat Parallel Programming 8 / 29

Shared Memory Multiprocessors Nonuniform Memory Access (NUMA) NUMA Some memory accesses are faster than others Depends on which processor accesses which word Diego Fabregat Parallel Programming 9 / 29

Cluster of Processors Diego Fabregat Parallel Programming 10 / 29

Cluster of Processors Each processor (node) has its own private address space Processors communicate via message passing Coordination/Synchronization via send/receive routines Diego Fabregat Parallel Programming 11 / 29

Cluster of (Multi)Processors Nowadays, we typically find hybrid configurations: Commodity clusters: Standard nodes Standard interconnection network Custom clusters: Custom nodes Custom interconnection network Example: IBM BlueGene Diego Fabregat Parallel Programming 12 / 29

Outline 1 Flynn s Taxonomy 2 Shared-memory Architectures 3 Distributed-memory Architectures 4 Interconnection Networks Diego Fabregat Parallel Programming 13 / 29

Interconnection Networks Components of a network: Nodes Links Interconnection Diego Fabregat Parallel Programming 14 / 29

Network Performance Bandwidth Maximum rate at which data can be transfered Aggregate bandwidth: total data bw supplied by the network Effective bandwidth (throughput): fraction of the aggregate bandwidth delivered to an application Diego Fabregat Parallel Programming 15 / 29

Network Performance Latency: Time to send and receive a message Components of packet latency. Source: Computer architecture, Appendix F. Patterson, Hennessy. Diego Fabregat Parallel Programming 16 / 29

Shared-media networks Only one message at a time Processors broadcast their message over the medium Each processor listens to every message and receives the ones for which it is the destination Decentralized arbitration Before sending a message, processors listen until medium is free Message collision can degrade performance Low cost but does not scale Example: bus networks to connect processors to memory Diego Fabregat Parallel Programming 17 / 29

Switched-media networks Support point-to-point messages between nodes Each node has its own communication path to the switch Advantages Support concurrent transmission of multiple messages among different node pairs Scales much more than bus Diego Fabregat Parallel Programming 18 / 29

Crossbar switch Non-blocking Links are not shared among paths to unique destinations Requires n 2 crosspoint switches Limited scalability Diego Fabregat Parallel Programming 19 / 29

Omega network Multi-stage interconnection network (MIN) Splits crossbar into multiple stages with simpler switches Complexity O(nlog(n)) Omega with k k switches log k (n) stages n k log k(n) switches Blocking due to paths between different sources and destinations simultaneously sharing network links Diego Fabregat Parallel Programming 20 / 29

Distributed switched networks Each network switch has one or more end node devices directly attached to it Mostly used for distributed-memory architectures End-node devices: processor(s) + memory Network node: end-node + switch These nodes are directly connected to other nodes without going through external switches Also called direct or static interconnection networks Ratio of switches to nodes 1:1 Diego Fabregat Parallel Programming 21 / 29

Evaluation criteria Network degree Maximum node degree Node degree: number of adjacent nodes (in + out edges) Diameter Largest distance between two nodes Bisection width (or bisection bandwidth) Minimum number of edges between nodes that must be removed to cut the network into roughly two equal halves Edge/node connectivity Minimum number of edges/nodes that need to be removed to render network disconnected Diego Fabregat Parallel Programming 22 / 29

Requirements Low network degree to reduce hardware costs Low diameter to ensure low distance (i.e., latency) for message transfer High bisection bandwidth to ensure high-throughput High connectivity to ensure robustness Good scalability to connect large numbers of device nodes Diego Fabregat Parallel Programming 23 / 29

Fully connected topology Each node is directly connected to every other node Expensive for large numbers of nodes Dedicated link between each pair of nodes Diego Fabregat Parallel Programming 24 / 29

Fully connected topology Each node is directly connected to every other node Expensive for large numbers of nodes Dedicated link between each pair of nodes Assuming 64 nodes Performance: Diameter: 1 BW Bisection (# links): 1024 Edge connectivity: 63 Cost: # Switches: 64 Network degree: 64 # links: 2080 Diego Fabregat Parallel Programming 24 / 29

Ring topology Lower-cost n 3 3 switches, n network links Not a bus! simultaneous transfers possible Diego Fabregat Parallel Programming 25 / 29

Ring topology Lower-cost n 3 3 switches, n network links Not a bus! simultaneous transfers possible Assuming 64 nodes Performance: Diameter: 32 BW Bisection (# links): 2 Edge connectivity: 2 Cost: # Switches: 64 Network degree: 3 # links: 128 Diego Fabregat Parallel Programming 25 / 29

N-dimensional meshes Typically 2 or 3 dimensions Direct link to neighbors Each node has 1 or 2 neighbors per dimension 2 in the center Less for border or corner nodes Efficient nearest neighbor communication Suitable for large number of nodes Diego Fabregat Parallel Programming 26 / 29

N-dimensional meshes Typically 2 or 3 dimensions Direct link to neighbors Each node has 1 or 2 neighbors per dimension 2 in the center Less for border or corner nodes Efficient nearest neighbor communication Suitable for large number of nodes Assuming 64 nodes Performance: Diameter: 14 BW Bisection (# links): 8 Edge connectivity: 2 Cost: # Switches: 64 Network degree: 5 # links: 176 Diego Fabregat Parallel Programming 26 / 29

Torus Mesh with wrap-around connections Each node has exactly 2 neighbors per dimension Diego Fabregat Parallel Programming 27 / 29

Torus Assuming 64 nodes Performance: Diameter: 8 BW Bisection (# links): 16 Edge connectivity: 4 Cost: # Switches: 64 Network degree: 5 # links: 192 Diego Fabregat Parallel Programming 28 / 29

Summary Flynn s classification This course: Focus on MIMD Shared-memory architectures Single address space Communication: shared variables Distributed-memory architectures Multiple private address spaces Communication: message passing Network topologies, performance and cost latency, bandwidth diameter, bisection bandwidth, connectivity # switches, # links, network degree Diego Fabregat Parallel Programming 29 / 29