Introduction to Parallel Computing. George Karypis Parallel Programming Platforms



Similar documents
Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Lecture 2 Parallel Programming Platforms

Parallel Programming

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Interconnection Network

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Components: Interconnect Page 1 of 18

Scalability and Classifications

Topological Properties

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Interconnection Networks

Chapter 2. Multiprocessors Interconnection Networks

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Why the Network Matters

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Interconnection Networks

Annotation to the assignments and the solution sheet. Note the following points

Chapter 2 Parallel Architecture, Software And Performance

Interconnection Network Design

Principles and characteristics of distributed systems and environments

Interconnection Networks

Asynchronous Bypass Channels

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Scaling 10Gb/s Clustering at Wire-Speed

Distributed Systems LEEC (2005/06 2º Sem.)

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Lecture 2.1 : The Distributed Bellman-Ford Algorithm. Lecture 2.2 : The Destination Sequenced Distance Vector (DSDV) protocol

How To Understand The Concept Of A Distributed System

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Switched Interconnect for System-on-a-Chip Designs

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Multiprocessor Cache Coherence

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Hadoop Cluster Applications

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Switching Architectures for Cloud Network Designs

Chapter 14: Distributed Operating Systems

Module 15: Network Structures

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Industrial Ethernet How to Keep Your Network Up and Running A Beginner s Guide to Redundancy Standards

Chapter 16: Distributed Operating Systems

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Scalable Source Routing

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Preserving Message Integrity in Dynamic Process Migration

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

Data Center Network Topologies: FatTree

Mixed-Criticality Systems Based on Time- Triggered Ethernet with Multiple Ring Topologies. University of Siegen Mohammed Abuteir, Roman Obermaisser


Optimizing Configuration and Application Mapping for MPSoC Architectures

CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

SwanLink: Mobile P2P Environment for Graphical Content Management System

Communication Networks. MAP-TELE 2011/12 José Ruela

Network Architecture and Topology

Strategies. Addressing and Routing

BBM467 Data Intensive ApplicaAons

Chapter 3. Enterprise Campus Network Design

Client/Server and Distributed Computing

Chapter 18: Database System Architectures. Centralized Systems

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

Architecture of distributed network processors: specifics of application in information security systems

A Review of Customized Dynamic Load Balancing for a Network of Workstations

LIST OF FIGURES. Figure No. Caption Page No.

Chapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1

Computer Networks: LANs, WANs The Internet

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

White Paper Abstract Disclaimer

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

ASON for Optical Networks

Parallel Computing. Benson Muite. benson.

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

SAN Conceptual and Design Basics

Chapter 17: Distributed Systems

Data Center Switch Fabric Competitive Analysis

Kevin Webb, Alex Snoeren, Ken Yocum UC San Diego Computer Science March 29, 2011 Hot-ICE 2011

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Large Scale Clustering with Voltaire InfiniBand HyperScale Technology

HPAM: Hybrid Protocol for Application Level Multicast. Yeo Chai Kiat

Juniper Networks QFabric: Scaling for the Modern Data Center

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Mesh Generation and Load Balancing

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Parallel Architectures and Interconnection

Architecting Low Latency Cloud Networks

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

SOC architecture and design

A Lab Course on Computer Architecture

Dual-Centric Data Center Network Architectures

General Overview of Shared-Memory Multiprocessor Systems

Distributed Operating Systems

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Transcription:

Introduction to Parallel Computing George Karypis Parallel Programming Platforms

Elements of a Parallel Computer Hardware Multiple Processors Multiple Memories Interconnection Network System Software Parallel Operating System Programming Constructs to Express/Orchestrate Concurrency Application Software Parallel Algorithms Goal: Utilize the Hardware, System, & Application Software to either Achieve Speedup: T p = T s /p Solve problems requiring a large amount of memory.

Parallel Computing Platform Logical Organization The user s view of the machine as it is being presented via its system software Physical Organization The actual hardware architecture Physical Architecture is to a large extent independent of the Logical Architecture

Logical Organization Elements Control Mechanism SISD/SIMD/MIMD/MISD Single/Multiple Instruction Stream & Single/Multiple Data Stream SPMD: Single Program Multiple Data

Logical Organization Elements Communication Model Shared-Address Space UMA/NUMA/ccNUMA Message-Passing

Physical Organization Ideal Parallel Computer Architecture PRAM: Parallel Random Access Machine PRAM Models EREW/ERCW/CREW/CRCW Exclusive/Concurrent Read and/or Write Concurrent Writes are resolved via Common/Arbitrary/Priority/Sum

Physical Organization Interconnection Networks (ICNs) Provide processor-to-processor and processor-to-memory connections Networks are classified as: Static Consist of a number of point-to-point links direct network Historically used to link processors-to-processors distributed-memory system Dynamic The network consists of switching elements that the various processors attach to indirect network Historically used to link processors-to-memory shared-memory systems

Static & Dynamic ICNs

Evaluation Metrics for ICNs Diameter The maximum distance between any two nodes Smaller the better. Connectivity The minimum number of arcs that must be removed to break it into two disconnected networks Larger the better Measures the multiplicity of paths Bisection width The minimum number of arcs that must be removed to partition the network into two equal halves. Larger the better Bisection bandwidth Applies to networks with weighted arcs weights correspond to the link width (how much data it can transfer) The minimum volume of communication allowed between any two halves of a network Larger the better Cost The number of links in the network Smaller the better

Metrics and Dynamic Networks

Network Topologies Bus-Based Networks Shared medium Information is being broadcasted Evaluation: Diameter: O(1) Connectivity: O(1) Bisection width: O(1) Cost: O(p)

Network Topologies Crossbar Networks Switch-based network Supports simultaneous connections Evaluation: Diameter: O(1) Connectivity: O(1)? Bisection width: O(p)? Cost: O(p 2 )

Network Topologies Multistage Interconnection Networks

Multistage Switch Architecture Pass-through Cross-over

Connecting the Various Stages

Blocking in a Multistage Switch Routing is done by comparing the bit-level representation of source and destination addresses. -match goes via pass-through -mismatch goes via cross-over

Network Topologies Complete and star-connected networks.

Network Topologies Cartesian Topologies

Network Topologies Hypercubes

Network Topologies Trees

Summary of Performance Metrics

Physical Organization Cache Coherence in Shared Memory Systems A certain level of consistency must be maintained for multiple copies of the same data Required to ensure proper semantics and correct program execution serializability Two general protocols for dealing with it invalidate & update

Invalidate/Update Protocols

Invalidate/Update Protocols The preferred scheme depends on the characteristics of the underlying application frequency of reads/writes to shared variables Classical trade-off between communication overhead (updates) and idling (stalling in invalidates) Additional problems with false sharing Existing schemes are based on the invalidate protocol A number of approaches have been developed for maintaining the state/ownership of the shared data

Communication Costs in Parallel Systems Message-Passing Systems The communication cost of a data-transfer operation depends on: start-up time: t s add headers/trailer, error-correction, execute the routing algorithm, establish the connection between source & destination per-hop time: t h time to travel between two directly connected nodes. node latency per-word transfer time: t w 1/channel-width

Store-and-Forward & Cut-Through Routing

Cut-through Routing Deadlocks Messages 0, 1, 2, and 3 need to go to nodes A, B, C, and D, respectively

Communication Model Used for this Class We will assume that the cost of sending a message of size m is: In general true because t s is much larger than t h and for most of the algorithms that we will study mt w is much larger than lt h

Routing Mechanisms Routing: The algorithm used to determine the path that a message will take to go from the source to destination Can be classified along different dimensions minimal vs non-minimal deterministic vs adaptive

Dimension Ordered Routing There is a predefined ordering of the dimensions Messages are routed along the dimensions in that order until they cannot move any further X-Y routing for meshes E-cube routine for hypercubes

Topology Embeddings Mapping between networks Useful in the early days of parallel computing when topology specific algorithms were being developed. Embedding quality metrics dilation maximum number of lines an edge is mapped to congestion maximum number of edges mapped on a single link

Mapping a Cartesian Topology onto a Hypercube Cool things

Mapping a Cartesian Topology onto a Hypercube