Scalability and Classifications



Similar documents
Lecture 2 Parallel Programming Platforms

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Parallel Programming

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Annotation to the assignments and the solution sheet. Note the following points

Chapter 2 Parallel Architecture, Software And Performance

Topological Properties

Interconnection Network

Introduction to Cloud Computing

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Interconnection Networks

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

CMSC 611: Advanced Computer Architecture

Parallel Computing. Benson Muite. benson.

High Performance Computing

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Why the Network Matters

Components: Interconnect Page 1 of 18

Scaling 10Gb/s Clustering at Wire-Speed

Interconnection Network Design

Computer Architecture TDTS10

Principles and characteristics of distributed systems and environments

Interconnection Networks

Lecture 23: Multiprocessors

Introduction to GPU Programming Languages

Analysis and Implementation of Cluster Computing Using Linux Operating System

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

Computer Networks. Definition of LAN. Connection of Network. Key Points of LAN. Lecture 06 Connecting Networks

Systolic Computing. Fundamentals

Parallel Architectures and Interconnection

Performance Analysis and Optimization Tool

A Flexible Cluster Infrastructure for Systems Research and Software Development

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

PARALLEL PROGRAMMING

InfiniBand Clustering

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Chapter 2. Multiprocessors Interconnection Networks

High Performance Computing in CST STUDIO SUITE

Cluster Computing at HRI

Building an Inexpensive Parallel Computer

Client/Server and Distributed Computing

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Silviu Panica, Marian Neagul, Daniela Zaharie and Dana Petcu (Romania)

Outline. hardware components programming environments. installing Python executing Python code. decimal and binary notations running Sage

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Behavior Analysis of Multilayer Multistage Interconnection Network With Extra Stages

SOC architecture and design

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

How To Understand The Concept Of A Distributed System

HPC Wales Skills Academy Course Catalogue 2015

Improved LS-DYNA Performance on Sun Servers

An Introduction to Parallel Computing/ Programming

Parallel Programming Survey

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

Definition. A Historical Example

GPU Parallel Computing Architecture and CUDA Programming Model

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

OPNET Network Simulator

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

LSN 2 Computer Processors

Chapter 2 Parallel Computer Architecture

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Stream Processing on GPUs Using Distributed Multimedia Middleware

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Chapter 16: Distributed Operating Systems

HPC Software Requirements to Support an HPC Cluster Supercomputer

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

Chapter 14: Distributed Operating Systems

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Switched Interconnect for System-on-a-Chip Designs

Computer Systems Structure Input/Output

A Comparison of Distributed Systems: ChorusOS and Amoeba

Module 15: Network Structures

Industrial Ethernet How to Keep Your Network Up and Running A Beginner s Guide to Redundancy Standards

NZQA Expiring unit standard 6857 version 4 Page 1 of 5. Demonstrate an understanding of local and wide area computer networks

10- High Performance Compu5ng

Multi-Threading Performance on Commodity Multi-Core Processors

Overview of Network Hardware and Software. CS158a Chris Pollett Jan 29, 2007.

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Operating System Concepts. Operating System 資 訊 工 程 學 系 袁 賢 銘 老 師

A Very Brief History of High-Performance Computing

Overlapping Data Transfer With Application Execution on Clusters

Transcription:

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections MCS 572 Lecture 2 Introduction to Supercomputing Jan Verschelde, 11 January 2012 Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 1 / 27

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 2 / 27

the classification of Flynn In 1966, Flynn introduced the following classification: SISD = Single Instruction Single Data stream one single processor handles data sequentially use pipelining (e.g.: car assembly) to achieve parallelism MISD = Multiple Instruction Single Data stream called systolic arrays, has been of little interest SIMD = Single Instruction Multiple Data stream graphics computing, issue same command for pixel matrix vector and arrays processors for regular data structures MIMD = Multiple Instruction Multiple Data stream this is the general purpose multiprocessor computer Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 3 / 27

Single Program Multiple Data Stream One model is SPMD: Single Program Multiple Data stream: 1 All processors execute the same program. 2 Branching in the code depends on the identification number of the processing node. Manager worker paradigm: manager (also called root) has identification zero, workers are labeled 1, 2,...,p 1. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 4 / 27

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 5 / 27

types of parallel computers shared memory distributed memory m 1 m 2 m 3 m 4 network network p 1 p 2 p 3 p 4 p 1 p 2 p 3 p 4 m 1 m 2 m 3 m 4 One crude distinction concerns memory, shared or distributed: A shared memory multicomputer has one single address space, accessible to every processor. In a distributed memory multicomputer, every processor has its own memory accessible via messages through that processor. Many parallel computers consist of multicore nodes. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 6 / 27

clusters Definition A cluster is an independent set of computers combined into a unified system through software and networking. Beowulf clusters are scalable performance clusters based on commodity hardware, on a private network, with open source software. What drove the clustering revolution in computing? 1 commodity hardware: choice of many vendors for processors, memory, hard drives, etc... 2 networking: Ethernet is dominating commodity networking technology, supercomputers have specialized networks; 3 open source software infrastructure: Linux and MPI. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 7 / 27

Message Passing and Scalability total time = computation time + communication time } {{ } o v e r h e a d Because we want to reduce the overhead, the computation/communication ratio = determines the scalability of a problem: computation time communication time How well can we increase the problem size n, keeping p, the number of processors fixed? Desired: order of overhead order of computation, so ratio, examples: O(log 2 (n)) < O(n) < O(n 2 ). Remedy: overlapping communication with computation. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 8 / 27

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 9 / 27

distributed shared memory computers In a distributed shared memory computer: Benefits: memory is physically distributed with each processor, and each processor has access to all memory in single address space. 1 message passing often not attractive to programmers, 2 shared memory computers allow limited number of processors, whereas distributed memory computers scale well Disadvantage: access to remote memory location causes delays and the programmer does not have control to remedy the delays. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 10 / 27

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 11 / 27

some terminology bandwidth: number of bits transmitted per second on latency, we distinguish tree types: message latency: time to send zero length message (or startup time), network latency: time to make a message transfer the network, communication latency: total time to send a message including software overhead and interface delays. diameter of network: minimum number of links between nodes that are farthest apart on bisecting the network: bisection width : number of links needed to cut network in two equal parts, bisection bandwidth: number of bits per second which can be sent from one half of network to the other half. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 12 / 27

arrays, rings, meshes, and tori Connecting p nodes in complete graph is too expensive. An array and ring of 4 nodes: A matrix and torus of 16 nodes: Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 13 / 27

hypercube network Two nodes are connected their labels differ in exactly one bit. 01 00 11 10 011 111 001 101 110 010 000 100 e-cube or left-to-right routing: flip bits from left to right, e.g.: going from node 000 to 101 passes through 100. In a hypercube network with p nodes: maximum number of flips is log 2 (p), number of connections is...? Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 14 / 27

a tree network Consider a binary tree: The leaves in the tree are processors. The interior nodes in the tree are switches. Often the tree is fat: with an increasing number of links towards the root of the tree. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 15 / 27

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 16 / 27

a crossbar switch In a shared memory multicomputer, processors are usually connected to memory modules by a crossbar switch. For example, for p = 4: p 4 p 3 p 2 p 1 m1 m 2 m3 m4 A p-processor shared memory computer requires p 2 switches. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 17 / 27

2-by-2 switches a switch pass through cross over Changing from pass through to cross over configuration: 1 2 3 4 1 2 3 4 Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 18 / 27

a multistage network Rules in the routing algorithm: 1 bit is zero: select upper output of switch 2 bit is one: select lower output of switch The first bit in the input determines the output of the first switch, the second bit in the input determines the output of the second switch. inputs 00 10 01 11 00 01 10 11 outputs The communication between 2 nodes using 2-by-2 switches causes blocking: other nodes are prevented from communicating. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 19 / 27

a 3-stage Omega interconnection network circuit switching: 000 100 000 001 001 101 010 110 011 111 010 011 100 101 110 111 number of switches for p processors: log 2 (p) p 2 Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 20 / 27

circuit and packet switching If all circuits are occupied, communication is blocked. Alternative solution: packet switching: message is broken in packets and sent through network. Problems to avoid: deadlock: Packets are blocked by other packets waiting to be forwarded. This occurs when the buffers are full with packets. Solution: avoid cycles using e-cube routing algorithm. livelock: a packet keeps circling the network and fails to find its destination. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 21 / 27

Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static connections dynamic network topologies by switches ethernet connections Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 22 / 27

ethernet connection network Computers in a typical cluster are connected via ethernet. compute nodes ethernet network manager node Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 23 / 27

personal supercomputing The SiCortex SC072 is a desktop supercomputer: 72 processors, 12 nodes of 6 cores each, operates at 1GFlops, 48GB of memory, green supercomputing: low energy usage. HP workstation Z800 RedHat Linux: two 6-core Intex Xeon at 3.47Ghz, 24GB of internal memory, 2 NVDIA Tesla C2050 general purpose graphic processing units. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 24 / 27

the ACCC cluster argo http://www.uic.edu/depts/accc/hardware/argo Beowulf cluster with 57 PCs running RedHat Linux 4.2, with one master node and 56 compute nodes (heterogeneous), and approximately 160GB of total memory. The master node is used to create, edit, and compile programs, to submit jobs for execution on compute nodes. User programs must not run on the master node. Cluster monitoring system at https://argo.cc.uic.edu/ (requires only UIC netid) allows to see job list & usage stats. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 25 / 27

summary and recommended reading We classified computers and introduced network terminology, we have covered chapter 1 of Wilkinson-Allen almost entirely. Available to UIC via the ACM digital library: Michael J. Flynn and Kevin W. Rudd: Parallel Architectures. ACM Computing Surveys 28(1): 67-69, 1996. Available to UIC via IEEE Xplore: George K. Thiruvathukal: Cluster Computing. Guest Editor s Introduction. Computing in Science & Engineering 7(2): 11-13, 2005. Visit www.beowulf.org. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 26 / 27

Exercises Homework will be collected at a to be announced date. Exercises: 1 Derive a formula for the number of links in a hypercube with p = 2 k processors for some positive number k. 2 Consider a network of 16 nodes, organized in a 4-by-4 mesh with connecting loops to give it the topology of a torus (or doughnut). Can you find a mapping of the nodes which give it the topology of a hypercube? If so, use 4 bits to assign labels to the nodes. If not, explain why. 3 We derived an Omega network for eight processors Give an example of a configuration of the switches which is blocking, i.e.: a case for which the switch configurations prevent some nodes from communicating with each other. 4 Draw a multistage Omega interconnection network for p = 16. Introduction to Supercomputing (MCS 572) scalability and classifications 11 January 2012 27 / 27