1 What is High Performance Computing? V. Sundararajan Scientific and Engineering Computing Group Centre for Development of Advanced Computing Pune
2 Plan Why HPC? Parallel Architectures Current Machines Issues Arguments against Parallel Computing Introduction to MPI A Simple example
3 Definition Supercomputing Parallel computing: Parallel Supercomputers High Performance Computing: includes computers, networks, algorithms and environments to make such systems usable. (range from small cluster of PCs to fastest supercomputers)
5 Increased Cores add value Physical Limits of single processor speed 2007 (Predicted by Paul Messina in 1997) Pentium 3600 MHz <0.3 ns Light travels 30 cm in 1 ns Signal is already 10% of its speed Red-Shift: Reduction in clock speed
6 Why HPC? To simulate a bio-molecule of atoms Non-bond energy term ~ 10 8 operations For 1 microsecond simulation ~ 10 9 steps ~ operations On a 500 MFLOPS machine (5x10 8 operations per second) takes 2x10 9 secs (About 60 years) (This may be on a machine of 5000 MFLOPS peak) Need to do large no of simulations for even larger molecules
7 Why HPC? Biggest CCSD(T) calculation: David A. Dixon, ( ) Hydrocarbon Fuels: Combustion - Need Heats of Formation to predict reaction equilibria and bond energies; octane(c8h18); aug-cc-pvqz= 1468 basis functions, 50 electrons; Used NWChem on 1400 Itanium-2 processors (2 and 4 GB/proc) on the EMSL MSCF Linux cluster. Took 23 hours to complete (3.5 yr on a desktop machine).
8 Parallel Architectures 1. Single Instruction Single Data (SISD) Sequential machines. 2. Multiple Instructions Single Data (MISD) 3. Single Instruction Multiple Data (SIMD) Array of processors with single control unit. Connection Machine (CM - 5) 4. Multiple Instructions Multiple Data (MIMD) Several Processors with several Instructions and Data Stream. All the recent parallel machines
9 Tightly Coupled MIMD Shared Memory Centralised (Uniform Access) Load/Store Memory INTERCONNECTION NETWORK P P P Symmetric Multiprocessors P - Processor
10 Loosely Coupled MIMD Message Interconnection Network Receive Message M M M Load/ store P P P Distributed Message-Passing Machines Also Called Message Passing Architecture Send Message M - Memory P - Processor
11 Shared Memory : Scalable up to about 64 or 128 processors but costly Memory contention problem Synchronisation problem Easy to code Distributed Memory : Architectures Scalable up to larger no. of processors Message passing overheads Latency hiding Difficult to code
12 Cluster of Computers Programming Environment (Java, C, F95, MPI, OpenMP) Other Subsystems User Interface (Database, OLTP) Single System Image Infrastructure OS Node OS Node OS Node Commodity Interconnect
16 Simulation Techniques N-body simulations Finite difference Finite Element Pattern evolution Classic methods - Do Large scale problems Faster New class of Solutions
17 Parallel Problem Solving Methodologies Data parallelism Pipelining Functional parallelism
18 Parallel Programming Models 1. Directives to Compiler (OpenMP, threads) 2. Using Parallel libraries (SCALAPACK, PARPACK) 3. Message passing (MPI)
19 Programming Style SPMD : Single program multiple Data. Each processor executes the same program but with different data. MPMD : Multiple programs, multiple data. Each processor executes a different program. Usually this is a master slave model.
20 Performance Issues 1. Speed up = Time for sequential code Time for parallel code S P = Ts Tp 1 < S P < P Sp 2. Efficiency E P = 0 < E P < 1 p E P = 1 => S P = P 100% efficient
21 Communication Overheads Latency Startup time for each message transaction 1 µs Bandwidth The rate at which the messages are transmitted across the nodes / processors 10 Gbits / Sec.
22 Strongest argument Amdahl s Law 1 S = f + (1-f) / P f = Sequential part of the code. Ex. f = 0.1 assume P = 10 processes 1 S = (0.9) / 10 1 = (0.09) As P S 10 Whatever we do, 10 is the maximum speedup possible.
23 What is MPI? An industry-wide standard protocol for passing messages between parallel processors. Small: Require only six library functions to write any parallel code Large: There are more than 200 functions in MPI-2
24 What is MPI? It uses communicator a handle to identify a set of processes and topology for different pattern of communication A communicator is a collection of processes that can send messages to each other. A topology is a structure imposed on the processes in a communicator that allows the processes to be addressed in different ways Can be called by Fortran, C or C++ codes
25 Small MPI MPI_Init MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv MPI_Finalize
26 MPI_Init Initialize the MPI execution environment Synopsis include "mpif.h" Call MPI_Init(Error) Output Parameter Error Error value (integer) 0 means no error Default communicator MPI_COMM_WORLD is initialised
27 MPI_Comm_size Determines the size of the group associated with a communictor Synopsis include "mpif.h Call MPI_Comm_size ( MPI_COMM_WORLD,size,Error) Input Parameter MPI_COMM_WORLD default communicator (handle) Output Parameter size - number of processes in the group of communicator MPI_COMM_WORLD (integer) Error Error value (integer)
28 MPI_Comm_rank Determines the rank of the calling process in the communicator Synopsis include "mpif.h" Call MPI_Comm_rank ( MPI_COMM_WORLD, myid, Error ) Input Parameters MPI_COMM_WORLD default communicator (handle) Output Parameter myid - rank of the calling process in group of communicator MPI_COMM_WORLD (integer) Error Error value (integer)
29 MPI_Send Performs a basic send Synopsis include "mpif.h" Call MPI_Send( buf, count, datatype, dest, tag, MPI_Comm_World,Error) Input Parameters buf count datatype dest tag comm Output Parameter Error initial address of send buffer (choice) number of elements in send buffer (nonnegative integer) data type of each send buffer element (handle) rank of destination (integer) message tag (integer) communicator (handle) Error value (integer) Notes This routine may block until the message is received.
30 MPI_Recv Basic receive Synopsis include "mpif.h" Call MPI_Recv(buf, count, datatype, source, tag, MPI_Comm_World, status, Error ) Output Parameters buf status Error Input Parameters count datatype source tag initial address of receive buffer (choice) status object (Status) Error value (integer) maximum number of elements in receive buffer (integer) datatype of each receive buffer element (handle) rank of source (integer) message tag (integer)
31 MPI_Comm_World communicator (handle) Notes The count argument indicates the maximum length of a message; the actual number can be determined with MPI_Get_count.
32 MPI_Finalize Terminates MPI execution environment Synopsis include "mpif.h" Call MPI_Finalize() Notes All processes must call this routine before exit. The number of processes running after this routine is called is undefined; it is best not to perform anything more than a return after calling MPI_Finalize.
33 Examples Summation Calculation of pi Matrix Multiplication
34 Sequential Flow Start Input Do the Computation Output Stop
35 Parallel using MPI Start Start Input Do the Computation Output Stop Input Divide the Work Communicate input Do the computation Collect the results Output Stop
36 Simple Example Let us consider summation of real numbers between 0 and 1. To compute i x i for (i=1,2, ) This may sound trivial but, useful in illustrating the practical aspects of parallel program development Total computation of operations to be done On a PC, it would take roughly 500 secs assuming 2 GFLOPS sustained performance Can we speed this up and finish in 1/8 th on 8 processors
37 Sequential Code program summation Implicit none double precision x,xsum integer i,iseed xsum=0.0 do i=1, x=float(mod(i,10))/ xsum=xsum+x enddo print *, xsum=,xsum stop end
38 Parallel Code SPMD style program summation Implicit none double precision x,xsum integer i xsum=0.0 do i=1, x=float(mod(i,10))/ xsum=xsum+x enddo print *, xsum=,xsum stop end program summation include mpif.h Implicit none double precision x,xsum,tsum integer i Integer myid,nprocs,ierr Call MPI_INIT(IERR) Call MPI_COMM_RANK(MPI_COMM_WORLD,myid,IERR) Call MPI_COMM_SIZE(MPI_COMM_WORLD,nprocs,IERR) xsum=0.0 do i=myid+1, ,nprocs x=float(mod(i,10))/ xsum=xsum+x enddo Call MPI_REDUCE(xsum,tsum,1,MPI_DOUBLE_PRECISION, MPI_SUM,0,MPI_COMM_WORLD,IERR) If(myid.eq.0) print *, tsum=,tsum stop end
39 Sequential vs Parallel Sequential Design algorithm step by step and write the code assuming single process to be executed f90 o sum.exe sum.f gcc o sum.exe sum.c Only one executable created sum.exe Executes a single process Single point of failure Good tools available for debugging Parallel Redesign the same algorithm with the distribution of work assuming many processes to be executed simultaneously mpif90 o sum.exe sum.f mpicc o sum.exe sum.f Only one executable created mpirun np 8 sum.exe Executes 8 processes simultaneously Cooperative operations but, multiple points of failure; Tools are still evolving towards debugging parallel executions.
40 A Quote to conclude James Bailey (New Era in Computation Ed. Metropolis & Rota) We are all still trained to believe that orderly, sequential processes are more likely to be true. As we were reshaping our computers, so simultaneously were they reshaping us. May be when things happen in this world, they actually happen in parallel
Lecture 6: Introduction to MPI programming Lecture 6: Introduction to MPI programming p. 1 MPI (message passing interface) MPI is a library standard for programming distributed memory MPI implementation(s)
Introduction Ned Nedialkov McMaster University Canada CS/SE 4F03 January 2016 Outline von Neumann architecture Processes Threads SIMD MIMD UMA vs. NUMA SPMD MPI example OpenMP example c 2013 16 Ned Nedialkov
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
High Performance Computing Trey Breckenridge Computing Systems Manager Engineering Research Center Mississippi State University What is High Performance Computing? HPC is ill defined and context dependent.
Linux and High-Performance Computing Outline Architectures & Performance Measurement Linux on High-Performance Computers Beowulf Clusters, ROCKS Kitten: A Linux-derived LWK Linux on I/O and Service Nodes
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Programming with OpenMP and MPI Mark Howison User Services & Support This talk will... Focus on the basics of parallel programming Will not inundate you with the details of memory hierarchies, hardware
HPCC - Hrothgar Getting Started User Guide MPI Programming High Performance Computing Center Texas Tech University HPCC - Hrothgar 2 Table of Contents 1. Introduction... 3 2. Setting up the environment...
RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 2 Outline
High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management:
Parallel Programming with MPI on the Odyssey Cluster Plamen Krastev Office: Oxford 38, Room 204 Email: email@example.com FAS Research Computing Harvard University Objectives: To introduce you
Introduction to High Performance Computing Gregory G. Howes Department of Physics and Astronomy University of Iowa Iowa High Performance Computing Summer School University of Iowa Iowa City, Iowa 6-8 June
Overview Lecture 1: an introduction to CUDA Mike Giles firstname.lastname@example.org hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
MPI Application Development Using the Analysis Tool MARMOT HLRS High Performance Computing Center Stuttgart Allmandring 30 D-70550 Stuttgart http://www.hlrs.de 24.02.2005 1 Höchstleistungsrechenzentrum
PRACTICAL SCIENTIFIC ANALYSIS OF BIG DATA RUNNING IN PARALLEL / The Johns Hopkins University 2 Parallelism Data parallel Same processing on different pieces of data Task parallel Simultaneous processing
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Short Course: Advanced programming with MPI Spring 2007 Multi-core processors 1 Parallel Computer Parallel Computer PC cluster in 2000-4U height per node - Myrinet Network - Fast Ethernet Network (1U =
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
By Aaron Weeden, Shodor Education Foundation, Inc. Module Document Overview This module presents the sieve of Eratosthenes, a method for finding the prime numbers below a certain integer. One can model
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
Center for Information Services and High Performance Computing (ZIH) Session 2: MUST Correctness Checking Dr. Matthias S. Müller (RWTH Aachen University) Tobias Hilbrich (Technische Universität Dresden)
February 2004 Provided by ClusterWorld for Jeff Squyres cw.squyres.com www.clusterworld.com Copyright 2004 ClusterWorld, All Rights Reserved For individual private use only. Not to be reproduced or distributed
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality
Motivation and Goal Introduction to HPC content and definitions Jan Thorbecke, Section of Applied Geophysics Get familiar with hardware building blocks, how they operate, and how to make use of them in
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Message-passing Multiprocessors. In recent years, there has been interest (and even commercially available machines) which do not use common memory at all, but instead, communicate by passing messages.
Introduction to High Performance Cluster Computing Cluster Training for UCL Part 1 What is HPC HPC = High Performance Computing Includes Supercomputing HPCC = High Performance Cluster Computing Note: these
Using HPC Wales Agenda Infrastructure : An Overview of our Infrastructure Logging in : Command Line Interface and File Transfer Linux Basics : Commands and Text Editors Using Modules : Managing Software
High Performance Computing R P Thangavelu C-MMACS January 23, 2004 Agenda What is HPC? Evolution of the HPC base at C-MMACS Current status Projections for the future What is HPC? Why worry about performance?
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles email@example.com Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Next-Generation PRIMEHPC The K computer and the evolution of PRIMEHPC K computer PRIMEHPC FX10 Post-FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1TFLOPS ~ # of cores
Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 4 Message-Passing Programming Learning Objectives Understanding how MPI programs execute Familiarity with fundamental MPI functions
An Introduction to Parallel Computing With MPI Computing Lab I The purpose of the first programming exercise is to become familiar with the operating environment on a parallel computer, and to create and
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop Joachim Protze and Felix Münchhalfen IT Center RWTH Aachen University February 2014 Content MPI Usage Errors Error Classes Avoiding
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
Middleware and Distributed Systems Introduction Dr. Martin v. Löwis 14 3. Software Engineering What is Middleware? Bauer et al. Software Engineering, Report on a conference sponsored by the NATO SCIENCE
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
Overview of High Performance Computing Timothy H. Kaiser, PH.D. firstname.lastname@example.org http://geco.mines.edu/workshop 1 This tutorial will cover all three time slots. In the first session we will discuss the
Outline: Challenges to Obtaining Good Parallel Processing Performance Coverage: The Parallel Processing Challenge of Finding Enough Parallelism Amdahl s Law: o The parallel speedup of any program is limited
Adapted from:operating Systems: Internals and Design Principles, 6/E William Stallings CS571 Fall 2010 Client/Server and Distributed Computing Dave Bremer Otago Polytechnic, N.Z. 2008, Prentice Hall Traditional
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks
Autumn Meeting 99 of AK Scientific Computing MPICH FOR SCI-CONNECTED CLUSTERS Joachim Worringen AGENDA Introduction, Related Work & Motivation Implementation Performance Work in Progress Summary MESSAGE-PASSING
p. 1/21 on NVIDIA GPUs Mike Giles email@example.com Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Optimization on Huygens Wim Rijks firstname.lastname@example.org Contents Introductory Remarks Support team Optimization strategy Amdahls law Compiler options An example Optimization Introductory Remarks Modern day supercomputers
Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals James Michael Poe II 1, Fernando Hernandez 2, and Tao Li 3 Department of Electrical and Computer
White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad ABOUT THE AUTHOR Einar Rustad is CTO of Numascale and has a background as CPU, Computer Systems and HPC Systems De-signer
Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002 email@example.com www.stud.ntnu.no/ sack/ Outline Background information Work on clusters Profiling tools
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: firstname.lastname@example.org
numascale Hardware Accellerated Data Intensive Computing White Paper The Numascale Solution: Extreme BIG DATA Computing By: Einar Rustad www.numascale.com Supemicro delivers 108 node system with Numascale
The CSEM Multi-processor Computing Facility J. Knap California Institute of Technology Hardware Configuration Front-end Machine name: alicante.aero.caltech.edu COMPAQ AlphaServer DS10 617 MHz 2GB memory
Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García, Pedro Universidad Simón Bolívar In 1999, a couple of projects from USB received funding
Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain email@example.com Summary