Parallel Programming and High-Performance Computing Part 1: Introduction Dr. Ralf-Peter Mundani CeSIM / IGSSE
General Remarks materials: http://www5.in.tum.de/lehre/vorlesungen/parhpp/ss08/ Ralf-Peter Mundani email mundani@tum.de, phone 289 25057, room 3181 (city centre) consultation-hour: Tuesday, 4:00 6:00 pm (room 02.05.058) Ioan Lucian Muntean email muntean@in.tum.de, phone 289 18692, room 02.05.059 lecture (2 SWS) weekly Tuesday, start at 12:15 pm, room 02.07.023 exercises (1 SWS) fortnightly Wednesday, start at 4:45 pm, room 02.07.023 1 2
General Remarks content part 1: introduction part 2: high-performance networks part 3: foundations part 4: programming memory-coupled systems part 5: programming message-coupled systems part 6: dynamic load balancing part 7: examples of parallel algorithms 1 3
Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation I think there is a world market for maybe five computers. Thomas Watson, chairman IBM, 1943 1 4
Motivation numerical simulation: from phenomena to predictions physical phenomenon technical process 1. modelling determination of parameters, expression of relations 2. numerical treatment model discretisation, algorithm development discipline mathematics computer science application 3. implementation software development, parallelisation 4. visualisation illustration of abstract simulation results 5. validation comparison of results with reality 6. embedding insertion into working process 1 5
Motivation why parallel programming and HPC? complex problems (especially the so called grand challenges ) demand for more computing power climate or geophysics simulation (tsunami, e. g.) structure or flow simulation (crash test, e. g.) development systems (CAD, e. g.) large data analysis (Large Hadron Collider at CERN, e. g.) military applications (crypto analysis, e. g.) performance increase due to faster hardware, more memory ( work harder ) more efficient algorithms, optimisation ( work smarter ) parallel computing ( get some help ) 1 6
Motivation objectives (in case all resources would be available N-times) throughput: compute N problems simultaneously running N instances of a sequential program with different data sets ( embarrassing parallelism ); SETI@home, e. g. drawback: limited resources of single nodes response time: compute one problem at a fraction (1/N) of time running one instance (i. e. N processes) of a parallel program for jointly solving a problem; finding prime numbers, e. g. drawback: writing a parallel program; communication problem size: compute one problem with N-times larger data running one instance (i. e. N processes) of a parallel program, using the sum of all local memories for computing larger problem sizes; iterative solution of SLE, e. g. drawback: writing a parallel program; communication 1 7
Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation 1 8
definition: A collection of processing elements that communicate and cooperate to solve large problems (ALMASE and GOTTLIEB, 1989) possible appearances of such processing elements specialised units (steps of a vector pipeline, e. g.) parallel features in modern monoprocessors (superscalar architectures, instruction pipelining, VLIW, multithreading, multicore, ) several uniform arithmetical units (processing elements of array computers, e. g.) processors of a multiprocessor computer (i. e. the actual parallel computers) complete stand-alone computers connected via LAN (work station or PC clusters, so called virtual parallel computers) parallel computers or clusters connected via WAN (so called metacomputers) 1 9
reminder: dual core, quad core, manycore, and multicore observation: increasing frequency (and thus core voltage) over past years problem: thermal power dissipation increases linearly in frequency and with the square of the core voltage 1 10
reminder: dual core, quad core, manycore, and multicore (cont d) 25% reduction in frequency (and thus core voltage) leads to 50% reduction in dissipation dissipation performance normal CPU reduced CPU 1 11
reminder: dual core, quad core, manycore, and multicore (cont d) idea: installation of two cores per die with same dissipation as single core system dissipation performance single core dual core 1 12
commercial parallel computers manufacturers: starting from 1983, big players and small start-ups (see tabular; out of business: no longer in the parallel business) names have been coming and going rapidly in addition: several manufacturers of vector computers and nonstandard architectures company country year status in 2003 Sequent U.S. 1984 acquired by IBM Intel U.S. 1984 out of business Meiko U.K. 1985 bankrupt ncube U.S. 1985 out of business Parsytec Germany 1985 out of business Alliant U.S. 1985 bankrupt 1 13
commercial parallel computers (cont d) company country year status in 2003 Encore U.S. 1986 out of business Floating Point Systems U.S. 1986 acquired by SUN Myrias Canada 1987 out of business Ametek U.S. 1987 out of business Silicon Graphics U.S. 1988 active C-DAC India 1991 active Kendall Square Research U.S. 1992 bankrupt IBM U.S. 1993 active NEC Japan 1993 active SUN Microsystems U.S. 1993 active Cray Research U.S. 1993 active 1 14
arrival of clusters in the late eighties, PCs became a commodity market with rapidly increasing performance, mass production, and decreasing prices growing attractiveness for parallel computers 1994: Beowulf, the first parallel computer built completely out of commodity hardware NASA Goddard Space Flight Centre 16 Intel DX4 processors multiple 10 Mbit Ethernet links Linux with GNU compilers MPI library 1996: Beowulf cluster performing more than 1 GFlops 1997: a 140-node cluster performing more than 10 GFlops 1 15
arrival of clusters (cont d) 2005: InfiniBand cluster at TUM 36 Opteron nodes (quad boards) 4 Itanium nodes (quad boards) 4 Xeon nodes (dual boards) for interactive tasks InfiniBand 4 Switch, 96 ports Linux (SuSE and Redhat) 1 16
supercomputers supercomputing or high-performance scientific computing as the most important application of the big number crunchers national initiatives due to huge budget requirements Accelerated Strategic Computing Initiative (ASCI) in the U.S. in the sequel of the nuclear testing moratorium in 1992/93 decision: develop, build, and install a series of five supercomputers of up to $100 million each in the U.S. start: ASCI Red (1997, Intel-based, Sandia National Laboratory, the world s first TFlops computer) then: ASCI Blue Pacific (1998, LLNL), ASCI Blue Mountain, ASCI White, meanwhile new high-end computing memorandum (2004) 1 17
supercomputers (cont d) federal Bundeshöchstleistungsrechner initiative in Germany decision in the mid-nineties three federal supercomputing centres in Germany (Munich, Stuttgart, and Jülich) one new installation every second year (i. e. a six year upgrade cycle for each centre) the newest one to be among the top 10 of the world overview and state of the art: Top500 list (updated every six month), see http://www.top500.org 1 18
MOORE s law observation of Intel co-founder Gordon E. MOORE, describes important trend in history of computer hardware (1965) number of transistors that can be placed on an integrated circuit is increasing exponentially, doubling approximately every two years 1 19
some numbers: Top500 1 20
some numbers: Top500 (cont d) 1 21
some numbers: Top500 (cont d) cluster: constellation: #nodes > #processors/node #nodes < #processors/node 1 22
some numbers: Top500 (cont d) 1 23
some numbers: Top500 (cont d) 1 24
some numbers: Top500 (cont d) 1 25
some numbers: Top500 (cont d) 1 26
some numbers: Top500 (cont d) 1 27
some numbers: Top500 (cont d) 1 28
some numbers: Top500 (cont d) 1 29
The Earth Simulator world s #1 from 2002 04 installed in 2002 in Yokohama, Japan ES-building (approx. 50m 65m 17m) based on NEC SX-6 architecture developed by three governmental agencies highly parallel vector supercomputer consists of 640 nodes (plus 2 control & 128 data switching) 8 vector processors (8 GFlops each) 16 GB shared memory 5120 processors (40.96 TFlops peak performance) and 10 TB memory; 35.86 TFlops sustained performance (Linpack) nodes connected by 640 640 single stage crossbar (83,200 cables with a total extension of 2400km; 8 TBps total bandwidth) further 700 TB disc space and 1.6 PB mass storage 1 30
BlueGene/L world s #1 since 2004 installed in 2005 at LLNL, CA, USA (beta-system in 2004 at IBM) cooperation of DoE, LLNL, and IBM massive parallel supercomputer consists of 65,536 nodes (plus 12 front-end and 1204 I/O nodes) 2 PowerPC 440d processors (2.8 GFlops each) 512 MB memory 131,072 processors (367 TFlops peak performance) and 33.5 TB memory; 280.6 TFlops sustained performance (Linpack) nodes configured as 3D torus (32 32 64); global reduction tree for fast operations (global max / sum) in a few microseconds 1024 Gbps link to global parallel file system further 806 TB disc space; operating system SuSE SLES 9 1 31
HLRB II (world s #6 for 04/2006) installed in 2006 at LRZ, Garching installation costs 38 M monthly costs approx. 400,000 upgrade in 2007 (finished) one of Germany s 3 supercomputers SGI Altix 4700 consists of 19 nodes (SGI NUMA link 2D torus) 256 blades (ccnuma link with partition fat tree) Intel Itanium2 Montecito Dual Core (12.8 GFlops) 4 GB memory per core 9728 processor cores (62.3 TFlops peak performance) and 39 TB memory; 56.5 TFlops sustained performance (Linpack) footprint 24m 12m; total weight 103 metric tons 1 32
standard classification according to FLYNN global data and instruction streams as criterion instruction stream: sequence of commands to be executed data stream: sequence of data subject to instruction streams two-dimensional subdivision according to amount of instructions per time a computer can execute amount of data elements per time a computer can process hence, FLYNN distinguishes four classes of architectures SISD: single instruction, single data SIMD: single instruction, multiple data MISD: multiple instruction, single data MIMD: multiple instruction, multiple data drawback: very different computers may belong to the same class 1 33
standard classification according to FLYNN (cont d) SISD one processing unit that has access to one data memory and to one program memory classical monoprocessor following VON NEUMANN s principle data memory processor program memory 1 34
standard classification according to FLYNN (cont d) SIMD several processing units, each with separate access to a (shared or distributed) data memory; one program memory synchronous execution of instructions example: array computer, vector computer advantages: easy programming model due to control flow with a strict synchronous-parallel execution of all instructions drawbacks: specialised hardware necessary, easily becomes outdated due to recent developments at commodity market data memory processor program memory data memory processor 1 35
standard classification according to FLYNN (cont d) MISD several processing units that have access to one data memory; several program memories not very popular class (mainly for special applications such as Digital Signal Processing) operating on a single stream of data, forwarding results from one processing unit to the next example: systolic array (network of primitive processing elements that pump data) processor program memory data memory processor program memory 1 36
standard classification according to FLYNN (cont d) MIMD several processing units, each with separate access to a (shared or distributed) data memory; several program memories classification according to (physical) memory organisation shared memory shared (global) address space distributed memory distributed (local) address space example: multiprocessor systems, networks of computers data memory processor program memory data memory processor program memory 1 37
processor coupling cooperation of processors / computers as well as their shared use of various resources require communication and synchronisation the following types of processor coupling can be distinguished memory-coupled multiprocessor systems (MemMS) message-coupled multiprocessor systems (MesMS) shared address space distributed address space global memory MemMS, SMP distributed memory Mem-MesMS (hybrid) MesMS 1 38
processor coupling (cont d) central issues scalability: costs for adding new nodes / processors programming model: costs for writing parallel programs portability: costs for portation (migration), i. e. transfer from one system to another while preserving executability and flexibility load distribution: costs for obtaining a uniform load distribution among all nodes / processors MemMS are advantageous concerning scalability, MesMS are typically better concerning the rest hence, combination of MemMS and MesMS for exploiting all advantages distributed / virtual shared memory (DSM / VSM) physical distributed memory with global shared address space 1 39
processor coupling (cont d) uniform memory access (UMA) each processor P has direct access via the network to each memory module M with same access times to all data standard programming model can be used (i. e. no explicit send / receive of messages necessary) communication and synchronisation via shared variables (inconsistencies (write conflicts, e. g.) have to prevented in general by the programmer) M M M network P P P 1 40
processor coupling (cont d) symmetric multiprocessor (SMP) only a small amount of processors, in most cases a central bus, one address space (UMA), but bad scalability cache-coherence implemented in hardware (i. e. a read always provides a variable s value from its last write) example: double or quad boards, SGI Challenge M C: cache C P C P C P 1 41
processor coupling (cont d) non-uniform memory access (NUMA) memory modules physically distributed among processors shared address space, but access times depend on location of data (i. e. local addresses faster than remote addresses) differences in access times are visible in the program example: DSM / VSM, Cray T3E network M M P P 1 42
processor coupling (cont d) cache-coherent non-uniform memory access (ccnuma) caches for local and remote addresses; cache-coherence implemented in hardware for entire address space problem with scalability due to frequent cache actualisations example: SGI Origin 2000 network M M C P C P 1 43
processor coupling (cont d) cache-only memory access (COMA) each processor has only cache-memory entirety of all cache-memories = global shared memory cache-coherence implemented in hardware example: Kendall Square Research KSR-1 network C P C P C P 1 44
processor coupling (cont d) no remote memory access (NORMA) each processor has direct access to its local memory only access to remote memory only via explicit message exchange (due to distributed address space) possible synchronisation implicitly via the exchange of messages performance improvement between memory and I/O due to parallel data transfer (Direct Memory Access, e. g.) possible example: IBM SP2, ASCI Red / Blue / White network P M P M P M 1 45
Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation 1 46
Levels of Parallelism the suitability of a parallel architecture for a given parallel program strongly depends on the granularity of parallelism some remarks on granularity quantitative meaning: ratio of computational effort and communication / synchronisation effort ( amount of instructions between two necessary communication / synchronisation steps) qualitative meaning: level on which work is done in parallel coarse-grain parallelism program level process level block level instruction level sub-instruction level fine-grain parallelism 1 47
Levels of Parallelism program level parallel processing of different programs independent units without any shared data no or only small amount of communication / synchronisation organised by the OS process level a program is subdivided into processes to be executed in parallel each process consists of a larger amount of sequential instructions and has a private address space synchronisation necessary (in case all processes in one program) communication in most cases necessary (data exchange, e. g.) support by OS via routines for process management, process communication, and process synchronisation term of process often referred to as heavy-weight process 1 48
Levels of Parallelism block level blocks of instructions are executed in parallel each block consists of a smaller amount of instructions and shares the address space with other blocks communication via shared variables; synchronisation mechanisms term of block often referred to as light-weight-process (thread) instruction level parallel execution of machine instructions optimising compilers can increase this potential by modifying the order of commands (better exploitation of superscalar architecture and pipelining mechanisms) sub-instruction level instructions are further subdivided in units to be executed in parallel or via overlapping (vector operations, e. g.) 1 49
Overview motivation classification of parallel computers levels of parallelism quantitative performance evaluation 1 50
Quantitative Performance Evaluation execution time time T of a parallel program between start of the execution on one processor and end of all computations on the last processor during execution all processors are in one of the following states compute computation time T COMP time spent for computations communicate idle communication time T COMM time spent for send and receive operations idle time T IDLE time spent for waiting (sending / receiving messages) hence T = T COMP + T COMM + T IDLE 1 51
Quantitative Performance Evaluation parallel profile measures the amount of parallelism of a parallel program graphical representation x-axis shows time, y-axis shows amount of parallel activities identification of computation, communication, and idle periods example proc. A proc. B proc. C amount of processes 3 2 1 0 compute communicate idle time 1 52
Quantitative Performance Evaluation parallel profile (cont d) degree of parallelism P(t) indicates the amount of processes (of one application) that can be executed in parallel at any point in time (i. e. y-values of the previous example for any time t) average parallelism (often referred to as parallel index) A(p) indicates the average amount of processes that can be executed in parallel, hence A(p) = t 2 1 t 1 t t 2 1 P(t)dt or A(p) = p i i t = 1 p t where p is the amount of processes and t i is the time when exactly i processes are busy i= 1 i i, 1 53
Quantitative Performance Evaluation parallel profile (cont d) previous example: A(p) = (1 18 + 2 4 + 3 13) / 35 = 65/35 = 1.86 P(t) 3 2 1 time 5 10 15 20 25 30 35 40 45 for A(p) exist several theoretical (typically quite pessimistic) estimates, often used as arguments against parallel systems example: estimate of MINSKY (1971) problem: amount of used processors is halved in every step parallel summation of 2p numbers on p processors, e. g. result? 1 54
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor correlation of multi- and monoprocessor systems performance important: program that can be executed on both systems definitions P(1): amount of unit operations of a program on the monoprocessor system P(p): amount of unit operations of a program on the multiprocessor systems with p processors T(1): execution time of a program on the monoprocessor system (measured in steps or clock cycles) T(p): execution time of a program on the multiprocessor system (measured in steps or clock cycles) with p processors 1 55
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) simplifying preconditions T(1) = P(1) one operation to be executed in one step on the monoprocessor system T(p) P(p) more than one operation to be executed in one step (for p 2) on the multiprocessor system with p processors 1 56
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) speed-up S(p) indicates the improvement in processing speed in general, 1 S(p) p efficiency S(p) = E(p) indicates the relative improvement in processing speed improvement is normalised by the amount of processors p in general, 1/p E(p) 1 T(1) T(p) S(p) E(p) = p 1 57
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) speed-up and efficiency can be seen in two different ways algorithm-independent best known sequential algorithm for the monoprocessor system is compared to the respective parallel algorithm for the multiprocessor system absolute speed-up absolute efficiency algorithm-dependent parallel algorithm is treated as sequential one to measure the execution time on the monoprocessor system; unfair due to communication and synchronisation overhead relative speed-up relative efficiency 1 58
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) overhead O(p) indicates the necessary overhead of a multiprocessor system for organisation, communication, and synchronisation in general, 1 O(p) parallel index P(p) O(p) = P(1) I(p) indicates the amount of operations executed on average per time unit P(p) I(p) = T(p) I(p) relative speed-up (taking into account the overhead) 1 59
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) utilisation U(p) indicates the amount of operations each processor executes on average per time unit conforms to the normalised parallel index conclusions all defined expressions have a value of 1 for p = 1 the parallel index is an upper bound for the speed-up 1 S(p) I(p) p the workload is an upper bound for the efficiency 1/p E(p) U(p) 1 I(p) U(p) = p 1 60
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) example (1) a monoprocessor systems needs 6000 steps for the execution of 6000 operations to compute some result a multiprocessor system with five processors needs 6750 operations for the computation of the same result, but it needs only 1500 steps for the execution thus P(1) = T(1) = 6000, P(5) = 6750, and T(5) = 1500 speed-up and efficiency can be computed as S(5) = 6000/1500 = 4 and E(5) = 4/5 = 0.8 there is an acceleration of factor 4 compared to the monoprocessor system, i. e. on average an improvement of 80% for each processor of the multiprocessor system 1 61
Quantitative Performance Evaluation comparison multiprocessor / monoprocessor (cont d) example (2) parallel index and utilisation can be computed as I(5) = 6750/1500 = 4.5 and U(5) = 4.5/5 = 0.9 on average 4.5 processors are simultaneously busy, i. e. each processor is working only for 90% of the execution time overhead can be computed as O(5) = 6750/6000 = 1.125 there is an overhead of 12.5% on the multiprocessor system compared to the monoprocessor system 1 62
Quantitative Performance Evaluation scalability objective: adding further processing elements to the system shall reduce the execution time without any program modifications i. e. a linear performance increase with an efficiency close to 1 important for the scalability is a sufficient problem size one porter may carry one suitcase in a minute 60 porters won t do it in a second but 60 porters may carry 60 suitcases in a minute in case of a fixed problem size and an increasing amount of processors saturation will occur for a certain value of p, hence scalability is limited when scaling the amount of processors together with the problem size (so called scaled problem analysis) this effect will not appear for good scalable hard- and software systems 1 63
Quantitative Performance Evaluation AMDAHL s law the probably most important and most famous estimate for the speedup (even if quite pessimistic) underlying model each program consists of a sequential part s, 0 s 1, that can only be executed in a sequential way; synchronisation, data I/O, e. g furthermore, each program consists of a parallelisable part 1 s that can be executed in parallel by several processes; finding the maximum value within a set of numbers, e. g. hence, the execution time for the parallel program executed on p processors can be written as 1 s T(p) = s T(1) + T(1) p 1 64
Quantitative Performance Evaluation AMDAHL s law (cont d) the speed-up can thus be computed as S(p) = T(1) T(p) T(1) 1 s s T(1) + T(1) p when increasing p we finally get AMDAHL s law speed-up is bounded: S(p) 1/s = lim S(p) p = 1 lim p 1 s s + p 1 1 s s + p the sequential part can have a dramatic impact on the speed-up therefore central effort of all (parallel) algorithms: keep s small many parallel programs have a small sequential part (s < 0.1) = 1 s = 1 65
Quantitative Performance Evaluation AMDAHL s law (cont d) example s = 0.1 and, thus, S(p) 10 independent from p the speed-up is bounded by this limit where s the error? S(p) 10 s = 0.1 5 5 10 15 20 25 p 1 66
Quantitative Performance Evaluation GUSTAFSON s law addresses the shortcomings of AMDAHL s law as it states that any sufficient large problem can be efficiently parallelised instead of a fixed problem size it supposes a fixed time concept underlying model execution time on the parallel machine is normalised to 1 this contains a non-parallelisable part σ, 0 σ 1 hence, the execution time for the sequential program on the monoprocessor can be written as T(1) = σ+p (1 σ) the speed-up can thus be computed as S(p) = σ+p (1 σ) = p + σ (1 p) 1 67
Quantitative Performance Evaluation GUSTAFSON s law (cont d) difference to AMDAHL sequential part s(p) is not constant, but gets smaller with increasing p σ s(p) =, σ + p (1 σ) s(p) ]0, 1[ often more realistic, because more processors are used for a larger problem size, and here parallelisable parts typically increase (more computations, less declarations, ) speed-up is not bounded for increasing p 1 68
Quantitative Performance Evaluation GUSTAFSON s law (cont d) some more thoughts about speed-up theory tells: a superlinear speed-up does not exist each parallel algorithm can be simulated on a monoprocessor system by emulating in a loop always the next step of a processor from the multiprocessor system but superlinear speed-up can be observed when improving an inferior sequential algorithm when a parallel program (that does not fit into the main memory of the monoprocessor system) completely runs in cache and main memory of the nodes from the multiprocessor system 1 69
Quantitative Performance Evaluation communication computation-ratio (CCR) important quantity measuring the success of a parallelisation gives the relation of pure communication time and pure computing time a small CCR is favourable typically: CCR decreases with increasing problem size example N N matrix distributed among p processors (N/p rows each) iterative method: in each step, each matrix element is replaced by the average of its eight neighbour values hence, the two neighbouring rows are always necessary computation time: 8N N/p communication time: 2N CCR: p/4n what does this mean? 1 70