High Performance Computing, an Introduction to

Transcription

1 High Performance ing, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific ations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Michel Fournié, Maître de conférence IMT, Université de Toulouse Toulouse University ing Center : Page 1 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel programing Make it happen : Message Passing Interface MPI (+BE) Execution model Why exchanging messages MPI Library : most important routines Basics : speed up, parallel efficiency and traps Page 2 1

2 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 3 Principle of Supercomputers : HPC computing Systems Hardware et Software Achieve Performance on floating computation Flop/s + Memory (RAM and file system) Flop : floating operation (mult, add) 3, , E+08 Storage, I/O (Input/output) ++ or $++ Mutualisation Several(tens?)users single (big) system Need coherency in the use of ressources Rules (Fair sharing?) => job sheduler Supercomputers are only dedicated to compute Scientific Applications Back-up (huge) file storage Remote access Nowdays : huge constraints on infrastructures facility Electricity supply (back-up) Cooling Weight Safety Security Page 4 2

3 HPC System taxonomy Shared memory Systems: Multiprocessors One single address space Shared memory UMA Uniform Memory Access NUMA Non Uniform Memory Access PROGRAM Distributed memory system(cluster) : Multi-computers Multiple (different) address space NORMA no-remote memory access Page 5 OS : Operanding System Operanding System («Système d exploitation» in French) : A collection of program that unify the hardware in order to make it usable! Exemple : Windows, Linux, MacOSX, Unix like : AIX (IBM), HPUX (HP), SOLARIS (SUN) user OS : «allocate hardware ressources to your program» Operanding system Hardware (Schéma Operanding system Wikipedia) Page 6 3

4 UMA Architecture (Shared Memory) User side: A single machine (single OS) several processors one single space memory adress How to program : extension of sequential programmation Machine side: SMP Symmetric MultiProcessor (SMP) Bus Interconnexion between memory and processors Central memory and I/O : shared by all processors Processors access to the same memory memory Register File Functional (mult, add) Register File Functional (mult, add) Cache Cache Coherency Processor Cache Cache Coherency Processor BUS interconnect OS Page 7 UMA Architecture (Shared memory) Machine side: SMP Symmetric MultiProcessor (SMP) Processors access to the same memory Cache Coherency Cache Coherency # processor modifing data in the same line Shared memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Page 8 4

5 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 FPU FPU FPU Page 9 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=1.5E8 FPU FPU FPU Page 10 5

6 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU Processors have different value of A Page 11 Cache Coherency A=1.5E8 Shared memory Bus interconnect A=1.5E8 A=0.6E-2 A=1.5E8 FPU FPU FPU? Processors have different value of A Unacceptable for the developper point of view Page 12 6

7 Machine UMA Cache Coherency Several protocols Invalidation 1 write invalidate data on other processors Snooping Handle by the OS, not the user But : Can slow parallel performance Developper may be aware of that OpenMP Page 13 UMA Architecture (Shared memory) Scalability : tests on DGEMM multithreaded (MKL, ifort : v11), size matrix = x (DP) Automatic parallelisation MKL DGEMM v time in seconds NHM EX 2,6 Ghz Itanium 1,5 Ghz NHM EP 2, numbre of cores Page 14 7

8 UMA Architecture Memory access: Concurrent access to central memory => bottleneck - Time access increase. Increase size (and so level) of s memory Register File Functional (mult, add) Cache Cache Coherency Processor Register File Functional (mult, add) Cache Cache Coherency Processor OS Consequence : few number of processor Another paradigma/option : distribute memory? Page 15 Laptop, Workstation, PC are Multi-core : Shared Memory 2 cores distincts Caches distincts Cache partagé Cohérence des s 2 bi-core Cohérence des s/ bi-core Shared Memory Architecture! Parallel application : compulsory Page 16 8

9 Distributed memory(norma) Processor and memory tighly interconnected MPP : Massively Parallel Processing : Cluster : machines(comput nodes) interconnection Page 17 Distributed memory : Cluster Cluster : massive technology A lot of processor. A lot of machine (nodes) interconnected nodes: multi-processors, multi-core M P E/S M P E/S M P E/S M P E/S interconnect E/S M E/S M E/S P M E/S M P P P Page 18 9

10 Distributed memory : multi-computer Architecture (Clusters) Machine side: Massive technology Process access to its own (local) memory space Interconnect nodes : Like internet (ethernet) need much faster (bandwidth and latency) process to process communication Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 19 Distributed memory : multi-computer Architecture (Clusters) User side: n different nodes (n OS) interconnected, 1 (or +) processor per node. Parallel programming Message Passing (MPI, work done by developper you?) Need efficient tools to properly access computing ressources Main memory Register File Cache Functional (mult, add) Main memory Register File Cache Functional (mult, add) Processor Processor disk OS 1 disk OS 2 Page 20 10

11 Interconnection : key of parallel machine In parallel machine hardware (processor / memory) have to be connected specific (fast) protocols infiniband (shrink of ethernet), myrinet, proprietary topology: ring, hypercude, Torus, fat-tree, Interconnexion : Latency : how many time to be connected? Order of microsecond bandwidth (throughput): rate of data transfer? Mbytes/sec Topology : how many path from a point to another 1 Mo/s 1 Megaoctet/s 10 6 octets/sec 1 Go/s 1 Gigaoctet/s 10 9 octets/sec 1 To/s 1 Téraoctet/s octets/s Page 21 Interconnect Interconnexion : Topology : Best choice: Each processor to all: price (affordable?), few numbers of cores impossible for large scale (1000 cores ) The least bad: try to avoid bottleneck scalability of the network topology Differents strategies Cross-bar Page 22 11

12 Interconnect : example Tore 2D Latency 69 B21 42,43 B12 24,25 B13 26,27 B14 28,29 B15 30,31 B16 32,33 B17 34,35 B0 0, B1 B2 2,3 4,5 B3 6,7 B4 8,9 48 B5 10,11 40 Tore 2D Page 23 Interconnexion / network : protocols Differents Protocols Source : Journées JoSy, Groupe Calcul CNRS, 13/09/2007 Lyon Page 24 12

13 Top500 Juin 2012 TOP 500 List and Interconnect # # Machine CURIE Programme Europe «PRACE» (France) Nuclear Plant between 40 MW and 1450 MW Page 25 Interconnect / Network TOP 500 Page 26 13

14 Interconnect : Topology topologie Hypercube 3D, 4D, 8D Page 27 Interconnect : Topology Fat-tree «mirrored» Page 28 14

15 Summary : High Performance ing - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computational System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, multi-core processors Code optimisation Parallel ation Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Page 29 Processors Architecture : MIMD MIMD : Multiple Instruction Multiple Data «Classic» Processor( ) HPC (, $ ) Exemple : Pentium IV Exemple : Itanium II 3,2 GHz Cache 500 kbytes 6,4 Gflop/s peak Linpack 0,7 Gflop/s 1,5 GHz Cache 6 MBytes 6 Gflop/s peak Linpack 5,4 Gflop/s Time to solution 8 times faster! Vendors : Intel (Itanium), AMD (Opteron), IBM (Power), FUJITSU (UltraSparc), NEC (Vector Processors) Page 30 15

16 Processors Architecture : MIMD Processor cycle Clock frequency = number of pulse per second 200 Mhz 200 Millions cycles per second aim : retire one or more operation per cycle Optimised Architecture: Instruction Level Parallelism (ILP) : Pipeline multiple Functional (FPU) Hierarchical Memory Time access Cache level L1, L2, L3 Speculative Execution Branch prediction Prefetching Page 31 Processors Architecture : MIMD Processor Architecture FPU, ALU register Instruction memory Control Data memory < Memory () Data+instruction < Page 32 16

17 Processors Architecture : MIMD Example Architecture Scheme Proc. Superscalaire: Itanium Architecure «massively parallel»: 2 FPU 4 I&MM s 3 Branch Prediction Page 33 Pipeline Aim 1 cycle = 1 retired operation Processors Architecture : MIMD Very roughly Speaking : A1 = B1 + C1 load exec write 3 phases 1 phase/cycle Pipeline : example 3 Independants Operations A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 Page 34 17

18 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant Operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 load exec write Cycle 1 Load B1,C1 Cycle 2 Add B1,C1 Cycle 3 Store in A1 Cycle 4 Load B2,C2 Add B2,C2 Store in A2 3 retired operations : 9 cycles Load B3,C3 Cycle 8 Add B3,C3 Cycle 9 Store in A3 Cycles (Time) Ressources «idle» Page 35 Processors Architecture : MIMD Pipeline Aim 1 cycle = 1 retired operation Independant operation A1 = B1 + C1 A2 = B2 + C2 A3 = B3 + C3 R3 retired Operation : 9 cycles load exec write Pipelining : Latency : 3 cycles 1 res/cycle Cycle 1 Cycle 2 Cycle 3 Cycle 4 Load B1,C1 Load B2,C2 Load B3,C3 Add B1,C1 Add B2,C2 Add B3,C3 Store in A1 Store in A2 Cycle 5 Store in A3 Cycles (Time) Independant Operation Exhibit maximum of independant Operations Feed the Pipeline Page 36 18

19 Processors Architecture : MIMD Feed with data : Hierarchical Memory Temps (Floating Point ) «work on» data in register Registre : tsmall size Mo A1 = B1 + C1 load B1 Hit? yes load C1 no = miss Cache? Coût n cycles e «WAIT» n cycles Page 37 Processors Architecture : MIMD Level Cache : example processor Itanium2 d INTEL latency and throughput 1,5 Ghz => 1 cycle = 0,6 ns 2 cycles 5 cycles 12 cycles 1 ko 128 Integer Registers 1 ko 128 FP Registers 16 ko L1D 16 Go/s 32 Go/s 16 Go/s 32 Go/s 16 Go/s 5+1 cycles L2U 256 ko Mo-9 Mo 32 Go/s 6.4 Go/s 12+1cycles L3U 16Rd / 6Wr Altix : 145+ ns Page 38 19

20 Processors Architecture : MIMD Level Cache effect: Page 39 Speculate : exhibit independant operations Processors Architecture : MIMD If (cond1) then a1 = b1 + c1 else a1 = b1 * c1 End if «Break» dependancy Dependancy : Cond1 a1 = b1 + c1, Branch prediction «bet» Cond1 true compute a1 = b1 + c1,. Check later. Page 40 20

21 Processors Architecture HPC A core what is it? Page 41 Processors Architecture HPC L g I Page 42 21

22 Processors Architecture HPC Trends nowdays : multi-core Fondeurs : Intel, AMD, IBM, FUJITSU(Sparc) Cache Cache core Engrave Shrink (45 nm, 32nm, 22, nm,?nm) core core core core 1 processeurs mono-core core core core core More RAW power ( 2, 4, 6, 8 ) better ratio flop/watt, flop/m 2 same frequency or lower! Bottleneck?(SMP) => bandwidth with the RAM 1 processor ( or socket) multi-core (multi = 2, 4, 6, 8, 10, 12, 20 ) Page 43 Processors Architecture HPC TOP 500 JUNE 2013 Page 44 22

23 Processor Architecture : SIMD Vector Processor Operation on vector (not on scalar): Single instruction multiple data (SIMD) each => cycle n scalar retired operation code/program may fit to vectorization (need to fit with a lot of data) Affordable? X X1 X2 X3 Xn Y Y1 Y2 Y3 Yn X+Y X1+Y1 X2+Y2 X3+Y3 Xn+Yn vector principle(simd) now on scalar/superscalar processors: X86 SSE/AVX: Streaming SIMD Extension Altivec Superscalaire (2 FPU) Page 45 Processors Architecture : SIMD Processor Architecture Data memory Instruction memory Control FPU, ALU Data memory register Data memory < Data memory Memory () Data+instruction < Page 46 23

24 Processor Architecture : SIMD Accelerators GP-GPU : General Purpose Graphic process Precision, compilers, languages (CUDA), etc Stream processing (data centric process) SIMD intensive data parallelism data locality Future : in processor chip? GP-GPU Peak Gflop Ram sur la carte Time access ram Tesla an Gflop/s 1Go environ ns Tesla Kepler 4,5 TF 6Go ns ATI 1,5 Tflop/s 2 G environ ns Accelerators Intel : Xeon Phi 50 cores! MIMD (x86) cores Page 47 Processors Architecture : SIMD Accelerator Architecture Control Data memory Control Data memory Control Data memory Control Data memory < Control Data memory Control Data memory Control Data memory < Page 48 24

25 TOP 500 accelerator TOP 500 JUNE 2012 TOP 500 JUNE 2013 Page 49 25