High Performance Computing, an Introduction to

Transcription

1 , an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific Computations CALMIP - DTSI Université Paul Sabatier University of Toulouse ([email protected]) Toulouse University Computing Center : Page 1 Heat Equation in 1 D (diffusion) Page 2 1

2 Heat Equation in 1 D (diffusion) Implémentation(MATLAB) Explicit Scheme Page 3 Heat Equation in 1 D (diffusion) Page 4 2

3 time to solution : example #1 % Heat equa@on [0,1] EXPLICIT SCHEME %- - - maximum Time tmax=0.02; % mesh J=200; u_init=exp(- (x- 0.5).^2/(2*1/6^2)); ymin=- 0.1 ; ymax=1.5; t=0; % test 1 : Explicit scheme while (t<tmax) t=t+dt ; %t1=cpu@me; u=u+dt*(mu/dx^2*(pun(u,ud)- 2*u+mun(u,ug))+source(mu,x(2:J+1),t)); %temps=temps + cpu@me- t1; End Computa1on 1me size(j=) tmax=0,02 0,02 0,2 2, seconds tmax =0,2 0,28 162, seconds tmax=20 3,44 hours Page 5 time to solution : example #1 tmax=0, y = 0,0025e 2,1907x tmax=0,02 Expon. (tmax=0,02) Time to solution : 3 hours to reach solution, is it reasonable? Exponential behavior, why? Hardware? Page 6 3

4 26/11/13 What about size of the problem N= ? Courtesy of H. Neau Institut de Mecanique des Fluides de Toulouse Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 7 Summary : High Performance Compu1ng - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computa1onal System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, mule- core processors Code opemisaeon Parallel Computa1on Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel eﬃciency and traps Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 8 4

5 Numerical Simulation : 3rd Pilar of Science A chain relationship of 3 pilars : sic. Jack Dongarra Univ. Knoxville, Tennessee Validation / new theory theory Fluids Dynamic (Navier Stokes) Electromagnetism (Maxwell) Quantum Mechanics (Schrödinger) Experimenta@ons Observa@ons Large Hadron Collider (CERN geneva) ITER Hubble Validation Numerical Simula@on numerical implementation Approximations Page 9 Scientific computation numerical simulation thanks to Scientific computation : Efficient (optimum) execution of computer program Time to solution!!! Computer program (Shrödinger) Quantic Chemistry: Gaussian Atomic scale Material : VASP (Genomic) Blast (Mécanique) : Crash : LS-Dyna, Structure : ABAQUS, GETFEM++ (Navier-Stokes) Fluid Dynamic : OpenFOAM, Fluent, GETFEM ++ High Performance Compu@ng Mathema@cs Scientific Computation : transverse (all disciplines). Discretisation : spatial (FEM, Finite Volume, time (Explicit, implict scheme, Numerical Methods (Solve Linear Systems, Algebra, ) Page 10 5

6 HPC : Why? Problem is too : Complex Different scales Expensive Dangerous (nuclear) Competitiveness : Best product before others Page 11 HPC : where? Science Climate Change Astrophysics Biology, genomics, pharmacoloy Chemistry nano-technology Engineering Crash Earthquake CFD Combustion Economy : Financial Market Classified Cryptography Nuclear Page 12 6

7 Tantalum resolidification HPC : Research computation/memory and accuracy Page 13 Machine CURIE Programme Europe «PRACE» (France) Cosmos SImulation, from big-bang to nowdays Dark Energy Simulation Series (DEUSS) : 550 billions of Particules (N-body simulation) Go RAM (distributed) cores Jean- Michel ALIMI1, Yann RASERA 1, Pier- Stefano CORASANITI 1, Jérome COURTIN 1, Fabrice ROY 1, André FUZFA 1,2, Vincent BOUCHER 3,4, Romain TEYSSIER 4 1 Laboratoire Univers et Théories (LUTH), CNRS- UMR8102, Observatoire de Paris, Université Paris Diderot 2 GAMASCO, Université de NAMUR, Belgique 3 CP3, Université catholique de LOUVAIN, Belgique 4 Service d Astrophysique, CEA Saclay Page 14 7

8 26/11/13 Simulate Proteines (molecules) related to Alzheimer cores 960 TFLOP 2900 base fonctions LCPQ-IRSAMC CNRS/University of Toulouse / Université Paul Sabatier M. Caffarel - A. Scemama Machine CURIE Programme Europe «PRACE» (France) Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 15 HPC : Competitiveness (Research and Industry) Proctor & Gamble use HPC for the Pringles? Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 16 8

9 HPC : Competitivness (Research and Industry) capacity to Model a whole engine? Reduce Design Costs Faster in the marcket code coupling : Fluid and Mecanics Fluid and chemistry Have to be confident in the numerical solution!!! Page 17 HPC : Competitivness (Research and Industry) capacity to Model a whole aircraft? Size of a mesh? A part of the aircraft? A whole aircraft Page 18 9

10 HPC : Virtual Prototyping Virtual prototyping : run virtual test cheaper and faster CFD (Navier Stokes) => example : Brake cooling Codes : Fluent (commercial) OpenFOAM(free somware) FEM => Impact Crash Codes (commercial): Abaqus, Radioss CFD (combus@on) => In cylinder simula@on Codes (commercial): Ansys- CFX, AVBP (Academic/ commercial) Feel confident enough on the result!!! Page 19 Virtual Prototyping : Slide from PRACE workshop Septembre 2009 Virtual protoyping : run virtual test cheaper and faster Feel confident enough on the result!!! Page 20 10

11 26/11/13 Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 21 HPC for experimentation Key Word : Data Deluge, «BigDATA» Large Hadron Collider : Higgs Boson Huge DATA management 40 millions of events per seconds GAFA (Google, Apple, Facebook, Amazon), and others Man Kind created data Peta-bytes in Peta-bytes in 2010 Credit : The Dtat deluge, Economist; Understanding Big Data, Eaton et al. Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 22 11

12 HPC in a Process chain Reasearch & Developement Highly Parallel code FEM, CFD, etc Analysis/ Visualisa@on Supercompu@ng/ Data ManagementStorage Page 23 HPC : Operating High technology need to compute more and faster Solve the real World!! need to think new, new efficient technics/algorithm Concern the deepest level of program performance at processor (core) level massively parallel approach Computational Systems associated are hugely powerful what means powerful? what means performance? Applica1on (scien1fic problem to solve) HPC Mathema1cs Page 24 12

13 TOP 500 List TOP 500 : 2 1mes per year (June and November) the list of the most powerfull compu1ng Top500 Juin 2012 # # Page 25 HPC : TOP 500 List What means powerfull? we re dealing with : flop/s FLOATING OPERATION PER SECOND Simple opera1on (+,*) on x R There is a Peak performance (theory : Rpeak) There is a «for real» performance!! (R max) depend on what (applica1on/program) you re running on Benchmark : Linpack (Linear Package) Solve thanks to the system you re benching : Ax = b ; A R n R n, n > How far/close are you from the Rpeak( theory) power? TOP500 look at Rmax and not Rpeak Page 26 13

14 Top500 Juin 2012 TOP 500 List Rmax # # PF 10 PF Machine CURIE Programme Europe «PRACE» (France) Page 27 HPC Try to figure out what represent all this flops : basic idea: 1 kflop/s = 10 3 opera1ons on floats 1 page, each side with 5 column of 100 numbers process in 1 second 10 6 flop/s (1 Mflop) 1000 pages 2 «ream of paper» (10 cm high ; 3 inches high) Page 28 14

15 Calcul Haute Performance Basic idea: 10 9 flop/s (1 Gflop order of the power of 1 core ) ο( 1p) 1 column of 100 meter high (Second floor of Eiffel Tower) flop/s (1 Tflop «linle» parallel machine ) 1 col. 100 km (mesosphere, Meteores, aurora) ο 10 5 p flop/s (1 Pflop current 1Oth posi1on in TOP 500 ) km (1/3 moon- earth distance) flop/s (1 exaflop - imagine?) ( ) ο( 100p) Page 29 Japon : K-computer (Y2011), cores 10 PF Fujitsu RIKEN 8 PFlop (Kobe Japan) efficiency 93% 548 k core Sparc 10 Mw, 825 Mflop/W Single socket, 8C socket, 8 flop per core 6D tore Interconnect Page 30 15

16 Top500 June 2012 Calcul Haute Performance : TOP 500 List Moore s Law? # # MW 12 MW Nuclear Plant between 40 MW and 1450 MW Page 31 Machine CURIE Programme Europe «PRACE» (France) Top500 Juin MW # # Page 32 16

17 Machine haute Performance Page 33 (Research) : France / Europe European European Stage (PRACE): O(1000) TFLOP Na1onal Na1onal Stage (GENCI): O(100) TFLOP Toulouse Mesocentre Regional Stage : O(10) TFLOP Na1onal Weather Forecast (Meteo France) CNES CALMIP : Toulouse University Compu1ng Center Total (Petroleum Industry) PANGEA : 2PF GENCI : Grands Equipements Nationaux en Calcul Intensif Page 34 17

18 26/11/13 (Research) : Toulouse Toulouse University Compu@ng System (CALMIP) «HYPERION» 3500 cores / 38,57 TF Rpeak 223rd TOP 500 (November 2009,that s the middle age) Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 35 Example in CFD : Industrial ﬂuidised bed Neptune_CFD : cpu hours used in Y2010 on HYPERION Number of cores range in produc@on : 68c - 512c Courtesy of : O Simonin, H. Neau, Laviéville - Ins@tut de Mécanique des ﬂuides de Toulouse - Université de Toulouse/ CNRS Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 36 18

19 Laboratoire Evolution et Diversité Biologique (UM5174) Example in Life Science Fish studies in fish in two Rivers near by Toulouse fish studies cpu hour used mostly data set exploring (code in C! Good!) Chevaine Gandoise Vairon Page 37 HPC in Europe (Partnership for Advanced Computing in Europe) Top500 June 2012 FLOP #8 JUGENE 1PFLOP HERMIT 1PFLOP #9 CURIE 2PFLOP SUPERMUC 3PFLOP #4 European (PRACE) National (GENCI) MareNostrum 1PF FERMI 2.1PFLOP #7 Mesocentre (University/ Région) Page 38 19

20 TOP 500 Geographical repartition Page 39 HPC : Operationnal Technology For a given application/program : Get the best of the processor (core) HPC concepts Tunning code / processor architecture. Dedicated Scientific Librairies make the code Parallel : How far can we get? (algorithm) according to parallel efficiency What kind of Computing System (big computers : Supercomputer) High performance processors (compute power, memory (data) bandwidth), bande passante mémoire) A lot of memory (distributed, Shared) Tens of thousands of processors : interconnection? Page 40 20

21 Summary of the next part Units in HPC Processors (cores?) Moore s Law The Energy Problem efficiency of programs (Rmax vs. Rpeak) Parallelism Concepts Amdahl s Law Page 41 Units in HPC Flop : floa1ng opera1on (mult, add) opéra1on sur les nombres à virgule flonantes (nombre réels) 3, , E+08 Order of 1 Kflop/s 1 Kiloflop/s 10 3 Flop/sec Kilo from thousand in Greek 1 Mflop/s 1 Megaflop/s 10 6 Flop/sec Mega from large in greek 1 Gflop/s 1 Gigaflop/s 10 9 Flop/sec 1 Tflop/s 1 Téraflop/s Flop/sec 1 Pflop/s 1 Pétaflop/s Flop/sec 1 Eflop/s 1 Exaflop/s Flop/sec Processor/ core Departement system Current highest level To be reached eventually (?) Giga from Giant in greek Tera from Monster in greek Peta from five in greek( [10 3 ] 5 ) Exa from six in greek( [10 3 ] 6 ) Page 42 21

22 Units in HPC Data : memory (bytes) cache RAM (DIMMs) Hard Disk Order of 1 Ko(KB) 1 Kilobytes 10 3 bytes register 1 Mo (MB) 1 Megabytes 10 6 bytes Cache processor 1 Go (GB) 1 Gigabytes 10 9 bytes RAM 1 To (TB) 1 Térabytes bytes Disk 1 Po (PB) 1 Pétabytes bytes 1 octet (o) = 1 Byte (B) = 8 bit Huge File systems Page 43 Units in Big DATA Systems aggrega1on Order of 1 Po (PB) 1 Pétabyte bytes All letters delivered by America s postal service in one year 1 Eo (EB) 1 ExaByte bytes 10 Billion of copies of «Le Monde» 1 Zo (ZB) 1 Zettabyte bytes Total amount of information in existence 1 Yo (YB) 1 Yottabyte bytes? Credit : The Dtat deluge, Economist; Understanding Big Data, Eaton et al. Page 44 22

23 Latency «1me- to- reach» (roughly) time amount From to 1 nano-second 1 Kilobytes Computing units (in the core) register 100 nano-second 1 Megabytes core RAM Mico-second 1 Gigabytes Process process 100 Micro-second 1 Térabytes Computer computer bandwidth Access RAM Interconnexion 1 Mo/s 1 Megabytes/s 10 6 bytes/sec Interconnect 1 Go/s 1 Gigabytes/s 10 9 bytes/sec Fast interconnect/ RAM 1 To/s 1 Térabytes/s bytes/s Exists? Page 45 Processors Classical Moore Law Energy Problem (not good news for the users) The end of the «free lunch» Applica1ons, are they efficient in terms of HPC? Page 46 23

24 processor architecture FPU, ALU memory Control unit Comput unit register Data memory < Memory (cache) cache processor cycle Clock frequency = number of pulse per second 200 Mhz 200 Millions cycles per second Page 47 L g I Page 48 24

25 Moore(Intel), Moore Law is not a Physical law «something» double each 18 month «something» = transistors «something» related to «performance» this is an exponen1el Law could it last? (True for 30 years) emperical anyway Years The Nighties (end) Year 2000 (begining) Now Days Processeurs Pentium III Itanium Xeon (Quad core) Power 1GFlop 6Gflop 40Gflop Page 49 Energy Problem The end of the Race for Frequency, why : Thermal Dissipa1on : frequency*(tension 2 )*Cste (frequency) 3 Using new(old) technics for cooling : water Lots of electric supply (see TOP500), because lots of processors Smartphones, Tablets : need to compute faster with fewer energy (save baneries energy) Processeurs nowdays voltage : pocket lamp (1 volt) intensity : an oven (250 A) power : a light (100 wans) surface : a big stamp(10cm 2 ) Frequency stable around 3 Ghz Page 50 25

26 Needs always increasing (perpetual and exponen1al) Moore s Law: «something» increase increase performance under constraint (energy) parameters we can play with? nbre of transistors processors «smarter» (new architecture) other ways : accelerators (GPGPU, Xeon Phi) L g I Page 51 Current vendors choice : Increase number of transistors : Engrave : 65 nm, 45 nm, 32 nm (nano : 10-9 ) Constant (lower!) frequency Same architecture Conclusion/consequence : mul1- core/many- core processors Cache Cache core core core engrave size reduc@on (45 nm, ) 1 processor mono-core 1 processeurs ( or socket) multi-core (multi = 2, 4, 8, 16, ) Page 52 26

27 A core, what is it? That s compute (decode instruc1on and perform it) opera1ons/instruc1on Data : Registre (small memory buffers) Page 53 Problem number 1 «Free» gain of performance with increase of frequency «free» gain of performance with increase of the number of cores Mul1- core : If your applica1on/program is not parallel => NO GAINS! 4 cores «No Free Lunch» opera1on per seconds 2 cores Parallel pogram 1 core «Free Lunch» Sequen@al program. frequency Page 54 27

28 Problem number 2 : efficiency of your applica1on execu1on applica1on efficiency What we observe (roughly) : 10% of peak performance some applica1ons are doing bener some worth (really worth)! Page 55 Top500 Juin 2012 TOP 500 List Rmax/Rpeak # # % 90% Page 56 28

29 Problem number 2 : efficiency of your applica1on execu1on applica1on efficiency What we observe (roughly) : 10% of peak performance some applica1ons are doing bener some worth (really worth)! Some leads: Memory Instruc1ons per processor cycle poten1ally 4 or 8 opera1on per cycle Applica1on : 1,2-1,4 re1red opera1on per cycle Page 57 Memory Access Problem Why fast processors are slow? wai1ng for the data Equilibrium between compute velocity and data transfer velocity temps (ns) 10 période d'horloge cpu (ns) Mem acces time 1 0, année Mem acces time période d'horloge cpu (ns) Page 58 29

30 Try to minimize the problem : principle of a hierarchical memory small capacity fast access big capacity slow access memory Disk Time acces 300 ns memory RAM Memory DIMM 10 ns Cache level 2 (ou 3) 2 ns 1 ns Cache level 1 Register ship silicium Page 59 Parallelism Why? Automa1c? Speedup Parallel Efficiency (Scalability) and amdahl law Page 60 30

31 Parallel Compu1ng Solve problem faster Solve bigger problem (more realis1c) How? Make «processor s» work together «communica1on» problem Independant use of ressources Concepts in parallelism efficiency - speedup amdahl law Parallel algorithm : granularity (lot of small tasks vs. Few big tasks) Load (compute) equilibrium and Synchronisa1on Harder to make a parallel program than a sequan1al one: but, things are gezng bener (technics), this is the aim of the training hardware gezng bener too parallelism is now everywhere (mul1- core/mul1- socket/processors laptop, GPGPU, ) Page 61 Automa1c parallelism (processor level, invisble for us, human ) ILP : Instruc1on parallelism Find independant instruc1ons and comput them in parallel Automa1c parallelism limited efficiency only at processor level So an explicit parallelisa1on have to be made Algorithm level Use of standard technics (API, Librairies) Fizng with a target(generic?) machine Page 62 31

32 Parallel applica1on : can use several ( ) cores/processors simultaneously If I am using 10 processors/cores, am I running 10 1mes faster? (do I get the solu1on 10 1mes faster?) Speedup = T Serial T Parallel Good efficiency in // if : speedup Number of processor Page 63 Applica1on scalability : speedup evolu1on according to the number of processor 1000 Perfect efficiency 100 speed-up parfait Appli 1 Appli 2 Appli Nombre de processeurs Difficult : keep good efficiency with lots of processors Why? Algorithmic Aspects Compu1ng system capability Page 64 32

33 Prac1ce yourself (D.I.Y.) : Algorithm aspect Hypothesis 1 : Only a part of the applica1on(code) is parallel f p + f s =1 f p = parallel fraction of the code f s = sequential fraction of the code Hypothesis 2 : the parallel part is op1mal (perfect behaviour for any number of processor)! t N = f p N + f $ # s &t 1 " % t 1 = time to solution serial / sequential t N = time to solution for N processors Page 65 Evaluate theori1cal Speed- up, for a given number of processors N, knowing the serial part : f s = 1 10 Evaluate the limit (max.?): S( N) = t 1 t N f p + f s =1 lim S =? N What Does it means? Consequence : amdhal s law Page 66 33

34 amdahl law: time execution 1 core (sequential) 1 core (processor) Sequential part Parallel part 2 cores 4 cores n cores Time execution with n cores Page 67 amdahl law : lim N S = 1 f s speed-up parfait fs=0,1 fs=0,05 fp=0, Nombre de processeurs Page 68 34

35 26/11/13 Algorithm Aspect : split the problem One Domain (Mesh, i.e element) Basic Idea : Split in subdomain (create a par11on of the inital mesh) Each core/processor work on a subdomain example : 64 cores (subdomain) :=> element per core Mesh(domain) par11oned in sub mesh (domain) Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 69 Algorithm Aspect : split the problem Parallel : TParallel = TComputing + TOverhead Overhead related to communica@on What we would like : Increase numbre of core/processor :=> decrease Time Control Overhead should increase not to much Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 70 35

36 Distributed memory example in CFD : Industrial fluidized bed Time- to- solu@on and speedup (exemple for 5 millions of Cells) HYPERION HYPERION C1 a : Al@x ICE Harpertown, Intel_MPI C1 b : Al@x ICE Harpertown, MPT C2 a : HYPERION Al@x ICE NHM, Intel MPI C2 b : HYPERION Al@x ICE NHM, MPT C3 : Cluster IB AMD Shanghaï, OpenMPI Courtesy of : O Simonin, H. Neau, Laviéville - Ins@tut de Mécanique des fluides de Toulouse - Université de Toulouse/ CNRS Page 71 Machine A t N A = time to solution for N processors Machine B t N B = time to solution for N processors Hypothesis 1 : Machine A is 10 Times slower than machine B t N A = 10.t N B compute speed up for both machine: S A A ( N) = t 1 t = 10 t 1 A N 10 t = t 1 B N t = S B B N B B ( N ) Same speedup, S1ll have a the solu1on sooner with machine B Page 72 36

37 Calcul Haute Performance Loi de amdahl : Hypothèses : Seule une par1e de l applica1on est parallélisée La par1e parallélisée est op1male " t N = f p N + f % $ s' t 1 # & 1 = f p + f s S = t 1 t N = lim N S = 1 f s 1 f p N + f s t N = temps calcul N procésseurs f p = fraction parallèle f s = fraction séquentielle S = speedup Page 73 Prac1ce yourself : Hypothesis : Only a part of the applica1on(code) is parallel the parallel part is op1mal (perfect behaviour for any number of processor)! t N = f p N + f $ # s &t 1 ; 1= f p + f s " % give the speedup of the code with : f s = 1 10 t N = compute time for N processors f p = parallel fraction of the code f s = sequential fraction of the code S = speedup S =?? lim S =? N Consequence? amdhal s law Page 74 37

38 HPC: Complexity Page 75 time to solution : example #2 Matrix Product Basis of most (all) numerical method C, A,B Μ n m C = A B n c ij = Σ aik b kj k=1 do i = 1, n do j = 1, n do k = 1, n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do n= (Unknows) A11 A12 A13 A21 A22 A23 A31 A32 A33 B11 B12 B13 B21 B22 B23 = B31 B32 B33 C11 C12 C13 C21 C22 C23 C31 C32 C33 how many matrix product Time to solu1on 10 2,98 hours ,76 hours ,40 days ,40 days ,97 days Page 76 38

39 Transport Equation in 1 D (convection) Page 77 Transport Equation in 1 D (convection) Explicit Scheme Page 78 39

40 Transport Equation in 1 D (convection) Page 79 40