High Performance Computing, an Introduction to



Similar documents
CALMIP a Computing Meso-center for middle HPC academis Users

High Performance Computing, an Introduction to

Trends in High-Performance Computing for Power Grid Applications

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Building a Top500-class Supercomputing Cluster at LNS-BUAP

YALES2 porting on the Xeon- Phi Early results

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

10- High Performance Compu5ng

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Parallel Programming Survey

Kriterien für ein PetaFlop System

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Data Centric Computing Revisited

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Jean-Pierre Panziera Teratec 2011

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Large Scale Simulation on Clusters using COMSOL 4.2

Introduction to Cloud Computing

High Performance Computing. Course Notes HPC Fundamentals

The K computer: Project overview

HPC Deployment of OpenFOAM in an Industrial Setting

CS140: Parallel Scientific Computing. Class Introduction Tao Yang, UCSB Tuesday/Thursday. 11:00-12:15 GIRV 1115

Cosmological simulations on High Performance Computers

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Parallel Computing. Introduction

Introduction to the Mathematics of Big Data. Philippe B. Laval

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Performance of the JMA NWP models on the PC cluster TSUBAME.

BSC - Barcelona Supercomputer Center

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

2: Computer Performance

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Lecture 1: the anatomy of a supercomputer

A Very Brief History of High-Performance Computing

Parallelism and Cloud Computing

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

High Performance Computing in CST STUDIO SUITE

SGI High Performance Computing

1 Bull, 2011 Bull Extreme Computing

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Discovering Computers Living in a Digital World

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Fast Multipole Method for particle interactions: an open source parallel library component

SPARC64 VIIIfx: CPU for the K computer

White Paper The Numascale Solution: Extreme BIG DATA Computing

The Central Processing Unit:

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Big Data Challenges in Bioinformatics

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Barry Bolding, Ph.D. VP, Cray Product Division

Sun Constellation System: The Open Petascale Computing Architecture

64-Bit versus 32-Bit CPUs in Scientific Computing

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Cluster Computing at HRI

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

High Performance Computing

PRACE the European HPC Research Infrastructure. Carlos Mérida-Campos, Advisor of Spanish Member at PRACE Council

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Parallel Algorithm Engineering

How To Compare Amazon Ec2 To A Supercomputer For Scientific Applications

An Open Dynamic Big Data Driven Applica3on System Toolkit

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

HPC enabling of OpenFOAM R for CFD applications


Big Data Streams. Analytics Challenges, Analysis, and Applications. Adel M. Alimi

High-Performance Computing and Big Data Challenge

Hadoop. Sunday, November 25, 12

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Lattice QCD Performance. on Multi core Linux Servers

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Chapter 2 Parallel Architecture, Software And Performance

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Comparing the performance of the Landmark Nexus reservoir simulator on HP servers

Next Generation GPU Architecture Code-named Fermi

Supercomputing Status und Trends (Conference Report) Peter Wegner

Turbomachinery CFD on many-core platforms experiences and strategies

Price/performance Modern Memory Hierarchy

~ Greetings from WSU CAPPLab ~

Supercomputing Resources in BSC, RES and PRACE

Transcription:

, an Introduction to Nicolas Renon, Ph. D, Research Engineer in Scientific Computations CALMIP - DTSI Université Paul Sabatier University of Toulouse (nicolas.renon@univ-tlse3.fr) Toulouse University Computing Center : http://www.calmip.cict.fr Page 1 Heat Equation in 1 D (diffusion) Page 2 1

Heat Equation in 1 D (diffusion) Implémentation(MATLAB) Explicit Scheme Page 3 Heat Equation in 1 D (diffusion) Page 4 2

time to solution : example #1 % Heat equa@on [0,1] EXPLICIT SCHEME %- - - maximum Time tmax=0.02; %- - - - - - mesh J=200; u_init=exp(- (x- 0.5).^2/(2*1/6^2)); ymin=- 0.1 ; ymax=1.5; t=0; %- - - - - - test 1 : Explicit scheme while (t<tmax) t=t+dt ; %t1=cpu@me; u=u+dt*(mu/dx^2*(pun(u,ud)- 2*u+mun(u,ug))+source(mu,x(2:J+1),t)); %temps=temps + cpu@me- t1; End Computa1on 1me size(j=) 100 200 500 1000 2000 tmax=0,02 0,02 0,2 2,3 17 124 seconds tmax =0,2 0,28 162,5 1240 seconds tmax=20 3,44 hours Page 5 time to solution : example #1 tmax=0,02 160 140 y = 0,0025e 2,1907x 120 100 80 60 tmax=0,02 Expon. (tmax=0,02) 40 20 0 100 200 500 1000 2000 Time to solution : 3 hours to reach solution, is it reasonable? Exponential behavior, why? Hardware? Page 6 3

26/11/13 What about size of the problem N= 3 000 000? Courtesy of H. Neau Institut de Mecanique des Fluides de Toulouse Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 7 Summary : High Performance Compu1ng - HPC HPC, what is it? What for? Concepts in HPC (processors and parallelism) Moore s Law, Amhdal s Law HPC computa1onal System and processor architecture Taxonomy : Shared Memory Machines Distributed Memory Machines Scalar, Superscalar, mule- core processors Code opemisaeon Parallel Computa1on Make it happen : Message Passing Interface MPI (+BE) OpenMP Basics : speed up, parallel efficiency and traps Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 8 4

Numerical Simulation : 3rd Pilar of Science A chain relationship of 3 pilars : sic. Jack Dongarra Univ. Knoxville, Tennessee Validation / new theory theory Fluids Dynamic (Navier Stokes) Electromagnetism (Maxwell) Quantum Mechanics (Schrödinger) Experimenta@ons Observa@ons Large Hadron Collider (CERN geneva) ITER Hubble Validation Numerical Simula@on numerical implementation Approximations Page 9 Scientific computation numerical simulation thanks to Scientific computation : Efficient (optimum) execution of computer program Time to solution!!! Computer program (Shrödinger) Quantic Chemistry: Gaussian Atomic scale Material : VASP (Genomic) Blast (Mécanique) : Crash : LS-Dyna, Structure : ABAQUS, GETFEM++ (Navier-Stokes) Fluid Dynamic : OpenFOAM, Fluent, GETFEM ++ High Performance Compu@ng Mathema@cs Scientific Computation : transverse (all disciplines). Discretisation : spatial (FEM, Finite Volume, time (Explicit, implict scheme, Numerical Methods (Solve Linear Systems, Algebra, ) Page 10 5

HPC : Why? Problem is too : Complex Different scales Expensive Dangerous (nuclear) Competitiveness : Best product before others Page 11 HPC : where? Science Climate Change Astrophysics Biology, genomics, pharmacoloy Chemistry nano-technology Engineering Crash Earthquake CFD Combustion Economy : Financial Market Classified Cryptography Nuclear Page 12 6

Tantalum resolidification HPC : Research computation/memory and accuracy Page 13 Machine CURIE Programme Europe «PRACE» (France) Cosmos SImulation, from big-bang to nowdays Dark Energy Simulation Series (DEUSS) : 550 billions of Particules (N-body simulation) 300 000 Go RAM (distributed) 38016 cores Jean- Michel ALIMI1, Yann RASERA 1, Pier- Stefano CORASANITI 1, Jérome COURTIN 1, Fabrice ROY 1, André FUZFA 1,2, Vincent BOUCHER 3,4, Romain TEYSSIER 4 1 Laboratoire Univers et Théories (LUTH), CNRS- UMR8102, Observatoire de Paris, Université Paris Diderot 2 GAMASCO, Université de NAMUR, Belgique 3 CP3, Université catholique de LOUVAIN, Belgique 4 Service d Astrophysique, CEA Saclay Page 14 7

26/11/13 Simulate Proteines (molecules) related to Alzheimer 76000 cores 960 TFLOP 2900 base fonctions LCPQ-IRSAMC CNRS/University of Toulouse / Université Paul Sabatier M. Caffarel - A. Scemama Machine CURIE Programme Europe «PRACE» (France) Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 15 HPC : Competitiveness (Research and Industry) Proctor & Gamble use HPC for the Pringles? Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 16 8

HPC : Competitivness (Research and Industry) capacity to Model a whole engine? Reduce Design Costs Faster in the marcket code coupling : Fluid and Mecanics Fluid and chemistry Have to be confident in the numerical solution!!! Page 17 HPC : Competitivness (Research and Industry) capacity to Model a whole aircraft? Size of a mesh? A part of the aircraft? A whole aircraft Page 18 9

HPC : Virtual Prototyping Virtual prototyping : run virtual test cheaper and faster CFD (Navier Stokes) => example : Brake cooling Codes : Fluent (commercial) OpenFOAM(free somware) FEM => Impact Crash Codes (commercial): Abaqus, Radioss CFD (combus@on) => In cylinder simula@on Codes (commercial): Ansys- CFX, AVBP (Academic/ commercial) Feel confident enough on the result!!! Page 19 Virtual Prototyping : Slide from PRACE workshop Septembre 2009 Virtual protoyping : run virtual test cheaper and faster Feel confident enough on the result!!! Page 20 10

26/11/13 Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 21 HPC for experimentation Key Word : Data Deluge, «BigDATA» Large Hadron Collider : Higgs Boson Huge DATA management 40 millions of events per seconds GAFA (Google, Apple, Facebook, Amazon), and others Man Kind created data 150 000 Peta-bytes in 2005 1 200 000 Peta-bytes in 2010 Credit : The Dtat deluge, Economist; Understanding Big Data, Eaton et al. Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 22 11

HPC in a Process chain Reasearch & Developement Highly Parallel code FEM, CFD, etc Analysis/ Visualisa@on Supercompu@ng/ Data ManagementStorage Page 23 HPC : Operating High technology need to compute more and faster Solve the real World!! need to think new, new efficient technics/algorithm Concern the deepest level of program performance at processor (core) level massively parallel approach Computational Systems associated are hugely powerful what means powerful? what means performance? Applica1on (scien1fic problem to solve) HPC Mathema1cs Page 24 12

TOP 500 List TOP 500 : 2 1mes per year (June and November) the list of the most powerfull compu1ng Top500 Juin 2012 #1 2012 #1 2011 Page 25 HPC : TOP 500 List What means powerfull? we re dealing with : flop/s FLOATING OPERATION PER SECOND Simple opera1on (+,*) on x R There is a Peak performance (theory : Rpeak) There is a «for real» performance!! (R max) depend on what (applica1on/program) you re running on Benchmark : Linpack (Linear Package) Solve thanks to the system you re benching : Ax = b ; A R n R n, n >1000000000 How far/close are you from the Rpeak( theory) power? TOP500 look at Rmax and not Rpeak Page 26 13

Top500 Juin 2012 TOP 500 List Rmax #1 2012 #1 2011 16 PF 10 PF Machine CURIE Programme Europe «PRACE» (France) Page 27 HPC Try to figure out what represent all this flops : basic idea: 1 kflop/s = 10 3 opera1ons on floats 1 page, each side with 5 column of 100 numbers process in 1 second 10 6 flop/s (1 Mflop) 1000 pages 2 «ream of paper» (10 cm high ; 3 inches high) Page 28 14

Calcul Haute Performance Basic idea: 10 9 flop/s (1 Gflop order of the power of 1 core ) ο( 1p) 1 column of 100 meter high (Second floor of Eiffel Tower) 10 12 flop/s (1 Tflop «linle» parallel machine ) 1 col. 100 km (mesosphere, Meteores, aurora) ο 10 5 p 10 15 flop/s (1 Pflop current 1Oth posi1on in TOP 500 ) 100 000 km (1/3 moon- earth distance) 10 18 flop/s (1 exaflop - imagine?) ( ) ο( 100p) Page 29 Japon : K-computer (Y2011), 580 000 cores 10 PF Fujitsu RIKEN 8 PFlop (Kobe Japan) efficiency 93% 548 k core Sparc 10 Mw, 825 Mflop/W Single socket, 8C socket, 8 flop per core 6D tore Interconnect Page 30 15

Top500 June 2012 Calcul Haute Performance : TOP 500 List Moore s Law? #1 2012 #1 2011 7 MW 12 MW Nuclear Plant between 40 MW and 1450 MW Page 31 Machine CURIE Programme Europe «PRACE» (France) Top500 Juin 2013 17 MW #1 2012 #1 2011 Page 32 16

Machine haute Performance Page 33 (Research) : France / Europe European European Stage (PRACE): O(1000) TFLOP Na1onal Na1onal Stage (GENCI): O(100) TFLOP Toulouse Mesocentre Regional Stage : O(10) TFLOP Na1onal Weather Forecast (Meteo France) CNES CALMIP : Toulouse University Compu1ng Center Total (Petroleum Industry) PANGEA : 2PF GENCI : Grands Equipements Nationaux en Calcul Intensif Page 34 17

26/11/13 (Research) : Toulouse Toulouse University Compu@ng System (CALMIP) «HYPERION» 3500 cores / 38,57 TF Rpeak 223rd TOP 500 (November 2009,that s the middle age) Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 35 Example in CFD : Industrial fluidised bed Neptune_CFD : 3 000 000 cpu hours used in Y2010 on HYPERION Number of cores range in produc@on : 68c - 512c Courtesy of : O Simonin, H. Neau, Laviéville - Ins@tut de Mécanique des fluides de Toulouse - Université de Toulouse/ CNRS Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 36 18

Laboratoire Evolution et Diversité Biologique (UM5174) Example in Life Science Fish popula@on fragmenta@on studies Coun@ng in fish in two Rivers near by Toulouse fish popula@on fragmenta@on studies 12 000 cpu hour used mostly data set exploring (code in C! Good!) Chevaine Gandoise Vairon Page 37 HPC in Europe (Partnership for Advanced Computing in Europe) Top500 June 2012 FLOP #8 JUGENE 1PFLOP HERMIT 1PFLOP #9 CURIE 2PFLOP SUPERMUC 3PFLOP #4 European (PRACE) National (GENCI) MareNostrum 1PF FERMI 2.1PFLOP #7 Mesocentre (University/ Région) Page 38 19

TOP 500 Geographical repartition Page 39 HPC : Operationnal Technology For a given application/program : Get the best of the processor (core) HPC concepts Tunning code / processor architecture. Dedicated Scientific Librairies make the code Parallel : How far can we get? (algorithm) according to parallel efficiency What kind of Computing System (big computers : Supercomputer) High performance processors (compute power, memory (data) bandwidth), bande passante mémoire) A lot of memory (distributed, Shared) Tens of thousands of processors : interconnection? Page 40 20

Summary of the next part Units in HPC Processors (cores?) Moore s Law The Energy Problem efficiency of programs (Rmax vs. Rpeak) Parallelism Concepts Amdahl s Law Page 41 Units in HPC Flop : floa1ng opera1on (mult, add) opéra1on sur les nombres à virgule flonantes (nombre réels) 3,14159-6,8675456342 E+08 Order of 1 Kflop/s 1 Kiloflop/s 10 3 Flop/sec Kilo from thousand in Greek 1 Mflop/s 1 Megaflop/s 10 6 Flop/sec Mega from large in greek 1 Gflop/s 1 Gigaflop/s 10 9 Flop/sec 1 Tflop/s 1 Téraflop/s 10 12 Flop/sec 1 Pflop/s 1 Pétaflop/s 10 15 Flop/sec 1 Eflop/s 1 Exaflop/s 10 18 Flop/sec Processor/ core Departement system Current highest level To be reached eventually (?) Giga from Giant in greek Tera from Monster in greek Peta from five in greek( [10 3 ] 5 ) Exa from six in greek( [10 3 ] 6 ) Page 42 21

Units in HPC Data : memory (bytes) cache RAM (DIMMs) Hard Disk Order of 1 Ko(KB) 1 Kilobytes 10 3 bytes register 1 Mo (MB) 1 Megabytes 10 6 bytes Cache processor 1 Go (GB) 1 Gigabytes 10 9 bytes RAM 1 To (TB) 1 Térabytes 10 12 bytes Disk 1 Po (PB) 1 Pétabytes 10 15 bytes 1 octet (o) = 1 Byte (B) = 8 bit Huge File systems Page 43 Units in Big DATA Systems aggrega1on Order of 1 Po (PB) 1 Pétabyte 10 15 bytes All letters delivered by America s postal service in one year 1 Eo (EB) 1 ExaByte 10 18 bytes 10 Billion of copies of «Le Monde» 1 Zo (ZB) 1 Zettabyte 10 21 bytes Total amount of information in existence 1 Yo (YB) 1 Yottabyte 10 24 bytes? Credit : The Dtat deluge, Economist; Understanding Big Data, Eaton et al. Page 44 22

Latency «1me- to- reach» (roughly) time amount From to 1 nano-second 1 Kilobytes Computing units (in the core) register 100 nano-second 1 Megabytes core RAM Mico-second 1 Gigabytes Process process 100 Micro-second 1 Térabytes Computer computer bandwidth Access RAM Interconnexion 1 Mo/s 1 Megabytes/s 10 6 bytes/sec Interconnect 1 Go/s 1 Gigabytes/s 10 9 bytes/sec Fast interconnect/ RAM 1 To/s 1 Térabytes/s 10 12 bytes/s Exists? Page 45 Processors Classical Moore Law Energy Problem (not good news for the users) The end of the «free lunch» Applica1ons, are they efficient in terms of HPC? Page 46 23

processor architecture FPU, ALU Instruc@on memory Control unit Comput unit register Data memory < Memory (cache) Data+instruc@on cache processor cycle Clock frequency = number of pulse per second 200 Mhz 200 Millions cycles per second Page 47 L g I Page 48 24

Moore(Intel), Moore Law is not a Physical law «something» double each 18 month «something» = transistors «something» related to «performance» this is an exponen1el Law could it last? (True for 30 years) emperical anyway Years The Nighties (end) Year 2000 (begining) Now Days Processeurs Pentium III Itanium Xeon (Quad core) Power 1GFlop 6Gflop 40Gflop Page 49 Energy Problem The end of the Race for Frequency, why : Thermal Dissipa1on : frequency*(tension 2 )*Cste (frequency) 3 Using new(old) technics for cooling : water Lots of electric supply (see TOP500), because lots of processors Smartphones, Tablets : need to compute faster with fewer energy (save baneries energy) Processeurs nowdays voltage : pocket lamp (1 volt) intensity : an oven (250 A) power : a light (100 wans) surface : a big stamp(10cm 2 ) Frequency stable around 3 Ghz Page 50 25

Needs always increasing (perpetual and exponen1al) Moore s Law: «something» increase increase performance under constraint (energy) parameters we can play with? nbre of transistors processors «smarter» (new architecture) other ways : accelerators (GPGPU, Xeon Phi) L g I Page 51 Current vendors choice : Increase number of transistors : Engrave : 65 nm, 45 nm, 32 nm (nano : 10-9 ) Constant (lower!) frequency Same architecture Conclusion/consequence : mul1- core/many- core processors Cache Cache core core core engrave size reduc@on (45 nm, ) 1 processor mono-core 1 processeurs ( or socket) multi-core (multi = 2, 4, 8, 16, ) Page 52 26

A core, what is it? That s compute (decode instruc1on and perform it) opera1ons/instruc1on Data : Registre (small memory buffers) Page 53 Problem number 1 «Free» gain of performance with increase of frequency «free» gain of performance with increase of the number of cores Mul1- core : If your applica1on/program is not parallel => NO GAINS! 4 cores «No Free Lunch» opera1on per seconds 2 cores Parallel pogram 1 core «Free Lunch» Sequen@al program. frequency Page 54 27

Problem number 2 : efficiency of your applica1on execu1on applica1on efficiency What we observe (roughly) : 10% of peak performance some applica1ons are doing bener some worth (really worth)! Page 55 Top500 Juin 2012 TOP 500 List Rmax/Rpeak #1 2012 #1 2011 80 % 90% Page 56 28

Problem number 2 : efficiency of your applica1on execu1on applica1on efficiency What we observe (roughly) : 10% of peak performance some applica1ons are doing bener some worth (really worth)! Some leads: Memory Instruc1ons per processor cycle poten1ally 4 or 8 opera1on per cycle Applica1on : 1,2-1,4 re1red opera1on per cycle Page 57 Memory Access Problem Why fast processors are slow? wai1ng for the data Equilibrium between compute velocity and data transfer velocity 1000 100 temps (ns) 10 période d'horloge cpu (ns) Mem acces time 1 0,1 1997 1999 2001 année 2003 2006 2009 Mem acces time période d'horloge cpu (ns) Page 58 29

Try to minimize the problem : principle of a hierarchical memory small capacity fast access big capacity slow access memory Disk Time acces 300 ns memory RAM Memory DIMM 10 ns Cache level 2 (ou 3) 2 ns 1 ns Cache level 1 Register ship silicium Page 59 Parallelism Why? Automa1c? Speedup Parallel Efficiency (Scalability) and amdahl law Page 60 30

Parallel Compu1ng Solve problem faster Solve bigger problem (more realis1c) How? Make «processor s» work together «communica1on» problem Independant use of ressources Concepts in parallelism efficiency - speed- up amdahl law Parallel algorithm : granularity (lot of small tasks vs. Few big tasks) Load (compute) equilibrium and Synchronisa1on Harder to make a parallel program than a sequan1al one: but, things are gezng bener (technics), this is the aim of the training hardware gezng bener too parallelism is now everywhere (mul1- core/mul1- socket/processors laptop, GPGPU, ) Page 61 Automa1c parallelism (processor level, invisble for us, human ) ILP : Instruc1on parallelism Find independant instruc1ons and comput them in parallel Automa1c parallelism limited efficiency only at processor level So an explicit parallelisa1on have to be made Algorithm level Use of standard technics (API, Librairies) Fizng with a target(generic?) machine Page 62 31

Parallel applica1on : can use several (2-10 4 ) cores/processors simultaneously If I am using 10 processors/cores, am I running 10 1mes faster? (do I get the solu1on 10 1mes faster?) Speedup = T Serial T Parallel Good efficiency in // if : speedup Number of processor Page 63 Applica1on scalability : speed- up evolu1on according to the number of processor 1000 Perfect efficiency 100 speed-up parfait Appli 1 Appli 2 Appli 3 10 1 1 2 4 8 16 32 64 128 256 Nombre de processeurs Difficult : keep good efficiency with lots of processors Why? Algorithmic Aspects Compu1ng system capability Page 64 32

Prac1ce yourself (D.I.Y.) : Algorithm aspect Hypothesis 1 : Only a part of the applica1on(code) is parallel f p + f s =1 f p = parallel fraction of the code f s = sequential fraction of the code Hypothesis 2 : the parallel part is op1mal (perfect behaviour for any number of processor)! t N = f p N + f $ # s &t 1 " % t 1 = time to solution serial / sequential t N = time to solution for N processors Page 65 Evaluate theori1cal Speed- up, for a given number of processors N, knowing the serial part : f s = 1 10 Evaluate the limit (max.?): S( N) = t 1 t N f p + f s =1 lim S =? N What Does it means? Consequence : amdhal s law Page 66 33

amdahl law: time execution 1 core (sequential) 1 core (processor) Sequential part Parallel part 2 cores 4 cores n cores Time execution with n cores Page 67 amdahl law : lim N S = 1 f s 1000 100 speed-up parfait fs=0,1 fs=0,05 fp=0,005 10 1 1 2 4 8 16 32 64 128 256 Nombre de processeurs Page 68 34

26/11/13 Algorithm Aspect : split the problem One Domain (Mesh, i.e. 3 000 000 element) Basic Idea : Split in subdomain (create a par11on of the inital mesh) Each core/processor work on a sub- domain example : 64 cores (sub- domain) :=> 46 000 element per core Mesh(domain) par11oned in sub mesh (domain) Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 69 Algorithm Aspect : split the problem Parallel compu@ng @me : TParallel = TComputing + TOverhead Overhead related to communica@on What we would like : Increase numbre of core/processor :=> decrease Time Control Overhead should increase not to much Master Pro M2 IMAT - Octobre 2013 Université Paul Saba@er Page 70 35

Distributed memory example in CFD : Industrial fluidized bed Time- to- solu@on and speed- up (exemple for 5 millions of Cells) HYPERION HYPERION C1 a : Al@x ICE Harpertown, Intel_MPI C1 b : Al@x ICE Harpertown, MPT C2 a : HYPERION Al@x ICE NHM, Intel MPI C2 b : HYPERION Al@x ICE NHM, MPT C3 : Cluster IB AMD Shanghaï, OpenMPI Courtesy of : O Simonin, H. Neau, Laviéville - Ins@tut de Mécanique des fluides de Toulouse - Université de Toulouse/ CNRS Page 71 Machine A t N A = time to solution for N processors Machine B t N B = time to solution for N processors Hypothesis 1 : Machine A is 10 Times slower than machine B t N A = 10.t N B compute speed up for both machine: S A A ( N) = t 1 t = 10 t 1 A N 10 t = t 1 B N t = S B B N B B ( N ) Same speed- up, S1ll have a the solu1on sooner with machine B Page 72 36

Calcul Haute Performance Loi de amdahl : Hypothèses : Seule une par1e de l applica1on est parallélisée La par1e parallélisée est op1male " t N = f p N + f % $ s' t 1 # & 1 = f p + f s S = t 1 t N = lim N S = 1 f s 1 f p N + f s t N = temps calcul N procésseurs f p = fraction parallèle f s = fraction séquentielle S = speedup Page 73 Prac1ce yourself : Hypothesis : Only a part of the applica1on(code) is parallel the parallel part is op1mal (perfect behaviour for any number of processor)! t N = f p N + f $ # s &t 1 ; 1= f p + f s " % give the speedup of the code with : f s = 1 10 t N = compute time for N processors f p = parallel fraction of the code f s = sequential fraction of the code S = speedup S =?? lim S =? N Consequence? amdhal s law Page 74 37

HPC: Complexity Page 75 time to solution : example #2 Matrix Product Basis of most (all) numerical method C, A,B Μ n m C = A B n c ij = Σ aik b kj k=1 do i = 1, n do j = 1, n do k = 1, n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do end do end do n=10 000 (Unknows) A11 A12 A13 A21 A22 A23 A31 A32 A33 B11 B12 B13 B21 B22 B23 = B31 B32 B33 C11 C12 C13 C21 C22 C23 C31 C32 C33 how many matrix product Time to solu1on 10 2,98 hours 100 29,76 hours 1000 12,40 days 100000 3,40 days 1000000 33,97 days Page 76 38

Transport Equation in 1 D (convection) Page 77 Transport Equation in 1 D (convection) Explicit Scheme Page 78 39

Transport Equation in 1 D (convection) Page 79 40