HPC and Parallel efficiency

Transcription

1 HPC and Parallel efficiency Martin Hilgeman EMEA product technologist HPC

2 What is parallel scaling?

3 Parallel scaling Parallel scaling is the reduction in application execution time when more than one core is used Number of cores Wall clock time (seconds) Speedup factor 7x faster on 8 cores does not seem to be that bad, but

4 Amdahl s Law

5 Amdahl s Law Gene Amdahl (1967): "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities". AFIPS Conference Proceedings (30): The effort expended on achieving high parallel processing rates is wasted unless it is accompanied by achievements in sequential processing rates of very nearly the same magnitude a = n (p + n 1 p ) a: speedup n: number of processors p: parallel fraction

6 Amdahl s Law: model 80% parallel application Parallel scaling is hindered by the domination of the serial parts in the application Wall clock time parallel serial Number of processors

7 Amdahl s Law limits maximal speedup Infinite number of processors 250 a = 1 (1 p) 200 a: speedup 200 n: number of processors Parallel Speedup p: parallel fraction % 97.0% 99.0% 99.5% Amdahl's Law percentage

8 Amdahl s Law curves a = n (p + n 1 p ) % 95.0% 97.0% 99.0% 99.5% 99.9% 100.0% a: speedup n: number of processors p: parallel fraction Parallel speedup Multiprocessing can exceed 99% Most Parallel codes are between 95% and 99% Number of processor cores

9 Amdahl s Law and Efficiency 100.0% 99.0% 98.0% 97.0% 96.0% Diminishing returns: Amdahl's Law Percentage 95.0% 94.0% 93.0% 92.0% 91.0% 90.0% 89.0% e = 1 p n n 1 Tension between the desire to use more processors and the associated cost 88.0% 87.0% 86.0% % 66.7% 85.7% 90.9% 93.3% 94.7% 95.7% 96.8% 97.4% 97.9% 98.2% 98.4% 98.7% 98.9% 99.1% 99.2% 66.7% 83.4% 92.9% 95.5% 96.7% 97.4% 97.8% 98.4% 98.7% 98.9% 99.1% 99.2% 99.4% 99.5% 99.6% 99.6% 75.0% 88.9% 95.2% 97.0% 97.8% 98.2% 98.6% 98.9% 99.1% 99.3% 99.4% 99.5% 99.6% 99.6% 99.7% 99.7% Number of processor cores

10 Best practices for efficient computing

11 Taking care of process placement

12 Beware of memory topology In the SMP days, the mapping of processors to memory was straightforward I/O hub 0 Processors controller

13 NUMA for Intel Xeon 56xx Intel Nehalem and AMD Opteron have their memory controller on the processor access becomes non-uniform (NUMA) QPI QPI QPI I/O hub Intel Westmere-EP

14 NUMA for Intel Xeon E5 Intel Sandy Bridge also has the PCIe controller on the chip PCI device access becomes non-uniform (NUPA?) QPI I/O hub I/O hub Intel Sandy Bridge EN/EP

15 NUMA for Intel Xeon E7 processors 4-socket Intel systems (PowerEdge R810, R910, M910) are connected with QPI links QPI QPI QPI Intel Westmere-EX

16 NUMA for AMD 4-socket AMD systems (PowerEdge R815, M915, C6145) are connected with HyperTransport3 links HT3 HT3 HT3 AMD Bulldozer

17 Use tools for memory placement Programming tools: sched_setaffinity() set the process CPU affinity mask Operating system tools: numactl - control NUMA policy for processes or shared memory taskset - set a process CPU affinity Most MPI libraries have memory placement tools enabled by default Intel MPI I_MPI_PIN_* environment variables Open MPI --bind_to_core, --bind_to_socket MVAPICH2 MV2_ENABLE_AFFINITY environment variable Some set affinity by default and do the right thing, but be careful

18 Case study: gysela5d plasma physics application Claimed 92% efficiency on 8192 cores Uses hybrid MPI/OpenMP parallelism Run on 64 cores Dell PowerEdge R815 with AMD GHz processors 8 MPI ranks x 8 OpenMP threads MVAPICH2 1.6

19 PowerEdge R815 CPU core layout PowerEdge R815 has a non-standard dual plane core mapping, making default placement to fail

20 No placement MPI ranks and OpenMP threads scattered across the sockets Sleeping processes (idle OpenMP threads) can start to roam Wall clock time: 1296 seconds

21 Naïve (default) placement Starting from core 0 and counting onwards MPI ranks on every 8 th core, OpenMP ranks in between Wall clock time: 249 seconds

22 Optimal placement Use a placement script which calculates the right mapping for numactl Wall clock time: 191 seconds

23 PowerEdge R910 CPU core layout PowerEdge R910 has the same non-linear core mapping as the R815!

24 No placement MPI ranks and OpenMP threads scattered across the sockets Sleeping processes (idle OpenMP threads) can start to roam QPI QPI QPI Wall clock time: 152 seconds

25 Naïve (default) placement Starting from core 0 and counting onwards MPI ranks on every 8 th core, OpenMP ranks in between QPI QPI QPI Wall clock time: 150 seconds

26 Optimal placement Use a placement script which calculates the right mapping for numactl QPI QPI QPI Wall clock time: 131 seconds

27 Placement program Written in C99 (~ 2,000 lines of code) Works on all major distributions RHEL 5.x and open variants SLES11 SP2 Supports all major MPI libraries MVAPICH MVAPICH2 OpenMPI Platform MPI/HP MPI Intel MPI Tested with > 5,000 core runs Supports hybrid MPI/OpenMP runs too!

28 Placement program Supports all machines and processor vendor models: Intel Nehalem-EP Intel Nehalem-EX Intel Westmere-EP Intel Westmere-EX Intel Sandy Bridge EP AMD Magny Cours AMD Interlagos Knows which cores are sharing L3 caches Understands the AMD Bulldozer module concept to make maximal use of available resources if possible

29 Works only on Dell systems! Try to run on a whitebox system $ mpirun np 1./dell_affinity.exe hello_world.exe This is not a Dell system. Exiting.

30 Program examples: Usage: open64_acml]$ mpirun -np 1 dell_affinity.exe -h Invalid option: -h Usage: dell_affinity.exe -n <# local MPI ranks> -t <# OpenMP threads per rank> TACC single node, 6 cores MPI login2$ mpirun -np 6 dell_affinity.exe./mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 6 OMP_NUM_THREADS: 1. dell_affinity.exe: Intel Westmere processor detected. dell_affinity.exe: Placing MPI rank 0 on host login2 local rank 0 cpulist 1 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host login2 local rank 1 cpulist 3 memlist 0 dell_affinity.exe: Placing MPI rank 2 on host login2 local rank 2 cpulist 5 memlist 0 dell_affinity.exe: Placing MPI rank 3 on host login2 local rank 3 cpulist 7 memlist 0 dell_affinity.exe: Placing MPI rank 4 on host login2 local rank 4 cpulist 9 memlist 0 dell_affinity.exe: Placing MPI rank 5 on host login2 local rank 5 cpulist 11 memlist 0

31 Program examples: TACC single node 2 MPI ranks, 6 OpenMP threads per rank: login1$ mpirun -np 4./dell_affinity.exe -n 2 -t 6 -v /bin/true./dell_affinity.exe: Using Open MPI../dell_affinity.exe: PPN = 2./dell_affinity.exe: OMP_NUM_THREADS = 6./dell_affinity.exe: Intel Westmere EP processor detected../dell_affinity.exe: node 1: cpulist: /dell_affinity.exe: node 0: cpulist: /dell_affinity.exe: Placing MPI rank 0 on host login1.ls4.tacc.utexas.edu local rank 0 cpulist 0,2,4,6,8,10 memlist 0./dell_affinity.exe: Placing MPI rank 3 on host login1.ls4.tacc.utexas.edu local rank 1 cpulist 1,3,5,7,9,11 memlist 1./dell_affinity.exe: Placing MPI rank 1 on host login1.ls4.tacc.utexas.edu local rank 1 cpulist 1,3,5,7,9,11 memlist 1./dell_affinity.exe: Placing MPI rank 2 on host login1.ls4.tacc.utexas.edu local rank 0 cpulist 0,2,4,6,8,10 memlist 0

32 Program examples: TACC two nodes, 4 MPI ranks per node, 3 OpenMP threads per rank: login2$ mpirun_rsh -ssh -hostfile./hosts -np 8 dell_affinity.exe -n 4 -t 3./mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 4 OMP_NUM_THREADS: 3. dell_affinity.exe: Intel Westmere processor detected. dell_affinity.exe: Placing MPI rank 0 on host login1 local rank 0 cpulist 1,3,5 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host login1 local rank 0 cpulist 7,9,11 memlist 0 dell_affinity.exe: Placing MPI rank 2 on host login1 local rank 2 cpulist 0,2,4 memlist 1 dell_affinity.exe: Placing MPI rank 3 on host login1 local rank 3 cpulist 6,8,10 memlist 1 dell_affinity.exe: Placing MPI rank 4 on host login2 local rank 0 cpulist 1,3,5 memlist 0 dell_affinity.exe: Placing MPI rank 5 on host login2 local rank 1 cpulist 7,9,11 memlist 0 dell_affinity.exe: Placing MPI rank 6 on host login2 local rank 2 cpulist 0,2,4 memlist 1 dell_affinity.exe: Placing MPI rank 7 on host login2 local rank 3 cpulist 6,8,10 memlist 1

33 Program examples: Cambridge Single C6145 with AMD Interlagos, 16 MPI ranks: open64_acml]$ mpirun -np 16 dell_affinity.exe ~/martinh/bin/mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 16 OMP_NUM_THREADS: 1. dell_affinity.exe: AMD Interlagos processor detected. dell_affinity.exe: Placing OMP threads on separate modules. dell_affinity.exe: Placing MPI rank 0 on host bench local rank 0 cpulist 0 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host bench local rank 1 cpulist 4 memlist 0 dell_affinity.exe: Placing MPI rank 2 on host bench local rank 2 cpulist 8 memlist 1 dell_affinity.exe: Placing MPI rank 3 on host bench local rank 3 cpulist 12 memlist 1 dell_affinity.exe: Placing MPI rank 4 on host bench local rank 4 cpulist 16 memlist 2 dell_affinity.exe: Placing MPI rank 5 on host bench local rank 5 cpulist 20 memlist 2 dell_affinity.exe: Placing MPI rank 6 on host bench local rank 6 cpulist 24 memlist 3 dell_affinity.exe: Placing MPI rank 7 on host bench local rank 7 cpulist 28 memlist 3 dell_affinity.exe: Placing MPI rank 8 on host bench local rank 8 cpulist 32 memlist 4 dell_affinity.exe: Placing MPI rank 9 on host bench local rank 9 cpulist 36 memlist 4 dell_affinity.exe: Placing MPI rank 10 on host bench local rank 10 cpulist 40 memlist 5 dell_affinity.exe: Placing MPI rank 11 on host bench local rank 11 cpulist 44 memlist 5 dell_affinity.exe: Placing MPI rank 12 on host bench local rank 12 cpulist 48 memlist 6 dell_affinity.exe: Placing MPI rank 13 on host bench local rank 13 cpulist 52 memlist 6 dell_affinity.exe: Placing MPI rank 14 on host bench local rank 14 cpulist 56 memlist 7 dell_affinity.exe: Placing MPI rank 15 on host bench local rank 15 cpulist 60 memlist 7 Dell Confidential

34 Program examples: Cambridge Single C6145 with AMD Interlagos, 8 MPI ranks, 8 OpenMP threads per rank: [dell-guest@bench open64_acml]$ mpirun -np 8 dell_affinity.exe -t 8 ~/martinh/bin/mpi.exe dell_affinity.exe: Using MVAPICH2. dell_affinity.exe: PPN: 8 OMP_NUM_THREADS: 8. dell_affinity.exe: AMD Interlagos processor detected. dell_affinity.exe: Placing MPI rank 0 on host bench local rank 0 cpulist 0,1,2,3,4,5,6,7 memlist 0 dell_affinity.exe: Placing MPI rank 1 on host bench local rank 1 cpulist 8,9,10,11,12,13,14,15 memlist 1 dell_affinity.exe: Placing MPI rank 2 on host bench local rank 2 cpulist 16,17,18,19,20,21,22,23 memlist 2 dell_affinity.exe: Placing MPI rank 3 on host bench local rank 3 cpulist 24,25,26,27,28,29,30,31 memlist 3 dell_affinity.exe: Placing MPI rank 4 on host bench local rank 4 cpulist 32,33,34,35,36,37,38,39 memlist 4 dell_affinity.exe: Placing MPI rank 4 on host bench local rank 5 cpulist 40,41,42,43,44,45,46,47 memlist 5 dell_affinity.exe: Placing MPI rank 6 on host bench local rank 6 cpulist 48,49,50,51,52,53,54,55 memlist 6 dell_affinity.exe: Placing MPI rank 7 on host bench local rank 7 cpulist 56,57,58,59,60,61,62,63 memlist 7

35 LS-DYNA benchmark: neon_refined LS-DYNA mpp971 v5.1.1 Platform MPI Ran on PE R MPI ranks Architecture knowledge Is key! Mode MPI ranks Wall clock (s) As-is Platform MPI pinning dell_affinity Mode MPI ranks Wall clock (s) As-is Platform MPI pinning dell_affinity

36 Parallel optimization

37 Parallel optimization A lot of attention is being paid to: Infiniband networking buzzwords: Fat-tree multi-rail QDR/FDR/EDR non-blocking MPI library features: shared memory optimization collective offloading single sided messaging message buffering Better start at the root of the parallel performance

38 Do these programs run efficient? LS-DYNA explicit PARATEC

39 Case study: PARATEC load balancing PARAllel Total Energy Code Developed at NERSC for ab initio electronic structure calculations in materials science Uses Density Functional Theory (DFT) to describe the electronic structure of a material (solid, crystal, metal) Knowing the electronic structure of a material tells you everything about its properties Electronic structure is described by wave functions, which (unfortunately) cannot be solved mathematically Approach: Expand the wave functions in plane waves (in Fourier space) Describe the nucleus of a atom with a pseudopotential 3D parallel Fourier Transformations are needed to convert to real (cartesian) space, which are *very* expensive!

40 Benchmark setup Si (silicium) in diamond structure 686 atoms, 7x7x7 cell, 1372 electronic bands Jobs ran at Texas Advanced Computing Center Dell Linux Cluster Lonestar 1,888 Dell PowerEdge M610 blades 22,656 Intel Xeon X GHz Mellanox QDR Infiniband 1 PB Lustre parallel storage Used 196 cores for the calculations

41 Default g vector distribution Computational Time MPI Time Wall Clock Time (s) seconds Uneven load MPI Rank Computation time : 648 seconds Communication time: 276 seconds Communication % : 29.9 % Load imbalance : 21.2 %

42 Optimized g vector distribution Computational Time MPI Time Speedup: 14.3 % 800 Wall Clock Time (s) seconds MPI rank Even load Computation time : 638 seconds Communication time: 154 seconds Communication % : 19.5 % Load imbalance : 5.8 %

43 Conclusion

44 Conclusion Architecture knowledge is key to obtain good scalability People concentrate on MPI optimization work but often forget load balancing issues Use system tools and profilers as standard practice!

45 Questions?