Optimization on Huygens

Size: px

Start display at page:

Download "Optimization on Huygens"

Dale McCarthy
8 years ago
Views:

1 Optimization on Huygens Wim Rijks

2 Contents Introductory Remarks Support team Optimization strategy Amdahls law Compiler options An example

3 Optimization Introductory Remarks Modern day supercomputers are still expensive; use them efficiently: Design for parallelism Optimize sequential execution with characteristics of modern cpus in mind. Optimize for configuration of system you run on. Old code: consider rewriting it from scratch, concentrate on hotspots Invest a little time cleaning up the code, Consider if your programming effort is worth the gain (manpower is expensive too). Don t hesitate to ask assistance from SARA

Optimize for configuration of system you run on.

4 Optimization SARA Support team Marcin Zielinski John Donners Walter Lioen Jeroen Engelberts Wim Rijks Willem Vermin

5 Optimization Support Submit proposal for parallelization to NCF Submit preparatory access proposal to PRACE. Visit PRACE workshop or summer school

6 Optimization Profile your code As a first step: profile your code!! Very rough: $ time./executable More refined: bracket blocks of code by timing routine: (MPI_Wtime(), date_and_time(),cpu_time()) Use gprof: Compile with flags: -pg g Execute your executable: $./executable Generate profile: $ gprof./executable gmon.out

/executable More refined: bracket blocks of code by timing routine:

7 Optimization GPROF profiling report Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name multigridpoissonsolver_nmod_pois main multigridpoissonsolver_nmod_inipoi _init multigridpoissonsolver_nmod_inicoe rhotempmultifgm sgsvre cflxv vflxw _start cflxu

90 23.74 7.08 10 0.71 1.79.main 2.44 24.36 0.62 20 0.03 0.03. multigridpoissonsolver_nmod_inipoi 1.38 24.71 0.35._init 0.79 24.91 0.20 316932 0.

8 Parallel optimization Strategy Choose parallelization paradigm OpenMP: MPI: Hybrid (MPI + OpenMP) Other. xlf_r qsmp=omp -qnosave mpfort (don t have to specify MPI library explicitly) Keep realistic expectations Remember Amdahl s law: The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program

xlf_r qsmp=omp -qnosave mpfort (don t have to specify MPI library explicitly) Keep realistic

9 Parallel optimization Amdahl s law 1 S = (1-P) + P/Sp S = speedup P = fraction parallel Sp = speedup parallel fraction 10% 90 % 1 task 45% 10% 2 tasks 22,5% 10% 4 tasks

$fraction parallel Sp = speedup parallel$

10 Sequential optimization strategy (1) Check literature for most efficient algorithm. Check if algorithm already is implemented: use existing applications or libraries!!!! Choose a Language Don t use interpreted code (python, perl,.) Traditional: c, C++, Fortran + MPI or OpenMP New developments: UPC, CAF,. Choose a compiler, preferably vendor supplied compiler. On Huygens: IBM s xl compiler suite.

!!! Choose a Language Don t use interpreted code (python, perl,.

11 Sequential optimization strategy (2) Write clean, not to complex code: Let the compiler do as much work as possible But still keep in mind, while designing your code: Modern CPU s prefer operations on large arrays Locality and reuse of memory Profile your code and find the hotspots. Try out compiler options for optimization During development and debugging: use conservative optimization flags. Use more aggressive optimization. Can alter the logic of the code: keep checking consistency of results Reassess critical pieces of code

find the hotspots. Try out compiler options for optimization During development and debugging: use conservative optimization flags.

12 Sequential optimization Use libraries For Blas, lapack, FFT routine use vendor libraries like ESSL (IBM) or MKL (Intel) Check out other numerical libraries (NAG, IMSL, ScaLapack, FFTW,PETSc, MUMPS) Have a look on the internet for a list of libraries and their contents (table of Support routines, Direct solvers, Sparse direct solvers, Preconditioners, Sparse iterative solvers, Sparse eigenvalue solvers)

for a list of libraries and their contents http://www.netlib.org/utk/people/jackdongarra/lasw.

13 Sequential optimization Use libraries

14 Sequential optimization Characteristics of Modern CPUs Pipelined floating point processor Very efficient when performing same operation on large arrays After certain startup time they can produce a result each clock cycle Two or more levels of cache memory Accessing vectors in memory with stride 1 is very important Try to reuse data that is already in cache

they can produce a result each clock cycle Two or more levels of cache memory Accessing

15 Sequential optimization Compiler options Invocation: xlc,xlc_r, xlc, xlc_r, xlf, xlf_r, xlf90, xlf90_r, mpcc, mpcc, mpfort Parallelization: OpenMP flags: -qsmp=omp -qnosave Optimization: -O[n] qstrict qhot qarch=<proc> - qtune=<proc> -qcache=<proc> -ipa qessl For proc choose auto or pwr6

OpenMP flags: -qsmp=omp -qnosave Optimization: -O[n] qstrict qhot

16 Machine optimization Check SMT (simultaneous Multi Threading). Gromacs is optimal using 62 tasks per node Use processor affinity Check huge pages

17 Sequential optimization Example: matrix multiply Demonstrate: Choice of compiler Choice of compiler options Effect of stride > 1 Using Library implementation Using fortran intrinsic function Replacing intrinsics with ESSL Influence of simple optimization

> 1 Using Library implementation Using fortran intrinsic

18 Sequential optimization Example: Matrix multiply! Matrix multiply C = A * B c = 0.d0 call cpu_time(t1) do i=1,ndim do j=1,ndim do k=1,ndim c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call cpu_time(t2) gflops = 2.d0*ndim*ndim*ndim/(giga*(t2-t1)) print *,c(1,1),c(ndim,ndim) write(6,'("loop order: i,j,k CPU time: ",1pd14.7," GFLOPS: ",f10.3)') t2-t1, gflops

enddo enddo enddo call cpu_time(t2) gflops = 2.

19 Sequential optimization Example: loop order + compiler options gfortran Xlf90_r Xlf90_r Xlf90_r Xlf90_r Xlf90_r Xlf90_r Xlf90_r -m64 O3 -O2 -qstrict -O3 -qstrict -O3 -qhot -O3 -qhot - ffree-format qstrict -O3 -qhot - qtune=auto - qarch=auto - qcache=auto -O4 qstrict -qessl -O3 -qhot qstrict - qtune=auto - qarch=auto - qcache=auto k,j,i 15,4 15,9 9,16 7,53 6,48 7,51 6,74 6,62 j,k,i 15,2 13,2 11,4 7,53 6,47 7,52 6,77 6,61 i,j,k 53,8 62,0 53,1 7,52 8,71 7,53 6,76 8,25 j,i,k 42,0 50,2 36,6 7,53 6,48 7,52 6,73 6,59 i,k,j 82,2 68,4 69,3 7,53 6,51 7,52 6,76 6,68 k,i,j 81,6 68,2 65,8 7,52 6,47 7,52 6,72 6,66

qtune=auto - qarch=auto - qcache=auto k,j,i 15,4 15,9 9,16 7,53 6,48 7,51 6,74 6,62 j,k,i 15,2 13,2 11,4 7,53 6,47 7,52 6,77 6,61 i,j,k 53,8 62,0 53,1

20 Optimization strategy Example: using libraries -O3 -qhot -qstrict -O3 -qhot -qstrict -O3 -qhot -qstrict -O3 -qhot -qstrict - qessl BLAS ATLAS ESSL ESSL j,k,i 6,13 6,25 6,41 6,48 dgemm 9,93 7,53 1,16 1,15 dgemul n.a. n.a. 1,16 1,16 matmul 4,31 4,69 4,72 1,18

-qstrict - qessl BLAS ATLAS ESSL ESSL j,k,i 6,13 6,25 6,41 6,48

21 call cpu_time(t1) do i=1,ndim do j=1,ndim Optimization strategy Example: Matrix multiply a_trans(j,i) = a(i,j) enddo enddo do i=1,ndim do j=1,ndim do k=1,ndim c(i,j) = c(i,j) + a(i,k)*b(k,j) c(i,j) = c(i,j) + a_trans(k,i)*b(k,j) enddo enddo enddo call cpu_time(t2)

22 Optimization An example: matrix multiply Compiled with xlf_r O2 qstrict Original code: 60,2 secs With a_trans :13,5 secs Compiled with xlf_r O4 qstrict Original code: 6,76 secs With a_trans: 3,63 secs

23 Questions

Mathematical Libraries on JUQUEEN. JSC Training Course

Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems: