Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

Size: px

Start display at page:

Download "Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013"

Laurence Ryan
8 years ago
Views:

1 Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

2 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue system

3 Installed compute hardware 630 Supermicro nodes Two socket Intel E GHz octa core 64 GiB memory FDR InfiniBand

4 Compute nodes - performance CPU performance 332 Gflops/s Theoretical 318 Gflops/s Practical HPL (top500) Memory bandwidth 63 GiB/s Practical (streams) Memory latency 115 nano seconds (random access)

5 Node performance High Performance Linpack performance (top500 test) : T/V N NB P Q Time Gflops WR11R2R e Ax-b _oo/(eps*( A _oo* x _oo+ b _oo)*n)= PASSED ================================================================================ [olews@login-0-0 hpl]$./xhpl-max.sh HPL.single.node.log No clock freq given, setting it to 2.6 GHz High perf. linpack results: Params size block nxm time Tflops %peak WR11R2R x WR11R2R x WR11R2R x

6 Installed compute hardware, GPU 16 Supermicro nodes with GPUs Two socket, two GPUs Intel E GHz quad core 64 GiB memory FDR InfiniBand Tesla K20Xm 6 GiB memory 2688 SP cores 896 DP cores

7 Node performance, all hardware High Performance Linpack performance (top500 test) : T/V N NB P Q Time Gflops WR10L2L e Ax-b _oo/(eps*( A _oo* x _oo+ b _oo)*n)= PASSED ================================================================================ [olews@login-0-0 hpl]$./xhpl-max.sh HPL.single.node.log High perf. linpack results: Params size block nxm time Tflops %peak WR10L2L x WR10L2L x WR10L2L x

8 Node performance, two K20Xm DGEMM performance GPU vs. CPU Performance [Gflops/s] Tesla K20X vs Intel SB CUDA BLAS MKL BLAS 1400 Single precision 32 bit, 2.6 Tflops/s SGEMM performance GPU vs CPU Tesla K20X vs Intel SB CUDA BLAS MKL BLAS Total matrix foorptrint [MiB] Double precision 64 bit, 1 Tflops/s Performance [Gflops/s] Total matrix footprint [MiB]

9 InfiniBand Basics Ping-pong Latency key for performance Intra rack : 0.95 second Inter rack : 1.40 second Ping-pong Bandwith 6.14 GiB/s All numbers measured under full production using OpenMPI

10 InfiniBand Basics TCP/IP over InfiniBand IPoIB on all node Both GbE and IB interfaces, eth0 and ib0 named Example Node compute-x-y has two interfaces cx-y is the eth0 interface ib-x-y is the ib0 interface scp gamessplus.tar.bz2 compute-9-1.local:/tmp/gamessplus.tar.bz2 100% 722MB 55.5MB/s 00:13 scp gamessplus.tar.bz2 ib-9-1:/tmp/gamessplus.tar.bz2 100% 722MB 90.2MB/s 00:08

11 Compilers Intel Fortran, C and C++ Gnu C, C++ and Fortran, module for newest Portland Group C, C++ and Fortran Open64 C, C++ and Fortran

12 Environment set by module CC=icc F90=ifort F77=ifort FC=ifort FORTRAN=ifort FPATH & CPATH set to include files CXXFLAGS,CFLAGS,FFLAGS set to -O2 -xavx -mavx MKLPATH, MKLROOT, IPPROOT is also set to relevant paths No variables set to point to interfaces (fftw etc)

13 MPI programming environment MPI: OMPI_F77=ifort OMPI_FC=ifort OMPI_CC=icc OMPI_CXX=icpc CPATH, FPATH set to OpenMPI include + include path with compiler requested

14 Intel set of compilers Versions 2011, 2013 currently installed Modules available for all versions ifort -help gives overview of options Documentation found at :

15 Intel compiler New vector unit in Sandy Bridge, 256 bits wide 8 floats / 4 double operations in one instruction! Options for in core parallelization SIMD, default by module : -xavx and -mavx

16 Intel compiler Options for code optimization -O1 optimize for maximum speed, but disable some optimizations which increase code size for a small speed benefit -O2 optimize for maximum speed (DEFAULT) -O3 optimize for maximum speed and enable more aggressive optimizations that may not improve performance on some programs -O same as -O2 -O0 disable optimizations -fast enable -xhost -O3 -ipo -no-prec-div -static options set by -fast cannot be overridden with the exception of -xhost, list options separately to change behavior The fast is not always safe, especially ipo ; be aware.

17 Intel compiler Options for code optimization -xavx May generate Intel(R) Advanced Vector Extensions -unroll Unroll loops -pad Changing variable and array memory layout -fomit-frame-pointer Enable use of EBP as general purpose register. -ftz Enable/disable flush denormal results to zero -fno-alias Assume no aliasing in program -opt-report Generate an optimization report -vec-report Control amount of vectorizer diagnostic information Read the Intel manuals and study the opt and vect report.

18 Intel math libraries If I had only one slide today this would be it! MKL - Math Kernel library IPP - Intel Performance Primitives Imf, imv simple functions, exp, sin etc

19 Intel MKL All common datatypes supported Single, s (real,float) Double, d (real*8, double) Complex, c (complex) Double complex, z (complex*16) In many cases 32 and 64 bit integer, lp and ilp lib names (i4, i8 or int, long)

20 Intel MKL a well filled toolbox BLAS LAPACK BLACS SCALAPACK Sparse solvers Vector math Statistics FFT PBLAS (subset from Scalapack) Partial Differential Equations Nonelinear Optimization Problem Solvers Data Fitting Functions

21 Libraries significant gain? Example 1 matrix multiplication Matrix multiply dgemm function d : double precision, 64 bit floats Fortran reference dgemm : 29.3 sec Intel MKL dgemm : 17.9 sec Performance gain : speedup of 1.64 Not twice as fast...

22 Libraries significant gain? Example 2 compression Gzip compiled using IPP Standard gzip (gzip -9) : secs IPP gzip (gzip -9) : secs Performance gain : speedup of 1.99 Not twice as fast, but very close!

23 Libraries significant gain? Example 3 FFT Running the performance test suite provided with fftw is left as an expersise! Next training will bring your submitted result!

24 Intel performance libraries Interfaces for : FFTW (Fortran and C, float/double i4/i8) Fortran and C BLAS95, LAPACK95 (Fortran 95 interface) (BLAS & LAPACK are included in std MKL) Interfaces located at : $MKLROOT/interfaces

25 Intel performance libraries Documentation at intel web site General landing page for documentation Specific for MKL

26 Portland set of compilers Version currently installed. Also a version with GPU support, we have test cluster with GPUs. Modules available for all versions, pgi pgif90 -help overview of options Documentation found at :

27 Portland compiler Intel Sandy Bridge supported with : -tp=sandybridge-64 pgf90 -help gives overview of flags -Mvect has a lof of options: -Mprefetch -Mvext=simd:256 -Minfo=all

28 Hint when running programs For mpi try using all cores --ntasks-per-node=16 Reserve the memory you need Does not matter if you reserve all cores Provide sensible time estimates Place files not needed for backup in a directory called «nobackup»

29 Scratch space Highest performing disk storage available Bandwidth Approx 8 GiB/sec sequential, using 16 jobs Metadata 3635 create, fill, touch and delete files per second, metadata benchmark, using 64 jobs Run metadata benchmark faster than expected Cleaned every night, orphan files deleted

30 Scratch space The directory $SCRATCH is cleared and removed when SLURM job terminates To copy back files from $SCRATCH automatically use the script chkfile or a copy back to your home dir. Do not run jobs that uses file IO on /cluster ($HOME) This upsets the backup, if you do, use a directory called nobackup

31 Scratch space Environment variable SCRATCH still points to a SLURM job unique directory The directory $SCRATCH is now shared and common to all nodes allocated the job Performance of $SCRATCH is far higher than the local SATA disk before Do use $SCRATCH as your work area!

32 Disk limitations FhGFS does not like huge number of files Millions of files are bad Backup cannot cope with milllions of files Put them in a nobackup directory Tar them into a single tar archive Do use $SCRATCH as your work area!

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY 14203 Phone: 716-881-8959