Application performance analysis on Pilatus

Transcription

1 Application performance analysis on Pilatus Abstract The US group at CSCS performed a set of benchmarks on Pilatus, using the three Programming Environments available (GNU, Intel, PGI): the results can be retrieved on the web-site: (link at the bottom of the page). The purpose of these benchmarks is to indicate which compiler is building the executable with the best performance, among the ones available: it turned out that the Intel and GNU compilers are winning the competition (hyperthreading enabled). In some cases the codes have been compared against the equivalent counterparts installed on Monte Rosa: however this additional information is not meant to measure strictly the difference between the two systems, but rather to give a hint on the performance that users can expect when using default applications. Below we report the procedure to run the benchmarks with a summary of the results and the scalability plots (all in log-log scale) of few selected libraries and applications widely used by the scientific community and parallel benchmarks. Information on the cluster and the processors Vendor_id: Genuine Intel Cores: 16 (32 with hyperthreading) Packages(sockets): 2 Model name: Intel(R) Xeon(R) CPU E GHz CPU MHz: (if powersave on) Cache size: KB Memory: 64GB DDR3 at 1333MHz per node Interconnect: Mellanox ConnectX FDR Infiniband currently configured as QDR Programming Environments and MPI library PrgEnv-intel/ PrgEnv-gnu/4.7.1 PrgEnv-pgi/12.5 MPI library: mvapich2/1.8 Benchmarks Gromacs-4.5.5, NPB, Echam6, SPECFEM3D, FFTW, ScaLAPACK, Stream, SPH-flow More details on the benchmarks with the full data are available at the following web-page:

2 Gromacs HDL benchmark The module gromacs/4.5.5 loads the PATH for the Gromacs molecular dynamics engine The source code is available for download at the web-page Optimized FFTW3 are necessary, since pre-compiled libraries are not using AVX instructions: MKL FFT with Intel does not perform better than FFTW, therefore only the latter is shown. The module fftw/3.3.2 sets the include and library paths. The optimization flags used are -O3 with -mavx (Intel), -march=corei7-avx (GNU), -tp=sandybridge-64 -Mvect=simd:256 (PGI). On Rosa we used gcc with -O3. Simulation setup and Results Retrieve input file hdl.tpr from /project/csstaff/lucamar/benchmark/gromacs/hdl. Run a performance and scalability test with 32, 64, 128, 256 and 512 MPI tasks using 32 tasks per node (from 1 up to 8 nodes). The runtimes (s) of the benchmark are plotted for each Programming Environment against the number of MPI tasks. Gromacs on the sandybridge (with hyperthreading) is performing better than on Monte Rosa. The relatively bad performance of the version compiled using the PGI compiler is due to the impossibility to compile FFTW3 with AVX intrinsics using PGI. The difference in performance between Pilatus and Rosa is easily explained, since Gromacs has assembly loops written for Intel architectures: if SSE and AVX are implemented, the performance doesn't depend on the compiler used.

3 NAS Parallel Benchmarks (NPB) The NAS Parallel Benchmarks (NPB) are a small set of programs designed to evaluate the performance of supercomputers: In the present benchmark, we use the CLASS=D of the hybrid MPI+OpenMP implementation of the BT (Block Tri-diagonal solver), multi-zone version (BT-MZ). Intel: -O3 -mavx -mcmodel=medium -shared-intel -openmp GNU : -O3 -march=corei7-avx -fopenmp -mcmodel=medium PGI : -O3 -tp=sandybridge-64 -Mvect=simd:256 -mcmodel=medium -mp Build with make bt-mz CLASS=D NPROCS=n (n=1,2,... is the number of MPI tasks) Simulation Setup and Results The CLASS=D executables were running on 1 to 16 nodes using up to 16 MPI tasks (1 task per node) and 32 OpenMP threads per task. The runtimes (s) of the benchmark are plotted for each Programming Environment against the total number of CPUs. The NAS Parallel Benchmarks (NPB) have shown a good scalability with all the compilers available on the Intel SandyBridge. The Intel compiler shows the best performance and scalability, while GNU and PGI perform relatively bad.

4 Echam6 Echam is a Global Climate Model developed by the Max Planck Institute for Meteorology. Website: Echam6 has been compiled on Pilatus with the optimization flag -O2 for the PGI compiler, -O3 with AVX instructions for the GNU compiler (with Intel on Rosa). Simulation Setup and Results The Echam6 executables were running the T63L47GR15 model for one month prediction on 1 to 8 nodes using 32 MPI tasks per node. The runtimes (s) of the benchmark are plotted for the Programming Environments against the total number of MPI tasks. The best performance is obtained with the GNU compiler, which is also outperforming the same version installed on Monte Rosa with less than 256 CPUs. Echam6 gives a segmentation fault at runtime when compiled with Intel on Pilatus.

5 SPECFEM3D (courtesy of J. Poznanovic) Specfem3d simulates seismic wave propagation in sedimentary basins or any other regional geological model. See Specfem3d was compiled with all compilers on Pilatus. These flags were used: -O3, some floating point optimization flags, and the appropriate architecture flags. Simulation Setup and Results The performance of SPECFEM3D has been benchmarked on Pilatus and Rosa using all available compilers, from 1 up to 16 nodes. The runtimes (s) of the benchmark are plotted for the Programming Environments against the number nodes. The performance span between different compilers on the same architecture is relevant. Note the wide performance gap of GNU on Pilatus vs Rosa. The Intel compiler on Pilatus is giving the best performance, comparable to Cray on Rosa.

6 FFTW FFTW is a C library for computing the discrete Fourier transform (DFT) in one or more dimensions of arbitrary input size, real and complex data: The optimization flags used to compile are -O3 with SandyBridge specific options. Simulation Setup and Results The performance of the FFTW/3.3.2 has been tested with all compilers available on Pilatus running the function FFTW_MEASURE on a two-dimensional mesh x The performance has been compared against Rosa (GNU version), from 32 to 320 CPUs. The runtimes (s) averaged over several calls are plotted vs the MPI tasks. The performance of GNU and Intel compilers is similar, while PGI with no AVX intrinsics for FFTW performs generally worse, except when the number of CPUs is not a divisor of the mesh size: then the performance of GNU and Intel is bad. The default fftw/ on Rosa (GNU 4.7) is outperformed by fftw/3.3.2 on Pilatus.

7 ScaLAPACK ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. ScaLAPACK solves dense and banded linear systems, least squares problems, eigenvalue problems, and singular value problems. More information and the source code are available at Simulation Setup and Results The performance of the ScaLAPACK/2.0.2 function PDSYEV has been tested with the GNU and Intel compilers on Pilatus with a matrix of size The performance has been compared against the ScaLAPACK available on Rosa (Intel version of craylibsci/ ), from 1 to 16 nodes (32 CPUS per node). The runtimes (s) are plotted against the number of MPI tasks. The best performance is obtained with the Intel compiler, however the GNU compiler on Pilatus is still outperforming the ScaLAPACK function PDSYEV provided by Cray and available in the module cray-libsci/ on the Cray XE6 Monte Rosa (Intel version).

8 STREAM STREAM is a synthetic benchmark program that measures sustained memory bandwidth for simple vector kernels: It is designed to work with large data sets, larger than the Last Level Cache (LLC) on the target system so that the results are indicative of very large vector-oriented applications. Available in FORTRAN and C, OpenMP and MPI. (Reference: Performance Guide for HPC Applications on idataplex dx360 M4 systems) To avoid measurements from cache a minimum dimension of the array is required. The array sizes must be chosen so that each of the three arrays is at least 4x larger than the sum of all the last level caches used in the run: each Sandy Bridge chip has a shared L3 cache of 20 MB, i.e. the two chips on a compute node have 40 MB or the equivalent of 5 million double precision words. The size of the arrays must be at least four times 5 million double precision words (~160MB). Simulation setup and Results The benchmark measures the 4 following loops: COPY: a(i) = b(i) => with memory access to 2 double precision words (16 bytes) and no floating point operation per iteration SCALE: a(i) = q * b(i) => with memory access to 2 double precision words (16 bytes) and one floating point operation per iteration SUM: a(i) = b(i) + c(i) => with memory access to 3 double precision words (16 bytes) and one floating point operation per iteration TRIAD: a(i) = b(i) + q * c(i) => with memory access to 3 double precision words (16 bytes) and two floating point operations per iteration INTEL/SANDYBRIDGE/GNU/4.7.1: make CC=gcc CFLAGS="-DN= O3" Function Rate (MB/s) Avg time Min time Max time Copy: Scale: Add: Triad: INTEL/SANDYBRIDGE/INTEL/12.1.2: make CC=icc CFLAGS="-DN= O3" Function Rate (MB/s) Avg time Min time Max time Copy: Scale: Add: Triad: Data: On AMD Interlagos (CRAY XE6) the measured bandwidth is ~6 GB/s for all loops and both compilers (gnu/47 and intel/12) while on the Intel SandyBridge the measured bandwidth is ~13 GB/s for all loops and both compilers (gnu/47 and intel/12). The two sets of SandyBridge results show that single processor binaries produced by the Intel compiler are faster, if compiler options recommended by Intel are used. Known issue with the Intel compiler:

9 SPH-flow SPH-flow is a hybrid MPI/OpenMP code using the hdf5 library for I/O. Get the source code from choose the Programming Environment, load hdf5/ The optimization flags are the following: GNU : -O3 -march=corei7-avx INTEL : -O3 -mavx -shared-intel -mcmodel=medium PGI : -O3 -tp=sandybridge-64 -Mvect=simd:256 -mcmodel=medium Simulation setup and Results Walltimes (in sec) shown here correspond to pure MPI jobs only, using 32 cores per node on 2 up to 20 nodes on the SandyBridge and 4 steps of dambreak.ini: The best performance is obtained with the Intel compiler, average performance with GNU, the worst performance with PGI. The difference between the fastest (Intel) and the slowest (PGI) executable is considerably high (~ 50% on 20 nodes).

10 Conclusions The benchmarks reported have the purpose to guide the users on Pilatus towards the compiler which can build the executable with the best performance. Among the applications that we have tested, the best performance is achieved in general with the Intel and GNU compilers. The PGI compiler is performing worse than the previous ones most of the times, with rare exceptions that have been reported. The performance of the default versions of some applications installed on Monte Rosa is presented as well in the current report. The data are included to give the user an idea of the performance that might be obtained using the default modules available on the two systems, without trying to implement specific optimizations. However the present comparison cannot measure strictly a performance difference: in fact sometimes different versions were compared on the two machines, e.g. FFTW on Pilatus vs. FFTW on Rosa and ScaLAPACK source code downloaded from vs. the cray-libsci version of the ScaLAPACK routines.