Application performance analysis on Pilatus

Similar documents
Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

FLOW-3D Performance Benchmark and Profiling. September 2012

SR-IOV: Performance Benefits for Virtualized Interconnects!

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Multicore Parallel Computing with OpenMP

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

1 Bull, 2011 Bull Extreme Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013

benchmarking Amazon EC2 for high-performance scientific computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

64-Bit versus 32-Bit CPUs in Scientific Computing

Mathematical Libraries and Application Software on JUROPA and JUQUEEN

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

LS DYNA Performance Benchmarks and Profiling. January 2009

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Overview of HPC systems and software available within

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Introduction to GPU Programming Languages

HP ProLiant SL270s Gen8 Server. Evaluation Report

The CNMS Computer Cluster

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

ECLIPSE Performance Benchmarks and Profiling. January 2009

Keys to node-level performance analysis and threading in HPC applications

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Multi-Threading Performance on Commodity Multi-Core Processors

YALES2 porting on the Xeon- Phi Early results

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

GPUs for Scientific Computing

CUDA programming on NVIDIA GPUs

On the Importance of Thread Placement on Multicore Architectures

MAQAO Performance Analysis and Optimization Tool

High Performance Computing in CST STUDIO SUITE

Building a Top500-class Supercomputing Cluster at LNS-BUAP

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

Advancing Applications Performance With InfiniBand

Mathematical Libraries on JUQUEEN. JSC Training Course

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Pedraforca: ARM + GPU prototype

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Parallel Programming Survey

Benchmarking Large Scale Cloud Computing in Asia Pacific

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Running applications on the Cray XC30 4/12/2015

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

OpenMP and Performance

SR-IOV In High Performance Computing

MPI / ClusterTools Update and Plans

How to Run Parallel Jobs Efficiently

Evaluation of CUDA Fortran for the CFD code Strukti

Intel True Scale Fabric Architecture. Enhanced HPC Architecture and Performance

System Requirements Table of contents

Lattice QCD Performance. on Multi core Linux Servers

Case Study on Productivity and Performance of GPGPUs

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Intel Xeon Processor E5-2600

Recommended hardware system configurations for ANSYS users

Part I Courses Syllabus

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

The Foundation for Better Business Intelligence

RDMA over Ethernet - A Preliminary Study

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

MOSIX: High performance Linux farm

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

A quick tutorial on Intel's Xeon Phi Coprocessor

Matrix Multiplication

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Lecture 1: the anatomy of a supercomputer

BLM 413E - Parallel Programming Lecture 3

PRIMERGY server-based High Performance Computing solutions

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

2IP WP8 Materiel Science Activity report March 6, 2013

Choosing a Computer for Running SLX, P3D, and P5

Lecture 2 Parallel Programming Platforms

- An Essential Building Block for Stable and Reliable Compute Clusters

Parallel and Distributed Computing Programming Assignment 1

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Linux for Scientific Computing

Power-Aware High-Performance Scientific Computing

Transcription:

Application performance analysis on Pilatus Abstract The US group at CSCS performed a set of benchmarks on Pilatus, using the three Programming Environments available (GNU, Intel, PGI): the results can be retrieved on the web-site: http://hpcforge.org (link at the bottom of the page). The purpose of these benchmarks is to indicate which compiler is building the executable with the best performance, among the ones available: it turned out that the Intel and GNU compilers are winning the competition (hyperthreading enabled). In some cases the codes have been compared against the equivalent counterparts installed on Monte Rosa: however this additional information is not meant to measure strictly the difference between the two systems, but rather to give a hint on the performance that users can expect when using default applications. Below we report the procedure to run the benchmarks with a summary of the results and the scalability plots (all in log-log scale) of few selected libraries and applications widely used by the scientific community and parallel benchmarks. Information on the cluster and the processors Vendor_id: Genuine Intel Cores: 16 (32 with hyperthreading) Packages(sockets): 2 Model name: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz CPU MHz: 1200.000 (if powersave on) Cache size: 20480 KB Memory: 64GB DDR3 at 1333MHz per node Interconnect: Mellanox ConnectX FDR Infiniband currently configured as QDR Programming Environments and MPI library PrgEnv-intel/12.1.2.273 PrgEnv-gnu/4.7.1 PrgEnv-pgi/12.5 MPI library: mvapich2/1.8 Benchmarks Gromacs-4.5.5, NPB, Echam6, SPECFEM3D, FFTW, ScaLAPACK, Stream, SPH-flow More details on the benchmarks with the full data are available at the following web-page: https://hpcforge.org/plugins/mediawiki/wiki/mcbench/index.php/pilatus_(sandybridge_cluster)_benchmarks

Gromacs 4.5.5 HDL benchmark The module gromacs/4.5.5 loads the PATH for the Gromacs molecular dynamics engine The source code is available for download at the web-page http://www.gromacs.org/ Optimized FFTW3 are necessary, since pre-compiled libraries are not using AVX instructions: MKL FFT with Intel does not perform better than FFTW, therefore only the latter is shown. The module fftw/3.3.2 sets the include and library paths. The optimization flags used are -O3 with -mavx (Intel), -march=corei7-avx (GNU), -tp=sandybridge-64 -Mvect=simd:256 (PGI). On Rosa we used gcc-4.6.2 with -O3. Simulation setup and Results Retrieve input file hdl.tpr from /project/csstaff/lucamar/benchmark/gromacs/hdl. Run a performance and scalability test with 32, 64, 128, 256 and 512 MPI tasks using 32 tasks per node (from 1 up to 8 nodes). The runtimes (s) of the benchmark are plotted for each Programming Environment against the number of MPI tasks. Gromacs-4.5.5 on the sandybridge (with hyperthreading) is performing better than on Monte Rosa. The relatively bad performance of the version compiled using the PGI compiler is due to the impossibility to compile FFTW3 with AVX intrinsics using PGI. The difference in performance between Pilatus and Rosa is easily explained, since Gromacs has assembly loops written for Intel architectures: if SSE and AVX are implemented, the performance doesn't depend on the compiler used.

NAS Parallel Benchmarks (NPB) The NAS Parallel Benchmarks (NPB) are a small set of programs designed to evaluate the performance of supercomputers: http://www.nas.nasa.gov/publications/npb.html In the present benchmark, we use the CLASS=D of the hybrid MPI+OpenMP implementation of the BT (Block Tri-diagonal solver), multi-zone version (BT-MZ). Intel: -O3 -mavx -mcmodel=medium -shared-intel -openmp GNU : -O3 -march=corei7-avx -fopenmp -mcmodel=medium PGI : -O3 -tp=sandybridge-64 -Mvect=simd:256 -mcmodel=medium -mp Build with make bt-mz CLASS=D NPROCS=n (n=1,2,... is the number of MPI tasks) Simulation Setup and Results The CLASS=D executables were running on 1 to 16 nodes using up to 16 MPI tasks (1 task per node) and 32 OpenMP threads per task. The runtimes (s) of the benchmark are plotted for each Programming Environment against the total number of CPUs. The NAS Parallel Benchmarks (NPB) have shown a good scalability with all the compilers available on the Intel SandyBridge. The Intel compiler shows the best performance and scalability, while GNU and PGI perform relatively bad.

Echam6 Echam is a Global Climate Model developed by the Max Planck Institute for Meteorology. Website: http://www.mpimet.mpg.de/en/wissenschaft/modelle/echam.html Echam6 has been compiled on Pilatus with the optimization flag -O2 for the PGI compiler, -O3 with AVX instructions for the GNU compiler (with Intel on Rosa). Simulation Setup and Results The Echam6 executables were running the T63L47GR15 model for one month prediction on 1 to 8 nodes using 32 MPI tasks per node. The runtimes (s) of the benchmark are plotted for the Programming Environments against the total number of MPI tasks. The best performance is obtained with the GNU compiler, which is also outperforming the same version installed on Monte Rosa with less than 256 CPUs. Echam6 gives a segmentation fault at runtime when compiled with Intel on Pilatus.

SPECFEM3D (courtesy of J. Poznanovic) Specfem3d simulates seismic wave propagation in sedimentary basins or any other regional geological model. See http://www.geodynamics.org/cig/software/specfem3d Specfem3d was compiled with all compilers on Pilatus. These flags were used: -O3, some floating point optimization flags, and the appropriate architecture flags. Simulation Setup and Results The performance of SPECFEM3D has been benchmarked on Pilatus and Rosa using all available compilers, from 1 up to 16 nodes. The runtimes (s) of the benchmark are plotted for the Programming Environments against the number nodes. The performance span between different compilers on the same architecture is relevant. Note the wide performance gap of GNU on Pilatus vs Rosa. The Intel compiler on Pilatus is giving the best performance, comparable to Cray on Rosa.

FFTW FFTW is a C library for computing the discrete Fourier transform (DFT) in one or more dimensions of arbitrary input size, real and complex data: http://fftw.org/ The optimization flags used to compile are -O3 with SandyBridge specific options. Simulation Setup and Results The performance of the FFTW/3.3.2 has been tested with all compilers available on Pilatus running the function FFTW_MEASURE on a two-dimensional mesh 32768 x 32768. The performance has been compared against Rosa (GNU version), from 32 to 320 CPUs. The runtimes (s) averaged over several calls are plotted vs the MPI tasks. The performance of GNU and Intel compilers is similar, while PGI with no AVX intrinsics for FFTW performs generally worse, except when the number of CPUs is not a divisor of the mesh size: then the performance of GNU and Intel is bad. The default fftw/3.3.0.1 on Rosa (GNU 4.7) is outperformed by fftw/3.3.2 on Pilatus.

ScaLAPACK ScaLAPACK is a library of high-performance linear algebra routines for parallel distributed memory machines. ScaLAPACK solves dense and banded linear systems, least squares problems, eigenvalue problems, and singular value problems. More information and the source code are available at http://www.netlib.org/scalapack/ Simulation Setup and Results The performance of the ScaLAPACK/2.0.2 function PDSYEV has been tested with the GNU and Intel compilers on Pilatus with a matrix of size 2048. The performance has been compared against the ScaLAPACK available on Rosa (Intel version of craylibsci/11.1.000), from 1 to 16 nodes (32 CPUS per node). The runtimes (s) are plotted against the number of MPI tasks. The best performance is obtained with the Intel compiler, however the GNU compiler on Pilatus is still outperforming the ScaLAPACK function PDSYEV provided by Cray and available in the module cray-libsci/11.1.00 on the Cray XE6 Monte Rosa (Intel version).

STREAM STREAM is a synthetic benchmark program that measures sustained memory bandwidth for simple vector kernels: http://www.cs.virginia.edu/stream/ref.html It is designed to work with large data sets, larger than the Last Level Cache (LLC) on the target system so that the results are indicative of very large vector-oriented applications. Available in FORTRAN and C, OpenMP and MPI. (Reference: Performance Guide for HPC Applications on idataplex dx360 M4 systems) To avoid measurements from cache a minimum dimension of the array is required. The array sizes must be chosen so that each of the three arrays is at least 4x larger than the sum of all the last level caches used in the run: each Sandy Bridge chip has a shared L3 cache of 20 MB, i.e. the two chips on a compute node have 40 MB or the equivalent of 5 million double precision words. The size of the arrays must be at least four times 5 million double precision words (~160MB). Simulation setup and Results The benchmark measures the 4 following loops: COPY: a(i) = b(i) => with memory access to 2 double precision words (16 bytes) and no floating point operation per iteration SCALE: a(i) = q * b(i) => with memory access to 2 double precision words (16 bytes) and one floating point operation per iteration SUM: a(i) = b(i) + c(i) => with memory access to 3 double precision words (16 bytes) and one floating point operation per iteration TRIAD: a(i) = b(i) + q * c(i) => with memory access to 3 double precision words (16 bytes) and two floating point operations per iteration INTEL/SANDYBRIDGE/GNU/4.7.1: make CC=gcc CFLAGS="-DN=20000000 -O3" Function Rate (MB/s) Avg time Min time Max time Copy: 13247.3058 0.0242 0.0242 0.0242 Scale: 13102.4657 0.0245 0.0244 0.0247 Add: 13111.8950 0.0367 0.0366 0.0368 Triad: 13350.3927 0.0360 0.0360 0.0361 INTEL/SANDYBRIDGE/INTEL/12.1.2: make CC=icc CFLAGS="-DN=20000000 -O3" Function Rate (MB/s) Avg time Min time Max time Copy: 13477.7053 0.0238 0.0237 0.0240 Scale: 7765.8814 0.0413 0.0412 0.0417 Add: 10196.6943 0.0471 0.0471 0.0473 Triad: 10282.0995 0.0468 0.0467 0.0469 Data: https://hpcforge.org/plugins/mediawiki/wiki/mcbench/index.php/stream-pilatus On AMD Interlagos (CRAY XE6) the measured bandwidth is ~6 GB/s for all loops and both compilers (gnu/47 and intel/12) while on the Intel SandyBridge the measured bandwidth is ~13 GB/s for all loops and both compilers (gnu/47 and intel/12). The two sets of SandyBridge results show that single processor binaries produced by the Intel compiler are faster, if compiler options recommended by Intel are used. Known issue with the Intel compiler: http://software.intel.com/enus/articles/hpcc-stream-performance-loss-with-the-11-0-compiler/

SPH-flow SPH-flow is a hybrid MPI/OpenMP code using the hdf5 library for I/O. Get the source code from http://www.sph-flow.com/, choose the Programming Environment, load hdf5/1.8.9. The optimization flags are the following: GNU : -O3 -march=corei7-avx INTEL : -O3 -mavx -shared-intel -mcmodel=medium PGI : -O3 -tp=sandybridge-64 -Mvect=simd:256 -mcmodel=medium Simulation setup and Results Walltimes (in sec) shown here correspond to pure MPI jobs only, using 32 cores per node on 2 up to 20 nodes on the SandyBridge and 4 steps of dambreak.ini: The best performance is obtained with the Intel compiler, average performance with GNU, the worst performance with PGI. The difference between the fastest (Intel) and the slowest (PGI) executable is considerably high (~ 50% on 20 nodes).

Conclusions The benchmarks reported have the purpose to guide the users on Pilatus towards the compiler which can build the executable with the best performance. Among the applications that we have tested, the best performance is achieved in general with the Intel and GNU compilers. The PGI compiler is performing worse than the previous ones most of the times, with rare exceptions that have been reported. The performance of the default versions of some applications installed on Monte Rosa is presented as well in the current report. The data are included to give the user an idea of the performance that might be obtained using the default modules available on the two systems, without trying to implement specific optimizations. However the present comparison cannot measure strictly a performance difference: in fact sometimes different versions were compared on the two machines, e.g. FFTW- 3.3.2 on Pilatus vs. FFTW-3.3.0.1 on Rosa and ScaLAPACK source code downloaded from www.netlib.org vs. the cray-libsci version of the ScaLAPACK routines.