Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ)

RWTH GPU-Cluster 56 Nvidia Quadro 6000 (Fermi) 2 GPUs per host Host: 12-core Westmere-CPU High utilization of resources Foto: C. Iwainsky Daytime VR: new CAVE (48 GPUs) HPC: interactive software development (8 GPUs) 2 Nighttime HPC: processing of GPGPU compute jobs (54-56 GPUs) CAVE, VR, RWTH Aachen, since 2004

Agenda Introduction Real-world Applications Performance Productivity Conclusion & Outlook 3

Introduction Today s GPUs usable for scientific applications Double precision computations ECC Performance Speedup over serial/parallel CPU version (including data transfers) (Performance per Watt) Programmability, productivity Modified lines of code Manpower Ratio of development effort to performance 4

Introduction Investigation of 2 real-world software packages using different programming models OpenMP (C/Fortran) By OpenMP ARB: industry standard, shared-memory programming, CPUs CUDA (C/C++) By NVIDIA: GPU programming model, NVIDIA GPUs OpenCL (C) By Khronos Group: open standard, heterogeneous programming, CPU/GPU/ PGI Accelerator Model (C/Fortran) By PGI: directive-based GPU programming model, NVIDIA GPUs OpenACC (C/Fortran) By PGI, Cray, CAPS, NVIDIA: directive-based accelerator programming model, 5 industry standard published in Nov. 2011, NVIDIA GPUs (currently)

Real-World Applications: KegelSpan 3D simulation of bevel gear cutting process 1 Compute key values (i.a. chip thickness) to analyze tool load and tool wear Fortran code (chip thickness computation) Loop nest Dependencies in inner loop (minimum computation) Implementation Source: BMW, ZF, Klingelnberg simple: outer loop in parallel on threads, inner loop serially (CPU,GPU) + optimized data access pattern + reduction of data transfers (GPU) vec: simple + code restructuring for auto-vectorization (CPU) smem: simple + storing input/intermediate data in shared memory (GPU) 6 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI-Berichte, pages 1381 1384, D usseldorf, 2010. VDI Verlag.

Real-World Applications: NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Matlab code (main program), C code (objective function, 1 st - and 2 nd -order derivatives) Source 2 Matrix-vector multiplications, vector operations Implementation simple: outer loop in parallel on threads, inner loop serially (CPU) blocked: simple + blocked matrix-vector multiplication (CPU,GPU) vec: blocked + code restructuring for auto-vectorization (CPU) l2p: blocked + two level of parallelism (GPU) advanced: blocked + async data transfer, async kernel execution, pinned memory, spcification of constant values as predprocessed macros (GPU) Loop unrolling important (pragma vs. manual unrolling) 7 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6):2905 2921, 2008.

Performance Setup OpenMP, Serial GPU Host Compiler - Intel Westmere EP 12-core processor Scientific Linux 6.1 Intel 12.1.3 CUDA NVIDIA Tesla C2050 Intel Westmere GCC 4.4.5 OpenCL ECC on 4-core processor Intel 12.1.2 PGI Accelerator CUDA Toolkit 4.0 Scientific Linux 6.1 PGI 11.10 1 experimental system from Cray 2 comprises early implementation of OpenACC Some results were removed as they have not been published yet. 8

Speedup Performance KegelSpan 80 75.3 74.4 70 60 50 49.3 40 41.4 33.6 36.2 30 20 9 10 0 6.3 6.2 4.1 11.9 1.0 Single Precision 1.0 11.9 3.5 2.8 3.0 1.0 Double Precision 1.0

Productivity Kegelspan Added + modified lines of code (host/kernel) CUDA (smem) OpenCL (smem) Original serial version: PGI Acc (simple) ~150 kernel code lines OpenMP (simple) 152/58 183/58 84/14 -/4 =210 =241 =98 =4 Manpower estimated 1st time effort* estimated 2nd time effort** CUDA OpenCL PGI Acc OpenMP 30 days 40 days 25 days 5 days 4.5 days 5 days 1.5 days 0.5 days ** Effort to understand the architecture + programming model, to develop (+code reorganization), to debug ** Effort to just develop the application (assuming knowledge of architecture + programming model) 10

Conclusion KegelSpan No real surprises CUDA implementation 2x faster than highly-vectorized OpenMP code Vectorization for DP? Both: lot of code restructuring (high effort) Memory bound? CUDA speedup over simple OpenMP version: 6.3x (SP) and 3.5x (DP) PGI Accelerator: good ratio of effort to performance, especially in DP 11

Conclusion Programming model matters 12 Many assumptions hold Low-level GPU programming models (CUDA, OpenCL) Good performance Most effort Directive-based GPU programming model (PGI Acc or OpenACC) Often good ratio of effort to performance (still potential) Essential for further growth and acceptance of accelerators Important step towards (OpenMP-) standardization (OpenMP for Accelerators) (Auto-) Vectorization on CPUs Gets more important in future (e.g. AVX on Sandy Bridge) Performance benefit for double precision floating point operations uncertain Increasing development effort, but better understanding of architecture

Outlook Advance of programming models OpenMP for Accelerators Better compiler for auto-vectorization? Advance of computer architectures NVIDIA Kepler Intel MIC Aim: Comprehensive TCO calculation Manpower Performance (runtime) Power consumption Thank you for your attention! 13