Case Study on Productivity and Performance of GPGPUs

Size: px

Start display at page:

Download "Case Study on Productivity and Performance of GPGPUs"

Alicia Whitehead
10 years ago
Views:

1 Case Study on Productivity and Performance of GPGPUs Sandra Wienke ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ)

Iwainsky Daytime VR: new CAVE (48 GPUs) HPC: interactive software development

2 RWTH GPU-Cluster 56 Nvidia Quadro 6000 (Fermi) 2 GPUs per host Host: 12-core Westmere-CPU High utilization of resources Foto: C. Iwainsky Daytime VR: new CAVE (48 GPUs) HPC: interactive software development (8 GPUs) 2 Nighttime HPC: processing of GPGPU compute jobs (54-56 GPUs) CAVE, VR, RWTH Aachen, since 2004

3 Agenda Introduction Real-world Applications Performance Productivity Conclusion & Outlook 3

4 Introduction Today s GPUs usable for scientific applications Double precision computations ECC Performance Speedup over serial/parallel CPU version (including data transfers) (Performance per Watt) Programmability, productivity Modified lines of code Manpower Ratio of development effort to performance 4

version (including data transfers) (Performance per Watt) Programmability,

5 Introduction Investigation of 2 real-world software packages using different programming models OpenMP (C/Fortran) By OpenMP ARB: industry standard, shared-memory programming, CPUs CUDA (C/C++) By NVIDIA: GPU programming model, NVIDIA GPUs OpenCL (C) By Khronos Group: open standard, heterogeneous programming, CPU/GPU/ PGI Accelerator Model (C/Fortran) By PGI: directive-based GPU programming model, NVIDIA GPUs OpenACC (C/Fortran) By PGI, Cray, CAPS, NVIDIA: directive-based accelerator programming model, 5 industry standard published in Nov. 2011, NVIDIA GPUs (currently)

standard, heterogeneous programming, CPU/GPU/ PGI Accelerator Model (C/Fortran) By PGI: directive-based GPU programming model, NVIDIA GPUs

Real-World Applications: KegelSpan 3D simulation of bevel gear cutting process 1 Compute key values (i.a. chip thickness) to analyze tool load and tool wear Fortran code (chip thickness computation)

6 Real-World Applications: KegelSpan 3D simulation of bevel gear cutting process 1 Compute key values (i.a. chip thickness) to analyze tool load and tool wear Fortran code (chip thickness computation) Loop nest Dependencies in inner loop (minimum computation) Implementation Source: BMW, ZF, Klingelnberg simple: outer loop in parallel on threads, inner loop serially (CPU,GPU) + optimized data access pattern + reduction of data transfers (GPU) vec: simple + code restructuring for auto-vectorization (CPU) smem: simple + storing input/intermediate data in shared memory (GPU) 6 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume of VDI-Berichte, pages , D usseldorf, VDI Verlag.

access pattern + reduction of data transfers (GPU) vec: simple + code restructuring for auto-vectorization (CPU) smem: simple + storing input/intermediate data in shared memory (GPU) 6 1 C.

Real-World Applications: NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Matlab code (main program), C code (objective function, 1 st - and 2 nd -order derivatives)

7 Real-World Applications: NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Matlab code (main program), C code (objective function, 1 st - and 2 nd -order derivatives) Source 2 Matrix-vector multiplications, vector operations Implementation simple: outer loop in parallel on threads, inner loop serially (CPU) blocked: simple + blocked matrix-vector multiplication (CPU,GPU) vec: blocked + code restructuring for auto-vectorization (CPU) l2p: blocked + two level of parallelism (GPU) advanced: blocked + async data transfer, async kernel execution, pinned memory, spcification of constant values as predprocessed macros (GPU) Loop unrolling important (pragma vs. manual unrolling) 7 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6): , 2008.

(CPU,GPU) vec: blocked + code restructuring for auto-vectorization (CPU) l2p: blocked + two level of parallelism (GPU) advanced: blocked + async data transfer, async kernel execution, pinned memory,

8 Performance Setup OpenMP, Serial GPU Host Compiler - Intel Westmere EP 12-core processor Scientific Linux 6.1 Intel CUDA NVIDIA Tesla C2050 Intel Westmere GCC OpenCL ECC on 4-core processor Intel PGI Accelerator CUDA Toolkit 4.0 Scientific Linux 6.1 PGI experimental system from Cray 2 comprises early implementation of OpenACC Some results were removed as they have not been published yet. 8

4.5 OpenCL ECC on 4-core processor Intel 12.1.2 PGI Accelerator CUDA Toolkit 4.0 Scientific Linux 6.

Speedup Performance KegelSpan 80 75.3 74.4 70 60 50 49.3 40 41.4 33.6 36.

9 Speedup Performance KegelSpan Single Precision Double Precision 1.0

2 30 20 9 10 0 6.3 6.2 4.1 11.9 1.0 Single Precision 1.

10 Productivity Kegelspan Added + modified lines of code (host/kernel) CUDA (smem) OpenCL (smem) Original serial version: PGI Acc (simple) ~150 kernel code lines OpenMP (simple) 152/58 183/58 84/14 -/4 =210 =241 =98 =4 Manpower estimated 1st time effort* estimated 2nd time effort** CUDA OpenCL PGI Acc OpenMP 30 days 40 days 25 days 5 days 4.5 days 5 days 1.5 days 0.5 days ** Effort to understand the architecture + programming model, to develop (+code reorganization), to debug ** Effort to just develop the application (assuming knowledge of architecture + programming model) 10

OpenCL PGI Acc OpenMP 30 days 40 days 25 days 5 days 4.5 days 5 days 1.5 days 0.

11 Conclusion KegelSpan No real surprises CUDA implementation 2x faster than highly-vectorized OpenMP code Vectorization for DP? Both: lot of code restructuring (high effort) Memory bound? CUDA speedup over simple OpenMP version: 6.3x (SP) and 3.5x (DP) PGI Accelerator: good ratio of effort to performance, especially in DP 11

Both: lot of code restructuring (high effort) Memory bound?

12 Conclusion Programming model matters 12 Many assumptions hold Low-level GPU programming models (CUDA, OpenCL) Good performance Most effort Directive-based GPU programming model (PGI Acc or OpenACC) Often good ratio of effort to performance (still potential) Essential for further growth and acceptance of accelerators Important step towards (OpenMP-) standardization (OpenMP for Accelerators) (Auto-) Vectorization on CPUs Gets more important in future (e.g. AVX on Sandy Bridge) Performance benefit for double precision floating point operations uncertain Increasing development effort, but better understanding of architecture

accelerators Important step towards (OpenMP-) standardization (OpenMP for Accelerators) (Auto-) Vectorization on CPUs Gets more important in future (e.g.

13 Outlook Advance of programming models OpenMP for Accelerators Better compiler for auto-vectorization? Advance of computer architectures NVIDIA Kepler Intel MIC Aim: Comprehensive TCO calculation Manpower Performance (runtime) Power consumption Thank you for your attention! 13

Advance of computer architectures NVIDIA Kepler Intel MIC Aim:

RWTH GPU Cluster. Sandra Wienke [email protected] November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

RWTH GPU Cluster. Sandra Wienke wienke@rz.rwth-aachen.de November 2012. Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky RWTH GPU Cluster Fotos: Christian Iwainsky Sandra Wienke [email protected] November 2012 Rechen- und Kommunikationszentrum (RZ) The RWTH GPU Cluster GPU Cluster: 57 Nvidia Quadro 6000 (Fermi) innovative