Case Study on Productivity and Performance of GPGPUs



Similar documents
RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

OpenACC Basics Directive-based GPGPU Programming

Parallel Programming Survey

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

OpenACC Programming on GPUs

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

GPU Tools Sandra Wienke

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

OpenACC 2.0 and the PGI Accelerator Compilers

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Introduction to GPU Programming Languages

Evaluation of CUDA Fortran for the CFD code Strukti

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPGPU accelerated Computational Fluid Dynamics

Introduction to GPU hardware and to CUDA

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Introduction to Hybrid Programming

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A quick tutorial on Intel's Xeon Phi Coprocessor

HPC Wales Skills Academy Course Catalogue 2015

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

ST810 Advanced Computing

OpenACC Programming and Best Practices Guide

Accelerating CFD using OpenFOAM with GPUs

Design and Optimization of a Portable Lattice Boltzmann Code for Heterogeneous Architectures

Performance Characteristics of Large SMP Machines

NAS Parallel Benchmarks for GPGPUs using a Directive-based Programming Model

OpenCL for programming shared memory multicore CPUs

Next Generation GPU Architecture Code-named Fermi

Programming the Intel Xeon Phi Coprocessor

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to GPU Computing

Introduction to GPGPU. Tiziano Diamanti

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Kalray MPPA Massively Parallel Processing Array

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

GPUs for Scientific Computing

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Part I Courses Syllabus

Computer Graphics Hardware An Overview

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Resource Scheduling Best Practice in Hybrid Clusters

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Turbomachinery CFD on many-core platforms experiences and strategies

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Overview of HPC systems and software available within

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

GPU Acceleration of the SENSEI CFD Code Suite

Multicore Parallel Computing with OpenMP

Embedded Systems: map to FPGA, GPU, CPU?

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

HPC with Multicore and GPUs

Graphic Processing Units: a possible answer to High Performance Computing?

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Parallel Algorithm Engineering

A Case Study - Scaling Legacy Code on Next Generation Platforms

Introduction to HPC Workshop. Center for e-research

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

GPU-based Decompression for Medical Imaging Applications

COSCO 2015 Heterogeneous Computing Programming

HP Workstations graphics card options

Parallel Computing with MATLAB

GPGPU acceleration in OpenFOAM

CERN openlab III. Major Review Platform CC. Sverre Jarp Alfio Lazzaro Julien Leduc Andrzej Nowak

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

GPU Performance Analysis and Optimisation

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPU Parallel Computing Architecture and CUDA Programming Model

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

Le langage OCaml et la programmation des GPU

OpenMP Programming on ScaleMP

An Introduction to Parallel Computing/ Programming

Transcription:

Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ)

RWTH GPU-Cluster 56 Nvidia Quadro 6000 (Fermi) 2 GPUs per host Host: 12-core Westmere-CPU High utilization of resources Foto: C. Iwainsky Daytime VR: new CAVE (48 GPUs) HPC: interactive software development (8 GPUs) 2 Nighttime HPC: processing of GPGPU compute jobs (54-56 GPUs) CAVE, VR, RWTH Aachen, since 2004

Agenda Introduction Real-world Applications Performance Productivity Conclusion & Outlook 3

Introduction Today s GPUs usable for scientific applications Double precision computations ECC Performance Speedup over serial/parallel CPU version (including data transfers) (Performance per Watt) Programmability, productivity Modified lines of code Manpower Ratio of development effort to performance 4

Introduction Investigation of 2 real-world software packages using different programming models OpenMP (C/Fortran) By OpenMP ARB: industry standard, shared-memory programming, CPUs CUDA (C/C++) By NVIDIA: GPU programming model, NVIDIA GPUs OpenCL (C) By Khronos Group: open standard, heterogeneous programming, CPU/GPU/ PGI Accelerator Model (C/Fortran) By PGI: directive-based GPU programming model, NVIDIA GPUs OpenACC (C/Fortran) By PGI, Cray, CAPS, NVIDIA: directive-based accelerator programming model, 5 industry standard published in Nov. 2011, NVIDIA GPUs (currently)

Real-World Applications: KegelSpan 3D simulation of bevel gear cutting process 1 Compute key values (i.a. chip thickness) to analyze tool load and tool wear Fortran code (chip thickness computation) Loop nest Dependencies in inner loop (minimum computation) Implementation Source: BMW, ZF, Klingelnberg simple: outer loop in parallel on threads, inner loop serially (CPU,GPU) + optimized data access pattern + reduction of data transfers (GPU) vec: simple + code restructuring for auto-vectorization (CPU) smem: simple + storing input/intermediate data in shared memory (GPU) 6 1 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI-Berichte, pages 1381 1384, D usseldorf, 2010. VDI Verlag.

Real-World Applications: NINA Software 2 for the solution of Neuromagnetic INverse large-scale problems Matlab code (main program), C code (objective function, 1 st - and 2 nd -order derivatives) Source 2 Matrix-vector multiplications, vector operations Implementation simple: outer loop in parallel on threads, inner loop serially (CPU) blocked: simple + blocked matrix-vector multiplication (CPU,GPU) vec: blocked + code restructuring for auto-vectorization (CPU) l2p: blocked + two level of parallelism (GPU) advanced: blocked + async data transfer, async kernel execution, pinned memory, spcification of constant values as predprocessed macros (GPU) Loop unrolling important (pragma vs. manual unrolling) 7 2 M. Bücker, R. Beucker, and A. Rupp. Parallel Minimum p-norm Solution of the Neuromagnetic Inverse Problem for Realistic Signals Using Exact Hessian-Vector Products. SIAM Journal on Scientific Computing, 30(6):2905 2921, 2008.

Performance Setup OpenMP, Serial GPU Host Compiler - Intel Westmere EP 12-core processor Scientific Linux 6.1 Intel 12.1.3 CUDA NVIDIA Tesla C2050 Intel Westmere GCC 4.4.5 OpenCL ECC on 4-core processor Intel 12.1.2 PGI Accelerator CUDA Toolkit 4.0 Scientific Linux 6.1 PGI 11.10 1 experimental system from Cray 2 comprises early implementation of OpenACC Some results were removed as they have not been published yet. 8

Speedup Performance KegelSpan 80 75.3 74.4 70 60 50 49.3 40 41.4 33.6 36.2 30 20 9 10 0 6.3 6.2 4.1 11.9 1.0 Single Precision 1.0 11.9 3.5 2.8 3.0 1.0 Double Precision 1.0

Productivity Kegelspan Added + modified lines of code (host/kernel) CUDA (smem) OpenCL (smem) Original serial version: PGI Acc (simple) ~150 kernel code lines OpenMP (simple) 152/58 183/58 84/14 -/4 =210 =241 =98 =4 Manpower estimated 1st time effort* estimated 2nd time effort** CUDA OpenCL PGI Acc OpenMP 30 days 40 days 25 days 5 days 4.5 days 5 days 1.5 days 0.5 days ** Effort to understand the architecture + programming model, to develop (+code reorganization), to debug ** Effort to just develop the application (assuming knowledge of architecture + programming model) 10

Conclusion KegelSpan No real surprises CUDA implementation 2x faster than highly-vectorized OpenMP code Vectorization for DP? Both: lot of code restructuring (high effort) Memory bound? CUDA speedup over simple OpenMP version: 6.3x (SP) and 3.5x (DP) PGI Accelerator: good ratio of effort to performance, especially in DP 11

Conclusion Programming model matters 12 Many assumptions hold Low-level GPU programming models (CUDA, OpenCL) Good performance Most effort Directive-based GPU programming model (PGI Acc or OpenACC) Often good ratio of effort to performance (still potential) Essential for further growth and acceptance of accelerators Important step towards (OpenMP-) standardization (OpenMP for Accelerators) (Auto-) Vectorization on CPUs Gets more important in future (e.g. AVX on Sandy Bridge) Performance benefit for double precision floating point operations uncertain Increasing development effort, but better understanding of architecture

Outlook Advance of programming models OpenMP for Accelerators Better compiler for auto-vectorization? Advance of computer architectures NVIDIA Kepler Intel MIC Aim: Comprehensive TCO calculation Manpower Performance (runtime) Power consumption Thank you for your attention! 13