CUDA programming on NVIDIA GPUs



Similar documents
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPUs for Scientific Computing

Lecture 1: an introduction to CUDA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Programming Languages

Introduction to GPU hardware and to CUDA

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Next Generation GPU Architecture Code-named Fermi

CUDA Basics. Murphy Stein New York University

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Programming for the CUDA Architecture. Version 2.3

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Turbomachinery CFD on many-core platforms experiences and strategies

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Computing - CUDA

GPGPU Computing. Yong Cao

GPU Parallel Computing Architecture and CUDA Programming Model

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Computing with MATLAB

~ Greetings from WSU CAPPLab ~

Introduction to GPGPU. Tiziano Diamanti

Parallel Programming Survey

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

HPC with Multicore and GPUs

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA. Multicore machines

GPU Hardware Performance. Fall 2015

Multicore Parallel Computing with OpenMP

HPC Wales Skills Academy Course Catalogue 2015

ultra fast SOM using CUDA

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Clustering Billions of Data Points Using GPUs

Final Project Report. Trading Platform Server

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

ST810 Advanced Computing

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPGPU Parallel Merge Sort Algorithm

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Guided Performance Analysis with the NVIDIA Visual Profiler

High Performance Computing

Computer Graphics Hardware An Overview

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

10- High Performance Compu5ng

GPU Acceleration of the SENSEI CFD Code Suite

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

BLM 413E - Parallel Programming Lecture 3

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

L20: GPU Architecture and Models

Overview of HPC Resources at Vanderbilt

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

GPU Performance Analysis and Optimisation

Advanced CUDA Webinar. Memory Optimizations

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Introduction to Cloud Computing

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Introduction to GPU Architecture

OpenACC Basics Directive-based GPGPU Programming

Programming GPUs with CUDA

Multi-Threading Performance on Commodity Multi-Core Processors

Evaluation of CUDA Fortran for the CFD code Strukti

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

A quick tutorial on Intel's Xeon Phi Coprocessor

Embedded Systems: map to FPGA, GPU, CPU?

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

OpenACC Programming and Best Practices Guide

OpenACC 2.0 and the PGI Accelerator Compilers

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Optimizing Application Performance with CUDA Profiling Tools


Transcription:

p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre

p. 2/21 Overview hardware view software view

p. 3/21 Hardware view At the top-level, there is a PCIe graphics card with a many-core GPU sitting inside a standard PC/server with one or two multicore CPUs: DDR2 GDDR3 motherboard graphics card

p. 4/21 Hardware view At the GPU level: basic building block is a multiprocessor with 8 cores 8192 registers (16384 on newest chips) 16KB of shared memory 8KB cache for constants held in graphics memory 8KB cache for textures held in graphics memory different chips have different numbers of these: product multiprocessors bandwidth cost 8800 GT 14 58GB/s 100 9800 GT 14 58GB/s 100 9800 GX2 32 128GB/s 250

p. 5/21 Hardware view Key hardware feature is that the 8 cores in a multiprocessor are SIMD (Single Instruction Multiple Data) cores: all cores execute the same instructions simultaneously, but with different data similar to vector computing on CRAY supercomputers minimum of 4 threads per core, so end up with a minimum of 32 threads all doing the same thing at (almost) the same time natural for graphics processing and much scientific computing SIMD is also a natural choice for massively multicore to simplify each core

p. 6/21 Software view At the top level, we have a master process which runs on the CPU and performs the following steps: 1. initialises card 2. allocates memory in host and on device 3. copies data from host to device memory 4. launches multiple copies of execution kernel on device 5. copies data from device memory to host 6. repeats 2-4 as needed 7. de-allocates all memory and terminates

p. 7/21 Software view At a lower level, within the GPU: each copy of the execution kernel executes on one of the multiprocessors if the number of copies exceeds the number of multiprocessors, then more than one will run at a time on each multiprocessor if there are enough registers and shared memory, and the others will wait in a queue and execute later all threads within one copy can access local shared memory but can t see what the other copies are doing (even if they are on the same multiprocessor) there are no guarantees on the order in which the copies will execute

p. 8/21 CUDA is NVIDIA s program development environment: based on C with some extensions FORTRAN support coming in near future multicore x86 back-end also coming soon to make CUDA code portable may evolve into proposed OpenCL standard which may be supported also by AMD/ATI lots of example code and good documentation 2-4 week learning curve for those with experience of OpenMP and MPI programming growing user community active on NVIDIA forums

p. 9/21 At the host code level, there are library routines for: memory allocation on graphics card data transfer to/from graphics memory constants texture arrays (useful for lookup tables) ordinary data error-checking timing There is also a special syntax for launching multiple copies of the kernel process on the GPU.

p. 10/21 In its simplest form it looks like: kernel_routine <<< griddim, blockdim >>> (args); where griddim is the number of copies of the kernel (the grid size ) blockdim is the number of threads within each copy (the block size) args is a limited number of arguments, usually mainly pointers to arrays in graphics memory The more general form allows griddim and blockdim to be 2D or 3D to simplify application programs

At the lower level, when one copy of the kernel is started on a multiprocessor it is executed by a number of threads, each of which knows about: some variables passed as arguments pointers to arrays in device memory (also arguments) constants held in device memory uninitialised shared memory and private registers/local variables some special variables: griddim size (or dimensions) of grid of blocks blockidx index (or 2D/3D indices)of block blockdim size (or dimensions) of each block threadidx index (or 2D/3D indices) of thread p. 11/21

p. 12/21 The kernel code involves operations such as: read from/write to arrays in device memory write to/read from arrays in shared memory read constants use a texture array for a lookup table perform integer and floating point operations on data held in registers The reading and writing is done implicitly data is automatically read into a register when needed for an operation, and is automatically written when there is an assignment.

p. 13/21 Simple Monte Carlo example: simplest possible example involves the calculation of a large number of paths each path calculation is completely independent of the others each should have its own set of random numbers still waiting for NVIDIA to develop a random number generation library so just set all of them to 0.3 path_calc calculates the paths and portfolio calculates the portfolio value, which is then written into a device array The host code copies the results back and averages them

p. 14/21 Monte Carlo LIBOR application: ideal because it is trivially parallel each path calculation is independent of the others timings in seconds for 96,000 paths, with 1500 blocks each with 64 threads, so one thread per path executed 5 blocks at a time on each multiprocessor, so 40 active threads per core remember: CUDA results are for single precision time original code (VS C++) 26.9 CUDA code (8800GTX) 0.2

p. 15/21 CUDA multithreading Lots of active threads is the key to high performance: no context switching ; each thread has its own registers, which limits the number of active threads threads execute in warps of 32 threads per multiprocessor (4 per core) execution alternates between active warps, with warps becoming temporarily inactive when waiting for data

p. 16/21 CUDA multithreading for each thread, one operation completes long before the next starts avoids the complexity of pipeline overlaps which can limit the performance of modern processors 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 time memory access from device memory has a delay of 400-600 cycles; with 40 threads this is equivalent to 10-15 operations and can be managed by the compiler

p. 17/21 What have I not discussed: memory transfer coalescence : bandwidth to device memory is maximised by a half-warp of 16 threads loading 16 contiguous floats or ints with correct alignment this is often the most important/tedious aspect of getting good performance use of shared-memory essential for PDE applications to minimise the amount of data loaded from device memory use of textures rather specialised, I ve used them only for a random-access lookup table to avoid penalty of non-coalesced transfers

p. 18/21 Finite difference application: recently started work on simple 3D finite difference applications Jacobi iteration for discrete Laplace equation CG iteration for discrete Laplace equation ADI time-marching conceptually straightforward for someone who is used to partitioning grids for MPI implementations each multiprocessor works on a block of the grid threads within each block read data into local shared memory, do the calculations in parallel and write new data back to main device memory

p. 19/21 3D finite difference implementation: insufficient shared memory to hold whole 3D block, so hold 3 working planes at a time (halo depth of 1, just one Jacobi iteration at a time) key steps in kernel code: load in k=0 z-plane (inc x and y-halos) loop over all z-planes load k+1 z-plane (over-writing k 2 plane) process k z-plane store new k z-plane 50 speedup relative to Xeon single core, compared to 5 speedup using OpenMP with 8 cores.

p. 20/21 Final words Will GPUs have real impact? I think they re the most exciting development since initial development of PVM and Beowulf clusters Have generated a lot of interest/excitement in academia, being used by application scientists, not just computer scientists Potential for 10 100 speedup and improvement in GFLOPS/ and GFLOPS/watt Effectively a personal cluster in a PC under your desk Needs work on tools and libraries to simplify development effort

p. 21/21 Webpages Wikipedia overviews of GeForce cards: en.wikipedia.org/wiki/geforce 8 Series en.wikipedia.org/wiki/geforce 9 Series NVIDIA s CUDA homepage: www.nvidia.com/object/cuda home.html Microprocessor Report article: www.nvidia.com/docs/io/47906/220401 Reprint.pdf LIBOR test code: www.maths.ox.ac.uk/ gilesm/hpc/