Programming GPUs with CUDA

Similar documents

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Parallel Computing Architecture and CUDA Programming Model

Next Generation GPU Architecture Code-named Fermi

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Computing - CUDA

CUDA Basics. Murphy Stein New York University

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to GPU hardware and to CUDA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

HPC with Multicore and GPUs

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to GPU Programming Languages

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Lecture 1: an introduction to CUDA

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

IMAGE PROCESSING WITH CUDA

Introduction to CUDA C

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

The Fastest, Most Efficient HPC Architecture Ever Built

Graphics and Computing GPUs

GPGPU Computing. Yong Cao

CUDA programming on NVIDIA GPUs

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Hardware Performance. Fall 2015

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA. Multicore machines

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPGPU Parallel Merge Sort Algorithm

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Programming. Week 4. Shared memory and register

Introduction to GPGPU. Tiziano Diamanti

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Parallel Programming Survey

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

L20: GPU Architecture and Models

OpenACC Basics Directive-based GPGPU Programming

GPGPUs, CUDA and OpenCL

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

NVIDIA GeForce GTX 580 GPU Datasheet

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

Computer Graphics Hardware An Overview

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction to GPU Architecture

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Embedded Systems: map to FPGA, GPU, CPU?

GPU Programming Strategies and Trends in GPU Computing

GPUs for Scientific Computing

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Introduction to GPU Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Texture Cache Approximation on GPUs

Optimizing Application Performance with CUDA Profiling Tools

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

GPU Tools Sandra Wienke

Radeon HD 2900 and Geometry Generation. Michael Doggett

Evaluation of CUDA Fortran for the CFD code Strukti

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

~ Greetings from WSU CAPPLab ~

QCD as a Video Game?

GPU Performance Analysis and Optimisation

Stream Processing on GPUs Using Distributed Multimedia Middleware

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Case Study on Productivity and Performance of GPGPUs

Introduction to Cloud Computing

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

ST810 Advanced Computing

Advanced CUDA Webinar. Memory Optimizations

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Parallel Firewalls on General-Purpose Graphics Processing Units

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Amazon EC2 Product Details Page 1 of 5

Writing Applications for the GPU Using the RapidMind Development Platform

1. If we need to use each thread to calculate one output element of a vector addition, what would

OpenACC 2.0 and the PGI Accelerator Compilers

Image Processing & Video Algorithms with CUDA

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

OpenACC Programming on GPUs

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Transcription:

Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University johnmc@rice.edu COMP 422 Lecture 23 12 April 2016

Why GPUs? Two major trends GPU performance is pulling away from traditional processors (FERMI) (FERMI) availability of general (non-graphics) programming interfaces GPU in every PC and workstation massive volume, potentially broad impact Figure Credits: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 3.1 2

Power and Area Efficiency GPU http://www.realworldtech.com/compute-efficiency-2012/2 3

GPGPU? General Purpose computation using GPU applications beyond 3D graphics typically, data-intensive science and engineering applications Data-intensive algorithms leverage GPU attributes large data arrays, streaming throughput fine-grain single instruction multiple threads (SIMT) parallelism low-latency floating point computation 4

GPGPU Programming in 2005 Stream-based programming model Express algorithms in terms of graphics operations use GPU pixel shaders as general-purpose SP floating point units Directly exploit pixel shaders vertex shaders video memory threads interact through off-chip video memory Example: GPUSort (Govindaraju, Manocha; 2005) Figure Credits: Dongho Kim, School of Media, Soongsil University 5

Fragment from GPUSort //invert the other half of the bitonic array and merge glbegin(gl_quads); for(int start=0; start<num_quads; start++){ gltexcoord2f(s+width,0); glvertex2f(s,0); gltexcoord2f(s+width/2,0); glvertex2f(s+width/2,0); gltexcoord2f(s+width/2,height); glvertex2f(s+width/2,height); gltexcoord2f(s+width,height); glvertex2f(s,height); s+=width; } glend(); (Govindaraju, Manocha; 2005) 6

CUDA CUDA = Compute Unified Device Architecture Software platform for parallel computing on Nvidia GPUs introduced in 2006 Nvidia s repositioning of GPUs as versatile compute devices C plus a few simple extensions write a program for one thread instantiate for many parallel threads familiar language; simple data-parallel extensions CUDA is a scalable parallel programming model runs on any number of processors without recompiling Slide credit: Patrick LeGresley, NVidia 7

NVIDIA Fermi GPU (2010) 3.0B transistors 15 SMs; 1 SM disabled for yield 32 CUDA cores per SM CUDA core = programmable shader 480 cores total Figure credit: http://tpucdn.com/reviews/nvidia/geforce_gtx_480_fermi/images/arch.jpg 8

NVIDIA s Fermi GPU Configurable L1 data cache 48K shared memory / 16K L1 cache 48K L1 cache / 16K shared memory 768K L2 cache 4 SFU (transcendental math and interpolation) per SM 16 LD/ST units per SM 16 DP FMA per clock per SM GigaThread Engine execute up to 16 concurrent kernels Unified address space thread local, block shared, global supports C/C++ pointers ECC support Figure Credit: http://www.nvidia.com/content/pdf/fermi_white_papers/ NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi Streaming Multiprocessor 9

NVIDIA GPU Stats Through Fermi Note: Fermi specification here is pre-release. Actual cores = 480 Figure Credit: http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf 10

NVIDIA Kepler GK110 (May 2012) 28nm chip, 7.1B transistors 15 Streaming Multiprocessors 192 CUDA cores fully pipelined FP and INT units IEEE 754-2008; fused multiply add four warp schedulers 32-thread groups (warps) 4 warps issue and execute concurrently 2 inst/warp/cycle 64 DP FP units 32 SFU 32 LD/ST units 6 64-bit memory controllers http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf 11

Why CUDA? Business rationale opportunity for Nvidia to sell more chips extend the demand from graphics into HPC insurance against uncertain future for discrete GPUs both Intel and AMD integrating GPUs onto microprocessors Technical rationale hides GPU architecture behind the programming API programmers never write directly to the metal insulate programmers from details of GPU hardware enables Nvidia to change GPU architecture completely, transparently preserve investment in CUDA programs simplifies the programming of multithreaded hardware CUDA automatically manages threads 12

CUDA Design Goals Support heterogeneous parallel programming (CPU + GPU) Scale to hundreds of cores, thousands of parallel threads Enable programmer to focus on parallel algorithms not GPU characteristics, programming language, scheduling... 13

CUDA Software Stack for Heterogeneous Computing Figure Credit: NVIDIA CUDA Compute Unified Device Architecture Programming Guide 1.1 14

Key CUDA Abstractions Hierarchy of concurrent threads Lightweight synchronization primitives Shared memory model for cooperating threads 15

Hierarchy of Concurrent Threads Parallel kernels composed of many threads all threads execute same sequential program use parallel threads rather than sequential loops Threads are grouped into thread blocks threads in block can sync and share memory Blocks are grouped into grids threads and blocks have unique IDs threadidx: 1D, 2D, or 3D blockidx: 1D or 2D simplifies addressing when processing multidimensional data Slide credit: Patrick LeGresley, NVidia 16

CUDA Programming Example Computing y = ax + y with a serial loop void saxpy_serial(int n, float alpha, float *x, float *y) { for (int i = 0; i< n; i++) y[i] = alpha * x[i] + y[i]; } // invoke serial saxpy kernel saxpy_serial(n, 2.0, x, y) Host code Computing y = ax + y in parallel using CUDA global void saxpy_parallel(int n, float alpha, float *x, float *y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = alpha * x[i] + y[i]; } Device code // invoke parallel saxpy kernel (256 threads per block) int nblocks = (n + 255)/256 saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y) Host code 17

Synchronization and Coordination Threads within a block may synchronize with barriers... step 1... syncthreads();... step 2... Blocks can coordinate via atomic memory operations e.g. increment shared counter with atomicinc() 18

CPU vs. GPGPU vs. CUDA Comparing the abstract models CPU GPGPU CUDA/GPGPU CPU Figure Credit: http://www.nvidia.com/docs/io/55972/220401_reprint.pdf 19

CUDA Memory Model Figure credits: Patrick LeGresley, NVidia 20

Memory Model (Continued) Figure credit: Patrick LeGresley, NVidia 21

Memory Access Latencies (FERMI) Registers: each SM has 32KB of registers each thread has private registers max # of registers / kernel: 63 latency: ~1 cycle; bandwidth ~8,000 GB/s L1+Shared Memory: on-chip memory that can be used either as L1 cache to share data among threads in the same thread block 64 KB memory: 48 KB shared / 16 KB L1; 16 KB shared / 48 KB L1 latency: 10-20 cycles. bandwidth ~1,600 GB/s Local Memory: holds "spilled" registers, arrays L2 Cache: 768 KB unified L2 cache, shared among the 16 SMs caches load/store from/to global memory, copies to/from CPU host, and texture requests L2 implements atomic operations Global Memory: Accessible by all threads as well as host (CPU). High latency (400-800 cycles), but generally cached 22

Minimal Extensions to C Declaration specifiers to indicate where things live functions global void KernelFunc(...); // kernel callable from host must return void device float DeviceFunc(...); // function callable on device no recursion no static variables within function host float HostFunc(); // only callable on host variables (later slide) Extend function invocation syntax for parallel kernel launch KernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each Built-in variables for thread identification in kernels dim3 threadidx; dim3 blockidx; dim3 blockdim; 23

Invoking a Kernel Function Call kernel function with an execution configuration Any call to a kernel function is asynchronous explicit synchronization is needed to block cudathreadsynchronize() forces runtime to wait until all preceding device tasks have finished 24

Example of CUDA Thread Organization Grid as 2D array of blocks Block as 3D array of threads global void KernelFunction(...); dim3 dimblock(4, 2, 2); dim3 dimgrid(2, 2, 1); KernelFunction<<<dimGrid, dimblock>>>(...); 25

CUDA Variable Declarations device is optional with local, shared, or constant Automatic variables without any qualifier reside in a register except arrays: reside in local memory Pointers allocated on the host and passed to the kernel global void Kernelfunc(float *ptr) address obtained for a global variable: float *ptr = &GlobalVar 26

Using Per Block Shared Memory Share variables among threads in a block with shared memory shared int scratch[blocksize]; scratch[threadidx.x] = arr[threadidx.x]; //... // compute on scratch values //... arr[threadidx.x] = scratch[threadidx.x]; Communicate values between threads scratch[threadidx.x] = arr[threadidx.x]; syncthreads(); int left = scratch[threadidx.x - 1]; 27

Features Available in GPU Code Special variables for thread identification in kernels dim3 threadidx; dim3 blockidx; dim3 blockdim; Intrinsics that expose specific operations in kernel code _syncthreads(); // barrier synchronization Standard math library operations exponentiation, truncation and rounding, trigonometric functions, min/max/abs, log, quotient/remainder, etc. Atomic memory operations atomicadd, atomicmin, atomicand, atomiccas, etc. 28

Runtime Support Memory management for pointers to GPU memory cudamalloc(), cudafree() Copying from host to/from device, device to device cudamemcpy(), cudamemcpy2d(), cudamemcpy3d() 29

More Complete Example: Vector Addition kernel code... 30

Vector Addition Host Code 31

Extended C Summary 32

Compiling CUDA 33

Ideal CUDA programs High intrinsic parallelism e.g. per-element operations Minimal communication (if any) between threads limited synchronization High ratio of arithmetic to memory operations Few control flow statements SIMT execution divergent paths among threads in a block may be serialized (costly) compiler may replace conditional instructions by predicated operations to reduce divergence 34

CUDA Matrix Multiply: Host Code 35

CUDA Matrix Multiply: Device Code 36

Concurrent Kernel Execution in Fermi http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf 37

Optimization Considerations Kernel optimizations make use of shared memory minimize use divergent control flow SIMT execution must follow all paths taken within a thread group use intrinsic instructions when possible exploit the hardware support behind them CPU/GPU interaction use asynchronous memory copies Key resource considerations for Fermi GPU s max dimensions of a block (1024, 1024, 64) max 1024 threads per block 32K registers per SM 16 KB / 48KB cache 48 KB / 16KB shared memory see the programmer s guide for a complete set of limits for compute capability 2.x 38

CUDA Resources General information about CUDA www.nvidia.com/object/cuda_home.html Nvidia GPUs compatible with CUDA www.nvidia.com/object/cuda_learn_products.html CUDA sample source code www.nvidia.com/object/cuda_get_samples.html Download the CUDA SDK www.nvidia.com/object/cuda_get.html 39

Portable CUDA Alternative: OpenCL Framework for writing programs that execute on heterogeneous platforms, including CPUs, GPUs, etc. supports both task and data parallelism based on subset of ISO C99 with extensions for parallelism numerics based on IEEE 754 floating point standard efficiently interoperated with graphics APIs, e.g. OpenGL OpenCL managed by non-profit Khronos Group Initial specification approved for public release Dec. 8, 2008 specification 1.2 released Nov 14, 2011 40

OpenCL Kernel Example: 1D FFT 41

OpenCL Host Program: 1D FFT 42

References Patrick LeGresley, High Performance Computing with CUDA, Stanford University Colloquium, October 2008, http://www.stanford.edu/dept/icme/ docs/seminars/legresley-2008-10-27.pdf Vivek Sarkar. Introduction to General-Purpose computation on GPUs (GPGPUs), COMP 635, September 2007 Rob Farber. CUDA, Supercomputing for the Masses, Parts 1-11, Dr. Dobb s Portal, http://www.ddj.com/architect/207200659, April 2008-March 2009. Tom Halfhill. Parallel Processing with CUDA, Microprocessor Report, January 2008. N. Govindaraju et al. A cache-efficient sorting algorithm for database and data mining computations using graphics processors. http:// gamma.cs.unc.edu/sort/gpusort.pdf http://defectivecompass.wordpress.com/2006/06/25/learning-from-gpusort http://en.wikipedia.org/wiki/opencl http://www.khronos.org/opencl http://www.khronos.org/files/opencl-quick-reference-card.pdf 43