CUDA C/C++ BASICS. NVIDIA Corporation

Similar documents
Introduction to CUDA C

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Introduction to GPU hardware and to CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Parallel Computing Architecture and CUDA Programming Model

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Hands-on CUDA exercises

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA Basics. Murphy Stein New York University

Programming GPUs with CUDA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Lecture 1: an introduction to CUDA

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

OpenACC Basics Directive-based GPGPU Programming

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Next Generation GPU Architecture Code-named Fermi

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

IMAGE PROCESSING WITH CUDA

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPGPU Parallel Merge Sort Algorithm

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Accelerate your Application with CUDA C

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Optimizing Application Performance with CUDA Profiling Tools

OpenACC 2.0 and the PGI Accelerator Compilers

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

1. If we need to use each thread to calculate one output element of a vector addition, what would

CUDA. Multicore machines

Rootbeer: Seamlessly using GPUs from Java

GPU Tools Sandra Wienke

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

CUDA Programming. Week 4. Shared memory and register

HPC with Multicore and GPUs

Introduction to Sun Grid Engine (SGE)

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

OpenACC Programming on GPUs

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to GPU Programming Languages

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU Computing - CUDA

Introduction to Cloud Computing

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

GPU Performance Analysis and Optimisation

CUDA programming on NVIDIA GPUs

Chapter 2: OS Overview

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Advanced CUDA Webinar. Memory Optimizations

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

Parallel Programming Survey

GPU Hardware Performance. Fall 2015

Graphics and Computing GPUs

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

OpenCL Programming for the CUDA Architecture. Version 2.3

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

ultra fast SOM using CUDA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

~ Greetings from WSU CAPPLab ~

GPGPUs, CUDA and OpenCL

U N C L A S S I F I E D

Image Processing & Video Algorithms with CUDA

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Parallel Firewalls on General-Purpose Graphics Processing Units

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A Pattern-Based Comparison of OpenACC & OpenMP for Accelerators

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

The C Programming Language course syllabus associate level

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPGPU Computing. Yong Cao

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

Chapter 6, The Operating System Machine Level

Objectives. Chapter 2: Operating-System Structures. Operating System Services (Cont.) Operating System Services. Operating System Services (Cont.

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Transcription:

CUDA C/C++ BASICS NVIDIA Corporation

CUDA PROGRAMMING MODEL

Anatomy of a CUDA C/C++ Application Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements CUDA C/C++ Application Serial code Host = CPU Parallel code Device = GPU Serial code Host = CPU Parallel code Device = GPU...

Compiling CUDA C Applications void serial_function( ) {... } void other_function(int... ) {... } void saxpy_serial(float... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..);... } Modify into Parallel CUDA C code CUDA C Functions NVCC (Open64/LLVM) CUDA object files Linker Rest of C Application CPU Compiler CPU object files CPU-GPU Executable

CUDA C : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code

CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU as an array of threads in parallel All threads execute the same code, can take different paths float x = input[threadidx.x]; float y = func(x); output[threadidx.x] = y; Each thread has an ID Select input/output data Control decisions

CUDA Kernels: Subdivide into Blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

Kernel Execution CUDA thread CUDA core Each thread is executed by a core CUDA thread block CUDA kernel grid... CUDA Streaming Multiprocessor CUDA-enabled GPU Each block is executed by one SM and does not migrate Several concurrent blocks can reside on one SM depending on the blocks memory requirements and the SM s memory resources Each kernel is executed on one device Multiple kernels can execute on a device at one time

Thread blocks allow cooperation Instruction Cache Scheduler Scheduler Dispatch Dispatch Threads may need to cooperate: Cooperatively load/store blocks of memory all will use Share results with each other or cooperate to produce a single result Synchronize with each other Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

Thread blocks allow scalability Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability: A kernel scales across any number of SMs Kernel Grid Launch Device with 2 SMs SM 0 SM 1 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device with 4 SMs SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

What is CUDA? CUDA Architecture Expose GPU parallelism for general-purpose computing Retain performance CUDA C/C++ Based on industry-standard C/C++ Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. This session introduces CUDA C/C++

Introduction to CUDA C/C++ What will you learn in this session? Start from Hello World! Write and launch CUDA C/C++ kernels Manage GPU memory Manage communication and synchronization

Prerequisites You (probably) need experience with C or C++ You don t need GPU experience You don t need parallel programming experience You don t need graphics experience

Heterogeneous Computing Blocks Threads CONCEPTS Indexing Shared memory syncthreads() Asynchronous operation Handling errors Managing devices

CONCEPTS Heterogeneous Computing Blocks Threads Indexing Shared memory syncthreads() Asynchronous operation HELLO WORLD! Handling errors Managing devices

Heterogeneous Computing Terminology: Host The CPU and its memory (host memory) Device The GPU and its memory (device memory) Host Device

Heterogeneous Computing #include <iostream> #include <algorithm> using namespace std; #define N 1024 #define RADIUS 3 #define BLOCK_SIZE 16 global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads(); parallel fn // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex] = result; void fill_ints(int *x, int n) { fill_n(x, n, 1); } int main(void) { int *in, *out; // host copies of a, b, c int *d_in, *d_out; // device copies of a, b, c int size = (N + 2*RADIUS) * sizeof(int); // Alloc space for host copies and setup values in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS); out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS); // Alloc space for device copies cudamalloc((void **)&d_in, size); cudamalloc((void **)&d_out, size); serial code } // Copy to device cudamemcpy(d_in, in, size, cudamemcpyhosttodevice); cudamemcpy(d_out, out, size, cudamemcpyhosttodevice); // Launch stencil_1d() kernel on GPU stencil_1d<<<n/block_size,block_size>>>(d_in + RADIUS, d_out + RADIUS); // Copy result back to host cudamemcpy(out, d_out, size, cudamemcpydevicetohost); // Cleanup free(in); free(out); cudafree(d_in); cudafree(d_out); return 0; parallel code serial code

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory

Hello World! int main(void) { } printf("hello World!\n"); return 0; Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code Output: $ nvcc hello_world.cu $ a.out Hello World! $

Hello World! with Device Code global void mykernel(void) { } printf( Hello World from device!\n ); int main(void) { mykernel<<<1,1>>>(); cudadevicesynchronize(); printf("hello World from host!\n"); return 0; } Two new syntactic elements

Hello World! with Device Code global void mykernel(void) { printf( Hello world from device!\n ); } CUDA C/C++ keyword global indicates a function that: Runs on the device Is called from host code nvcc separates source code into host and device components Device functions (e.g. mykernel()) processed by NVIDIA compiler Host functions (e.g. main()) processed by standard host compiler gcc, cl.exe

Hello World! with Device Code mykernel<<<1,1>>>(); Triple angle brackets mark a call from host code to device code Also called a kernel launch We ll return to the parameters (1,1) in a moment That s all that is required to execute a function on the GPU!

Access to BigRed2 ssh <username>@bigred2.uits.iu.edu cp r /N/u/jbentz/BigRed2/oct2/cuda. cd cuda module load cudatoolkit Use batch system for job submission qsub submit a job to the queue qstat show all jobs in the queue qdel delete a job from the queue

Exercises: General Instructions (compiling) Exercises are in cuda/exercises directory Solutions are in cuda/exercise_solutions directory To compile, use one of the provided makefiles C: > make

Exercises: General Instructions (running) To run, just execute on the command line > qsub runit

Hello world Login to BigRed 2 Each coding project in a separate folder in the following dir ~cuda/exercises cd cuda/exercises/hello_world All dirs have Makefiles for you Try building/running the code make Make sure you ve loaded the cudatoolkit module! qsub runit

Screenshot

Hello World! with Device Code global void mykernel(void) { } printf( Hello from device!\n ); int main(void) { } mykernel<<<1,1>>>(); printf("hello from Host!\n"); return 0; mykernel() does nothing interesting, somewhat anticlimactic! Output: $ nvcc hello.cu $ a.out Hello from device! Hello from Host! $

Parallel Programming in CUDA C/C++ But wait GPU computing is about massive parallelism! We need a more interesting example We ll start by adding two integers and build up to vector addition a b c

Addition on the Device A simple kernel to add two integers global void add(int *a, int *b, int *c) { } *c = *a + *b; As before global is a CUDA C/C++ keyword meaning add() will execute on the device add() will be called from the host

Addition on the Device Note that we use pointers for the variables global void add(int *a, int *b, int *c) { } *c = *a + *b; add() runs on the device, so a, b and c must point to device memory We need to allocate memory on the GPU

Memory Management Host and device memory are separate entities Device pointers point to GPU memory May be passed to/from host code May not be dereferenced in host code Host pointers point to CPU memory May be passed to/from device code May not be dereferenced in device code Simple CUDA API for handling device memory cudamalloc(), cudafree(), cudamemcpy() Similar to the C equivalents malloc(), free(), memcpy()

Addition on the Device: add() Returning to our add() kernel global void add(int *a, int *b, int *c) { *c = *a + *b; } Let s take a look at main() Open exercises/simple_add/kernel.cu Fill-in missing code as indicated. Need to replace FIXME with code. Comments should help. If something isn t clear, PLEASE ASK!

Addition on the Device: main() int main(void) { int a, b, c; int *d_a, *d_b, *d_c; int size = sizeof(int); // host copies of a, b, c // device copies of a, b, c // Allocate space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Setup input values a = 2; b = 7;

Addition on the Device: main() // Copy inputs to device cudamemcpy(d_a, &a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, &b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU add<<<1,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(&c, d_c, size, cudamemcpydevicetohost); } // Cleanup cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

CONCEPTS Heterogeneous Computing Blocks Threads Indexing Shared memory syncthreads() Asynchronous operation RUNNING IN PARALLEL Handling errors Managing devices

Moving to Parallel GPU computing is about massive parallelism So how do we run code in parallel on the device? add<<< 1, 1 >>>(); add<<< N, 1 >>>(); Instead of executing add() once, execute N times in parallel

Vector Addition on the Device With add() running in parallel we can do vector addition Terminology: each parallel invocation of add() is referred to as a block The set of blocks is referred to as a grid Each invocation can refer to its block index using blockidx.x global void add(int *a, int *b, int *c) { } c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; By using blockidx.x to index into the array, each block handles a different index

Vector Addition on the Device global void add(int *a, int *b, int *c) { } c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; On the device, each block can execute in parallel: Block 0 Block 1 Block 2 Block 3 c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

Vector Addition on the Device: add() Returning to our parallelized add() kernel global void add(int *a, int *b, int *c) { c[blockidx.x] = a[blockidx.x] + b[blockidx.x]; } Let s take a look at main() Open exercises/simple_add_blocks/kernel.cu Fill-in missing code as indicated. Should be clear from comments where you need to add some code Need to replace FIXME with the proper piece of code.

Vector Addition on the Device: main() #define N 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);

Vector Addition on the Device: main() // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N blocks add<<<n,1>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

Review (1 of 2) Difference between host and device Host CPU Device GPU Using global to declare a function as device code Executes on the device Called from the host Passing parameters from host code to a device function

Review (2 of 2) Basic device memory management cudamalloc() cudamemcpy() cudafree() Launching parallel kernels Launch N copies of add() with add<<<n,1>>>( ); Use blockidx.x to access block index

CONCEPTS Heterogeneous Computing Blocks Threads Indexing Shared memory syncthreads() Asynchronous operation INTRODUCING THREADS Handling errors Managing devices

CUDA Threads Terminology: a block can be split into parallel threads Let s change add() to use parallel threads instead of parallel blocks global void add(int *a, int *b, int *c) { c[threadidx.x] = a[threadidx.x] + b[threadidx.x]; } We use threadidx.x instead of blockidx.x Need to make one change in main() Open exercises/simple_add_threads/kernel.cu

Vector Addition Using Threads: main() #define N 512 int main(void) { int *a, *b, *c; int *d_a, *d_b, *d_c; int size = N * sizeof(int); // host copies of a, b, c // device copies of a, b, c // Alloc space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);

Vector Addition Using Threads: main() // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU with N threads add<<<1,n>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

CONCEPTS Heterogeneous Computing Blocks Threads Indexing Shared memory syncthreads() Asynchronous operation COMBINING THREADS AND BLOCKS Handling errors Managing devices

Combining Blocks and Threads We ve seen parallel vector addition using: Many blocks with one thread each One block with many threads Let s adapt vector addition to use both blocks and threads Why? We ll come to that First let s discuss data indexing

Indexing Arrays with Blocks and Threads No longer as simple as using blockidx.x and threadidx.x Consider indexing an array with one element per thread (8 threads/block) threadidx.x threadidx.x threadidx.x threadidx.x 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockidx.x = 0 blockidx.x = 1 blockidx.x = 2 blockidx.x = 3 With M threads/block a unique index for each thread is given by: int index = threadidx.x + blockidx.x * M;

Indexing Arrays: Example Which thread will operate on the red element? 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 M = 8 threadidx.x = 5 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 blockidx.x = 2 int index = threadidx.x + blockidx.x * M; = 5 + 2 * 8; = 21;

Vector Addition with Blocks and Threads Use the built-in variable blockdim.x for threads per block int index = threadidx.x + blockidx.x * blockdim.x; Combined version of add() to use parallel threads and parallel blocks global void add(int *a, int *b, int *c) { } int index = threadidx.x + blockidx.x * blockdim.x; c[index] = a[index] + b[index]; What changes need to be made in main()? Open exercises/simple_add_blocks_threads/kernel.cu

Addition with Blocks and Threads: main() #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; int *d_a, *d_b, *d_c; int size = N * sizeof(int); // host copies of a, b, c // device copies of a, b, c // Alloc space for device copies of a, b, c cudamalloc((void **)&d_a, size); cudamalloc((void **)&d_b, size); cudamalloc((void **)&d_c, size); // Alloc space for host copies of a, b, c and setup input values a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size);

Addition with Blocks and Threads: main() // Copy inputs to device cudamemcpy(d_a, a, size, cudamemcpyhosttodevice); cudamemcpy(d_b, b, size, cudamemcpyhosttodevice); // Launch add() kernel on GPU add<<<n/threads_per_block,threads_per_block>>>(d_a, d_b, d_c); // Copy result back to host cudamemcpy(c, d_c, size, cudamemcpydevicetohost); } // Cleanup free(a); free(b); free(c); cudafree(d_a); cudafree(d_b); cudafree(d_c); return 0;

Handling Arbitrary Vector Sizes Typical problems are not friendly multiples of blockdim.x Avoid accessing beyond the end of the arrays: global void add(int *a, int *b, int *c, int n) { } int index = threadidx.x + blockidx.x * blockdim.x; if (index < n) c[index] = a[index] + b[index]; Update the kernel launch: add<<<(n + M-1) / M,M>>>(d_a, d_b, d_c, N);

Why Bother with Threads? Threads seem unnecessary They add a level of complexity What do we gain? Unlike parallel blocks, threads have mechanisms to: Communicate Synchronize To look closer, we need a new example

Review Launching parallel kernels Launch N copies of add() with add<<<n/m,m>>>( ); Use blockidx.x to access block index Use threadidx.x to access thread index within block Allocate elements to threads: int index = threadidx.x + blockidx.x * blockdim.x;

CONCEPTS Heterogeneous Computing Blocks Threads Indexing Shared memory syncthreads() Asynchronous operation COOPERATING THREADS Handling errors Managing devices

1D Stencil Consider applying a 1D stencil to a 1D array of elements Each output element is the sum of input elements within a radius If radius is 3, then each output element is the sum of 7 input elements: radius radius

Implementing Within a Block Each thread processes one output element blockdim.x elements per block Input elements are read several times With radius 3, each input element is read seven times

Simple Stencil in 1d Open exercises/simple_stencil/kernel.cu Finish the kernel and the kernel launch Each thread calculates one stencil value Reads 2*RADIUS + 1 values dim3 type: CUDA 3 dimensional struct used for grid/block sizes Inserted GPU timers into code to time the execution of the kernel Try various sizes of N, RADIUS, BLOCK Time a large (over a million) value of N with a RADIUS of 7

Can we do better? Input elements are read multiple times With RADIUS=3, each input element is read seven times! Neighbouring threads read most of the same elements. Thread 7 reads elements 4 through 10 Thread 8 reads elements 5 through 11 Can we avoid redundant reading of data?

Sharing Data Between Threads Terminology: within a block, threads share data via shared memory Extremely fast on-chip memory, user-managed Declare using shared, allocated per block Data is not visible to threads in other blocks

Implementing With Shared Memory Cache data in shared memory (user managed scratch-pad) Read (blockdim.x + 2 * radius) input elements from global memory to shared memory Compute blockdim.x output elements Write blockdim.x output elements to global memory Each block needs a halo of radius elements at each boundary halo on left halo on right blockdim.x output elements

Stencil Kernel global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + RADIUS; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex - RADIUS] = in[gindex - RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; }

Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex] = result;

Simple Stencil 1d with shared memory cd exercises/simple_stencil_smem/ Run the code. It will build/run without modification. If Errors occur, each offending element will be printed to the screen What is the result with N=10,000 and BLOCK=32? What is the result with N=10,000 and BLOCK=64? Why?

Data Race! The stencil example will not work Suppose thread 15 reads the halo before thread 0 has fetched it temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex RADIUS = in[gindex RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } int result = 0; result += temp[lindex + 1]; Store at temp[18] Load from temp[19] Skipped, threadidx > RADIUS

syncthreads() void syncthreads(); Synchronizes all threads within a block Used to prevent RAW / WAR / WAW hazards All threads must reach the barrier In conditional code, the condition must be uniform across the block Insert syncthreads() into the kernel in the proper location Compare timing of previous simple stencil with the current shared memory implementation for same (large N) and BLOCK=512

Stencil Kernel global void stencil_1d(int *in, int *out) { shared int temp[block_size + 2 * RADIUS]; int gindex = threadidx.x + blockidx.x * blockdim.x; int lindex = threadidx.x + radius; // Read input elements into shared memory temp[lindex] = in[gindex]; if (threadidx.x < RADIUS) { temp[lindex RADIUS] = in[gindex RADIUS]; temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE]; } // Synchronize (ensure all the data is available) syncthreads();

Stencil Kernel // Apply the stencil int result = 0; for (int offset = -RADIUS ; offset <= RADIUS ; offset++) result += temp[lindex + offset]; } // Store the result out[gindex] = result;

Review (1 of 2) Launching parallel threads Launch N blocks with M threads per block with kernel<<<n,m>>>( ); Use blockidx.x to access block index within grid Use threadidx.x to access thread index within block Allocate elements to threads: int index = threadidx.x + blockidx.x * blockdim.x;

Review (2 of 2) Use shared to declare a variable/array in shared memory Data is shared between threads in a block Not visible to threads in other blocks Using large shared mem size impacts number of blocks that can be scheduled on an SM (48K total smem size) Use syncthreads() as a barrier Use to prevent data hazards

Thank you!