CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum (RZ)
NVIDIA Corporation 2010 NVIDIA Corporation 2010 CUDA in a Nutshell Programming Model & Memory Model Core Thread Registers Registers Streaming Multiprocessor (SM) Block Shared Mem L1 Shared Mem L1 SM-1 SM-n Device: GPU Grid (Kernel) L2 Device Global Memory Host CPU PCIe CPU Mem float x = input[threadid]; float y = func(x); output[threadid] = y; Host Host Memory 2
CUDA in a Nutshell CUDA Runtime API 3 int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; //Pointer to CPU memory //Allocate and initialize h_x and h_y float *d_x,*d_y; //Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); //Invoke parallel SAXPY kernel dim3 threadsperblock(128); Allocate data on GPU Copy/transfer data to GPU Invoke kernel on GPU dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } Free data on GPU y[i] = a*x[i] + y[i]; Compute SAXPY Copy/transfer data to CPU Indicate kernel execution Compute thread ID
CUDA Toolkit Developer kit: libs, header, profiler, compiler, Compiling CUDA applications nvcc [-arch=sm_20] mykernel.cu Debugging flags: -g -G CUDA command line tools Debugger: cuda-gdb Detecting memory access errors: cuda-memcheck CUDA GUI-based debugger: TotalView Debugging host and device code in same session Thread navigation by logical or physical coordinates Displaying hierarchical memory, 4
Setting breakpoints in CUDA kernels Start debugging (e.g. Go ) Message box when kernel is loaded: Set kernel breakpoints as in host code 5
Debugger thread IDs in Linux CUDA process Host thread: positive no. CUDA thread: negative no. GPU thread navigation Logical coordinates: blocks (3 dimensions), threads (3 dimensions) Physical coordinates: device, SM, warp, core/lane Only valid selections are permitted 6
Warp: group of 32 threads Share one PC Advance synchronously Problem: Diverging threads if (threadidx.x > 2) {...} else {...} Single Stepping Advances all GPU hardware threads within same warp Stepping over a syncthreads() call advances all threads within the block Advancing more than just one warp Halt Run To a selected line number in the source pane Set a breakpoint and Continue the process Stops all the host and device threads 7
Displaying CUDA device properties Tools - CUDA Devices Helps mapping between logical & physical coordinates PCs across SMs, warps, lanes GPU thread divergence? Different PC within warp Diverging threads 8
Displaying GPU data Dive into variable or watch Type in Expression List Device memory spaces: @ notation Storage Qualifier @global @shared @local @register @generic @constant @texture @parameter Meaning of address Offset within global storage Offset within shared storage Offset within local storage PTX register name Offset within generic address space (e.g. pointer to global, local or shared memory) Offset within constant storage Offset within texture storage Offset within parameter storage 9
Checking GPU memory Enable CUDA Memory checking during startup or in the Debug menu Detects global memory addressing violations and misaligned global memory accesses Further features Multi-device support Host-pinned memory support MPI-CUDA applications 10
- Tips Check CUDA API calls All CUDA API routines return error code (cudaerror_t) Or cudagetlasterror() returns last error from a CUDA runtime call cudageterrorstring(cudaerror_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and CheckError macros from cutil.h (NVIDIA GPU Computing SDK) 2. Use TotalView to examine the return code Evaluate the CUDA API call in the expression list If needed, dive on the error value and typecast it to an cudaerror_t type You can also surround the API call by cudageterrorstring() in the expression field and typecast it to char[xx]* 11
- Tips Check + use available hardware features printf statements are possible within kernels (since Fermi) Use double precision floating point operations (since GT200) Enable ECC and check whether single or double bit errors occurred using nvidia-smi -q (since Fermi) Check final numerical results on host While porting, it is recommended to compare all computed GPU results with host results 1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise Comparative debugging approach, e.g. statistics view 12
- Tips Check intermediate results If results are directly stored in global memory: dive on result array If results are stored in on-chip memory (e.g. registers) tedious debugging TotalView: View of variables across CUDA threads not possible yet 1. Create additional array on host for intermediate results with size #threads * #results * sizeof(result) Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results 2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: shared myarray[size] Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks Use filter, array statistics, freeze, duplicate, last values and watch points 13