Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ)
Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 2
IDE - Eclipse Eclipse + Parallel NSight: IDE for GPU programming CUDA syntax highlighting, CUDA debugging, CUDA profiling OpenACC programming, OpenACC profiling Using Nsight on RWTH Cluster environment module load cuda nsight 1. Chose workspace 2. File New Makefile Project with Existing code 3. Chose file directory of CG Solver 4. Toolchain: CUDA Toolkit 5.0 5. Create Makefile targets (Makefile Tab in right pane), e.g. cuda, clean 6. Double click on Makefile target for execution or download Parallel Nisght: http://www.nvidia.com/object/nsight.html 3
IDE - Eclipse Debugging (CUDA) Use the debug configuration of Makefile to create executable Press debug button (green bug) Proceed (even if errors) Switch to debug perspective (Application will suspend in the main function. At this point there is no GPU code running) Add a breakpoint in the device code Resume the application Profiling (CUDA, OpenACC) Uses internally NVIDIA s Visual Profiler (see later) Use the release target of Makefile to create executable Press profile button (watch) Proceed (even if errors) Switch perspective Ouput/ interpretation see chapter NVIDIA Visual Profiler 4
Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 5
Debugging (CUDA) Debugging host code as usual Debugging GPU kernels requires special tools CUDA debuggers available OpenACC debuggers not available Compiling CUDA applications nvcc [-arch=sm_20] mykernel.cu RWTH Cluster environment: module load cuda Debugging flags: -g G nvcc g G [-arch=sm_20] mykernel.cu (see Makefile debug target) CUDA command line tools Debugger: cuda-gdb Detecting memory access errors: cuda-memcheck 6
Debugging (CUDA) CUDA GUI-based debugger: TotalView Debugging host and device code in same session Thread navigation by logical or physical coordinates Displaying hierarchical memory, General information on debugging with TotalView can be found in the appendix CUDA GUI-based debugger: Eclipse (see above) RWTH Cluster environment: module load totalview totalview If you get an error concerning the CUDA version, try to compile your application with CUDA 4.1: module switch cuda cuda/41 7
Debugging (CUDA) - TotalView Setting breakpoints in CUDA kernels Start debugging (e.g. Go ) Message box when kernel is loaded: Set kernel breakpoints as in host code 8
Debugging (CUDA) - TotalView Debugger thread IDs in Linux CUDA process Host thread: positive no. CUDA thread: negative no. GPU thread navigation Logical coordinates: blocks (3 dimensions), threads (3 dimensions) Physical coordinates: device, SM, warp, core/lane Only valid selections are permitted 9
Debugging (CUDA) - TotalView Warp: group of 32 threads Share one PC Advance synchronously Problem: Diverging threads if (threadidx.x > 2) {...} else {...} Single Stepping Advances all GPU hardware threads within same warp Stepping over a syncthreads() call advances all threads within the block Advancing more than just one warp Halt Run To a selected line number in the source pane Set a breakpoint and Continue the process Stops all the host and device threads 10
Debugging (CUDA) - TotalView Displaying CUDA device properties Tools - CUDA Devices Helps mapping between logical & physical coordinates PCs across SMs, warps, lanes GPU thread divergence? Different PC within warp Diverging threads 11
Debugging (CUDA) - TotalView Displaying GPU data Dive into variable or watch Type in Expression List Device memory spaces: @ notation Storage Qualifier @global @shared @local @register @generic @constant @texture @parameter Meaning of address Offset within global storage Offset within shared storage Offset within local storage PTX register name Offset within generic address space (e.g. pointer to global, local or shared memory) Offset within constant storage Offset within texture storage Offset within parameter storage 12
Debugging (CUDA) - TotalView Checking GPU memory Enable CUDA Memory checking during startup or in the Debug menu Detects global memory addressing violations and misaligned global memory accesses Further features Multi-device support Host-pinned memory support MPI-CUDA applications 13
Debugging (CUDA) - Tips Check CUDA API calls All CUDA API routines return error code (cudaerror_t) Or cudagetlasterror() returns last error from a CUDA runtime call cudageterrorstring(cudaerror_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and CheckError macros from cutil.h (NVIDIA GPU Computing SDK) 2. Use TotalView to examine the return code Evaluate the CUDA API call in the expression list If needed, dive on the error value and typecast it to an cudaerror_t type You can also surround the API call by cudageterrorstring() in the expression field and typecast it to char[xx]* 14
Debugging (CUDA) - Tips Check + use available hardware features printf statements are possible within kernels (since Fermi) Use double precision floating point operations (since GT200) Enable ECC and check whether single or double bit errors occurred using nvidia-smi -q (since Fermi) Check final numerical results on host While porting, it is recommended to compare all computed GPU results with host results 1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise Comparative debugging approach, e.g. statistics view 15
Debugging (CUDA) - Tips Check intermediate results If results are directly stored in global memory: dive on result array If results are stored in on-chip memory (e.g. registers) tedious debugging TotalView: View of variables across CUDA threads not possible yet 1. Create additional array on host for intermediate results with size #threads * #results * sizeof(result) Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results 2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: shared myarray[size] Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks Use filter, array statistics, freeze, duplicate, last values and watch points 16
Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 17
Profiling (CUDA & OpenACC) Profiling = Analyze behavior of application during runtime e.g. runtime of functions, memory throughput NVIDIA Visual Profiler for CUDA & OpenACC codes Profiles only GPU data movement & computation (not host code) 1. Compile your program and start the profiler nvvp 2. Select File New Session 3. Chose your executable as file Specify arguments, e.g. the matrix file RWTH Cluster environment: module load cuda nvvp Specify envrionment variables, e.g. CG_MAX_ITER 4. If you want to shorten the execution time, set a timeout limit 5. Finish the session configuration & wait for results 18
Profiling (CUDA & OpenACC) 19
Profiling (CUDA & OpenACC) Session Tab Timeline Long memory copy from host to device Timeline Short memory copy from device to host Collpase to see summarized info only Compute time for first kernel Is data only transfered when needed? Which kernel does need the most time? 20
Profiling (CUDA & OpenACC) Analysis Tab Gives hints for optimization (not always useful) Details Tab Switch from Analysis Tab to the Details Tab Runtime Grid dimensions On right hand side, activate summary view 21 Kernel name <func>_<line>_gpu
Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 22
Appendix - Debugging host code Start TotalView and select your program to debug 23
Appendix - Debugging host code Process window of TotalView Toolbar Process and Thread Status Stack Trace Pane Stack Frame Pane Source Pane Tabbed Pane 24
Appendix - Debugging host code Breakpoints Interrupt execution when reaching a specific code line Conditional Breakpoints possible Set by clicking in the source pane Temporary disabling is possible Watchpoints Interrupt when a change occurs to a specific memory location Conditional watchpoints possible (e.g. only stop if the sign of the value changes or specified threshold reached) 25
Appendix - Debugging host code Setting a breakpoint 26
Appendix - Debugging host code Inspecting an array in C/C++ Double click on array name Typecast necessary 27
Appendix - Debugging host code Data visualizations helpful for big data arrays 28
Appendix - Debugging host code Create a watchpoint for a[29] 29
Appendix - Debugging host code Will interrupt as soon as a[29] changes 30