CUDA Debugging. GPGPU Workshop, August 2012. Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Similar documents
GPU Tools Sandra Wienke

OpenACC Basics Directive-based GPGPU Programming

OpenACC Programming on GPUs

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

CUDA Basics. Murphy Stein New York University

Introduction to GPU hardware and to CUDA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Hands-on CUDA exercises

Lecture 1: an introduction to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to CUDA C

Debugging with TotalView

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Case Study on Productivity and Performance of GPGPUs

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPGPU Parallel Merge Sort Algorithm

IMAGE PROCESSING WITH CUDA

Optimizing Application Performance with CUDA Profiling Tools

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

XID ERRORS. vr352 May XID Errors

Programming GPUs with CUDA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Rootbeer: Seamlessly using GPUs from Java

U N C L A S S I F I E D

Next Generation GPU Architecture Code-named Fermi

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Image Processing & Video Algorithms with CUDA

CUDA Programming. Week 4. Shared memory and register

Advanced CUDA Webinar. Memory Optimizations

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

1. If we need to use each thread to calculate one output element of a vector addition, what would

Languages, APIs and Development Tools for GPU Computing

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Guided Performance Analysis with the NVIDIA Visual Profiler

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Introduction to Parallel Programming (w/ JAVA)

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPGPUs, CUDA and OpenCL

Introduction to Hybrid Programming

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

GPU Performance Analysis and Optimisation

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Using the Intel Inspector XE

Introduction to GPU Programming Languages

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions

GPU Computing - CUDA

OpenACC 2.0 and the PGI Accelerator Compilers

Texture Cache Approximation on GPUs

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

RTOS Debugger for ecos

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

Performance Characteristics of Large SMP Machines

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

SKP16C62P Tutorial 1 Software Development Process using HEW. Renesas Technology America Inc.

Andreas Burghart 6 October 2014 v1.0

CUDA. Multicore machines

gpus1 Ubuntu Available via ssh

Evaluation of CUDA Fortran for the CFD code Strukti

CUDA programming on NVIDIA GPUs

Parallel Programming Survey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

NVIDIA CUDA INSTALLATION GUIDE FOR MICROSOFT WINDOWS

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

Transcription:

CUDA Debugging GPGPU Workshop, August 2012 Sandra Wienke Center for Computing and Communication, RWTH Aachen University Nikolay Piskun, Chris Gottbrath Rogue Wave Software Rechen- und Kommunikationszentrum (RZ)

NVIDIA Corporation 2010 NVIDIA Corporation 2010 CUDA in a Nutshell Programming Model & Memory Model Core Thread Registers Registers Streaming Multiprocessor (SM) Block Shared Mem L1 Shared Mem L1 SM-1 SM-n Device: GPU Grid (Kernel) L2 Device Global Memory Host CPU PCIe CPU Mem float x = input[threadid]; float y = func(x); output[threadid] = y; Host Host Memory 2

CUDA in a Nutshell CUDA Runtime API 3 int main(int argc, char* argv[]) { int n = 10240; float* h_x,*h_y; //Pointer to CPU memory //Allocate and initialize h_x and h_y float *d_x,*d_y; //Pointer to GPU memory cudamalloc(&d_x, n*sizeof(float)); cudamalloc(&d_y, n*sizeof(float)); cudamemcpy(d_x, h_x, n * sizeof(float), cudamemcpyhosttodevice); cudamemcpy(d_y, h_y, n * sizeof(float), cudamemcpyhosttodevice); //Invoke parallel SAXPY kernel dim3 threadsperblock(128); Allocate data on GPU Copy/transfer data to GPU Invoke kernel on GPU dim3 blockspergrid(n/threadsperblock.x); saxpy_parallel<<<blockspergrid, threadsperblock>>>(n, 2.0, d_x, d_y); cudamemcpy(h_y, d_y, n * sizeof(float), cudamemcpydevicetohost); cudafree(d_x); cudafree(d_y); free(h_x); free(h_y); return 0; } global void saxpy_parallel(int n, float a, float *x, float *y) { } int i = blockidx.x * blockdim.x + threadidx.x; if (i < n){ } Free data on GPU y[i] = a*x[i] + y[i]; Compute SAXPY Copy/transfer data to CPU Indicate kernel execution Compute thread ID

CUDA Toolkit Developer kit: libs, header, profiler, compiler, Compiling CUDA applications nvcc [-arch=sm_20] mykernel.cu Debugging flags: -g -G CUDA command line tools Debugger: cuda-gdb Detecting memory access errors: cuda-memcheck CUDA GUI-based debugger: TotalView Debugging host and device code in same session Thread navigation by logical or physical coordinates Displaying hierarchical memory, 4

Setting breakpoints in CUDA kernels Start debugging (e.g. Go ) Message box when kernel is loaded: Set kernel breakpoints as in host code 5

Debugger thread IDs in Linux CUDA process Host thread: positive no. CUDA thread: negative no. GPU thread navigation Logical coordinates: blocks (3 dimensions), threads (3 dimensions) Physical coordinates: device, SM, warp, core/lane Only valid selections are permitted 6

Warp: group of 32 threads Share one PC Advance synchronously Problem: Diverging threads if (threadidx.x > 2) {...} else {...} Single Stepping Advances all GPU hardware threads within same warp Stepping over a syncthreads() call advances all threads within the block Advancing more than just one warp Halt Run To a selected line number in the source pane Set a breakpoint and Continue the process Stops all the host and device threads 7

Displaying CUDA device properties Tools - CUDA Devices Helps mapping between logical & physical coordinates PCs across SMs, warps, lanes GPU thread divergence? Different PC within warp Diverging threads 8

Displaying GPU data Dive into variable or watch Type in Expression List Device memory spaces: @ notation Storage Qualifier @global @shared @local @register @generic @constant @texture @parameter Meaning of address Offset within global storage Offset within shared storage Offset within local storage PTX register name Offset within generic address space (e.g. pointer to global, local or shared memory) Offset within constant storage Offset within texture storage Offset within parameter storage 9

Checking GPU memory Enable CUDA Memory checking during startup or in the Debug menu Detects global memory addressing violations and misaligned global memory accesses Further features Multi-device support Host-pinned memory support MPI-CUDA applications 10

- Tips Check CUDA API calls All CUDA API routines return error code (cudaerror_t) Or cudagetlasterror() returns last error from a CUDA runtime call cudageterrorstring(cudaerror_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and CheckError macros from cutil.h (NVIDIA GPU Computing SDK) 2. Use TotalView to examine the return code Evaluate the CUDA API call in the expression list If needed, dive on the error value and typecast it to an cudaerror_t type You can also surround the API call by cudageterrorstring() in the expression field and typecast it to char[xx]* 11

- Tips Check + use available hardware features printf statements are possible within kernels (since Fermi) Use double precision floating point operations (since GT200) Enable ECC and check whether single or double bit errors occurred using nvidia-smi -q (since Fermi) Check final numerical results on host While porting, it is recommended to compare all computed GPU results with host results 1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise Comparative debugging approach, e.g. statistics view 12

- Tips Check intermediate results If results are directly stored in global memory: dive on result array If results are stored in on-chip memory (e.g. registers) tedious debugging TotalView: View of variables across CUDA threads not possible yet 1. Create additional array on host for intermediate results with size #threads * #results * sizeof(result) Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results 2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: shared myarray[size] Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks Use filter, array statistics, freeze, duplicate, last values and watch points 13