Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

Similar documents

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPU Tools Sandra Wienke

Hands-on CUDA exercises

Introduction to CUDA C

CUDA Basics. Murphy Stein New York University

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GPU hardware and to CUDA

GPGPU Parallel Merge Sort Algorithm

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

IMAGE PROCESSING WITH CUDA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

CONTENTS. I Introduction 2. II Preparation 2 II-A Hardware... 2 II-B Checking Versions for Hardware and Drivers... 3

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Optimizing Application Performance with CUDA Profiling Tools

OpenCL. Administrivia. From Monday. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5 Posted. Project

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

1. If we need to use each thread to calculate one output element of a vector addition, what would

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Lecture 1: an introduction to CUDA

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Multi-threaded Debugging Using the GNU Project Debugger

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

#pragma omp critical x = x + 1; !$OMP CRITICAL X = X + 1!$OMP END CRITICAL. (Very inefficiant) example using critical instead of reduction:

NVIDIA CUDA. NVIDIA CUDA C Programming Guide. Version 4.2

Rootbeer: Seamlessly using GPUs from Java

How To Port A Program To Dynamic C (C) (C-Based) (Program) (For A Non Portable Program) (Un Portable) (Permanent) (Non Portable) C-Based (Programs) (Powerpoint)

GPU Performance Analysis and Optimisation

XID ERRORS. vr352 May XID Errors

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

OpenACC 2.0 and the PGI Accelerator Compilers

The Asterope compute cluster

Debugging with TotalView

GPU Hardware Performance. Fall 2015

Leak Check Version 2.1 for Linux TM

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

CS3600 SYSTEMS AND NETWORKS

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Debugging Multi-threaded Applications in Windows

Spatial BIM Queries: A Comparison between CPU and GPU based Approaches

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

HPC Wales Skills Academy Course Catalogue 2015

C++ Programming Language

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

W4118 Operating Systems. Junfeng Yang

Control III Programming in C (small PLC)

Speeding Up Evolutionary Learning Algorithms using GPUs

Programming GPUs with CUDA

CUDA. Multicore machines

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

ME964 High Performance Computing for Engineering Applications

GPU Computing - CUDA

Using the CoreSight ITM for debug and testing in RTX applications

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Best Practice mini-guide accelerated clusters

Example of Standard API

serious tools for serious apps

OpenACC Basics Directive-based GPGPU Programming

RTOS Debugger for ecos

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

CUDA programming on NVIDIA GPUs

Operating Systems and Networks

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

The Advanced JTAG Bridge. Nathan Yawn 05/12/09

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Visualization Tool for GPGPU Programming

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Languages, APIs and Development Tools for GPU Computing

CUDA Programming. Week 4. Shared memory and register

Going Linux on Massive Multicore

Advanced CUDA Webinar. Memory Optimizations

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

An Implementation Of Multiprocessor Linux

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

INTRODUCTION TO OBJECTIVE-C CSCI 4448/5448: OBJECT-ORIENTED ANALYSIS & DESIGN LECTURE 12 09/29/2011

The GPU Hardware and Software Model: The GPU is not a PRAM (but it s not far off)

HammerDB Metrics. Introduction. Installation

Real-time Debugging using GDB Tracepoints and other Eclipse features

U N C L A S S I F I E D

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Texture Cache Approximation on GPUs

Transcription:

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL Michał Wójcik, Tomasz Boiński Katedra Architektury Systemów Komputerowych Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika Gdańska January 24, 2014 Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 1 / 20

cuda-gdb ported version of GDB (The GNU Project Debugger), extends the GDB with possibility of CUDA code debugging, can debug standard CPU code, works on any Linux and Mac OS, part of CUDA Toolkit from version 2.1. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 2 / 20

CUDA Debugging In order to compile CUDA applications with debugging support the -g -G option pair must be passed to the CUDA compiler when compiling. What it does: nvcc -g -G foo.cu -o foo, forces -O0 (mostly unoptimized) compilation, makes the compiler include symbolic debugging information in the executable. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 3 / 20

Defining break-points cuda-dbg is used the same way as gdb, function symbol names and source file line numbers can be used as break-point symbols, breakpoints in host and device functions, for kernel s function name mykernel_main break command are: (cuda-gdb) break mykernel_main, (cuda-gdb) b mykernel_main; the above command will set a breakpoint at a particular device location (the address of mykernel_main), and it will force all resident GPU threads to stop at this location, there is currently no method to stop only certain threads or warps at any given breakpoint. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 4 / 20

Stepping granularity cuda-dbg supports stepping GPU code at the finest granularity of a warp. This means that typing next or step from the cuda-dbg command line (when in the focus of device code) will advance all threads in the same warp as the current thread of focus, in order to advance the execution of more than one warp, you must set a break-point at the desired location, when stepping the thread barrier call syncthreads(), an implicit breakpoint is set immediately after the barrier and all threads are continued to this point, on GPUs with sm_type less than sm_20 it is not possible to step over a subroutine in the device code, cuda-dbg always steps into the device function, subroutines are always inline, for GPUs with sm_type higher than sm_20 subroutines can be step over when noinline keyword on function is used or by compiling with -arch=sm_20. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 5 / 20

Query info about devices (cuda-gdb) info cuda devices all the devices, (cuda-gdb) info cuda sm all SMs in the current device, (cuda-gdb) info cuda warp all warps in the current SM, (cuda-gdb) info cuda lane all lanes in the current warp, (cuda-gdb) info cuda kernels all active kernels, (cuda-gdb) info cuda blocks active block in the current kernel, (cuda-gdb) info cuda threads active threads in the current kernel. Those options vary between cuda-gdb versions! Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 6 / 20

The coordinates Inspecting the coordinates (while in kernel function) can be done using (cuda-gdb) cuda with: device, sm, warp, lane, block, thread, kernel. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 7 / 20

The coordinates (2) To change the physical coordinates use: ( cuda - dbg ) cuda device 0 sm 1 warp 2 lane 3 To change the logical coordinates use: (cuda - dbg ) cuda thread (15,0,0) ( cuda - dbg ) cuda block (1,0) thread (3,0,0) providing only the CUDA thread coordinates will maintain the current block of focus while switching to the specified thread, providing only thread (10) will switch to the CUDA thread with X coordinates 10 and Y and Z coordinates 0; To change kernel focus use: ( cuda - dbg ) cuda kernel 0 Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 8 / 20

Other features when debugging a kernel that appear to be hanging or looping indefinitely, the Ctrl-C signal can be used, it will freeze the GPU and report back the source code location, cuda-gdb supports an initialization file, which must reside in your home directory ( /.cuda-gdbinit), this file can take any cuda-gdb command/extension as input to be processed upon executing the cuda-gdb command, this is just like the.gdbinit file used by standard versions of GDB, only renamed. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 9 / 20

Example # include <stdio.h> # include < stdlib.h> // Simple 8- bit bitreversal Compute test # define N 256 global void bitreverse ( void * data ) { unsigned int * idata = ( unsigned int *) data ; extern shared int array []; array [ threadidx.x] = idata [ threadidx.x]; array [ threadidx. x] = ((0 xf0f0f0f0 & array [ threadidx. x]) >> 4) ((0 x0f0f0f0f & array [ threadidx. x]) << 4); array [ threadidx. x] = ((0 xcccccccc & array [ threadidx. x]) >> 2 ) ((0 x33333333 & array [ threadidx. x]) << 2); array [ threadidx. x] = ((0 xaaaaaaaa & array [ threadidx. x]) >> 1 ) ((0 x55555555 & array [ threadidx. x]) << 1); idata [ threadidx.x] = array [ threadidx.x]; } Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 10 / 20

Example - kernel int main ( void ) { void * d = NULL ; int i; unsigned int idata [ N], odata [ N]; for ( i = 0; i < N; i ++) idata [ i] = ( unsigned int ) i; cudamalloc (( void **) &d, sizeof ( int )*N); cudamemcpy (d, idata, sizeof ( int )*N, cudamemcpyhosttodevice ); bitreverse <<<1, N, N* sizeof ( int ) >>>(d); cudamemcpy ( odata, d, sizeof ( int )*N, cudamemcpydevicetohost ); for ( i = 0; i < N; i ++) printf ("%u -> %u\n", idata [i], odata [i]); } cudafree (( void *)d); return 0; Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 11 / 20

Example - start debugging $ nvcc -g -G bitreverse. cu -o bitreverse $ cuda - gdb bitreverse ( cuda - gdb ) b main // breakpoint Breakpoint 1 at 0 x400950 : file bitreverse. cu, line 25 ( cuda - gdb ) b bitreverse // breakpoint Breakpoint 2 at 0 x400b06 : file bitreverse. cu, line 8 (cuda - gdb ) b bitreverse.cu :21 // breakpoint Breakpoint 3 at 0 x400b12 : file bitreverse. cu, line21 compile code with appropriate flags, run cuda-gdb, put breakpoint at host function, put breakpoint at device function, put breakpoint according to line number, Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 12 / 20

Example (cuda - gdb ) r // run Starting program : / macierz / home / psysiu / dydaktyka / cuda / debugging / bitreverse [ Thread debugging using libthread_db enabled ] [ New process 26269 ] [ New Thread 140057761130272 ( LWP 2 6 2 6 9)] [ Switching to Thread 140057761130272 ( LWP 2 6 2 6 9)] Breakpoint 1, main () at bitreverse. cu :25 25 void * d = NULL ; int i; run program, new process and thread information, stopped at first brakpoint, Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 13 / 20

Example At this point, commands can be entered to advance execution and/or print program state. For this walkthrough we will continue to the device kernel. (cuda -gdb ) c // continue Continuing. [ Context Create of context 0 x1a53e10 on Device 0] Breakpoint 3 at 0 x1b92070 : file bitreverse.cu, line 21. [ Launch of CUDA Kernel 0 ( bitreverse <<<(1,1,1),(256,1,1) >>>) on Device 0] [ Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 2, warp 0] Breakpoint 2, bitreverse <<<(1,1,1),(256,1,1) >>> (data=0x200100000 ) at bitreverse.cu :9 warning : Sourcefile is more recent than executable. 9 unsigned int * idata = ( unsigned int *) data ; creating new context on the first device, launching kernel, breakpoint in kernel, cuda-gdb has detected that we have reached a CUDA device kernel, so it prints the current CUDA thread and focus. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 14 / 20

Example (cuda - gdb ) info threads 2 Thread 139666865731360 (LWP 26716) 0 x00007f06b7f1ee89 in?? () from /usr /lib / libcuda. so 1 process 26716 0 x00007f06b7f1ee89 in?? () from /usr /lib / libcuda.so (cuda - gdb ) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count VirtualPC Filename Line Kernel 0 * (0,0,0) (0,0,0) (0,0,0) (255,0,0) 256 0 x00000000020df868 bitreverse.cu 9 The above output tells us that the host thread of focus has LWP ID of 26176 and the current CUDA thread has block coordinates of (0,0,0) and thread coordinates (0,0,0). Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 15 / 20

Example (cuda - gdb ) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count VirtualPC Filename Line Kernel 0 * (0,0,0) (0,0,0) (0,0,0) (255,0,0) 256 0 x0000000001b91868 bitreverse.cu 9 (cuda - gdb ) thread [ Current thread is 2 ( Thread 140057761130272 (LWP 26269) )] (cuda - gdb ) thread 2 [ Switching to thread 2 ( Thread 140057761130272 (LWP 26269) ) ]#0 0 x00007f61bbe16f05 in pthread_mutex_lock (cuda -gdb )bt // backtrace #0 0x00007f61bbe16f05 in pthread_mutex_lock () from /lib / libpthread.so.0 #1 0 x00007f61bb1d59b7 in?? () from /usr /lib / libcuda.so #2 0 x00007f61bb20391e in?? () from /usr /lib / libcuda.so #3 0 x00007f61bb2098d5 in?? () from /usr /lib / libcuda.so #4 0 x00007f61bb209ab6 in?? () from /usr /lib / libcuda.so #5 0 x00007f61bb1f495b in?? () from /usr /lib / libcuda.so #6 0 x00007f61bb1eae9c in?? () from /usr /lib / libcuda.so #7 0 x00007f61bb1cae41 in?? () from /usr /lib / libcuda.so #8 0 x00007f61bb1cebc8 in?? () from /usr /lib / libcuda.so #9 0 x00007f61bb1c1244 in?? () from /usr /lib / libcuda.so #10 0 x00007f61bcd48de2 in?? () from /usr / local /cuda / lib64 / libcudart.so.4 #11 0 x00007f61bcd6c824 in cudamemcpy () from /usr / local /cuda / lib64 / libcudart.so.4 #12 0 x0000000000400a3d in main () at bitreverse.cu :37 Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 16 / 20

Example cont. get info about cuda (device) threads, switch to host thread, switch to host tread 2, show backtrace of the host thread, Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 17 / 20

Example (cuda - gdb ) cuda thread 170 [ Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (170,0,0), device 0, sm 2, warp 0 9 unsigned int * idata =( unsigned int *) data ; (cuda -gdb ) n \\ next 12 array [ threadidx.x] = idata [ threadidx.x]; (cuda - gdb ) n 14 array [ threadidx.x] = ((0 xf0f0f0f0 & array [ threadidx.x]) >> 4) (cuda - gdb ) n 16 array [ threadidx.x] = ((0 xcccccccc & array [ threadidx.x]) >> 2) (cuda - gdb ) n 18 array [ threadidx.x] = ((0 xaaaaaaaa & array [ threadidx.x]) >> 1) (cuda - gdb ) n Breakpoint 3, bitreverse <<<(1,1,1),(256,1,1) >>> (data=0x200100000 ) 21 idata [ threadidx.x] = array [ threadidx.x]; switch to thread number 170, run through the code, Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 18 / 20

Example (cuda -gdb ) p & array $13 = ( @shared int (*) [0]) 0x0 (cuda -gdb ) p array [0] @256 $8 = {0, 128, 64, 192, 32, 160, 96, 224, 16, 144, 80, 208, 48, 176, 112, 240, 8, 136, 72, 200, 40, 168, 104, 232, 24, 152, 88, 216, 56, 184, 120, 248, 4, 132, 68, 196, 36, 164, 100, 228, 20, 148, 84, 212, 52, 180, 116, 244, 12, 140, 76, 204, 44, 172, 108, 236, 28, 156, 92, 220, 60, 188, 124, 252, 2, 130, 66, 194, 34, 162, 98, 226, 18, 146, 82, 210, 50, 178, 114, 242, 10, 138, 74, 202, 42, 170, 106, 234, 26, 154, 90, 218, 58, 186, 122, 250, 6, 134, 70, 198, 38, 166, 102, 230, 22, 150, 86, 214, 54, 182, 118, 246, 14, 142, 78, 206, 46, 174, 110, 238, 30, 158, 94, 222, 62, 190, 126, 254, 1, 129, 65, 193, 33, 161, 97, 225, 17, 145, 81, 209, 49, 177, 113, 241, 9, 137, 73, 201, 41, 169, 105, 233, 25, 153, 89, 217, 57, 185, 121, 249, 5, 133, 69, 197, 37, 165, 101, 229, 21, 149, 85, 213, 53, 181, 117, 245, 13, 141, 77, 205, 45, 173, 109, 237, 29, 157, 93, 221, 61, 189, 125, 253, 3, 131, 67, 195, 35, 163, 99, 227...} (cuda - gdb ) p & data $15 = ( @generic void * @parameter *) 0 x20 (cuda -gdb ) p *( @generic void * @parameter *) 0x20 $16 = ( @generic void * @parameter ) 0 x200100000 (cuda - gdb ) delete b Delete all breakpoints? (y or n) y (cuda - gdb ) continue Continuing. check shared memory content, delete breakpoints and continue program. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 19 / 20

Bibliography I [1] David B Kirk and W Hwu Wen-mei. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann, 2010. [2] NVIDIA CUDA Debugger. CUDA-DBG. http://docs.nvidia.com/cuda/cuda-gdb/index.html. 2014-01-24. Michał Wójcik (KASK, ETI, PG) Debugging CUDA January 24, 2014 20 / 20