GPU Tools Sandra Wienke



Similar documents
CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Debugging with TotalView

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Optimizing Application Performance with CUDA Profiling Tools

Hands-on CUDA exercises

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

For Introduction to Java Programming, 5E By Y. Daniel Liang

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Case Study on Productivity and Performance of GPGPUs

Getting Started with CodeXL

OpenACC Basics Directive-based GPGPU Programming

Guided Performance Analysis with the NVIDIA Visual Profiler

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

XID ERRORS. vr352 May XID Errors

RTOS Debugger for ecos

POOSL IDE User Manual

Next Generation GPU Architecture Code-named Fermi

Q N X S O F T W A R E D E V E L O P M E N T P L A T F O R M v Steps to Developing a QNX Program Quickstart Guide

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Parallel Computing Architecture and CUDA Programming Model

Debugging in Heterogeneous Environments with TotalView. ECMWF HPC Workshop 30 th October 2014

SKP16C62P Tutorial 1 Software Development Process using HEW. Renesas Technology America Inc.

GPU Performance Analysis and Optimisation

WebSphere Business Monitor

Lab 2-2: Exploring Threads

Project Manager Editor & Debugger

CUDA Basics. Murphy Stein New York University

DS-5 ARM. Using the Debugger. Version 5.7. Copyright 2010, 2011 ARM. All rights reserved. ARM DUI 0446G (ID092311)

Eliminate Memory Errors and Improve Program Stability

IBM Operational Decision Manager Version 8 Release 5. Getting Started with Business Rules

NVIDIA CUDA INSTALLATION GUIDE FOR MICROSOFT WINDOWS

Department of Veterans Affairs. Open Source Electronic Health Record Services

Nios II IDE Help System

CooCox CoIDE UserGuide Version: page 1. Free ARM Cortex M3 and Cortex M0 IDE: CooCox CoIDE UserGuide

Using the Intel Inspector XE

Installing Eclipse C++ for Windows

DS-5 ARM. Using the Debugger. Version Copyright ARM. All rights reserved. ARM DUI 0446M (ID120712)

Capacitive Touch Lab. Renesas Capacitive Touch Lab R8C/36T-A Family

Andreas Burghart 6 October 2014 v1.0

GDB Tutorial. A Walkthrough with Examples. CMSC Spring Last modified March 22, GDB Tutorial

Lecture 1: an introduction to CUDA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

10 STEPS TO YOUR FIRST QNX PROGRAM. QUICKSTART GUIDE Second Edition

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

How To Develop For A Powergen 2.2 (Tegra) With Nsight) And Gbd (Gbd) On A Quadriplegic (Powergen) Powergen Powergen 3

Java Troubleshooting and Performance

Introduction to GPU hardware and to CUDA

Eddy Integrated Development Environment, LemonIDE for Embedded Software System Development

DS-5 ARM. Using the Debugger. Version Copyright ARM. All rights reserved. ARM DUI0446P

Tutorial 5: Developing Java applications

Introduction to Embedded Systems. Software Update Problem

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Backup Server DOC-OEMSPP-S/6-BUS-EN

Fahim Uddin 1. Java SDK

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

USBSPYDER08 Discovery Kit for Freescale MC9RS08KA, MC9S08QD and MC9S08QG Microcontrollers User s Manual

GIVE WINGS TO YOUR IDEAS TOOLS MANUAL

Altera Monitor Program

National CR16C Family On-Chip Emulation. Contents. Technical Notes V

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

Running a Program on an AVD

AMD CodeXL 1.7 GA Release Notes

CodeWarrior for Power Architecture Errata

Java Application Development using Eclipse. Jezz Kelway Java Technology Centre, z/os Service IBM Hursley Park Labs, United Kingdom

ABAP Debugging Tips and Tricks

CodeWarrior Development Studio for Freescale S12(X) Microcontrollers Quick Start

Profiler User's Guide

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

How to test and debug an ASP.NET application

Experiences with Tools at NERSC

Google App Engine f r o r J av a a v a (G ( AE A / E J / )

Board also Supports MicroBridge

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

Installing the Android SDK

Real-time Debugging using GDB Tracepoints and other Eclipse features

UM1680 User manual. Getting started with STM32F429 Discovery software development tools. Introduction

OpenACC 2.0 and the PGI Accelerator Compilers

DiskBoss. File & Disk Manager. Version 2.0. Dec Flexense Ltd. info@flexense.com. File Integrity Monitor

Visual Basic. murach's TRAINING & REFERENCE

Phone Inventory 1.0 (1000) Installation and Administration Guide

Development_Setting. Step I: Create an Android Project

Programming with the Dev C++ IDE

End-user Tools for Application Performance Analysis Using Hardware Counters

Introduction to Eclipse

Development Studio 5 (DS-5)

DsPIC HOW-TO GUIDE Creating & Debugging a Project in MPLAB

Transcription:

Sandra Wienke Center for Computing and Communication, RWTH Aachen University MATSE HPC Battle 2012/13 Rechen- und Kommunikationszentrum (RZ)

Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 2

IDE - Eclipse Eclipse + Parallel NSight: IDE for GPU programming CUDA syntax highlighting, CUDA debugging, CUDA profiling OpenACC programming, OpenACC profiling Using Nsight on RWTH Cluster environment module load cuda nsight 1. Chose workspace 2. File New Makefile Project with Existing code 3. Chose file directory of CG Solver 4. Toolchain: CUDA Toolkit 5.0 5. Create Makefile targets (Makefile Tab in right pane), e.g. cuda, clean 6. Double click on Makefile target for execution or download Parallel Nisght: http://www.nvidia.com/object/nsight.html 3

IDE - Eclipse Debugging (CUDA) Use the debug configuration of Makefile to create executable Press debug button (green bug) Proceed (even if errors) Switch to debug perspective (Application will suspend in the main function. At this point there is no GPU code running) Add a breakpoint in the device code Resume the application Profiling (CUDA, OpenACC) Uses internally NVIDIA s Visual Profiler (see later) Use the release target of Makefile to create executable Press profile button (watch) Proceed (even if errors) Switch perspective Ouput/ interpretation see chapter NVIDIA Visual Profiler 4

Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 5

Debugging (CUDA) Debugging host code as usual Debugging GPU kernels requires special tools CUDA debuggers available OpenACC debuggers not available Compiling CUDA applications nvcc [-arch=sm_20] mykernel.cu RWTH Cluster environment: module load cuda Debugging flags: -g G nvcc g G [-arch=sm_20] mykernel.cu (see Makefile debug target) CUDA command line tools Debugger: cuda-gdb Detecting memory access errors: cuda-memcheck 6

Debugging (CUDA) CUDA GUI-based debugger: TotalView Debugging host and device code in same session Thread navigation by logical or physical coordinates Displaying hierarchical memory, General information on debugging with TotalView can be found in the appendix CUDA GUI-based debugger: Eclipse (see above) RWTH Cluster environment: module load totalview totalview If you get an error concerning the CUDA version, try to compile your application with CUDA 4.1: module switch cuda cuda/41 7

Debugging (CUDA) - TotalView Setting breakpoints in CUDA kernels Start debugging (e.g. Go ) Message box when kernel is loaded: Set kernel breakpoints as in host code 8

Debugging (CUDA) - TotalView Debugger thread IDs in Linux CUDA process Host thread: positive no. CUDA thread: negative no. GPU thread navigation Logical coordinates: blocks (3 dimensions), threads (3 dimensions) Physical coordinates: device, SM, warp, core/lane Only valid selections are permitted 9

Debugging (CUDA) - TotalView Warp: group of 32 threads Share one PC Advance synchronously Problem: Diverging threads if (threadidx.x > 2) {...} else {...} Single Stepping Advances all GPU hardware threads within same warp Stepping over a syncthreads() call advances all threads within the block Advancing more than just one warp Halt Run To a selected line number in the source pane Set a breakpoint and Continue the process Stops all the host and device threads 10

Debugging (CUDA) - TotalView Displaying CUDA device properties Tools - CUDA Devices Helps mapping between logical & physical coordinates PCs across SMs, warps, lanes GPU thread divergence? Different PC within warp Diverging threads 11

Debugging (CUDA) - TotalView Displaying GPU data Dive into variable or watch Type in Expression List Device memory spaces: @ notation Storage Qualifier @global @shared @local @register @generic @constant @texture @parameter Meaning of address Offset within global storage Offset within shared storage Offset within local storage PTX register name Offset within generic address space (e.g. pointer to global, local or shared memory) Offset within constant storage Offset within texture storage Offset within parameter storage 12

Debugging (CUDA) - TotalView Checking GPU memory Enable CUDA Memory checking during startup or in the Debug menu Detects global memory addressing violations and misaligned global memory accesses Further features Multi-device support Host-pinned memory support MPI-CUDA applications 13

Debugging (CUDA) - Tips Check CUDA API calls All CUDA API routines return error code (cudaerror_t) Or cudagetlasterror() returns last error from a CUDA runtime call cudageterrorstring(cudaerror_t) returns corresponding message 1. Write a macro to check CUDA API return codes or use SafeCall and CheckError macros from cutil.h (NVIDIA GPU Computing SDK) 2. Use TotalView to examine the return code Evaluate the CUDA API call in the expression list If needed, dive on the error value and typecast it to an cudaerror_t type You can also surround the API call by cudageterrorstring() in the expression field and typecast it to char[xx]* 14

Debugging (CUDA) - Tips Check + use available hardware features printf statements are possible within kernels (since Fermi) Use double precision floating point operations (since GT200) Enable ECC and check whether single or double bit errors occurred using nvidia-smi -q (since Fermi) Check final numerical results on host While porting, it is recommended to compare all computed GPU results with host results 1. Compute check sums of GPU and host array values 2. If not sufficient, compare arrays element-wise Comparative debugging approach, e.g. statistics view 15

Debugging (CUDA) - Tips Check intermediate results If results are directly stored in global memory: dive on result array If results are stored in on-chip memory (e.g. registers) tedious debugging TotalView: View of variables across CUDA threads not possible yet 1. Create additional array on host for intermediate results with size #threads * #results * sizeof(result) Use array on GPU: each thread stores its result at unique index Transfer array back to host and examine the results 2. If having a limited number of thread blocks: create additional array in shared memory within kernel function: shared myarray[size] Use defines to exchange access to on-chip variable with array access Examine results by diving on array and switching between blocks Use filter, array statistics, freeze, duplicate, last values and watch points 16

Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 17

Profiling (CUDA & OpenACC) Profiling = Analyze behavior of application during runtime e.g. runtime of functions, memory throughput NVIDIA Visual Profiler for CUDA & OpenACC codes Profiles only GPU data movement & computation (not host code) 1. Compile your program and start the profiler nvvp 2. Select File New Session 3. Chose your executable as file Specify arguments, e.g. the matrix file RWTH Cluster environment: module load cuda nvvp Specify envrionment variables, e.g. CG_MAX_ITER 4. If you want to shorten the execution time, set a timeout limit 5. Finish the session configuration & wait for results 18

Profiling (CUDA & OpenACC) 19

Profiling (CUDA & OpenACC) Session Tab Timeline Long memory copy from host to device Timeline Short memory copy from device to host Collpase to see summarized info only Compute time for first kernel Is data only transfered when needed? Which kernel does need the most time? 20

Profiling (CUDA & OpenACC) Analysis Tab Gives hints for optimization (not always useful) Details Tab Switch from Analysis Tab to the Details Tab Runtime Grid dimensions On right hand side, activate summary view 21 Kernel name <func>_<line>_gpu

Agenda IDE Eclipse Debugging (CUDA) TotalView Profiling (CUDA & OpenACC) NVIDIA Visual Profiler Appendix Debugging host code with TotalView 22

Appendix - Debugging host code Start TotalView and select your program to debug 23

Appendix - Debugging host code Process window of TotalView Toolbar Process and Thread Status Stack Trace Pane Stack Frame Pane Source Pane Tabbed Pane 24

Appendix - Debugging host code Breakpoints Interrupt execution when reaching a specific code line Conditional Breakpoints possible Set by clicking in the source pane Temporary disabling is possible Watchpoints Interrupt when a change occurs to a specific memory location Conditional watchpoints possible (e.g. only stop if the sign of the value changes or specified threshold reached) 25

Appendix - Debugging host code Setting a breakpoint 26

Appendix - Debugging host code Inspecting an array in C/C++ Double click on array name Typecast necessary 27

Appendix - Debugging host code Data visualizations helpful for big data arrays 28

Appendix - Debugging host code Create a watchpoint for a[29] 29

Appendix - Debugging host code Will interrupt as soon as a[29] changes 30