How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda



Similar documents
Hands-on CUDA exercises

Optimizing Application Performance with CUDA Profiling Tools

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Learn CUDA in an Afternoon: Hands-on Practical Exercises

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Tools Sandra Wienke

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Performance Analysis and Optimisation

Debugging CUDA Applications Przetwarzanie Równoległe CUDA/CELL

GPGPU Computing. Yong Cao

1. If we need to use each thread to calculate one output element of a vector addition, what would

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Computer Graphics Hardware An Overview

Guided Performance Analysis with the NVIDIA Visual Profiler

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

GPU Parallel Computing Architecture and CUDA Programming Model

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPUs for Scientific Computing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Getting Started with CodeXL

Case Study on Productivity and Performance of GPGPUs

Debugging with TotalView

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Developing, Deploying, and Debugging Applications on Windows Embedded Standard 7

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Part I Courses Syllabus

MONITORING PERFORMANCE IN WINDOWS 7

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU Programming Languages

GPU Programming Strategies and Trends in GPU Computing

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

RecoveryVault Express Client User Manual

NVIDIA Tools For Profiling And Monitoring. David Goodwin

CS3600 SYSTEMS AND NETWORKS

Next Generation GPU Architecture Code-named Fermi

AMD CodeXL 1.7 GA Release Notes

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Online Backup Linux Client User Manual

SKP16C62P Tutorial 1 Software Development Process using HEW. Renesas Technology America Inc.

Online Backup Client User Manual

Introduction to Android Development

Operating Systems 4 th Class

Scalability and Classifications

Introduction. What is an Operating System?

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Example Program for Crestron - Setup Guide -

Cosmic Board for phycore AM335x System on Module and Carrier Board. Application Development User Manual

1. Product Information

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

OpenACC 2.0 and the PGI Accelerator Compilers

Online Backup Client User Manual Linux

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Mitglied der Helmholtz-Gemeinschaft. System monitoring with LLview and the Parallel Tools Platform

Online Backup Client User Manual

ultra fast SOM using CUDA

CSC 2405: Computer Systems II

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

CodeWarrior Development Studio for Freescale S12(X) Microcontrollers Quick Start

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

SimbaEngine SDK 9.4. Build a C++ ODBC Driver for SQL-Based Data Sources in 5 Days. Last Revised: October Simba Technologies Inc.

Using the CoreSight ITM for debug and testing in RTX applications

OpenCL Programming for the CUDA Architecture. Version 2.3

Installation Guide. (Version ) Midland Valley Exploration Ltd 144 West George Street Glasgow G2 2HG United Kingdom

STLinux Software development environment

Yocto Project Eclipse plug-in and Developer Tools Hands-on Lab

IN STA LLIN G A VA LA N C HE REMOTE C O N TROL 4. 1

LOG MANAGEMENT Update Log Setup Screen Update Log Options Use Update Log to track edits, adds and deletes Accept List Cancel

Introduction to Embedded Systems. Software Update Problem

Eliminate Memory Errors and Improve Program Stability

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

XID ERRORS. vr352 May XID Errors

RA MPI Compilers Debuggers Profiling. March 25, 2009

Fast Implementations of AES on Various Platforms

Embedded Linux development training 4 days session

CS 455 Spring Word Count Example

Lab 1 Beginning C Program

Introduction to GPU hardware and to CUDA

Keil Debugger Tutorial

Novell ZENworks Asset Management 7.5

Installing S500 Power Monitor Software and LabVIEW Run-time Engine

The installation of pylon for Linux is described in the INSTALL text document.

Example of Standard API

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

Lab 2-2: Exploring Threads

Chapter 6, The Operating System Machine Level

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

MetaMorph Microscopy Automation & Image Analysis Software Super-Resolution Module

Transcription:

Mitglied der Helmholtz-Gemeinschaft Hands On CUDA Tools and Performance-Optimization JSC GPU Programming Course 26. März 2011 Dominic Eschweiler

Outline of This Talk Introduction Setup CUDA-GDB Profiling Performance 26. März 2011 Dominic Eschweiler Folie 2

Introduction Every section (and subsection) of the exercise paper is a task for the hands on. The task description tells You which file is to open. After opening the file You have to look after the parts which are marked with TODO. 1 / TODO : Some d e s c r i p t i o n s 2... 3 / 4... 5 / TODO / Every informational part on this slides has bullets, every task has a number: 1 26. März 2011 Dominic Eschweiler Folie 3

Setup Your Tools (Exercise 2) 1 Please use ssh -X jugipsy to open a remote session to our GPU system. 2 Try out if X-forwarding works (type cudaprof) 3 Extract tar -xzf testbed.tar.gz to your home directory and change to its folder. 26. März 2011 Dominic Eschweiler Folie 4

CUDA-GDB Debugging CUDA Programs is a hard task because the compute kernel is a blackbox. Printf (except Fermi) or systemcalls are not available for kernels. NVIDIA offers a special version of GDB for debugging, which is able to debug kernels. 26. März 2011 Dominic Eschweiler Folie 5

CUDA-GDB (Exercise 2.1) 1 Change to the gdb subfolder cd gdb. 2 Type make and do a ls. 3 Open the file gdb test.cu with an text editor of your choice. 4 Now start it with the cuda-gdb by typing cuda-gdb./gdb test. 26. März 2011 Dominic Eschweiler Folie 6

(CUDA-)GDB Commands (Exercise 2.1) break <filename.cu>:<linenumber> break <functionname> run continue next step print <variable> Add a breakpoint on this line in the named file. Add a breakpoint on the beginning of this function. Execute the program and hold on the first breakpoint. Go on to the next breakpoint. Step to the next line. Step into the current function. Print out the current value of this variable. 26. März 2011 Dominic Eschweiler Folie 7

Profiling To measure durations of different function calls during runtime is a hard job, if the programmer only uses printf() and gettimeofday(). Introduces alien code into the program. Is not very fault tolerant. It is not clear if one can trust the results.... A profiler is a performance analysis tool that, most commonly, measures the frequency and duration of function calls. cudaprof is the profiler from NVIDIA for CUDA 26. März 2011 Dominic Eschweiler Folie 8

Profile Your Code (Exercise 2.2) 1 Start the profiler (run cudaprof). 2 After that You will see the main window. Abbildung: CUDA profiler main window. 26. März 2011 Dominic Eschweiler Folie 9

Profile Your Code (Exercise 2.2) 1 Add a new project by clicking on file and then on new. 2 Type in a name and press ok. Abbildung: New project dialog. 26. März 2011 Dominic Eschweiler Folie 10

Profile Your Code (Exercise 2.2) 1 Klick start and see the profiling output. 2 Open the cudaprof.pdf and find out what each counter mean. Abbildung: CUDA profiler with the project dialog. 26. März 2011 Dominic Eschweiler Folie 11

Performance GPU Architecture Host Input Assembler Setup/Rstr/ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Thread Processor L1 L1 L1 L1 L1 L1 L1 L1 R L2 R L2 R L2 R L2 R L2 R L2 FB FB FB FB FB FB Abbildung: The G80 architecture. 26. März 2011 Dominic Eschweiler Folie 12

Performance Memory Layer Shared Memory Global Memory Host Memory Abbildung: Memory hierarchy. Thread 1 4c Thread 1 4c...... Thread n 4c... Thread n 4c 600c 600c 600c 600c Local Memory Global Memory Local Memory Abbildung: Memory organization. 26. März 2011 Dominic Eschweiler Folie 13

Performance Make (Exercise 3.1) 1 Please go to the main folder of the testbed. 2 Type make debug. 1 ptxas i n f o : Compiling e n t r y f u n c t i o n Z15P1 Fixed KernelPfS S 2 ptxas i n f o : Used 6 r e g i s t e r s, 24+16 bytes smem, 12 bytes cmem[ 1 ] 3 ptxas i n f o : Compiling e n t r y f u n c t i o n Z16P1 Broken KernelPfS S 4 ptxas i n f o : Used 6 r e g i s t e r s, 24+16 bytes smem, 12 bytes cmem[ 1 ] 3 Go to the subfolder cudals and type make again. 4 Type./cudals and find out how many registers the GPU have and how many threads per blocks could be launched. 26. März 2011 Dominic Eschweiler Folie 14

Performance Excessive Global Memory Usage (Exercise 3.2) Thread Shared memory Global memory A A A A A A A A A A A Thread Shared memory Global memory A A' A' A' A' A' A' A' A' A' Runtime benefit Abbildung: Shared memory is much faster than global memory. This part demonstrates a performance issue where the kernel program only uses global memory for performing calculations. The better way is to use registers or shared memory to store intermediate results. 26. März 2011 Dominic Eschweiler Folie 15

Performance Bank Conflicts (Exercise 3.3) Thread 0 Thread 1 Shared memory Thread 0 Thread 1 Shared memory Bank 0 Bank 1 Bank 0 Bank 1 Runtime benefit Shared memory is a parallel memory which is distributed over several banks, where every bank can only accessed by one thread at the same time. If more than one thread try to access the same bank of the shared memory, the execution is serialized. Abbildung: Bank Conflicts. 26. März 2011 Dominic Eschweiler Folie 16

Performance Memory Coalescing (Exercise 3.4) Thread Shared memory Global memory Thread Shared memory Global memory A1' A2' A A2A A1 AA3 A1 A2A A3 A4 A1' A2' A3' A4' A3' Runtime benefit Abbildung: Sometimes memory accesses can be coalesced. A A4 A4' One transfer cycle between global and shared memory has always the size of 128 Bit. The compiler performs automatically every smaller access with a 128 Bit transfer. The better way is to coalesce smaller transfers to bigger ones (if possible). 26. März 2011 Dominic Eschweiler Folie 17

Performance Scattered Host Transfer (Exercise 3.5) Global memory Host memory Global memory Host memory A1 A2 A3 A4 init A1 init A2 init A3 init A4 init A1 A2 A3 A4 A1 A2 A3 A4 Runtime benefit Abbildung: Scattered against non scattered host transfer. It could happen that the input data is not located in a consecutive buffer on the host. To scatter the copies directly between host and device memory is a bad idea. 26. März 2011 Dominic Eschweiler Folie 18

Performance Thread Register Imbalance (Exercise 3.6) Thread block slot 0 Thread block slot 1 Thread block slot 2 Thread block slot 3 Thread block slot 0 Thread block slot 1 Thread block slot 2 Thread block slot 3 N Thread N Thread N Thread N Thread reg. block 0 reg. block 1 reg. block 2 reg. block 3 Multiprocessor Runtime benefit N/4 Thread reg. block 0 N/4 Thread reg. block 1 N/4 Thread reg. block 2 N/4 Thread reg. block 3 Multiprocessor A kernel should always use every available thread slot on the shared multiprocessor. This can be limited by the number of registers which are used per thread (see compiler output). Abbildung: Keep the multiprocessors busy. 26. März 2011 Dominic Eschweiler Folie 19

Performance Wait at Barrier (Exercise 3.7) Thread 0 Barrier Thread 1 Barrier Thread 2 Barrier Thread 3 Barrier Runtime benefit Thread 0 Barrier Thread 1 Barrier Thread 2 Barrier Thread 3 Barrier Abbildung: Wait at barrier. Sometimes there is some initialization needed, which must be divided with an barrier from the calculation steps. If shared memory is used in this initialization step, an easy way to reduce barrier waiting time is to let only one thread do the initialization 26. März 2011 Dominic Eschweiler Folie 20

Performance Branch Diverging (Exercise 3.8) Warp 0 Warp 1 Warp 0 Warp 1 B B B B If-case If-case If-case Else-case Abbildung: Branch Diverging. Else-case Else-case Runtime benefit Branching is traditionally a hard job for SIMD architectures. Branches in CUDA do only have no impact on the performance, if they are aligned to the warp boarders. 26. März 2011 Dominic Eschweiler Folie 21