Mitglied der Helmholtz-Gemeinschaft Hands On CUDA Tools and Performance-Optimization JSC GPU Programming Course 26. März 2011 Dominic Eschweiler
Outline of This Talk Introduction Setup CUDA-GDB Profiling Performance 26. März 2011 Dominic Eschweiler Folie 2
Introduction Every section (and subsection) of the exercise paper is a task for the hands on. The task description tells You which file is to open. After opening the file You have to look after the parts which are marked with TODO. 1 / TODO : Some d e s c r i p t i o n s 2... 3 / 4... 5 / TODO / Every informational part on this slides has bullets, every task has a number: 1 26. März 2011 Dominic Eschweiler Folie 3
Setup Your Tools (Exercise 2) 1 Please use ssh -X jugipsy to open a remote session to our GPU system. 2 Try out if X-forwarding works (type cudaprof) 3 Extract tar -xzf testbed.tar.gz to your home directory and change to its folder. 26. März 2011 Dominic Eschweiler Folie 4
CUDA-GDB Debugging CUDA Programs is a hard task because the compute kernel is a blackbox. Printf (except Fermi) or systemcalls are not available for kernels. NVIDIA offers a special version of GDB for debugging, which is able to debug kernels. 26. März 2011 Dominic Eschweiler Folie 5
CUDA-GDB (Exercise 2.1) 1 Change to the gdb subfolder cd gdb. 2 Type make and do a ls. 3 Open the file gdb test.cu with an text editor of your choice. 4 Now start it with the cuda-gdb by typing cuda-gdb./gdb test. 26. März 2011 Dominic Eschweiler Folie 6
(CUDA-)GDB Commands (Exercise 2.1) break <filename.cu>:<linenumber> break <functionname> run continue next step print <variable> Add a breakpoint on this line in the named file. Add a breakpoint on the beginning of this function. Execute the program and hold on the first breakpoint. Go on to the next breakpoint. Step to the next line. Step into the current function. Print out the current value of this variable. 26. März 2011 Dominic Eschweiler Folie 7
Profiling To measure durations of different function calls during runtime is a hard job, if the programmer only uses printf() and gettimeofday(). Introduces alien code into the program. Is not very fault tolerant. It is not clear if one can trust the results.... A profiler is a performance analysis tool that, most commonly, measures the frequency and duration of function calls. cudaprof is the profiler from NVIDIA for CUDA 26. März 2011 Dominic Eschweiler Folie 8
Profile Your Code (Exercise 2.2) 1 Start the profiler (run cudaprof). 2 After that You will see the main window. Abbildung: CUDA profiler main window. 26. März 2011 Dominic Eschweiler Folie 9
Profile Your Code (Exercise 2.2) 1 Add a new project by clicking on file and then on new. 2 Type in a name and press ok. Abbildung: New project dialog. 26. März 2011 Dominic Eschweiler Folie 10
Profile Your Code (Exercise 2.2) 1 Klick start and see the profiling output. 2 Open the cudaprof.pdf and find out what each counter mean. Abbildung: CUDA profiler with the project dialog. 26. März 2011 Dominic Eschweiler Folie 11
Performance GPU Architecture Host Input Assembler Setup/Rstr/ZCull Vtx Thread Issue Geom Thread Issue Pixel Thread Issue SP SP SP SP SP SP SP SP TF TF TF TF TF TF TF TF Thread Processor L1 L1 L1 L1 L1 L1 L1 L1 R L2 R L2 R L2 R L2 R L2 R L2 FB FB FB FB FB FB Abbildung: The G80 architecture. 26. März 2011 Dominic Eschweiler Folie 12
Performance Memory Layer Shared Memory Global Memory Host Memory Abbildung: Memory hierarchy. Thread 1 4c Thread 1 4c...... Thread n 4c... Thread n 4c 600c 600c 600c 600c Local Memory Global Memory Local Memory Abbildung: Memory organization. 26. März 2011 Dominic Eschweiler Folie 13
Performance Make (Exercise 3.1) 1 Please go to the main folder of the testbed. 2 Type make debug. 1 ptxas i n f o : Compiling e n t r y f u n c t i o n Z15P1 Fixed KernelPfS S 2 ptxas i n f o : Used 6 r e g i s t e r s, 24+16 bytes smem, 12 bytes cmem[ 1 ] 3 ptxas i n f o : Compiling e n t r y f u n c t i o n Z16P1 Broken KernelPfS S 4 ptxas i n f o : Used 6 r e g i s t e r s, 24+16 bytes smem, 12 bytes cmem[ 1 ] 3 Go to the subfolder cudals and type make again. 4 Type./cudals and find out how many registers the GPU have and how many threads per blocks could be launched. 26. März 2011 Dominic Eschweiler Folie 14
Performance Excessive Global Memory Usage (Exercise 3.2) Thread Shared memory Global memory A A A A A A A A A A A Thread Shared memory Global memory A A' A' A' A' A' A' A' A' A' Runtime benefit Abbildung: Shared memory is much faster than global memory. This part demonstrates a performance issue where the kernel program only uses global memory for performing calculations. The better way is to use registers or shared memory to store intermediate results. 26. März 2011 Dominic Eschweiler Folie 15
Performance Bank Conflicts (Exercise 3.3) Thread 0 Thread 1 Shared memory Thread 0 Thread 1 Shared memory Bank 0 Bank 1 Bank 0 Bank 1 Runtime benefit Shared memory is a parallel memory which is distributed over several banks, where every bank can only accessed by one thread at the same time. If more than one thread try to access the same bank of the shared memory, the execution is serialized. Abbildung: Bank Conflicts. 26. März 2011 Dominic Eschweiler Folie 16
Performance Memory Coalescing (Exercise 3.4) Thread Shared memory Global memory Thread Shared memory Global memory A1' A2' A A2A A1 AA3 A1 A2A A3 A4 A1' A2' A3' A4' A3' Runtime benefit Abbildung: Sometimes memory accesses can be coalesced. A A4 A4' One transfer cycle between global and shared memory has always the size of 128 Bit. The compiler performs automatically every smaller access with a 128 Bit transfer. The better way is to coalesce smaller transfers to bigger ones (if possible). 26. März 2011 Dominic Eschweiler Folie 17
Performance Scattered Host Transfer (Exercise 3.5) Global memory Host memory Global memory Host memory A1 A2 A3 A4 init A1 init A2 init A3 init A4 init A1 A2 A3 A4 A1 A2 A3 A4 Runtime benefit Abbildung: Scattered against non scattered host transfer. It could happen that the input data is not located in a consecutive buffer on the host. To scatter the copies directly between host and device memory is a bad idea. 26. März 2011 Dominic Eschweiler Folie 18
Performance Thread Register Imbalance (Exercise 3.6) Thread block slot 0 Thread block slot 1 Thread block slot 2 Thread block slot 3 Thread block slot 0 Thread block slot 1 Thread block slot 2 Thread block slot 3 N Thread N Thread N Thread N Thread reg. block 0 reg. block 1 reg. block 2 reg. block 3 Multiprocessor Runtime benefit N/4 Thread reg. block 0 N/4 Thread reg. block 1 N/4 Thread reg. block 2 N/4 Thread reg. block 3 Multiprocessor A kernel should always use every available thread slot on the shared multiprocessor. This can be limited by the number of registers which are used per thread (see compiler output). Abbildung: Keep the multiprocessors busy. 26. März 2011 Dominic Eschweiler Folie 19
Performance Wait at Barrier (Exercise 3.7) Thread 0 Barrier Thread 1 Barrier Thread 2 Barrier Thread 3 Barrier Runtime benefit Thread 0 Barrier Thread 1 Barrier Thread 2 Barrier Thread 3 Barrier Abbildung: Wait at barrier. Sometimes there is some initialization needed, which must be divided with an barrier from the calculation steps. If shared memory is used in this initialization step, an easy way to reduce barrier waiting time is to let only one thread do the initialization 26. März 2011 Dominic Eschweiler Folie 20
Performance Branch Diverging (Exercise 3.8) Warp 0 Warp 1 Warp 0 Warp 1 B B B B If-case If-case If-case Else-case Abbildung: Branch Diverging. Else-case Else-case Runtime benefit Branching is traditionally a hard job for SIMD architectures. Branches in CUDA do only have no impact on the performance, if they are aligned to the warp boarders. 26. März 2011 Dominic Eschweiler Folie 21