CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Similar documents

Guided Performance Analysis with the NVIDIA Visual Profiler

GPU Performance Analysis and Optimisation

Optimizing Application Performance with CUDA Profiling Tools

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Next Generation GPU Architecture Code-named Fermi

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPU Tools Sandra Wienke

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

1. If we need to use each thread to calculate one output element of a vector addition, what would

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Introduction to GPU hardware and to CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

OpenCL Programming for the CUDA Architecture. Version 2.3

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

GPU Hardware Performance. Fall 2015

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Hands-on CUDA exercises

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

Introduction to GPU Programming Languages

The Fastest, Most Efficient HPC Architecture Ever Built

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU Programming Strategies and Trends in GPU Computing

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GPU Architecture

Texture Cache Approximation on GPUs

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

CUDA Programming. Week 4. Shared memory and register

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

ultra fast SOM using CUDA

CUDA programming on NVIDIA GPUs

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Introduction to GPGPU. Tiziano Diamanti

GPUs for Scientific Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Intel Pentium 4 Processor on 90nm Technology

Architecture of Hitachi SR-8000

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

1. Memory technology & Hierarchy

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Advanced CUDA Webinar. Memory Optimizations

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Programming GPUs with CUDA

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

CUDA Basics. Murphy Stein New York University

Profiler User's Guide

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Parallel Programming Survey

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Path Tracing Overview

HPC with Multicore and GPUs

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

L20: GPU Architecture and Models

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

GPU-based Decompression for Medical Imaging Applications

Pipelining Review and Its Limitations

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem

Lecture 1: an introduction to CUDA

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

SIDN Server Measurements

One of the database administrators

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Transcription:

CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA

What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2

What Does the Application Do? It does not matter!!! We care about memory accesses, instructions, latency, Companion code: https://github.com/jdemouth/nsight-gtc2013 3

4

Our Method Trace the application Identify the hot spot and profile it Identify the performance limiter Memory Bandwidth Instruction Throughput Latency Optimize the code Iterate 5

Our Environment We use Nvidia Tesla K20c (GK110, SM 3.5), ECC OFF, Microsoft Windows 7 x64, Microsoft Visual Studio 2012, CUDA 5.5, Nvidia Nsight 3.1. 6

ITERATION 1 7

Trace the Application (Nsight VSE)

CUDA Launch Summary (Nsight VSE) spmv_kernel_v0 is a hot spot, let s start here!!! Kernel Time Speedup Original version 104.72ms 9

Trace the Application (NVVP)

Trace the Application (nvprof)

Profile the Most Expensive Kernel (Nsight VSE)

CUDA Launches (Nsight VSE) 13

Profile the Most Expensive Kernel (NVVP)

Profile the Most Expensive Kernel (nvprof)

Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 16

Memory Bandwidth SM SM Registers SMEM/L1$ Registers SMEM/L1$ L2$ DRAM (Global Memory)

Memory Bandwidth Utilization of DRAM Bandwidth: 29.14% We are not limited by the memory bandwidth (< 70-80%) 18

Memory Bandwidth (nvprof) Utilization of DRAM Bandwidth: 31.86% Read Write Total Bandwidth (GB/s) 60.77 2.38 63.15 Utilization (%) 29.22 1.14 30.36 Peak BW (K20c): 208GB/s We are not limited by the memory bandwidth 19

Instruction Throughput SM Registers Instructions go to the pipes Load Store Control Flow ALU Texture Issue 1 or 2 instructions every cycle SMEM/L1$ We cannot if a pipe is saturated

Instruction Throughput All pipes are underutilized: <70-75% We are not limited by instruction throughput 21

Instruction Throughput All pipes have Low/Mid utilization Load/Store Control Flow ALU Texture Utilization Mid Low Low Idle We are not limited by instruction throughput

Guided Analysis (Nvvp)

Latency First two things to check: Occupancy Memory accesses (coalesced/uncoalesced accesses) Other things to check (if needed): Control flow efficiency (branching, idle threads) Divergence Bank conflicts in shared memory 24

Latency (Occupancy) Occupancy: 55.98% Achieved / 62.50% Theoretical It s not too high but not too low: Hard to say

Latency (Occupancy) Guided Analysis (Nvvp): Theoretical occupancy is less than 100% but is large enough that increasing occupancy may not improve performance

Latency (Occupancy) SM Eligible Warps per Active Cycle: 10.43 Registers Sched Sched Sched Sched Units Units Units Units SMEM/L1$ Occupancy is not an issue (> 4)

Memory Transactions Warps of threads (32 threads) L1 transaction: 128B Alignment: 128B (0, 128, 256, ) L2 transaction: 32B Alignment: 32B (0, 32, 64, 96, )

Memory Transactions (fp32) Ideal case: 32 aligned and consecutive fp32 numbers 1x L1 transaction: 128B needed / 128B transferred 4x L2 transactions: 128B needed / 128B transferred

Memory Transactions (fp64) Ideal case: 32 aligned and consecutive fp64 numbers 2x L1 transactions: 256B needed / 256B transferred 8x L2 transactions: 256B needed / 256B transferred

Memory Transactions (fp64) Worst case: 32 fp64 with a stride of 128B (16x fp64) 32x L1 transactions: 256B needed / 32x128B transferred 32x L2 transactions: 256B needed / 32x32B transferred

Memory Transactions (fp64) Misaligned: 32 fp64 3x L1 transactions: 256B needed / 384B transferred 9x L2 transactions: 256B needed / 288B transferred

Memory Transactions Broadcast: 1 fp64 32 threads 1 L1 transaction: 8B needed / 128B transferred 1 L2 transaction: 8B needed / 32B transferred

Replays A Memory Request: LD/ST instruction The 1 st transaction is issued Other transactions induce replays Note: For each fp64 request, we have at least 1 replay

Latency (Memory Accesses) Transactions per Request: 19.92 loads / 8 stores We have too many uncoalesced accesses!!! 35

Where Do Those Accesses Happen? (Nsight VSE) CUDA Source Profiler (Nsight VSE): Where are uncoalesced requests (need to compile with lineinfo) Tip: Sort L2 Global Transactions Executed 36

Where Do Those Accesses Happen? (Nvvp) 1/ Select a run 2/ Click on Unguided Analysis 3/ Run Memory Access Pattern 4/ Something goes wrong at those lines

Access Pattern Double precision numbers: 64-bit Thread 0 Thread 1 Thread 2 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Per Warp: Up to 32 L1 Transactions / Ideal case: 2 Transactions Up to 32 L2 Transactions / Ideal case: 8 Transactions 38

Access Pattern Next iteration: Thread 0 Thread 1 Thread 2 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Idea: Use the Read-only cache (LDG load) On Fermi: Use a texture or Use 48KB for L1 39

First Modification: Use ldg LD (DRAM-L2$-L1$-Reg) LDG (DRAM-L2$-TEX$-Reg) LD SM SM LDG Registers Registers LD/ST Ctrl Flow ALU TEX LD/ST Ctrl Flow ALU TEX L1$ TEX$ L1$ TEX$ L2$ 40

First Modification: Use ldg We change the source code: It is slower: 625.8ms Kernel Time Speedup Original version 104.72ms LDG to load A 125.67ms 0.83x 41

First Modification: Use ldg No benefit from the read-only cache: Hit rate at 3.3% Worse hit rate in L2$: 15.8% compared to 36.7% 42

First Modification: Use ldg Eligible Warps per Active Cycle has dropped to 0.54 43

First Modification: Use ldg Warps cannot issue because they have to wait 44

First Modification: Use ldg The loads compete for the cache too much Low hit rate: 3.3% Texture requests introduce too much latency (in that case) Things to check in those cases: Texture Hit Rate: Low means no reuse Issue Efficiency and Stall Reasons It was actually expected: GPU caches are not CPU caches!!! 45

First Modification: Use ldg Other accesses may benefit from LDGs Memory blocks accessed several times by several threads How can we detect it? Source code analysis There is no way to detect it from Nsight 46

First Modification: Use ldg We change the source code In y = Ax, we use ldg when loading x It s faster: 98.30ms Kernel Time Speedup Original version 104.72ms LDG to load A 125.67ms 0.83x LDG to load X 98.30ms 1.07x 47

First Modification: Use ldg Good hit rate in Texture Cache: 82% Slightly less data transferred from L2 (94MB vs 110MB) 48

ITERATION 2 49

CUDA Launch Summary spmv_kernel_v2 is still a hot spot, so we profile it 50

Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 51

Identify the Main Limiter We are still limited by latency Low DRAM utilization: 29.95% Pipe utilization is Low/Mid: <70-75% 52

Identify the Main Limiter We are not limited by the Occupancy We have > 4 Eligible Warps per Active Cycle (8.16) Too many uncoalesced accesses: 40.79% of Replay Overhead 53

Second Strategy: Change Memory Accesses 4 consecutive threads load 4 consecutive elements Threads 0, 1, 2, 3 Threads 4, 5, 6, 7 Threads 8, 9, 10, 11 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) Per Warp: Up to 8 L1 Transactions / Ideal case: 2 Transactions Up to 8 L2 Transactions / Ideal case: 8 Transactions 54

Second Strategy: Change Memory Accesses It s much faster: 45.61ms Kernel Time Speedup Original version 104.72ms LDG to load A 125.67ms 0.83x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 55

Second Strategy: Change Memory Accesses We have much fewer Transactions per Request: 5.16 (LD) 56

Second Strategy: Change Memory Accesses Much less traffic from L2: 28.27MB (it was 109.83MB) Much less DRAM traffic: 25.93MB (it was 69.59MB)

ITERATION 3 58

CUDA Launch Summary spmv_kernel_v3 is still a hot spot, so we profile it 59

Identify the Main Limiter Is it limited by the memory bandwidth? Is it limited by the instruction throughput? Is it limited by latency? 60

Identify the Main Limiter We are still limited by latency Low DRAM utilization: 36.01% Pipe Utilization is still Low/Mid 61

Latency Eligible Warps per Active Cycle: 6.70 on average We are not limited by occupancy 62

Latency Memory Accesses: Load: 5.16 Transactions per Request Store: 2 Transactions per Request We still have too many uncoalesced accesses 63

Latency We still have too many uncoalesced accesses Nearly 68.44% of Instruction Serialization (Replays) Stall Reasons: 43.14% due to Data Requests 64

Latency Serialization: (Inst. Issued Inst. Executed) / Inst. Issued Inst. Replay Overhead: Avg. Number of replays per Inst. Inst. Issued = 1 + Avg. Number of Replays Inst. Replay Overhead Inst. Replay Overhead / (1 + Inst. Replay Overhead ) 2.17 68.45%

Latency Issue Stall Reasons Stall Reasons Instruction Fetch 0.58% Execution Dependency 32.51% Data Request 37.85% Texture 0.41% Sync 0.00% Other 15.13%

Latency

Where Do Those Accesses Happen? Same lines of code as before 68

What Can We Do? In our kernel: 4 threads per row of the matrix A Threads 0, 1, 2, 3 Threads 4, 5, 6, 7 Threads 8, 9, 10, 11 L2 Transaction (32B) L1 Transaction (128B) L2 Transaction (32B) L2 Transaction (32B) New approach: 1 warp of threads per row of the matrix A Threads 0, 1, 2, 3,, 31 (some possibly idle) L2 Transaction (32B) L2 Transaction (32B) L2 Transaction (32B) 69

One Warp Per Row It s faster: 37.50ms Kernel Time Speedup Original version 104.72ms LDG to load A 125.67ms 0.83x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 1 Warp per Row 37.50ms 2.79x 70

One Warp Per Row Much fewer Transactions Per Request: 1.37 (LD) / 1 (ST) 71

ITERATION 4 72

One Warp Per Row spmv_kernel_v4 is the hot spot 73

One Warp Per Row DRAM utilization: 37.36% Pipe Utilization is Low/Mid We are still limited by latency

One Warp Per Row Occupancy and memory accesses are OK (not shown) Control Flow Efficiency: 87.31% 75

Control Flow Efficiency All threads work: 100% Some threads do nothing: Less efficiency if( threadidx.x % 32 < 24 ) { // do some long computation } Efficiency = 24/32 = 75%

Control Flow Efficiency Low efficiency in one the key loop: 69.9%

One Half Warp Per Row It is faster: 35.81ms Kernel Time Speedup Original version 104.72ms LDG to load A 125.67ms 0.73x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 1 Warp per Row 37.50ms 2.79x ½ Warp per Row 35.81ms 2.93x 78

ITERATION 5 79

One Half Warp Per Row DRAM utilization: 48.18% Pipe Utilization is Low/Mid We are still limited by latency 80

One Half Warp Per Row Occupancy is not an issue Memory accesses are good enough 81

One Half Warp Per Row Branch divergence induce latency We have 16.32% of divergent branches 82

Branch Divergence

Branch Divergence Execution Time = Time of If branch + Time of Else branch if( threadidx.x % 32 < 24 ) { // do some long computation } else { // do some long computation }

One Half Warp Per Row We fix branch divergence It is faster: 29.60ms Kernel Time Speedup Original version 104.72ms LDG to load A 125.67ms 0.83x LDG to load X 98.30ms 1.07x Coalescing with 4 Threads 45.61ms 2.30x 1 Warp per Row 37.50ms 2.79x ½ Warp per Row 35.81ms 2.93x No divergence 29.60ms 3.54x 85

One Half Warp Per Row DRAM utilization: 60.80% We are still far from the peak

So Far We have consecutively: Improved caching using ldg (use with care) Improved coalescing Improved control flow efficiency Improved branching Our new kernel is 3.5x faster than our first implementation Tools helped us a lot 87

88

ITERATION 6 89

Next Kernel We are satisfied with the performance of spmv_kernel We move to the next kernel: jacobi_smooth 90

91

What Have You Seen? An iterative method to optimize your GPU code Trace your application Identify the hot spot and profile it Identify the performance limiter Optimize the code Iterate A way to conduct that method with Nvidia tools 92