Guided Performance Analysis with the NVIDIA Visual Profiler



Similar documents
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

GPU Performance Analysis and Optimisation

Optimizing Application Performance with CUDA Profiling Tools

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Next Generation GPU Architecture Code-named Fermi

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Synerscope Sept 2013

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

GPU Tools Sandra Wienke

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU Programming Languages

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GPU hardware and to CUDA

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Programming Strategies and Trends in GPU Computing

Lecture 1: an introduction to CUDA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

How To Develop For A Powergen 2.2 (Tegra) With Nsight) And Gbd (Gbd) On A Quadriplegic (Powergen) Powergen Powergen 3

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

ultra fast SOM using CUDA

1. If we need to use each thread to calculate one output element of a vector addition, what would

Parallel Firewalls on General-Purpose Graphics Processing Units

GPUs for Scientific Computing

Enabling GPU-accelerated High Performance Geospatial Line-of-Sight Calculations

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPGPU Computing. Yong Cao

CUDA programming on NVIDIA GPUs

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

VMware Horizon View 3D Graphics

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

How To Test Your Code On A Cuda Gdb (Gdb) On A Linux Computer With A Gbd (Gbd) And Gbbd Gbdu (Gdb) (Gdu) (Cuda

Introduction to GPU Computing

OpenCL Programming for the CUDA Architecture. Version 2.3

A STUDY OF PARALLEL SORTING ALGORITHMS. USING CUDA AND OpenMP

Introduction to GPU Architecture

Speeding Up RSA Encryption Using GPU Parallelization

Embedded Systems: map to FPGA, GPU, CPU?

GPU-based Decompression for Medical Imaging Applications

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

~ Greetings from WSU CAPPLab ~

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Final Project Report. Trading Platform Server

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Programming GPUs with CUDA

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Performance Testing in Virtualized Environments. Emily Apsey Product Engineer

World s fastest database and big data analytics platform

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

CUDA Basics. Murphy Stein New York University

Speeding up GPU-based password cracking

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Advanced CUDA Webinar. Memory Optimizations

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Computing - CUDA

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Texture Cache Approximation on GPUs

L20: GPU Architecture and Models

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

The Fastest, Most Efficient HPC Architecture Ever Built

GPU for Scientific Computing. -Ali Saleh

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Amazon EC2 Product Details Page 1 of 5

Introduction to GPGPU. Tiziano Diamanti

HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Software Development In the Cloud Cloud management and ALM

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Interactive Level-Set Deformation On the GPU

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Multi-GPU Load Balancing for Simulation and Rendering

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

OpenPOWER Software Stack with Big Data Example March 2014

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Transcription:

Guided Performance Analysis with the NVIDIA Visual Profiler

Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler

Guided Performance Analysis NEW in 5.5 Step-by-step optimization guidance

Improved Analysis Visualization NEW in 5.5 More clearly present optimization opportunities

Demo - Analysis Introduction

Guided Analysis: Kernel Optimization Overall application optimization strategy Based on identifying primary performance limiter Common optimization opportunities How to use CUDA profiling tools Execute optimization strategy Identify optimization opportunities

Identifying Candidate Kernels Analysis system estimates which kernels are best candidates for speedup Execution time, achieved occupancy

Primary Performance Limiter Most likely limiter to performance for a kernel Memory bandwidth Compute resources Instruction and memory latency Primary limiter should be addressed first Often beneficial to examine secondary limiter as well

Calculating Performance Limiter Memory bandwidth utilization Compute resource utilization High utilization value likely performance limiter Low utilization value likely not performance limiter Taken together to determine primary performance limiter

Memory Bandwidth Utilization Traffic to/from each memory subsystem, relative to peak Maximum utilization of any memory subsystem L1/Shared Memory L2 Cache Texture Cache Device Memory System Memory (via PCIe)

Compute Resource Utilization Number of instructions issued, relative to peak capabilities of GPU Some resources shared across all instructions Some resources specific to instruction classes : integer, FP, control-flow, etc. Maximum utilization of any resource

Calculating Performance Limiter Utilizations Both high compute and memory highly utilized Compute Memory

Calculating Performance Limiter Utilizations Memory high, compute low memory bandwidth limited Compute Memory

Calculating Performance Limiter Utilizations Compute high, memory low compute resource limited Compute Memory

Calculating Performance Limiter Utilizations Both low latency limited Compute Memory

Demo - Performance Limiter

Limiter-Specific Optimization Analysis Memory Bandwidth Limited Compute Limited Latency Limited

Memory Bandwidth Limited Global Memory Load/Store Access pattern inefficiencies Misaligned Memory Bandwidth Limits Local Memory Overhead Register spilling Local stack variables

Compute Resource Limited Divergent Branches Low Warp Execution Efficiency Due to divergent branches Over-subscribed function units Load/Store Arithmetric Control-Flow Texture

Latency Limited Low Theoretical Occupancy Block Size Registers Shared Memory Low Achieved Occupancy Too Few Blocks Instruction Stalls

Demo - Limiter Details

Summary Visual Profiler Guided Analysis gives step-by-step optimization advice Kernel Analysis strategy based on identifying primary limiter Memory bandwidth Compute resources Instruction and memory latency Visual result and integrated documentation

Upcoming GTC Express Webinars September 17 - ArrayFire: A Productive GPU Software Library for Defense and Intelligence Applications September 19 - Learn How to Debug OpenGL 4.2 with NVIDIA Nsight Visual Studio Edition 3.1 September 24 - Pythonic Parallel Patterns for the GPU with NumbaPro September 25 - An Introduction to GPU Programming September 26 - Learn How to Profile OpenGL 4.2 with NVIDIA Nsight Visual Studio Edition 3.1 Register at www.gputechconf.com/gtcexpress

GTC 2014 Call for Submissions Looking for submissions in the fields of Science and research Professional graphics Mobile computing Automotive applications Game development Cloud computing Submit by September 27 at www.gputechconf.com

Extra

Understanding Latency Memory load : delay from when load instruction executed until data returns Arithmetic : delay from when instruction starts until result produced A = Waiting for A = A

Hiding Latency SM can do many things at once Time Warp 0 Warp 2 Warp 3 Warp 0 Warp 6 Warp 1 Warp 4 Warp 5 Warp 2 Warp 0 waiting SM still busy Can hide latency as long as there are enough

Occupancy Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps Theoretical occupancy, upper bound based on: Threads / block Shared memory usage / block Register usage / thread Achieved occupancy Actual measured occupancy of running kernel (active_warps / active_cycles) / MAX_WARPS_PER_SM

Low Theoretical Occupancy # warps on SM = # blocks on SM # warps per block If low Not enough blocks in kernel # blocks on SM limited by threads, shared memory, registers CUDA Best Practices Guide has extensive discussion on improving occupancy

SM0 SM1 Low Achieved Occupancy In theory kernel allows enough warps on each SM Time Block 0 Block 6 Block 1 Block 7 Block 2 Block 8 Block 3 Block 9 Block 4 Block 10 Block 5 Block 11

SM0 SM1 Low Achieved Occupancy In theory kernel allows enough warps on each SM In practice have low achieved occupancy Likely cause is that all SMs do not remain equally busy over duration of kernel execution Time Block 0 Block 1 Block 2 Block 6 Block 7 Block 8 Block 3 Block 4 Block 5 Block 11 Block 9 Block 10

Lots Of Warps, Not Enough Can Execute SM can do many things at once Warp Idle No ready warp Warp Time Idle