NVIDIA Tools For Profiling And Monitoring. David Goodwin

Similar documents

Optimizing Application Performance with CUDA Profiling Tools

GPU Performance Analysis and Optimisation

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Guided Performance Analysis with the NVIDIA Visual Profiler

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Performance Analysis for GPU Accelerated Applications

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Next Generation GPU Architecture Code-named Fermi

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Parallel Programming Survey

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introduction to GPU Programming Languages

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU Parallel Computing Architecture and CUDA Programming Model

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Profiler User's Guide

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Performance Measurement and monitoring in TSUBAME2.5 towards next generation supercomputers

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

GPU Profiling with AMD CodeXL

Stream Processing on GPUs Using Distributed Multimedia Middleware

GPGPU Computing. Yong Cao

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

CUDA Basics. Murphy Stein New York University

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Use-Case Power Management Optimization: Identifying & Tracking Key Power Indicators ELC- E Edimburgh, Patrick Ti:ano

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Performance Monitoring and Analysis System for MUSCLE-based Applications

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Multi-Threading Performance on Commodity Multi-Core Processors

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

End-user Tools for Application Performance Analysis Using Hardware Counters

Introduction to GPU hardware and to CUDA

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Sequential Performance Analysis with Callgrind and KCachegrind

Parallel Firewalls on General-Purpose Graphics Processing Units

Texture Cache Approximation on GPUs

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

NVIDIA GeForce GTX 580 GPU Datasheet

A Pattern-Based Approach to. Automated Application Performance Analysis

GPU Tools Sandra Wienke

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Final Project Report. Trading Platform Server

ultra fast SOM using CUDA

Performance Tools for Parallel Java Environments

Performance Management for Cloudbased STC 2012

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Load Imbalance Analysis

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Bernie Velivis President, Performax Inc

1. If we need to use each thread to calculate one output element of a vector addition, what would

GPU for Scientific Computing. -Ali Saleh

OpenACC 2.0 and the PGI Accelerator Compilers

Part I Courses Syllabus

Holistic Performance Analysis of J2EE Applications

~ Greetings from WSU CAPPLab ~

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Introduction to CUDA C

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

GPUs for Scientific Computing

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

GPU Computing - CUDA

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

GPU Programming Strategies and Trends in GPU Computing

Introduction to GPU Architecture

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

serious tools for serious apps

MapGraph. A High Level API for Fast Development of High Performance Graphic Analytics on GPUs.

Oracle JRockit Mission Control Overview

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

PERFORMANCE TOOLS DEVELOPMENTS

FileNet System Manager Dashboard Help

Performance evaluation

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

- An Essential Building Block for Stable and Reliable Compute Clusters

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Performance Testing at Scale

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Program Grid and HPC5+ workshop

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Network Infrastructure Services CS848 Project

Transcription:

NVIDIA Tools For Profiling And Monitoring David Goodwin

Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 2

CUDA Profiling Tools Interface (CUPTI) CUPTI only supported interface for CUDA profiling 3 rd party profilers: Vampir, Tau, PAPI, Internal: NVIDIA Visual Profiler, nvprof Application Profiling Tool Library CUPTI CUDA Runtime GPU CPU CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 3

CUPTI Callbacks Profiling tool can register for callback on every CUDA API call Callback at entry and exit Callback get access to function parameters and return code Also for CUDA object create/destroy, synchronization cudamalloc() Application CUDA Runtime Profiling Tool Library CUPTI GPU CPU CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 4

CUPTI Events and Metrics Events: low-level raw counts Sample Collect over duration of kernel execution HW counters SW patches Replay kernel execution to collect large amount of data Metrics: typical profiling values E.g. IPC, throughput, hit-rate, etc. Derived from event values and system properties More actionable CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 5

CUPTI Activity Trace Deliver asynchronous stream of system activity GPU Kernel, memset, memcpy CUDA API invocations Developer-defined activity Dynamic, Per-PC activity Behavior of individual branches Behavior of individual loads/stores Application Profiling Tool Library CUPTI CUDA Runtime GPU CPU CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 6

NVIDIA Profiling Tools Visual Profiler Graphical, Eclipse-based Timeline Automated Analysis nvprof Command-line Backend for Visual Profiler Built on CUPTI CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 7

Visual Profiler - Timeline CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 8

Automated Analysis Actionable feedback Direct link to more extensive documentation Based on both trace and profile information CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 9

Automated Analysis - Expert System Identify bottlenecks Multiprocessor Compute, latency, memory bound FU, register, etc. resource limiters Memory subsytems Global, Shared Texture, Constant Caches CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 10

Analysis Examples Occupancy can potentially be improved by increasing the number of threads per block. Global memory loads may have a bad access pattern, leading to inefficient use of global memory bandwidth. The kernel is likely memory bound, but it is not fully utilizing the available DRAM bandwidth. Divergent branches are causing significant instruction issue overhead. CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 11

nvprof Command-line interface to most CUPTI functionality Summary and full trace outputs Collect on headless node, visualize with Visual Profiler $ nvprof dct8x8 ======== Profiling result: Time(%) Time Calls Avg Min Max Name 49.52 9.36ms 101 92.68us 92.31us 94.31us CUDAkernel2DCT(float*, float*, int) 37.47 7.08ms 10 708.31us 707.99us 708.50us CUDAkernel1DCT(float*,int, int,int) 3.75 708.42us 1 708.42us 708.42us 708.42us CUDAkernel1IDCT(float*,int,int,int) 1.84 347.99us 2 173.99us 173.59us 174.40us CUDAkernelQuantizationFloat() 1.75 331.37us 2 165.69us 165.67us 165.70us [CUDA memcpy DtoH] 1.41 266.70us 2 133.35us 89.70us 177.00us [CUDA memcpy HtoD] 1.00 189.64us 1 189.64us 189.64us 189.64us CUDAkernelShortDCT(short*, int) 0.94 176.87us 1 176.87us 176.87us 176.87us [CUDA memcpy HtoA] CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 12

Future (Profiling) Tighter integration of CPU/GPU profiling Expand expert system upward help identify and extract data parallelism downward more precise feedback Auto tuning PC Sampling Expose expert system through CUPTI so 3 rd party tools can take advantage Standards for higher-level profiling OpenACC CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 13

Monitoring Cluster, not individual nodes Coarser measurement and control Zero or little overhead NVIDIA Monitoring Library In-band Proprietary bus protocol Out-of-band CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 14

NVIDIA Monitoring Library (NVML) GPU aggregate utilizations Compute Bandwidth Memory usage Power Temperature Clocks Power draw Power states CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 15

Future (Monitoring) Power management Improved control over clock speeds, power budget Integration with power balancing solutions Exploit variations in GPU power efficiencies Improved profiling capabilities Support concurrent workloads (per-process reporting vs. per-gpu) System level metrics: network bottlenecks, communication locality, Higher accuracy utilization measurement Cluster-level profiling expert system CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 16

Future (Monitoring) cont. In-band vs. out-of-band Proprietary vs. standard CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 17

Questions? CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale Computing 18