Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

Similar documents

TOOLS AND TIPS FOR MANAGING A GPU CLUSTER. Adam DeConinck HPC Systems Engineer, NVIDIA

nvidia smi(1) NVIDIA nvidia smi(1)

XID ERRORS. vr352 May XID Errors

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Clusters with GPUs under Linux and Windows HPC

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Know your Cluster Bottlenecks and Maximize Performance

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

Accelerating CFD using OpenFOAM with GPUs

GRID VGPU FOR VMWARE VSPHERE

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

REMOTE VISUALIZATION ON SERVER-CLASS TESLA GPUS

GPU Performance Analysis and Optimisation

HP ProLiant SL270s Gen8 Server. Evaluation Report

GPU Tools Sandra Wienke

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

Optimizing Application Performance with CUDA Profiling Tools

Bright Cluster Manager

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Pedraforca: ARM + GPU prototype

NVIDIA GRID VGPU Steve Harpster SA NALA October 2013

Resource Scheduling Best Practice in Hybrid Clusters

HP Workstations graphics card options

NVIDIA GRID OVERVIEW SERVER POWERED BY NVIDIA GRID. WHY GPUs FOR VIRTUAL DESKTOPS AND APPLICATIONS? WHAT IS A VIRTUAL DESKTOP?

Database Services for CERN

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Maintaining Non-Stop Services with Multi Layer Monitoring

GPU Profiling with AMD CodeXL

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Findings in High-Speed OrthoMosaic

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

Moab and TORQUE Highlights CUG 2015

GRID VGPU FOR CITRIX XENSERVER

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

NVIDIA GeForce GTX 580 GPU Datasheet

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Amazon EC2 Product Details Page 1 of 5

Computer Graphics Hardware An Overview

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

Cloud Gaming & Application Delivery with NVIDIA GRID Technologies. Franck DIARD, Ph.D. GRID Architect, NVIDIA

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

GTC EXPRESS WEBINAR: GETTING THE MOST OUT OF VMWARE HORIZON VIEW VDGA WITH NVIDIA GRID. Steve Harpster - Solution Architect NALA

Table of Contents Introduction and System Requirements 9 Installing VMware Server 35

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Hybrid Cluster Management: Reducing Stress, increasing productivity and preparing for the future

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

ECLIPSE Performance Benchmarks and Profiling. January 2009

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

HPC with Multicore and GPUs

HIGH-PERFORMANCE GPU VIDEO ENCODING ABHIJIT PATAIT SR. MANAGER, NVIDIA

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

GOLD20TH-GTX980-P-4GD5

Directions for VMware Ready Testing for Application Software

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

SAM XFile. Trial Installation Guide Linux. Snell OD is in the process of being rebranded SAM XFile

LS DYNA Performance Benchmarks and Profiling. January 2009

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Running applications on the Cray XC30 4/12/2015

PNY Professional Solutions NVIDIA GRID - GPU Acceleration for the Cloud

Intel Power Gadget 2.0 Monitoring Processor Energy Usage

Cloud Computing. Adam Barker

A quick tutorial on Intel's Xeon Phi Coprocessor

VI Performance Monitoring

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Technical Specifications

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Intelligent Power Protector User manual extension for Microsoft Virtual architectures: Hyper-V 6.0 Manager Hyper-V Server (R1&R2)

Introduction to GPU hardware and to CUDA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

PLANNING FOR DENSITY AND PERFORMANCE IN VDI WITH NVIDIA GRID JASON SOUTHERN SENIOR SOLUTIONS ARCHITECT FOR NVIDIA GRID

HP Workstations graphics card options

NVIDIA VIDEO ENCODER 5.0

SysPatrol - Server Security Monitor

Automated Performance Testing of Desktop Applications

Getting Started with Tizen SDK : How to develop a Web app. Hong Gyungpyo 洪競杓 Samsung Electronics Co., Ltd

GPU Parallel Computing Architecture and CUDA Programming Model

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Building a Top500-class Supercomputing Cluster at LNS-BUAP

LotServer Deployment Manual

The "Eclipse Classic" version is recommended. Otherwise, a Java or RCP version of Eclipse is recommended.

CA Nimsoft Monitor. Probe Guide for Active Directory Server. ad_server v1.4 series

Transcription:

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

MANAGE GPUS IN THE CLUSTER Administrators, End users Middleware Engineers Monitoring/Management Tools NVML Nvidia-smi Health Tools

NVIDIA MANAGEMENT PRIMER NVIDIA Management Library Provides a low-level C API for application developers to monitor, manage, and analyze specific characteristics of a GPU. nvmldevicegettemperature(device, NVML_TEMPERATURE_GPU, &temp); NVIDIA System Management Interface A command line tool that uses NVML to provide information in a more readable or parse-ready format Exposes most of the NVML API at the command line # nvidia-smi --query-gpu=temperature.gpu --format=csv Health Tools

SOFTWARE RELATIONSHIPS Key SW Components CUDA Toolkit NV Driver GPU Deployment Kit CUDA Libraries Validation Suite NVML SDK Stats, Dmon, Daemon, Replay NVSMI CUDA Runtime NVML Library NVIDIA Kernel Mode Driver

Configuration MANAGEMENT CAPABILITIES Events Samples Process accounting Identification Logging & Analysis ECC Errors PCI Information Topology NVIDIA Management Features XID Errors Replay/failure counters Health Mode of operation query/control Performance Violation Counters Thermal data Power control/query Clock control and performance limits GPU utilization

NVML EXAMPLE WITH C #include nvml.h int main() { nvmlreturn_t result; nvmlpciinfo_t pci; nvmldevice_t device; // First initialize NVML library result = nvmlinit(); if (NVML_SUCCESS!= result) { printf("failed to initialize NVML: %s\n", nvmlerrorstring(result)); return 1; } result = nvmldevicegethandlebyindex(0, &device); (check for error...) result = nvmldevicegetpciinfo(device, &pci); (check for error...) printf("%d. %s [%s]\n", i, name, pci.busid); } result = nvmlshutdown(); (check for error...)

NVML EXAMPLE WITH PYTHON BINDINGS Errors are handled by a raise NVMLError(returncode) https://pypi.python.org/pypi/nvidia-ml-py/ import pynvml pynvml.nvmlinit() device = nvmldevicegethandlebyindex(0); pci = pynvml.nvmldevicegetpciinfo(device); print pci.busid pynvml.nvmlshutdown();

CONFIGURATION Identification Device handles: ByIndex, ByUUID, ByPCIBusID, BySerial Basic info: serial, UUID, brand, name, index PCI Information Current and max link/gen, domain/bus/device Topology Get/set CPU affinity (uses sched_affinity calls) Mode of operation

ECC SETTINGS Tesla and Quadro GPUs support ECC memory Correctable errors are logged but not scrubbed Uncorrectable errors cause error at user and system level GPU rejects new work after uncorrectable error, until reboot ECC can be turned off makes more GPU memory available at cost of error correction/detection Configured using NVML or nvidia-smi # nvidia-smi -e 0 Requires reboot to take effect

P2P AND RDMA Shows traversal expectations and potential bandwidth bottleneck via NVSMI Cgroups friendly For NUMA binding GPUDirect Comm Matrix GPU0 GPU1 GPU2 mlx5_0 mlx5_1 CPU Affinity GPU0 X PIX SOC PHB SOC 0-9 GPU1 PIX X SOC PHB SOC 0-9 GPU2 SOC SOC X SOC PHB 10-19 mlx5_0 PHB PHB SOC X SOC mlx5_1 SOC SOC PHB SOC X Legend: X = Self SOC = Path traverses a socket-level link (e.g. QPI) PHB = Path traverses a PCIe host bridge PXB = Path traverses multiple PCIe internal switches PIX = Path traverses a PCIe internal switch CPU Affinity = The cores that are most ideal for NUMA Socket0 Socket1

HEALTH Both APIs and tools to monitor/manage health of a GPU ECC error detection Both SBE and DBE XID errors PCIe throughput and errors Gen/width Errors Throughput Violation counters Thermal and power violations of maximum thresholds

PERFORMANCE Driver Persistence Power and Thermal Management Clock Management

DRIVER PERSISTENCE By default, driver unloads when GPU is idle Driver must re-load when job starts, slowing startup If ECC is on, memory is cleared between jobs Persistence keeps driver loaded when GPUs idle: # nvidia-smi i <device#> pm 1 Faster job startup time

GPU Clocks GFLOPS Power/Temp POWER AND THERMAL DATA 6 5 4 3 2 1 0 Inconsistent Application Perf run1 run3 run5 run7 150 100 50 0 Y-Values Time Power/Thermal Limit Power/Thermal Capping Clocks lowered as a preventive measure

POWER AND THERMAL DATA List Temperature Margins Query Power Cap Settings Set Power Cap nvidia-smi q d temperature Temperature Current Temp : 90 C GPU Slowdown Temp : 92 C GPU Shutdown Temp : 97 C nvidia-smi q d power Power Readings Power Limit : 95 W Default Power Limit : 100 W Enforced Power Limit : 95 W Min Power Limit : 70 W Max Power Limit : 10 W nvidia-smi -power-limit=150 Power limit for GPU 0000:0X:00.0 was set to 150.00W from 95.00W

CLOCK MANAGEMENT List Supported Clocks List Current Clocks Set Application clocks Launch CUDA Application Reset Application Clocks nvidia-smi q d supported_clocks Example Supported Clocks Memory : 3004 MHz Graphics : 875 MHz Graphics : 810 MHz Graphics : 745 MHz Graphics : 666 MHz Memory : 324 MHz Graphics : 324 MHz nvidia-smi q d clocks nvidia-smi ac 3004,810 Current Clocks Applications Clocks Clocks Graphics : 324 MHz Graphics Memory : 810 MHz : 3004 MHz SM : 324 MHz Memory : 324 MHz Clocks Applications Clocks Graphics : 810 MHz Graphics : 745 MHz SM : 810 MHz Memory : 3004 MHz Memory : 3004 MHz Default Applications Clocks Graphics : 745 MHz Memory : 3004 MHz nvidia-smi rac Applications Clocks Graphics : 745 MHz Memory : 3004 MHz

CLOCK BEHAVIOR (K80) Fixed Clocks best for consistent perf Autoboost (boost up) generally best for max perf

MONITORING & ANALYSIS Events Samples Background Monitoring

HIGH FREQUENCY MONITORING Provide higher quality data for perf limiters, error events and sensors. Includes xids, power, clocks, utilization and throttle events

MHz Timeline dt Watts HIGH FREQUENCY MONITORING nvidia-smi stats Visualize monitored data using 3 rd party custom UI procclk, 1395544840748857, 324 memclk, 1395544840748857, 324 pwrdraw, 1395544841083867, 20 pwrdraw, 1395544841251269, 20 gpuutil, 1395544840912983, 0 violpwr, 1395544841708089, 0 procclk, 1395544841798380, 705 memclk, 1395544841798380, 2600 pwrdraw, 1395544841843620, 133 xid, 1395544841918978, 31 pwrdraw, 1395544841948860, 250 violpwr, 1395544842708054, 345 Clocks Idle Clocks boost XID error Power cap 200 0 50 0 1000 500 0 21:38:53 21:39:10 21:39:27 21:39:45 21:40:02 21:40:19 21:40:36 21:40:54 Power Draw Power Capping Clock Changes

BRIEF FORMAT Scrolling single-line interface Metrics/Devices to be displayed can be configured nvidia-smi dmon -i <device#> CUDA APP Power Limit = 160 W Slowdown Temp = 90 C

BACKGROUND MONITORING root@:~$nvidia-smi daemon Background nvsmi Process Only one instance allowed Must be run as a root GPU 0 GPU 1 Day-1 Day-2 GPU 2 Log Log Log /var/log/nvstats-yyyymmdd (Log file path can be configured. Compressed file)

PLAYBACK/EXTRACT LOGS Extract/Replay the complete or parts of log file generated by the daemon Useful to isolate GPU problems happened in the past nvidia-smi replay f <replay file> -b 9:00:00 e 9:00:05

LOOKING AHEAD NVIDIA Diagnostic Tool Suite Cluster Management APIs

NVIDIA DIAGNOSTIC TOOL SUITE User runnable, user actionable health and diagnostic tool SW, HW, perf and system integration coverage Command line, pass/fail, configurable Goal is to consolidate key needs around one tool Prologue Epilog Manual pre-job sanity post-job analysis offline debug Admin (interactive) or Resource Manager (scripted)

NVIDIA DIAGNOSTIC TOOL SUITE Config Mode Hardware PCIe SM/CE FB Software Driver Sanity CUDA Sanity Driver Conflicts Analysis Extensible diagnostic tool Healthmon will be deprecated Determine if a system is ready for a job logs NVML Stats Data Collection stdout

NVIDIA DIAGNOSTIC TOOL SUITE JSON format Binary and text logging options Metrics vary by plugin Various existing tools to parse, analyze and display data

NVIDIA CLUSTER MANAGEMENT ISV/OEM Management Console Head Node Network ISV/ OEM Compute Node GPU NV Node Engine (lib) GPU ISV/ OEM Compute Node GPU NV Node Engine (lib) GPU

NVIDIA CLUSTER MANAGEMENT NV Mgmt Client ISV & OEM Head Node NV Cluster Engine Network ISV/ OEM Compute Node GPU NV Node Engine GPU Compute Compute Node Node ISV/ OEM GPU NV Node Engine GPU

NVIDIA CLUSTER MANAGEMENT Stateful Proactive Monitoring with Actionable Insights Comprehensive Health Diagnostics Policy Management Configuration Management

NVIDIA REGISTERED DEVELOPER PROGRAMS Everything you need to develop with NVIDIA products Membership is your first step in establishing a working relationship with NVIDIA Engineering Exclusive access to pre-releases Submit bugs and features requests Stay informed about latest releases and training opportunities Access to exclusive downloads Exclusive activities and special offers Interact with other developers in the NVIDIA Developer Forums REGISTER FOR FREE AT: developer.nvidia.com

THANK YOU S5894 - Hangout: GPU Cluster Management & Monitoring Thursday, 03/19, 5pm 6pm, Location: Pod A http://docs.nvidia.com/deploy/index.html contact: cudatools@nvidia.com

APPENDIX

SUPPORTED PLATFORMS/PRODUCTS Supported platforms: Windows (64-bits) / Linux (32-bit and 64-bit) Supported products: Full Support All Tesla products, starting with the Fermi architecture All Quadro products, starting with the Fermi architecture All GRID products, starting with the Kepler architecture Selected GeForce Titan products Limited Support All Geforce products, starting with the Fermi architecture

CURRENT TESLA GPUS GPUs Single Precision Peak (SGEMM) Double Precision Peak (DGEMM) Memory Size Memory Bandwidth (ECC off) PCIe Gen System Solution K80 5.6 TF 1.8 TF 2 x 12GB 480 GB/s Gen3 Server K40 4.29 TF (3.22TF) 1.43 TF (1.33 TF) 12 GB 288 GB/s Gen 3 Server + Workstation K20X 3.95 TF (2.90 TF) 1.32 TF (1.22 TF) 6 GB 250 GB/s Gen 2 Server only K20 3.52 TF (2.61 TF) 1.17 TF (1.10 TF) 5 GB 208 GB/s Gen 2 Server + Workstation K10 4.58 TF 0.19 TF 8 GB 320 GB/s Gen 3 Server only

AUTO BOOST User-specified settings for automated clocking changes. Persistence Mode nvidia-smi --auto-boost-default=0/1 Enabled by default Tesla K80

GPU PROCESS ACCOUNTING Provides per-process accounting of GPU usage using Linux PID Accessible via NVML or nvidia-smi (in comma-separated format) Requires driver be continuously loaded (i.e. persistence mode) Enable accounting mode: $ sudo nvidia-smi am 1 Human-readable accounting output: $ nvidia-smi q d ACCOUNTING Output comma-separated fields: $ nvidia-smi --query-accountedapps=gpu_name,gpu_util format=csv No RM integration yet, use site scripts i.e. prologue/epilogue Clear current accounting logs: $ sudo nvidia-smi -caa

MONITORING SYSTEM WITH NVML SUPPORT Examples: Ganglia, Nagios, Bright Cluster Manager, Platform HPC Or write your own plugins using NVML

TURN OFF ECC ECC can be turned off makes more GPU memory available at cost of error correction/detection Configured using NVML or nvidia-smi # nvidia-smi -e 0 Requires reboot to take effect