TOOLS AND TIPS FOR MANAGING A GPU CLUSTER. Adam DeConinck HPC Systems Engineer, NVIDIA



Similar documents
Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

XID ERRORS. vr352 May XID Errors

Clusters with GPUs under Linux and Windows HPC

Know your Cluster Bottlenecks and Maximize Performance

Pedraforca: ARM + GPU prototype

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

MULTI-PROCESS SERVICE

How To Run A Multi Process Powerpoint On A Multi Threaded Cuda (Nvidia) Powerpoint (Powerpoint) On A Single Process (Nvdv) On An Uniden-Cuda) On Multiple Processes (Nvm)

Resource Scheduling Best Practice in Hybrid Clusters

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

nvidia smi(1) NVIDIA nvidia smi(1)

Better Integration of Systems Management Hardware with Linux

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

HPC Software Requirements to Support an HPC Cluster Supercomputer

Designed for Maximum Accelerator Performance

HP ProLiant SL270s Gen8 Server. Evaluation Report

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

Agenda. Context. System Power Management Issues. Power Capping Overview. Power capping participants. Recommendations

GRID VGPU FOR VMWARE VSPHERE

Bright Cluster Manager

HPC Cluster Decisions and ANSYS Configuration Best Practices. Diana Collier Lead Systems Support Specialist Houston UGM May 2014

FLOW-3D Performance Benchmark and Profiling. September 2012

Working with HPC and HTC Apps. Abhinav Thota Research Technologies Indiana University

MATLAB Distributed Computing Server Installation Guide. R2012a

McAfee Firewall Enterprise

Spectrum Technology Platform. Version 9.0. Spectrum Spatial Administration Guide

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA GPUs in the Cloud

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

GRID VGPU FOR CITRIX XENSERVER

June, Supermicro ICR Recipe For 1U Twin Department Cluster. Version 1.4 6/25/2009

Program Grid and HPC5+ workshop

MPI / ClusterTools Update and Plans

Automation Tools for UCS Sysadmins

Overview of HPC Resources at Vanderbilt

IBM Upward Integration Module (UIM) Advanced Technical Sales Wayne Wigley

Designed for Maximum Accelerator Performance

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Moab and TORQUE Highlights CUG 2015

IBM Platform Computing : infrastructure management for HPC solutions on OpenPOWER Jing Li, Software Development Manager IBM

SUN HPC SOFTWARE CLUSTERING MADE EASY

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

The Evolution of Cray Management Services

Intrusion Detection Architecture Utilizing Graphics Processors

Building a Top500-class Supercomputing Cluster at LNS-BUAP

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

ENTERPRISE LINUX SYSTEM ADMINISTRATION

Maintaining Non-Stop Services with Multi Layer Monitoring

"Charting the Course to Your Success!" MOC A Understanding and Administering Windows HPC Server Course Summary

Bright Cluster Manager 5.2. Administrator Manual. Revision: Date: Fri, 27 Nov 2015

REMOTE VISUALIZATION ON SERVER-CLASS TESLA GPUS

Reboot the ExtraHop System and Test Hardware with the Rescue USB Flash Drive

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

Sun in HPC. Update for IDC HPC User Forum Tucson, AZ, Sept 2008

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

WHITE PAPER. ClusterWorX 2.1 from Linux NetworX. Cluster Management Solution C ONTENTS INTRODUCTION

RedHat (RHEL) System Administration Course Summary

Hybrid Cluster Management: Reducing Stress, increasing productivity and preparing for the future

Table of Contents Introduction and System Requirements 9 Installing VMware Server 35

locuz.com HPC App Portal V2.0 DATASHEET

GPU Performance Analysis and Optimisation

Cluster Lifecycle Management Carlo Nardone. Technical Systems Ambassador GSO Client Solutions

Command Line Interface User Guide for Intel Server Management Software

Intel Simple Network Management Protocol (SNMP) Subagent v6.0

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

NVIDIA GRID VGPU Steve Harpster SA NALA October 2013

Acronis Backup & Recovery 10 Server for Linux. Update 5. Installation Guide

OSC NEWS: Stocking the Toolbox: Tools No Cluster Admin Should Be Without (Linux Magazine)

UPSMON PRO Linux --- User Manual

Smarter Cluster Supercomputing from the Supercomputer Experts

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

GRID VGPU FOR CITRIX XENSERVER

What s new in Hyper-V 2012 R2

RH033 Red Hat Linux Essentials or equivalent experience with Red Hat Linux..

ECLIPSE Performance Benchmarks and Profiling. January 2009

Running applications on the Cray XC30 4/12/2015

Gigabyte Content Management System Console User s Guide. Version: 0.1

The QEMU/KVM Hypervisor

MATLAB Distributed Computing Server System Administrator's Guide

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

NVIDIA GRID 2.0 ENTERPRISE SOFTWARE

ITG Software Engineering

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

White Paper. System Administration for the. Intel Xeon Phi Coprocessor

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC Growing Pains. Lessons learned from building a Top500 supercomputer

High Performance Computing in CST STUDIO SUITE

SLURM Resources isolation through cgroups. Yiannis Georgiou Matthieu Hautreux

HPC Wales Skills Academy Course Catalogue 2015

McAfee Data Loss Prevention

SAM XFile. Trial Installation Guide Linux. Snell OD is in the process of being rebranded SAM XFile

AST2150 IPMI Configuration Guide

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

A High Performance Computing Scheduling and Resource Management Primer

GPU Profiling with AMD CodeXL

Transcription:

TOOLS AND TIPS FOR MANAGING A GPU CLUSTER Adam DeConinck HPC Systems Engineer, NVIDIA

Steps for configuring a GPU cluster Select compute node hardware Configure your compute nodes Set up your cluster for GPU jobs Monitor and test your cluster

NVML and nvidia-smi Primary management tools mentioned throughout this talk will be NVML and nvidia-smi NVML: NVIDIA Management Library Query state and configure GPU C, Perl, and Python API nvidia-smi: Command-line client for NVML GPU Deployment Kit: includes NVML headers, docs, and nvidia-healthmon

Select compute node hardware Choose the correct GPU Select server hardware Consider compatibility with networking hardware

What GPU should I use? Tesla M-series is designed for servers Passively Cooled Higher Performance Chassis/BMC Integration Out-of-Band Monitoring

PCIe Topology Matters Biggest factor right now in server selection is PCIe topology System Memory GPU0 Memory GPU1 Memory 0x0000 Direct memory access between devices w/ P2P transfers Unified addressing for system and GPUs CPU GPU0 0xFFFF GPU1 Works best when all devices are on same PCIe root or switch

P2P on dual-socket servers GPU0 GPU1 P2P communication supported between GPUs on the same IOH CPU0 PCI-e x16 x16 Incompatible with PCI-e P2P specification CPU1 PCI-e x16 x16 GPU2 GPU3 P2P communication supported between GPUs on the same IOH

For many GPUs, use PCIe switches GPU0 GPU1 GPU2 GPU3 x16 x16 x16 x16 Switch 0 Switch 1 CPU0 PCI-e x16 x16 PCIe switches fully supported Best P2P performance between devices on same switch

How many GPUs per server? If apps use P2P heavily: More GPUs per node are better Choose servers with appropriate PCIe topology Tune application to do transfers within PCIe complex If apps don t use P2P: May be dominated by host <-> device data transfers More servers with fewer GPUs/server

For many devices, use PCIe switches GPU0 GPU1 GPU2 Other Device x16 x16 x16 x16 Switch 0 Switch 1 CPU0 PCI-e x16 x16 PCIe switches fully supported for all operations Best P2P performance between devices on same switch P2P also supported with other devices such as NIC via GPUDirect RDMA

GPUDirect RDMA on the network PCIe P2P between NIC and GPU without touching host memory Greatly improved performance Currently supported on Cray (XK7 and XC-30) and Mellanox FDR Infiniband Some MPI implementations support GPUDirect RDMA

Configure your compute nodes Configure system BIOS Install and configure device drivers Configure GPU devices Set GPU power limits

Configure system BIOS Configure large PCIe address space Many servers ship with 64-bit PCIe addressing turned off Needs to be turned on for Tesla K40 or systems with many GPUs Might be called Enable 4G Decoding or similar Configure for cooling passive GPUs Tesla M-series has passive cooling relies on system fans Communicates thermals to BMC to manage fan speed Make sure BMC firmware is up to date, fans are configured correctly Make sure remote console uses onboard VGA, not offboard NVIDIA GPU

Disable the nouveau driver nouveau does not support CUDA and will conflict with NVIDIA driver Two steps to disable: 1. Edit /etc/modprobe.d/disable-nouveau.conf: blacklist nouveau nouveau modeset=0 2. Rebuild initial ramdisk: RHEL: dracut --force SUSE: mkinitrd Deb: update-initramfs -u

Install the NVIDIA driver Two ways to install the driver Command-line installer Bundled with CUDA toolkit developer.nvidia.com/cuda Stand-alone www.nvidia.com/drivers RPM/DEB Provided by NVIDIA (major versions only) Provided by Linux distros (other release schedule) Not easy to switch between these methods

Initializing a GPU in runlevel 3 Most clusters operate at runlevel 3 so you should initialize the GPU explicitly in an init script At minimum: Load kernel modules nvidia + nvidia_uvm (in CUDA 6) Create devices with mknod Optional steps: Configure compute mode Set driver persistence Set power limits

Install GPUDirect RDMA network drivers (if available) Mellanox OFED 2.1 (beta) has support for GPUDirect RDMA Should also be supported on Cray systems for CLE < > HW required: Mellanox FDR HCAs, Tesla K10/K20/K20X/K40 SW required: NVIDIA driver 331.20 or better, CUDA 5.5 or better, GPUDirect plugin from Mellanox Enables an additional kernel driver, nv_peer_mem

Configure driver persistence By default, driver unloads when GPU is idle Driver must re-load when job starts, slowing startup If ECC is on, memory is cleared between jobs Persistence daemon keeps driver loaded when GPUs idle: # /usr/bin/nvidia-persistenced --persistence-mode \ [--user <username>] Faster job startup time Slightly lower idle power

Configure ECC Tesla and Quadro GPUs support ECC memory Correctable errors are logged but not scrubbed Uncorrectable errors cause error at user and system level GPU rejects new work after uncorrectable error, until reboot ECC can be turned off makes more GPU memory available at cost of error correction/detection Configured using NVML or nvidia-smi # nvidia-smi -e 0 Requires reboot to take effect

Set GPU power limits Power consumption limits can be set with NVML/nvidia-smi Set on a per-gpu basis Useful in power-constrained environments nvidia-smi pl <power in watts> Settings don t persist across reboots set this in your init script Requires driver persistence

Set up your cluster for GPU jobs Enable GPU integration in resource manager and MPI Set up GPU process accounting to measure usage Configure GPU Boost clocks (or allow users to do so) Managing job topology on GPU compute nodes

Resource manager integration Most popular resource managers have some NVIDIA integration features available: SLURM, Torque, PBS Pro, Univa Grid Engine, LSF GPU status monitoring: Report current config, load sensor for utilization Managing process topology: GPUs as consumables, assignment using CUDA_VISIBLE_DEVICES Set GPU configuration on a per-job basis Health checks: Run nvidia-healthmon or integrate with monitoring system NVIDIA integration usually configured at compile time (open source) or as a plugin

GPU process accounting Provides per-process accounting of GPU usage using Linux PID Accessible via NVML or nvidia-smi (in comma-separated format) Requires driver be continuously loaded (i.e. persistence mode) No RM integration yet, use site scripts i.e. prologue/epilogue Enable accounting mode: $ sudo nvidia-smi am 1 Human-readable accounting output: $ nvidia-smi q d ACCOUNTING Output comma-separated fields: $ nvidia-smi --query-accountedapps=gpu_name,gpu_util format=csv Clear current accounting logs: $ sudo nvidia-smi -caa

MPI integration with CUDA Most recent versions of most MPI libraries support sending/receiving directly from CUDA device memory OpenMPI 1.7+, mvapich2 1.8+, Platform MPI, Cray MPT Typically needs to be enabled for the MPI at compile time Depending on version and system topology, may also support GPUDirect RDMA Non-CUDA apps can use the same MPI without problems (but might link libcuda.so even if not needed) Enable this in MPI modules provided for users

GPU Boost (user-defined clocks) 1.40 1.20 1.00 Use Power Headroom to Run at Higher Clocks 25% Faster 20% Faster 14% Faster 17% Faster 0.80 0.60 0.40 13% Faster 0.20 0.00 AMBER SPFP-TRPCage LAMMPS-EAM NAMD 2.9-APOA1 11% Faster Tesla K40 (base) Tesla K40 with GPU Boost

GPU Boost (user-defined clocks) Configure with nvidia-smi: nvidia-smi q d SUPPORTED_CLOCKS nvidia-smi ac <MEM clock, Graphics clock> nvidia-smi q d CLOCK shows current mode nvidia-smi rac resets all clocks nvidia-smi acp 0 allows non-root to change clocks Changing clocks doesn t affect power cap; configure separately Requires driver persistence Currently supported on K20, K20X and K40

Managing CUDA contexts with compute mode Compute mode: determines how GPUs manage multiple CUDA contexts 0/DEFAULT: Accept simultaneous contexts. 1/EXCLUSIVE_THREAD: Single context allowed, from a single thread. 2/PROHIBITED: No CUDA contexts allowed. 3/EXCLUSIVE_PROCESS: Single context allowed, multiple threads OK. Most common setting in clusters. Changing this setting requires root access, but it sometimes makes sense to make this user-configurable.

N processes on 1 GPU: MPS Multi-Process Server allows multiple processes to share a single CUDA context GPU0 Improved performance where multiple processes share GPU (vs multiple open contexts) Easier porting of MPI apps: can continue to use one rank per CPU, but all ranks can access the GPU MPS Persistent context Server process: nvidia-cuda-mps-server Control daemon: nvidia-cuda-mps-control 0 1 2 3 Application ranks

PCIe-aware process affinity To get good performance, CPU processes should be scheduled on cores local to the GPUs they use GPU 0 GPU 1 No good out of box tools for this yet! hwloc can be help identify CPU <-> GPU locality CPU0 PCI-e x16 x16 Can use PCIe dev ID with NVML to get CUDA rank Set process affinity with MPI or numactl CPU1 PCI-e x16 x16 Possible admin actions: Documentation: node toplogy & how to set affinity Wrapper scripts using numactl to set recommended affinity GPU 2 GPU 3

Multiple user jobs on a multi-gpu node CUDA_VISIBLE_DEVICES environment variable controls which GPUs are visible to a process Comma-separated list of devices export CUDA_VISIBLE_DEVICES= 0,2 Tooling and resource manager support exists but limited Example: configure SLURM with CPU<->GPU mappings SLURM will use cgroups and CUDA_VISIBLE_DEVICES to assign resources Limited ability to manage process affinity this way Where possible, assign all a job s resources on same PCIe root complex

Monitor and test your cluster Use nvidia-healthmon to do GPU health checks on each job Use a cluster monitoring system to watch GPU behavior Stress test the cluster

Automatic health checks: nvidia-healthmon Runs a set of fast sanity checks against each GPU in system Basic sanity checks PCIe link config and bandwith between host and peers GPU temperature All checks are configurable set them up based on your system s expected values Use cluster health checker to run this for every job Single command to run all checks Returns 0 if successful, non-zero if a test fails Does not require root to run

Use a monitoring system with NVML support Examples: Ganglia, Nagios, Bright Cluster Manager, Platform HPC Or write your own plugins using NVML

Good things to monitor GPU Temperature Check for hot spots Monitor w/ NVML or OOB via system BMC GPU Power Usage Higher than expected power usage => possible HW issues Current clock speeds Lower than expected => power capping or HW problems Check Clocks Throttle Reasons in nvidia-smi ECC error counts

Good things to monitor Xid errors in syslog May indicate HW error or programming error Common non-hw causes: out-of-bounds memory access (13), illegal access (31), bad termination of program (45) Turn on PCIe parity checking with EDAC modprobe edac_core echo 1 > /sys/devices/system/edac/pci/check_pci_parity Monitor value of /sys/devices/<pciaddress>/broken_parity_status

Stress-test your cluster Best workload for testing is the user application Alternatively use CUDA Samples or benchmarks (like HPL) Stress entire system, not just GPUs Do repeated runs in succession to stress the system Things to watch for: Inconsistent perf between nodes: config errors on some nodes Inconsistent perf between runs: cooling issues, check GPU Temps Slow GPUs / PCIe transfers: misconfigured SBIOS, seating issues Get pilot users with stressful workloads, monitor during their runs Use successful test data for stricter bounds on monitoring and healthmon

Always use serial number to identify bad boards Multiple possible ways to enumerate GPUs: PCIe NVML CUDA runtime These may not be consistent with each other or between boots! Serial number will always map to the physical board and is printed on the board. UUID will always map to the individual GPU. (I.e., 2 UUIDs and 1 SN if a board has 2 GPUs.)

Key take-aways Topology matters! For both HW selection and job configuration You should provide tools which expose this to your users Use NVML-enabled tools for GPU cofiguration and monitoring (or write your own!) Lots of hooks exist for cluster integration and management, and third-party tools

Where to find more information docs.nvidia.com developer.nvidia.com/cluster-management Documentation in GPU Deployment Kit man pages for the tools (nvidia-smi, nvidia-healthmon, etc) Other talks in the Clusters and GPU Management tag here at GTC

QUESTIONS? @ajdecon #GTC14