The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

The Top Six Advantages of CUDA-Ready Clusters Ian Lumb Bright Evangelist GTC Express Webinar January 21, 2015

We scientists are time-constrained, said Dr. Yamanaka. Our priority is our research, not managing our clusters. Bright [Cluster Manager] is intuitive to use, and with it I can effectively manage my cluster without wasting time writing scripts, or synchronizing management tool revisions. Provisioning is fast and easy too. I prefer this approach over open source toolkits. http://www.brightcomputing.com/news-tokyo-institute-of-technology-gordon-bell-prize-winner-uses-bright-cluster-managerto-develop-applications-for-one-of-the-worlds-fastest-supercomputers 2

CUDA-Ready Clusters 1. You focus on coding not infrastructure & toolchains 2. You re always in sync with GPUs + CUDA 3. You cross-develop with confidence and ease Maintaining and using highly customized environments 4. You choose and combine in programming GPUs CUDA or OpenCL or OpenACC and combine with MPI 5. You have converged HPC + Big Data Analytics You have access to Hadoop alongside HPC 6. You seamlessly utilize The Cloud You extend into AWS, deploy OpenStack, CUDA-ready clusters are GPU developer-ready

CPU GPUs Memory Disk Ethernet Interconnect IPMI / ilo PDU Bright Cluster Manager CUDA Environment Cluster Management GUI Provisioning User Portal SSL / SOAP / X509 / IPtables Cluster Management Daemon Slurm PBS Pro Torque/Maui Torque/MOAB Grid Engine LSF Monitoring Automation Health Checks Management SLES / RHEL / CentOS / SL Cluster Management Shell Compilers Libraries Debuggers Profilers

Unified Memory http://info.brightcomputing.com/blog/bid/196783/bright-cluster-manager-integrates-support-for-cuda-6 6

NVIDIA GPU Boost 10

Modernized monitoring for HPC clusters http://insidehpc.com/2014/11/monitoring-hpc-clusters-modernized/ 11

Cluster Health Management Provide problem free environment for running jobs Four elements 1. Cluster management automation 2. Regular health checks 3. Pre-job health checks 4. Hardware stability & performance tests All elements above are configurable and extensible

Syncing with GPUs + CUDA Innovation characterizes the entire history and evolution of GPU programmability through CUDA BUT introduces challenges and opportunities Bright Computing s approach leverages People Proactively maintaining business and technical relationships Process `Hands-on engineering begins with release candidates Product Preliminary to fully productized implementations Bright Cluster Manager released once twice per year Updates flow continuously http://info.brightcomputing.com/blog/cuda-6.5-something-for-nothing http://www.brightcomputing.com/news-bright-cluster-manager-adds-support-for-the-nvidia-tesla-k80-dual-gpu-accelerator

Available Versions of the CUDA Toolkit 16

Using CUDA 6.0 17

HPC Development Environment Compilers (GNU, Intel*, AMD, Portland*, etc.) Debuggers and profilers (GNU, TAU, Allinea, TotalView) MPI libraries (OpenMPI, MPICH, MPICH-MX, MVAPICH) Other libraries (threading libraries, OpenMP, Global Arrays, HDF5, IIPP, TBB, NetCDF, PETSc, etc.) Mathematical libraries (ACML, MKL*, FFTW, GMP, GotoBLAS, ScaLAPACK, etc.) Environment modules

Programming GPUs CUDA OpenCL OpenACC MPI Tools CUDA gdb nvidia-smi CUDA Utility Library Examples 3 rd Party Allinea Rogue Wave

CUDA Development Environment

HPC and Hadoop Use GPUs for HPC and Big Data Analytics Introduce GPUs into Hadoop clusters Make use of Hadoop services

GPUs in the Cloud? The Top Four Reasons 1. You can realize possibilities using the cloud You can scale up and scale out 2. You still realize the promise of GPU programmability via HPC in the cloud 3. Your use of the cloud is transparent You ve found ways to `hide latency Constraints apply for MPI apps 4. Your go-to apps still work in the cloud http://info.brightcomputing.com/blog/bid/196290/the-top-4-reasons-you-should-try-cloud-based-gpus-for-hpc

Cloud Utilization Scenario I Cluster on Demand node001 head node node002 node003

Cloud Utilization Scenario II Cluster Extension node006 node004 node007 node005 head node node001 node002 node003

Case Study: TUAT (1) The Customer Engages materials-science research Compares computational models with physical experiments High-resolution, 3D phase field modeling at large scales using GPUs The Challenge Make available the latest innovations in GPU technology without distracting focus from research

Case Study: TUAT (2) The Solution Laboratory GPU cluster designed and implemented by HPCTech Corp. Bright Cluster Manager deployed by HPCTech Use Bright to fully manage the entire CUDA environment including regular updates Use modules environment via Bright to manage multiple CUDA environments Prototype simulations using laboratory HPC cluster Includes debugging and tuning code Execute large-scale simulations using TSUBAME The Results

51μm 0.01 0.38 [wt.%] Calculation steps : 25000 150000 275000 Caption: Snapshots of austenite-to-ferrite transformation behavior in Fe-C alloy simulated by a multi-phase-field method. Upper and lower panels show time evolution of ferrite grains and carbon concentration during the phase transformation. The simulation was performed on 512 512 256 computational grids using 8 GPUs in lab cluster. (Prof. A. Yamanaka, TUAT)

Elapsed time [ 1000 s] 5 4 3 2 1 0 128 256 512 Number of GPUs Caption: Performance of multiple-gpu computation of multi-phase-field simulation of austenite-to-ferrite transformation in Fe-C alloy. The performance was measured by performing the simulations on TSUBAME2.5 supercomputer of Tokyo Institute of Technology. The number of computational grids, crystal grains and calculation steps were 512 3, 4068 and 10 5, respectively. (Prof. A. Yamanaka, TUAT, priv. comm.) http://www.tuat.ac.jp/~yamanaka/

Case Study: TUAT (3) We scientists are time-constrained, said Dr. Yamanaka. Our priority is our research, not managing our clusters. Bright is intuitive to use, and with it I can effectively manage my cluster without wasting time writing scripts, or synchronizing management tool revisions. Provisioning is fast and easy too. I prefer this approach over open source toolkits. 37

Q & A Ian Lumb, ian.lumb@brightcomputing.com http://www.brightcomputing.com/

Additional Slides

Cluster Health Management Goal: provide problem free environment for running jobs Four elements 1. Cluster management automation 2. Regular health checks Actions that return PASS, FAIL or UNKNOWN Can be associated with a settable severity and a message Can launch an action based on any response value 3. Pre-job health checks Let the workload manager hold the job very briefly Check the health of each reserved node If unhealthy, take the node offline, inform the system administrator Let the workload manager reschedule the job to a different set of nodes 4. Hardware stability & performance tests Very wide range of tests May include disk overwrites and reboot(s) All elements above are configurable and extensible

Bright API 44