NVIDIA HPC Update Carl Ponder, PhD; cponder@nvidia.com; NVIDIA, Austin, TX, USA - Sr. Applications Engineer, Developer Technology Group Stan Posey; sposey@nvidia.com; NVIDIA, Santa Clara, CA, USA - HPC Program Manager, ESM and CFD Solutions
IBM and NVIDIA CWO Review NVIDIA HPC Strategy for CWO GPU Model Developments Global: ECMWF ESCAPE Project, NOAA NGGPS Meso-scale: COSMO, WRF 2
GPU Motivation (I): Performance Trends Peak Double Precision FLOPS Peak Memory Bandwidth 8000 GFLOPS 1400 GB/s 7000 6000 5000 Volta ~5x 1200 ~5x ~15x 1000 Pascal Volta 4000 Pascal 800 3000 2000 1000 M1060 M2050 M2090 K20 K40 K80 600 400 200 M1060 M2050 M2090 K20 K40 K80 0 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 NVIDIA GPU x86 CPU 0 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 NVIDIA GPU x86 CPU 3
GPU Motivation (II): Power Limiting Trends Challenges of Getting ECMWF s Weather Forecast Model (IFS) to the Exascale G. Mozdzynski, ECMWF 16 th HPC Workshop Conclusion: Current mathematics, algorithms and hardware will not be sufficient (by far) 4
GPU Motivation (III): ES Model Trends Higher grid resolution with manageable compute and energy costs Global atmosphere models from 10 km today to cloud-resolving scales of 1-3 km IFS? 128 km 16 km 10 km 3 km Source: Project Athena http://www.wxmaps.org/athena/home/ Increase in ensemble use and ensemble members to manage uncertainty Number of Jobs 10x - 50x Fewer model approximations (NH), more features (physics, chemistry, etc.) Accelerator technology identified as a cost-effective and practical approach to future NWP requirements 5
Tesla GPU Progression During Recent Years 2012 (Fermi) M2075 2014 (Kepler) K20X 2014 (Kepler) K40 2014 (Kepler) K80 Kepler / Fermi Peak SP Peak SGEMM 1.03 TF 3.93 TF 2.95 TF 4.29 TF 3.22 TF 8.74 TF 4x Peak DP Peak DGEMM.515 TF 1.31 TF 1.22 TF 1.43 TF 1.33 TF 2.90 TF Memory size 6 GB 6 GB 12 GB 24 GB (12 each) 2x Mem BW (ECC off) 150 GB/s 250 GB/s 288 GB/s 480 GB/s (240 each) 2x Memory Clock 2.6 GHz 3.0 GHz 3.0 GHz PCIe Gen Gen 2 Gen 2 Gen 3 Gen 3 2x # of Cores 448 2688 2880 4992 (2496 each) 5x Board Power 235W 235W 235W 300W 0% 28% Note: Tesla K80 specifications are shown as aggregate of two GPUs on a single board 3x 6
Features of Pascal GPU Architecture 2016 NVLink Interconnect at 80 GB/s (Speed of CPU Memory) Stacked Memory 4x Higher Bandwidth ~1 TB/s 3x Capacity, 4x More Efficient Unified Memory Lower Development Effort (Available Today in CUDA6) TESLA GPU Stacked Memory NVLink 80-200 GB/s HBM 1 TB/s UVM CPU (POWER,?) DDR4 50-75 GB/s DDR Memory 7
NVLink Interconnect Alternative to PCIe High speed interconnect for CPU-to-GPU and GPU-to-GPU Performance advantage over PCIe Gen-3 (5x to 12x faster) GPUs and CPUs share data structures at CPU memory speeds TESLA GPU Stacked Memory NVLink 80-200 GB/s HBM 1 TB/s UVM CPU (POWER,?) DDR4 50-75 GB/s DDR Memory 8
NVLink Interconnect Alternative to PCIe GPU Density Increasing HPC Vendors & NVLink 2016 All will support GPU-to-GPU Several to support CPU-to-GPU Cray CS-Storm: 8 x K80 Dell C4130: 4 x K80 HP SL270: 8 x K80 9
GPU and CPU Platforms and Configurations NVLink PCIe PCIe X86, ARM64, POWER CPU X86, ARM64, POWER CPU 2014 Kepler GPU 2016 Pascal GPU More CPU Choices for GPU Acceleration 2014 NVLink Interconnect Alternative to PCIe 2016 NVLink POWER CPU 10
IBM Power + NVIDIA GPU Accelerated HPC Next-Gen IBM Supercomputers and Enterprise Servers OpenPOWER Foundation Open ecosystem built on Power Architecture Long term roadmap integration + POWER CPU Tesla GPU & 30+ more First GPU-Accelerated POWER-Based Systems Available Since 2015 11
DOE CORAL Based on Power + GPU 2017 US DOE CORAL Systems Summit (ORNL) and Sierra (LLNL) Installation 2017 at ~150 PF each Nodes of POWER 9 + Tesla Volta GPUs NVLink Interconnect for CPUs + GPUs ORNL Summit System Approximately 3,400 total nodes Each node 40+ TF peak performance About 1/5 of total #2 Titan nodes (18K+) Same energy used as #2 Titan (27 PF) CORAL Summit System 5-10x Faster than Titan 1/5th the Nodes, Same Energy Use as Titan 12
Feature Comparison of Summit vs. Titan 2017 2012 ~5-10x ~1/5x ~30x ~15x ~10-20x ~0x 13
CORAL Systems for US DOE Climate Model ACME ACME: Accelerated Climate Model for Energy DOE Labs: Argonne, LANL, LBL, LLNL, ORNL, PNNL, Sandia Co-design project using US DOE LCF systems Branch of CESM using CAM-SE (atm) and MPAS-O (ocn) ACME Project Roadmap (page 2 in report) Strategy Report 11 Jul 2014 2017 2014 CORAL: Summit, Sierra, Aurora Trinity: NERSC8 (Cori) TITAN OLCF [AMD + GPU] Mira ALCF [Blue GeneQ] Source: http://climatemodeling.science.energy.gov/sites/default/files/publications/acme-project-strategy-plan_0.pdf 14
GPU Programming Models for CPU Platforms Availability for x86 Today, POWER and ARM Under Development Libraries AmgX cublas Compiler Directives Programming Languages / x86 15
PGI Compilers for OpenPOWER + GPU Feature parity with PGI Compilers on Linux/x86+Tesla CUDA Fortran, OpenACC, OpenMP, CUDA C/C++ host compiler Integrated with IBM s optimized LLVM / Power code generator Limited access in 2015, Beta 1H 2016, Production in 2016 x86 Recompile 16
NVIDIA GPU Focus for Atmosphere Models Recent Developments Global: ECMWF ESCAPE Project, NOAA NGGPS Regional: COSMO, WRF 17
ECMWF Project for Exascale Weather Prediction Expectations towards Exascale: Weather and Climate Prediction P. Bauer, ECMWF, EESI-2015 EASCAPE HPC Goals: Standardized, highly optimized kernels on specialized hardware Overlap of communication and computation Compilers/standards supporting portability Project Approach: Co-design between domain scientists, computer scientists, and HPC vendors 18
ECMWF Initial IFS Investigation on Titan GPUs Challenges of Getting ECMWF s Weather Forecast Model (IFS) to the Exascale G. Mozdzynski, ECMWF 16 th HPC Workshop >3x >4x Legendre Transform Kernels Observe >4x for Titan K20X vs. XC-30 24-core Ivy Bridge 19
NOAA NGGPS: Background on Program From: Next Generation HPC and Forecast Model Application Readiness at NCEP -by John Michalakes, NOAA NCEP; AMS, Phoenix, AZ, Jan 2015 20
NOAA NGGPS: NH Model Dycore Candidates (5) Dycores FV3 & MPAS selected for Level-II evaluation NVIDIA, NOAA investigating potential for GPUs NOAA evaluations under HPC program SENA Software From: Next Generation HPC and Forecast Model Application Readiness at NCEP -by John Michalakes, NOAA NCEP; AMS, Phoenix, AZ, Jan 2015 Engineering for Novel Architectures (2015) 21
COSMO Regional Model on GPUs http://www.cosmo-model.org/ COSMO model fully implemented on GPUs by MeteoSwiss COSMO consortium approved GPU code for standard distribution (5.3) MeteoSwiss to deploy first-ever operational NWP on GPUs Successful daily test runs on GPUs since Dec 2014, operational in 2016 COSMO 2 (2.2 KM) COSMO 1 (1.1 KM) Reality Source: Increased Resolution = Quality Forecast 22
Operational NWP at MeteoSwiss Before GPUs Delivering performance in scientific simulations: present and future role of GPUs in supercomputers -by Dr. Thomas Schulthess, Director CSCS; NVIDIA GTC, March 2014 23
New Operational NWP at MeteoSwiss With GPUs Climate Simulation and Numerical Weather Prediction Using GPUs -by Dr. Xavier Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 MeteoSwiss will deploy the first-ever operational NWP with GPUs (2016) Two identical systems purchased (Cray CS + K80 GPUs) to be installed during 2015 Successful daily run of pre-operational COSMO on GPUs since Dec 2014 at CSCS New Operational NWP Configurations: New configurations of higher resolution and ensemble prediction only possible with performance-per-energy gains from GPUs 2.2km increase to 1.1km 6.6km increase to 2.2km EPS Short range rapid update (grid=1158x774x80, 1.1km) forecast of 24 hrs generated every 3h (8 per day) Medium range ensemble (grid=582x390x60, 2.2km) forecast of 5+ days generated every 12h (2 per day) 24
MeteoSwiss GPU-Driven Weather Prediction MeteoSwiss COSMO NWP Configurations Since 2008 IFS from ECMWF 2 per day, 10 day forecast COSMO 7 (6.6 KM) 3 per day, 3 day forecast COSMO 2 (2.2 KM) 8 per day, 24 hr forecast Before GPUs MeteoSwiss COSMO NWP Configurations During 2016 IFS from ECMWF 2 per day, 10 day forecast COSMO E (2.2 KM) 2 per day, 5 day forecast COSMO 1 (1.1 KM) 8 per day, 24 hr forecast With GPUs New configurations of higher resolution and ensemble predictions possible owing to the performance-per-energy gains from GPUs X. Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 25
COSMO Model Performance for K20X Oct 2014 Are OpenACC Directives the Easy Way to Port NWP Applications to GPUs? -by Dr. Xavier Lapillonne, MeteoSwiss; ECMWF 16 th HPC Workshop, Oct 2014 Details of Benchmark and CPU + GPU Configurations: COSMO-2 (2km) configuration, 24h run, grid = 520x350x60 System todi, XK7 at CSCS All results for 9 compute nodes with 9 x AMD CPU and 9 x K20X All results socket-to-socket CPU use of full 16 cores GPU use of 1 CPU core Physics that have OpenACC optimizations speedup 3.4X Assim, OpenACC on 1 core CPU 3.3X slower but < 15% of profile Total GPU speedup 2.7X 26
COSMO Model Performance for K20X Apr 2015 Climate Simulation and Numerical Weather Prediction Using GPUs -by Dr. Xavier Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 Details of Benchmark and CPU + GPU Configurations: COSMO-1 (1km) configuration, 24h run, grid = 1158x774x80 System Piz Daint, XC30 at CSCS All results for 70 compute nodes with 1 x SB CPU and 1 x K20X All results socket-to-socket CPU use of full 8 cores GPU use of 1 CPU core More substantial gains in energy per solution, but not reported Total GPU speedup 2.2X 27
COSMO Development on GPUs by MeteoSwiss Climate Simulation and Numerical Weather Prediction Using GPUs -by Dr. Xavier Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 Approach Implementation 28
References: COSMO Model on NVIDIA GPUs Towards GPU-accelerated Operational Weather Forecasting - Oliver Fuhrer (MeteoSwiss), NVIDIA GTC 2013, Mar 2013 Source: http://on-demand.gputechconf.com/gtc/2013/presentations/s3417-gpu-accelerated-operational-weather-forecasting.pdf Delivering Performance in Scientific Simulations: Present and Future Role of GPUs in Supercomputers - Thomas Schulthess (CSCS), NVIDIA GTC 2014, Mar 2014 Source: http://on-demand.gputechconf.com/gtc/2014/presentations/s4719-scientific-sims-gpus-supercomputing.pdf Implementation of COSMO on Accelerators - Oliver Fuhrer (MeteoSwiss), ECMWF Scalability Workshop, Apr 2014 Source: http://old.ecmwf.int/newsevents/meetings/workshops/2014/scalability/ Implementation of COSMO on Accelerators - Xavier Lapillonne (MeteoSwiss), ECMWF 16 th HPC Workshop, Oct 2014 Source: http://www.ecmwf.int/sites/default/files/hpc-ws-lapillonne.pdf 29
WRF Progress Update http://www.wrf-model.org/ Independent development projects of (I) CUDA and (II) OpenACC I. CUDA through funded collaboration with SSEC www.ssec.wisc.edu SSEC lead Dr. Bormin Huang, NVIDIA CUDA Fellow: research.nvidia.com/users/bormin-huang Objective: Hybrid WRF benchmark capability starting 2H 2015 II. OpenACC directives through collaboration with NCAR MMM Objective: NCAR GPU interest in Fortran version to host on distribution site \ 30
Project Status (I): NVIDIA-SSEC CUDA WRF Objective to provide Q3 benchmark capability of hybrid CPU+GPU WRF 3.6.1 SSEC ported 20+ modules to CUDA across 5 physics types in WRF: Cloud MP = 11; Radiation = 4; Surface Layer = 3; Planetary BL = 1; Cumulus = 1 Project to integrate CUDA modules into latest full model WRF 3.6.1 SSEC development began 2010, based on WRF 3.3/3.4 vs. latest release 3.6.1 SSEC about publications so module development was isolated from WRF model Start with existing modules most common across target customer s input Objective: speedups measured on individual modules within full model run NVIDIA SSEC Project Success \ Target modules achieve published GPU performance within full WRF model run Final clean-up and performance studies underway, real benchmarks in 2H 2015 TQI/SSEC to fund/port all modules to CUDA towards a full WRF model on GPUs Focus on additional physics modules, and ongoing WRF-ARW dycore development 31
SSEC Speedups of WRF Physics Modules * Modules for initial NVIDIA-funded integration project WRF Module GPU Speedup (w-wo I/O) Technical Paper Publication Kessler MP 70x / 816x J. Comp. & GeoSci., 52, 292-299, 2012 Purdue-Lin MP 156x / 692x SPIE: doi:10.1117/12.901825 WSM 3-class MP 150x / 331x WSM 5-class MP 202x / 350x JSTARS, 5, 1256-1265, 2012 Eta MP 37x / 272x SPIE: doi:10.1117/12.976908 WSM 6-class MP 165x / 216x Submitted to J. Comp. & GeoSci. Goddard GCE MP 348x / 361x Accepted for publication in JSTARS Thompson MP * * * 76x / 153x SBU 5-class MP 213x / 896x JSTARS, 5, 625-633, 2012 WDM 5-class MP 147x / 206x WDM 6-class MP 150x / 206x J. Atmo. Ocean. Tech., 30, 2896, 2013 RRTMG LW 123x / 127x JSTARS, 7, 3660-3667, 2014 RRTMG SW 202x / 207x Submitted to J. Atmos. Ocean. Tech. Goddard SW 92x / 134x JSTARS, 5, 555-562, 2012 Dudhia SW MYNN SL TEMF SL 19x / 409x 6x / 113x 5x / 214x Thermal Diffusion LS 10x / 311x Submitted to JSATRS * * * * YSU PBL 34x / 193x Submitted to GMD Hybrid WRF Customer Benchmark Capability Starting in 2H 2015 Hardware and Benchmark Case CPU: Xeon Core-i7 3930K,1 core use; Benchmark: CONUS 12 km for 24 Oct 24; 433 x 308, 35 levels 32
NVIDIA-SSEC Phase-I Priority on 6 Modules WRF Module NCEP HRRR NCEP NWS-1 NCAR KISTI KAUST Cloud MP ~20% Rad ~6% PBL WSM 5-class MP WSM 6-class MP Thompson MP Dudhia Radiation RRTMG Radiation YSU Planetary BL NOTE: New NCAR global model MPAS uses WRF physics: WSM6, RRTMG, and YSU 33
Project Status (II): NVIDIA-NCAR OpenACC WRF Objective to provide benchmark capability of hybrid CPU+GPU WRF 3.6.1 NVIDIA ported 30+ routines to OpenACC for both dynamics and physics: List provided on next page Project to integrate OpenACC modules with full model WRF 3.6.1 release Open source project with plans for NCAR hosting on WRF trunk distribution site Objective to minimize code refactoring for scientists readability and acceptance \ NVIDIA - NCAR Project Success Several performance sensitive routines of dynamics and physics are completed This includes advection routines for dynamics; expensive microphysics schemes Hybrid WRF performance using single socket CPU + GPU faster than CPU-only Incremental improvements as more OpenACC code is developed for execution on GPU 34
WRF Modules Available in OpenACC Project to integrate CUDA modules into latest full model WRF 3.6.1 Several dynamics routines including all of advection Several physics schemes: Planetary boundary layer YSU, GWDO Cumulus KFETA Microphysics Kessler, Morrison, Thompson Radiation RRTM Surface physics Noah \ 35
Summary Highlights NVIDIA observes strong NWP community interest in GPU acceleration Appeal of new technologies: Pascal, NVLink, more CPU platform choices NVIDIA applications engineering collaboration in several model projects Investments in OpenACC: PGI release of 15; Continued Cray collaborations GPU progress for several models we examined a few of these MeteoSwiss developments towards COSMO operational NWP on GPUs ECMWF developments of IFS and launch of H2020 ESCAPE project NVIDIA HPC technology behind the next-gen pre-exascale systems New GPU technologies for DOE CORAL program and ACME climate model 36
Thank You Carl Ponder, PhD; cponder@nvidia.com; NVIDIA, Austin, TX, USA