NVIDIA HPC Update. Carl Ponder, PhD; cponder@nvidia.com; NVIDIA, Austin, TX, USA - Sr. Applications Engineer, Developer Technology Group

Similar documents

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Summit and Sierra Supercomputers:

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

HP ProLiant SL270s Gen8 Server. Evaluation Report

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Accelerating CFD using OpenFOAM with GPUs

HPC with Multicore and GPUs

Case Study on Productivity and Performance of GPGPUs

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

High Performance Computing in CST STUDIO SUITE

Turbomachinery CFD on many-core platforms experiences and strategies

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

Jean-Pierre Panziera Teratec 2011

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

NVIDIA Tesla K20-K20X GPU Accelerators Benchmarks Application Performance Technical Brief

ST810 Advanced Computing

Parallel Computing. Introduction

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Trends in High-Performance Computing for Power Grid Applications

1 DCSC/AU: HUGE. DeIC Sekretariat /RB. Bilag 1. DeIC (DCSC) Scientific Computing Installations

Pedraforca: ARM + GPU prototype

NVLink High-Speed Interconnect: Application Performance

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Barry Bolding, Ph.D. VP, Cray Product Division

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Computer Graphics Hardware An Overview

Performance Engineering of the Community Atmosphere Model

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Linux Cluster Computing An Administrator s Perspective

HPC Trends and Directions in the Earth Sciences Per Nyberg Sr. Director, Business Development

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Data Centric Systems (DCS)

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Parallel Programming Survey

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Evaluation of CUDA Fortran for the CFD code Strukti

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

GPUs for Scientific Computing

HPC Programming Framework Research Team

Kriterien für ein PetaFlop System

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Visit to the National University for Defense Technology Changsha, China. Jack Dongarra. University of Tennessee. Oak Ridge National Laboratory

Hank Childs, University of Oregon

Cluster Monitoring and Management Tools RAJAT PHULL, NVIDIA SOFTWARE ENGINEER ROB TODD, NVIDIA SOFTWARE ENGINEER

HPC-related R&D in 863 Program

GPGPU accelerated Computational Fluid Dynamics

GPU Acceleration of the SENSEI CFD Code Suite

Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji sinamera@ca.ibm.com

Experiences with Tools at NERSC

Achieving Performance Isolation with Lightweight Co-Kernels

Part I Courses Syllabus

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Software and Performance Engineering of the Community Atmosphere Model

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

SR-IOV: Performance Benefits for Virtualized Interconnects!

ASSESSMENT OF THE CAPABILITY OF WRF MODEL TO ESTIMATE CLOUDS AT DIFFERENT TEMPORAL AND SPATIAL SCALES

ECLIPSE Best Practices Performance, Productivity, Efficiency. March 2009

Infrastructure Matters: POWER8 vs. Xeon x86

Simulation Platform Overview

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

Overview of HPC systems and software available within

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Recent Advances in HPC for Structural Mechanics Simulations

CFD Implementation with In-Socket FPGA Accelerators

The Top Six Advantages of CUDA-Ready Clusters. Ian Lumb Bright Evangelist

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Climate-Weather Modeling Studies Using a Prototype Global Cloud-System Resolving Model

Multicore Parallel Computing with OpenMP

Altix Usage and Application Programming. Welcome and Introduction

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Transcription:

NVIDIA HPC Update Carl Ponder, PhD; cponder@nvidia.com; NVIDIA, Austin, TX, USA - Sr. Applications Engineer, Developer Technology Group Stan Posey; sposey@nvidia.com; NVIDIA, Santa Clara, CA, USA - HPC Program Manager, ESM and CFD Solutions

IBM and NVIDIA CWO Review NVIDIA HPC Strategy for CWO GPU Model Developments Global: ECMWF ESCAPE Project, NOAA NGGPS Meso-scale: COSMO, WRF 2

GPU Motivation (I): Performance Trends Peak Double Precision FLOPS Peak Memory Bandwidth 8000 GFLOPS 1400 GB/s 7000 6000 5000 Volta ~5x 1200 ~5x ~15x 1000 Pascal Volta 4000 Pascal 800 3000 2000 1000 M1060 M2050 M2090 K20 K40 K80 600 400 200 M1060 M2050 M2090 K20 K40 K80 0 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 NVIDIA GPU x86 CPU 0 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 NVIDIA GPU x86 CPU 3

GPU Motivation (II): Power Limiting Trends Challenges of Getting ECMWF s Weather Forecast Model (IFS) to the Exascale G. Mozdzynski, ECMWF 16 th HPC Workshop Conclusion: Current mathematics, algorithms and hardware will not be sufficient (by far) 4

GPU Motivation (III): ES Model Trends Higher grid resolution with manageable compute and energy costs Global atmosphere models from 10 km today to cloud-resolving scales of 1-3 km IFS? 128 km 16 km 10 km 3 km Source: Project Athena http://www.wxmaps.org/athena/home/ Increase in ensemble use and ensemble members to manage uncertainty Number of Jobs 10x - 50x Fewer model approximations (NH), more features (physics, chemistry, etc.) Accelerator technology identified as a cost-effective and practical approach to future NWP requirements 5

Tesla GPU Progression During Recent Years 2012 (Fermi) M2075 2014 (Kepler) K20X 2014 (Kepler) K40 2014 (Kepler) K80 Kepler / Fermi Peak SP Peak SGEMM 1.03 TF 3.93 TF 2.95 TF 4.29 TF 3.22 TF 8.74 TF 4x Peak DP Peak DGEMM.515 TF 1.31 TF 1.22 TF 1.43 TF 1.33 TF 2.90 TF Memory size 6 GB 6 GB 12 GB 24 GB (12 each) 2x Mem BW (ECC off) 150 GB/s 250 GB/s 288 GB/s 480 GB/s (240 each) 2x Memory Clock 2.6 GHz 3.0 GHz 3.0 GHz PCIe Gen Gen 2 Gen 2 Gen 3 Gen 3 2x # of Cores 448 2688 2880 4992 (2496 each) 5x Board Power 235W 235W 235W 300W 0% 28% Note: Tesla K80 specifications are shown as aggregate of two GPUs on a single board 3x 6

Features of Pascal GPU Architecture 2016 NVLink Interconnect at 80 GB/s (Speed of CPU Memory) Stacked Memory 4x Higher Bandwidth ~1 TB/s 3x Capacity, 4x More Efficient Unified Memory Lower Development Effort (Available Today in CUDA6) TESLA GPU Stacked Memory NVLink 80-200 GB/s HBM 1 TB/s UVM CPU (POWER,?) DDR4 50-75 GB/s DDR Memory 7

NVLink Interconnect Alternative to PCIe High speed interconnect for CPU-to-GPU and GPU-to-GPU Performance advantage over PCIe Gen-3 (5x to 12x faster) GPUs and CPUs share data structures at CPU memory speeds TESLA GPU Stacked Memory NVLink 80-200 GB/s HBM 1 TB/s UVM CPU (POWER,?) DDR4 50-75 GB/s DDR Memory 8

NVLink Interconnect Alternative to PCIe GPU Density Increasing HPC Vendors & NVLink 2016 All will support GPU-to-GPU Several to support CPU-to-GPU Cray CS-Storm: 8 x K80 Dell C4130: 4 x K80 HP SL270: 8 x K80 9

GPU and CPU Platforms and Configurations NVLink PCIe PCIe X86, ARM64, POWER CPU X86, ARM64, POWER CPU 2014 Kepler GPU 2016 Pascal GPU More CPU Choices for GPU Acceleration 2014 NVLink Interconnect Alternative to PCIe 2016 NVLink POWER CPU 10

IBM Power + NVIDIA GPU Accelerated HPC Next-Gen IBM Supercomputers and Enterprise Servers OpenPOWER Foundation Open ecosystem built on Power Architecture Long term roadmap integration + POWER CPU Tesla GPU & 30+ more First GPU-Accelerated POWER-Based Systems Available Since 2015 11

DOE CORAL Based on Power + GPU 2017 US DOE CORAL Systems Summit (ORNL) and Sierra (LLNL) Installation 2017 at ~150 PF each Nodes of POWER 9 + Tesla Volta GPUs NVLink Interconnect for CPUs + GPUs ORNL Summit System Approximately 3,400 total nodes Each node 40+ TF peak performance About 1/5 of total #2 Titan nodes (18K+) Same energy used as #2 Titan (27 PF) CORAL Summit System 5-10x Faster than Titan 1/5th the Nodes, Same Energy Use as Titan 12

Feature Comparison of Summit vs. Titan 2017 2012 ~5-10x ~1/5x ~30x ~15x ~10-20x ~0x 13

CORAL Systems for US DOE Climate Model ACME ACME: Accelerated Climate Model for Energy DOE Labs: Argonne, LANL, LBL, LLNL, ORNL, PNNL, Sandia Co-design project using US DOE LCF systems Branch of CESM using CAM-SE (atm) and MPAS-O (ocn) ACME Project Roadmap (page 2 in report) Strategy Report 11 Jul 2014 2017 2014 CORAL: Summit, Sierra, Aurora Trinity: NERSC8 (Cori) TITAN OLCF [AMD + GPU] Mira ALCF [Blue GeneQ] Source: http://climatemodeling.science.energy.gov/sites/default/files/publications/acme-project-strategy-plan_0.pdf 14

GPU Programming Models for CPU Platforms Availability for x86 Today, POWER and ARM Under Development Libraries AmgX cublas Compiler Directives Programming Languages / x86 15

PGI Compilers for OpenPOWER + GPU Feature parity with PGI Compilers on Linux/x86+Tesla CUDA Fortran, OpenACC, OpenMP, CUDA C/C++ host compiler Integrated with IBM s optimized LLVM / Power code generator Limited access in 2015, Beta 1H 2016, Production in 2016 x86 Recompile 16

NVIDIA GPU Focus for Atmosphere Models Recent Developments Global: ECMWF ESCAPE Project, NOAA NGGPS Regional: COSMO, WRF 17

ECMWF Project for Exascale Weather Prediction Expectations towards Exascale: Weather and Climate Prediction P. Bauer, ECMWF, EESI-2015 EASCAPE HPC Goals: Standardized, highly optimized kernels on specialized hardware Overlap of communication and computation Compilers/standards supporting portability Project Approach: Co-design between domain scientists, computer scientists, and HPC vendors 18

ECMWF Initial IFS Investigation on Titan GPUs Challenges of Getting ECMWF s Weather Forecast Model (IFS) to the Exascale G. Mozdzynski, ECMWF 16 th HPC Workshop >3x >4x Legendre Transform Kernels Observe >4x for Titan K20X vs. XC-30 24-core Ivy Bridge 19

NOAA NGGPS: Background on Program From: Next Generation HPC and Forecast Model Application Readiness at NCEP -by John Michalakes, NOAA NCEP; AMS, Phoenix, AZ, Jan 2015 20

NOAA NGGPS: NH Model Dycore Candidates (5) Dycores FV3 & MPAS selected for Level-II evaluation NVIDIA, NOAA investigating potential for GPUs NOAA evaluations under HPC program SENA Software From: Next Generation HPC and Forecast Model Application Readiness at NCEP -by John Michalakes, NOAA NCEP; AMS, Phoenix, AZ, Jan 2015 Engineering for Novel Architectures (2015) 21

COSMO Regional Model on GPUs http://www.cosmo-model.org/ COSMO model fully implemented on GPUs by MeteoSwiss COSMO consortium approved GPU code for standard distribution (5.3) MeteoSwiss to deploy first-ever operational NWP on GPUs Successful daily test runs on GPUs since Dec 2014, operational in 2016 COSMO 2 (2.2 KM) COSMO 1 (1.1 KM) Reality Source: Increased Resolution = Quality Forecast 22

Operational NWP at MeteoSwiss Before GPUs Delivering performance in scientific simulations: present and future role of GPUs in supercomputers -by Dr. Thomas Schulthess, Director CSCS; NVIDIA GTC, March 2014 23

New Operational NWP at MeteoSwiss With GPUs Climate Simulation and Numerical Weather Prediction Using GPUs -by Dr. Xavier Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 MeteoSwiss will deploy the first-ever operational NWP with GPUs (2016) Two identical systems purchased (Cray CS + K80 GPUs) to be installed during 2015 Successful daily run of pre-operational COSMO on GPUs since Dec 2014 at CSCS New Operational NWP Configurations: New configurations of higher resolution and ensemble prediction only possible with performance-per-energy gains from GPUs 2.2km increase to 1.1km 6.6km increase to 2.2km EPS Short range rapid update (grid=1158x774x80, 1.1km) forecast of 24 hrs generated every 3h (8 per day) Medium range ensemble (grid=582x390x60, 2.2km) forecast of 5+ days generated every 12h (2 per day) 24

MeteoSwiss GPU-Driven Weather Prediction MeteoSwiss COSMO NWP Configurations Since 2008 IFS from ECMWF 2 per day, 10 day forecast COSMO 7 (6.6 KM) 3 per day, 3 day forecast COSMO 2 (2.2 KM) 8 per day, 24 hr forecast Before GPUs MeteoSwiss COSMO NWP Configurations During 2016 IFS from ECMWF 2 per day, 10 day forecast COSMO E (2.2 KM) 2 per day, 5 day forecast COSMO 1 (1.1 KM) 8 per day, 24 hr forecast With GPUs New configurations of higher resolution and ensemble predictions possible owing to the performance-per-energy gains from GPUs X. Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 25

COSMO Model Performance for K20X Oct 2014 Are OpenACC Directives the Easy Way to Port NWP Applications to GPUs? -by Dr. Xavier Lapillonne, MeteoSwiss; ECMWF 16 th HPC Workshop, Oct 2014 Details of Benchmark and CPU + GPU Configurations: COSMO-2 (2km) configuration, 24h run, grid = 520x350x60 System todi, XK7 at CSCS All results for 9 compute nodes with 9 x AMD CPU and 9 x K20X All results socket-to-socket CPU use of full 16 cores GPU use of 1 CPU core Physics that have OpenACC optimizations speedup 3.4X Assim, OpenACC on 1 core CPU 3.3X slower but < 15% of profile Total GPU speedup 2.7X 26

COSMO Model Performance for K20X Apr 2015 Climate Simulation and Numerical Weather Prediction Using GPUs -by Dr. Xavier Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 Details of Benchmark and CPU + GPU Configurations: COSMO-1 (1km) configuration, 24h run, grid = 1158x774x80 System Piz Daint, XC30 at CSCS All results for 70 compute nodes with 1 x SB CPU and 1 x K20X All results socket-to-socket CPU use of full 8 cores GPU use of 1 CPU core More substantial gains in energy per solution, but not reported Total GPU speedup 2.2X 27

COSMO Development on GPUs by MeteoSwiss Climate Simulation and Numerical Weather Prediction Using GPUs -by Dr. Xavier Lapillonne, MeteoSwiss; EGU Assembly, Apr 2015 Approach Implementation 28

References: COSMO Model on NVIDIA GPUs Towards GPU-accelerated Operational Weather Forecasting - Oliver Fuhrer (MeteoSwiss), NVIDIA GTC 2013, Mar 2013 Source: http://on-demand.gputechconf.com/gtc/2013/presentations/s3417-gpu-accelerated-operational-weather-forecasting.pdf Delivering Performance in Scientific Simulations: Present and Future Role of GPUs in Supercomputers - Thomas Schulthess (CSCS), NVIDIA GTC 2014, Mar 2014 Source: http://on-demand.gputechconf.com/gtc/2014/presentations/s4719-scientific-sims-gpus-supercomputing.pdf Implementation of COSMO on Accelerators - Oliver Fuhrer (MeteoSwiss), ECMWF Scalability Workshop, Apr 2014 Source: http://old.ecmwf.int/newsevents/meetings/workshops/2014/scalability/ Implementation of COSMO on Accelerators - Xavier Lapillonne (MeteoSwiss), ECMWF 16 th HPC Workshop, Oct 2014 Source: http://www.ecmwf.int/sites/default/files/hpc-ws-lapillonne.pdf 29

WRF Progress Update http://www.wrf-model.org/ Independent development projects of (I) CUDA and (II) OpenACC I. CUDA through funded collaboration with SSEC www.ssec.wisc.edu SSEC lead Dr. Bormin Huang, NVIDIA CUDA Fellow: research.nvidia.com/users/bormin-huang Objective: Hybrid WRF benchmark capability starting 2H 2015 II. OpenACC directives through collaboration with NCAR MMM Objective: NCAR GPU interest in Fortran version to host on distribution site \ 30

Project Status (I): NVIDIA-SSEC CUDA WRF Objective to provide Q3 benchmark capability of hybrid CPU+GPU WRF 3.6.1 SSEC ported 20+ modules to CUDA across 5 physics types in WRF: Cloud MP = 11; Radiation = 4; Surface Layer = 3; Planetary BL = 1; Cumulus = 1 Project to integrate CUDA modules into latest full model WRF 3.6.1 SSEC development began 2010, based on WRF 3.3/3.4 vs. latest release 3.6.1 SSEC about publications so module development was isolated from WRF model Start with existing modules most common across target customer s input Objective: speedups measured on individual modules within full model run NVIDIA SSEC Project Success \ Target modules achieve published GPU performance within full WRF model run Final clean-up and performance studies underway, real benchmarks in 2H 2015 TQI/SSEC to fund/port all modules to CUDA towards a full WRF model on GPUs Focus on additional physics modules, and ongoing WRF-ARW dycore development 31

SSEC Speedups of WRF Physics Modules * Modules for initial NVIDIA-funded integration project WRF Module GPU Speedup (w-wo I/O) Technical Paper Publication Kessler MP 70x / 816x J. Comp. & GeoSci., 52, 292-299, 2012 Purdue-Lin MP 156x / 692x SPIE: doi:10.1117/12.901825 WSM 3-class MP 150x / 331x WSM 5-class MP 202x / 350x JSTARS, 5, 1256-1265, 2012 Eta MP 37x / 272x SPIE: doi:10.1117/12.976908 WSM 6-class MP 165x / 216x Submitted to J. Comp. & GeoSci. Goddard GCE MP 348x / 361x Accepted for publication in JSTARS Thompson MP * * * 76x / 153x SBU 5-class MP 213x / 896x JSTARS, 5, 625-633, 2012 WDM 5-class MP 147x / 206x WDM 6-class MP 150x / 206x J. Atmo. Ocean. Tech., 30, 2896, 2013 RRTMG LW 123x / 127x JSTARS, 7, 3660-3667, 2014 RRTMG SW 202x / 207x Submitted to J. Atmos. Ocean. Tech. Goddard SW 92x / 134x JSTARS, 5, 555-562, 2012 Dudhia SW MYNN SL TEMF SL 19x / 409x 6x / 113x 5x / 214x Thermal Diffusion LS 10x / 311x Submitted to JSATRS * * * * YSU PBL 34x / 193x Submitted to GMD Hybrid WRF Customer Benchmark Capability Starting in 2H 2015 Hardware and Benchmark Case CPU: Xeon Core-i7 3930K,1 core use; Benchmark: CONUS 12 km for 24 Oct 24; 433 x 308, 35 levels 32

NVIDIA-SSEC Phase-I Priority on 6 Modules WRF Module NCEP HRRR NCEP NWS-1 NCAR KISTI KAUST Cloud MP ~20% Rad ~6% PBL WSM 5-class MP WSM 6-class MP Thompson MP Dudhia Radiation RRTMG Radiation YSU Planetary BL NOTE: New NCAR global model MPAS uses WRF physics: WSM6, RRTMG, and YSU 33

Project Status (II): NVIDIA-NCAR OpenACC WRF Objective to provide benchmark capability of hybrid CPU+GPU WRF 3.6.1 NVIDIA ported 30+ routines to OpenACC for both dynamics and physics: List provided on next page Project to integrate OpenACC modules with full model WRF 3.6.1 release Open source project with plans for NCAR hosting on WRF trunk distribution site Objective to minimize code refactoring for scientists readability and acceptance \ NVIDIA - NCAR Project Success Several performance sensitive routines of dynamics and physics are completed This includes advection routines for dynamics; expensive microphysics schemes Hybrid WRF performance using single socket CPU + GPU faster than CPU-only Incremental improvements as more OpenACC code is developed for execution on GPU 34

WRF Modules Available in OpenACC Project to integrate CUDA modules into latest full model WRF 3.6.1 Several dynamics routines including all of advection Several physics schemes: Planetary boundary layer YSU, GWDO Cumulus KFETA Microphysics Kessler, Morrison, Thompson Radiation RRTM Surface physics Noah \ 35

Summary Highlights NVIDIA observes strong NWP community interest in GPU acceleration Appeal of new technologies: Pascal, NVLink, more CPU platform choices NVIDIA applications engineering collaboration in several model projects Investments in OpenACC: PGI release of 15; Continued Cray collaborations GPU progress for several models we examined a few of these MeteoSwiss developments towards COSMO operational NWP on GPUs ECMWF developments of IFS and launch of H2020 ESCAPE project NVIDIA HPC technology behind the next-gen pre-exascale systems New GPU technologies for DOE CORAL program and ACME climate model 36

Thank You Carl Ponder, PhD; cponder@nvidia.com; NVIDIA, Austin, TX, USA