GPU EN CALCUL SCIENTIFIQUE Formation du Club des Affiliés du LAAS-CNRS, Toulouse, 22 mars 2016 Frédéric Parienté, Tesla Accelerated Computing, NVIDIA
GAMING PRO ENTERPRISE VISUALIZATION DATA CENTER AUTO THE WORLD LEADER IN VISUAL COMPUTING 2
Time of accelerators has come FIVE THINGS TO REMEMBER NVIDIA is focused on co-design from top-to-bottom Accelerators are surging in supercomputing Machine learning is the next killer application for HPC Tesla platform leads in every way 3
It s time to start planning for the end of Moore s Law, and it s worth pondering how it will end, not just when. Robert Colwell Director, Microsystems Technology Office, DARPA 4
TESLA ACCELERATED COMPUTING PLATFORM Focused on Co-Design from Top to Bottom TFLOPS Fast GPU Engineered for High Throughput NVIDIA GPU x86 CPU Productive Programming Model & Tools Expert Co-Design Accessibility 3,0 2,5 K80 APPLICATION 2,0 1,5 1,0 0,5 0,0 M1060 M2090 K20 K40 Fast GPU + Strong CPU 2008 2009 2010 2011 2012 2013 2014 MIDDLEWARE SYS SW LARGE SYSTEMS PROCESSOR 5
125 ACCELERATORS SURGE IN WORLD S TOP SUPERCOMPUTERS 100 75 50 Top500: # of Accelerated Supercomputers 100+ accelerated systems now on Top500 list 1/3 of total FLOPS powered by accelerators NVIDIA Tesla GPUs sweep 23 of 24 new accelerated supercomputers 25 Tesla supercomputers growing at 50% CAGR over past five years 0 2013 2014 2015 6
70% OF TOP HPC APPS ACCELERATED INTERSECT360 SURVEY OF TOP APPS TOP 25 APPS IN SURVEY GROMACS SIMULIA Abaqus NAMD AMBER ANSYS Mechanical MSC NASTRAN SPECFEM3D LAMMPS NWChem LS-DYNA Schrodinger Gaussian GAMESS Top 10 HPC Apps 90% Accelerated Intersect360, Nov 2015 HPC Application Support for GPU Computing Top 50 HPC Apps 70% Accelerated ANSYS Fluent WRF VASP OpenFOAM CHARMM Quantum Espresso ANSYS CFX Star-CD CCSM COMSOL Star-CCM+ BLAST = All popular functions accelerated = Some popular functions accelerated = In development = Not supported 7
370 GPU-Accelerated Applications www.nvidia.com/appscatalog 8
TESLA BOOSTS DATACENTER THROUGHPUT $500M Datacenter, 4x increase in ROI 30% CPU Nodes 100% CPU Nodes 70% of Applications 5x Faster with GPU 70% GPU-Accelerated Nodes 1000 Jobs Per Day 3800 Jobs Per Day 9
NEXT-GEN SUPERCOMPUTERS ARE GPU-ACCELERATED SUMMIT SIERRA U.S. Dept. of Energy Pre-Exascale Supercomputers for Science NOAA New Supercomputer for Next-Gen Weather Forecasting IBM Watson Breakthrough Natural Language Processing for Cognitive Computing 10
MACHINE LEARNING HPC 1 ST CONSUMER KILLER-APP MICROSOFT CORTANA GOOGLE OPEN-SOURCE TENSORFLOW FACEBOOK MESSENGER FACIAL RECOGNITION MICROSOFT OPEN-SOURCE DMTK YOUTUBE CLICK-TO-BUY ADS GOOGLE PHOTO 11
TESLA PLATFORM LEADS IN EVERY WAY PROCESSOR INTERCONNECT SOFTWARE ECOSYSTEM 12
TESLA PLATFORM FOR HPC 13
Approximately a third of HPC systems operating today are equipped with accelerators and nearly half of all newly deployed systems have them. Source: ACCELERATED COMPUTING: A TIPPING POINT FOR HPC Intersect360 Nov 2015 14
TESLA FOR SIMLUATION LIBRARIES DIRECTIVES LANGUAGES ACCELERATED COMPUTING TOOLKIT TESLA ACCELERATED COMPUTING 15
Tesla Accelerates Discoveries Using a supercomputer powered by the Tesla Platform with over 3,000 Tesla accelerators, University of Illinois scientists performed the first all-atom simulation of the HIV virus and discovered the chemical structure of its capsid the perfect target for fighting the infection. Without GPU, the supercomputer would need to be 5x larger for similar performance. 16
TESLA K80 World s Fastest Accelerator for HPC & Data Analytics Dual CPU Server Tesla K80 Server 5x Faster AMBER Performance Simulation Time from 1 Month to 1 Week 0 5 10 15 20 25 30 # of Days CUDA Cores 4992 Peak DP Peak DP w/ Boost GDDR5 Memory Bandwidth Power GPU Boost 1.9 TFLOPS 2.9 TFLOPS 24 GB 480 GB/s 300 W Dynamic AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2 17
TESLA K80: 10X FASTER ON REAL-WORLD APPS 15x K80 CPU 10x 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry Physics CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 18
TESLA K80 BOOSTS DATA CENTER THROUGHPUT ACCELERATING KEY APPS 1/3 OF NODES ACCELERATED, 2X SYSTEM THROUGHPUT 15x Speed-up vs Dual CPU K80 CPU CPU-only System Accelerated System 10x 5x 0x QMCPACK LAMMPS CHROMA NAMD AMBER 100 Jobs Per Day 220 Jobs Per Day CPU: Dual E5-2698 v3@2.3ghz 3.6GHz, 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 19
TESLA FOR VISUALIZATION IRAY OPTIX INDEX VISUALIZATION TOOLS FOR HPC TESLA ACCELERATED COMPUTING 20
VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE CPU Supercomputer Viz Cluster Data Transfer Traditional Slower Time to Discovery Simulation- 1 Week Days Viz- 1 Day Time to Discovery = Months Multiple Iterations GPU-Accelerated Supercomputer Interactive Tesla Platform Faster Time to Discovery Visualize while you simulate/without data transfers Restart Simulation Instantly Multiple Iterations Time to Discovery = Weeks Scalable Flexible 21
VISUALIZATION-ENABLED SUPERCOMPUTERS Simulation + Visualization CSCS Piz Daint NCSA Blue Waters ORNL Titan Galaxy Formation Molecular Dynamics Cosmology 22
GROWING ADOPTION IN CLIMATE & WEATHER MeteoSwiss Deploys World s First Accelerated Weather Supercomputer 2x higher resolution for daily forecasts 14x more simulation with ensemble approach for medium-range forecasts NOAA Chooses Tesla To Improve Weather Forecast Research Develop global model with 3km resolution, five-fold increase from today s resolution Improved resolution requires 100x computational complexity 23
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform 100-300 PFLOPS Peak 10x in Scientific App Performance IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect 40 TFLOPS per Node, >3,400 Nodes 2017 Major Step Forward on the Path to Exascale 24
ACCELERATED COMPUTING DELIVERS 5X HIGHER ENERGY EFFICIENCY 80-200 GB/s IBM POWER CPU Most Powerful Serial Processor NVIDIA NVLink Fastest CPU-GPU Interconnect NVIDIA Volta GPU Most Powerful Parallel Processor 25
CORAL: BUILT FOR GRAND SCIENTIFIC CHALLENGES Fusion Energy Role of material disorder, statistics, and fluctuations in nanoscale materials and systems Climate Change Study climate change adaptation and mitigation scenarios; realistically represent detailed features Biofuels Search for renewable and more efficient energy sources Astrophysics Radiation transport critical to astrophysics, laser fusion, atmospheric dynamics, and medical imaging Combustion Combustion simulations to enable the next gen diesel/biofuels to burn more efficiently Nuclear Energy Unprecedented high-fidelity radiation transport calculations for nuclear energy applications 26
TESLA PLATFORM FOR MACHINE LEARNING 27
THE BIG BANG IN MACHINE LEARNING DNN BIG DATA GPU Google s AI engine also reflects how the world of computer hardware is changing. (It) depends on machines equipped with GPUs And it depends on these chips more than the larger tech universe realizes. 28
Tesla Revolutionizes Machine Learning GOOGLE BRAIN APPLICATION DEEP LEARNING BEFORE TESLA AFTER TESLA Cost $5,000K $200K Servers 1,000 Servers 16 Tesla Servers Energy 600 KW 4 KW Performance 1x 6x 29
THE AI RACE IS ON 30
NVIDIA GPU THE ENGINE OF DEEP LEARNING WATSON CHAINER THEANO MATCONVNET TENSORFLOW CNTK TORCH CAFFE NVIDIA CUDA ACCELERATED COMPUTING PLATFORM 31
Caffe Performance 6 M40+cuDNN4 CUDA BOOSTS DEEP LEARNING 5X IN 2 YEARS Performance 5 4 3 2 1 K40 K40+cuDNN1 M40+cuDNN3 0 11/2013 9/2014 7/2015 12/2015 AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz. 128GB System Memory, Ubuntu 14.04 32
AMAZING RATE OF IMPROVEMENT 100% Image Recognition ImageNet Accuracy IMAGENET 96% 100% Pedestrian Detection CALTECH CV-based DNN-based 100% Object Detection KITTI 95% 90% 85% 80% NVIDIA GPU 84% 88% 93% Accuracy 95% 90% 85% 80% 90% 80% 70% 60% Top Score 72% 66% 62% 75% 79% 83% 86% 87,5% 75% 75% 50% 55% NVIDIA DRIVENet 70% 72% 74% 70% 40% 39% 45% 65% 2010 2011 2012 2013 2014 2015 65% 11/2013 6/2014 12/2014 7/2015 1/2016 30% 33
CUDA FOR DEEP LEARNING DEVELOPMENT DEEP LEARNING SDK DIGITS cudnn cusparse cublas NCCL TITAN X DEVBOX GPU CLOUD 34
FACEBOOK S DEEP LEARNING MACHINE Purpose-Built for Deep Learning Training 2x Faster Training for Faster Deployment 2x Larger Networks for Higher Accuracy Powered by Eight Tesla M40 GPUs Open Rack Compliant Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful GPUs and huge data sets to build and train advanced models Serkan Piantino Engineering Director of Facebook AI Research 35
DESIGNED FOR AI COMPUTING AT LARGE SCALE Built on the NVIDIA Tesla Platform 8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance Leverages world s leading deep learning platform to tap into frameworks such as Torch and libraries such as cudnn Operational Efficiency and Serviceability Free-air Cooled Design Optimizes Thermal and Power Efficiency Components swappable without tools Configurable PCI-e for versatility 36
13x Faster Training Caffe TESLA M40 World s Fastest Accelerator for Deep Learning Training Dual CPU Server GPU Server with 4x TESLA M40 Reduce Training Time from 5 Days to less than 10 Hours 0 1 2 3 4 5 Number of Days CUDA Cores 3072 Peak SP GDDR5 Memory Bandwidth Power 7 TFLOPS 12 GB 288 GB/s 250W Note: Caffe benchmark with AlexNet, training 1.3M images with 90 epochs CPU server uses 2x Xeon E5-2699v3 CPU, 128GB System Memory, Ubuntu 14.04 37
Video Processing Stabilization and Enhancements Image Processing Resize, Filter, Search, Auto-Enhance 4x 5x TESLA M4 Highest Throughput Hyperscale Workload Acceleration Video Transcode 2x H.264 & H.265, SD & HD Machine Learning Inference 2x CUDA Cores 1024 Peak SP 2.2 TFLOPS GDDR5 Memory Bandwidth Form Factor Power 4 GB 88 GB/s PCIe Low Profile 50 75 W Preliminary specifications. Subject to change. 38
TESLA PLATFORM FOR DEVELOPERS 39
10X GROWTH IN ACCELERATED COMPUTING 2008 2015 150,000 CUDA Downloads 3 Million CUDA Downloads 27 CUDA Apps 370 CUDA Apps 60 Universities Teaching 800 Universities Teaching 4,000 Academic Papers 60,000 Academic Papers 6,000 Tesla GPUs 450,000 Tesla GPUs 77 Supercomputing Teraflops 54,000 Supercomputing Teraflops 40
HOW GPU ACCELERATION WORKS Application Code Compute-Intensive Functions GPU 5% of Code Rest of Sequential CPU Code CPU + 41
COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS Libraries AmgX cublas Compiler Directives Programming Languages / x86 42
GPU ACCELERATED LIBRARIES Drop-in Acceleration for Your Applications Domain-specific Deep Learning, GIS, EDA, Bioinformatics, Fluids NVBIO Triton Ocean SDK Visual Processing Image & Video Linear Algebra Dense, Sparse, Matrix NVIDIA CODEC SDK NVIDIA NPP NVIDIA cublas, cusparse Math Algorithms AMG, Templates, Solvers AmgX developer.nvidia.com/gpu-accelerated-libraries NVIDIA curand cusolver 43
OpenACC Simple Powerful Portable Fueling the Next Wave of Scientific Discoveries in HPC main() { <serial code> #pragma acc kernels //automatically runs on GPU { <parallel code> } } RIKEN Japan NICAM- Climate Modeling 7-8x Speed-Up 5% of Code Modified University of Illinois PowerGrid- MRI Reconstruction 70x Speed-Up 2 Days of Effort 8000+ Developers using OpenACC http://www.cray.com/sites/default/files/resources/openacc_213462.12_openacc_cosmo_cs_fnl.pdf http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway http://on-demand.gputechconf.com/gtc/2015/presentation/s5297-hisashi-yashiro.pdf 44 http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7
LS-DALTON Large-scale Application for Calculating High-accuracy Molecular Energies Lines of Code Modified Minimal Effort # of Weeks Required # of Codes to Maintain <100 Lines 1 Week 1 Source Big Performance LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X) 12,0x OpenACC makes GPU computing approachable for domain scientists. Initial OpenACC implementation required only minor effort, and more importantly, no modifications of our existing CPU implementation. Janus Juul Eriksen, PhD Fellow qleap Center for Theoretical Chemistry, Aarhus University Speedup vs CPU 8,0x 4,0x 0,0x Alanine-1 13 Atoms Alanine-2 23 Atoms Alanine-3 33 Atoms 45
OPENACC DELIVERS TRUE PERFORMANCE PORTABILITY Paving the Path Forward: Single Code for All HPC Processors Speedup vs Single CPU Core 35x 30x 25x 20x 15x 10x 5x 0x Application Performance Benchmark CPU: MPI + OpenMP CPU: MPI + OpenACC CPU + GPU: MPI + OpenACC 30,3x 11,9x 7,6x 7,1x 7,1x 4,1x 4,3x 5,2x 5,3x 359.MINIGHOST (MANTEVO) NEMO (CLIMATE & OCEAN) CLOVERLEAF (PHYSICS) 359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80- single GPU NEMO: Each socket CPU: Intel Xeon E5-2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs 46
CUDA Super Simplified Memory Management Code void sortfile(file *fp, int N) { char *data; data = (char *)malloc(n); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); CPU Code CUDA 6 Code with Unified Memory void sortfile(file *fp, int N) { char *data; cudamallocmanaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,n,1,compare); cudadevicesynchronize(); use_data(data); } free(data); } cudafree(data); 47
Numerical Packages MATLAB Mathematica LabView GPU DEVELOPER ECO-SYSTEM Debuggers & Profilers CUDA-GDB NV Visual Profiler NVIDIA Nsight Visual Studio Allinea TotalView Languages & Directives C C++ Fortran Java Python OpenACC OpenMP Cluster Tools GPUDirect RDMA Datacenter GPU Manager Libraries FFT BLAS SPARSE LAPACK NPP Video Imaging Consultants & Training OEM Solution Providers ANEO GPU Tech 48
DEVELOP ON GEFORCE, DEPLOY ON TESLA Designed for Developers & Gamers Available Everywhere developer.nvidia.com/cuda-gpus developer.nvidia.com/devbox Designed for the Data Center ECC 24x7 Runtime GPU Monitoring Cluster Management GPUDirect-RDMA Hyper-Q for MPI 3 Year Warranty Integrated OEM Systems, Professional Support 49
Sep 28-29, 2016 Amsterdam www.gputechconf.eu #GTC16 EUROPE S BRIGHTEST MINDS & BEST IDEAS DEEP LEARNING & ARTIFICIAL INTELLIGENCE SELF-DRIVING CARS VIRTUAL REALITY & AUGMENTED REALITY SUPERCOMPUTING & HPC GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are using parallel computing to transform our world. 2 Days 800 Attendees 50+ Exhibitors 50+ Speakers 15+ Tracks 15+ Workshops 1-to-1 Meetings 51