HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing
US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance 10x in Scientific Applications 2017 Major Step Forward on the Path to Exascale 2
Just 4 nodes in Summit would make the Top500 list of supercomputers today Similar Power as Titan 5-10x Faster 1/5th the Size 150 PF = 3M Laptops One laptop for Every Resident in State of Mississippi 3
Optimizing Serial/Parallel Execution Application Code Parallel Work Serial Work GPU Majority of Ops System and Sequential Ops CPU + 4
NVLink-Enabled Heterogeneous Node 5x Higher Energy Efficiency 80-200 GB/s IBM POWER CPU Most Powerful Serial Processor NVIDIA NVLink Fastest CPU-GPU Interconnect NVIDIA Volta GPU Most Powerful Parallel Processor 5
NVLink: Logical Node Integration TESLA GPU NVLink 80 GB/s Power or ARM CPU 5x PCIe bandwidth Move data at CPU memory speed 3x lower energy/bit HBM 1 Terabyte/s Stacked Memory Throughput Optimized DDR4 50-75 GB/s DDR Memory Latency Optimized 6
KEPLER GPU PASCAL GPU NVLink NVLink High-speed GPU Interconnect POWER CPU NVLink PCIe PCIe X86 ARM64 POWER CPU 2014 X86 ARM64 POWER CPU 2016 7
NVLink Unleashes Multi-GPU Performance GPUs Interconnected with NVLink CPU Over 2x Application Performance Speedup When Next-Gen GPUs Connect via NVLink Versus PCIe Speedup vs PCIe based Server 2.25x 2.00x PCIe Switch 1.75x TESLA GPU TESLA GPU 1.50x 1.25x 5x Faster than PCIe Gen3 x16 1.00x ANSYS Fluent Multi-GPU Sort LQCD QUDA AMBER 3D FFT 3D FFT, ANSYS: 2 GPU configuration, All other apps comparing 4 GPU 8 configuration 8 AMBER Cellulose (256x128x128), FFT problem size (256^3)
Two Computing Models For Accelerators Many-Weak-Cores (MWC) Model Single CPU Core for Both Serial & Parallel Work Heterogeneous Computing Model Complementary Processors Work Together Xeon Phi (And Others) Many Weak Serial Cores CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 9
Amdahl s Law Analysis 98% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 1 2 3 4 5 6 7 8 9 10 Minutes Run Time 10
Amdahl s Law Analysis 90% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 1 2 3 4 5 6 7 8 9 10 Minutes Run Time 11
Amdahl s Law Analysis 80% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 1 2 3 4 5 6 7 8 9 10 Minutes Run Time 12
Amdahl s Law Analysis 70% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 2 4 6 8 10 12 14 Minutes Run Time 13
Amdahl s Law Analysis 60% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 2 4 6 8 10 12 14 16 18 Minutes Run Time 14
Amdahl s Law Analysis 50% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 Minutes Run Time 15
Amdahl s Law Analysis 40% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 Minutes Run Time 16
Amdahl s Law Analysis 30% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 Minutes Run Time 17
Amdahl s Law Analysis 20% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 35 Minutes Run Time 18
Amdahl s Law Analysis 10% Parallel Work 1 GPU+1 CPU 2 x MWC (.25x CPU) Serial Parallel Work 1x CPU 0 5 10 15 20 25 30 35 40 Minutes Run Time 19
TESLA K80 WORLD S FASTEST ACCELERATOR FOR DATA ANALYTICS AND SCIENTIFIC COMPUTING Dual-GPU Accelerator for Max Throughput 2x Faster 2.9 TF 4992 Cores 480 GB/s 25x 20x 15x Deep Learning: Caffe Double the Memory Designed for Big Data Apps 24GB K40 12GB Maximum Performance Dynamically Maximize Performance for Every Application 10x 5x 0x CPU Tesla K40 Tesla K80 Oil & HPC Gas Viz Data Analytics Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 20
Performance Lead Continues to Grow Peak Double Precision FLOPS Peak Memory Bandwidth GFLOPS 3500 GB/s 600 3000 K80 500 K80 2500 400 2000 K40 300 K40 1500 1000 500 M1060 K20 M2090 Ivy Bridge Sandy Bridge Westmere Haswell 200 100 M1060 K20 M2090 Sandy Bridge Westmere Haswell Ivy Bridge 0 2008 2009 2010 2011 2012 2013 2014 0 2008 2009 2010 2011 2012 2013 2014 NVIDIA GPU x86 CPU NVIDIA GPU x86 CPU 21
15x 10x Faster than CPU on Applications K80 CPU 10x 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry Physics CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled 22
Tesla Platform Enables Optimization Scalable Nodes, ISA Choice NVLink x86 23
Tesla Platform Enables Optimization Ecosystem Industry Standard CPUs and Interconnects ARM64 POWER x86 Ethernet Cray NVIDIA GPU InfiniBand Others Industry-Driven Solutions 24
CORAL Scalable Heterogeneous Node NVLink In Practice Approximately 3,400 nodes, each with: IBM POWER9 CPUs and multiple NVIDIA Tesla Volta GPUs CPUs and GPUs integrated on-node with high speed NVLink Large coherent memory: over 512 GB (HBM + DDR4) All directly addressable from the CPUs and GPUs An additional 800 GB of NVRAM, burst buffer or as extended memory Over 40 TF peak performance/node(!) 25
Optimized Heterogeneous Node CORAL Application Performance Projections 26
YOUR PLATFORM FOR DISCOVERY Tesla Accelerated Computing