GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer

The GU Advantage

The GU Advantage A Tale of Two Machines

Tianhe-1A at NSC Tianjin

Tianhe-1A at NSC Tianjin The World s Fastest Supercomputer 2.507 etaflop 7168 Tesla M2050 GUs

Tesla M2050 GUs

Gigaflops 3 of Top5 Supercomputers 2500 2000 1500 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II

Gigaflops Megawatts Top 5 erformance and ower 2500 2000 1500 8 7 6 5 4 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II 3 2 1 0

NVIDIA/NCSA Green 500 Entry

NVIDIA/NCSA Green 500 Entry 128 nodes, each with: 1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLO peak) 1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLO peak) 4x QDR Infiniband 4 GB DRAM Theoretical eak erf: 68.95 TF Footprint: ~20 ft^2 => 3.45 TF/ft^2 Cost: $500K (street price) => 137.9 MF/$ Linpack: 33.62 TF, 36.0 kw => 934 MF/W

The GU Advantage Efficiency and rogrammability

GU 200pJ/Instruction CU 2nJ/Instruction

GU 200pJ/Instruction Optimized for Throughput Explicit Management of On-chip Memory CU 2nJ/Instruction Optimized for Latency Caches

D GFLOS per Watt CUDA GU Roadmap 16 Maxwell 14 12 10 8 6 Kepler 4 2 Tesla Fermi 2007 2009 2011 2013

The GU Advantage Efficiency and rogrammability

The GU Advantage CUDA Enables rogrammability

CUDA C: C with a Few Keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); CUDA C Code

Libraries Mathematical ackages Research & Education Consultants, Training & Certification Integrated Development Environment arallel Nsight for MS Visual Studio DirectX GU Computing Ecosystem Tools & artners Languages & AI s All Major latforms Fortran

GU Computing Today By the Numbers: 200 Million 600,000 100,000 8,000 362 11 CUDA Capable GUs CUDA Toolkit Downloads Active GU Computing Developers Members in arallel Nsight Developer rogram Universities Teaching CUDA Worldwide CUDA Centers of Excellence Worldwide

To ExaScale and Beyond

Science Needs 1000x More Computing Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 etaflop Chromatophore 50M atoms Bacteria Billions of atoms 1,000 1 BTI 3K atoms Estrogen Receptor 36K atoms F1-ATase 327K atoms Ribosome 2.7M atoms Ran for 8 months to simulate 2 nanoseconds 1982 1997 2003 2006 2010 2012

DARA Study Identifies Four Challenges for ExaScale Computing Report published September 28, 2008: Four Major Challenges Energy and ower challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! ower also constrains what we can put on a chip Available at www.darpa.mil/ipto/personnel/docs/exascale_study_initial.pdf

ower is THE roblem

ower is THE roblem A GU is the Solution

GFLOS/W (Core) ExaFLOS at 20MW = 50GFLOS/W 100 10 1 GU 5GFLOS/W GU CU EF 0.1 2010 2013 2016

GFLOS/W (Core) 50GFLOS/W 10x Energy Gap for Today s GU 100 10 1 10x ExaFLOS Gap GU 5GFLOS/W GU CU EF 0.1 2010 2013 2016

GFLOS/W (Core) GUs Close the Gap with rocess and Architecture 100 10 4x Architecture 4x rocess Technology 1 0.1 2010 2013 2016 GU CU EF

GFLOS/W GUs Close the Gap with rocess and Architecture 100 10 1 GU CU* EF 0.1 2010 2013 2016

GFLOS/W GUs Close the Gap With CUs, a Gap Remains 100 10 6x Residual CU Gap 1 0.1 2010 2013 2016 GU CU* EF

GFLOS/W 100 10 1 GU CU* EF 0.1 2010 2013 2016 GUs Close the Gap With CUs, a Gap Remains Heterogeneous Computing is Required to get to ExaScale

Echelon NVIDIA s Extreme-Scale Computing roject

Echelon Team

LC0 LC7 System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube DRAM Cube L2 0 L2 1023 MC NV RAM NIC Self-Aware OS L0 L0 NoC Self-Aware Runtime C 0 C 7 SM0 SM127 rocessor Chip (C) Node 0 (N0) 20TF, 1.6TB/s, 256GB Module 0 (M)) 160TF, 12.8TB/s, 2TB Cabinet 0 (C0) 2.6F, 205TB/s, 32TB Echelon System N7 M15 CN Locality-Aware Compiler & Autotuner

Load/Store Execution Model Object Thread A B Global Address Space A Abstract Memory Hierarchy B B Active Message

The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit D 20pJ 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 500 pj Efficient off-chip link 256-bit access 8 kb 50 pj 1 nj 28nm

An NVIDIA ExaScale Machine

Lane 4 DFMAs, 20GFLOS L0 I$ L0 D$ Main Registers Operand Registers DFMA DFMA DFMA DFMA LSI LSI

SM 8 lanes 160GFLOS $ Switch

Chip 128 SMs 20.48 TFLOS + 8 Latency rocessors 1024 Banks, 256KB each MC MC NI NoC SM SM SM SM SM L L 128 SMs 160GF each

Node MCM 20TF + 256GB GU Chip 20TF D 256MB 150GB/s Network BW 1.4TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NV Memory

Cabinet 128 Nodes 2.56F 38 kw NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

System to ExaScale and Beyond Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW

GU Computing is the Future 1 2 3 4 GU Computing is # 1 Today On Top 500 AND Dominant on Green 500 GU Computing Enables ExaScale At Reasonable ower The GU is the Computer A general purpose computing engine, not just an accelerator The Real Challenge is Software

ower is THE roblem 1 2 3 Data Movement Dominates ower Optimize the Storage Hierarchy Tailor Memory to the Application

% Miss Some Applications Have Hierarchical Re-Use 120 DGEMM 100 80 60 40 20 0 1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size

Applications with Hierarchical Reuse Want a Deep Storage Hierarchy L4 L3 L2 L2 L2 L2

% Miss Some Applications Have lateaus in Their Working Sets 120 Table 100 80 60 40 20 0 1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size

Applications with lateaus Want a Shallow Storage Hierarchy L2 L2 L2 L2 NoC

Configurable Memory Can Do Both At the Same Time Flat hierarchy for large working sets Deep hierarchy for reuse Shared memory for explicit management Cache memory for unpredictable sharing NoC

ROUTER ROUTER ROUTER ROUTER Configurable Memory Reduces Distance and Energy