GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer
The GU Advantage
The GU Advantage A Tale of Two Machines
Tianhe-1A at NSC Tianjin
Tianhe-1A at NSC Tianjin The World s Fastest Supercomputer 2.507 etaflop 7168 Tesla M2050 GUs
Tesla M2050 GUs
Gigaflops 3 of Top5 Supercomputers 2500 2000 1500 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II
Gigaflops Megawatts Top 5 erformance and ower 2500 2000 1500 8 7 6 5 4 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II 3 2 1 0
NVIDIA/NCSA Green 500 Entry
NVIDIA/NCSA Green 500 Entry
NVIDIA/NCSA Green 500 Entry 128 nodes, each with: 1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLO peak) 1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLO peak) 4x QDR Infiniband 4 GB DRAM Theoretical eak erf: 68.95 TF Footprint: ~20 ft^2 => 3.45 TF/ft^2 Cost: $500K (street price) => 137.9 MF/$ Linpack: 33.62 TF, 36.0 kw => 934 MF/W
The GU Advantage Efficiency and rogrammability
GU 200pJ/Instruction CU 2nJ/Instruction
GU 200pJ/Instruction Optimized for Throughput Explicit Management of On-chip Memory CU 2nJ/Instruction Optimized for Latency Caches
D GFLOS per Watt CUDA GU Roadmap 16 Maxwell 14 12 10 8 6 Kepler 4 2 Tesla Fermi 2007 2009 2011 2013
The GU Advantage Efficiency and rogrammability
The GU Advantage CUDA Enables rogrammability
CUDA C: C with a Few Keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); CUDA C Code
Libraries Mathematical ackages Research & Education Consultants, Training & Certification Integrated Development Environment arallel Nsight for MS Visual Studio DirectX GU Computing Ecosystem Tools & artners Languages & AI s All Major latforms Fortran
GU Computing Today By the Numbers: 200 Million 600,000 100,000 8,000 362 11 CUDA Capable GUs CUDA Toolkit Downloads Active GU Computing Developers Members in arallel Nsight Developer rogram Universities Teaching CUDA Worldwide CUDA Centers of Excellence Worldwide
To ExaScale and Beyond
Science Needs 1000x More Computing Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 etaflop Chromatophore 50M atoms Bacteria Billions of atoms 1,000 1 BTI 3K atoms Estrogen Receptor 36K atoms F1-ATase 327K atoms Ribosome 2.7M atoms Ran for 8 months to simulate 2 nanoseconds 1982 1997 2003 2006 2010 2012
DARA Study Identifies Four Challenges for ExaScale Computing Report published September 28, 2008: Four Major Challenges Energy and ower challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! ower also constrains what we can put on a chip Available at www.darpa.mil/ipto/personnel/docs/exascale_study_initial.pdf
ower is THE roblem
ower is THE roblem A GU is the Solution
GFLOS/W (Core) ExaFLOS at 20MW = 50GFLOS/W 100 10 1 GU 5GFLOS/W GU CU EF 0.1 2010 2013 2016
GFLOS/W (Core) 50GFLOS/W 10x Energy Gap for Today s GU 100 10 1 10x ExaFLOS Gap GU 5GFLOS/W GU CU EF 0.1 2010 2013 2016
GFLOS/W (Core) GUs Close the Gap with rocess and Architecture 100 10 4x Architecture 4x rocess Technology 1 0.1 2010 2013 2016 GU CU EF
GFLOS/W GUs Close the Gap with rocess and Architecture 100 10 1 GU CU* EF 0.1 2010 2013 2016
GFLOS/W GUs Close the Gap With CUs, a Gap Remains 100 10 6x Residual CU Gap 1 0.1 2010 2013 2016 GU CU* EF
GFLOS/W 100 10 1 GU CU* EF 0.1 2010 2013 2016 GUs Close the Gap With CUs, a Gap Remains Heterogeneous Computing is Required to get to ExaScale
Echelon NVIDIA s Extreme-Scale Computing roject
Echelon Team
LC0 LC7 System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube DRAM Cube L2 0 L2 1023 MC NV RAM NIC Self-Aware OS L0 L0 NoC Self-Aware Runtime C 0 C 7 SM0 SM127 rocessor Chip (C) Node 0 (N0) 20TF, 1.6TB/s, 256GB Module 0 (M)) 160TF, 12.8TB/s, 2TB Cabinet 0 (C0) 2.6F, 205TB/s, 32TB Echelon System N7 M15 CN Locality-Aware Compiler & Autotuner
Load/Store Execution Model Object Thread A B Global Address Space A Abstract Memory Hierarchy B B Active Message
The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit D 20pJ 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 500 pj Efficient off-chip link 256-bit access 8 kb 50 pj 1 nj 28nm
An NVIDIA ExaScale Machine
Lane 4 DFMAs, 20GFLOS L0 I$ L0 D$ Main Registers Operand Registers DFMA DFMA DFMA DFMA LSI LSI
SM 8 lanes 160GFLOS $ Switch
Chip 128 SMs 20.48 TFLOS + 8 Latency rocessors 1024 Banks, 256KB each MC MC NI NoC SM SM SM SM SM L L 128 SMs 160GF each
Node MCM 20TF + 256GB GU Chip 20TF D 256MB 150GB/s Network BW 1.4TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NV Memory
Cabinet 128 Nodes 2.56F 38 kw NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect
System to ExaScale and Beyond Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW
GU Computing is the Future 1 2 3 4 GU Computing is # 1 Today On Top 500 AND Dominant on Green 500 GU Computing Enables ExaScale At Reasonable ower The GU is the Computer A general purpose computing engine, not just an accelerator The Real Challenge is Software
ower is THE roblem 1 2 3 Data Movement Dominates ower Optimize the Storage Hierarchy Tailor Memory to the Application
% Miss Some Applications Have Hierarchical Re-Use 120 DGEMM 100 80 60 40 20 0 1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size
Applications with Hierarchical Reuse Want a Deep Storage Hierarchy L4 L3 L2 L2 L2 L2
% Miss Some Applications Have lateaus in Their Working Sets 120 Table 100 80 60 40 20 0 1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size
Applications with lateaus Want a Shallow Storage Hierarchy L2 L2 L2 L2 NoC
Configurable Memory Can Do Both At the Same Time Flat hierarchy for large working sets Deep hierarchy for reuse Shared memory for explicit management Cache memory for unpredictable sharing NoC
ROUTER ROUTER ROUTER ROUTER Configurable Memory Reduces Distance and Energy