GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Size: px

Start display at page:

Download "GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer"

Reynard Flowers
7 years ago
Views:

2 GU Computing The GU Advantage To ExaScale and Beyond The GU is the Computer

3 The GU Advantage

4 The GU Advantage A Tale of Two Machines

5 Tianhe-1A at NSC Tianjin

6 Tianhe-1A at NSC Tianjin The World s Fastest Supercomputer etaflop 7168 Tesla M2050 GUs

7 Tesla M2050 GUs

Gigaflops 3 of Top5 Supercomputers 2500 2000 1500

8 Gigaflops 3 of Top5 Supercomputers Tianhe-1A Jaguar Nebulae Tsubame Hopper II

9 Gigaflops Megawatts Top 5 erformance and ower Tianhe-1A Jaguar Nebulae Tsubame Hopper II

10 NVIDIA/NCSA Green 500 Entry

11 NVIDIA/NCSA Green 500 Entry

12 NVIDIA/NCSA Green 500 Entry 128 nodes, each with: 1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLO peak) 1x Tesla C2050 (14 cores, 1.15 GHz => GFLO peak) 4x QDR Infiniband 4 GB DRAM Theoretical eak erf: TF Footprint: ~20 ft^2 => 3.45 TF/ft^2 Cost: $500K (street price) => MF/$ Linpack: TF, 36.0 kw => 934 MF/W

13 The GU Advantage Efficiency and rogrammability

14 GU 200pJ/Instruction CU 2nJ/Instruction

15 GU 200pJ/Instruction Optimized for Throughput Explicit Management of On-chip Memory CU 2nJ/Instruction Optimized for Latency Caches

D GFLOS per Watt CUDA GU Roadmap 16 Maxwell 14 12

16 D GFLOS per Watt CUDA GU Roadmap 16 Maxwell Kepler 4 2 Tesla Fermi

17 The GU Advantage Efficiency and rogrammability

18 The GU Advantage CUDA Enables rogrammability

19 CUDA C: C with a Few Keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); CUDA C Code

Environment arallel Nsight for MS Visual Studio DirectX GU

20 Libraries Mathematical ackages Research & Education Consultants, Training & Certification Integrated Development Environment arallel Nsight for MS Visual Studio DirectX GU Computing Ecosystem Tools & artners Languages & AI s All Major latforms Fortran

21 GU Computing Today By the Numbers: 200 Million 600, ,000 8, CUDA Capable GUs CUDA Toolkit Downloads Active GU Computing Developers Members in arallel Nsight Developer rogram Universities Teaching CUDA Worldwide CUDA Centers of Excellence Worldwide

22 To ExaScale and Beyond

23 Science Needs 1000x More Computing Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 etaflop Chromatophore 50M atoms Bacteria Billions of atoms 1,000 1 BTI 3K atoms Estrogen Receptor 36K atoms F1-ATase 327K atoms Ribosome 2.7M atoms Ran for 8 months to simulate 2 nanoseconds

24 DARA Study Identifies Four Challenges for ExaScale Computing Report published September 28, 2008: Four Major Challenges Energy and ower challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! ower also constrains what we can put on a chip Available at

25 ower is THE roblem

26 ower is THE roblem A GU is the Solution

27 GFLOS/W (Core) ExaFLOS at 20MW = 50GFLOS/W GU 5GFLOS/W GU CU EF

28 GFLOS/W (Core) 50GFLOS/W 10x Energy Gap for Today s GU x ExaFLOS Gap GU 5GFLOS/W GU CU EF

29 GFLOS/W (Core) GUs Close the Gap with rocess and Architecture x Architecture 4x rocess Technology GU CU EF

30 GFLOS/W GUs Close the Gap with rocess and Architecture GU CU* EF

31 GFLOS/W GUs Close the Gap With CUs, a Gap Remains x Residual CU Gap GU CU* EF

32 GFLOS/W GU CU* EF GUs Close the Gap With CUs, a Gap Remains Heterogeneous Computing is Required to get to ExaScale

33 Echelon NVIDIA s Extreme-Scale Computing roject

34 Echelon Team

35 LC0 LC7 System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube DRAM Cube L2 0 L MC NV RAM NIC Self-Aware OS L0 L0 NoC Self-Aware Runtime C 0 C 7 SM0 SM127 rocessor Chip (C) Node 0 (N0) 20TF, 1.6TB/s, 256GB Module 0 (M)) 160TF, 12.8TB/s, 2TB Cabinet 0 (C0) 2.6F, 205TB/s, 32TB Echelon System N7 M15 CN Locality-Aware Compiler & Autotuner

36 Load/Store Execution Model Object Thread A B Global Address Space A Abstract Memory Hierarchy B B Active Message

37 The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit D 20pJ 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 500 pj Efficient off-chip link 256-bit access 8 kb 50 pj 1 nj 28nm

38 An NVIDIA ExaScale Machine

39 Lane 4 DFMAs, 20GFLOS L0 I$ L0 D$ Main Registers Operand Registers DFMA DFMA DFMA DFMA LSI LSI

40 SM 8 lanes 160GFLOS $ Switch

41 Chip 128 SMs TFLOS + 8 Latency rocessors 1024 Banks, 256KB each MC MC NI NoC SM SM SM SM SM L L 128 SMs 160GF each

42 Node MCM 20TF + 256GB GU Chip 20TF D 256MB 150GB/s Network BW 1.4TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NV Memory

43 Cabinet 128 Nodes 2.56F 38 kw NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

44 System to ExaScale and Beyond Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW

46 GU Computing is the Future GU Computing is # 1 Today On Top 500 AND Dominant on Green 500 GU Computing Enables ExaScale At Reasonable ower The GU is the Computer A general purpose computing engine, not just an accelerator The Real Challenge is Software

49 ower is THE roblem Data Movement Dominates ower Optimize the Storage Hierarchy Tailor Memory to the Application

50 % Miss Some Applications Have Hierarchical Re-Use 120 DGEMM E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size

51 Applications with Hierarchical Reuse Want a Deep Storage Hierarchy L4 L3 L2 L2 L2 L2

52 % Miss Some Applications Have lateaus in Their Working Sets 120 Table E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size

53 Applications with lateaus Want a Shallow Storage Hierarchy L2 L2 L2 L2 NoC

54 Configurable Memory Can Do Both At the Same Time Flat hierarchy for large working sets Deep hierarchy for reuse Shared memory for explicit management Cache memory for unpredictable sharing NoC

55 ROUTER ROUTER ROUTER ROUTER Configurable Memory Reduces Distance and Energy

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time