GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Similar documents
Next Generation GPU Architecture Code-named Fermi

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Trends in High-Performance Computing for Power Grid Applications

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Programming Survey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GPU hardware and to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

Programming GPUs with CUDA

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Data Centric Systems (DCS)

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

ANALYSIS OF SUPERCOMPUTER DESIGN

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

GPU Tools Sandra Wienke

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

ICRI-CI Retreat Architecture track

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

A quick tutorial on Intel's Xeon Phi Coprocessor

Using the Intel Xeon Phi (with the Stampede Supercomputer) ISC 13 Tutorial

The Fastest, Most Efficient HPC Architecture Ever Built

Achieving Performance Isolation with Lightweight Co-Kernels

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

HPC-related R&D in 863 Program

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

From Hypercubes to Dragonflies a short history of interconnect

GPU Performance Analysis and Optimisation

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

HPC with Multicore and GPUs

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Chapter 2 Heterogeneous Multicore Architecture

Guided Performance Analysis with the NVIDIA Visual Profiler

Parallel Computing. Introduction

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Jean-Pierre Panziera Teratec 2011

OpenSoC Fabric: On-Chip Network Generator

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA Programming. Week 4. Shared memory and register

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Hadoop on the Gordon Data Intensive Cluster

Optimizing Application Performance with CUDA Profiling Tools

Big Data Challenges In Leadership Computing

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Introduction to GPGPU. Tiziano Diamanti

On-Chip Interconnection Networks Low-Power Interconnect

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Multi-Threading Performance on Commodity Multi-Core Processors

CUDA Basics. Murphy Stein New York University

SR-IOV In High Performance Computing

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

TSUBAME-KFC : a Modern Liquid Submersion Cooling Prototype Towards Exascale

LS DYNA Performance Benchmarks and Profiling. January 2009

Lecture 1: an introduction to CUDA

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

Build an Energy Efficient Supercomputer from Items You can Find in Your Home (Sort of)!

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

PCI Express and Storage. Ron Emerick, Sun Microsystems

HPC Software Requirements to Support an HPC Cluster Supercomputer

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Stream Processing on GPUs Using Distributed Multimedia Middleware

Chances and Challenges in Developing Future Parallel Applications

GPGPU Computing. Yong Cao

NVIDIA GeForce GTX 580 GPU Datasheet

Lecture 1: the anatomy of a supercomputer

FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters

Keys to node-level performance analysis and threading in HPC applications

Energy efficient computing on Embedded and Mobile devices. Nikola Rajovic, Nikola Puzovic, Lluis Vilanova, Carlos Villavieja, Alex Ramirez

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Introduction to GPU Architecture

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Kalray MPPA Massively Parallel Processing Array

Kriterien für ein PetaFlop System

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Performance Characteristics of Large SMP Machines


Intel Pentium 4 Processor on 90nm Technology

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Memory Management CS 217. Two programs can t control all of memory simultaneously

ultra fast SOM using CUDA

Barry Bolding, Ph.D. VP, Cray Product Division

Lecture 1. Course Introduction

ECLIPSE Performance Benchmarks and Profiling. January 2009

The Fusion of Supercomputing and Big Data. Peter Ungaro President & CEO

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Transcription:

GU Computing 1 2 3 The GU Advantage To ExaScale and Beyond The GU is the Computer

The GU Advantage

The GU Advantage A Tale of Two Machines

Tianhe-1A at NSC Tianjin

Tianhe-1A at NSC Tianjin The World s Fastest Supercomputer 2.507 etaflop 7168 Tesla M2050 GUs

Tesla M2050 GUs

Gigaflops 3 of Top5 Supercomputers 2500 2000 1500 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II

Gigaflops Megawatts Top 5 erformance and ower 2500 2000 1500 8 7 6 5 4 1000 500 0 Tianhe-1A Jaguar Nebulae Tsubame Hopper II 3 2 1 0

NVIDIA/NCSA Green 500 Entry

NVIDIA/NCSA Green 500 Entry

NVIDIA/NCSA Green 500 Entry 128 nodes, each with: 1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLO peak) 1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLO peak) 4x QDR Infiniband 4 GB DRAM Theoretical eak erf: 68.95 TF Footprint: ~20 ft^2 => 3.45 TF/ft^2 Cost: $500K (street price) => 137.9 MF/$ Linpack: 33.62 TF, 36.0 kw => 934 MF/W

The GU Advantage Efficiency and rogrammability

GU 200pJ/Instruction CU 2nJ/Instruction

GU 200pJ/Instruction Optimized for Throughput Explicit Management of On-chip Memory CU 2nJ/Instruction Optimized for Latency Caches

D GFLOS per Watt CUDA GU Roadmap 16 Maxwell 14 12 10 8 6 Kepler 4 2 Tesla Fermi 2007 2009 2011 2013

The GU Advantage Efficiency and rogrammability

The GU Advantage CUDA Enables rogrammability

CUDA C: C with a Few Keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); CUDA C Code

Libraries Mathematical ackages Research & Education Consultants, Training & Certification Integrated Development Environment arallel Nsight for MS Visual Studio DirectX GU Computing Ecosystem Tools & artners Languages & AI s All Major latforms Fortran

GU Computing Today By the Numbers: 200 Million 600,000 100,000 8,000 362 11 CUDA Capable GUs CUDA Toolkit Downloads Active GU Computing Developers Members in arallel Nsight Developer rogram Universities Teaching CUDA Worldwide CUDA Centers of Excellence Worldwide

To ExaScale and Beyond

Science Needs 1000x More Computing Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 etaflop Chromatophore 50M atoms Bacteria Billions of atoms 1,000 1 BTI 3K atoms Estrogen Receptor 36K atoms F1-ATase 327K atoms Ribosome 2.7M atoms Ran for 8 months to simulate 2 nanoseconds 1982 1997 2003 2006 2010 2012

DARA Study Identifies Four Challenges for ExaScale Computing Report published September 28, 2008: Four Major Challenges Energy and ower challenge Memory and Storage challenge Concurrency and Locality challenge Resiliency challenge Number one issue is power Extrapolations of current architectures and technology indicate over 100MW for an Exaflop! ower also constrains what we can put on a chip Available at www.darpa.mil/ipto/personnel/docs/exascale_study_initial.pdf

ower is THE roblem

ower is THE roblem A GU is the Solution

GFLOS/W (Core) ExaFLOS at 20MW = 50GFLOS/W 100 10 1 GU 5GFLOS/W GU CU EF 0.1 2010 2013 2016

GFLOS/W (Core) 50GFLOS/W 10x Energy Gap for Today s GU 100 10 1 10x ExaFLOS Gap GU 5GFLOS/W GU CU EF 0.1 2010 2013 2016

GFLOS/W (Core) GUs Close the Gap with rocess and Architecture 100 10 4x Architecture 4x rocess Technology 1 0.1 2010 2013 2016 GU CU EF

GFLOS/W GUs Close the Gap with rocess and Architecture 100 10 1 GU CU* EF 0.1 2010 2013 2016

GFLOS/W GUs Close the Gap With CUs, a Gap Remains 100 10 6x Residual CU Gap 1 0.1 2010 2013 2016 GU CU* EF

GFLOS/W 100 10 1 GU CU* EF 0.1 2010 2013 2016 GUs Close the Gap With CUs, a Gap Remains Heterogeneous Computing is Required to get to ExaScale

Echelon NVIDIA s Extreme-Scale Computing roject

Echelon Team

LC0 LC7 System Sketch Dragonfly Interconnect (optical fiber) High-Radix Router Module (RM) DRAM Cube DRAM Cube L2 0 L2 1023 MC NV RAM NIC Self-Aware OS L0 L0 NoC Self-Aware Runtime C 0 C 7 SM0 SM127 rocessor Chip (C) Node 0 (N0) 20TF, 1.6TB/s, 256GB Module 0 (M)) 160TF, 12.8TB/s, 2TB Cabinet 0 (C0) 2.6F, 205TB/s, 32TB Echelon System N7 M15 CN Locality-Aware Compiler & Autotuner

Load/Store Execution Model Object Thread A B Global Address Space A Abstract Memory Hierarchy B B Active Message

The High Cost of Data Movement Fetching operands costs more than computing on them 20mm 64-bit D 20pJ 26 pj 256 pj 16 nj DRAM Rd/Wr 256-bit buses 500 pj Efficient off-chip link 256-bit access 8 kb 50 pj 1 nj 28nm

An NVIDIA ExaScale Machine

Lane 4 DFMAs, 20GFLOS L0 I$ L0 D$ Main Registers Operand Registers DFMA DFMA DFMA DFMA LSI LSI

SM 8 lanes 160GFLOS $ Switch

Chip 128 SMs 20.48 TFLOS + 8 Latency rocessors 1024 Banks, 256KB each MC MC NI NoC SM SM SM SM SM L L 128 SMs 160GF each

Node MCM 20TF + 256GB GU Chip 20TF D 256MB 150GB/s Network BW 1.4TB/s DRAM BW DRAM Stack DRAM Stack DRAM Stack NV Memory

Cabinet 128 Nodes 2.56F 38 kw NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE NODE NODE ROUTER NODE NODE MODULE MODULE MODULE MODULE MODULE 32 Modules, 4 Nodes/Module, Central Router Module(s), Dragonfly Interconnect

System to ExaScale and Beyond Dragonfly Interconnect 400 Cabinets is ~1EF and ~15MW

GU Computing is the Future 1 2 3 4 GU Computing is # 1 Today On Top 500 AND Dominant on Green 500 GU Computing Enables ExaScale At Reasonable ower The GU is the Computer A general purpose computing engine, not just an accelerator The Real Challenge is Software

ower is THE roblem 1 2 3 Data Movement Dominates ower Optimize the Storage Hierarchy Tailor Memory to the Application

% Miss Some Applications Have Hierarchical Re-Use 120 DGEMM 100 80 60 40 20 0 1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size

Applications with Hierarchical Reuse Want a Deep Storage Hierarchy L4 L3 L2 L2 L2 L2

% Miss Some Applications Have lateaus in Their Working Sets 120 Table 100 80 60 40 20 0 1.0E+0 1.0E+3 1.0E+6 1.0E+9 1.0E+12 Size

Applications with lateaus Want a Shallow Storage Hierarchy L2 L2 L2 L2 NoC

Configurable Memory Can Do Both At the Same Time Flat hierarchy for large working sets Deep hierarchy for reuse Shared memory for explicit management Cache memory for unpredictable sharing NoC

ROUTER ROUTER ROUTER ROUTER Configurable Memory Reduces Distance and Energy