GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Similar documents
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GPU hardware and to CUDA

GPU Parallel Computing Architecture and CUDA Programming Model

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

HPC with Multicore and GPUs

ultra fast SOM using CUDA

Parallel Programming Survey

Next Generation GPU Architecture Code-named Fermi

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPU Programming Languages

HIGH PERFORMANCE CONSULTING COURSE OFFERINGS

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Case Study on Productivity and Performance of GPGPUs

ST810 Advanced Computing

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Computer Graphics Hardware An Overview

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Accelerating CFD using OpenFOAM with GPUs

Parallel Computing with MATLAB

Introduction to GPGPU. Tiziano Diamanti

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to the CUDA Toolkit for Building Applications. Adam DeConinck HPC Systems Engineer, NVIDIA

#OpenPOWERSummit. Join the conversation at #OpenPOWERSummit 1

Texture Cache Approximation on GPUs

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

The GPU Accelerated Data Center. Marc Hamilton, August 27, 2015

GPUs for Scientific Computing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Evaluation of CUDA Fortran for the CFD code Strukti

L20: GPU Architecture and Models

GPU Computing - CUDA

NVIDIA GeForce GTX 580 GPU Datasheet

HP ProLiant SL270s Gen8 Server. Evaluation Report

GPGPU Computing. Yong Cao

Parallel Computing. Introduction

HPC Wales Skills Academy Course Catalogue 2015

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

NVIDIA GPUs in the Cloud

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

CUDA programming on NVIDIA GPUs

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Lecture 1: an introduction to CUDA

The Fastest, Most Efficient HPC Architecture Ever Built

OpenCL Programming for the CUDA Architecture. Version 2.3

GPGPU accelerated Computational Fluid Dynamics

Building a Top500-class Supercomputing Cluster at LNS-BUAP

Graphic Processing Units: a possible answer to High Performance Computing?

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU-accelerated Large Scale Analytics using MapReduce Model

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Guided Performance Analysis with the NVIDIA Visual Profiler

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Trends in High-Performance Computing for Power Grid Applications

10- High Performance Compu5ng

ME964 High Performance Computing for Engineering Applications

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Parallel Algorithm Engineering

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Summit and Sierra Supercomputers:

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Technical Specifications

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to Cloud Computing

5x in 5 hours Porting SEISMIC_CPML using the PGI Accelerator Model

Auto-Tuning TRSM with an Asynchronous Task Assignment Model on Multicore, GPU and Coprocessor Systems

CUDA Basics. Murphy Stein New York University

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Overview. NVIDIA Quadro K5200 8GB Graphics J3G90AA

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

QuickSpecs. NVIDIA Quadro K4200 4GB Graphics INTRODUCTION. NVIDIA Quadro K4200 4GB Graphics. Overview

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

AGENDA. Overview GPU Video Encoding NVIDIA Video Encoding Capabilities. Software API Performance & Quality. Kepler vs Maxwell GPU capabilities Roadmap

Final Project Report. Trading Platform Server

Deep Learning and GPUs. Julie Bernauer

NVIDIA GeForce GTX 750 Ti

64-Bit versus 32-Bit CPUs in Scientific Computing

CS 352H: Computer Systems Architecture

Hardware Acceleration for CST MICROWAVE STUDIO

Transcription:

GPU Hardware and Programming Models Jeremy Appleyard, September 2015

A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2

A Brief History of GPUs 3

Once upon a time (1997)... GPU: Graphical Processing Unit Originated as specialized hardware for 3D games. Why a different processor? Rendering is the most computationally intense part of a game. CPU is not an ideal device for computer graphics rendering Quake Software Rendering Quake Hardware Rendering Freed CPU allows more complex AI, dynamic world generation, realistic dynamics. 4

Evolution of GPUs Kepler 7B xtors RIVA 128 3M xtors GeForce 256 23M xtors GeForce 3 60M xtors GeForce FX 250M xtors GeForce 8800 681M xtors 1995 2000 2001 2003 2006 2012 Fixed function Programmable shaders General-programmable 5

NVIDIA Kepler NVIDIA Kepler K80 2.91 TFLOP/s double precision 8.74 TFLOP/s single precision 480 GB/s memory bandwidth 4,992 Functional Units (cores) 24 GB DRAM About 2x faster than #1 on Top500 in 1997 NVIDIA GK110 - Kepler 6

Tesla K80: 10x Faster on Scientific Apps 15x 10x K80 CPU 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled Physics 7

TITAN: World s Fastest Open Science Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.6 Petaflops on Linpack 90% of Performance from GPUs Top500 Ranked 2 nd, June 2015 8

Hardware Overview 9

Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10

Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation 11

Low Latency or High Throughput Design leads to performance CPU architecture must minimize latency within each thread GPU architecture hides latency with computation (data-parallelism, 10+k threads!) GPU High Throughput Processor Computation Thread T 1 T 2 T 3 T 4 T n Processing Waiting for data CPU core Low Latency Processor T 1 T 2 T 3 T 4 Ready to be processed 12

Work Pattern GPU as a coprocessor Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 13

Simple Processing Flow PCI Bus 14

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to to GPU memory 15

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Execute GPU program. Results stored in GPU memory. 16

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Execute GPU program. Results stored in GPU memory. 3. Copy results from GPU memory to CPU memory 17

System Diagram Single GPU PCI Bus 18

System Diagram Many GPUs PCI Bus PCI Bus 19

Programming Models 20

Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 21

Sparse Matrix-Vector Multiply y = Ax CSR format to store A Used in many applications Fluid dynamics Circuit Simulation Structural Mechanics 22

Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 23

Libraries cusparse cusparse<t>csrmv() Performs a matrix-vector multiply using a matrix in csr format Maintained library: Bug free High performance Performance portable 24

Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 25

OpenACC The Standard for Massively Parallel Directives Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU 26

Standard Fortran subroutine spmv_cpu(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do end subroutine spmv_cpu... call spmv_cpu(rowstart, col, val, invec, outvec, 256000) 27

OpenACC subroutine spmv_acc(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index!$acc kernels do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do!$acc end kernels end subroutine spmv_acc... call spmv_acc(rowstart, col, val, invec, outvec, 256000) 28

Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 29

GPU Language Extensions CUDA CUDA is available through C/C++, Fortran, Python, Matlab, CUDA Fortran Based on industry-standard Fortran Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. 30

Standard Fortran subroutine spmv_cpu(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do end subroutine spmv_cpu... call spmv_cpu(rowstart, col, val, invec, outvec, 256000) 31

CUDA Fortran attributes(global) subroutine spmv_cuda(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, value, intent(in) :: n REAL :: rowsum INTEGER :: i, index i = (blockidx%x - 1) * blockdim%x + threadidx%x if (i <= n) then rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum endif end subroutine spmv_cuda... call spmv_cuda<<< 1000,256 >>>(rowstart, col, val, invec, outvec, 2560000) 32

Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 33