GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models Jeremy Appleyard, September 2015

A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2

A Brief History of GPUs 3

Once upon a time (1997)... GPU: Graphical Processing Unit Originated as specialized hardware for 3D games. Why a different processor? Rendering is the most computationally intense part of a game. CPU is not an ideal device for computer graphics rendering Quake Software Rendering Quake Hardware Rendering Freed CPU allows more complex AI, dynamic world generation, realistic dynamics. 4

Evolution of GPUs Kepler 7B xtors RIVA 128 3M xtors GeForce 256 23M xtors GeForce 3 60M xtors GeForce FX 250M xtors GeForce 8800 681M xtors 1995 2000 2001 2003 2006 2012 Fixed function Programmable shaders General-programmable 5

NVIDIA Kepler NVIDIA Kepler K80 2.91 TFLOP/s double precision 8.74 TFLOP/s single precision 480 GB/s memory bandwidth 4,992 Functional Units (cores) 24 GB DRAM About 2x faster than #1 on Top500 in 1997 NVIDIA GK110 - Kepler 6

Tesla K80: 10x Faster on Scientific Apps 15x 10x K80 CPU 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry CPU: 12 cores, E5-2697v2 @ 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled Physics 7

TITAN: World s Fastest Open Science Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.6 Petaflops on Linpack 90% of Performance from GPUs Top500 Ranked 2 nd, June 2015 8

Hardware Overview 9

Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10

Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation 11

Low Latency or High Throughput Design leads to performance CPU architecture must minimize latency within each thread GPU architecture hides latency with computation (data-parallelism, 10+k threads!) GPU High Throughput Processor Computation Thread T 1 T 2 T 3 T 4 T n Processing Waiting for data CPU core Low Latency Processor T 1 T 2 T 3 T 4 Ready to be processed 12

Work Pattern GPU as a coprocessor Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 13

Simple Processing Flow PCI Bus 14

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to to GPU memory 15

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Execute GPU program. Results stored in GPU memory. 16

Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Execute GPU program. Results stored in GPU memory. 3. Copy results from GPU memory to CPU memory 17

System Diagram Single GPU PCI Bus 18

System Diagram Many GPUs PCI Bus PCI Bus 19

Programming Models 20

Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 21

Sparse Matrix-Vector Multiply y = Ax CSR format to store A Used in many applications Fluid dynamics Circuit Simulation Structural Mechanics 22

Libraries cusparse cusparse<t>csrmv() Performs a matrix-vector multiply using a matrix in csr format Maintained library: Bug free High performance Performance portable 24

OpenACC The Standard for Massively Parallel Directives Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU 26

Standard Fortran subroutine spmv_cpu(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do end subroutine spmv_cpu... call spmv_cpu(rowstart, col, val, invec, outvec, 256000) 27

OpenACC subroutine spmv_acc(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index!$acc kernels do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do!$acc end kernels end subroutine spmv_acc... call spmv_acc(rowstart, col, val, invec, outvec, 256000) 28

GPU Language Extensions CUDA CUDA is available through C/C++, Fortran, Python, Matlab, CUDA Fortran Based on industry-standard Fortran Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. 30

Standard Fortran subroutine spmv_cpu(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do end subroutine spmv_cpu... call spmv_cpu(rowstart, col, val, invec, outvec, 256000) 31

CUDA Fortran attributes(global) subroutine spmv_cuda(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, value, intent(in) :: n REAL :: rowsum INTEGER :: i, index i = (blockidx%x - 1) * blockdim%x + threadidx%x if (i <= n) then rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum endif end subroutine spmv_cuda... call spmv_cuda<<< 1000,256 >>>(rowstart, col, val, invec, outvec, 2560000) 32