GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Size: px

Start display at page:

Download "GPU Hardware and Programming Models. Jeremy Appleyard, September 2015"

Emil Norris
9 years ago
Views:

1 GPU Hardware and Programming Models Jeremy Appleyard, September 2015

2 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2

3 A Brief History of GPUs 3

4 Once upon a time (1997)... GPU: Graphical Processing Unit Originated as specialized hardware for 3D games. Why a different processor? Rendering is the most computationally intense part of a game. CPU is not an ideal device for computer graphics rendering Quake Software Rendering Quake Hardware Rendering Freed CPU allows more complex AI, dynamic world generation, realistic dynamics. 4

5 Evolution of GPUs Kepler 7B xtors RIVA 128 3M xtors GeForce M xtors GeForce 3 60M xtors GeForce FX 250M xtors GeForce M xtors Fixed function Programmable shaders General-programmable 5

6 NVIDIA Kepler NVIDIA Kepler K TFLOP/s double precision 8.74 TFLOP/s single precision 480 GB/s memory bandwidth 4,992 Functional Units (cores) 24 GB DRAM About 2x faster than #1 on Top500 in 1997 NVIDIA GK110 - Kepler 6

7 Tesla K80: 10x Faster on Scientific Apps 15x 10x K80 CPU 5x 0x Benchmarks Molecular Dynamics Quantum Chemistry CPU: 12 cores, 2.70GHz. 64GB System Memory, CentOS 6.2 GPU: Single Tesla K80, Boost enabled Physics 7

8 TITAN: World s Fastest Open Science Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak, 17.6 Petaflops on Linpack 90% of Performance from GPUs Top500 Ranked 2 nd, June

9 Hardware Overview 9

10 Accelerated Computing CPU Optimized for Serial Tasks GPU Accelerator Optimized for Parallel Tasks 10

11 Low Latency or High Throughput? CPU Optimized for low-latency access to cached data sets Control logic for out-of-order and speculative execution GPU Optimized for data-parallel, throughput computation Architecture tolerant of memory latency More transistors dedicated to computation 11

12 Low Latency or High Throughput Design leads to performance CPU architecture must minimize latency within each thread GPU architecture hides latency with computation (data-parallelism, 10+k threads!) GPU High Throughput Processor Computation Thread T 1 T 2 T 3 T 4 T n Processing Waiting for data CPU core Low Latency Processor T 1 T 2 T 3 T 4 Ready to be processed 12

13 Work Pattern GPU as a coprocessor Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU + 13

14 Simple Processing Flow PCI Bus 14

15 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to to GPU memory 15

16 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Execute GPU program. Results stored in GPU memory. 16

17 Simple Processing Flow PCI Bus 1. Copy input data from CPU memory to GPU memory 2. Execute GPU program. Results stored in GPU memory. 3. Copy results from GPU memory to CPU memory 17

18 System Diagram Single GPU PCI Bus 18

19 System Diagram Many GPUs PCI Bus PCI Bus 19

20 Programming Models 20

21 Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 21

22 Sparse Matrix-Vector Multiply y = Ax CSR format to store A Used in many applications Fluid dynamics Circuit Simulation Structural Mechanics 22

23 Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 23

24 Libraries cusparse cusparse<t>csrmv() Performs a matrix-vector multiply using a matrix in csr format Maintained library: Bug free High performance Performance portable 24

25 Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 25

26 OpenACC The Standard for Massively Parallel Directives Simple: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU 26

27 Standard Fortran subroutine spmv_cpu(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do end subroutine spmv_cpu... call spmv_cpu(rowstart, col, val, invec, outvec, ) 27

28 OpenACC subroutine spmv_acc(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index!$acc kernels do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do!$acc end kernels end subroutine spmv_acc... call spmv_acc(rowstart, col, val, invec, outvec, ) 28

29 Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 29

30 GPU Language Extensions CUDA CUDA is available through C/C++, Fortran, Python, Matlab, CUDA Fortran Based on industry-standard Fortran Small set of extensions to enable heterogeneous programming Straightforward APIs to manage devices, memory etc. 30

31 Standard Fortran subroutine spmv_cpu(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, intent(in) :: n REAL :: rowsum INTEGER :: i, index do i=1,n rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum end do end subroutine spmv_cpu... call spmv_cpu(rowstart, col, val, invec, outvec, ) 31

32 CUDA Fortran attributes(global) subroutine spmv_cuda(rowstart, col, val, invec, outvec, n) INTEGER, dimension(:), intent(in) :: rowstart, col REAL, dimension(:), intent(in) :: val, invec REAL, dimension(:), intent(out) :: outvec INTEGER, value, intent(in) :: n REAL :: rowsum INTEGER :: i, index i = (blockidx%x - 1) * blockdim%x + threadidx%x if (i <= n) then rowsum = 0. do index=rowstart(i),rowstart(i+1)-1 rowsum = rowsum + val(index)*invec(col(index)) end do outvec(i) = rowsum endif end subroutine spmv_cuda... call spmv_cuda<<< 1000,256 >>>(rowstart, col, val, invec, outvec, ) 32

33 Three Ways to Accelerate Applications Applications Libraries OpenACC Directives Language Extensions Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility 33

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance