GPU Computing Architectures

Size: px

Start display at page:

Download "GPU Computing Architectures"

Marian Owen
7 years ago
Views:

1 GPU Computing Architectures 10th Summer School in Statistics for Astronomers Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University Park June / 45

2 Introduction 2 / 45

3 Objectives 1. (Re)discover GPUs 2. Reasons for GPU computing 3. Review GPU architectures 4. Example(s) 3 / 45

4 Reminders Thread: Sequence of instructions to be executed on a core SIMD: Single Instruction Multiple Data 4 / 45

5 GPU GPU: Graphics Processing Unit Dedicated to graphics Highly parallel architecture Better at that than CPUs 5 / 45

6 GPGPU what GPGPU: General Purpose computing on GPU Took off with introduction of CUDA in 2006 CUDA: Compute Unified Device Architecture Hardware and software model for NVIDIA GPUs Alternative: OpenCL 6 / 45

7 GPGPU where Everywhere! Finance Computational Engineering Numerical Methods Defense Computational Chemistry Astrophysics... 7 / 45

8 GPGPU why Previous session: expensive machines to solve larger problems faster GPUs: do that at a fraction of the cost! Hardware Flops (DP) Power (W) Price (k$) 2 Ivybridge EX (2 15 cores 2.8 GHz; TFlops DP ops/cycle) K40 GPU 1.43 TFlops GTX Titan Black 1.7 TFlops Table: K40 GPU vs. GTX Titan Black vs. dual socket server with Ivybridge EX Can use a gamer s card (e.g. GTX) to do calculations Titan Black $1k 8 / 45

9 GPGPU why Great! Let s ditch the CPU, then. Not so fast! CPUs are great at serial Still needed for other ops Share load CPU/GPU Amdahl s law 9 / 45

10 GPGPU how Different approaches throughout the years Used to be C only C, C++, Python, Fortran, Haskell, IDL, Java, Julia, LUA, Mathematica, MATLAB,.NET, Perl, Ruby, R 10 / 45

11 Upcoming CPU vs. GPU GPU computing architecture Execution model GPU memory architecture Example 11 / 45

12 CPU vs GPU 12 / 45

13 CPU, GPU CPU host Multiple cores e.g. 15/CPU - quad-socket: 60 cores Run 1 thread / core Heavy threads GPU device NVIDIA card: 32 threads minimum 32 threads = 1 warp 2048 threads run actively on a streaming multiprocessor (SMX) 15 SMX on a card 30k+ concurrent threads Lightweight threads 13 / 45

14 GPU Integration Figure: Schematic of a compute node with GPUs 14 / 45

15 GPU Integration A word on memory spaces CPU and GPU: distinct memories Remark: CUDA 6 Unified Memory 15 / 45

16 Summary Many more lightweight threads on GPU GPU is a PCIe card transfer rates! GPU and CPU: not same memory 16 / 45

17 GPU Architecture GK / 45

18 GK110 at large 18 / 45

19 GK110 SMX 1/4 19 / 45

20 GK110 SMX 2/4 4 warp schedulers Bunch of execution units: 192 CUDA cores 64 double prec. (DP) units 32 load/store (LD) units 32 Special Function Units (SFU) L1 cache / Shared memory Texture memory Registers for threads 20 / 45

21 GK110 SMX 3/4 Warps 32 threads Scheduled through warp schedulers Warp execute the exact same instructions SIMD SMX Schedulers select four warps Issues one instruction from each warp to a group of cores / LD-ST units / SFU Instructions can be dual issued, including DP 21 / 45

22 GK110 SMX 4/4 Remark can t predict scheduling order 22 / 45

23 Summary GPU has multiple SMX that execute thread instructions Scheduling through warp schedulers 23 / 45

24 Execution model 24 / 45

25 Programmer s POV GPU function : kernel CUDA threads are organized in blocks Blocks are organized in grids 25 / 45

26 Physical organization Actual architecture 26 / 45

Executing GPU program Asynchronous behavior CPU initializes the device CPU queues GPU kernels Control returns to CPU after queuing: asynchronous [Some_program] 1 cpu_func1

27 Executing GPU program Asynchronous behavior CPU initializes the device CPU queues GPU kernels Control returns to CPU after queuing: asynchronous [Some_program] 1 cpu_func1 ( ) ; 2 g p u _ k e r n e l 1 <<< >>> ( ) ; 3 cpu_func2 ( ) ; 4 cpu_func3 ( ) ; 5 g p u _ k e r n e l 2 <<< >>> ( ) ; 6 cpu_func4 ( ) ; 7 cudadevicesynchronize ( ) ; 27 / 45

28 Summary Programmer s POV: kernel, grid, blocks, threads GPU execution is asynchronous with CPU 28 / 45

29 GPU Architecture Memory 29 / 45

30 GPU DRAM Limited 5 GB K40: 12 GB 30 / 45

31 The logical organization 31 / 45

32 The logical organization Memory Size Scope R/W Latency BW (cycles) (GB/s) Global 5 GB Grid R/W Constant 64 kb Grid R N/A Texture N/A Grid R N/A Shared 16/32/48 kb per SM Block R/W 2-4 2,260 Local 512 kb per th. Thread R/W N/A Registers 255 per th. Thread R/W 1 N/A Table: Memory perf. of the Tesla K20 32 / 45

33 The physical layout 33 / 45

34 The physical layout GPU L1 cache L2 cache SMEM size Max. resident size (kb) size (kb) (kb) threads Tesla K20 48/32/16 1,536 16/32/48 2,048 Table: Physical characteristics for GK / 45

35 Summary GPU memory is limited Different memory and caches perf. Optimization points 35 / 45

36 Example Likelihood calculation 36 / 45

37 Using GPUs I Native CUDA 1 i n t main ( i n t argc, char a r g v [ ] ) { 2 i n t nobs, s i z e x, nsample = 0 ; 3 char l o c a t i o n = NULL ; 4 i n t r e t = 0 ; 5 6 // P a r s e t h e command l i n e 7 r e t = parse_command_line ( argc, argv,& nobs,& s i z e x, 8 &nsample,& l o c a t i o n ) ; 9 10 // P a r s e t h e data on CPU 11 double X = ( double ) m a l l o c ( nobs s i z e x s i z e o f ( double ) ) ; 12 double i s i g m a = ( double ) m a l l o c ( s i z e x s i z e x s i z e o f ( double ) ) ; 13 double mu = ( double ) m a l l o c ( s i z e x s i z e o f ( double ) ) ; 14 double det_ sigma = 0. 0 ; r e t = read_data (X, i s i g m a, &det_sigma, mu, 17 &nobs, &s i z e x, l o c a t i o n ) ; // Timing v a r i a b l e s 20 double t i c, toc, t o t _ t i m e = 0. 0 ; / 45

38 Using GPUs II Native CUDA 22 // R e s u l t 23 double r e s = 0. 0 ; // A l l o c a t e GPU memory 26 double d_lv, d_tmp, d_ones ; 27 cudamalloc ( ( v o i d )&d_lv, nobs s i z e x s i z e o f ( double ) ) ; 28 cudamalloc ( ( v o i d )&d_tmp, nobs s i z e x s i z e o f ( double ) ) ; 29 cudamalloc ( ( v o i d )&d_ones, nobs s i z e o f ( double ) ) ; double d_x, d_ isigma, d_mu ; 32 cudamalloc ( ( v o i d )&d_x, nobs s i z e x s i z e o f ( double ) ) ; 33 cudamalloc ( ( v o i d )&d_isigma, s i z e x s i z e x s i z e o f ( double ) ) ; 34 cudamalloc ( ( v o i d )&d_mu, s i z e x s i z e o f ( double ) ) ; // Copy t h e data r e a d onto t h e GPU 37 cudamemcpy (d_x, X, nobs s i z e x s i z e o f ( double ), cudamemcpyhosttodevice ) ; 38 cudamemcpy ( d_isigma, i s i g m a, s i z e x s i z e x s i z e o f ( double ), cudamemcpyhosttodevice ) ; 39 cudamemcpy (d_mu, mu, s i z e x s i z e o f ( double ), cudamemcpyhosttodevice ) ; / 45

39 Using GPUs III Native CUDA 41 // C r e a t e a h a n d l e f o r c u b l a s 42 c u b l a s H a n d l e _ t h a n d l e ; 43 c u b l a s S t a t u s _ t s t a t ; 44 s t a t = c u b l a s C r e a t e (& h a n d l e ) ; 45 c u b l a s S e t P o i n t e r M o d e ( handle,cublas_pointer_mode_host) ; t i c = omp_get_wtime ( ) ; 48 r e s = 0. 0 ; // Main d r i v e r 51 l o g _ l i k e l i h o o d (d_x, d_isigma, d_mu, det_sigma, nobs, s i z e x,& r e s, d_lv, d_tmp, d_ones,& h a n d l e ) ; t o c = omp_get_wtime ( ) ; 54 t o t _ t i m e += toc t i c ; cudafree (d_mu) ; 57 cudafree (d_x) ; 58 cudafree ( d_isigma ) ; 59 cudafree ( d_lv) ; 60 cudafree ( d_tmp ) ; 61 cudafree ( d_ones ) ; 39 / 45

40 Using GPUs IV Native CUDA f r e e (X) ; 64 f r e e ( i s i g m a ) ; 65 f r e e (mu) ; 66 f r e e ( l o c a t i o n ) ; r e t u r n EXIT_SUCCESS ; 69 } 40 / 45

41 Timing results CPU vs GPU NP Serial OpenMP MPI GPU native Table: Runtime of the log likelihood example for various number of processors (NP) and dataset size. 41 / 45

42 Conclusion 42 / 45

43 Conclusion GPUs are great at parallel tasks Large amount of lightweight threads Inherent parallel architecture w/ SMX, warp schedulers Programmer s POV Kernels, grids, blocks, threads Asynchronous execution (mostly) Can t access CPU mem. Limited memory: large optimization target Multiple languages for GPU programming 43 / 45

44 Want to learn more about GPU programming? Online resources CUDA: OpenCL: MOOC Coursera: Intro to heterogeneous computing Wen-Mei Hwu Udacity: Intro to parallel programming NVIDIA Books Programming Massively Parallel Processors David B. Kirk, Wen-mei W. Hwu CUDA By Example Jason Sanders, Edward Kandrot Numerical Computations with GPUs Volodymyr Kindratenko 44 / 45

45 Questions? 45 / 45

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?