General & Special-purpose architecture. General-purpose GPU. GPGPU Programming models. GPGPU Memory models. Next generation

Transcription

1

2 General & Special-purpose architecture General-purpose GPU GPGPU Programming models GPGPU Memory models Next generation 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 2

3 Von Neumann computational model Super-scalar processors, multi-cores Programming languages Virtual Machines 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 3

4 The need for efficient specialized processing of 3D meshes silently introduced a non-von Neumann computational model for computer games GPU Non-Von Neumann computational model Single Instruction Multiple Data Fixed-function pipeline 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 4

5 What are GPUs good at? Large data sets Arithmetic intensity = High compute/io ratio Minimal control flow or recursion High locality CPU GPU Optimization High-performance on sequential code Arithmetic intensity Die area for computation (% of transistor) 20% 80% Memory Cache Low latency (1/10 of GPU) Big (10 times GPU) High bandwidth (10 times CPU) Small 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 5

6 Why are GPUs getting faster? 1. Arithmetic intensity The specialized nature of GPUs makes it easier to use additional transistors for computation 2. Economics Multi-billion dollar video game market drives innovation 3. Intense competition Fast moving industry with no clear winner NVIDIA, AMD, Intel, SONY are major players 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 6

7 General & Special-purpose architecture General-purpose GPU Programming models Memory models Next generation 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 7

8 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 8

9 NVIDIA GeForce GTX 280 AMD-ATI Firestream 9250 Intel Core 2 Extreme AMD Phenom X4 Intel Pentium D Chip G92 RV770 QX Fabrication process (nm) # transistor (million) Max GPU power (W) Core clock (MHz) Memory clock (MHz) Peak memory bandwidth (GB/s) /02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 9

11 Fixed function Configurable, but not programmable Programmable shading Fixed pipeline Programmable graphics Customizable pipeline 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 11

14 Stream Computing SIMD programming model Multiple processing units Non-determinism how data in streams gets processed by the cores breaks the sequential Von Neumann architecture schema. The computation of each core is driven by a program, kernel The GPU infrastructure is responsible for assigning cores to kernels, each running instance of a kernel is called thread each thread has an associated set of output locations in the GPU memory referred as the domain of execution. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 14

15 GPU schedules the array of threads into the pool of physical processors It is possible to schedule different kernels on a GPU at once Each physical processor in the GPU can execute a group of threads together called wavefront or wrap. The number of wavefronts in execution at the same time is dependent on the active register usage of a kernel. If the wavefront is not supported, the hardware will spill thread data into memory having a significant impact over performance. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 15

16 The effectiveness of an SIMD pipeline is based on the assumption that all threads running the same shader program expose identical control-flow behavior Input-dependent branch leads to different control flow paths for different threads in a warp branch divergence, hazard that occurs since an SIMD pipeline cannot execute different instructions in the same cycle Conditionals/Branch via Predication Branch predication none of the instructions whose execution depends on the controlling condition gets skipped: 1. each instruction is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition 2. each of these instructions gets scheduled for execution 3. only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 16

18 Command Processor reads and initiates commands sent by the host CPU to the GPU for execution. Stream processor set of SIMD pipelines, each independent of the others, which operate in parallel on data streams. SIMD pipelines can process data or transfer data to and from memory. Memory controller direct access to the local memory and the host-specified areas of system memory. performs the functions of a DMA controller. Caches optimize code and data storage into the memory hierarchy. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 18

19 MIMD array of 10 SIMT (TPCs) TPC a 3-way collection of: 3 shader processors (SMs) 1 TEX unit SM comprises: 8 scalar ALUs Stream Processors (SP) [FP32] 1 ALU 64-bit [FP64] total of 240 scalar processors 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 19

20 MIMD array of 10 SIMD cores SIMD core contains: 16 Shader Processors Units (SPU) 1 Texture Unit SPU: 5-ALU-wide superscalar 4 regular [FP32 ] 1 fat [FP64] processes 4 threads at 4 cycles total of 800 ALUs 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 20

23 Software stack a device driver, an application programming interface (API) and its runtime, two higher-level mathematical libraries of common usage, CUFFT and CUBLAS. The CUDA programming interface consists of: A minimal set of extensions to the C language that allow the programmer to target portions of the source code for execution on the device. A runtime library split into: Host component, that runs on the CPU and provides functions to control and access one or more compute devices from the host. Device component (kernel), that runs on the device and provides device-specific functions. Common component, that provides built-in vector types and a subset of the C standard library that are supported in both host and device code. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 23

24 An Extension to the C Programming Language Function type qualifiers to specify execution on host or device device, global, host Variable type qualifiers to specify the memory location on the device device, constant, shared Static execution configuration << dgrid, dblock, nbyteshmem, Strm >> Defines the dimension of the grid and blocks global void Func(float* parameter); Func<<< Dg, Db, Ns >>>(parameter); Evaluate before actual execution Four built-in variables that specify the grid and block dimensions and the block and thread indices 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 24

25 Multi-level of parallelism Thread Block Batch of threads that can cooperate together Fast shared memory Synchronizable Thread ID Block can be one-, two- or three-dimensional arrays Grid of Thread Block Limited number of threads in a block Allows larger numbers of thread to execute the same kernel with one invocation Blocks identifiable via block ID Leads to a reduction in thread cooperation Blocks can be one- or two-dimensional arrays 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 25

26 1. Write kernel (in C) 2. Initialize CUDA environment 3. Allocate memory on GPU 4. Prepare input data in GPU memory 5. Configure execution and execute CUDA kernel. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 26

27 CTM low-level programming interface direct access to the native instruction set and memory GPGPU 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 27

28 Device-driver layer that sits on top of CTM CAL system comprises: Master process executes on the CPU submits commands for execution queries for the status of the completion of tasks Device, it is a hardware component capable of running CAL programs (kernels) Kernel, it is executed on these processors and is implemented by using the AMD Intermediate Language (IL). Computation invoked by setting up one or more outputs and specifying a region (domain of execution) into this output that must be computed. 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 28

40 HLC LLC 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 40

43 .br file Integrated Stream kernel and CPU program brcc CPU, Stream Code Splitter Kernel Compiler (IL Code Generator) HLC some processing for stream operators CPU code (C) Evaluate kernels on CPU using explicit loops over stream elements CPU Emulation code (C++) IL Code appended to generated.cpp file LLC other libs.cpp file g++ / cl.exe brook lib cal lib brt Stream Runtime CPU Backend GPU Backend (CAL) 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 43

45 CPU and GPU memory hierarchy Disk CPU GPU Host Memory Local Memory Caches GPU Caches Registers Constant Registers Temporary Registers 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 45

46 CPU GPU Allocate/free memory any program point only before computation Memory access random limited Register read/write read/write Local memory read/write to stack read/write to stack Host memory read/write to heap read-only during computation write-only at end of computation Disk read/write to disk no direct access 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 46

47 CPU interface allocate/free resource copy CPU GPU copy GPU CPU GPU interface random-access read/write stream read/write copy GPU GPU stream read/write 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 47

48 NVIDIA GeForce GTX 280 Off-chip memory 1 GB, 256-bit GDDR3, 993 MHz clock speed 400/600 clock cycles latency On-chip memory 256KB L2 caches 32 KB L1 caches Local Data Share (16KB) 4/6 clock cycles latency AMD-ATI Firestream 9250 Off-chip memory 1 GB, 256-bit GDDR3,993 MHz clock speed. On-chip memory 256KB L2 caches 32 KB L1 caches Local Data Share (16KB) Global Data Share (16KB) 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 48

49 (Device) Grid Each thread can: Block (0, 0) Block (1, 0) R/W per-thread registers R/W per-thread local memory Shared Memory Shared Memory R/W per-block shared memory Registers Registers Registers Registers R/W per-grid global memory Read only per-grid constant memory Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Read only per-grid texture memory The host can R/W global, constant, and texture memories Host Local Memory Global Memory Local Memory Local Memory Local Memory Constant Memory Texture Memory 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 49

50 Shared memory Pros: fast, inter-block threads fast communication, reduce GPU-CPU data transfers Cons: the hardware will spill thread data into off-chip memory Global Shared memory Pros: intra-block threads communication, reduce GPU-CPU data transfers Kernel implementation Learning curve Optimizations Static execution configuration Toolchain 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 50

51 Restrictions: 1. Memory access is limited to reads from input/gather streams 2. Writing to static or global variables in not allowed inside kernels 3. It is illegal to call non-kernel functions from a kernel 4. Kernels invoked from application have a void return type 5. kernels are callable as subroutines from other kernels only 6. All variables are automatic 7. Pointers are not supported 8. Memory cannot be allocated 9. Recursion is not allowed 26/02/2009 Cristian Dittamo Dept. of Computer Science, University of Pisa 51