# Introduction to GPGPU. Tiziano Diamanti

2 Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model

3 Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate the XY plane. Points of intersection will be our projected object

4 An example of perspective projection Px Py = = Zx * Qz Qz Zy * Qz Qz Qx* Zz Zz Qy * Zz Zz Dove: P (PX, PY) pixel on the screen Q (QX, QY, QZ,) starting point in 3D coordinates Z (ZX, ZY, ZZ,) vanishing point Projection is on the XY plane for simplicity so z= 0.

5 Hidden lines removal Many alghoritms may be found in literature for solving this problem, like the painter alghoritm

6 Graphic primitives are transformed into pixels of the frame buffer Rasterization

7 Z-Buffer When using filled polygons instead of lines, there is a method easily implemented in hardware to solve the problem of depth: the Z buffer. This buffer has the same size as the viewport and stores the depth value for each pixel that has been designed, where depth is the distance to the observer. For each pixel, you can go to change the color of the pixel if and only if the associated depth value is less than the existing one. In this way the polygons closer to the observer will cover the most remote in the sense that the pixels that constitute them overlap with the polygons that are further away.

8 Z-Buffer Z = -.5 Z = -.3 Final image eye Top View

9 The Z-Buffer algorithm Step 1: Initialization/enabling of the depth buffer depth buffer

10 The Z-Buffer algorithm Step 2: OpenGL stores the z coordinates of the polygons as they are rendered on the screen eye Z = -.5 Z = -.3

11 The Z-Buffer algorithm Step 3: draw the polygons according to their position z eye Z = -.5 Z = -.3

12 Texture mapping Texture mapping is to apply a bitmap image to a two-dimensional polygon.

13 Rendering Pipeline Vertices connections Fragment position Application Vertex Processor Rasterizer Fragment Processor Buffer Vertices Transformed vertices Fragments Textured fragments Woode n texture

14 The first graphic computers The first graphic supercomputers were typically SGI had hardware acceleration and areas were used for military or aviation simulation

15 3D accelerators for PC Since 1997, there are some PC graphics accelerators. The progress is very fast (about a new generation every year)

16 1999 Nvidia Riva TNT 128-bit bus and graphic engine 180 millions pixels/sec fill rate 6 millions triangles/sec peak 16 Mbyte frame buffer

17 The AGP bus Accellerated Graphic Port, introduced by Intel in 1997 The PCI bus was a 32 bit bus and had a frequency of 33 MHz, so the bandwidth was 33 * 4 byte/s = 133 MB/s The AGP bus (1X) had a frequency of 66 MHz and a width of 32 bit, so the bandwidth was 266 MB/s. AGP2x offered 533 MB/s AGP4x doubles again with 1066 MB/s. AGP8x offered 2GB/s

18 Trasform & Lighting, for the first time, perspective and illumination are calculated on the GPU 256-bit bus and graphic engine 480 Millions pixels/s 15 Millions triangles/s 32 Mbytes frame buffer 2000 Nvidia G-Force

19 2001 Nvidia G-Force 3 57 millions transistors First 3D chip 3D with vertex e pixel shaders 2 textures per pixel

20 2002 Nvidia G-Force 4 Ti 63 millions transistors millions triangles/s 128 Mbytes frame buffer vertex shader units were doubled

21 2002 Nvidia G-Force FX 130 Millions transistors 315 Millions Triangles/s 128/256MBytes frame buffer DirectX9 vertex and pixel shaders

22 Introduced by Intel in 2004 The PCI-Express bus PCI-Express 16x offers 4 Gbytes/s both ways (from and to the GPU), this is increasingly important for having the results of calculations on the GPU (GPGPU)

23 2004 Nvidia G-Force Millions transistors 128/256/512MB frame buffer 16 graphic pipelines for pixel shaders 6 units for vertex shaders DirectX 9.0c

24 2 graphic cards in a PC: Nvidia SLI, ATI Crossfire

25 2005: Nvidia G-Force millions transistors 24 graphic pixel pipelines 8 units for vertex shaders Available only for PCI-Express 15,6 billions pixel/sec 1400 millions verteces/sec

26 2006: Nvidia G-Force 8 Shader model 4.0, geometry shader (DirectX 10) Up to 768 Mbytes memory on-board 36,8 billions pixel/sec 681 millions transistors 128 unified graphic pipeline millions vertices/sec

27 PCI Express 2.0 The PCI Express 2.0 doubles the bus clock frequency of 1.1, doubling the available bandwidth. It is backward compatible with PCI Express 1.1 specifications

28 2008: Nvidia G-Force 9 Shader model 4.0, geometry shader (DirectX 10) Up to 1 Gbytes memory on-board 43,2 billionsdi pixel per second Support for PCI Express millions transistors 128 graphic pipelines 65 nm transistors

29 Shader model 4.0, geometry shader (DirectX 10) 240 Streaming processors 55 nm transistors 51.8 billions pixel per second 2008: GTX 200

30 New generation: Fermi The new generation of Nvidia graphics chips has been dubbed Fermi and is marketed under the symbol GTX 400/500. The original project included 512 CUDA cores, up to 6 GB GDDR5 memory. Produced with the process to 40 nm of TMSC (nvidia has always been fabless) _platform/b010101_40nm.htm

31 512 CUDA cores, up to 6 Gbyte GDDR5 memory. TMSC 40 nm transistors GeForce GTX 580: 512 CUDA Cores, 1536 MB GDDR5 GeForce GTX 570: 480 CUDA Cores, 1280 MB GDDR5 2010: Fermi

32 New generation: Fermi

33 Fermi The GPU is organized in 4 Graphics Processing Clusters (GPC) Each GPC has 4 sub-units, each one with 32 streaming processors that execute the same instruction in parallel (in comparison the GTX 200 chip had 8) Each GTC has cache L1 e shared memory Each GTC has 2 Dispatch units

34 Fermi introduces cache

35 Shared memory A sort of explicit cache Resides on the chip so it is much faster than the onboard memory Size is 16KB (48KB on Fermi)

36 Fermi (3) NVIDIA introduces GigaThreadTM Engine that allows concurrent execution kernel, or kernel threads belonging to different kernels can be run simultaneously, which was not possible with previous generation GPUs.

37 GF 104 Introduced the 104 chip for GF GTX 460 graphics card, introduces the hardware differences Each MS 48 and not 32 CUDA cores Provides a total of 384 cores The GTX 460 has a SM card disabled for a total of 336 cores The GTX 560 has the full 384 cores implemented

38 To balance the increase in cores for MS have been doubled dispatch units from 2 to 4 GF 104

39 nvidia naming Mainstream & laptops: GeForce Target: videogames and multi-media Workstation: Quadro Target: graphic professionals who use CAD and 3D modeling applications The surcharge is due to more memory and especially the specific drivers for accelerating applications GPGPU: Tesla Target: High Performance Computing

40 Mainstream: Fermi: real products GeForce GTX 580: 512 CUDA Cores, 1536 MB GDDR5 GeForce GTX 570: 480 CUDA Cores, 1280 MB GDDR5 Computing (memory can be configured to be ECC): Tesla C2050: 448 CUDA Cores, 3GB GDDR5 Tesla C2070: 448 CUDA Cores, 6GB GDDR5 * Note: With ECC on, 12.5% of the GPU memory is used for ECC bits. For example, 3 GB total memory yields GB of user available memory with ECC on.

41 Tesla C2050 Double Precision floating point performance (peak) 515 Gflops Single Precision floating point performance (peak) 1.03 Tflops They were 78 e 933 Tflops for the previous generation

42 Rendering Pipeline Vertices connections Fragment position Application Vertex Processor Rasterizer Fragment Processor Buffer Vertices Transformed vertices Fragments Textured fragments Woode n texture

43 Shading languages HLSL (Microsoft, 2002) Cg (nvidia, 2002) GLSL (ARB, 2003) ASM Shading Languages (2001) Direct3D (Microsoft, 1995) OpenGL (ARB, 1992)

44 GLSL: example void main() // Vertex shader { gl_position = gl_modelviewprojectionmatrix * gl_vertex; } void main() // Fragment shader { gl_fragcolor = vec4(1.0, 0.0, 0.0, 1.0); }

45 Hi level languages C-like syntax Data types: Vectors (from 1 to 4 floating point, integer, boolean) Matrices (2x2, 3x3, 4x4) Arrays e Textures Conditions, loops, functions Matrix and vector Algebra Special instructions: trigonometry, exponentials, geometry, interpolations

46 GPGPU (General Purpose computation using GPU) Non graphic use of the programmable shaders

47 Future trends The power dissipation can be further increased We are already at the limits of air cooling Power consumption increases not linearly with the clock P = CfV 2, V is proportional to f cubic relation Clock high ratios lead to very low efficiency Multi-core processors can be beneficial: To reduce the clock of 20% leads to an energy savings of 50% More efficient use of transistors rather than turning up the clock from a single processor

48 Architecture of a GPU nvidia GTX 580: Bandwidth: GB/s Estimated Gflops/s Intel Core i7-980x: Max Memory Bandwidth: 25.6 GB/s Estimated 107 GFlops

49 AMD s architecture: VLIW 5 Very Long Instruction Word 5 Designed to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time Found on models of the 6800 serie and backwards 48

50 AMD s architecture: VLIW 4 In games VLIW5 reached an average of efficiency of 3.4 Starting from 6900 serie AMD introduced VLIW 4 The space previously allocated to the t-unit can now be used to have more SIMDs Drivers and compilers are more complicated on this architecture than on nvidia s because they need to exploit not only the SIMDs parallelism but they also need to exploit the vectorization inside the SIMDs 49

51 nvidia vs AMD nvidia s SMIDs are simpler (one instruction per clock cicle) but they run at double the clock of the rest of the chip, for example this are the specs of the GeForce GTX 580: CUDA Cores 512 Graphics Clock (MHz) 772 MHz Processor Clock (MHz) 1544 MHz AMD s radeon 6970 specs: Stream processors 1536 (384 * 4) Clock 880 MHz AMD s radeon 6870 specs: Stream processors 1120 (224 * 5) Clock 900 Mhz 50

