t.diamanti@cineca.it
Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model
Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate the XY plane. Points of intersection will be our projected object
An example of perspective projection Px Py = = Zx * Qz Qz Zy * Qz Qz Qx* Zz Zz Qy * Zz Zz Dove: P (PX, PY) pixel on the screen Q (QX, QY, QZ,) starting point in 3D coordinates Z (ZX, ZY, ZZ,) vanishing point Projection is on the XY plane for simplicity so z= 0.
Hidden lines removal Many alghoritms may be found in literature for solving this problem, like the painter alghoritm
Graphic primitives are transformed into pixels of the frame buffer Rasterization
Z-Buffer When using filled polygons instead of lines, there is a method easily implemented in hardware to solve the problem of depth: the Z buffer. This buffer has the same size as the viewport and stores the depth value for each pixel that has been designed, where depth is the distance to the observer. For each pixel, you can go to change the color of the pixel if and only if the associated depth value is less than the existing one. In this way the polygons closer to the observer will cover the most remote in the sense that the pixels that constitute them overlap with the polygons that are further away.
Z-Buffer Z = -.5 Z = -.3 Final image eye Top View
The Z-Buffer algorithm Step 1: Initialization/enabling of the depth buffer depth buffer -999-999 -999-999 -999-999 -999-999 -999-999 -999-999 -999-999 -999-999
The Z-Buffer algorithm Step 2: OpenGL stores the z coordinates of the polygons as they are rendered on the screen. -999-999 -999-999 -999-999 -999-999 -.5 -.5-999 -999 -.5 -.5-999 -999 eye Z = -.5 Z = -.3
The Z-Buffer algorithm Step 3: draw the polygons according to their position z. -999-999 -999-999 -999 -.3 -.3-999 -.5 -.3 -.3-999 -.5 -.5-999 -999 eye Z = -.5 Z = -.3
Texture mapping Texture mapping is to apply a bitmap image to a two-dimensional polygon.
Rendering Pipeline Vertices connections Fragment position Application Vertex Processor Rasterizer Fragment Processor Buffer Vertices Transformed vertices Fragments Textured fragments Woode n texture
The first graphic computers The first graphic supercomputers were typically SGI had hardware acceleration and areas were used for military or aviation simulation
3D accelerators for PC Since 1997, there are some PC graphics accelerators. The progress is very fast (about a new generation every year)
1999 Nvidia Riva TNT 128-bit bus and graphic engine 180 millions pixels/sec fill rate 6 millions triangles/sec peak 16 Mbyte frame buffer
The AGP bus Accellerated Graphic Port, introduced by Intel in 1997 The PCI bus was a 32 bit bus and had a frequency of 33 MHz, so the bandwidth was 33 * 4 byte/s = 133 MB/s The AGP bus (1X) had a frequency of 66 MHz and a width of 32 bit, so the bandwidth was 266 MB/s. AGP2x offered 533 MB/s AGP4x doubles again with 1066 MB/s. AGP8x offered 2GB/s
Trasform & Lighting, for the first time, perspective and illumination are calculated on the GPU 256-bit bus and graphic engine 480 Millions pixels/s 15 Millions triangles/s 32 Mbytes frame buffer 2000 Nvidia G-Force
2001 Nvidia G-Force 3 57 millions transistors First 3D chip 3D with vertex e pixel shaders 2 textures per pixel
2002 Nvidia G-Force 4 Ti 63 millions transistors 75-100 millions triangles/s 128 Mbytes frame buffer vertex shader units were doubled
2002 Nvidia G-Force FX 130 Millions transistors 315 Millions Triangles/s 128/256MBytes frame buffer DirectX9 vertex and pixel shaders
Introduced by Intel in 2004 The PCI-Express bus PCI-Express 16x offers 4 Gbytes/s both ways (from and to the GPU), this is increasingly important for having the results of calculations on the GPU (GPGPU)
2004 Nvidia G-Force 6 222 Millions transistors 128/256/512MB frame buffer 16 graphic pipelines for pixel shaders 6 units for vertex shaders DirectX 9.0c
2 graphic cards in a PC: Nvidia SLI, ATI Crossfire
2005: Nvidia G-Force 7 302 millions transistors 24 graphic pixel pipelines 8 units for vertex shaders Available only for PCI-Express 15,6 billions pixel/sec 1400 millions verteces/sec
2006: Nvidia G-Force 8 Shader model 4.0, geometry shader (DirectX 10) Up to 768 Mbytes memory on-board 36,8 billions pixel/sec 681 millions transistors 128 unified graphic pipeline 10800 millions vertices/sec
PCI Express 2.0 The PCI Express 2.0 doubles the bus clock frequency of 1.1, doubling the available bandwidth. It is backward compatible with PCI Express 1.1 specifications
2008: Nvidia G-Force 9 Shader model 4.0, geometry shader (DirectX 10) Up to 1 Gbytes memory on-board 43,2 billionsdi pixel per second Support for PCI Express 2.0 754 millions transistors 128 graphic pipelines 65 nm transistors
Shader model 4.0, geometry shader (DirectX 10) 240 Streaming processors 55 nm transistors 51.8 billions pixel per second 2008: GTX 200
New generation: Fermi The new generation of Nvidia graphics chips has been dubbed Fermi and is marketed under the symbol GTX 400/500. The original project included 512 CUDA cores, up to 6 GB GDDR5 memory. Produced with the process to 40 nm of TMSC (nvidia has always been fabless) http://www.tsmc.com/english/b_technology/b01 _platform/b010101_40nm.htm
512 CUDA cores, up to 6 Gbyte GDDR5 memory. TMSC 40 nm transistors GeForce GTX 580: 512 CUDA Cores, 1536 MB GDDR5 GeForce GTX 570: 480 CUDA Cores, 1280 MB GDDR5 2010: Fermi
New generation: Fermi
Fermi The GPU is organized in 4 Graphics Processing Clusters (GPC) Each GPC has 4 sub-units, each one with 32 streaming processors that execute the same instruction in parallel (in comparison the GTX 200 chip had 8) Each GTC has cache L1 e shared memory Each GTC has 2 Dispatch units
Fermi introduces cache
Shared memory A sort of explicit cache Resides on the chip so it is much faster than the onboard memory Size is 16KB (48KB on Fermi)
Fermi (3) NVIDIA introduces GigaThreadTM Engine that allows concurrent execution kernel, or kernel threads belonging to different kernels can be run simultaneously, which was not possible with previous generation GPUs.
GF 104 Introduced the 104 chip for GF GTX 460 graphics card, introduces the hardware differences Each MS 48 and not 32 CUDA cores Provides a total of 384 cores The GTX 460 has a SM card disabled for a total of 336 cores The GTX 560 has the full 384 cores implemented
To balance the increase in cores for MS have been doubled dispatch units from 2 to 4 GF 104
nvidia naming Mainstream & laptops: GeForce Target: videogames and multi-media Workstation: Quadro Target: graphic professionals who use CAD and 3D modeling applications The surcharge is due to more memory and especially the specific drivers for accelerating applications http://www.nvidia.it/object/maxtreme_workstation_it.html GPGPU: Tesla Target: High Performance Computing
Mainstream: Fermi: real products GeForce GTX 580: 512 CUDA Cores, 1536 MB GDDR5 GeForce GTX 570: 480 CUDA Cores, 1280 MB GDDR5 Computing (memory can be configured to be ECC): Tesla C2050: 448 CUDA Cores, 3GB GDDR5 Tesla C2070: 448 CUDA Cores, 6GB GDDR5 * Note: With ECC on, 12.5% of the GPU memory is used for ECC bits. For example, 3 GB total memory yields 2.625 GB of user available memory with ECC on.
Tesla C2050 Double Precision floating point performance (peak) 515 Gflops Single Precision floating point performance (peak) 1.03 Tflops They were 78 e 933 Tflops for the previous generation
Rendering Pipeline Vertices connections Fragment position Application Vertex Processor Rasterizer Fragment Processor Buffer Vertices Transformed vertices Fragments Textured fragments Woode n texture
Shading languages HLSL (Microsoft, 2002) Cg (nvidia, 2002) GLSL (ARB, 2003) ASM Shading Languages (2001) Direct3D (Microsoft, 1995) OpenGL (ARB, 1992)
GLSL: example void main() // Vertex shader { gl_position = gl_modelviewprojectionmatrix * gl_vertex; } void main() // Fragment shader { gl_fragcolor = vec4(1.0, 0.0, 0.0, 1.0); }
Hi level languages C-like syntax Data types: Vectors (from 1 to 4 floating point, integer, boolean) Matrices (2x2, 3x3, 4x4) Arrays e Textures Conditions, loops, functions Matrix and vector Algebra Special instructions: trigonometry, exponentials, geometry, interpolations
GPGPU (General Purpose computation using GPU) Non graphic use of the programmable shaders
Future trends The power dissipation can be further increased We are already at the limits of air cooling Power consumption increases not linearly with the clock P = CfV 2, V is proportional to f cubic relation Clock high ratios lead to very low efficiency Multi-core processors can be beneficial: To reduce the clock of 20% leads to an energy savings of 50% More efficient use of transistors rather than turning up the clock from a single processor
Architecture of a GPU nvidia GTX 580: Bandwidth: 192.4 GB/s Estimated 1581.1 Gflops/s Intel Core i7-980x: Max Memory Bandwidth: 25.6 GB/s Estimated 107 GFlops
AMD s architecture: VLIW 5 Very Long Instruction Word 5 Designed to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time Found on models of the 6800 serie and backwards 48
AMD s architecture: VLIW 4 In games VLIW5 reached an average of efficiency of 3.4 Starting from 6900 serie AMD introduced VLIW 4 The space previously allocated to the t-unit can now be used to have more SIMDs Drivers and compilers are more complicated on this architecture than on nvidia s because they need to exploit not only the SIMDs parallelism but they also need to exploit the vectorization inside the SIMDs 49
nvidia vs AMD nvidia s SMIDs are simpler (one instruction per clock cicle) but they run at double the clock of the rest of the chip, for example this are the specs of the GeForce GTX 580: CUDA Cores 512 Graphics Clock (MHz) 772 MHz Processor Clock (MHz) 1544 MHz AMD s radeon 6970 specs: Stream processors 1536 (384 * 4) Clock 880 MHz AMD s radeon 6870 specs: Stream processors 1120 (224 * 5) Clock 900 Mhz 50