Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it



Similar documents
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Graphics Hardware An Overview

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPGPU Computing. Yong Cao

GPU Architecture. Michael Doggett ATI

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

QCD as a Video Game?

Next Generation GPU Architecture Code-named Fermi

Radeon HD 2900 and Geometry Generation. Michael Doggett

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Introduction to GPU Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

L20: GPU Architecture and Models

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

GPU Parallel Computing Architecture and CUDA Programming Model

Introduction to GPU Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Introduction to GPU hardware and to CUDA

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Introduction to Computer Graphics

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

Introduction to GPU Programming Languages

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Parallel Programming Survey

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Evaluation of CUDA Fortran for the CFD code Strukti

GPUs for Scientific Computing

GPGPU accelerated Computational Fluid Dynamics

Choosing a Computer for Running SLX, P3D, and P5

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

HP Workstations graphics card options

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

QuickSpecs. NVIDIA Quadro M GB Graphics INTRODUCTION. NVIDIA Quadro M GB Graphics. Overview

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Configuring Memory on the HP Business Desktop dx5150

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Technical Specifications

1. INTRODUCTION Graphics 2

Writing Applications for the GPU Using the RapidMind Development Platform

Petascale Visualization: Approaches and Initial Results

Monash University Clayton s School of Information Technology CSE3313 Computer Graphics Sample Exam Questions 2007

QuickSpecs. NVIDIA Quadro K1200 4GB Graphics INTRODUCTION PERFORMANCE AND FEATURES. Overview

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

A Crash Course on Programmable Graphics Hardware

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

AMD EMBEDDED PCIe ADD-IN BOARD Comparison

The Future Of Animation Is Games

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

QuickSpecs. NVIDIA Quadro K5200 8GB Graphics INTRODUCTION. NVIDIA Quadro K5200 8GB Graphics. Overview. NVIDIA Quadro K5200 8GB Graphics J3G90AA

Several tips on how to choose a suitable computer

GRAPHICS CARDS IN RADIO RECONNAISSANCE: THE GPGPU TECHNOLOGY

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

CUDA programming on NVIDIA GPUs

NVIDIA GeForce GTX 750 Ti

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

IP Video Rendering Basics

QuickSpecs. NVIDIA Quadro K2200 4GB Graphics INTRODUCTION. NVIDIA Quadro K2200 4GB Graphics. Technical Specifications

Turbomachinery CFD on many-core platforms experiences and strategies

Msystems Ltd. SAPPHIRE HD GB GDDR5 PCIE

Developer Tools. Tim Purcell NVIDIA

Console Architecture. By: Peter Hood & Adelia Wong

~ Greetings from WSU CAPPLab ~

Case Study on Productivity and Performance of GPGPUs

ST810 Advanced Computing

In the early 1990s, ubiquitous

3D Computer Games History and Technology

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

System requirements for Autodesk Building Design Suite 2017

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Boundless Security Systems, Inc.

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Optimizing AAA Games for Mobile Platforms

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

gpus1 Ubuntu Available via ssh

Latency and Bandwidth Impact on GPU-systems

How to choose a suitable computer

SUBJECT: SOLIDWORKS HARDWARE RECOMMENDATIONS UPDATE

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Stream Processing on GPUs Using Distributed Multimedia Middleware

HPC with Multicore and GPUs

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Binary search tree with SIMD bandwidth optimization using SSE

Dynamic Resolution Rendering

Transcription:

t.diamanti@cineca.it

Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model

Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate the XY plane. Points of intersection will be our projected object

An example of perspective projection Px Py = = Zx * Qz Qz Zy * Qz Qz Qx* Zz Zz Qy * Zz Zz Dove: P (PX, PY) pixel on the screen Q (QX, QY, QZ,) starting point in 3D coordinates Z (ZX, ZY, ZZ,) vanishing point Projection is on the XY plane for simplicity so z= 0.

Hidden lines removal Many alghoritms may be found in literature for solving this problem, like the painter alghoritm

Graphic primitives are transformed into pixels of the frame buffer Rasterization

Z-Buffer When using filled polygons instead of lines, there is a method easily implemented in hardware to solve the problem of depth: the Z buffer. This buffer has the same size as the viewport and stores the depth value for each pixel that has been designed, where depth is the distance to the observer. For each pixel, you can go to change the color of the pixel if and only if the associated depth value is less than the existing one. In this way the polygons closer to the observer will cover the most remote in the sense that the pixels that constitute them overlap with the polygons that are further away.

Z-Buffer Z = -.5 Z = -.3 Final image eye Top View

The Z-Buffer algorithm Step 1: Initialization/enabling of the depth buffer depth buffer -999-999 -999-999 -999-999 -999-999 -999-999 -999-999 -999-999 -999-999

The Z-Buffer algorithm Step 2: OpenGL stores the z coordinates of the polygons as they are rendered on the screen. -999-999 -999-999 -999-999 -999-999 -.5 -.5-999 -999 -.5 -.5-999 -999 eye Z = -.5 Z = -.3

The Z-Buffer algorithm Step 3: draw the polygons according to their position z. -999-999 -999-999 -999 -.3 -.3-999 -.5 -.3 -.3-999 -.5 -.5-999 -999 eye Z = -.5 Z = -.3

Texture mapping Texture mapping is to apply a bitmap image to a two-dimensional polygon.

Rendering Pipeline Vertices connections Fragment position Application Vertex Processor Rasterizer Fragment Processor Buffer Vertices Transformed vertices Fragments Textured fragments Woode n texture

The first graphic computers The first graphic supercomputers were typically SGI had hardware acceleration and areas were used for military or aviation simulation

3D accelerators for PC Since 1997, there are some PC graphics accelerators. The progress is very fast (about a new generation every year)

1999 Nvidia Riva TNT 128-bit bus and graphic engine 180 millions pixels/sec fill rate 6 millions triangles/sec peak 16 Mbyte frame buffer

The AGP bus Accellerated Graphic Port, introduced by Intel in 1997 The PCI bus was a 32 bit bus and had a frequency of 33 MHz, so the bandwidth was 33 * 4 byte/s = 133 MB/s The AGP bus (1X) had a frequency of 66 MHz and a width of 32 bit, so the bandwidth was 266 MB/s. AGP2x offered 533 MB/s AGP4x doubles again with 1066 MB/s. AGP8x offered 2GB/s

Trasform & Lighting, for the first time, perspective and illumination are calculated on the GPU 256-bit bus and graphic engine 480 Millions pixels/s 15 Millions triangles/s 32 Mbytes frame buffer 2000 Nvidia G-Force

2001 Nvidia G-Force 3 57 millions transistors First 3D chip 3D with vertex e pixel shaders 2 textures per pixel

2002 Nvidia G-Force 4 Ti 63 millions transistors 75-100 millions triangles/s 128 Mbytes frame buffer vertex shader units were doubled

2002 Nvidia G-Force FX 130 Millions transistors 315 Millions Triangles/s 128/256MBytes frame buffer DirectX9 vertex and pixel shaders

Introduced by Intel in 2004 The PCI-Express bus PCI-Express 16x offers 4 Gbytes/s both ways (from and to the GPU), this is increasingly important for having the results of calculations on the GPU (GPGPU)

2004 Nvidia G-Force 6 222 Millions transistors 128/256/512MB frame buffer 16 graphic pipelines for pixel shaders 6 units for vertex shaders DirectX 9.0c

2 graphic cards in a PC: Nvidia SLI, ATI Crossfire

2005: Nvidia G-Force 7 302 millions transistors 24 graphic pixel pipelines 8 units for vertex shaders Available only for PCI-Express 15,6 billions pixel/sec 1400 millions verteces/sec

2006: Nvidia G-Force 8 Shader model 4.0, geometry shader (DirectX 10) Up to 768 Mbytes memory on-board 36,8 billions pixel/sec 681 millions transistors 128 unified graphic pipeline 10800 millions vertices/sec

PCI Express 2.0 The PCI Express 2.0 doubles the bus clock frequency of 1.1, doubling the available bandwidth. It is backward compatible with PCI Express 1.1 specifications

2008: Nvidia G-Force 9 Shader model 4.0, geometry shader (DirectX 10) Up to 1 Gbytes memory on-board 43,2 billionsdi pixel per second Support for PCI Express 2.0 754 millions transistors 128 graphic pipelines 65 nm transistors

Shader model 4.0, geometry shader (DirectX 10) 240 Streaming processors 55 nm transistors 51.8 billions pixel per second 2008: GTX 200

New generation: Fermi The new generation of Nvidia graphics chips has been dubbed Fermi and is marketed under the symbol GTX 400/500. The original project included 512 CUDA cores, up to 6 GB GDDR5 memory. Produced with the process to 40 nm of TMSC (nvidia has always been fabless) http://www.tsmc.com/english/b_technology/b01 _platform/b010101_40nm.htm

512 CUDA cores, up to 6 Gbyte GDDR5 memory. TMSC 40 nm transistors GeForce GTX 580: 512 CUDA Cores, 1536 MB GDDR5 GeForce GTX 570: 480 CUDA Cores, 1280 MB GDDR5 2010: Fermi

New generation: Fermi

Fermi The GPU is organized in 4 Graphics Processing Clusters (GPC) Each GPC has 4 sub-units, each one with 32 streaming processors that execute the same instruction in parallel (in comparison the GTX 200 chip had 8) Each GTC has cache L1 e shared memory Each GTC has 2 Dispatch units

Fermi introduces cache

Shared memory A sort of explicit cache Resides on the chip so it is much faster than the onboard memory Size is 16KB (48KB on Fermi)

Fermi (3) NVIDIA introduces GigaThreadTM Engine that allows concurrent execution kernel, or kernel threads belonging to different kernels can be run simultaneously, which was not possible with previous generation GPUs.

GF 104 Introduced the 104 chip for GF GTX 460 graphics card, introduces the hardware differences Each MS 48 and not 32 CUDA cores Provides a total of 384 cores The GTX 460 has a SM card disabled for a total of 336 cores The GTX 560 has the full 384 cores implemented

To balance the increase in cores for MS have been doubled dispatch units from 2 to 4 GF 104

nvidia naming Mainstream & laptops: GeForce Target: videogames and multi-media Workstation: Quadro Target: graphic professionals who use CAD and 3D modeling applications The surcharge is due to more memory and especially the specific drivers for accelerating applications http://www.nvidia.it/object/maxtreme_workstation_it.html GPGPU: Tesla Target: High Performance Computing

Mainstream: Fermi: real products GeForce GTX 580: 512 CUDA Cores, 1536 MB GDDR5 GeForce GTX 570: 480 CUDA Cores, 1280 MB GDDR5 Computing (memory can be configured to be ECC): Tesla C2050: 448 CUDA Cores, 3GB GDDR5 Tesla C2070: 448 CUDA Cores, 6GB GDDR5 * Note: With ECC on, 12.5% of the GPU memory is used for ECC bits. For example, 3 GB total memory yields 2.625 GB of user available memory with ECC on.

Tesla C2050 Double Precision floating point performance (peak) 515 Gflops Single Precision floating point performance (peak) 1.03 Tflops They were 78 e 933 Tflops for the previous generation

Rendering Pipeline Vertices connections Fragment position Application Vertex Processor Rasterizer Fragment Processor Buffer Vertices Transformed vertices Fragments Textured fragments Woode n texture

Shading languages HLSL (Microsoft, 2002) Cg (nvidia, 2002) GLSL (ARB, 2003) ASM Shading Languages (2001) Direct3D (Microsoft, 1995) OpenGL (ARB, 1992)

GLSL: example void main() // Vertex shader { gl_position = gl_modelviewprojectionmatrix * gl_vertex; } void main() // Fragment shader { gl_fragcolor = vec4(1.0, 0.0, 0.0, 1.0); }

Hi level languages C-like syntax Data types: Vectors (from 1 to 4 floating point, integer, boolean) Matrices (2x2, 3x3, 4x4) Arrays e Textures Conditions, loops, functions Matrix and vector Algebra Special instructions: trigonometry, exponentials, geometry, interpolations

GPGPU (General Purpose computation using GPU) Non graphic use of the programmable shaders

Future trends The power dissipation can be further increased We are already at the limits of air cooling Power consumption increases not linearly with the clock P = CfV 2, V is proportional to f cubic relation Clock high ratios lead to very low efficiency Multi-core processors can be beneficial: To reduce the clock of 20% leads to an energy savings of 50% More efficient use of transistors rather than turning up the clock from a single processor

Architecture of a GPU nvidia GTX 580: Bandwidth: 192.4 GB/s Estimated 1581.1 Gflops/s Intel Core i7-980x: Max Memory Bandwidth: 25.6 GB/s Estimated 107 GFlops

AMD s architecture: VLIW 5 Very Long Instruction Word 5 Designed to process a 4 component dot product (e.g. w, x, y, z) and a scalar component (e.g. lighting) at the same time Found on models of the 6800 serie and backwards 48

AMD s architecture: VLIW 4 In games VLIW5 reached an average of efficiency of 3.4 Starting from 6900 serie AMD introduced VLIW 4 The space previously allocated to the t-unit can now be used to have more SIMDs Drivers and compilers are more complicated on this architecture than on nvidia s because they need to exploit not only the SIMDs parallelism but they also need to exploit the vectorization inside the SIMDs 49

nvidia vs AMD nvidia s SMIDs are simpler (one instruction per clock cicle) but they run at double the clock of the rest of the chip, for example this are the specs of the GeForce GTX 580: CUDA Cores 512 Graphics Clock (MHz) 772 MHz Processor Clock (MHz) 1544 MHz AMD s radeon 6970 specs: Stream processors 1536 (384 * 4) Clock 880 MHz AMD s radeon 6870 specs: Stream processors 1120 (224 * 5) Clock 900 Mhz 50