GeForce 8800 & NVIDIA CUDA. A New Architecture for Computing on the GPU

Size: px

Start display at page:

Download "GeForce 8800 & NVIDIA CUDA. A New Architecture for Computing on the GPU"

Cecily Bond
7 years ago
Views:

1 GeForce 8800 & NVIDIA CUDA A New Architecture for Computing on the GPU

Stunning Graphics Realism Lush, Rich Worlds Crysis 2006 Crytek / Electronic Arts Incredible Physics Effects Core of the Definitive Gaming Platform Hellgate: London 2005-2006

2 Stunning Graphics Realism Lush, Rich Worlds Crysis 2006 Crytek / Electronic Arts Incredible Physics Effects Core of the Definitive Gaming Platform Hellgate: London Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc. Full Spectrum Warrior: Ten Hammers 2006 Pandemic Studios, LLC. All rights reserved THQ Inc. All rights reserved.

obstructions Earlier detection X-Ray tube Advanced Imaging Solution of the Year Axis of rotation reduced

3 Digital Tomosynthesis: Signal Processing Pioneering work at Massachusetts General Hospital 100X speed-up with NVIDIA GPU 32 node server takes 5 hours 1 GPU takes 5 minutes Improved diagnostic value Clearer images Fewer obstructions Earlier detection X-Ray tube Advanced Imaging Solution of the Year Axis of rotation reduced reconstruction time from 5 hours to 5 minutes Digital detector Low-dose X-ray Projections Computationally Intense Reconstruction

Electromagnetic Simulation Acceleware FDTD acceleration technology for the GPU 3D Finite-Difference and Finite-Element Modeling of: Cell phone irradiation MRI Design / Modeling Printed Circuit

4 Electromagnetic Simulation Acceleware FDTD acceleration technology for the GPU 3D Finite-Difference and Finite-Element Modeling of: Cell phone irradiation MRI Design / Modeling Printed Circuit Boards Radar Cross Section (Military) Large speedups with NVIDIA GPUs X Performance (Mcells/s) X Pacemaker with Transmit Antenna 200 5X 100 1X 0 CPU 3.2 GHz 1 GPU 2 GPUs 4 GPUs

CUDA & GPU Computing New Architecture for Computing Thread Execution Manager Control P = P + V * t Control P = P + V * t Parall el Data Cache Shared Data P1, V1 P2, V2 P3, V3 P4, V4 P5, V5 DRAM

5 CUDA & GPU Computing New Architecture for Computing Thread Execution Manager Control P = P + V * t Control P = P + V * t Parall el Data Cache Shared Data P1, V1 P2, V2 P3, V3 P4, V4 P5, V5 DRAM Standard C Programming dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Control P = P + V * t Unprecedented Performance New Applications

6 GPU Computing Model

7 GPGPU Programming Model OpenGL Program to Add A and B Start by creating a quad Vertex Program Programs created with raster operation Rasterization Fragment Program Read textures as input to OpenGL shader program CPU Reads Texture Memory for Results Write answer to texture memory as a color All this just to do A + B

8 Current Constraints Graphics API Addressing modes Limited texture size/dimension Shader capabilities Limited outputs Instruction sets Integer & bit ops Communication limited Between pixels Scatter a[i] = p Input Registers Fragment Program Output Registers Texture Constants Registers

9 GeForce 7800 Pixel Input Registers Texture Fragment Program Constants Registers Output Registers

10 Thread Programs Features Millions of instructions Full Integer and Bit instructions Thread Number No limits on branching, looping 1D, 2D, or 3D thread ID allocation Texture Thread Program Constants Registers Output Registers

11 Global Memory Features Fully general load/store to GPU memory: Scatter/Gather Thread Number Programmer flexibility on how memory is accessed Untyped, not limited to fixed texture types Pointer support Texture Thread Program Constants Registers Global Memory

12 Parallel Data Cache Thread Number Features Dedicated on-chip memory Shared between threads for interthread communication Explicitly managed As fast as registers Parallel Data Cache Thread Program Texture Constants Registers Global Memory

13 Example Algorithm - Fluids Goal: Calculate PRESSURE in a fluid Pressure = Sum of neighboring pressures P n = P 1 + P 2 + P 3 + P 4 So the pressure for each particle is Pressure depends on neighbors Pressure 1 = P 1 + P 2 + P 3 + P 4 Pressure 2 = P 3 + P 4 + P 5 + P 6 Pressure 3 = P 5 + P 6 + P 7 + P 8 Pressure 4 = P 7 + P 8 + P 9 + P 10

14 Control Example Fluid Algorithm CPU P n =P 1 +P 2 +P 3 +P 4 Cache P 1 P 2 P 3 P 4 Single thread out of cache DRAM GPGPU Control P n =P 1 +P 2 +P 3 +P 4 Control P n =P 1 +P 2 +P 3 +P 4 Control P n =P 1 +P 2 +P 3 +P 4 P1,P2 P3,P4 P1,P2 P3,P4 Video Memory P1,P2 P3,P4 CUDA GPU Computing Thread Execution Manager Control P n =P 1 +P 2 +P 3 +P 4 Control P n =P 1 +P 2 +P 3 +P 4 Parallel Data Cache Shared Data P 1 P 2 P 3 P 4 P 5 DRAM Multiple passes through video memory Control P n =P 1 +P 2 +P 3 +P 4 Data/Computation Parallel execution through cache Program/Control

Parallel Data Cache Addresses a fundamental problem of stream computing Bring the data closer to the Stage computation for the parallel data cache Minimize trips to external memory Share values to

15 Parallel Data Cache Addresses a fundamental problem of stream computing Bring the data closer to the Stage computation for the parallel data cache Minimize trips to external memory Share values to minimize overfetch and computation Increases arithmetic intensity by keeping data close to the processors User managed generic memory, threads read/write arbitrarily Thread Execution Manager Control P n =P 1 +P 2 +P 3 +P 4 Control P n =P 1 +P 2 +P 3 +P 4 Control P n =P 1 +P 2 +P 3 +P 4 Parallel Data Cache Shared Data P 1 P 2 P 3 P 4 P 5 DRAM Parallel execution through cache

16 Streaming vs. GPU Computing Streaming Gather in, Restricted write Memory is far from No inter-element communication GPGPU CUDA CUDA More general data parallel model Full Scatter / Gather PDC brings the data closer to the App decides how to decompose the problem across threads Share and communicate between threads to solve problems efficiently

17 GeForce 8800 GTX Graphics Board Core Multi- Processors Shader Memory Memory 575MHz MHz 900MHz 768MB GDDR3 $599 e-taile

GeForce 8800 GPU Computing Processors execute computing threads Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors

18 GeForce 8800 GPU Computing Processors execute computing threads Host Input Assembler Thread Execution Manager Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Load/store Global Memory

19 Thread Processor Parallel Data Cache Thread Processors Parallel Data Cache 128, 1.35 GHz processors 16KB Parallel Data Cache per cluster Scalar architecture IEEE 754 Precision Full featured instruction set

20 CUDA Programming Model

21 Programming Model: A Highly Multi-threaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application execute on the device as kernels which run many cooperative threads in parallel Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

22 C on the GPU A simple, explicit programming language solution Extend only where necessary global void KernelFunc(...); device int GlobalVar; shared int SharedVar; KernelFunc<<< 500, 128 >>>(...);

23 Runtime Component: Memory Management Explicit GPU memory allocation Returns pointers to GPU memory Device memory allocation cudamalloc(), cudafree() Memory copy from host to device, device to host, device to device cudamemcpy(), cudamemcpy2d(),... cudagetsymboladdress() OpenGL & DirectX interoperability cudaglmapbufferobject()

24 CUDA SDK Standard Libraries: FFT, BLAS, Integrated CPU and GPU C Source Code NVIDIA C Compiler NVIDIA Assembly for Computing CPU Host Code CUDA Runtime & Driver Profiler

25 CUDA Stable Fluids Demo CUDA port of: Jos Stam, "Stable Fluids", In SIGGRAPH 99 Conference Proceedings, Annual Conference Series, August 1999,

26 New Applications Enabled by CUDA 197x CUDA Advantage 20x 10x 47x Rigid Body Physics Solver Matrix Numerics Wave Equation Biological Sequence Match Finance GeForce 8800 vs GHz Core 2 Duo

27 CUDA Performance Advantages Performance: Benefits: BLAS1: 60+ GB/sec Leveraging the parallel data cache BLAS3: 100+ GFLOPS GPU memory bandwidth FFT: 52 benchfft* GFLOPS GPU GFLOPS performance FDTD: 1.2 Gcells/sec Custom hardware intrinsics SSEARCH: 5.2 Gcells/sec sinf, cosf, expf, logf,... Black Scholes: 4.7 GOptions/sec All benchmarks are compiled code! *GFLOPS as defined by

28 Conclusions GPU Computing on GeForce 8800 Simple threading model Parallel data cache General global memory access CUDA Programming model C on GPUs Tool chain and driver designed for computation Libraries optimized for GPU Computing CUFFT, CUBLAS Availability Linux and Windows Register for the Beta online

29 Questions?

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel