Advanced Programming (GPGPU) Mike Houston

Similar documents
General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Brook for GPUs: Stream Computing on Graphics Hardware

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPGPU: General-Purpose Computation on GPUs

GPGPU Computing. Yong Cao

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

L20: GPU Architecture and Models

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Computer Graphics Hardware An Overview

Writing Applications for the GPU Using the RapidMind Development Platform

Introduction to GPU Architecture

Introduction to GPU Programming Languages

Developer Tools. Tim Purcell NVIDIA

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

QCD as a Video Game?

Introduction to GPGPU. Tiziano Diamanti

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPU Architecture. Michael Doggett ATI

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Interactive Rendering In The Post-GPU Era. Matt Pharr Graphics Hardware 2006

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

A Performance-Oriented Data Parallel Virtual Machine for GPUs

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

GPGPU Parallel Merge Sort Algorithm

OpenGL Performance Tuning

GPU Shading and Rendering: Introduction & Graphics Hardware

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Next Generation GPU Architecture Code-named Fermi

Rethinking SIMD Vectorization for In-Memory Databases

Interactive Level-Set Deformation On the GPU

Deconstructing Hardware Usage for General Purpose Computation on GPUs

Texture Cache Approximation on GPUs

ultra fast SOM using CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Introduction to GPU hardware and to CUDA

Performance Optimization and Debug Tools for mobile games with PlayCanvas

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

GPU for Scientific Computing. -Ali Saleh

Hardware design for ray tracing

Multi-GPU Load Balancing for Simulation and Rendering

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPU Parallel Computing Architecture and CUDA Programming Model

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

GPU Computing - CUDA

A Crash Course on Programmable Graphics Hardware

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

HPC with Multicore and GPUs

Optimizing AAA Games for Mobile Platforms

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Evaluation of CUDA Fortran for the CFD code Strukti

CUDA programming on NVIDIA GPUs

Flexible Agent Based Simulation for Pedestrian Modelling on GPU Hardware

Lecture Notes, CEng 477

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Choosing a Computer for Running SLX, P3D, and P5

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Chapter 2 Parallel Architecture, Software And Performance

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPUs for Scientific Computing

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Introduction to Cloud Computing

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Matrix Multiplication

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

The Future Of Animation Is Games

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

ME964 High Performance Computing for Engineering Applications

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Total Recall: A Debugging Framework for GPUs

Optimization for DirectX9 Graphics. Ashu Rege

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

NVIDIA GeForce GTX 580 GPU Datasheet

OpenACC 2.0 and the PGI Accelerator Compilers

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

What is GPUOpen? Currently, we have divided console & PC development Black box libraries go against the philosophy of game development Game

Introduction to Computer Graphics

Ray Tracing on Graphics Hardware

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

Binary search tree with SIMD bandwidth optimization using SSE

Transcription:

Advanced Programming (GPGPU) Mike Houston

The world changed over the last year Multiple GPGPU initiatives Vendors without GPGPU talking about it A few big apps: Game physics Folding@Home Video processing Finance modeling Biomedical Real-time image processing Courses UIUC ECE 498 Supercomputing 2006 SIGGRAPH 2006/2007 Lots of academic research Actual GPGPU companies PeakStream RapidMind Accelware 2

What can you do on GPUs other than graphics? Large matrix/vector operations (BLAS) Protein Folding (Molecular Dynamics) FFT (SETI, signal processing) Ray Tracing Physics Simulation [cloth, fluid, collision] Sequence Matching (Hidden Markov Models) Speech Recognition (Hidden Markov Models, Neural nets) Databases Sort/Search Medical Imaging (image segmentation, processing) And many, many more http://www.gpgpu.org 3

Task vs. Data parallelism Task parallel Independent processes with little communication Easy to use Free on modern operating systems with SMP Data parallel Lots of data on which the same computation is being executed No dependencies between data elements in each step in the computation Can saturate many ALUs But often requires redesign of traditional algorithms 4

CPU vs. GPU CPU Really fast caches (great for data reuse) Fine branching granularity Lots of different processes/threads High performance on a single thread of execution GPU Lots of math units Fast access to onboard memory Run a program on each fragment/vertex High throughput on parallel tasks CPUs are great for task parallelism GPUs are great for data parallelism 5

The Importance of Data Parallelism for GPUs GPUs are designed for highly parallel tasks like rendering GPUs process independent vertices and fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers In short, no communication between vertices or fragments Data-parallel processing GPU architectures are ALU-heavy Multiple vertex & pixel pipelines Lots of compute power GPU memory systems are designed to stream data Linear access patterns can be prefetched Hide memory latency 6

GPGPU Terminology 7

Arithmetic Intensity Arithmetic intensity Math operations per word transferred Computation / bandwidth Ideal apps to target GPGPU have: Large data sets High parallelism Minimal dependencies between data elements High arithmetic intensity Lots of work to do without CPU intervention 8

Data Streams & Kernels Streams Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. Provide data parallelism Kernels Functions applied to each element in stream transforms, PDE, No dependencies between stream elements Encourage high Arithmetic Intensity 9

Scatter vs. Gather Gather Indirect read from memory ( x = a[i] ) Naturally maps to a texture fetch Used to access data structures and data streams Scatter Indirect write to memory ( a[i] = x ) Difficult to emulate: Render to vertex array Sorting buffer Needed for building many data structures Usually done on the CPU 10

Mapping algorithms to the GPU 11

Mapping CPU algorithms to the GPU Basics Stream/Arrays -> Textures Parallel loops -> Quads Loop body -> vertex + fragment program Output arrays -> render targets Memory read -> texture fetch Memory write -> framebuffer write Controlling the parallel loop Rasterization = Kernel Invocation Texture Coordinates = Computational Domain Vertex Coordinates = Computational Range 12

Computational Resources Programmable parallel processors Vertex & Fragment pipelines Rasterizer Mostly useful for interpolating values (texture coordinates) and per-vertex constants Texture unit Read-only memory interface Render to texture Write-only memory interface 13

Vertex Processors Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather Can change the location of current vertex Cannot read info from other vertices Can only read a small constant memory Vertex Texture Fetch Random access memory for vertices Limited gather capabilities Can fetch from texture Cannot fetch from current vertex stream 14

Fragment Processors Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Generally capable of gather but not scatter Indirect memory read (texture fetch), but no indirect memory write Output address fixed to a specific pixel Typically more useful than vertex processor More fragment pipelines than vertex pipelines Direct output (fragment processor is at end of pipeline) Better memory read performance For GPGPU, we mainly concentrate on using the fragment processors Most of the flops Highest memory bandwidth 15

And then they were unified Current trend is to unify shading resources DX10 vertex/geometry/fragment shading have similar capabilities Just a pool of processors Scheduled by the hardware dynamically You can get all the board resources through each NVIDIA 8800GTX 16

GPGPU example Adding Vectors Place arrays into 2D textures Convert loop body into a shader Loop body = Render a quad Needs to cover all the pixels in the output 1:1 mapping between pixels and texels Readback framebuffer into result array float a[5*5]; float b[5*5]; float c[5*5]; //initialize vector a //initialize vector b for(int i=0; i<5*5; i++) { c[i] = a[i] + b[i]; } 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24!!ARBfp1.0 TEMP R0; TEMP R1; TEX R0, fragment.position, texture[0], 2D; TEX R1, fragment.position, texture[1], 2D; ADD R0, R0, R1; MOV fragment.color, R0; 17

How this basically works Adding vectors Bind Input Textures Vector A, Vector B Bind Render Targets Load Shader Set Shader Params Render Quad Readback Buffer!!ARBfp1.0 TEMP R0; TEMP R1; TEX R0, fragment.position, texture[0], 2D; TEX R1, fragment.position, texture[1], 2D; ADD R0, R0, R1; MOV fragment.color, R0; Vector C C = A+B 18

Rolling your own GPGPU apps Lots of information on GPGPU.org For those with a strong graphics background: Do all the graphics setup yourself Write your kernels: Use high level languages Cg, HLSL, ASHLI Or, direct assembly ARB_fragment_program, ps20, ps2a, ps2b, ps30 High level languages and systems to make GPGPU easier BrookGPU (http://graphics.stanford.edu/projects/brookgpu/) RapidMind (http://www.rapidmind.net) PeakStream (http://www.peakstreaminc.com) CUDA NVIDIA (http://developer.nvidia.com/cuda) CTM AMD/ATI (ati.amd.com/companyinfo/researcher/documents.html ) 19

Basic operations Map Reduce Scan Gather/Scatter Covered earlier 20

Map operation Given: Array or stream of data elements A Function ƒ(x) map(a, ƒ) = applies ƒ(x) to all a i A GPU implementation is straightforward A is a texture, a i are texels Pixel shader implements ƒ(x), reads a i as x Draw a quad with as many pixels as texels in A with ƒ(x) pixel shader active Output(s) stored in another texture Courtesy John Owens 21

Parallel Reductions Given: Binary associative operator with identity I Ordered set s = [a 0, a 1,, a n-1 ] of n elements Reduce(, s) returns a 0 a 1 a n-1 Example: Reduce(+, [3 1 7 0 4 1 6 3]) = 25 Reductions common in parallel algorithms Common reduction operators are +, x, min, max Note floating point is only pseudo-associative Courtesy John Owens 22

Parallel Scan (aka( prefix sum) Given: Binary associative operator with identity I Ordered set s = [a0, a1,, an-1] of n elements scan(, s) returns [a 0, (a 0 a 1 ),, (a 0 a 1 a n-1 )] Example: scan(+, [3 1 7 0 4 1 6 3]) = [3 4 11 11 15 16 22 25] (From Blelloch, 1990, Prefix Sums and Their Applications ) Courtesy John Owens 23

Applications of Scan Radix sort Quicksort String comparison Lexical analysis Stream compaction Polynomial evaluation Solving recurrences Tree operations Histograms Courtesy John Owens 24

Brook: General Purpose Streaming Language Stream programming model GPU = streaming coprocessor C with stream extensions Cross platform ATI & NVIDIA OpenGL, DirectX, CTM Windows & Linux 25

Streams Collection of records requiring similar computation particle positions, voxels, FEM cell, Ray r<200>; float3 velocityfield<100,100,100>; Similar to arrays, but index operations disallowed: position[i] read/write stream operators streamread (r, r_ptr); streamwrite (velocityfield, v_ptr); 26

Kernels Functions applied to streams similar to for_all construct no dependencies between stream elements kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; 27

Kernels Kernel arguments input/output streams kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } 28

Kernels Kernel arguments input/output streams gather streams kernel void foo (..., float array[] ) { a = array[i]; } 29

Kernels Kernel arguments input/output streams gather streams iterator streams kernel void foo (..., iter float n<> ) { a = n + b; } 30

Kernels Kernel arguments input/output streams gather streams iterator streams constant parameters kernel void foo (..., float c ) { a = c + b; } 31

Reductions Compute single value from a stream associative operations only reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i]; 32

Reductions Multi-dimension reductions stream shape differences resolved by reduce function reduce void sum (float a<>, reduce float r<>) r += a; } float a<20>; float r<5>; sum(a,r); for (int i=0; i<5; i++) r[i] = a[i*4]; for (int j=1; j<4; j++) r[i] += a[i*4 + j]; 33

Stream Repeat & Stride Kernel arguments of different shape resolved by repeat and stride kernel void foo (float a<>, float b<>, out float result<>); float a<20>; float b<5>; float c<10>; foo(a,b,c); foo(a[0], b[0], c[0]) foo(a[2], b[0], c[1]) foo(a[4], b[1], c[2]) foo(a[6], b[1], c[3]) foo(a[8], b[2], c[4]) foo(a[10], b[2], c[5]) foo(a[12], b[3], c[6]) foo(a[14], b[3], c[7]) foo(a[16], b[4], c[8]) foo(a[18], b[4], c[9]) 34

Matrix Vector Multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20,10>; float vector<1, 10>; float tempmv<20,10>; float result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); M V V V = T 35

Matrix Vector Multiply kernel void mul (float a<>, float b<>, out float result<>) { result = a*b; } reduce void sum (float a<>, reduce float result<>) { result += a; } float matrix<20,10>; float vector<1, 10>; float tempmv<20,10>; float result<20, 1>; mul(matrix,vector,tempmv); sum(tempmv,result); T sum R 36

Runtime Accessing stream data for graphics aps Brook runtime api available in C++ code autogenerated.hpp files for brook code brook::initialize( "dx9", (void*)device ); // Create streams fluidstream0 = stream::create<float4>( kfluidsize, kfluidsize ); normalstream = stream::create<float3>( kfluidsize, kfluidsize ); // Get a handle to the texture being used by // the normal stream as a backing store normaltexture = (IDirect3DTexture9*) normalstream->getindexedfieldrenderdata(0); // Call the simulation kernel simulationkernel( fluidstream0, fluidstream0, controlconstant, fluidstream1 ); 37

Applications ray-tracer segmentation SAXPY SGEMV fft edge detect linear algebra 38

Brook for GPUs Release v0.4 available on Sourceforge CVS tree *much* more up to date and includes CTM support Project Page http://graphics.stanford.edu/projects/brook Source http://www.sourceforge.net/projects/brook Paper: Brook for GPUs: Stream Computing on Graphics Hardware Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Fly-fishing fly images from The English Fly Fishing Shop 39

Understanding GPUs Through Benchmarking

Introduction Key areas for GPGPU Memory latency behavior Memory bandwidths Upload/Download Instruction rates Branching performance Chips analyzed ATI X1900XTX (R580) NVIDIA 7900GTX (G71) NVIDIA 8800GTX (G80) 41

GPUBench An open-source suite of micro-benchmarks GL (we ll be using this for the talk) DX9 (alpha version) Developed at Stanford to aid our understanding of GPUs Vendors wouldn t directly tell us arch details Behavior under GPGPU apps different than games and other benchmarks Library of results http://graphics.stanford.edu/projects/gpubench/ 42

Memory latency Questions Can latency be hidden? Does access pattern affect latency? 43

Methodology Try different numbers of texture fetches Different access patterns: Cache hit every fetch to the same texel Sequential every fetch increments address by 1 Random dependent lookup with random texture Increase the ALU ops of the shader ALU ops must be dependent to avoid optimization GPUBench test: fetchcost 44

Fetch cost ATI cache hit X1900XTX has 3X the ALUs per pipe 4 ALU ops 12 ALU ops ATI X1800XT ATI X1900XTX Cost = max(alu, TEX) 45

Fetch cost ATI sequential X1900XTX has 3X the ALUs per pipe 8 ALU ops 24 ALU ops ATI X1800XT ATI X1900XTX Cost = max(alu, TEX) 46

Fetch cost NVIDIA cache hit 4 ALU op penalty Cost = sum(alu, TEX) NVIDIA 7900 GTX 47

Fetch cost NVIDIA sequential 8 ALU op issue penalty Cost = sum(alu, TEX) NVIDIA 7900 GTX 48

Fetch cost NVIDIA 8800 GTX Cache sequential 4 ALU ops 8 ALU ops NVIDIA 8800 GTX Cost = max(alu, TEX) NVIDIA 8800 GTX 49

Bandwidth to ALUs Questions Cache performance? Sequential performance? Random-read performance? 50

Methodology Cache hit Use a constant as index to texture(s) Sequential Use fragment position to index texture(s) Random Index a seeded texture with fragment position to look up into input texture(s) GPUBench test: inputfloatbandwidth 51

Results Better random bandwidth Better effective cache bandwidth ATI X1900XTX Sequential bandwidth (SEQ) about the same NVIDIA 7900GTX 52

Results NVIDIA 7900GTX NVIDIA 8800GTX 2X bandwidth of 7900GTX 53

Off-board bandwidth Questions How fast can we get data on the board (download)? How fast can we get data off the board (readback)? GPUBench tests: download readback 54

Download Host to GPU is slow ATI X1900XTX NVIDIA 7900GTX 55

Download Next generation not much better NVIDIA 7900GTX NVIDIA 8800GTX 56

Readback GPU to host is slow ATI GL Readback performance is abysmal ATI X1900XTX NVIDIA 7900GTX 57

Readback Next generation not much better NVIDIA 7900GTX NVIDIA 8800GTX 58

Instruction Issue Rate Questions What is the raw performance achievable? Do different instructions have different costs? Vector vs. scalar issue differences? 59

Methodology Write long shaders with dependent instructions >100 instructions All instructions dependent But try to structure to allow for multi-issue Test float1 vs. float4 performance GPUBench tests: instrissue 60

Results Vector issue ATI X1900XTX NVIDIA 7900GTX = More costly than others 61

Results Vector issue Faster ADD/SUB Peak (single instruction) FLOPS with MAD ATI X1900XTX NVIDIA 7900GTX 62

Results Vector issue NVIDIA 7900GTX NVIDIA 8800GTX 8800GTX is 37% faster (peak) 63

When benchmarks go wrong Smart compilers subverting testing and optimizing away shaders. Bug found in previous subtract test. No clever way to write RCP test found yet Always sanity check results against theoretical peak!!! NVIDIA 7800GTX GPUBench 1.2 64

Results Scalar issue NVIDIA 7900GTX NVIDIA 8800GTX 8800GTX is a scalar issue processor 65

Branching Performance Questions Is predication better than branching? Is using Early-Z culling a better option? What is the cost of branching? What branching granularity is required? How much can I really save branching around heavy computation? 66

Methodology Early-Z Set a Z-buffer and compare function to mask out compute Change coherence of blocks Change sizes of blocks Set differing amounts of pixels to be drawn Shader Branching If{ do a little }; else { LOTS of math} Change coherence of blocks Change sizes of blocks Have differing amounts of pixels execute heavy math branch GPUBench tests: branching 67

Results Early-Z - NVIDIA Random is bad! 4x4 coherence is almost perfect! NVIDIA 7900GTX 68

Results Branching - NVIDIA Fully coherent has good performance But overhead NVIDIA 7900GTX 69

Results Branching - NVIDIA Performance increases with branch coherence NVIDIA 7900GTX Need > 32x32 branch coherence 70

Results Branching - NVIDIA Performance increases with branch coherence NVIDIA 8800GTX Need > 16x16 branch coherence (Turns out 16x4 is as good as 16x16 ) 71

Summary Benchmarks can help discern app behavior and architecture characteristics We use these benchmarks as predictive models when designing algorithms Folding@Home ClawHMMer CFD Be wary of driver optimizations Driver revisions change behavior Raster order, scheduler, compiler 72