Brook for GPUs: Stream Computing on Graphics Hardware



Similar documents
Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPGPU Computing. Yong Cao

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

L20: GPU Architecture and Models

Interactive Level-Set Deformation On the GPU

Introduction to GPU Architecture

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Introduction to GPU Programming Languages

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to GPGPU. Tiziano Diamanti

Hardware design for ray tracing

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPGPU: General-Purpose Computation on GPUs

Interactive Level-Set Segmentation on the GPU

Computer Graphics Hardware An Overview

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

GPU Shading and Rendering: Introduction & Graphics Hardware

QCD as a Video Game?

Optimization for DirectX9 Graphics. Ashu Rege

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Radeon HD 2900 and Geometry Generation. Michael Doggett

Flexible Agent Based Simulation for Pedestrian Modelling on GPU Hardware

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

GPU Parallel Computing Architecture and CUDA Programming Model

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

A Survey of General-Purpose Computation on Graphics Hardware

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPU for Scientific Computing. -Ali Saleh

GPU Architecture. Michael Doggett ATI

Developer Tools. Tim Purcell NVIDIA

Writing Applications for the GPU Using the RapidMind Development Platform

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Graphic Processing Units: a possible answer to High Performance Computing?

Choosing a Computer for Running SLX, P3D, and P5

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Next Generation GPU Architecture Code-named Fermi

Introduction to Computer Graphics

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Realtime Ray Tracing and its use for Interactive Global Illumination

GPU architecture II: Scheduling the graphics pipeline

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

Evaluation of CUDA Fortran for the CFD code Strukti

GUI GRAPHICS AND USER INTERFACES. Welcome to GUI! Mechanics. Mihail Gaianu 26/02/2014 1

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

A Crash Course on Programmable Graphics Hardware

OpenGL Performance Tuning

Web Based 3D Visualization for COMSOL Multiphysics

Introduction to GPU hardware and to CUDA

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

CUDA programming on NVIDIA GPUs

Real-Time Graphics Architecture

ST810 Advanced Computing

HPC with Multicore and GPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

The Future Of Animation Is Games

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPUs for Scientific Computing

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Multi-Threading Performance on Commodity Multi-Core Processors

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

INTRODUCTION TO RENDERING TECHNIQUES

Data Visualization Using Hardware Accelerated Spline Interpolation

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

System requirements for Autodesk Building Design Suite 2017

Texture Cache Approximation on GPUs

GPU Computing - CUDA

OpenACC 2.0 and the PGI Accelerator Compilers

Introduction to GPU Computing

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Le langage OCaml et la programmation des GPU

GPGPU Parallel Merge Sort Algorithm

Advanced Rendering for Engineering & Styling

Stream Processing on GPUs Using Distributed Multimedia Middleware

2) Write in detail the issues in the design of code generator.

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Clustering Billions of Data Points Using GPUs

Instruction Set Architecture (ISA) Design. Classification Categories

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Transcription:

Brook for GPUs: Stream Computing on Graphics Hardware Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan Computer Science Department Stanford University

recent trends multiplies per second NVIDIA NV30, 35, 40 GFLOPS ATI R300, 360, 420 Pentium 4 July 01 Jan 02 July 02 Jan 03 July 03 Jan 04 SIGGRAPH 2004 2

recent trends GPU-based SIGGRAPH/Graphics Hardware papers 13 4 July 01 Jan 02 July 02 Jan 03 July 03 Jan 04 SIGGRAPH 2004 3

domain specific solutions map directly to graphics primitives requires extensive knowledge of GPU programming SIGGRAPH 2004 4

building an abstraction general GPU computing question can we simplify GPU programming? what is the correct abstraction for GPU-based computing? what is the scope of problems that can be implemented efficiently on the GPU? SIGGRAPH 2004 5

contributions Brook stream programming environment for GPU-based computing language, compiler, and runtime system virtualizing or extending GPU resources analysis of when GPUs outperform CPUs SIGGRAPH 2004 6

GPU programming model each fragment shaded independently no dependencies between fragments temporary registers are zeroed no static variables no read-modify-write textures multiple pixel pipes SIGGRAPH 2004 7

GPU = data parallel each fragment shaded independently no dependencies between fragments temporary registers are zeroed no static variables no read-modify-write textures multiple pixel pipes data parallelism support ALU heavy architectures hide memory latency [Torborg and Kajiya 96, Anderson et al. 97, Igehy et al. 98] SIGGRAPH 2004 8

compute vs. bandwidth GFLOPS 7x Gap GFloats/sec R300 R360 R420 ATI Hardware SIGGRAPH 2004 9

compute vs. bandwidth arithmetic intensity = compute-to-bandwidth ratio graphics pipeline vextex BW: 1 vertex = 32 bytes; OP: 100-500 f32-ops / vertex fragment BW: 1 fragment = 10 bytes OP: 300-1000 i8-ops/fragment SIGGRAPH 2004 10

Brook language stream programming model enforce data parallel computing streams encourage arithmetic intensity kernels SIGGRAPH 2004 11

design goals general purpose computing GPU = general streaming-coprocessor GPU-based computing for the masses no graphics experience required eliminating annoying GPU limitations performance platform independent ATI & NVIDIA DirectX & OpenGL Windows & Linux SIGGRAPH 2004 12

Other languages Cg / HLSL / OpenGL Shading Language + C-like language for expressing shader computation graphics execution model requires graphics API for data management and shader execution Sh [McCool et al. '04] + functional approach for specifying shaders evolved from a shading language Connection Machine C* StreamIt, StreamC & KernelC, Ptolemy SIGGRAPH 2004 13

Brook language C with streams streams collection of records requiring similar computation particle positions, voxels, FEM cell, Ray r<200>; float3 velocityfield<100,100,100>; data parallelism provides data to operate on in parallel SIGGRAPH 2004 14

kernels kernels functions applied to streams similar to for_all construct no dependencies between stream elements kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; SIGGRAPH 2004 15

kernels kernels arguments input/output streams kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } SIGGRAPH 2004 16

kernels kernels arguments input/output streams gather streams kernel void foo (..., float array[] ) { a = array[i]; } SIGGRAPH 2004 17

kernels kernels arguments input/output streams gather streams iterator streams kernel void foo (..., iter float n<> ) { a = n + b; } SIGGRAPH 2004 18

kernels kernels arguments input/output streams gather streams iterator streams constant parameters kernel void foo (..., float c ) { a = c + b; } SIGGRAPH 2004 19

kernels why not allow direct array operators? A + B * C arithmetic intensity temporaries kept local to computation explicit communication kernel arguments Ray-triangle intersection kernel void krnintersecttriangle(ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ); qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ); candidatehit.data.x = dot( edge2, qvec ); candidatehit.data.xyz *= inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); } } SIGGRAPH 2004 20

reductions reductions compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } SIGGRAPH 2004 21

reductions reductions compute single value from a stream reduce void sum (float a<>, reduce float r<>) r += a; } float a<100>; float r; sum(a,r); r = a[0]; for (int i=1; i<100; i++) r += a[i]; SIGGRAPH 2004 22

reductions reductions associative operations only (a+b)+c = a+(b+c) sum, multiply, max, min, OR, AND, XOR matrix multiply permits parallel execution SIGGRAPH 2004 23

system outline brcc source to source compiler generate CG & HLSL code CGC and FXC for shader assembly virtualization brt Brook run-time library stream texture management kernel shader execution SIGGRAPH 2004 24

eliminating GPU limitations treating texture as memory limited texture size and dimension compiler inserts address translation code float matrix<8096,10,30,5>; SIGGRAPH 2004 25

eliminating GPU limitations extending kernel outputs duplicate kernels, let cgc or fxc do dead code elimination better solution: "Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley, Mike Houston, and Pat Hanrahan "Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling" Andrew T. Riffel, Aaron E. Lefohn, Kiril Vidimce, Mark Leone, and John D. Owens SIGGRAPH 2004 26

applications ray-tracer segmentation SAXPY SGEMV fft edge detect linear algebra

evaluation Relative Performance 7 ATI Radeon X800 XT 6 5 4 3 2 1 NVIDIA GeForce 6800 Pentium 4 3.0 GHz compared against: Intel Math Library Atlas Math Library cached blocked segmentation FFTW Wald ['04] SSE Ray-Triangle SAXPY Segment SGEMV FFT Ray-tracer SIGGRAPH 2004 28

evaluation Relative Performance 7 6 5 4 3 2 1 GPU wins when limited data reuse SAXPY FFT Pentium 4 3.0 GHz 44 GB/sec peak cache bandwidth NVIDIA GeForce 6800 Ultra 36 GB/sec peak memory bandwidth SAXPY FFT SIGGRAPH 2004 29

evaluation Relative Performance 7 6 5 4 3 2 1 GPU wins when arithmetic intensity Segment 3.7 ops per word SGEMV 1/3 ops per word Segment SGEMV SIGGRAPH 2004 30

outperforming the CPU considering GPU transfer costs: T r computational intensity: γ γ K gpu / T r work per word transferred considering CPU cost to issuing a kernel SIGGRAPH 2004 31

efficiency Brook version within 80% of hand-coded GPU version FF T Relative Performance 1 ATI Pentium 4 Hand coded Brook Hand coded SIGGRAPH 2004 32 C

summary GPUs are faster than CPUs and getting faster why? data parallelism arithmetic intensity what is the right programming model? Brook stream computing SIGGRAPH 2004 33

summary GPU-based computing for the masses bioinfomatics rendering simulation statistics SIGGRAPH 2004 34

acknowledgements paper Bill Mark (UT-Austin) Nick Triantos, Tim Purcell (NVIDIA) Mark Segal (ATI) Kurt Akeley Reviewers sponsors language Stanford Merrimac Group Reservoir Labs DARPA contract MDA904-98-R-S855, F29601-00-2-0085 DOE ASC contract LLL-B341491 NVIDIA, ATI, IBM, Sony Rambus Stanford Graduate Fellowship Stanford School of Engineering Fellowship SIGGRAPH 2004 35

Brook for GPUs release v0.3 available on Sourceforge project page http://graphics.stanford.edu/projects/brook source http://www.sourceforge.net/projects/brook over 6K downloads! interested in collaborating? fly-fishing fly images from The English Fly Fishing Shop SIGGRAPH 2004 36