Recent Trends. GPU Computation Strategies & Tricks. ...but Bandwidth is Expensive. Compute is Cheap. Optimizing for GPUs. Compute vs.

Similar documents
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

GPGPU Computing. Yong Cao

GPGPU: General-Purpose Computation on GPUs

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Graphics Hardware An Overview

GPU Architecture. Michael Doggett ATI

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Radeon HD 2900 and Geometry Generation. Michael Doggett

Introduction to GPGPU. Tiziano Diamanti

OpenGL Performance Tuning

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Introduction to GPU Architecture

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Optimization for DirectX9 Graphics. Ashu Rege

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

QCD as a Video Game?

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Brook for GPUs: Stream Computing on Graphics Hardware

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

L20: GPU Architecture and Models

Introduction to GPU Programming Languages

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

3D Computer Games History and Technology

Console Architecture. By: Peter Hood & Adelia Wong

Interactive Level-Set Deformation On the GPU

Image Synthesis. Transparency. computer graphics & visualization

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Writing Applications for the GPU Using the RapidMind Development Platform

Low power GPUs a view from the industry. Edvard Sørgård

NVIDIA GeForce GTX 580 GPU Datasheet

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Ray Tracing on Graphics Hardware

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

GPU Christmas Tree Rendering. Evan Hart

Chapter 2 Parallel Architecture, Software And Performance

Real-Time Graphics Architecture

Advanced Rendering for Engineering & Styling

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A Crash Course on Programmable Graphics Hardware

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA


Optimizing AAA Games for Mobile Platforms

A Survey of General-Purpose Computation on Graphics Hardware

Lecture 15: Hardware Rendering

GPUs for Scientific Computing

A Performance-Oriented Data Parallel Virtual Machine for GPUs

CUDA programming on NVIDIA GPUs

GPU Parallel Computing Architecture and CUDA Programming Model

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Developer Tools. Tim Purcell NVIDIA

Real-time Visual Tracker by Stream Processing

Realtime Ray Tracing and its use for Interactive Global Illumination

High-speed image processing algorithms using MMX hardware

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

ST810 Advanced Computing

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Scalability and Classifications

1. INTRODUCTION Graphics 2

Rethinking SIMD Vectorization for In-Memory Databases

Compiling the Kohonen Feature Map Into Computer Graphics Hardware

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

Hardware design for ray tracing

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Multi-GPU Load Balancing for Simulation and Rendering

CUDA. Multicore machines

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Texture Cache Approximation on GPUs

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Fast and efficient dense variational stereo on GPU

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

In the early 1990s, ubiquitous

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

IP Video Rendering Basics

Dynamic Resolution Rendering

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Why You Need the EVGA e-geforce 6800 GS

GPGPU accelerated Computational Fluid Dynamics

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

GPGPU Parallel Merge Sort Algorithm

HPC with Multicore and GPUs

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

ME964 High Performance Computing for Engineering Applications

An Introduction to Modern GPU Architecture Ashu Rege Director of Developer Technology

Choosing a Computer for Running SLX, P3D, and P5

Transcription:

Recent Trends GPU Computation Strategies & Tricks Ian Buck NVIDIA 2 Compute is Cheap...but Bandwidth is Expensive parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit FPU (to scale) 90nm Chip latency tolerance to cover 500 cycle remote memory access time locality to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth 0.5mm 1 clock 90nm Chip 12mm 12mm courtesy of Bill Dally 3 courtesy of Bill Dally 4 Optimizing for GPUs Compute vs. Bandwidth shading is compute intensive 100s of floating point operations output 1 32-bit color value compute to bandwidth ratio arithmetic intensity 0.5mm 90nm Chip 1 clock 12mm R300 R360 R420 GFLOPS GFloats/sec courtesy of Bill Dally 5 6 1

Arithmetic Intensity Arithmetic Intensity GFLOPS 7x Gap Pentium 4 3.0 GHz GPU wins when Arithmetic intensity Segment 3.7 ops per word 11 GFLOPS GFloats/sec R300 R360 R420 7 8 Arithmetic Intensity Overlapping computation with communication Memory Bandwidth GPU wins when Streaming memory bandwidth SAXPY FFT Pentium 4 3.0 GHz 9 10 Memory Bandwidth Computational Intensity Streaming Memory System Optimized for sequential performance GPU cache is limited Optimized for texture filtering Read-only Small Considering GPU transfer costs: T r CPU GPU Memory Pentium 4 Local storage CPU >> GPU Memory 11 12 2

Computational Intensity Considering GPU transfer costs: T r Computational intensity: γ Kernel Overhead Considering CPU cost to issuing a kernel Generating compute geometry Graphics driver γ K gpu / T r work per word transferred to outperform the CPU: speedup: s K cpu / K gpu CPU limited GPU limited γ > 1 s - 1 13 14 Floating Point Precision s exponent mantissa sign * 1.mantissa * 2 (exponent+bias) NVIDIA FP32 s23e8 ATI 24-bit float s16e7 NVIDIA FP16 s10e5 Floating Point Precision Common Bug Pack large 1D array in 2D texture Compute 1D address in shader Convert 1D address into 2D FP precision will leave unaddressable texels! Largest Counting Number NVIDIA FP32: 16,777,217 ATI 24-bit float: 131,073 NVIDIA FP16: 2,049 15 16 Problem: a[i] = p Indirect write Can t set the x,y of fragment in pixel shader Often want to do: a[i] += p Solution 1: Convert to Gather for each spring f = computed force mass_force[left] += f; mass_force[right] -= f; f1 m1 f2 m2 f3 17 18 3

Solution 1: Convert to Gather for each spring f = computed force for each mass mass_force = f[left] - f[right]; Solution 2: Address Sorting Sort & Search Shader outputs destination address and data Bitonic sort based on address Run binary search shader over destination buffer Each fragment searches for source data f1 m1 f2 m2 f3 19 20 Solution 3: Vertex processor Render points Use vertex shader to set destination or just read back the data and re-issue Vertex Textures Render data and address to texture Issue points, set point x,y in vertex shader using address texture Requires texld instruction in vertex program Strategies & Tricks: 21 22 Problem: if (a) b = f(); else b = g(); Limited fragment shader conditional support Pre-computation Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim When user draws obstacles, compute texture containing boundary info for cells Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! 23 24 4

Static Branch Resolution Avoid branches where outcome is fixed One region is always true, another false Separate FPs for each region, no branches Example: boundaries Branching with Occlusion Query Use it for iteration termination Do { // outer loop on CPU BeginOcclusionQuery { // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0 Can be used for subdivision techniques 25 26 Predication Execute both f and g if (a) b = f(); else b = g(); Predication Use DP4 instruction a = (0, 1, 0, 0) f = (x, y, z, w) DP4 b.x, a, f Executes all conditional code Use CMP instruction CMP b, -a, f, g Executes all conditional code if (a.x) b = x; else if (a.y) b = y; else if (a.z) b = z; else if (a.w) b = w; 27 28 Using the depth buffer Set Z buffer to a Z-test can prevent shader execution glenable(gl_depth_test) Locality in conditional if (a) b = f(); else b = g(); Using the depth buffer Optimization disabled with: ATI: Writing Z in shader Enabling Alpha test Using texkill in shader NVIDIA: Changing depth test direction in frame Writing stencil while rejecting based on stencil Changing stencil func/ref/mask in frame 29 30 5

Depth Conditional Instructions Available with NV_fragment_program2 MOVC CC, R0; IF GT.x; MOV R0, R1; # executes if R0.x > 0 ELSE; MOV R0, R2; # executes if R0.x <= 0 ENDIF; 31 32 GeForce 6/7 Series Branching True, SIMD branching Lots of incoherent branching can hurt performance Should have coherent regions of ~1000 pixels That is only about 30x30 pixels, so still very useable! Don t ignore overhead of branch instructions Branching over < 5 instructions may not be worth it Use branching for early exit from loops Save a lot of computation Conditional Instructions 33 34 Branching Techniques Fragment program branches can be expensive No true fragment branching on GeForce FX or Radeon 9x00-X850 SIMD branching on GeForce 6/7 Series Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Replace with math Occlusion Query Static Branch Resolution Depth Buffer Pre-computation 35 6