OpenGL Performance Tuning



Similar documents
Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Computer Graphics Hardware An Overview

INTRODUCTION TO RENDERING TECHNIQUES

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Image Synthesis. Transparency. computer graphics & visualization

GPU Architecture. Michael Doggett ATI

Optimizing AAA Games for Mobile Platforms

How To Understand The Power Of Unity 3D (Pro) And The Power Behind It (Pro/Pro)

GPGPU Computing. Yong Cao

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

Modern Graphics Engine Design. Sim Dietrich NVIDIA Corporation

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Interactive Information Visualization using Graphics Hardware Študentská vedecká konferencia 2006

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Radeon HD 2900 and Geometry Generation. Michael Doggett

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Optimization for DirectX9 Graphics. Ashu Rege

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Dynamic Resolution Rendering

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Advanced Visual Effects with Direct3D

Lecture 15: Hardware Rendering

John Carmack Archive -.plan (2002)

Writing Applications for the GPU Using the RapidMind Development Platform

Introduction to Game Programming. Steven Osman

Introduction to GPGPU. Tiziano Diamanti

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Touchstone -A Fresh Approach to Multimedia for the PC

L20: GPU Architecture and Models

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

A Crash Course on Programmable Graphics Hardware

Software Virtual Textures

Interactive Level-Set Deformation On the GPU

3D Computer Games History and Technology

Advanced Rendering for Engineering & Styling

Deferred Shading & Screen Space Effects

Advanced Graphics and Animations for ios Apps

Introduction to GPU Programming Languages

Low power GPUs a view from the industry. Edvard Sørgård

Developer Tools. Tim Purcell NVIDIA

How To Teach Computer Graphics

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Lecture Notes, CEng 477

QCD as a Video Game?

Introduction to Computer Graphics

Introduction to Computer Graphics

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

A Short Introduction to Computer Graphics

Beyond 2D Monitor NVIDIA 3D Stereo

Color correction in 3D environments Nicholas Blackhawk

GPU Christmas Tree Rendering. Evan Hart

GPU Shading and Rendering: Introduction & Graphics Hardware

Intel Graphics Media Accelerator 900

Monash University Clayton s School of Information Technology CSE3313 Computer Graphics Sample Exam Questions 2007

BRINGING UNREAL ENGINE 4 TO OPENGL Nick Penwarden Epic Games Mathias Schott, Evan Hart NVIDIA

Console Architecture. By: Peter Hood & Adelia Wong

Computer Graphics. Anders Hast

High Dynamic Range and other Fun Shader Tricks. Simon Green

The OpenGL R Graphics System: A Specification (Version 3.3 (Core Profile) - March 11, 2010)

Beginning Android 4. Games Development. Mario Zechner. Robert Green

Programming 3D Applications with HTML5 and WebGL

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Pitfalls of Object Oriented Programming

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

GUI GRAPHICS AND USER INTERFACES. Welcome to GUI! Mechanics. Mihail Gaianu 26/02/2014 1

Workstation Applications for Windows. NVIDIA MAXtreme User s Guide

Graphical displays are generally of two types: vector displays and raster displays. Vector displays

An Introduction to Modern GPU Architecture Ashu Rege Director of Developer Technology

Hardware design for ray tracing

Introduction to GPU Architecture

NVIDIA GeForce GTX 580 GPU Datasheet

Real-Time Graphics Architecture

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

Web Based 3D Visualization for COMSOL Multiphysics

What is GPUOpen? Currently, we have divided console & PC development Black box libraries go against the philosophy of game development Game

MS SQL Performance (Tuning) Best Practices:

Interactive Rendering In The Post-GPU Era. Matt Pharr Graphics Hardware 2006

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Midgard GPU Architecture. October 2014

Computer Graphics on Mobile Devices VL SS ECTS

3D Math Overview and 3D Graphics Foundations

Anime Studio Debut vs. Pro

Using Photorealistic RenderMan for High-Quality Direct Volume Rendering

Lezione 4: Grafica 3D*(II)

Q. Can an Exceptional3D monitor play back 2D content? A. Yes, Exceptional3D monitors can play back both 2D and specially formatted 3D content.

Making natural looking Volumetric Clouds In Blender 2.48a

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

3D Stereoscopic Game Development. How to Make Your Game Look

Blender Notes. Introduction to Digital Modelling and Animation in Design Blender Tutorial - week 9 The Game Engine

Game Development in Android Disgruntled Rats LLC. Sean Godinez Brian Morgan Michael Boldischar

OctaVis: A Simple and Efficient Multi-View Rendering System

Transcription:

OpenGL Performance Tuning Evan Hart ATI Pipeline slides courtesy John Spitzer - NVIDIA

Overview What to look for in tuning How it relates to the graphics pipeline Modern areas of interest Vertex Buffer Objects Shader performance

Simplified Graphics Pipeline CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer Texture Storage + Filtering Vertices Pixels

Possible Pipeline Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer Texture Storage + Filtering CPU/Bus Bound Vertex Bound Pixel Bound

Battle Plan for Better Performance Locate the bottleneck(s) Eliminate the bottleneck (if possible) Decrease workload of the bottlenecked stage Otherwise, balance the pipeline Increase workload of the non-bottlenecked stages:

Bottleneck Identification Run App Vary FB FPS varies? Yes FB limited No Vary texture size/filtering Vary resolution Vary vertex instructions FPS varies? No FPS varies? No FPS varies? No Yes Yes Yes Texture limited Vary fragment instructions Transform limited Yes FPS varies? No Fragment limited Raster limited Vary vertex size/ AGP rate FPS varies? No Yes Transfer limited CPU limited

CPU Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

CPU Bottlenecks Application limited (most games are in some way) Driver or API limited Too many state changes per draw Non-optimal paths Use VTune (Intel performance analyzer) caveat: truly GPU-limited games hard to distinguish from pathological use of API

Geometry Transfer Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Geometry Transfer Bottlenecks Vertex data problems size issues (just under or over 32 bytes) non-native types (e.g. double, packed byte normals) Using the wrong API calls Immediate mode, non-accelerated vertex arrays Non-indexed primitives (e.g. gldrawarrays) Prefer gldrawrangeelements over gldrawelements AGP misconfigured or aperture set too small

Optimizing Geometry Transfer Dynamic geometry - use ARB_vertex_buffer_object vertex size ideally multiples of 32 bytes access vertices in sequential pattern always use indexed primitives (i.e. gldrawelements) 16 bit indices can be faster than 32 bit try to batch at least 100 tris/call Static geometry - can use display lists

Geometry Transform Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer Texture Storage + Filtering CPU/Bus Bound Vertex Bound Pixel Bound

Geometry Transform Bottlenecks Too many vertices Too much computation per vertex Vertex cache inefficiency

Too Many Vertices Maximize vertex cache usage with locality of reference Use levels of detail (but beware of CPU overhead) Use bump maps to fake geometric details

Too Much Vertex Computation: Fixed Function Avoid superflous work >3 lights (saturation occurs quickly) local lights/viewer, unless really necessary unused texgen or non-identity texture matrices Consider commuting to vertex program if a good shortcut exists example: texture matrix only needs to be 2x2 Fixed function already tuned for the HW

Too Much Vertex Computation: Vertex Programs Move per-object calculations to CPU, save results as constants Leverage full spectrum of instruction set (LIT, DST, SIN,...) Leverage swizzle and mask operators to minimize MOVs Consider using shader levels of detail

Vertex Cache Inefficiency Always use indexed primitives on high-poly models Re-order vertices to be sequential in use (e.g. NVTriStrip) Favor strip order over random order for triangle lists

Rasterization Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Rasterization Rarely the bottleneck (exception: stencil shadow volumes) Speed influenced primarily by size of triangles Also, by number of vertex attributes to be interpolated Be sure to maximize depth culling efficiency

Maximize Depth Culling Efficiency Always clear depth at the beginning of each frame clear with stencil, if stencil buffer exists feel free to combine with color clear, if applicable Coarsely sort objects front to back Don t switch the direction of the depth test mid-frame Constrain near and far planes to geometry visible in frame Avoid polygon offset unless you really need it NVIDIA advice use scissor and depth bounds test to minimize superfluous fragment generation for stencil shadow volumes ATI advice avoid EQUAL and NOTEQUAL depth tests

Texture Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Texture Bottlenecks Running out of texture memory Poor texture cache utilization Excessive texture filtering

Conserving Texture Memory Texture resolutions should be only as big as needed Avoid expensive internal formats Modern GPUs allow floating point formats Compress textures: Collapse monochrome channels into alpha Use 16-bit color depth when possible (environment maps and shadow maps) Use DXT compression Be smart and use tools to selectively compress (ATI s Compressonator)

Poor Texture Cache Utilization Localize texture accesses beware of dependent texturing ALWAYS use mipmapping Avoid negative LOD bias to sharpen texture caches are tuned for standard LODs sharpening usually causes aliasing in the distance opt for anisotropic filtering over sharpening

Excessive Texture Filtering Use trilinear filtering only when needed trilinear filtering can cut fillrate in half typically, only diffuse maps truly benefit light maps are too low resolution to benefit environment maps are distorted anyway Use anisotropic filtering judiciously often more expensive than trilinear not useful for environment maps

Fragment Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer CPU/Bus Bound Vertex Bound Texture Storage + Filtering Pixel Bound

Fragment Bottlenecks Too many fragments Too much computation per fragment Unnecessary fragment operations

Too Many Fragments Follow prior advice for maximizing depth culling efficiency Consider using a depth-only first pass shade only the visible fragments in subsequent pass(es) improve fragment throughput at the expense of additional vertex burden (only use for frames employing complex shaders) Less helpful if opaque geometry is rendered front to back

Too Much Fragment Computation Use a mix of texture and math instructions (they often run in parallel) Move constant per-triangle calculations to vertex program, send data as texture coordinates Do similar with values that can be linear interpolated (e.g. fresnel) Consider using shader levels of detail

Framebuffer Bottlenecks CPU transfer transform raster texture fragment frame buffer CPU Geometry Storage Geometry Processor Rasterizer Fragment Processor Frame buffer Texture Storage + Filtering CPU/Bus Bound Vertex Bound Pixel Bound

Minimizing Framebuffer Traffic Collapse multiple passes with longer shaders Turn off Z writes for transparent objects and multipass Question the use of floating point frame buffers Reduce number and size of render-to-texture targets Cube maps and shadow maps can be of small resolution and at 16- bit color depth and still look good Try turning cube-maps into hemisphere maps for reflections instead Can be smaller than an equivalent cube map Fewer render target switches Reuse render target textures to reduce memory footprint

NVIDIA specific FB Optimizations Do not mask off only some color channels unless really necessary Use 16-bit Z depth if you can get away with it

Use Occlusion Query Use occlusion query to minimize useless rendering It s cheap and easy! Examples: multi-pass rendering rough visibility determination (lens flare, portals) Better than glreadpixels Caveat: need time for query to process

Vertex Buffer Objects Improve geometry throughput and CPU usage Provides extra information to the driver Can be 3-4x faster than regular vertex arrays Can hurt performance in non-optimal cases

VBO Quick Tips Do use static VBO s for maximum performance Do invalidate buffer contents on streaming buffers to prevent synchronization Do update contiguous blocks of data in a single operation Do set the usage flags properly Do use Element Array

VBO Don ts Do not use glarrayelement Do not use non-native types Double, int, RGB color in ubyte format Do not make really small VBO s Do not read VBO s unnecessarily Do not use one massive VBO for everything

Vendor Specific VBO Info NVIDIA Avoid Redundant calls to gl*pointer Use the first parameter of gldrawarrays Use BufferData instead of Map to replace an entire buffer ATI Prefer BufferData and BufferSubData over Map to reduce synchronization overhead

Shader Performance Generic category that covers many areas of the pipeline Compilers do much of the work

General Shader Tips Lift constant or linear expressions to higher levels Avoid excessive control flow Utilize built-in functions/operations Switch to specialized versions when you know certain problem domain limits Avoid unnecessary complexity Compiler is likely better with the straight-forward code

ATI Specific Shader Tips Be careful with unnecessary swizzles in fragment shaders Not all swizzles are native Avoid unnecessary use of Rectangle Textures Don t use the derivative operator Check native instruction counts for ops Available in the ATI OpenGL SDK Try to use ALU normalization

GeForceFX-specific Optimizations Use even numbers of texture instructions Use even numbers of blending (math) instructions Use normalization cubemaps to efficiently normalize vectors Leverage full spectrum of instruction set (LIT, DST, SIN,...) Leverage swizzle and mask operators to minimize MOVs Minimize temporary storage Use 16-bit registers where applicable (most cases) Use all components in each (swizzling is free)

Conclusion Complex, programmable GPUs have many potential bottlenecks Rarely is there but one bottleneck in a game Understand what you are bound by in various sections of the scene The skybox is probably texture limited The skinned, dot3 characters are probably transfer or transform limited Exploit imbalances to get things for free