Real-Time BC6H Compression on GPU. Krzysztof Narkowicz Lead Engine Programmer Flying Wild Hog



Similar documents
Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Optimizing AAA Games for Mobile Platforms

Optimization for DirectX9 Graphics. Ashu Rege

Computer Graphics Hardware An Overview

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Video-Conferencing System

Dynamic Resolution Rendering

What is GPUOpen? Currently, we have divided console & PC development Black box libraries go against the philosophy of game development Game

Introduction to image coding

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Blender addons ESRI Shapefile import/export and georeferenced raster import

Software Virtual Textures

Radeon HD 2900 and Geometry Generation. Michael Doggett

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

How To Make A Texture Map Work Better On A Computer Graphics Card (Or Mac)

CSE 237A Final Project Final Report

Computer Networks and Internets, 5e Chapter 6 Information Sources and Signals. Introduction

Low power GPUs a view from the industry. Edvard Sørgård

GPU Architecture. Michael Doggett ATI

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Advanced Visual Effects with Direct3D

Introduction Computer stuff Pixels Line Drawing. Video Game World 2D 3D Puzzle Characters Camera Time steps

The Future Of Animation Is Games

Image Compression through DCT and Huffman Coding Technique

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Data Storage 3.1. Foundations of Computer Science Cengage Learning

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Quality Estimation for Scalable Video Codec. Presented by Ann Ukhanova (DTU Fotonik, Denmark) Kashaf Mazhar (KTH, Sweden)

NVIDIA GeForce GTX 580 GPU Datasheet

Fast Arithmetic Coding (FastAC) Implementations

Solomon Systech Image Processor for Car Entertainment Application

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Water Flow in. Alex Vlachos, Valve July 28, 2010

To determine vertical angular frequency, we need to express vertical viewing angle in terms of and. 2tan. (degree). (1 pt)

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Hi everyone, my name is Michał Iwanicki. I m an engine programmer at Naughty Dog and this talk is entitled: Lighting technology of The Last of Us,

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Calibration Best Practices

MPEG-4 AVC/H.264 Video Codecs Comparison

balesio Native Format Optimization Technology (NFO)

MPEG-4 AVC/H.264 Video Codecs Comparison

Performance Analysis and Comparison of JM 15.1 and Intel IPP H.264 Encoder and Decoder

Image Synthesis. Transparency. computer graphics & visualization

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Image Compression and Decompression using Adaptive Interpolation

Interactive Level-Set Deformation On the GPU

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

3D Computer Games History and Technology

Digital Video: A Practical Guide

NVIDIA Tools For Profiling And Monitoring. David Goodwin

How To Run A Factory I/O On A Microsoft Gpu 2.5 (Sdk) On A Computer Or Microsoft Powerbook 2.3 (Powerpoint) On An Android Computer Or Macbook 2 (Powerstation) On

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

Thermistor Calculator. Features. General Description. Input/Output Connections. When to use a Thermistor Calculator 1.10

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Building an Advanced Invariant Real-Time Human Tracking System

Touchstone -A Fresh Approach to Multimedia for the PC

Amazing renderings of 3D data... in minutes.

Comparison of different image compression formats. ECE 533 Project Report Paula Aguilera

FCE: A Fast Content Expression for Server-based Computing

APP INVENTOR. Test Review

Outline. srgb DX9, DX10, XBox 360. Tone Mapping. Motion Blur

WHITE PAPER. H.264/AVC Encode Technology V0.8.0

Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC Spring 2013

FAST INVERSE SQUARE ROOT

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

GPU-Driven Rendering Pipelines

Numerical Matrix Analysis

Writing Applications for the GPU Using the RapidMind Development Platform

GPU Hardware Performance. Fall 2015

Video Codec Requirements and Evaluation Methodology

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

GPU Shading and Rendering: Introduction & Graphics Hardware

Color correction in 3D environments Nicholas Blackhawk

Introduction to Game Programming. Steven Osman

Measuring Video Quality in Videoconferencing Systems

H 261. Video Compression 1: H 261 Multimedia Systems (Module 4 Lesson 2) H 261 Coding Basics. Sources: Summary:

Video compression: Performance of available codec software

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Filter Comparison. Match #1: Analog vs. Digital Filters

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Palmprint Recognition. By Sree Rama Murthy kora Praveen Verma Yashwant Kashyap

CS-184: Computer Graphics

OpenGL Performance Tuning

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Grid Computing for Artificial Intelligence

SCANNING, RESOLUTION, AND FILE FORMATS

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

Michael W. Marcellin and Ala Bilgin

If you are working with the H4D-60 or multi-shot cameras we recommend 8GB of RAM on a 64 bit Windows and 1GB of video RAM.

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

JPEG compression of monochrome 2D-barcode images using DCT coefficient distributions

Transcription:

Real-Time BC6H Compression on GPU Krzysztof Narkowicz Lead Engine Programmer Flying Wild Hog

Introduction BC6H is lossy block based compression designed for FP16 HDR textures Hardware supported since DX11 and current-gen consoles Fixed size 4x4 texel blocks No alpha 6:1 compression ratio (8bits per texel) Great replacement of hacky encodings like RGBE, RGBM, RGBK...

A Bag Of Tricks block Signed or unsigned half floats Mix of three algorithms: Endpoints and indices Delta compression Partitioning Different compression modes selected per

BC6H Endpoints And Indices Two endpoints and 16 indices are stored per block Endpoints define a line segment in RGB space Indices define location of every texel in a block on this segment

BC6H Delta Compression First endpoint stored in higher precision Signed offset between endpoints is stored instead of second endpoint Increases quality for blocks with similar values

BC6H Partitioning Allows two separate line segments per block Increases quality for blocks containing a large color variation Use one of 32 predefined partitions to assign current texel to one of two line segments

BC6H Partitioning With Delta Compression One base endpoint is stored directly Other 3 are stored as signed offsets from the base endpoint Helps with limited endpoint precision

BC6H Modes 14 different modes: 1 mode which doesn t use partitioning or delta compression 3 modes which use just delta compression 1 mode which uses just partitioning 9 modes which use partitioning and delta compression Modes choose different tradeoffs between endpoints precision and offset precision

Optimal Compression Is Hard There are 10 modes with partitioning Every one of those requires finding 4 endpoints and 16 optimal indices per every partition (and we have 32 partitions) Huge search space

Real-Time Compression on GPU Typical cases: Dynamic env maps HDR textures transcribed at runtime from other formats User generated content GPU based compression avoids CPU-GPU synchronization and data transfer between CPU and GPU

Our Use Case Dynamic lighting conditions Procedural world geometry Generate nearby env maps during camera movement Compress generated env maps to BC6H in real-time Enables dense env map placement

Real-Time Compression Performance or quality? Fast Performance is crucial Quality doesn t have to be top notch Quality Quality is important Compression can take a few ms

Fast Preset Which modes? Modes with partitioning are too slow Modes with only delta compression are fast, but improve only a few specific cases Mode 11 Just endpoints and indices Two endpoints quantized to 10 bit floating points 16 indices (4 bits per index)

Endpoints Compute RGB bounding box of the block s texels [Waveren 06] Use it s min and max values as endpoints

RGB BBox Inset Inset RGB bbox by a small percentage for better precision

RGB BBox Inset HDR data leads to very large offsets

RGB BBox Inset Results Reference Compressed

RGB BBox Refine Rebuild bbox using the second smallest and second largest R, G and B

RGB BBox Refine Results Reference Compressed

Indices Need to select one of 16 interpolated colors on segment between endpoints Project on segment and pick nearest index

Indices Not so simple as interpolation weight s aren t evenly distributed: Simple approximation is to fit equation for smallest error: Index i = Clamp[texelPos i 14 15 + 1 30, 0,15]

Fix-up Index MSB of the first index is implicitly assumed to be zero and isn t stored We need to ensure this property by swapping endpoints

Quality Comparison RMSE, MSE, PSNR are very bad error metrics for HDR images A slight round-off error for very bright pixels results in incorrectly high RMSE RMSLE [Richter 14] RMSLE = n 1 n i log x i + 1 ) log y i + 1 2

Fast Preset Results Intel timings on i7 860 Fast preset and DirectXTex timings on AMD R9 270 (mid range GPU) 256x256 env map with mip maps - 0.07ms Fast preset Intel very fast Intel fast Intel very slow DirectX Tex RMSLE 0.0552 0.0470 0.0307 0.0293 0.0413 Mp/s 8022.40 63.10 5.35 0.33 0.65

Fast Preset Results Reference Compressed

Fast Preset Results Reference Compressed

Quality Preset Increasing quality requires partitioning We tested all modes with partitioning and picked two most important Mode 2 7 bits per first endpoint 6 bits per 3 signed offsets Mode 6 9 bits per first endpoint 5 bits per 3 signed offsets All use 3 bits per index

Endpoints Compute two RGB bounding boxes per partition as there are two RGB line segments now Use their min and max corners as endpoints 32 partitioning modes

Indices Compute indices per partition using similar approximation as before Two fix-up indices: For the first segment it s the first index For the second segment it s specified by a predefined lookup table

Selecting Mode And Partition We compress block in 65 ways 2 partitioning modes with 32 partitions each 1 non partitioning mode (mode 11) Compute compression error for every combination Again using RMSLE (log2) Pick combination with the lowest error

Quality Preset Results 256x256 env map with mip maps 6.84 ms Quality is comparable to offline compressors Fast preset Quality preset Intel very fast Intel fast Intel very slow DirectX Tex RMSLE 0.0552 0.0333 0.0470 0.0307 0.0293 0.0413 Mp/s 8022.4 143.55 63.10 5.35 0.33 0.65

Quality Preset Results Reference Compressed

DX11 Implementation Render to a 16 times smaller R32G32B32A32_Uint temporary target Run pixel shader and output compressed blocks as pixels Copy results to a BC6H texture (CopyResource) Using modern APIs: Skip the copy step Implement as an async compute

Shader Optimization Fetch source texels using gather Replace 16 bit integer math with float math Tightly bit pack lookup tables into unsigned integers

Embedding palletized sprites in shader code https://www.shadertoy.com/view/xtlsd7

Misc Source code, example application and test images on GitHub: https://github.com/knarkowicz/gpurealtimebc6h Thanks: Mark Cerny (presentation s mentor) Michał Iwanicki for helping me with the slides Contact me: @knarkowicz k.narkowicz@gmail.com

Questions?

References [Waveren 06] J.M.P. van Waveren, Real-Time DXT Compression, 2006 [Richter 14] Thomas Richter, HDR Image Quality in the Evolution of JPEG XT, MMSP 2014