Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

Similar documents

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

Motivation: Smartphone Market

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Lab Testing Summary Report

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Hardware design for ray tracing

SPARC64 VIIIfx: CPU for the K computer

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

2) Based on the information in the table which choice BEST shows the answer to 1 906?

Computer Graphics Hardware An Overview

GPU Architecture. Michael Doggett ATI

ARM Microprocessor and ARM-Based Microcontrollers

FAST 11. Yongseok Oh University of Seoul. Mobile Embedded System Laboratory

A Cross-Platform Framework for Interactive Ray Tracing

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Radeon HD 2900 and Geometry Generation. Michael Doggett

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

L20: GPU Architecture and Models

SharePoint Performance Optimization

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Binary search tree with SIMD bandwidth optimization using SSE

GPUs for Scientific Computing

ClearCube White Paper Best Practices Pairing Virtualization and Centralization Increasing Performance for Power Users with Zero Client End Points

Web Server Software Architectures

Introduction to GPGPU. Tiziano Diamanti

Exploring RAID Configurations

Embedded Parallel Computing

Let s put together a Manual Processor

Infrastructure Matters: POWER8 vs. Xeon x86

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Course Description. Students Will Learn

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Rethinking SIMD Vectorization for In-Memory Databases

Touchstone -A Fresh Approach to Multimedia for the PC

Virtual vs Physical Addresses

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Path Tracing Overview

Windows Server 2008 R2 Hyper-V Live Migration

3ds Max 8 & mental ray. Autodesk Media and Entertainment

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Cisco Application Networking for Citrix Presentation Server

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Annotation to the assignments and the solution sheet. Note the following points

Top 10 Performance Tips for OBI-EE

Advanced Core Operating System (ACOS): Experience the Performance

Hardware/Software Guidelines

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Project Report on Traffic Engineering and QoS with MPLS and its applications

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Appendix L. General-purpose GPU Radiative Solver. Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy)

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

Asynchronous Bypass Channels

Datacenter Operating Systems

INTRODUCTION TO RENDERING TECHNIQUES

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Multi-Threading Performance on Commodity Multi-Core Processors

WhitePaper: XipLink Real-Time Optimizations

Networking Virtualization Using FPGAs

CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS

Improving Grid Processing Efficiency through Compute-Data Confluence

Programmable Networking with Open vswitch

Dragon NaturallySpeaking and citrix. A White Paper from Nuance Communications March 2009

CMSC 611: Advanced Computer Architecture

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Next Generation GPU Architecture Code-named Fermi

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Windows Server 2008 R2 Hyper-V Live Migration

Parallel Programming Survey

FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Parallel Firewalls on General-Purpose Graphics Processing Units

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

Latency on a Switched Ethernet Network

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

Chapter 1 Computer System Overview

ultra fast SOM using CUDA

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Introduction to GPU Architecture

Choosing a Computer for Running SLX, P3D, and P5

Stream Processing on GPUs Using Distributed Multimedia Middleware

Intel DPDK Boosts Server Appliance Performance White Paper

PCI Express IO Virtualization Overview

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Hardware acceleration enhancing network security

Management and Orchestration of Virtualized Network Functions

Avoid Paying The Virtualization Tax: Deploying Virtualized BI 4.0 The Right Way. Ashish C. Morzaria, SAP

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Transcription:

Reduced Precision Hardware for Ray Tracing Sean Keely University of Texas, Austin

Question Why don t GPU s accelerate ray tracing?

Real time ray tracing needs very high ray rate Example Scene: 3 area lights + AO/GI 64 rays/pixel, 1080p, 30Hz -> 4 Billion Rays/s

Real time ray tracing needs very high ray rate Example Scene: 3 area lights + AO/GI 64 rays/pixel, 1080p, 30Hz -> 4 Billion Rays/s Software approaches give 80-400M Rays/s

Single precision ASIC s will be large Approx. 4 Tflops Trace + 2 Tflops Intersect +?? Shading Will need roughly the die area of a high end GPU for a trace & intersect co-processor

Previous work competes with current GPU s STRaTA, D. Kopta 2013 MIMD configurable pipeline, avoids divergence penalty Repurpose L2 as ray cache SGRT, W. Lee 2014 MIMD co-processor design Most recent work addresses the performance cost of co-processor design

Goal: Accelerate ray tracing in a GPU, not next to one Add high performance BVH traversal acceleration to current GPU architecture MIMD traversal SIMD programs Constraints: low impact, not a co-processor small die area low power low bandwidth

How: Reduced Precision and Integration 4 Billion RPS needs ~24W just for multiplies in traversal. Can be reduced to ~1W with reduced precision. One off-chip data access is ~100x more energy than a FMA. Reduced precision saves here too! Don t build new registers, cache, wires Integrate to a GPU and get this for free!

Reduced Precision BVH Traversal

Robust Ray-AABB Test T F T F >T N : Intersect Box T N T N >T F : Miss Box T F T N HIT MISS

Robust Ray-AABB Test T F T F >T N : Intersect Box T N T N >T F : Miss Box T F T N HIT MISS T F T N?? Ambiguity if T F T N. Need T N >T F +Ɛ, Ɛ=2*ULP(T F ) to declare a miss [T. Ize, 13].

Hit or Miss?

Hit or Miss? High Precision Intersection

Hit or Miss? Low Precision Intersection

Hit or Miss?

Hit or Miss?

Hit or Miss?

Minimize the effective box size Seek P such that

Traversal Point Update Gap due to finite precision. Problems Computing T N needs multiplies Unbounded box size Multiple applications

Traversal Point Update Gap due to finite precision. f Precision of update procedure S Maximum parent-child edge ratio C Maximum internal parent-child offset

Traversal Point Update 5 bit child boxes -> S=32 Need 8 bit arithmetic to guarantee Can actually use just 1 bit! Incorrectly taken paths are quickly aborted Much smaller than even 8 bit arithmetic Reduces precision of the adds and divides as well No need to share divider, replaced with LUT.

Reduced Precision Traversal Unit Fixed function traversal unit implements: 1. Two 1 bit precision traversal point updates 2. Two 5 bit precision robust Ray-AABB tests 3. Near child detection 4. Single bit traversal stack 5.

Reduced Precision Traversal Unit Fixed function traversal unit implements: 1. Two 1 bit precision traversal point updates 2. Two 5 bit precision robust Ray-AABB tests 3. Near child detection 4. Single bit traversal stack 5.

Reduced Precision Traversal Unit Fixed function traversal unit implements: 1. Two 1 bit precision traversal point updates 2. Two 5 bit precision robust Ray-AABB tests 3. Near child detection 4. Single bit traversal stack 5. A fully pipelined traversal unit is roughly 6% the area of a single SIMD FPU in a current GPU.

BVH Compression Traversal inputs drop to 5 bits early on Can t distinguish all boxes Opportunity to save bandwidth! BVH boxes are quantized to 5 bits 12 byte node format holding two boxes

GPU Integration & Scheduling

Schedule to minimize off chip traffic Treelet scheduling [T. Aila,... 10] Smaller is better but for queuing traffic. Stay out of memory! On chip queuing. Lots of rays Improve odds of reuse.

We can use everything here

Registers are great for rays 64 byte SIMD registers Random access Ultra high bandwidth

Registers are great for rays 64 byte ray state Store one ray across one register All queues are linked lists of registers

MIMD traversal shares with SIMD core.

Queue data stays on chip Single global resource Has remote logic capability already

Queue data stays on chip

Queue data stays on chip One per core 40% used for queues

Queue rays when data is expensive Add Hit-Only load type Cache reserved for scene data

Queue based intersect and shading

Queue based intersect and shading Register Transpose units ease transitioning between MIMD and SIMD operation

Ray tracing fits in current GPUs

Evaluation

Experimental setup Two simulators measure key metrics Crown: 4.9M Triangles Hairball: 2.9M Triangles Powerplant: 12.8M Triangles Vegetation: 1.1M Triangles

Normalized traversal steps per ray Reduced precision has low costs Compressed BVH: 7% more work (usually) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates

Normalized traversal steps per ray Reduced precision has low costs Reduced Precision Traversal: 3% more work 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates

Normalized traversal steps per ray Reduced precision has low costs 1-bit Traversal Point Update: 1-2% more work 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates

Normalized traversal steps per ray Total overhead is small 10-15% (usually) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Hairball Vegetation Powerplant Crown Baseline Compressed BVH 8-bit Updates 1-bit Updates

GB off chip traffic Off chip traffic is in the real time range Simulated workload set to 30 frames per second 10 9 8 7 6 5 4 3 2 1 0 282 GB/s 219 GB/s 231 GB/s 123 GB/s Hairball Vegetation Powerplant Crown Traversal Limited Bandwidth Limited

Million Rays Per Second Simulated ray rate Average ray rate: 3.4 Billion rays/s 4500 4000 3500 3711 3580 3964 3000 2500 2000 2248 1500 1000 500 0 Current Software Performance Hairball Vegetation Powerplant Crown

Conclusion Reduced precision yields surprising performance benefits roughly 20x. Hardware ray tracing acceleration can be a lightweight feature of modern GPUs.

Acknowledgements Samuli Laine for use of Vegetation and Hairball Martin Lubich & Intel for use of Crown Daniel Kopta for assistance with comparisons in the paper UT Graphics group and others for many useful discussions

Questions?

Ground Truth

Incorrectly taken paths abort quickly

Incorrectly taken paths abort quickly

Incorrectly taken paths abort quickly

Incorrectly taken paths abort quickly

Incorrectly taken paths abort quickly