SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Similar documents

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

Hardware design for ray tracing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin

Computer Graphics Hardware An Overview

Introduction to GPU Programming Languages

CHAPTER 1 INTRODUCTION

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

A Cross-Platform Framework for Interactive Ray Tracing

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPGPU. Tiziano Diamanti

Performance Optimization and Debug Tools for mobile games with PlayCanvas

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Petascale Visualization: Approaches and Initial Results

Introduction to Cloud Computing

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Advanced Rendering for Engineering & Styling

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Path Tracing Overview

Scalability and Classifications

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Architectures and Platforms

Kalray MPPA Massively Parallel Processing Array

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

HPC with Multicore and GPUs

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

GPU Parallel Computing Architecture and CUDA Programming Model

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

ultra fast SOM using CUDA

Introduction to GPU Architecture

Next Generation GPU Architecture Code-named Fermi

NVIDIA Tools For Profiling And Monitoring. David Goodwin

GPGPU Computing. Yong Cao

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Computer Applications in Textile Engineering. Computer Applications in Textile Engineering

Embedded Systems: map to FPGA, GPU, CPU?

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Computer Graphics Global Illumination (2): Monte-Carlo Ray Tracing and Photon Mapping. Lecture 15 Taku Komura

SOC architecture and design

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Accelerate Cloud Computing with the Xilinx Zynq SoC

NVIDIA GeForce GTX 580 GPU Datasheet

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

The Future Of Animation Is Games

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

OC By Arsene Fansi T. POLIMI

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Going Linux on Massive Multicore

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

CMSC 611: Advanced Computer Architecture

GPU Architecture. Michael Doggett ATI

Chapter 2 Parallel Architecture, Software And Performance

Texture Cache Approximation on GPUs

GUI GRAPHICS AND USER INTERFACES. Welcome to GUI! Mechanics. Mihail Gaianu 26/02/2014 1

OpenSoC Fabric: On-Chip Network Generator

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Parallel Firewalls on General-Purpose Graphics Processing Units

GPGPU: General-Purpose Computation on GPUs

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Data Center and Cloud Computing Market Landscape and Challenges

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Lecture Notes, CEng 477

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

L20: GPU Architecture and Models

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Realtime Ray Tracing and its use for Interactive Global Illumination

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Touchstone -A Fresh Approach to Multimedia for the PC

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Guided Performance Analysis with the NVIDIA Visual Profiler

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Embedded Development Tools

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Chapter 1 Computer System Overview

Data Centric Systems (DCS)

How To Teach Computer Graphics

Five Families of ARM Processor IP

Getting Started with RemoteFX in Windows Embedded Compact 7

Transcription:

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *, Korea Talks, ACM SIGGRAPH 2012

Outline Introduction SGRT Core Architecture T&I Engine: H/W Accelerator SRP : Programmable DSP SMK : Parallelization Framework Experimental Results Conclusion

Introduction Talks, ACM SIGGRAPH 2012

Reality Graphics Trends Graphics is being important as increasing smart devices Evolving toward more realistic graphics Mobile graphics template earlier PC graphics (5~6 years) PC/Console Realistic 3D Game ( 10) Immersive AR/MR Mobile/CE 3D Game ( 04) Smart Phone ( 09) Immersive AR/MR on Mobile/CE Smart TV ( 10) Realistic 3D Game on Mobile/CE 10 15 20

Mobile SoC Apple A5X Die Photo Image Courtesy: Chipworks

Current Mobile GPU for Ray Tracing Inadequate Performance Flagship mobile GPU: ~256GFLOPS (ARM Mali T658) Real-time ray tracing @HD: >300Mray/sec (1~2TFLOPS) Unsuitable Execution Model Multithreaded SIMD is not fit for processing incoherent rays Weak Branch Supports Performance drops when recursion, function calls, control flow

Need a New Architecture? Dedicated, Fixed Function H/W Performance & power-efficient, but weak flexibility RPU [Woop, SIGGRAPH 2005] Fully Programmable Processor Flexible, but inadequate performance and power consumption Reconfigurable stream processor [Kim, CICC 2012] : 1~2 Mrays/sec MIMD threaded processor [Spjut, SHAW-3 2012] : ~30 Mrays/sec

Performance for Real Time Rendering 200~300Mray/sec Reasonable Flexibility Programmable shading and ray generation Support various BVHs : SAH/Binned/SBVH/LBVH.. Easy to extend to GI (path tracing, photon mapping..) Easy to combine rasterizer (OpenGL ES) and ray tracing Low Power & Cost Requirements

SGRT Talks, ACM SIGGRAPH 2012

Our Approach Combination of CPU, H/W and DSP (Mobile SoC) Tree Build: sorting, irregular work Multi-core CPU (with multi-level $) Refit, Traversal, Intersection: embarrassingly parallel Dedicated H/W Ray Gen. & Shading: need for flexibility Programmable DSP Multi-core CPU (Tree Build) Dedicated H/W (Traversal & Intersection) Programmable DSP (Ray Gen. & Shading) Memory Memory

System Architecture SGRT (Samsung reconfigurable GPU based on Ray Tracing) T&I Engine: fast, compact H/W to accelerate traversal & intersection SRP: Samsung Reconfigurable Processor to support flexible shading SMK : Parallelization framework Multi-core ARM Core #1 Core #2 Core #3 Core #4 Refitting Unit Intersection Unit Cache(L1) T&I Engine Traversal Traversal Unit Unit Cache(L1) Cache(L1) Cache(L1) Cache(L1) Cache(L2) SGRT Core #4 SGRT Core #3 SGRT Core #2 SGRT Core #1 SRP VLIW Engine Coarse Grained Reconfigurable Array Internal SRAM I-Cache C-Mem Texture Unit Cache(L1) Host System BUS AXI System BUS Host DRAM External DRAM

T&I Engine : A MIMD H/W Accelerator Rays Hit info T&I Engine Ray Dispatcher Traversal L1$ RAU pipe Unit L1$ RAU pipe L1$ L1$ stack RAU pipe L1$ RAU L1$ stack L1$ pipe stack L1$ stack L1$ L1$ L1$ L1$ Intersection Unit RAU pipe L1$ MIMD arch. L2$ Newly designed H/W Accelerator based on our previous work KDtree H/W engine [Nah, SIGGRAPH ASIA 2011] Single-ray-based MIMD architecture : Efficient processing for incoherent rays Ray Accumulation Unit (RAU) : Hardware multithreading Optimized restart & short stack algorithm Adaptive restart trail [Lee, HPG 2012] Early Intersection Test Reducing expensive ray-primitive IST test

Early (Two-Pass) Intersection Test Conventional IST 1 2 3 4 5 10 11 Ray-nodeAABB Test Ray-Primitive Test Intersection Unit 6 7 8 9 Early IST 1 2 3 Ray-nodeAABB Test Ray-primAABB Test Ray-Primitive Test T0 T1 T2 T3 T4 T5 T6 T7 Intersection Unit Inner node Leaf node Primitive AABB Primitive

Ray Accumulation Unit Specialized H/W multi-threading for latency hiding [Nah, 2011] $ missed rays are accumulated in RA buffer, other rays can be processed during this period Coherence can be increased, the rays that reference the same cache line are accumulated in the same row in an RA buffer Experimental results, up to 3x performance gain hit result cache data cache address Non-blocking CACHE Input Buffer Ray Accumulation Unit Control Buffer ray rays cache address cache data 4 0 1 occupation counter Ray + data Traversal or Intersection pipeline Cache hit 3 Cache miss

Samsung Reconfigurable Processor A flexible architecture template [Lee, HPG 2011/2012] ISA such as arithmetic, special function and texture are properly implemented. The VLIW engine useful for GP computations (function invocation, control flow). The CGRA makes full use of software pipeline technique for loop acceleration. Instruction VLIW DATA Central RF (Register file) FU FU FU FU FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF CGA for ( ) { Loop } for ( ) { Loop } for ( ) { Loop } Control proc Data proc Control proc Data proc Control proc Data proc

Packet Stream Tracing on SRP Intersection Result Classify hit rays, Update colors Compute normal vectors Classify second rays & texture Remove recursion Job-Q based streamed iteration Classified according to the types of operation CGA kernel Gen. second rays Reflection Ray CGA Kernels VLIW code Refraction Ray Compute texture color Compute N L, classify shading Shading, Gen. shadow rays Shadow Ray A packet of rays are batched Each kernels are mapped on CGA, loop accelerated shows high IPC rate up to the maximum number of FU arrays

Parallelization Framework Parallel ray tracing with multi-tasking system Utilized embedded RTOS, SMK (Samsung Multi-Platform Kernel) [Shin, SAC 2011] Supports multi-tasking by systematic scheduling in the task queues Individual task for each SGRT core is responsible for Different pixels (or pixel tiles), the scheduler can distribute the next tasks to the idle SGRT core first, dynamic load balancing T&I Engine SRP T&I SRP T&I SRP SMK Engine SMK Engine SMK

Evaluation Talks, ACM SIGGRAPH 2012

Simulation Environment Built a cycle accurate simulator (T&I Engine), and a in-house cycle accurate compiled simulator, called csim (SRP) Test condition w/ two benchmarks Full SAH, cost ratio 5:1 (TRV:IST) for shallow tree Ferrari scene (210K triangles, 1 light source) Fairy scene (170K triangles, 2 light sources) Shadow, reflection, refraction @WVGA (800x640)

Architecture configuration Preliminary Results 4 SGRT cores, traversal & intersection unit = 4:1 per SGRT core 1Ghz core clock Achieved around 170 MRPS (T&I), 255 MRPS (RGS) for Fairy Recent GPU ray tracer (156~317 MRPS, NVIDIA Kepler) [Alia, HPG 2012] Scene # of tri. # of ray Pipeline usage T&I Engine TRV $ hit ratio IST $ hit ratio MRPS SRP MRPS Simulated FPS Fairy 170K 1.7M 87.27 93.83 96.53 171.32 255.72 87.82 Ferrari 210K 1.5M 79.75 92.56 92.92 122.48 319.56 67.83

FPGA Currently, we are also testing the SGRT on FPGA board

Conclusion Talks, ACM SIGGRAPH 2012

SGRT: A novel mobile GPU based on ray tracing, first mobile GPU to realize a real-time ray tracing Carefully designed to suit for mobile SoC environment Currently implementing the T&I engine at the RTL level Future work Conclusion Analyze cost and power consumption Support dynamic scenes with a fast BVH build algorithm optimized for mobile environment Higher-level shading/ecosystem Poster (#103) session: 8/7, 8/8 12:15-13:15PM

Acknowledgements This project is based on the collaboration with two University (Yonsei, National Kongju). Authors appreciate to two professors (Tack-Don Han, Hyun-Sang Park) for their valuable advices. Thanks