SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

Size: px

Start display at page:

Download "SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing"

Marcus Cook
8 years ago
Views:

1 SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *, Korea Talks, ACM SIGGRAPH 2012

Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT,

2 Outline Introduction SGRT Core Architecture T&I Engine: H/W Accelerator SRP : Programmable DSP SMK : Parallelization Framework Experimental Results Conclusion

3 Introduction Talks, ACM SIGGRAPH 2012

Reality Graphics Trends Graphics is being important as increasing smart devices Evolving toward more realistic graphics Mobile graphics template earlier PC graphics (5~6 years)

4 Reality Graphics Trends Graphics is being important as increasing smart devices Evolving toward more realistic graphics Mobile graphics template earlier PC graphics (5~6 years) PC/Console Realistic 3D Game ( 10) Immersive AR/MR Mobile/CE 3D Game ( 04) Smart Phone ( 09) Immersive AR/MR on Mobile/CE Smart TV ( 10) Realistic 3D Game on Mobile/CE

PC/Console Realistic 3D Game ( 10) Immersive AR/MR Mobile/CE 3D Game ( 04) Smart Phone (

5 Mobile SoC Apple A5X Die Photo Image Courtesy: Chipworks

6 Current Mobile GPU for Ray Tracing Inadequate Performance Flagship mobile GPU: ~256GFLOPS (ARM Mali T658) Real-time ray >300Mray/sec (1~2TFLOPS) Unsuitable Execution Model Multithreaded SIMD is not fit for processing incoherent rays Weak Branch Supports Performance drops when recursion, function calls, control flow

Unsuitable Execution Model Multithreaded SIMD is not fit for processing incoherent

7 Need a New Architecture? Dedicated, Fixed Function H/W Performance & power-efficient, but weak flexibility RPU [Woop, SIGGRAPH 2005] Fully Programmable Processor Flexible, but inadequate performance and power consumption Reconfigurable stream processor [Kim, CICC 2012] : 1~2 Mrays/sec MIMD threaded processor [Spjut, SHAW ] : ~30 Mrays/sec

[Woop, SIGGRAPH 2005] Fully Programmable Processor Flexible, but inadequate

8 Performance for Real Time Rendering 200~300Mray/sec Reasonable Flexibility Programmable shading and ray generation Support various BVHs : SAH/Binned/SBVH/LBVH.. Easy to extend to GI (path tracing, photon mapping..) Easy to combine rasterizer (OpenGL ES) and ray tracing Low Power & Cost Requirements

SAH/Binned/SBVH/LBVH.. Easy to extend to GI (path tracing, photon mapping.

9 SGRT Talks, ACM SIGGRAPH 2012

Our Approach Combination of CPU, H/W and DSP (Mobile SoC) Tree Build: sorting, irregular work Multi-core CPU (with multi-level $) Refit, Traversal, Intersection: embarrassingly parallel

10 Our Approach Combination of CPU, H/W and DSP (Mobile SoC) Tree Build: sorting, irregular work Multi-core CPU (with multi-level $) Refit, Traversal, Intersection: embarrassingly parallel Dedicated H/W Ray Gen. & Shading: need for flexibility Programmable DSP Multi-core CPU (Tree Build) Dedicated H/W (Traversal & Intersection) Programmable DSP (Ray Gen. & Shading) Memory Memory

System Architecture SGRT (Samsung reconfigurable GPU based on Ray Tracing) T&I Engine: fast, compact H/W to accelerate traversal & intersection SRP: Samsung Reconfigurable Processor to support

11 System Architecture SGRT (Samsung reconfigurable GPU based on Ray Tracing) T&I Engine: fast, compact H/W to accelerate traversal & intersection SRP: Samsung Reconfigurable Processor to support flexible shading SMK : Parallelization framework Multi-core ARM Core #1 Core #2 Core #3 Core #4 Refitting Unit Intersection Unit Cache(L1) T&I Engine Traversal Traversal Unit Unit Cache(L1) Cache(L1) Cache(L1) Cache(L1) Cache(L2) SGRT Core #4 SGRT Core #3 SGRT Core #2 SGRT Core #1 SRP VLIW Engine Coarse Grained Reconfigurable Array Internal SRAM I-Cache C-Mem Texture Unit Cache(L1) Host System BUS AXI System BUS Host DRAM External DRAM

Intersection Unit Cache(L1) T&I Engine Traversal Traversal Unit Unit Cache(L1) Cache(L1) Cache(L1) Cache(L1) Cache(L2) SGRT Core #4 SGRT Core #3 SGRT Core #2

12 T&I Engine : A MIMD H/W Accelerator Rays Hit info T&I Engine Ray Dispatcher Traversal L1$ RAU pipe Unit L1$ RAU pipe L1$ L1$ stack RAU pipe L1$ RAU L1$ stack L1$ pipe stack L1$ stack L1$ L1$ L1$ L1$ Intersection Unit RAU pipe L1$ MIMD arch. L2$ Newly designed H/W Accelerator based on our previous work KDtree H/W engine [Nah, SIGGRAPH ASIA 2011] Single-ray-based MIMD architecture : Efficient processing for incoherent rays Ray Accumulation Unit (RAU) : Hardware multithreading Optimized restart & short stack algorithm Adaptive restart trail [Lee, HPG 2012] Early Intersection Test Reducing expensive ray-primitive IST test

L2$ Newly designed H/W Accelerator based on our previous work KDtree H/W engine [Nah, SIGGRAPH ASIA 2011] Single-ray-based MIMD architecture : Efficient

Early (Two-Pass) Intersection Test Conventional IST 1 2 3 4 5 10 11 Ray-nodeAABB Test Ray-Primitive Test Intersection Unit 6 7 8 9 Early IST 1 2

13 Early (Two-Pass) Intersection Test Conventional IST Ray-nodeAABB Test Ray-Primitive Test Intersection Unit Early IST Ray-nodeAABB Test Ray-primAABB Test Ray-Primitive Test T0 T1 T2 T3 T4 T5 T6 T7 Intersection Unit Inner node Leaf node Primitive AABB Primitive

1 2 3 Ray-nodeAABB Test Ray-primAABB Test Ray-Primitive Test T0 T1 T2 T3

14 Ray Accumulation Unit Specialized H/W multi-threading for latency hiding [Nah, 2011] $ missed rays are accumulated in RA buffer, other rays can be processed during this period Coherence can be increased, the rays that reference the same cache line are accumulated in the same row in an RA buffer Experimental results, up to 3x performance gain hit result cache data cache address Non-blocking CACHE Input Buffer Ray Accumulation Unit Control Buffer ray rays cache address cache data occupation counter Ray + data Traversal or Intersection pipeline Cache hit 3 Cache miss

RA buffer Experimental results, up to 3x performance gain hit result cache data cache address Non-blocking CACHE Input Buffer Ray Accumulation

Samsung Reconfigurable Processor A flexible architecture template [Lee, HPG 2011/2012] ISA such as arithmetic, special function and texture are properly implemented.

15 Samsung Reconfigurable Processor A flexible architecture template [Lee, HPG 2011/2012] ISA such as arithmetic, special function and texture are properly implemented. The VLIW engine useful for GP computations (function invocation, control flow). The CGRA makes full use of software pipeline technique for loop acceleration. Instruction VLIW DATA Central RF (Register file) FU FU FU FU FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF CGA for ( ) { Loop } for ( ) { Loop } for ( ) { Loop } Control proc Data proc Control proc Data proc Control proc Data proc

The CGRA makes full use of software pipeline technique for loop acceleration.

16 Packet Stream Tracing on SRP Intersection Result Classify hit rays, Update colors Compute normal vectors Classify second rays & texture Remove recursion Job-Q based streamed iteration Classified according to the types of operation CGA kernel Gen. second rays Reflection Ray CGA Kernels VLIW code Refraction Ray Compute texture color Compute N L, classify shading Shading, Gen. shadow rays Shadow Ray A packet of rays are batched Each kernels are mapped on CGA, loop accelerated shows high IPC rate up to the maximum number of FU arrays

$second rays Reflection Ray CGA Kernels VLIW code Refraction Ray Compute texture color Compute N L, classify shading Shading, Gen.$

Parallelization Framework Parallel ray tracing with multi-tasking system Utilized embedded RTOS, SMK (Samsung Multi-Platform Kernel) [Shin, SAC 2011] Supports multi-tasking by systematic scheduling

17 Parallelization Framework Parallel ray tracing with multi-tasking system Utilized embedded RTOS, SMK (Samsung Multi-Platform Kernel) [Shin, SAC 2011] Supports multi-tasking by systematic scheduling in the task queues Individual task for each SGRT core is responsible for Different pixels (or pixel tiles), the scheduler can distribute the next tasks to the idle SGRT core first, dynamic load balancing T&I Engine SRP T&I SRP T&I SRP SMK Engine SMK Engine SMK

Individual task for each SGRT core is responsible for Different pixels (or pixel tiles), the scheduler can

18 Evaluation Talks, ACM SIGGRAPH 2012

19 Simulation Environment Built a cycle accurate simulator (T&I Engine), and a in-house cycle accurate compiled simulator, called csim (SRP) Test condition w/ two benchmarks Full SAH, cost ratio 5:1 (TRV:IST) for shallow tree Ferrari scene (210K triangles, 1 light source) Fairy scene (170K triangles, 2 light sources) Shadow, reflection, (800x640)

$Fairy scene (170K triangles, 2 light sources) Shadow, reflection, refraction @WVGA$

20 Architecture configuration Preliminary Results 4 SGRT cores, traversal & intersection unit = 4:1 per SGRT core 1Ghz core clock Achieved around 170 MRPS (T&I), 255 MRPS (RGS) for Fairy Recent GPU ray tracer (156~317 MRPS, NVIDIA Kepler) [Alia, HPG 2012] Scene # of tri. # of ray Pipeline usage T&I Engine TRV $ hit ratio IST $ hit ratio MRPS SRP MRPS Simulated FPS Fairy 170K 1.7M Ferrari 210K 1.5M

Kepler) [Alia, HPG 2012] Scene # of tri.

21 FPGA Currently, we are also testing the SGRT on FPGA board

22 Conclusion Talks, ACM SIGGRAPH 2012

23 SGRT: A novel mobile GPU based on ray tracing, first mobile GPU to realize a real-time ray tracing Carefully designed to suit for mobile SoC environment Currently implementing the T&I engine at the RTL level Future work Conclusion Analyze cost and power consumption Support dynamic scenes with a fast BVH build algorithm optimized for mobile environment Higher-level shading/ecosystem Poster (#103) session: 8/7, 8/8 12:15-13:15PM

24 Acknowledgements This project is based on the collaboration with two University (Yonsei, National Kongju). Authors appreciate to two professors (Tack-Don Han, Hyun-Sang Park) for their valuable advices. Thanks

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing Won-Jong Lee 1, Youngsam Shin 1, Jaedon Lee 1, Jin-Woo Kim 2, Jae-Ho Nah 3, Seokyoon Jung 1, Shihwa Lee 1, Hyun-Sang Park 4, Tack-Don Han 2 1 SAMSUNG