SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *, Korea Talks, ACM SIGGRAPH 2012
Outline Introduction SGRT Core Architecture T&I Engine: H/W Accelerator SRP : Programmable DSP SMK : Parallelization Framework Experimental Results Conclusion
Introduction Talks, ACM SIGGRAPH 2012
Reality Graphics Trends Graphics is being important as increasing smart devices Evolving toward more realistic graphics Mobile graphics template earlier PC graphics (5~6 years) PC/Console Realistic 3D Game ( 10) Immersive AR/MR Mobile/CE 3D Game ( 04) Smart Phone ( 09) Immersive AR/MR on Mobile/CE Smart TV ( 10) Realistic 3D Game on Mobile/CE 10 15 20
Mobile SoC Apple A5X Die Photo Image Courtesy: Chipworks
Current Mobile GPU for Ray Tracing Inadequate Performance Flagship mobile GPU: ~256GFLOPS (ARM Mali T658) Real-time ray tracing @HD: >300Mray/sec (1~2TFLOPS) Unsuitable Execution Model Multithreaded SIMD is not fit for processing incoherent rays Weak Branch Supports Performance drops when recursion, function calls, control flow
Need a New Architecture? Dedicated, Fixed Function H/W Performance & power-efficient, but weak flexibility RPU [Woop, SIGGRAPH 2005] Fully Programmable Processor Flexible, but inadequate performance and power consumption Reconfigurable stream processor [Kim, CICC 2012] : 1~2 Mrays/sec MIMD threaded processor [Spjut, SHAW-3 2012] : ~30 Mrays/sec
Performance for Real Time Rendering 200~300Mray/sec Reasonable Flexibility Programmable shading and ray generation Support various BVHs : SAH/Binned/SBVH/LBVH.. Easy to extend to GI (path tracing, photon mapping..) Easy to combine rasterizer (OpenGL ES) and ray tracing Low Power & Cost Requirements
SGRT Talks, ACM SIGGRAPH 2012
Our Approach Combination of CPU, H/W and DSP (Mobile SoC) Tree Build: sorting, irregular work Multi-core CPU (with multi-level $) Refit, Traversal, Intersection: embarrassingly parallel Dedicated H/W Ray Gen. & Shading: need for flexibility Programmable DSP Multi-core CPU (Tree Build) Dedicated H/W (Traversal & Intersection) Programmable DSP (Ray Gen. & Shading) Memory Memory
System Architecture SGRT (Samsung reconfigurable GPU based on Ray Tracing) T&I Engine: fast, compact H/W to accelerate traversal & intersection SRP: Samsung Reconfigurable Processor to support flexible shading SMK : Parallelization framework Multi-core ARM Core #1 Core #2 Core #3 Core #4 Refitting Unit Intersection Unit Cache(L1) T&I Engine Traversal Traversal Unit Unit Cache(L1) Cache(L1) Cache(L1) Cache(L1) Cache(L2) SGRT Core #4 SGRT Core #3 SGRT Core #2 SGRT Core #1 SRP VLIW Engine Coarse Grained Reconfigurable Array Internal SRAM I-Cache C-Mem Texture Unit Cache(L1) Host System BUS AXI System BUS Host DRAM External DRAM
T&I Engine : A MIMD H/W Accelerator Rays Hit info T&I Engine Ray Dispatcher Traversal L1$ RAU pipe Unit L1$ RAU pipe L1$ L1$ stack RAU pipe L1$ RAU L1$ stack L1$ pipe stack L1$ stack L1$ L1$ L1$ L1$ Intersection Unit RAU pipe L1$ MIMD arch. L2$ Newly designed H/W Accelerator based on our previous work KDtree H/W engine [Nah, SIGGRAPH ASIA 2011] Single-ray-based MIMD architecture : Efficient processing for incoherent rays Ray Accumulation Unit (RAU) : Hardware multithreading Optimized restart & short stack algorithm Adaptive restart trail [Lee, HPG 2012] Early Intersection Test Reducing expensive ray-primitive IST test
Early (Two-Pass) Intersection Test Conventional IST 1 2 3 4 5 10 11 Ray-nodeAABB Test Ray-Primitive Test Intersection Unit 6 7 8 9 Early IST 1 2 3 Ray-nodeAABB Test Ray-primAABB Test Ray-Primitive Test T0 T1 T2 T3 T4 T5 T6 T7 Intersection Unit Inner node Leaf node Primitive AABB Primitive
Ray Accumulation Unit Specialized H/W multi-threading for latency hiding [Nah, 2011] $ missed rays are accumulated in RA buffer, other rays can be processed during this period Coherence can be increased, the rays that reference the same cache line are accumulated in the same row in an RA buffer Experimental results, up to 3x performance gain hit result cache data cache address Non-blocking CACHE Input Buffer Ray Accumulation Unit Control Buffer ray rays cache address cache data 4 0 1 occupation counter Ray + data Traversal or Intersection pipeline Cache hit 3 Cache miss
Samsung Reconfigurable Processor A flexible architecture template [Lee, HPG 2011/2012] ISA such as arithmetic, special function and texture are properly implemented. The VLIW engine useful for GP computations (function invocation, control flow). The CGRA makes full use of software pipeline technique for loop acceleration. Instruction VLIW DATA Central RF (Register file) FU FU FU FU FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF FU RF CGA for ( ) { Loop } for ( ) { Loop } for ( ) { Loop } Control proc Data proc Control proc Data proc Control proc Data proc
Packet Stream Tracing on SRP Intersection Result Classify hit rays, Update colors Compute normal vectors Classify second rays & texture Remove recursion Job-Q based streamed iteration Classified according to the types of operation CGA kernel Gen. second rays Reflection Ray CGA Kernels VLIW code Refraction Ray Compute texture color Compute N L, classify shading Shading, Gen. shadow rays Shadow Ray A packet of rays are batched Each kernels are mapped on CGA, loop accelerated shows high IPC rate up to the maximum number of FU arrays
Parallelization Framework Parallel ray tracing with multi-tasking system Utilized embedded RTOS, SMK (Samsung Multi-Platform Kernel) [Shin, SAC 2011] Supports multi-tasking by systematic scheduling in the task queues Individual task for each SGRT core is responsible for Different pixels (or pixel tiles), the scheduler can distribute the next tasks to the idle SGRT core first, dynamic load balancing T&I Engine SRP T&I SRP T&I SRP SMK Engine SMK Engine SMK
Evaluation Talks, ACM SIGGRAPH 2012
Simulation Environment Built a cycle accurate simulator (T&I Engine), and a in-house cycle accurate compiled simulator, called csim (SRP) Test condition w/ two benchmarks Full SAH, cost ratio 5:1 (TRV:IST) for shallow tree Ferrari scene (210K triangles, 1 light source) Fairy scene (170K triangles, 2 light sources) Shadow, reflection, refraction @WVGA (800x640)
Architecture configuration Preliminary Results 4 SGRT cores, traversal & intersection unit = 4:1 per SGRT core 1Ghz core clock Achieved around 170 MRPS (T&I), 255 MRPS (RGS) for Fairy Recent GPU ray tracer (156~317 MRPS, NVIDIA Kepler) [Alia, HPG 2012] Scene # of tri. # of ray Pipeline usage T&I Engine TRV $ hit ratio IST $ hit ratio MRPS SRP MRPS Simulated FPS Fairy 170K 1.7M 87.27 93.83 96.53 171.32 255.72 87.82 Ferrari 210K 1.5M 79.75 92.56 92.92 122.48 319.56 67.83
FPGA Currently, we are also testing the SGRT on FPGA board
Conclusion Talks, ACM SIGGRAPH 2012
SGRT: A novel mobile GPU based on ray tracing, first mobile GPU to realize a real-time ray tracing Carefully designed to suit for mobile SoC environment Currently implementing the T&I engine at the RTL level Future work Conclusion Analyze cost and power consumption Support dynamic scenes with a fast BVH build algorithm optimized for mobile environment Higher-level shading/ecosystem Poster (#103) session: 8/7, 8/8 12:15-13:15PM
Acknowledgements This project is based on the collaboration with two University (Yonsei, National Kongju). Authors appreciate to two professors (Tack-Don Han, Hyun-Sang Park) for their valuable advices. Thanks