Midgard GPU Architecture. October 2014



Similar documents
Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

How To Build An Engine 4 Mobile Graphics On Anarm V8-A (A64)

GPU Architecture. Michael Doggett ATI

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Low power GPUs a view from the industry. Edvard Sørgård

Optimizing AAA Games for Mobile Platforms

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Radeon HD 2900 and Geometry Generation. Michael Doggett

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Performance Optimization and Debug Tools for mobile games with PlayCanvas

How To Understand The Power Of Unity 3D (Pro) And The Power Behind It (Pro/Pro)

Computer Graphics Hardware An Overview

L20: GPU Architecture and Models

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Image Synthesis. Transparency. computer graphics & visualization

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Touchstone -A Fresh Approach to Multimedia for the PC

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Next Generation GPU Architecture Code-named Fermi

NVIDIA GeForce GTX 580 GPU Datasheet

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Introduction to GPGPU. Tiziano Diamanti

GPGPU Computing. Yong Cao

How To Teach Computer Graphics

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

OpenGL Performance Tuning

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

The Future Of Animation Is Games

NVFX : A NEW SCENE AND MATERIAL EFFECT FRAMEWORK FOR OPENGL AND DIRECTX. TRISTAN LORACH Senior Devtech Engineer SIGGRAPH 2013

Advanced Visual Effects with Direct3D

Vulkan on NVIDIA GPUs. Piers Daniell, Driver Software Engineer, OpenGL and Vulkan

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Dynamic Resolution Rendering

Web Based 3D Visualization for COMSOL Multiphysics

Getting Started with RemoteFX in Windows Embedded Compact 7

Writing Applications for the GPU Using the RapidMind Development Platform

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Intel Graphics Media Accelerator 900

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to GPU Architecture

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

VA (Video Acceleration) API. Jonathan Bian 2009 Linux Plumbers Conference

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

LittleCMS: A free color management engine in 100K.

3D Computer Games History and Technology

SAPPHIRE HD GB GDDR5 PCIE.

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

What is GPUOpen? Currently, we have divided console & PC development Black box libraries go against the philosophy of game development Game

Advanced Rendering for Engineering & Styling

Parallel Web Programming

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Stream Processing on GPUs Using Distributed Multimedia Middleware

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

QCD as a Video Game?

High Performance or Cycle Accuracy?

Hardware design for ray tracing

INTRODUCTION TO RENDERING TECHNIQUES

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPU Parallel Computing Architecture and CUDA Programming Model

NVIDIA IndeX Enabling Interactive and Scalable Visualization for Large Data Marc Nienhaus, NVIDIA IndeX Engineering Manager and Chief Architect

Advanced Graphics and Animations for ios Apps

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

NVIDIA workstation 3D graphics card upgrade options deliver productivity improvements and superior image quality

HIGH PERFORMANCE VIDEO ENCODING WITH NVIDIA GPUS

Introduction to Computer Graphics

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

An Introduction to Modern GPU Architecture Ashu Rege Director of Developer Technology

Lecture Notes, CEng 477

Computer Graphics on Mobile Devices VL SS ECTS

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

CUBE-MAP DATA STRUCTURE FOR INTERACTIVE GLOBAL ILLUMINATION COMPUTATION IN DYNAMIC DIFFUSE ENVIRONMENTS

Graphics and Computing GPUs

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Intel 845G/GL Chipset Dynamic Video Memory Technology

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

System requirements for Autodesk Building Design Suite 2017

Solomon Systech Image Processor for Car Entertainment Application

The OpenGL R Graphics System: A Specification (Version 3.3 (Core Profile) - March 11, 2010)

Introduction to Computer Graphics

Application. EDIUS and Intel s Sandy Bridge Technology

Real-Time BC6H Compression on GPU. Krzysztof Narkowicz Lead Engine Programmer Flying Wild Hog

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

Introduction to GPU Programming Languages

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to Game Programming. Steven Osman

Computer Graphics. Anders Hast

Transcription:

Midgard GPU Architecture October 2014 1

The Midgard Architecture HARDWARE EVOLUTION 2

3 Mali GPU Roadmap

Mali-T760 High-Level Architecture Distributes tasks to shader cores Efficient mapping of geometry to tiles Up to sixteen shader cores Address Translation and Protection Configurable cache shared among all shader cores Maintains cache coherency between different processors 4

The Midgard Architecture THE GRAPHICS PIPELINE 5

Immediate Mode Rendering Geometry (Triangles) Vertex Shader Rasterizer Fragment Shader Blender Typical desktop GPU Triangle data drawn immediately High memory bandwidth for blending Screen 6

Tiled Mode Rendering Local memory buffer Geometry (Triangles) Vertex Shader Rasterizer Fragment Shader Blender Tile Buffer Blender bandwidth stays on-chip Triangle lists, Triangle one Triangle per lists, tile one per lists, tile one per tile Triangle data stored temporarily Typical mobile GPU Screen 7

Basic Dataflow Application Driver Display System CPU Job Descriptors Input Data Input Data Input Data Intermediate Input Data Data Input Data Output Data Shared Memory Job Manager Shader Cores GPU GPU GPU GPU Midgard Hardware 8

The Midgard Architecture THE TRI-PIPE ARCHITECTURE 9

Shader Core Architecture Compute Thread Creator Rasterizer Triangle Setup Unit Tiler Data Structures Early Z Thread Execution Tri Pipe Thread Issue Compute Data and Results Reg file Arith / LUT / Branch Reg file Arith / LUT / Branch Load / Store / Varying Texturing Z/Stencil Buffer Textures Thread Completion Late Z Blender Tile Buffers Tile Buffers Frame Buffer 10

Tri-pipe Architecture Mali-T624 Thread Issue Reg file Arith / LUT / Branch Reg file Arith / LUT / Branch Load / Store / Varying Texturing Thread Completion Unified shader architecture Fragment and vertex shaders Geometry and compute shaders Very high throughput graphics Multiple parallel pipelines Two low-latency arithmetic pipes 256 simultaneous threads Low-latency for computation 11

The Midgard Architecture SCALABILITY 12

T720 scalability: 1-8 shader cores Thread Issue Reg file Arith / LUT / Branch Load / Store / Varying Texturing Thread Completion 13

T760 scalability: 1-16 shader cores Thread Issue Reg file Arith / LUT / Branch Reg file Arith / LUT / Branch Load / Store / Varying Texturing Thread Completion 14

The Midgard Architecture 64-BIT SUPPORT 15

Midgard architecture: 64-bit embedded Midgard is a fully-featured 64-bit GPU architecture Native 64-bit address space, 64-bit arithmetic (integer and IEEE FP) Very fast, optimized 16/32-bit performance 'right-sized computing Wake-up call: 32-bit addresses are not enough 32-bit address space is nearly full on current high-end ARM platforms Mali-T624 uses same page tables as ARM CPU for >4GByte memory Be ready, do not get left behind with a 32-bit only GPU architecture IEEE double precision floating-point, Full Profile OpenCL Numerical stability issues break many compute/graphics algorithms FP64 is the default standard for scientific algorithm development FP64 is exposed in key APIs: OpenGL 4.x, DirectX 11, CUDA and OpenCL 1.x 16

64-bit use-cases Pointer arithmetic for addresses larger than 32 bits => 64-bit integers 64-bit integers and 64-bit floats needed High-dynamic range image processing such as summed area tables, allowing: Bokeh effect Summed Area Variance Shadow maps Double-precision GPU computing algorithms coming from the desktop and scientific computing worlds Approximate may be okay for images to be viewed by human eye Accuracy is needed for other (non-image) computing 17

The Midgard Architecture MEMORY BANDWIDTH 18

T6xx bandwidth and power save features Mali-400MP Mali-T624 Improved Texture Compression Transaction Elimination Hierarchical Tiling 19

ASTC Texture Compression Texture compression format developed by ARM and adopted by Khronos Support for LDR + HDR + 3D in ARM Mali GPUs: Mali-T62x, Mali-T760, Mali-T720 Support for vast color formats and bits/pixel 20

Transaction Elimination Lower memory bandwidth Eliminates redundant writes to the output buffer Frame N List of CRCs calculated per tile for frame N...... CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC Frame N+1 Reduction Active game >20% 95MB/s Static frame 100% 475MB/s Reduction 32bpp HD@60fps CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC Compared with CRCs calculated for frame N+1...... CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC CRC Eliminated tiles matching previous frame are highlighted by 21

1311 Transaction elimination signature matches: 1142 (28% elimination) 22

1312 Transaction elimination signature matches: 1148 (28% elimination) 23

Mali-T760 bandwidth saving summary Reusing highly efficient Mali-T620 bandwidth saving features Advanced Scalable Texture Compression (ASTC) Transaction Elimination and extending them even further ARM Frame Buffer Compression (AFBC) Smart Composition AFBC-coded Input Surfaces AFBC-coded Output Surface AFBC Decoding Smart Composition Pre-Processing Composition Processing Transaction Elimination AFBC Encoding GPU 24

AFBC : Bandwidth reduction in Multimedia System Increase Energy Efficiency across the SoC Designed for integration with all major multimedia IP blocks Maintain Image Quality Lossless Frame buffer Compression Flexible formats extensible to higher pixel-depth formats Efficient Implementations Easy parallelization of encode and decode Supported in current and all future generations of media processing IP from ARM Mali-T760, Mali-V500, Mali-DP-500 Licensable for integration with 3rd Party media User Interface GFXBench 2.7 T-Rex GFXBench 3.0 Manhattan System Bandwidth Reduction process IP 33.0% Angry Birds 49.8% 45.5% 9.7% 25

Smart Composition Reducing Bandwidth and Workloads Extending Transaction Elimination Significantly lower composition bandwidth Reduces unnecessary shader workload Ignoring repetitive tile data by tracking Change Maps Integrates with the window manager Bandwidth reduction with Smart Composition 90% Chrome Browser 80% Facebook application 70% Twitter application Source: ARM Notifies Mali GPU which regions need updating For partially-static applications further savings Applications notify which screen regions are updated Bandwidth and workload reduction Enables complete end-to-end Smart Composition 26

The Midgard Architecture SUMMARY 27

Evolution of Midgard architecture High Scalability Multiple configurations of shader cores Different product lines extracted by the same architecture Bandwidth reduction and high power efficiency ASTC Texture Compression Transaction Elimination/ Smart Composition ARM Frame Buffer Compression Tiler & Interconnect optimisations Support for all major graphics and compute APIs Khronos compliant OpenGL ES 3.1/3.0 /2.0 / 1.1 Khronos compliant OpenCL 1.1/1.0 Microsoft Windows compliant Direct3D 11.1 28

The Midgard Architecture QUESTIONS AND ANSWERS 29

Efficient Rendering with Tile Local Storage Staff Engineer Marius Bjorge 30

Agenda Motivation Introduction to extensions Example use-cases 31

Background ARM Mali -T600 series GPU Tile-based rendering 16x16 tile size Fast on-chip memory 128-bits per-pixel color data Raw bit access Depth Tilebuffer pixel 128-bit pixel data Stencil Sample Sample Sample Sample 32

Color Buffer Access Read existing color from thread s pixel location No need to re-bind texture Multi-sampling support Approximate but fast reading of MSAA framebuffer Adds support for explicit per-sample shading Useful for things like programmable blending Exposed as ARM_shader_framebuffer_fetch 33

Color Buffer Access - Example #extension GL_ARM_shader_framebuffer_fetch : require precision mediump float; uniform vec4 ublendcolor; void main() { vec4 PrevColor = gl_lastfragcolorarm; gl_fragcolor = mix(prevcolor, ublendcolor, ublendcolor.w); } 34

Depth / Stencil Buffer Access Read existing depth and stencil from thread s pixel location No need to re-bind texture Useful for Soft particles Modulated shadows Position reconstruction Programmable depth/stencil testing Re-interpret depth to color: keep variance shadowmap moments on-chip Exposed as ARM_shader_framebuffer_fetch_depth_stencil 35

Depth Buffer Access - Example #extension GL_ARM_shader_framebuffer_fetch_depth_stencil : require precision mediump float; uniform float udepthref; uniform int ustencilref; void main() { if (gl_lastfragstencilarm!= ustencilref) discard; if (gl_lastfragdeptharm > udepthref) discard; } gl_fragcolor = vec4(1.0, 0.0, 0.0, 0.0); 36

Pixel Local Storage (PLS) Exposed as EXT_shader_pixel_local_storage Per-pixel scratch memory available to fragment shaders Automatically discarded once a tile is fully processed No impact on external memory bandwidth Shader declares a view of PLS memory Re-interpret PLS between different passes Can have separate input and output views Independent of framebuffer format 37

Pixel Local Storage (PLS) - Example pixel_localext FragDataLocal { layout(r32f) highp float value; layout(r11f_g11f_b10f) mediump vec3 normal; layout(rgb10_a2) highp vec4 color; layout(rgba8ui) mediump uvec4 flags; } pls; Storage Format Read/write Format 38

Pixel Local Storage (PLS) Rendering pipeline changes slightly when PLS is enabled Writing to PLS bypasses blending Note Fragment order Fragment tests still applies PLS and color share the same memory location Memory Position data Varyings Textures Uniforms Tile execution Primitive setup Rasterization Fragment Shading Blending PLS / Color Tilebuffer Framebuffer Writeback 39

Pixel Local Storage (PLS) Limitations of the current implementation No MRT or MSAA support Size limitation 40

Why Pixel Local Storage? An alternative approach is to use MRT with framebuffer fetch if the driver can prove that render targets are not used later, it can avoid the write-back PLS is more explicit than MRT Harder for the application to get it wrong Driver doesn t have to make guesses PLS is more flexible Re-interpret PLS data between fragment shader invocations Not limited to OpenGL ES 3.x frame buffer formats 41

Deferred Shading Popular technique on PC and console games Very memory bandwidth intensive Traditionally not a good fit for mobile 42

Order Independent Transparency Unsolved problem For correct result, fragments must be rendered in order, from nearest to farthest or farthest to nearest 43

Order Independent Transparency Input fragments: 0 1 2 3 44

Order Independent Transparency Input fragments: 0 1 2 3 Order: 0, 1, 2, 3 45

Order Independent Transparency Input fragments: 0 1 2 3 Order: 0, 1, 2, 3 Order: 0, 2, 3, 1 46

Order Independent Transparency Input fragments: 0 1 2 3 Order: 0, 1, 2, 3 Order: 0, 2, 3, 1 Order: 2, 3, 1, 0 47

Order Independent Transparency Exact OIT Accurately computes the final color all fragments must be sorted Approximate OIT Does not store all fragments Multi-Layer Alpha Blending [Salvi et al, 2014] Adaptive Range Pixel Local Storage can be used for both 48

49 Reference

50 Alpha blending

51 Adaptive Range

52 MLAB

53 Reference

Chaining pipelines Opaque phase OIT phase Fill gbuffer Accumulate lights Tonemap Init OIT Transparent rendering Resolve Pixel Local Storage RGB10A2 RGB10A2 RG16F RG16F R32UI R32UI R32UI R32UI Color 54

Performance 140% 120% 100% 100% 122% 100% 93% 80% 60% 40% 20% 0% Deferred Shading (MRT) Deferred Shading (PLS) Deferred Shading and Adaptive Range OIT (PLS) Deferred Shading and MLAB3 OIT (PLS) 55

Summary Future work Add support for MSAA and MRT when used together with PLS More flexible size of PLS ARM Mali -T760 GPU Device availability Samsung Note 4 (Exynos version) Samsung Galaxy Tab S (Exynos versions) 56

Thank you! Marius.Bjorge@arm.com 57