How To Build An Engine 4 Mobile Graphics On Anarm V8-A (A64)

Similar documents
Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

Midgard GPU Architecture. October 2014

Optimizing AAA Games for Mobile Platforms

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Performance Optimization and Debug Tools for mobile games with PlayCanvas

GPU Architecture. Michael Doggett ATI

Low power GPUs a view from the industry. Edvard Sørgård

How To Understand The Power Of Unity 3D (Pro) And The Power Behind It (Pro/Pro)

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Radeon HD 2900 and Geometry Generation. Michael Doggett

Computer Graphics Hardware An Overview

Making Dreams Come True: Global Illumination with Enlighten. Graham Hazel Senior Product Manager Sam Bugden Technical Artist

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

High Performance or Cycle Accuracy?

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

The Future Of Animation Is Games

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Which ARM Cortex Core Is Right for Your Application: A, R or M?

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

L20: GPU Architecture and Models

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Vulkan on NVIDIA GPUs. Piers Daniell, Driver Software Engineer, OpenGL and Vulkan

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

How To Teach Computer Graphics

How To Develop For A Powergen 2.2 (Tegra) With Nsight) And Gbd (Gbd) On A Quadriplegic (Powergen) Powergen Powergen 3

Introduction to GPGPU. Tiziano Diamanti

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

QCD as a Video Game?

GPGPU Computing. Yong Cao

Cross-Platform Game Development Best practices learned from Marmalade, Unreal, Unity, etc.

Touchstone -A Fresh Approach to Multimedia for the PC

Web Based 3D Visualization for COMSOL Multiphysics

Computer Graphics on Mobile Devices VL SS ECTS

Writing Applications for the GPU Using the RapidMind Development Platform

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Next Generation GPU Architecture Code-named Fermi

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

OpenGL Performance Tuning

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Introduction to GPU Architecture

Application Performance Analysis of the Cortex-A9 MPCore

Developer Tools. Tim Purcell NVIDIA

Trends in HTML5. Matt Spencer UI & Browser Marketing Manager

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Project SHIELD and Tegra 4: Redefining AFK

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

NVIDIA GeForce GTX 580 GPU Datasheet

TEGRA X1 DEVELOPER TOOLS SEBASTIEN DOMINE, SR. DIRECTOR SW ENGINEERING

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

Image Synthesis. Transparency. computer graphics & visualization

Hardware accelerated Virtualization in the ARM Cortex Processors

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Introduction to Game Programming. Steven Osman

SAPPHIRE HD GB GDDR5 PCIE.

LOOKING FOR AN AMAZING PROCESSOR. Product Brief 6th Gen Intel Core Processors for Desktops: S-series

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Game Development in Android Disgruntled Rats LLC. Sean Godinez Brian Morgan Michael Boldischar

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

BRINGING UNREAL ENGINE 4 TO OPENGL Nick Penwarden Epic Games Mathias Schott, Evan Hart NVIDIA

System requirements for Autodesk Building Design Suite 2017

ARM Microprocessor and ARM-Based Microcontrollers

Advanced Visual Effects with Direct3D

4.1 Introduction 4.2 Explain the purpose of an operating system Describe characteristics of modern operating systems Control Hardware Access

Introduction to GPU hardware and to CUDA

Università Degli Studi di Parma. Distributed Systems Group. Android Development. Lecture 1 Android SDK & Development Environment. Marco Picone

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

Crosswalk: build world class hybrid mobile apps

Advanced Rendering for Engineering & Styling

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Embedded Development Tools

OpenEXR Image Viewing Software

Parallel Programming Survey

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

VA (Video Acceleration) API. Jonathan Bian 2009 Linux Plumbers Conference

Dynamic Resolution Rendering

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Intel Graphics Media Accelerator 900

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

GPU Parallel Computing Architecture and CUDA Programming Model

Android Development: a System Perspective. Javier Orensanz

White Paper. Intel Sandy Bridge Brings Many Benefits to the PC/104 Form Factor

Transcription:

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture Jesse Barker, Principal Software Engineer, ARM Marius Bjørge, Graphics Research Engineer, ARM Niklas Smedis Smedberg, Senior Engine Programmer, Epic Games Brad Grantham, Principal Software Engineer, ARM Graham Hazel, Senior Product Manager, Geomerics 1

Agenda Programming for ARM v8-a Technology ARM Mali GPU Architecture Hardware evolution The tri-pipe architecture Exposing the tile Unreal Engine 4 Case Study: Moon Temple Enlighten in Unreal Engine 4 2

Programming for ARMv8-A Technology Jesse Barker Principal Software Engineer, ARM 3

ARM Architecture Evolution Virtualization ARMv4 ARM7TDMI ARM926EJ ARM1176 Cortex -A9 Cortex-A57 4 1995 2005 2015

ARMv8-A AArch32 Maintaining compatibility AArch32 maintains full-compatibility with ARMv7 while addressing emerging software trends AArch32: evolution of 32-bit Enhanced floating point support (IEE754-2008) Ideal for concurrent programming C11, C++ 11, Java5 More efficient, high-performance thread-safe software Cryptography support (AES, Sha-1, Sha-256) Applications and software ARMv7-A AArch32 A32+T32 ARMv7-A Compatible CRYPTO Scalar FP Advanced SIMD ARMv8-A AArch64 A64 5

ARMv8-A Architecture Designed for efficiency Design 64-bit architecture Increased number and size of general purpose registers Large Virtual Address Space Efficient 32-bit/64-bit architecture Double the number and size of NEON registers Cryptography support Why it Matters Efficient access to large datasets Gains in performance and code efficiency 1. Applications not limited to 4GB memory 2. Large memory mapped files handled efficiently 1. Common software architecture (phone, tablet, clamshell) 2. A single software model across the entire portfolio Enhanced capacity of SIMD multimedia engine 1. Over10x software encryption performance 2. New security models for consumer and enterprise 6

AArch64 Performance Over AArch32 >20% increase on several key workloads Most workloads increase, some slow down Slowdowns are often outliers like mcf in spec2k with unrealistic data access patterns Overall trend is increasing performance with 64b Will increase further as compilers mature 80% 70% 60% 50% 40% 30% 20% 10% 0% -10% ARMv8 AArch64 performance vs. AArch32 Cortex-A53 Cortex-A57 7

Multi-core ARM big.little Technology Taking advantage of parallelism Platform trending toward multi-cores Single thread performance improvements diminishing Thermally constrained use cases are now commonplace Production differentiation via different CPU combinations Modern OSs are supporting multi-core How to exploit parallelism. In the core API-level parallelization Multi-thread programming ARM NEON tech/simd OpenMP, Renderscript, OpenCL, etc. Never easy, but increasingly necessary 8

ARMv8-A and 64-bit Everywhere Mega trend is the move to ARMv8-A and AARCH64 Low cost development platforms available from 96boards.org Huge growth in share of 64-bit platforms in smartphone and tablets in 2015 9

Mali GPU Architecture Marius Bjørge Graphics Research Engineer, ARM 10

The Midgard Architecture HARDWARE EVOLUTION 11

Driving for Efficiency The Mali GPU roadmap 12

Mali GPU High-Level Architecture A breakdown of the Mali-T880 Up to sixteen shader cores Distributes tasks to shader cores ARM Mali -T880 GPU Inter-Core Task Management SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC SC Efficient mapping of geometry to tiles Advanced Tiling Unit Thread Issue Memory Management Unit L2 Cache L2 Cache AMBA 4 ACE-Lite AMBA 4 ACE-Lite Addresses translation and protection Arithmetic Pipeline Arithmetic Pipeline Arithmetic Pipeline Load/Store Pipeline Texture Pipeline Configurable cache shared among all shader cores Maintains cache coherency between different processors 13 Thread Completion

The Midgard Architecture THE TRI-PIPE ARCHITECTURE 14

Shader Core Architecture Compute Thread Creator Rasterizer Triangle Setup Unit Tiler Data Structures Early Z Thread Execution Tri Pipe Thread Issue Compute Data and Results Reg file Arith / LUT / Branch Reg file Arith / LUT / Branch Load / Store / Varying Texturing Z/Stencil Buffer Textures Thread Completion Late Z Blender Tile Buffers Tile Buffers Frame Buffer 15

Tri-pipe Architecture Thread Issue Reg file Arith / LUT / Branch Reg file Arith / LUT / Branch Load / Store / Varying Texturing 16 Unified shader architecture Fragment and vertex shaders Compute shaders Very high throughput graphics Thread Completion Multiple parallel pipelines Two low-latency arithmetic pipes 256 simultaneous threads Low-latency for computation

The Midgard Architecture EXPOSING THE TILE 17

The Tilebuffer Mali-T600/T700/T800 Series GPU Tile-based rendering 16x16 tile size Fast on-chip memory 16 bytes of per-pixel color data Raw bit access Tilebuffer pixel Depth Stencil 128-bit pixel data Sample Sample More recent GPU architectures allow more flexible tile sizes and open up more per-pixel color data Sample Sample 18

Exposing the Tilebuffer Shader Framebuffer Fetch Access previous fragment color, depth and stencil Programmable blending, soft particles, etc. Shader Pixel Local Storage (PLS) 19

Pixel Local Storage (PLS) Exposed as EXT_shader_pixel_local_storage Per-pixel scratch memory available to fragment shaders Automatically discarded once a tile is fully processed No impact on external memory bandwidth Shader declares an interface block of PLS memory Re-interpret PLS between different passes Can have separate input and output views Independent of framebuffer format 20

Pixel Local Storage pixel_localext FragDataLocal { layout(r32f) highp float_value; layout(r11f_g11f_b10f) mediump vec3 normal; layout(rgb10_a2) highp vec4 color; layout(rgba8ui) mediump uvec4 flags; } pls; See the extension spec for more information! https://www.khronos.org/registry/gles/extensions/ext/ext_shader_pixel_local_storage.txt http://malideveloper.arm.com 21

Pixel Local Storage Rendering pipeline changes slightly when PLS is enabled Writing to PLS bypasses blending Note Fragment order Fragment tests still apply PLS and color share the same memory location Memory Position data Varyings Textures Uniforms Tile execution Primitive Setup Rasterization Fragment shading Blending PLS/color Tilebuffer Framebuffer Writeback 22

Why Pixel Local Storage? An alternative approach is to use multiple render targets (MRT) with framebuffer fetch if the driver can prove that render targets are not used later, it can avoid the write-back PLS is more explicit than MRT Harder for the application to get it wrong Driver doesn t have to make guesses PLS is more flexible Re-interpret PLS data between fragment shader invocations Not limited to OpenGL ES 3.x framebuffer formats 23

Deferred Shading Popular technique in PC and console games Very memory bandwidth intensive Traditionally not a good fit for mobile Diffuse (RGBA8) Depth (D32F) Normals (RGBA8) 24

Order Independent Transparency Unsolved problem Depth peeling Approximate approaches Multi-Layer Alpha Blending [Salvi et al, 2014] Adaptive Range 25

Pixel Local Storage Opaque phase OIT phase Fill gbuffer Light accumulation Resolve Init OIT Transparent rendering Resolve + Tonemap Pixel Local Storage RGB10A2 RGB10A2 RGB16F RGB16F R32UI R32UI R32UI R32UI Color At this point we change the layout of the PLS 26

Performance Comparison of Approaches 140% 120% 122% 100% 100% 100% 93% 80% 60% 40% 27 20% 0% MRT + AB PLS + AB PLS + Adaptive Range PLS + MLAB3 AB = Alpha Blending MLAB3 = 3 layer Multi-Layer Alpha Blending Relative performance

Unreal Engine 4 Niklas Smedis Smedberg Senior Engine Programmer, Epic Games Brad Grantham Principal Software Engineer, ARM 28

PSNR (db) Compress, Compress, Compress! ASTC = Adaptive Scalable Texture Compression Texture compression standard developed by ARM, adopted by Khronos KHR_texture_compression_astc_ldr for OpenGL ES and Open GL Increased quality and fidelity at low bit-rates Expansive range of input formats offers complete flexibility Choice of base format, 2D and 3D plus addition of HDR formats 50 45 40 35 30 25 8 5.12 3.56 2 1.28 0.89 29 Compression Rate (bpp)

Input Color Formats Input bits/pixel Compression in the Pre-ASTC World HDR RGB+A BC6 HDR RGBA HDR XY+Z All Major Players 64 48 HDR RGB HDR X+Y RGB+A PVRTC PVRTC ETC, BC2 BC3, BC7 48 32 32 RBGA XY+Z ETC, BC1 BC7 32 24 RGB HDR L ETC, BC5 24 16 X+Y LA ETC, BC4 16 16 L 8 1 2 3 4 5 6 7 8 Compressed bits/pixel 30

Input Color Formats Input bits/pixel ASTC Choices All ASTC HDR RGB+A HDR RGBA HDR XY+Z HDR RGB HDR X+Y RGB+A RBGA XY+Z RGB HDR L X+Y LA L 64 48 48 32 32 32 24 24 16 16 16 8 1 2 3 4 5 6 7 8 Compressed bits/pixel 31

ASTC for Mobile Games ASTC is widely supported by all major hardware vendors It s free to use Finally a good texture format that can work everywhere! Avoids separate SKUs per hardware manufacturer: PVRTC, ATC, DXT, <supports-gl-texture android:name="gl_amd_compressed_atc_texture" /> Support for ASTC is also required by Google s Android Extension Pack GL_ANDROID_extension_pack_es31a 32

ASTC Support in Unreal Engine 4 33

Game Texture Comparison 2048x2048 RGB Normal Map, with mips 17 MB uncompressed Original: 17 MB ETC: 3 MB ASTC 6x6: 2.5 MB 34

Game Texture Comparison Same texture zoomed in for Truth Original: 17 MB ETC: 3 MB ASTC 6x6: 2.5 MB 35

Unreal Engine 4 Demo: Moon Temple Made specifically for ARM Unreal Engine 4 Goals: 64-bit Android ASTC PLS 36

Unreal Engine 4 Pixel Local Storage Read & write custom data for each pixel E.g. Depth Blend particles softly against the background 37

Unreal Engine 4 Pixel Local Storage 38

Moon Temple Demo 39

Enabling 64-bit Android in Unreal Engine 4 Android NDK r10c 64-bit AArch64 compilers Android SDK 21 Required for Lollipop, 64-bit UE4 Engine changes collaboration between ARM and Epic Games Patches submitted Available in future release packaging considerations to resolve New Android platform arm64, 64-bit libue4.so Results: 8% Sun Temple FPS uplift just from compiling 64-bit 40

Measuring ASTC Benefit Streamline tool, part of ARM Development Studio 5 (DS-5) to know more https://ds.arm.com Capture CPU and GPU parameters during runtime for analysis ASTC requires less memory, so bandwidth use should drop We should see that reflected in L2 cache external R+W beats Example image from Streamline 41

Measuring ASTC Benefit Result of Streamline L2 counters: ETC2 over 30s: 1.29 GB/s ASTC 6x6 over same 30s:.98 GB/s 24.4% less bandwidth used per frame And ASTC OBB is 12% smaller than ETC2 OBB (179MB versus 203MB) 42

Enlighten in Unreal Engine 4 Graham Hazel Senior Product Manager 43

Enlighten in Unreal Engine 4 Enlighten is global illumination middleware, available pre-integrated into UE4 Runtime is lightweight and optimised for a wide range of platforms, including Android 64-bit ios 64-bit Windows PC Mac OS X PlayStation 4 Xbox One Find out more Thursday 10AM, West Hall 3014, and at the ARM Booth 1624 44

Enlighten in Unreal Engine 4 45

To Find Out More. 46 ARM Booth #1624 on Expo Floor Live demos In-depth Q&A with ARM engineers More tech talks at the ARM Lecture Theatre Epic Games: Live Session with Unreal Engine 4 for Mobile Devices Geomerics Enlighten session ARM tools Live Sessions http://malideveloper.arm.com/gdc2015 Revisit this talk in PDF and video format post GDC Download the tools and resources

More Talks from ARM at GDC 2015 Available post-show online at Mali Developer Center Unreal Engine 4 mobile graphics and the latest ARM CPU and GPU architecture - Weds 9:30AM; West Hall 3003 This talk introduces the latest advances in features and benefits of the ARMv8-A and tile-based Mali GPU architectures on Unreal Engine 4, allowing mobile game developers to move to 64-bit s improved instruction set. Unleash the benefits of OpenGL ES 3.1 and Android Extension Pack (AEP) Weds 2PM; West Hall 3003 OpenGL ES 3.1 provides a rich set of tools for creating stunning images. This talk will cover best practices for using advanced features of OpenGL ES 3.1 on ARM Mali GPUs using recently developed examples from the Mali SDK. Making dreams come true global illumination made easy Thurs 10AM; West Hall 3014 In this talk, we present an overview of the Enlighten feature set and show through workflow examples and gameplay demonstrations how it enables fast iteration and high visual quality on all gaming platforms. How to optimize your mobile game with ARM Tools and practical examples Thurs 11:30AM; West Hall 3014 This talk introduces you to the tools and skills needed to profile and debug your application by showing you optimization examples from popular game titles. Enhancing your Unity mobile game Thurs 4PM; West Hall 3014 47 Learn how to get the most out of Unity when developing under the unique challenges of mobile platforms.

Any Questions? Ask the best question and win a PiPO P4 tablet! Rockchip RK3288 processor ARM Cortex-A17 MP4 CPU ARM Mali-T760 MP4 GPU 48

Thank You The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners 49