Bifrost - The GPU architecture for next five billion

Transcription

1 Bifrost - The GPU architecture for next five billion Alan Tsai Regional Marketing Manager Media Processing Group ARM Tech Forum Taipei July 1 st, 2016

2 Why New GPU Architecture? 2 ARM 2016

3 Market Drivers and Trends Evolution of mobile gaming and graphics Virtual Reality Augmented Reality Increasing User Interface complexity APIs adapting to developer needs 3

4 Evolution of Mobile Graphics 2016: Lofoten 2013: Trollheim 2010: TrueForce Hardware: Galaxy S2 GPU: Mali-400MP4 API support: OpenGL ES 2.0 Primitives per frame: 16k Cycles per pixel: 3.7 Draw calls per frame: 50 Hardware: Nexus 10 GPU: Mali-T604 API support: OpenGL ES 3.0, OpenCL 1.1 Primitives per frame: 150k Cycles per pixel: 16 Draw calls per frame: 60 Hardware: Galaxy S7 GPU: Mali-T880MP12 API support: Vulkan 1.0 Primitives per frame: 600k Cycles per pixel: 40 Draw calls per frame: 500 4

5 Vulkan: Developer-driven GFX API Vulkan graphics API driven by developer need Low-level API Ideal for new and emerging use-cases Fully exploit heterogeneous system Fully multi-threaded Benefit from HW coherency Application Mali OpenGL ES Driver Driver handles context, memory and error management Application Application handles memory allocation, resources, and thread management to generate command buffers Mali Vulkan Driver Low-overhead driver Mali GPU Mali GPU 5

6 What Vulkan Means for GPU Architecture Stricter system requirements GPU address faults cannot destabilize system Process isolation must be guaranteed Coherent shared memory is mandatory GPU architecture is more exposed Reduces flexibility in some areas, e.g. resource descriptors Application provides more information, sooner Reduces need for indirection/late binding 6

7 Bifrost 7 ARM 2016

8 Why is the Architecture Called Bifrost? 8

9 ARM Mali Architecture Evolution BIFROST Mali-G71 GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan UTGARD Mali-200 GPU Mali-300 GPU Mali-400 GPU Mali-450 GPU Mali-470 GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 9

10 Bifrost Features A more efficient architecture: More performance overall, per mm 2 and per line of real world shader code Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units New core fabric New geometry data flow Reduces memory bandwidth and footprint 1.5x Performance improvement 10

11 Bifrost Architectural Innovations Energy efficiency Claused Shaders Index Driven Position Shading Wire light pipelines Developer friendly Designed for Vulkan and VR/AR Heterogeneous computing Full system coherency Midgard Bifrost CPU CPU GPU Coherent Interconnect DRAM 11

12 Bifrost GPU Design 12 ARM 2016

13 Bifrost GPU Design Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 13

14 Scalable System Design Driver Software Up to 32 shader cores supported Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 14

15 Geometry Flow Improvement Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 15

16 Index-Driven Position Shading Tiler Assembly Position Shading Tiler Culling ½ x Varying Shading Fragment Shading Read/write bandwidth [x times of storage size] 1x 1x ½ x ½ x ½ x Processing Memory 3.5x 2.0x 2.5x 1.5x Positions Positions Attribs Attributes Indices Positions Transformed Positions Polygon List Vertex Attributes Vertex Varyings Midgard Bifrost 1x Bandwidth used relative to memory storage size 16

17 Memory System Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus Full coherency using ACE protocol 17

18 Memory System Full system coherency support Supports tightly coupled CPU+GPU use cases Cortex-A73 CPU Mali-G71 GPU L2 cache improvements Single logical L2 cache makes software easier Fewer partial lines written to AXI which improves LPDDR4 performance CoreLink CCI-550 DMC-500 DRAM 18

19 Bifrost Core Design 19 ARM 2016

20 Execution Core Improvements Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 20

21 ZS Memory Bifrost Core Design Compute Frontend Fragment Frontend Quad Creator Quad Creator Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Control Quad Manager Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 21

22 ZS Memory Quad Creation Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 22

23 ZS Memory Quad Management Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 23

24 ZS Memory Quad Execution Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 24

25 Lane 0 Lane 1 Lane 2 Lane 3 Quad Vectorization Bifrost uses quad-parallel execution Four scalar threads executed in lockstep in a quad One quad at a time executes in each pipeline stage Each thread fills one 32-bit lane of the hardware 4 threads doing a vec3 FP32 add takes 3 cycles Improves utilization T0.x T0.y T0.z T1.x T1.y T1.z T2.x T2.y T2.z Idle Idle Idle Cycle 1 Cycle 2 Cycle 3 Quad vectorization is compiler friendly T3.x T3.y T3.z Idle Cycle 4 Each thread only sees a stream of scalar operations Vector operations can always be split into scalars 25

26 Classic Instruction Execution Scheduling decision before every instruction Architecturally visible state guaranteed after every instruction Overhead Instruction 26

27 Clause Execution Back-to-back execution guaranteed within a clause Allows aggressive optimisation Overhead Instruction 27

28 Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 A simple register-based instruction set Each instruction fetches arguments from the register file And writes results back to the register file 28

29 Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 R0 R1 R2 R3 R4 R5 R6 R7 ADD R4, R2, R3 R0 R1 R2 R3 R4 R5 R6 R7 Register file access can be expensive Register file is often the most-used part of the GPU High bandwidth means high power consumption Thread allocation keeps registers close to where they are used 29

30 Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 ADD R4, R2, R3 ADD R0, R4, R5 Back-to-back register access is common The result from one instruction is often only used as input to the next 30

31 Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD T, R0, R1 T ADD T, T, R3 T ADD R0, T, R5 Back-to-back register access is common Register file bypass saves power. Allows use of simpler, smaller register files. 31

32 Clause Scheduling TEX Unrelated? Required data not ready? Use result Texture unit operation Delay next clause if asynchronous data not ready Overhead Instruction 32

33 Clause Scheduling? Use result TEX Texture unit operation Another quad can use this execution unit High utilization, high efficiency Overhead Quad 1 Quad 2 33

34 Temp Registers Bifrost Arithmetic Functional Units Executes quad-parallel scalar operations 4x32-bit multiplier FMA 4x32-bit adder ADD Adder includes special function unit Smaller and more area efficient Simplified layout eases compilation Better scheduling in today s code Better utilization One instruction word contains two instructions Main Regs Read FMA ADD/SF Main Regs Write 34

35 Temp Registers Bifrost Arithmetic Functional Units Retains support for smaller width data types Integers useful for deep learning 2x performance for FP16 useful for pixel shaders Main Regs Read int8 int8 int8 int8 8-bit integers int16 int16 16-bit integers int32 32-bit integers FMA float16 float16 16-bit floating point float32 32-bit floating point ADD/SF Main Regs Write 35

36 ZS Memory Load/Store Units Separated Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 36

37 First Incarnation of Bifrost 37 ARM 2016

38 Mali-G71 Built on the innovative new Bifrost architecture Premium GPU delivering our highest performance ever ARM s most scalable GPU to date 38

39 Mali-G71 Efficiency Drives Performance 20% Higher energy efficiency* 32 Shader cores 40% Better performance density* 20% Bandwidth Improvement* Optimized for next generation, advanced, real-world content *Compared to Mali-T880, on same process node under the same conditions. 39

40 The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited