Bifrost - The GPU architecture for next five billion

Similar documents
GPU Architecture. Michael Doggett ATI

Midgard GPU Architecture. October 2014

Introduction to GPU Programming Languages

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Performance Optimization and Debug Tools for mobile games with PlayCanvas

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Low power GPUs a view from the industry. Edvard Sørgård

Computer Graphics Hardware An Overview

Next Generation GPU Architecture Code-named Fermi

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPGPU Computing. Yong Cao

Optimizing AAA Games for Mobile Platforms

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction to GPGPU. Tiziano Diamanti

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

How To Teach Computer Graphics

GPU Parallel Computing Architecture and CUDA Programming Model

GPU architecture II: Scheduling the graphics pipeline

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

How To Build An Engine 4 Mobile Graphics On Anarm V8-A (A64)

Introduction to Cloud Computing

L20: GPU Architecture and Models

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

HPC with Multicore and GPUs

Hardware accelerated Virtualization in the ARM Cortex Processors

Xeon+FPGA Platform for the Data Center

OpenGL Performance Tuning

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

NVIDIA GeForce GTX 580 GPU Datasheet

2: Introducing image synthesis. Some orientation how did we get here? Graphics system architecture Overview of OpenGL / GLU / GLUT

How To Understand The Power Of Unity 3D (Pro) And The Power Behind It (Pro/Pro)

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

NVPRO-PIPELINE A RESEARCH RENDERING PIPELINE MARKUS TAVENRATH MATAVENRATH@NVIDIA.COM SENIOR DEVELOPER TECHNOLOGY ENGINEER, NVIDIA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Data Center and Cloud Computing Market Landscape and Challenges

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Parallel Web Programming

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Introduction to AMBA 4 ACE and big.little Processing Technology

Writing Applications for the GPU Using the RapidMind Development Platform

Silverlight for Windows Embedded Graphics and Rendering Pipeline 1

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to GPU Architecture

Trends in HTML5. Matt Spencer UI & Browser Marketing Manager

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

OC By Arsene Fansi T. POLIMI

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

big.little Technology: The Future of Mobile Making very high performance available in a mobile envelope without sacrificing energy efficiency

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Image Processing and Computer Graphics. Rendering Pipeline. Matthias Teschner. Computer Science Department University of Freiburg

Intel Xeon +FPGA Platform for the Data Center

Introduction to Computer Graphics

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Advanced Rendering for Engineering & Styling

Instruction Set Architecture (ISA)

Client/Server Computing Distributed Processing, Client/Server, and Clusters

ARM Webinar series. ARM Based SoC. Abey Thomas

Stream Processing on GPUs Using Distributed Multimedia Middleware

Introduction to GPU hardware and to CUDA

GPUs for Scientific Computing

Chapter 2 Parallel Architecture, Software And Performance

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Web Based 3D Visualization for COMSOL Multiphysics

The Future Of Animation Is Games

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

LSN 2 Computer Processors

GPU Profiling with AMD CodeXL

Embedded Systems: map to FPGA, GPU, CPU?

Architectures and Platforms

Advanced Graphics and Animations for ios Apps

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

3D Computer Games History and Technology

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

A Crash Course on Programmable Graphics Hardware

Transcription:

Bifrost - The GPU architecture for next five billion Alan Tsai Regional Marketing Manager Media Processing Group ARM Tech Forum Taipei July 1 st, 2016

Why New GPU Architecture? 2 ARM 2016

Market Drivers and Trends Evolution of mobile gaming and graphics Virtual Reality Augmented Reality Increasing User Interface complexity APIs adapting to developer needs 3

Evolution of Mobile Graphics 2016: Lofoten 2013: Trollheim 2010: TrueForce Hardware: Galaxy S2 GPU: Mali-400MP4 API support: OpenGL ES 2.0 Primitives per frame: 16k Cycles per pixel: 3.7 Draw calls per frame: 50 Hardware: Nexus 10 GPU: Mali-T604 API support: OpenGL ES 3.0, OpenCL 1.1 Primitives per frame: 150k Cycles per pixel: 16 Draw calls per frame: 60 Hardware: Galaxy S7 GPU: Mali-T880MP12 API support: Vulkan 1.0 Primitives per frame: 600k Cycles per pixel: 40 Draw calls per frame: 500 4

Vulkan: Developer-driven GFX API Vulkan graphics API driven by developer need Low-level API Ideal for new and emerging use-cases Fully exploit heterogeneous system Fully multi-threaded Benefit from HW coherency Application Mali OpenGL ES Driver Driver handles context, memory and error management Application Application handles memory allocation, resources, and thread management to generate command buffers Mali Vulkan Driver Low-overhead driver Mali GPU Mali GPU 5

What Vulkan Means for GPU Architecture Stricter system requirements GPU address faults cannot destabilize system Process isolation must be guaranteed Coherent shared memory is mandatory GPU architecture is more exposed Reduces flexibility in some areas, e.g. resource descriptors Application provides more information, sooner Reduces need for indirection/late binding 6

Bifrost 7 ARM 2016

Why is the Architecture Called Bifrost? 8

ARM Mali Architecture Evolution BIFROST Mali-G71 GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan UTGARD Mali-200 GPU Mali-300 GPU Mali-400 GPU Mali-450 GPU Mali-470 GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 9

Bifrost Features A more efficient architecture: More performance overall, per mm 2 and per line of real world shader code Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units New core fabric New geometry data flow Reduces memory bandwidth and footprint 1.5x Performance improvement 10

Bifrost Architectural Innovations Energy efficiency Claused Shaders Index Driven Position Shading Wire light pipelines Developer friendly Designed for Vulkan and VR/AR Heterogeneous computing Full system coherency Midgard Bifrost CPU CPU GPU Coherent Interconnect DRAM 11

Bifrost GPU Design 12 ARM 2016

Bifrost GPU Design Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 13

Scalable System Design Driver Software Up to 32 shader cores supported Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 14

Geometry Flow Improvement Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 15

Index-Driven Position Shading Tiler Assembly Position Shading Tiler Culling ½ x Varying Shading Fragment Shading Read/write bandwidth [x times of storage size] 1x 1x ½ x ½ x ½ x Processing Memory 3.5x 2.0x 2.5x 1.5x Positions Positions Attribs Attributes Indices Positions Transformed Positions Polygon List Vertex Attributes Vertex Varyings Midgard Bifrost 1x Bandwidth used relative to memory storage size 16

Memory System Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus Full coherency using ACE protocol 17

Memory System Full system coherency support Supports tightly coupled CPU+GPU use cases Cortex-A73 CPU Mali-G71 GPU L2 cache improvements Single logical L2 cache makes software easier Fewer partial lines written to AXI which improves LPDDR4 performance CoreLink CCI-550 DMC-500 DRAM 18

Bifrost Core Design 19 ARM 2016

Execution Core Improvements Driver Software Job Manager Shader Core 0 Shader Core 1 Shader Core 2 Shader Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 20

ZS Memory Bifrost Core Design Compute Frontend Fragment Frontend Quad Creator Quad Creator Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Control Quad Manager Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 21

ZS Memory Quad Creation Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 22

ZS Memory Quad Management Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 23

ZS Memory Quad Execution Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 24

Lane 0 Lane 1 Lane 2 Lane 3 Quad Vectorization Bifrost uses quad-parallel execution Four scalar threads executed in lockstep in a quad One quad at a time executes in each pipeline stage Each thread fills one 32-bit lane of the hardware 4 threads doing a vec3 FP32 add takes 3 cycles Improves utilization T0.x T0.y T0.z T1.x T1.y T1.z T2.x T2.y T2.z Idle Idle Idle Cycle 1 Cycle 2 Cycle 3 Quad vectorization is compiler friendly T3.x T3.y T3.z Idle Cycle 4 Each thread only sees a stream of scalar operations Vector operations can always be split into scalars 25

Classic Instruction Execution Scheduling decision before every instruction Architecturally visible state guaranteed after every instruction Overhead Instruction 26

Clause Execution Back-to-back execution guaranteed within a clause Allows aggressive optimisation Overhead Instruction 27

Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 A simple register-based instruction set Each instruction fetches arguments from the register file And writes results back to the register file 28

Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 R0 R1 R2 R3 R4 R5 R6 R7 ADD R4, R2, R3 R0 R1 R2 R3 R4 R5 R6 R7 Register file access can be expensive Register file is often the most-used part of the GPU High bandwidth means high power consumption Thread allocation keeps registers close to where they are used 29

Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 ADD R4, R2, R3 ADD R0, R4, R5 Back-to-back register access is common The result from one instruction is often only used as input to the next 30

Clause Execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD T, R0, R1 T ADD T, T, R3 T ADD R0, T, R5 Back-to-back register access is common Register file bypass saves power. Allows use of simpler, smaller register files. 31

Clause Scheduling TEX Unrelated? Required data not ready? Use result Texture unit operation Delay next clause if asynchronous data not ready Overhead Instruction 32

Clause Scheduling? Use result TEX Texture unit operation Another quad can use this execution unit High utilization, high efficiency Overhead Quad 1 Quad 2 33

Temp Registers Bifrost Arithmetic Functional Units Executes quad-parallel scalar operations 4x32-bit multiplier FMA 4x32-bit adder ADD Adder includes special function unit Smaller and more area efficient Simplified layout eases compilation Better scheduling in today s code Better utilization One instruction word contains two instructions Main Regs Read FMA ADD/SF Main Regs Write 34

Temp Registers Bifrost Arithmetic Functional Units Retains support for smaller width data types Integers useful for deep learning 2x performance for FP16 useful for pixel shaders Main Regs Read int8 int8 int8 int8 8-bit integers int16 int16 16-bit integers int32 32-bit integers FMA float16 float16 16-bit floating point float32 32-bit floating point ADD/SF Main Regs Write 35

ZS Memory Load/Store Units Separated Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 36

First Incarnation of Bifrost 37 ARM 2016

Mali-G71 Built on the innovative new Bifrost architecture Premium GPU delivering our highest performance ever ARM s most scalable GPU to date 38

Mali-G71 Efficiency Drives Performance 20% Higher energy efficiency* 32 Shader cores 40% Better performance density* 20% Bandwidth Improvement* Optimized for next generation, advanced, real-world content *Compared to Mali-T880, on same process node under the same conditions. 39

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited