GPU architecture II: Scheduling the graphics pipeline

Similar documents
GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Radeon HD 2900 and Geometry Generation. Michael Doggett

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Introduction to GPU Architecture

Next Generation GPU Architecture Code-named Fermi

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPGPU Computing. Yong Cao

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architecture. Michael Doggett ATI

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Introduction to GPGPU. Tiziano Diamanti

Computer Graphics Hardware An Overview

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

L20: GPU Architecture and Models

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:


Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

CSE 564: Visualization. GPU Programming (First Steps) GPU Generations. Klaus Mueller. Computer Science Department Stony Brook University

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

NVIDIA GeForce GTX 580 GPU Datasheet

Instruction Set Design

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Let s put together a Manual Processor

LSN 2 Computer Processors

VLIW Processors. VLIW Processors

PROBLEMS #20,R0,R1 #$3A,R2,R4

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

NVIDIA Tools For Profiling And Monitoring. David Goodwin

OpenGL Performance Tuning

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Introduction to Cloud Computing

CPU Organization and Assembly Language

Optimizing Unity Games for Mobile Platforms. Angelo Theodorou Software Engineer Unite 2013, 28 th -30 th August

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Pipelining Review and Its Limitations

A Crash Course on Programmable Graphics Hardware

Shader Model 3.0. Ashu Rege. NVIDIA Developer Technology Group

Parallel Programming Survey

CHAPTER 7: The CPU and Memory

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Hardware Performance. Fall 2015

IA-64 Application Developer s Architecture Guide

Introduction to GPU Programming Languages

The Fastest, Most Efficient HPC Architecture Ever Built

QCD as a Video Game?

Chapter 11 I/O Management and Disk Scheduling

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

Performance Optimization and Debug Tools for mobile games with PlayCanvas

High-speed image processing algorithms using MMX hardware

GPUs for Scientific Computing

Hardware design for ray tracing

FPGA-based Multithreading for In-Memory Hash Joins

Choosing a Computer for Running SLX, P3D, and P5

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

GPU Computing - CUDA

Interactive Level-Set Deformation On the GPU

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

DATA VISUALIZATION OF THE GRAPHICS PIPELINE: TRACKING STATE WITH THE STATEVIEWER

Computer Organization and Components

An Introduction to Modern GPU Architecture Ashu Rege Director of Developer Technology

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel Pentium 4 Processor on 90nm Technology

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

GPU Programming Strategies and Trends in GPU Computing

Chapter 2 Parallel Architecture, Software And Performance

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Introduction to GPU hardware and to CUDA

Developer Tools. Tim Purcell NVIDIA

Transcription:

GPU architecture II: Scheduling the graphics pipeline Mike Houston, AMD / Stanford Aaron Lefohn, Intel / University of Washington 1

Notes The front half of this talk is almost verbatim from: Keeping Many s Busy: Scheduling the Graphics Pipeline by Jonathan Ragan-Kelley from SIGGRAPH Beyond Programmable Shading, 2010 I ll be adding my own twists from my background on working on some of the hardware in the talk 2

This talk How to think about scheduling GPU-style pipelines Four constraints which drive scheduling decisions Examples of these concepts in real GPU designs Goals Know why GPUs, APIs impose the constraints they do. Develop intuition for what they can do well. Understand key patterns for building your own pipelines. Dig into ATI Radeon HD 5800 Series ( Cypress ) architecture deep dive What questions do you have? 3

First, a definition Scheduling [n.]: Assigning computations and data to resources in space and time. 4

data flow The workload: Direct3D IA VS PA HS Tess Logical pipeline Fixed-function stage Programmable stage DS PA GS Rast 5

The machine: a modern GPU Tex Input Assembler Primitive Assembler Logical pipeline Fixed-function stage Tex Rasterizer Output Programmable stage Physical processor Tex Tex Task Distributor Fixed-function logic Programmable core Fixed-function control 6

time Scheduling a draw call as a series of tasks Input Assembler Primitive Assembler Rasterizer Output IA VS PA Rast 7

time An efficient schedule keeps hardware busy Input Assembler Primitive Assembler Rasterizer Output VS VS VS VS VS VS VS VS IA IA IA IA IA PA PA PA PA PA Rast Rast Rast Rast Rast IA PA Rast 8

Choosing which tasks to run when (and where) Resource constraints Tasks can only execute when there are sufficient resources for their computation and their data. Coherence Control coherence is essential to shader core efficiency. Data coherence is essential to memory and communication efficiency. Load balance Irregularity in execution time create bubbles in the pipeline schedule. Ordering Graphics APIs define strict ordering semantics, which restrict possible schedules. 9

time Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output??? VS IA 10

time Resource constraints limit scheduling options Input Assembler Primitive Assembler Rasterizer Output VS IA?????? Deadlock Key concept: Pre-allocation of resources helps guarantee forward progress. 11

Coherence is a balancing act Intrinsic tension between: Horizontal (control, fetch) coherence and Vertical (producer-consumer) locality. Locality and Load Balance. 12

Graphics workloads are irregular Rasterizer SuperExpensive( ) Trivial( )! But: s are optimized for regular, self-similar work. Imbalanced work creates bubbles in the task schedule. Solution: Dynamically generating and aggregating tasks isolates irregularity and recaptures coherence. Redistributing tasks restores load balance. 13

time Redistribution after irregular amplification Input Assembler Primitive Assembler Rasterizer Output IA VS PA Rast 14

time Redistribution after irregular amplification Input Assembler Primitive Assembler Rasterizer Output IA VS PA Rast Key concept: Managing irregularity by dynamically generating, aggregating, and redistributing tasks 15

Ordering Rule: All framebuffer updates must appear as though all triangles were drawn in strict sequential order Key concept: Carefully structuring task redistribution to maintain API ordering. 16

Building a real pipeline 17

Static tile scheduling The simplest thing that could possibly work. Vertex Multiple cores: 1 front-end n back-end Exemplar: ARM Mali 400 18

Static tile scheduling Vertex Exemplar: ARM Mali 400 19

Static tile scheduling Vertex Exemplar: ARM Mali 400 20

Static tile scheduling Vertex Exemplar: ARM Mali 400 21

Static tile scheduling Locality captured within tiles Resource constraints static = simple Ordering single front-end, sequential processing within each tile Exemplar: ARM Mali 400 22

Static tile scheduling The problem: load imbalance only one task creation point. no dynamic task redistribution.!!! idle idle idle Exemplar: ARM Mali 400 23

Sort-last fragment shading Vertex Rasterizer Redistribution restores fragment load balance. But how can we maintain order? Exemplars: NVIDIA G80, ATI RV770 24

Sort-last fragment shading Vertex Rasterizer Preallocate outputs in FIFO order Exemplars: NVIDIA G80, ATI RV770 25

Sort-last fragment shading Vertex Rasterizer Complete shading asynchronously Exemplars: NVIDIA G80, ATI RV770 26

Sort-last fragment shading fragments in FIFO order Vertex Rasterizer Output Exemplars: NVIDIA G80, ATI RV770 27

Unified shaders Solve load balance by time-multiplexing different stages onto shared processors according to load Tex Input Assembler Primitive Assembler Tex Rasterizer Output Tex Tex Task Distributor Exemplars: NVIDIA G80, ATI RV770 28

time Unified s: time-multiplexing cores Input Assembler Primitive Assembler Rasterizer Output IA VS PA Rast Exemplars: NVIDIA G80, ATI RV770 29

Prioritizing the logical pipeline IA 5 VS 4 PA Rast 3 2 priority 1 0 30

Prioritizing the logical pipeline IA 5 VS 4 fixed-size queue storage PA Rast 3 2 priority 1 0 31

time Scheduling the pipeline IA VS PA Rast 32

time Scheduling the pipeline VS VS IA VS PA Rast 33

time Scheduling the pipeline VS VS IA VS PA Rast 34

time Scheduling the pipeline VS VS IA VS PA High priority, but stalled on output Rast Lower priority, but ready to run 35

time Scheduling the pipeline VS VS IA VS PA Rast 36

Scheduling the pipeline Queue sizes and backpressure provide a natural knob for balancing horizontal batch coherence and producer-consumer locality. 37

OptiX 38

Summary 39

Key concepts Think of scheduling the pipeline as mapping tasks onto cores. Preallocate resources before launching a task. Preallocation helps ensure forward progress and prevent deadlock. Graphics is irregular. Dynamically generating, aggregating and redistributing tasks at irregular amplification points regains coherence and load balance. Order matters. Carefully structure task redistribution to maintain ordering. 40

Why don t we have dynamic resource allocation? e.g. recursion, malloc() in shaders Static preallocation of resources guarantees forward progress. Tasks which outgrow available resources can stall, causing deadlock. 41

Geometry s are slow because they allow dynamic amplification in shaders. Pick your poison: Always stream through DRAM. Exemplar: ATI R600 Smooth falloff for large amplification, but very slow for small amplification (DRAM latency). Scale down parallelism to fit. exemplar: NVIDIA G80 Fast for small amplification, poor shader throughput (no parallelism) for large amplification. 42

Why isn t rasterization programmable? Yes, partly because it is computationally intensive, but also: It is highly irregular. It must generate and aggregate regular output. It must integrate with an orderpreserving task redistribution mechanism. 43

Questions for the future Can we relax the strict ordering requirements? Can you build a generic scheduler for application-defined pipelines? What application-specific information would a generic scheduler need to work well? 44

Starting points to learn more The next step: parallel primitive processing Eldridge et al. Pomegranate: A Fully Scalable Graphics Architecture. SIGGRAPH 2000. Tim Purcell. Fast Tessellated Rendering on Fermi GF100. Hot3D, HPG 2010. Scheduling cyclic graphs, in software, on current GPUs Parker et al. OptiX: A General Purpose Ray Tracing Engine. SIGGRAPH 2010. Details of the ARM Mali design Tom Olson. Mali-400 MP: A Scalable GPU for Mobile Devices. Hot3D, HPG 2010. 45

Cypress Deep Dive 46

ATI Radeon HD 5870 GPU Features ATI Radeon HD 4870 ATI Radeon HD 5870 Difference Area 263 mm 2 334 mm 2 1.27x Transistors 956 million 2.15 billion 2.25x Memory Bandwidth L2-L1 Rd Bandwidth L1 Bandwidth 115 GB/sec 153 GB/sec 1.33x 512 bytes/clk 640 bytes/clk 512 bytes/clk 1280 bytes/clk Vector GPR 2.62 Mbytes 5.24 MByte 2x LDS Memory 160 kb 640kb 4x LDS Bandwidth Concurrent Threads (ALU units) Board Power* 640 byte/clk 2560 bytes/clk 1x 2x 4x 15872 31744 2x 800 1600 2x Idle 90 W 27 W 0.3x 20 SIMD engines (MPMD) Each with 16 Stream s (SC) Each SC with 5 Processing Elements (PE) (1600 Total) Each PE IEEE 754 2008 precision capable denorm support, fma, flags, round modes Fast integer support Each with local data share memory 32 kb shared low latency memory 32 banks with hardware conflict management 32 integer atomic units 80 Read Address Probes 4 addresses per SIMD engine (32-128 bits data) 4 filter or convert logic per SIMD Global Memory access 32 SC access read/write/integer atomic/clock Relaxed Unordered Memory Consistency Model On chip 64 kb global data share 153 GB/sec GDDR5 memory interface Max 160 W 188 W 1.17x 47

48

Compute Aspects of Radeon HD 5870 Stream s Local Data Share (LDS) SIMD Engine Load / Store / Atomic Data Access Dispatch / Indirect Dispatch Global Data Share (GDS) 49

Stream with Processing Elements (PE) Each Stream Unit includes: 4 PE 4 Independent SP or Integer Ops 2 DP add or dependant SP pairs 1 DP fma or mult or SP dp4 1 Special Function PE 1 SP or Integer Operation SP or DP Transcendental Ops Operand Prep logic General Purpose Registers Data forwarding and predication logic 50

Processing Element (PE) Precision Improvements FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of Nan/Inf/Zero and full de-normal support in hardware for SP and DP MULADD instruction without truncation, enabling a MUL ieee followed ADD ieee to be combined with round and normalization after both multiplication and subsequent addition. IEEE Rounding Modes (Round to nearest even, Round toward +Infinity, Round toward Infinity, Round toward zero) supported under program control anywhere in the shader. Double and single precision modes are controlled separately. Applies to all slots in a VLIW. De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero. FP Conversion Ops between 16 bit, 32 bit, and 64 bit floats with full IEEE 754 precision. Exceptions Detection in hardware for floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation 64 bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root 51

Processing Element (PE) Improved IPC Co-issue of dependant Ops in ONE VLIW instruction full IEEE intermediate rounding & normalization Dot4 (A=A*B + C*D + E*F + G*H), Dual Dot2 ( A= A*B + C*D; F = G*h + I*J) Dual Dependant Multiplies (A = A * B * C ; F = G * H * I;) Dual Dependant Adds (A = B + C + D; E = F + G + H;) Dependant Muladd (A= A*B+C + D*E; F = G*H + I + J*K) 24 bit integer MUL, MULADD (4 co-issue) Heavy use for Integer thread group address calculation 52

Processing Element (PE) New Integer Ops 32b operand Count Bits Set 64b operand Count Bits Set Insert Bit field Extract Bit Field Find first Bit (high, low, signed high) Reverse bits Extended Integer Math Integer Add with carry Integer Subtract with borrow 1 bit pre-fix sum on 64b mask. (useful for compaction) Accessible 64 bit counter Uniform indexing of constants 53

Processing Element (PE) Special Ops Conversion Ops FP32 to FP64 and FP64 to FP32 (w/ieee conversion rules) FP32 to FP16 and FP 16 to FP32 (w/ieee conversion rules) FP32 to Int/UInt and Uint/Int to FP32 Very Fast 8 bit Sum of Absolute Differences (SAD) 4x1 SAD per lane, with 4x4 8 bit SAD in one VLIW Used for video encoding, computer vision Exposed via OpenCL extension Video Ops 8 bit packed to float and float to 8 bit packed conversion Ops 4 8 bit pixel average (bilinear interpolation with programmable round) Arbitrary Byte or Bit extraction from 64 bits 54

Local Data Share (LDS) Share Data between Work Items of a Work Group to increase performance High Bandwidth access per SIMD Engine Peak is double external R/W bandwidth Full coalesce R/W/Atomic with optimization for broadcast reads access Low Latency Access per SIMD Engine 0 latency direct reads (Conflict free or Broadcast) 1 VLIW instruction latency for LDS indirect Op All bank conflicts are hardware detected and serialized as necessary with fast support for broadcast reads Hardware allocation of LDS space per thread group dispatch Base and size stored with wavefront for private access Hardware range check - Out of bound access attempts will return 0 and kill writes 32 byte, ubyte, short, ushort reads/writes per clk (reads are sign extended) 32 dwords access per clock Per lane 32 bit Read, Read2, Write, Write2 Per lane 32 bit Atomic: add, sub, rsub, inc, dec, min, max, and, or, xor, mskor, exchange, exchange2, compare _swap(int), compare_swap(float w/nan/inf) Return pre-op value to Stream 4 primary Processing Elements 55

Local Data Share (LDS) 56

SIMD Engine SIMD Engine can process Wavefronts from multiple kernels concurrently Masking and/or Branching is used to enable thread divergence within a Wavefront Enabling each Thread in a Wavefront to traverse a unique program execution path Full hardware barrier support for up to 8 Work Groups per SIMD Engine (for thread data sharing) Each Stream receives up to the following per VLIW instruction issue 5 unique ALU Ops - or - 4 unique ALU Ops with a LDS Op (Up to 3 operands per thread) LDS and Global Memory access for byte, ubyte, short, ushort reads/writes supported at 32bit dword rates Private Loads and read only texture reads via Read Cache Unordered shared consistent loads/stores/atomics via R/W Cache Wavefront length of 64 threads where each thread executes a 5 way VLIW Instruction each issue ¼ Wavelength (16 threads) on each clock of 4 clocks (T0-15, T16-31, T32-47, T48-T63) 57

58