Introduction to GPU Architecture

Similar documents
GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPUs for Scientific Computing

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPGPU Computing. Yong Cao

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Introduction to GPGPU. Tiziano Diamanti

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Next Generation GPU Architecture Code-named Fermi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

General Purpose Computation on Graphics Processors (GPGPU) Mike Houston, Stanford University

Introduction to GPU Programming Languages

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Radeon HD 2900 and Geometry Generation. Michael Doggett

NVIDIA GeForce GTX 750 Ti

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Parallel Programming Survey

OpenCL Programming for the CUDA Architecture. Version 2.3

Introduction to GPU hardware and to CUDA

L20: GPU Architecture and Models

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

GPU Architecture. Michael Doggett ATI

NVIDIA GeForce GTX 580 GPU Datasheet

GPU architecture II: Scheduling the graphics pipeline

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Computer Graphics Hardware An Overview

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

GPU Parallel Computing Architecture and CUDA Programming Model

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

GPU Computing - CUDA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Introduction to GPU Computing

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

QCD as a Video Game?

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Choosing a Computer for Running SLX, P3D, and P5

ultra fast SOM using CUDA

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Intel Pentium 4 Processor on 90nm Technology

ME964 High Performance Computing for Engineering Applications

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

ST810 Advanced Computing

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Evaluation of CUDA Fortran for the CFD code Strukti

An Introduction to Modern GPU Architecture Ashu Rege Director of Developer Technology

Using Many-Core Hardware to Correlate Radio Astronomy Signals

Thread level parallelism

Embedded Systems: map to FPGA, GPU, CPU?

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27


CUDA programming on NVIDIA GPUs

GPU Programming Strategies and Trends in GPU Computing

Lecture 1. Course Introduction

GPU Performance Analysis and Optimisation

Fast Implementations of AES on Various Platforms

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

CS 352H: Computer Systems Architecture

IP Video Rendering Basics

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

Guided Performance Analysis with the NVIDIA Visual Profiler

Appendix L. General-purpose GPU Radiative Solver. Andrea Tosetto Marco Giardino Matteo Gorlani (Blue Engineering & Design, Italy)

2020 Design Update Release Notes November 10, 2015

LSN 2 Computer Processors

Binary search tree with SIMD bandwidth optimization using SSE

Rethinking SIMD Vectorization for In-Memory Databases

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Algorithms of Scientific Computing II

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

gpus1 Ubuntu Available via ssh

The Fastest, Most Efficient HPC Architecture Ever Built

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

CUDA. Multicore machines

Fast Software AES Encryption

Turbomachinery CFD on many-core platforms experiences and strategies

Architecture of Hitachi SR-8000

High Performance GPGPU Computer for Embedded Systems

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Transcription:

Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University

Content 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA GTX 580 AMD Radeon 6970 3. The GPU memory hierarchy: moving data to processors 4. Heterogeneous Cores

Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand space of GPU core (and throughput CPU core) designs 2. Optimize shaders/compute kernels 3. Establish intuition: what workloads might benefit from the design of these architectures?

What s in a GPU? A GPU is a heterogeneous chip multi-processor (highly tuned for graphics) Shader Core Shader Core Tex Input Assembly Rasterizer Shader Core Shader Core Tex Output Blend Shader Core Shader Core Tex Video Decode Shader Core Shader Core Tex Work Distributor HW or SW?

A diffuse reflectance shader sampler mysamp; Texture2D<float3> mytex; float3 lightdir; Shader programming model: float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); Fragments are processed independently, but there is no explicit parallel programming return float4(kd, 1.0); }

Compile shader 1 unshaded fragment input record sampler mysamp; Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) { float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); } <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) 1 shaded fragment output record

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Execute shader Fetch/ Decode ALU (Execute) Execution Context <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

CPU-style cores Fetch/ Decode ALU (Execute) Data cache (a big one) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher

Slimming down Fetch/ Decode ALU (Execute) Execution Context Idea #1: Remove components that help a single instruction stream run fast

Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Fetch/ Decode Decode <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 ALU (Execute) ALU (Execute) <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 Execution Execution clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) Context Context mul o2, r2, r3 mov o3, l(1.0)

Four cores (four fragments in parallel) Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context

Sixteen cores (sixteen fragments in parallel) 16 cores = 16 simultaneous instruction streams

Instruction stream sharing But... many fragments should be able to share an instruction stream! <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0)

Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context

Add ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx SIMD processing Shared Ctx Data

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx <diffuseshader>: sample r0, v4, t0, s0 mul r3, v0, cb0[0] madd r3, v1, cb0[1], r3 madd r3, v2, cb0[2], r3 clmp r3, r3, l(0.0), l(1.0) mul o0, r0, r3 mul o1, r1, r3 mul o2, r2, r3 mov o3, l(1.0) Ctx Ctx Ctx Ctx Original compiled shader: Shared Ctx Data Processes one fragment using scalar ops on scalar registers

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx New compiled shader: Shared Ctx Data Processes eight fragments using vector ops on vector registers

Modifying the shader 1 2 3 4 5 6 7 8 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx <VEC8_diffuseShader>: VEC8_sample vec_r0, vec_v4, t0, vec_s0 VEC8_mul vec_r3, vec_v0, cb0[0] VEC8_madd vec_r3, vec_v1, cb0[1], vec_r3 VEC8_madd vec_r3, vec_v2, cb0[2], vec_r3 VEC8_clmp vec_r3, vec_r3, l(0.0), l(1.0) VEC8_mul vec_o0, vec_r0, vec_r3 VEC8_mul vec_o1, vec_r1, vec_r3 VEC8_mul vec_o2, vec_r2, vec_r3 VEC8_mov o3, l(1.0) Ctx Ctx Ctx Ctx Shared Ctx Data

128 fragments in parallel 16 cores = 128 ALUs, 16 simultaneous instruction streams

128 [ vertices/fragments ] in parallel primitives OpenCL work items vertices primitives fragments

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code>

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code>

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 peak performance if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code>

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> T T F T F F F F if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code>

Clarification SIMD processing does not imply SIMD instructions Option 1: explicit vector instructions x86 SSE, AVX, Intel Larrabee Option 2: scalar instructions, implicit HW vectorization HW determines instruction stream sharing across ALUs (amount of sharing hidden from software) NVIDIA GeForce ( SIMT warps), ATI Radeon architectures ( wavefronts ) In practice: 16 to 64 fragments share an instruction stream.

Stalls! Stalls occur when a core cannot run the next instruction because of a dependency on a previous operation. Texture access latency = 100 s to 1000 s of cycles We ve removed the fancy caches and logic that helps avoid stalls.

But we have LOTS of independent fragments. Idea #3: Interleave processing of many fragments on a single core to avoid stalls caused by high latency operations.

Hiding shader stalls Time (clocks) Frag 1 8 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable

Hiding shader stalls Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Stall Runnable Stall Stall

Throughput! Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Start Stall Start Runnable Stall Start Runnable Stall Done! Runnable Done! Runnable Increase run time of one group Done! to increase throughput of many groups Done!

Storing contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Pool of context storage 128 KB

Eighteen small contexts (maximal latency hiding) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

Twelve medium contexts Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8

Four large contexts (low latency hiding ability) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 1 2 3 4

Clarification Interleaving between contexts can be managed by hardware or software (or both!) NVIDIA / AMD Radeon GPUs HW schedules / manages all contexts (lots of them) Special on-chip storage holds fragment state Intel Larrabee HW manages four x86 (big) contexts at fine granularity SW scheduling interleaves many groups of fragments on each HW context L1-L2 cache holds fragment state (as determined by SW)

Example chip 16 cores 8 mul-add ALUs per core (128 total) 16 simultaneous instruction streams 64 concurrent (but interleaved) instruction streams 512 concurrent fragments = 256 GFLOPs (@ 1GHz)

Summary: three key ideas 1. Use many slimmed down cores to run in parallel 2. Pack cores full of ALUs (by sharing instruction stream across groups of fragments) Option 1: Explicit SIMD vector instructions Option 2: Implicit sharing managed by hardware 3. Avoid latency stalls by interleaving execution of many groups of fragments When one group stalls, work on another group

Part 2: Putting the three ideas into practice: A closer look at real GPUs NVIDIA GeForce GTX 580 AMD Radeon HD 6970

Disclaimer The following slides describe a reasonable way to think about the architecture of commercial GPUs Many factors play a role in actual chip performance

NVIDIA GeForce GTX 580 (Fermi) NVIDIA-speak: 512 stream processors ( CUDA cores ) SIMT execution Generic speak: 16 cores 2 groups of 16 SIMD functional units per core

NVIDIA GeForce GTX 580 core Fetch/ Decode = SIMD function unit, control shared across 16 units (1 MUL-ADD per clock) Groups of 32 [fragments/vertices/cuda threads] share an instruction stream Execution contexts (128 KB) Up to 48 groups are simultaneously interleaved Shared memory (16+48 KB) Up to 1536 individual contexts can be stored Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G

NVIDIA GeForce GTX 580 core Fetch/ Decode = SIMD function unit, control shared across 16 units (1 MUL-ADD per clock) Fetch/ Decode The core contains 32 functional units Execution contexts (128 KB) Two groups are selected each clock (decode, fetch, and execute two instruction streams in parallel) Shared memory (16+48 KB) Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G

NVIDIA GeForce GTX 580 SM Fetch/ Decode = CUDA core (1 MUL-ADD per clock) Fetch/ Decode The SM contains 32 CUDA cores Execution contexts (128 KB) Two warps are selected each clock (decode, fetch, and execute two warps in parallel) Shared memory (16+48 KB) Up to 48 warps are interleaved, totaling 1536 CUDA threads Source: Fermi Compute Architecture Whitepaper CUDA Programming Guide 3.1, Appendix G

NVIDIA GeForce GTX 580 There are 16 of these things on the GTX 480: That s 24,500 fragments! Or 24,000 CUDA threads! 52

AMD Radeon HD 6970 (Cayman) AMD-speak: 1536 stream processors Generic speak: 24 cores 16 beefy SIMD functional units per core 4 multiply-adds per functional unit (VLIW processing)

ATI Radeon HD 6970 core Fetch/ Decode = SIMD function unit, control shared across 16 units (Up to 4 MUL-ADDs per clock) Groups of 64 Execution contexts (256 KB) [fragments/vertices/etc.] share instruction stream Shared memory (32 KB) Four clocks to execute an instruction for all fragments in a group Source: ATI Radeon HD5000 Series: An Inside View (HPG 2010)

ATI Radeon HD 6970 SIMD-Engine Fetch/ Decode = Stream Processor, control shared across 16 units (Up to 4 MUL-ADDs per clock) Groups of 64 Execution contexts (256 KB) [fragments/vertices/etc.] are in a wavefront Shared memory (32 KB) Four clocks to execute an instruction for an entire wavefront Source: ATI Radeon HD5000 Series: An Inside View (HPG 2010)

ATI Radeon HD 6970 There are 24 of these cores on the 6970: that s about 32,000 fragments! (there is a global limitation of 496 wavefronts)

The talk thus far: processing data Part 3: moving data to processors

Recall: CPU-style core OOO exec logic Branch predictor Fetch/Decode ALU Data cache (a big one) Execution Context

CPU-style memory hierarchy OOO exec logic Branch predictor Fetch/Decode ALU L1 cache (32 KB) L3 cache (8 MB) 25 GB/sec to memory Execution contexts L2 cache (256 KB) shared across cores CPU cores run efficiently when data is resident in cache (caches reduce latency, provide high bandwidth)

Throughput core (GPU-style) Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 150 GB/sec ALU 5 ALU 6 ALU 7 ALU 8 Memory Execution contexts (128 KB) More ALUs, no large traditional cache hierarchy: Need high-bandwidth connection to memory

Bandwidth is a critical resource A high-end GPU (e.g. Radeon HD 6970) has... Over twenty times (2.7 TFLOPS) the compute performance of quad-core CPU No large cache hierarchy to absorb memory requests GPU memory system is designed for throughput Wide bus (150 GB/sec) Repack/reorder/interleave memory requests to maximize use of memory bus Still, this is only six times the bandwidth available to CPU

Bandwidth thought experiment Task: element-wise multiply two long vectors A and B 1.Load input A[i] 2.Load input B[i] 3.Load input C[i] A B + C = D 4.Compute A[i] B[i] + C[i] 5.Store result into D[i] Four memory operations (16 bytes) for every MUL-ADD Radeon HD 6970 can do 1536 MUL-ADDS per clock Need ~20 TB/sec of bandwidth to keep functional units busy Less than 1% efficiency but 6x faster than CPU!

Bandwidth limited! If processors request data at too high a rate, the memory system cannot keep up. No amount of latency hiding helps this. Overcoming bandwidth limits are a common challenge for GPU-compute application developers.

Reducing bandwidth requirements Request data less often (instead, do more math) arithmetic intensity Fetch data from memory less often (share/reuse data across fragments on-chip communication or storage

Reducing bandwidth requirements Two examples of on-chip storage Texture caches OpenCL local memory (CUDA shared memory) 1 2 3 4 Texture caches: Capture reuse across Texture data fragments, not temporal reuse within a single shader program

Modern GPU memory hierarchy Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 Texture cache (read-only) ALU 5 ALU 6 ALU 7 ALU 8 L2 cache (~1 MB) Memory Execution contexts (128 KB) Shared local storage or L1 cache (64 KB) On-chip storage takes load off memory system. Many developers calling for more cache-like storage (particularly GPU-compute applications)

Don t forget about offload cost PCIe bandwidth/latency 8GB/s each direction in practice Attempt to pipeline/multi-buffer uploads and downloads Dispatch latency O(10) usec to dispatch from CPU to GPU This means offload cost if O(10M) instructions 67

Heterogeneous cores to the rescue? Tighter integration of CPU and GPU style cores Reduce offload cost Reduce memory copies/transfers Power management Industry shifting rapidly in this direction AMD Fusion APUs Intel SandyBridge NVIDIA Tegra 2 Apple A4 and A5 QUALCOMM Snapdragon TI OMAP 68

AMD A-Series APU ( Llano ) 69

Others GPU is not compute ready, yet Intel SandyBridge NVIDIA Tegra 2 70