AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Size: px

Start display at page:

Download "AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009"

Shawn Glenn
9 years ago
Views:

1 AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009

2 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU Architecture September 13th, 2009

3 The Chip (ATI Radeon HD 4870) Primitive Assembly/Rasterizer Vertex Grouper/Tesselator Shader Parameter Interpolator Shader Export Shader Core Tex L2 VC Memory Controller Render Backend 3 AMD GPU Architecture September 13th, 2009

4 Vertex path Primitive Assembly/Rasterizer Vertex Grouper/Tesselator Shader Parameter Interpolator Shader Export Shader Core Tex L2 VC Memory Controller Render Backend 4 AMD GPU Architecture September 13th, 2009

Shader Export Shader Core Tex L2 VC Memory

5 Pixel path Primitive Assembly/Rasterizer Vertex Grouper/Tesselator Shader Parameter Interpolator Shader Export Shader Core Tex L2 VC Memory Controller Render Backend 5 AMD GPU Architecture September 13th, 2009

6 Compute path Primitive Assembly/Rasterizer Vertex Grouper/Tesselator Shader Parameter Interpolator Shader Export Shader Core Tex L2 VC Memory Controller Render Backend 6 AMD GPU Architecture September 13th, 2009

7 Shader Core (ATI Radeon HD 4870) Dispatch Processor Vertex Cache/GDS SIMD *10 SIMD Export To Shader Export 7 AMD GPU Architecture September 13th, 2009

8 SIMD (ATI Radeon HD 4870) One VLIW instruction across 64 threads in 4 cycles Texture Cache L1 TexAddr From TCL2 TexData Local Data Share 16KB 8 AMD GPU Architecture September 13th, 2009

9 Thread Processor Register file Cycle Read Write 0 SrcA Dst Vec X Y Z W 1 SrcB Dst Scalar 2 SrcC/Exp Interp Read MUX 3 TexAddr/Exp TexData Thread Processor X Y Z W T 9 AMD GPU Architecture September 13th, 2009

10 255 Register file (one SIMD) *64 X Y Z W X Y Z W 256 float4 regs high float4 * 64 vectors * 256 registers * 10 SIMD = 2.5MB of registers on ATI Radeon HD AMD GPU Architecture September 13th, 2009

11 Register file (one SIMD) 0 Pixel shader mode in CAL Clause Temps Shared (Global) Regs wave Clause Temps Pixel Shader Regs OpenCL and Compute mode in CAL Compute Shader Regs Vertex Shader Regs AMD GPU Architecture September 13th, 2009

Shader Regs OpenCL and Compute mode in CAL Compute Shader Regs

12 Instruction Set 3 instruction classes: Control flow ALU Texture/Vertex fetch Family_Instruction_Set_Architecture.pdf Stream_Processor.pdf 12 AMD GPU Architecture September 13th, 2009

amd.com/gpu_assets/r700- Family_Instruction_Set_Architecture.pdf http://developer.amd.com/gpu_assets/intermediate_language_specification-- Stream_Processor.

13 Instruction Set: Control Flow Instruction classes Start ALU clause and lock constant cache lines Start Fetch clause (Texture, Vertex, Vertex_TC, global mem read) Control flow (PUSH,POP, ELSE, JUMP, LOOP, BREAK, CONTINUE) Export (position, vertex attributes, pixels, z, scratch memory, compute shader global mem export) 13 AMD GPU Architecture September 13th, 2009

(PUSH,POP, ELSE, JUMP, LOOP, BREAK, CONTINUE) Export (position, vertex attributes, pixels,

14 Instruction Set: ALU One 5-way (XYZWT) VLIW instruction (bundle) per clock Instruction classes: Floating point (ADD, MUL, MULADD etc) Double FP (XY,ZW slots) (mul_64, muladd_64) Integer (add_int, lshr_int, or_int etc) Transcendental (only in T slot) (sin, recip, sqrt, flt_to_int) Predicates, kill 14 AMD GPU Architecture September 13th, 2009

(mul_64, muladd_64) Integer (add_int, lshr_int, or_int etc) Transcendental (only in T

15 Instruction Set: Fetch Instruction classes: sample (texture load using fp coords and filtering) ld (texture load using int coords) get/set gradients vfetch/semfetch (vertex fetches) read mem, scratch, reduction buffer (uncached) read/write global and local data share 15 AMD GPU Architecture September 13th, 2009

vfetch/semfetch (vertex fetches) read mem, scratch, reduction buffer

16 Instruction Set: Simple Compute Shader ALU calc_addr: add_int mov CF Start: mullo_int exec_alu exec_vtx_tc VTX fetch_data: exec_alu read_mem export_mem read_mem end ALU math: add mul muladd sin 16 AMD GPU Architecture September 13th, 2009

fetch_data: exec_alu read_mem export_mem read_mem end ALU

17 OpenCL view Processing Element Compute Unit Processing Element OpenCL (abstraction) compute device compute unit processing element Compute Device GPU (reality) GPU (ATI Radeon HD 4870) SIMD Thread Processor Compute Device Compute Compute Compute Unit Unit Compute Compute Unit Compute Compute Unit Unit Unit Compute Unit Unit Compute Compute Compute Unit Unit Compute Unit Unit Compute Device Host 17 AMD GPU Architecture September 13th, 2009

Device Compute Compute Compute Unit Unit Compute Compute Unit Compute Compute Unit Unit Unit Compute Unit Unit

18 How OpenCL maps onto GPU OpenCL (abstract) work-item work-group ATI Radeon HD 4870 (reality) CS instance, pixel, vertex group of wavefronts (waves) local memory Local Data Share * private memory scratch memory (x[] in CAL/IL) global memory scalar variable vector variable image global buffer (g[] in CAL/IL) single or double component of a 128-bit register one or more 128-bit registers texture ** (not yet) *Local Data Share supports only owner writes in ATI Radeon HD **Image capabilities not currently implemented in ATI Stream 2.0 beta. 18 AMD GPU Architecture September 13th, 2009

(g[] in CAL/IL) single or double component of a 128-bit register one or more 128-bit registers texture ** (not yet) *Local Data Share supports only

19 How OpenCL maps on CPU OpenCL (abstract) compute device compute unit processing element work-item work-group local memory global memory private memory scalar variable vector variable image CPU (reality) PC system processor core processor core software thread hardware thread CPU data L1 (per work-group allocated RAM) system RAM per work-item allocated system RAM CPU register or stack space SSE register or stack space (not yet) array in system RAM 19 AMD GPU Architecture September 13th, 2009

software thread hardware thread CPU data L1 (per work-group allocated RAM) system RAM per work-item allocated system RAM

20 Optimizations: Vectors We prefer float4 and int4 vectors. They map very well to 5-way VLIW and 128-bit registers on GPU and to SSE on CPUs. SSE 128-bit loads/stores. GPU 128-bit loads/stores. 20 AMD GPU Architecture September 13th, 2009

GPU and to SSE on CPUs. SSE 128-bit loads/stores.

21 Optimizations: Control Flow Divergent control flow within a wavefront is bad. GPU uses predication or thread masking, parts of the machine become unused. Kernel with many control flow statements with little math or fetches can become control flow bound. 21 AMD GPU Architecture September 13th, 2009

22 Optimizations: Memory Access Coalescing! Use 128-bit accesses. They are the most efficient and fully utilize the buses. Reads: Make wavefront read whole rows from memory. The HW will fetch whole cache lines so make other threads in wave reuse the data. Writes: Write whole rows. Be careful not to overload single backend or memory channel. Writes flow from Shader Core to Shader Export (coalescing) -> 4 Rendering Backends -> 8 memory channels (two per backend, bits 8-10 address the channel) 22 AMD GPU Architecture September 13th, 2009

23 Optimizations: Registers Stack variables and private variables (at discrection of the compiler) map to GPU registers. Number of registers used affects number of wavefronts that can fit into SIMD. A work-group must fit on a SIMD. More waves help cover memory access latencies. Big ALU clauses (more math in the kernel) also help cover memory latencies. You need minimum of 3 waves (two in ALU, one in fetch), preferably 5 or more. Sometimes better to recompute instead of fetch from memory. 23 AMD GPU Architecture September 13th, 2009

24 Optimizations: Balance Fetch:ALU:Export ratio bit fetches way VLIW ALU instructions bit exports Registers:Waves:Latency compensation Use registers to store fetched or intermediate data, there are a lot of them But make sure there is enough waves to hide latencies And/or enough math 24 AMD GPU Architecture September 13th, 2009

25 App OpenCL kernel source Software Architecture OpenCL runtime CPU binary CPU binary IL kernel CL kernel IL kernel OpenCL compiler CPU CAL GPU Shader GPU 25 AMD GPU Architecture September 13th, 2009

26 Optimizations: API Building programs and creating kernels is expensive. For interactive applications do it once and early while loading the app. Running with different work-group sizes may also trigger less expensive but significant compile. Moving data from host to (GPU) device and back can be also very expensive. Keep it on the device, chain kernels, read back only what you really need on host. Use Direct3D and OpenGL bindings to share the data with rendering APIs. 26 AMD GPU Architecture September 13th, 2009

27 Stream Kernel Analyzer 27 AMD GPU Architecture September 13th, 2009

28 Trademark Attribution AMD, the AMD Arrow logo, ATI, the ATI logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc. Microsoft and Direct3D are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. All rights reserved. 28 AMD GPU Architecture September 13th, 2009

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010 GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context: