GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010



Similar documents
Introduction to GPU Architecture

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPU architecture II: Scheduling the graphics pipeline

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

GPU Architecture. Michael Doggett ATI

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Next Generation GPU Architecture Code-named Fermi

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

GPGPU Computing. Yong Cao

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Introduction to GPGPU. Tiziano Diamanti

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

LSN 2 Computer Processors

AMD APP SDK v2.8 FAQ. 1 General Questions

GPUs for Scientific Computing

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE


Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Introduction to GPU Programming Languages

CHAPTER 7: The CPU and Memory

GPU Parallel Computing Architecture and CUDA Programming Model

Central Processing Unit (CPU)

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Introduction to GPU hardware and to CUDA

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

(Refer Slide Time: 00:01:16 min)

NVIDIA GeForce GTX 580 GPU Datasheet

Binary search tree with SIMD bandwidth optimization using SSE

Pipelining Review and Its Limitations

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

1. Memory technology & Hierarchy

Intel Pentium 4 Processor on 90nm Technology

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

Parallel Programming Survey

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Computer Graphics Hardware An Overview

Chapter 2 Parallel Architecture, Software And Performance

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Computer Architecture TDTS10

PROBLEMS #20,R0,R1 #$3A,R2,R4

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

OpenCL Programming for the CUDA Architecture. Version 2.3

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

Let s put together a Manual Processor

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14,

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Rethinking SIMD Vectorization for In-Memory Databases

Design Cycle for Microprocessors

Multi-Threading Performance on Commodity Multi-Core Processors

İSTANBUL AYDIN UNIVERSITY

GPU Hardware Performance. Fall 2015

Overview. CPU Manufacturers. Current Intel and AMD Offerings

L20: GPU Architecture and Models

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Choosing a Computer for Running SLX, P3D, and P5

CPU Organization and Assembly Language

Family 12h AMD Athlon II Processor Product Data Sheet

FPGA-based Multithreading for In-Memory Hash Joins

VLIW Processors. VLIW Processors

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

A Computer Vision System on a Chip: a case study from the automotive domain

MICROPROCESSOR AND MICROCOMPUTER BASICS

Introduction to GPU Computing

CPU Organisation and Operation

Family 10h AMD Phenom II Processor Product Data Sheet

High-speed image processing algorithms using MMX hardware

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Optimizing Code for Accelerators: The Long Road to High Performance

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Transcription:

GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010

The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context: With other types of architecture With other GPU designs To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way 2 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Agenda Talk about GPUs as graphics processing devices What they are designed for What this means architecturally The implications of SIMD execution on application development LDS and latency hiding How the GPU fits in the CPU design space. A description of features of AMD Radeon HD5870 GPU 3 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

What is a GPU? 4 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

In a nutshell The GPU is a multicore processor optimized for graphics workloads Shader Core Shader Core Tex Rasterizer Shader Core Shader Core Tex Output blend Shader Core Shader Core Tex Video decode Shader Core Shader Core Tex Scheduler 5 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Processing pixels Direction of light Normal at surface Pixel 6 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel 7 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 8 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel 9 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

SIMD execution and its implications 10 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

SIMD pixel execution Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 11 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 12 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 13 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 14 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 15 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 16 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 17 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 18 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

SIMD execution: SIMD instructions? Programming SIMD with SIMD instructions Most vector instructions sets: SSE, AVX Intel s Larrabee and Knight s Corner Programmed masking OpenCL compiling to SSE Pixel shaders compiling to Larrabee. Programming SIMD with scalar instructions GPU shader languages Hardware controlled masking Current generation GPUs GPU intermediate languages OpenCL 19 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Why does this matter for Compute? Graphics code traditionally has relatively short shaders on large triangles The level of branch divergence overall will not be high With graphics code you can not necessarily control it SIMD batches are constructed by the hardware depending on the scene properties. For OpenCL code you are defining your execution space You choose what work is performed by which work item You choose how to structure your algorithm to avoid this divergence 20 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Throughput execution and latency hiding 21 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Covering pipeline latency Lanes 0-3 Instruction 0 Stall Instruction 1 22 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Covering pipeline latency: logical vector Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes 12-15 Instruction 0 Stall Stall Stall Stall Instruction 1 23 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Covering pipeline latency: ALU operations Lanes 0-3 Instruction 0 Lanes 4-7 Instruction 0 Lanes 8-11 Instruction 0 Lanes 12-15 Instruction 0 Lanes 0-3 Instruction 1 Lanes 4-7 Instruction 1 Lanes 8-11 Instruction 1 Lanes 12-15 Instruction 1 24 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Covering memory latency Lanes 0-3 Instruction 0 Stall Instruction 1 25 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Covering memory latency: we still stall Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes 12-15 Instruction 0 Stall Stall Stall Stall Instruction 1 26 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Covering memory latency: another wavefront Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes 12-15 Instruction 0 Instruction 0 Instruction 1 27 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 28 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 29 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 30 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 31 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU 32 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Adding the memory hierarchy Unlike most CPUs, GPUs do not have vast cache hierarchies. Caches on CPUs allow primarily for lower access latency Heavy multithreading reduces the latency requirement Latency is not an issue, we cover that with other threads Total bandwidth still an issue, even with high-latency high-speed memory interfaces 33 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Texture caches and local memory Designed to support sharing between work items Reduce bandwidth, not latency Global High latency load. Limited bandwidth. Local Efficient random accesses. Very high bandwidth SIMD engine 34 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

A throughput-oriented SIMD engine Fetch Decode Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU Local storage/cache Execute 35 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

A throughput-oriented SIMD engine Fetch Decode Local storage/cache Execute 36 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The GPU shader cores 37 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The design space 38 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The AMD Phenom II X6 6 cores One state set per core 4-wide SIMD (actually there are two pipes) 39 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The Intel i7 6-core variants 6 cores Two state sets per core (SMT/Hyperthreading) 4-wide SIMD Phenom II X6 40 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Sun UltraSPARC T2 8 cores Eight state sets per core No SIMD Phenom II X6 Intel i7 6-core 41 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The AMD HD5870 GPU 20 cores Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements) 16-wide physical Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Phenom II X6 Intel i7 6-core UltraSPARC T2 42 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The AMD Radeon HD5870 GPU 43 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Features of the Radeon HD5870 architecture ATI Radeon HD 5870 2.72 Teraflops architecture Area 334 mm 2 Transistors Memory andwidth L2-L1 Rd andwidth L1 andwidth Vector GPR LDS Memory LDS andwidth 2.15 billion 153 G/sec 512 bytes/clk 1280 bytes/clk 5.24 M 640k 2560 bytes/clk Concurrent Wavefronts 496 Shader (ALU units) 1600 Idle power Max power 27 W 188 W 44 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Ins cache SIMD Engines 0-9 High level view Command Processor/Group Generator Sequencer GDS Sequencer Ins cache SIMD Engines10-19 Rd cache Crossbar Write crossbar Write combine caches R/W cache for global atomics Read L2 caches 8k/mem channel 8 channel memory controller Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 45 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 46 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 47 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 48 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 49 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 50 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 51 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 52 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 53 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

54 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Local Data Share SIMD Engine Conflict detection and control scheduling Source collectors and return data staging Input address cross bar Read data cross bar Write data cross bar 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 Integer atomic units SEQ PE0 PE15 Pre-op return value

LDS features High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD). Fully coalesced reads, writes and atomics with optimization for broadcast reads Low latency access 0 latency direct reads 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline) ank conflicts hardware detected and serialized 55 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

The processing element (using OpenCL terms) Input Data Fetch Return General Purpose Registers LDS op 0 LDS op 1 Constants Operand Preparation Fetch Addr/Data Export Addr/Data 4 32b FP FMA or MullAdd 1 64b FMA or Mul 2 64b FP Add 4 24b Int Mul or MulAdd 4 32b Int Add, Logical or Special 1 32b FP MulAdd 1 32b Integer 1 32b Special (log, exp, rcp, sin ) LDS Requests 56 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Dependent operations Co-issue of dependent operations in a single VLIW packet Full IEEE intermediate rounding & normalization Dot4 (A= A* + C*D + E*F + G*H) Dual Dot2 (A = A* + C*D; F = F*H + I*J) Dual dependent multiplies (A = * C * D; F = G * H * I) Dual dependent adds (A = * C * D; F = G * H * I) Dependent MulAdd (A = * C + D + E * F) 24 bit integer Mul and Muladd (4 way c-issue) Heavy use for workgroup address calculation 57 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

58 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Global Data Share Dual arrays of SIMD Engines Conflict detection and control scheduling Fast append counter control Left and right source collectors (4 Wis/clock each) and return data staging Input address cross bar Read data cross bar Write data cross bar 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 Integer atomic units SEQ Left bus Right bus Pre-op return value

GDS features Low latency access to a global shared memory 25 clocks latency Issued in separate clause in the same way as texture accesses Fully coalesced reads, writes and atomics as LDS. 8 work items per clock can request Driver allocation and initialization Useful for low latency global reductions 59 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Summary We ve: looked at the basic principles of the GPU architecture in the processor design space seen some of the tradeoffs that lead to GPU features gone over the basic features of the HD 5870 architecture that affect compute applications Later talks will go into detail on GPU optimizations on these architecture features 60 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Questions and Answers Visit the OpenCL Zone on developer.amd.com http://developer.amd.com/zones/openclzone/ The OpenCL Programming Webinars page includes: Schedule of upcoming webinars On-demand versions of this and past webinars Slide decks of this and past webinars Upcoming webinars include OpenCL Programming In Detail Real World Application Example Optimization Techniques And Device Fission Extensions for OpenCL 61 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. 2009 Advanced Micro Devices, Inc. All rights reserved. 62 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone