GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Transcription

1 GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010

2 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context: With other types of architecture With other GPU designs To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way 2 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

3 Agenda Talk about GPUs as graphics processing devices What they are designed for What this means architecturally The implications of SIMD execution on application development LDS and latency hiding How the GPU fits in the CPU design space. A description of features of AMD Radeon HD5870 GPU 3 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

4 What is a GPU? 4 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

5 In a nutshell The GPU is a multicore processor optimized for graphics workloads Shader Core Shader Core Tex Rasterizer Shader Core Shader Core Tex Output blend Shader Core Shader Core Tex Video decode Shader Core Shader Core Tex Scheduler 5 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

6 Processing pixels Direction of light Normal at surface Pixel 6 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

7 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel 7 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

8 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 8 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

9 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel 9 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

10 SIMD execution and its implications 10 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

11 SIMD pixel execution Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 11 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

12 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 12 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

19 SIMD execution: SIMD instructions? Programming SIMD with SIMD instructions Most vector instructions sets: SSE, AVX Intel s Larrabee and Knight s Corner Programmed masking OpenCL compiling to SSE Pixel shaders compiling to Larrabee. Programming SIMD with scalar instructions GPU shader languages Hardware controlled masking Current generation GPUs GPU intermediate languages OpenCL 19 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

20 Why does this matter for Compute? Graphics code traditionally has relatively short shaders on large triangles The level of branch divergence overall will not be high With graphics code you can not necessarily control it SIMD batches are constructed by the hardware depending on the scene properties. For OpenCL code you are defining your execution space You choose what work is performed by which work item You choose how to structure your algorithm to avoid this divergence 20 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

21 Throughput execution and latency hiding 21 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

22 Covering pipeline latency Lanes 0-3 Instruction 0 Stall Instruction 1 22 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

23 Covering pipeline latency: logical vector Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Stall Stall Stall Stall Instruction 1 23 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

24 Covering pipeline latency: ALU operations Lanes 0-3 Instruction 0 Lanes 4-7 Instruction 0 Lanes 8-11 Instruction 0 Lanes Instruction 0 Lanes 0-3 Instruction 1 Lanes 4-7 Instruction 1 Lanes 8-11 Instruction 1 Lanes Instruction 1 24 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

25 Covering memory latency Lanes 0-3 Instruction 0 Stall Instruction 1 25 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

26 Covering memory latency: we still stall Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Stall Stall Stall Stall Instruction 1 26 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

27 Covering memory latency: another wavefront Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Instruction 0 Instruction 1 27 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

28 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 28 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

29 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 29 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

30 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 30 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

31 A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 31 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

32 A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU 32 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

33 Adding the memory hierarchy Unlike most CPUs, GPUs do not have vast cache hierarchies. Caches on CPUs allow primarily for lower access latency Heavy multithreading reduces the latency requirement Latency is not an issue, we cover that with other threads Total bandwidth still an issue, even with high-latency high-speed memory interfaces 33 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

34 Texture caches and local memory Designed to support sharing between work items Reduce bandwidth, not latency Global High latency load. Limited bandwidth. Local Efficient random accesses. Very high bandwidth SIMD engine 34 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

35 A throughput-oriented SIMD engine Fetch Decode Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU Local storage/cache Execute 35 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

36 A throughput-oriented SIMD engine Fetch Decode Local storage/cache Execute 36 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

37 The GPU shader cores 37 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

38 The design space 38 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

39 The AMD Phenom II X6 6 cores One state set per core 4-wide SIMD (actually there are two pipes) 39 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

40 The Intel i7 6-core variants 6 cores Two state sets per core (SMT/Hyperthreading) 4-wide SIMD Phenom II X6 40 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

41 Sun UltraSPARC T2 8 cores Eight state sets per core No SIMD Phenom II X6 Intel i7 6-core 41 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

42 The AMD HD5870 GPU 20 cores Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements) 16-wide physical Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Phenom II X6 Intel i7 6-core UltraSPARC T2 42 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

43 The AMD Radeon HD5870 GPU 43 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

44 Features of the Radeon HD5870 architecture ATI Radeon HD Teraflops architecture Area 334 mm 2 Transistors Memory andwidth L2-L1 Rd andwidth L1 andwidth Vector GPR LDS Memory LDS andwidth 2.15 billion 153 G/sec 512 bytes/clk 1280 bytes/clk 5.24 M 640k 2560 bytes/clk Concurrent Wavefronts 496 Shader (ALU units) 1600 Idle power Max power 27 W 188 W 44 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

45 Ins cache SIMD Engines 0-9 High level view Command Processor/Group Generator Sequencer GDS Sequencer Ins cache SIMD Engines10-19 Rd cache Crossbar Write crossbar Write combine caches R/W cache for global atomics Read L2 caches 8k/mem channel 8 channel memory controller Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 45 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

46 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 46 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

51 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 51 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

54 54 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Local Data Share SIMD Engine Conflict detection and control scheduling Source collectors and return data staging Input address cross bar Read data cross bar Write data cross bar Integer atomic units SEQ PE0 PE15 Pre-op return value

55 LDS features High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD). Fully coalesced reads, writes and atomics with optimization for broadcast reads Low latency access 0 latency direct reads 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline) ank conflicts hardware detected and serialized 55 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

56 The processing element (using OpenCL terms) Input Data Fetch Return General Purpose Registers LDS op 0 LDS op 1 Constants Operand Preparation Fetch Addr/Data Export Addr/Data 4 32b FP FMA or MullAdd 1 64b FMA or Mul 2 64b FP Add 4 24b Int Mul or MulAdd 4 32b Int Add, Logical or Special 1 32b FP MulAdd 1 32b Integer 1 32b Special (log, exp, rcp, sin ) LDS Requests 56 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

57 Dependent operations Co-issue of dependent operations in a single VLIW packet Full IEEE intermediate rounding & normalization Dot4 (A= A* + C*D + E*F + G*H) Dual Dot2 (A = A* + C*D; F = F*H + I*J) Dual dependent multiplies (A = * C * D; F = G * H * I) Dual dependent adds (A = * C * D; F = G * H * I) Dependent MulAdd (A = * C + D + E * F) 24 bit integer Mul and Muladd (4 way c-issue) Heavy use for workgroup address calculation 57 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

58 58 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Global Data Share Dual arrays of SIMD Engines Conflict detection and control scheduling Fast append counter control Left and right source collectors (4 Wis/clock each) and return data staging Input address cross bar Read data cross bar Write data cross bar Integer atomic units SEQ Left bus Right bus Pre-op return value

59 GDS features Low latency access to a global shared memory 25 clocks latency Issued in separate clause in the same way as texture accesses Fully coalesced reads, writes and atomics as LDS. 8 work items per clock can request Driver allocation and initialization Useful for low latency global reductions 59 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

60 Summary We ve: looked at the basic principles of the GPU architecture in the processor design space seen some of the tradeoffs that lead to GPU features gone over the basic features of the HD 5870 architecture that affect compute applications Later talks will go into detail on GPU optimizations on these architecture features 60 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

61 Questions and Answers Visit the OpenCL Zone on developer.amd.com The OpenCL Programming Webinars page includes: Schedule of upcoming webinars On-demand versions of this and past webinars Slide decks of this and past webinars Upcoming webinars include OpenCL Programming In Detail Real World Application Example Optimization Techniques And Device Fission Extensions for OpenCL 61 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

62 Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. All rights reserved. 62 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone