GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Size: px
Start display at page:

Download "GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010"

Transcription

1 GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010

2 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context: With other types of architecture With other GPU designs To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way 2 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

3 Agenda Talk about GPUs as graphics processing devices What they are designed for What this means architecturally The implications of SIMD execution on application development LDS and latency hiding How the GPU fits in the CPU design space. A description of features of AMD Radeon HD5870 GPU 3 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

4 What is a GPU? 4 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

5 In a nutshell The GPU is a multicore processor optimized for graphics workloads Shader Core Shader Core Tex Rasterizer Shader Core Shader Core Tex Output blend Shader Core Shader Core Tex Video decode Shader Core Shader Core Tex Scheduler 5 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

6 Processing pixels Direction of light Normal at surface Pixel 6 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

7 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel 7 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

8 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 8 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

9 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel 9 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

10 SIMD execution and its implications 10 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

11 SIMD pixel execution Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 11 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

12 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 12 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

13 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 13 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

14 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 14 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

15 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 15 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

16 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 16 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

17 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 17 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

18 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 18 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

19 SIMD execution: SIMD instructions? Programming SIMD with SIMD instructions Most vector instructions sets: SSE, AVX Intel s Larrabee and Knight s Corner Programmed masking OpenCL compiling to SSE Pixel shaders compiling to Larrabee. Programming SIMD with scalar instructions GPU shader languages Hardware controlled masking Current generation GPUs GPU intermediate languages OpenCL 19 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

20 Why does this matter for Compute? Graphics code traditionally has relatively short shaders on large triangles The level of branch divergence overall will not be high With graphics code you can not necessarily control it SIMD batches are constructed by the hardware depending on the scene properties. For OpenCL code you are defining your execution space You choose what work is performed by which work item You choose how to structure your algorithm to avoid this divergence 20 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

21 Throughput execution and latency hiding 21 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

22 Covering pipeline latency Lanes 0-3 Instruction 0 Stall Instruction 1 22 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

23 Covering pipeline latency: logical vector Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Stall Stall Stall Stall Instruction 1 23 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

24 Covering pipeline latency: ALU operations Lanes 0-3 Instruction 0 Lanes 4-7 Instruction 0 Lanes 8-11 Instruction 0 Lanes Instruction 0 Lanes 0-3 Instruction 1 Lanes 4-7 Instruction 1 Lanes 8-11 Instruction 1 Lanes Instruction 1 24 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

25 Covering memory latency Lanes 0-3 Instruction 0 Stall Instruction 1 25 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

26 Covering memory latency: we still stall Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Stall Stall Stall Stall Instruction 1 26 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

27 Covering memory latency: another wavefront Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Instruction 0 Instruction 1 27 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

28 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 28 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

29 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 29 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

30 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 30 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

31 A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 31 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

32 A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU 32 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

33 Adding the memory hierarchy Unlike most CPUs, GPUs do not have vast cache hierarchies. Caches on CPUs allow primarily for lower access latency Heavy multithreading reduces the latency requirement Latency is not an issue, we cover that with other threads Total bandwidth still an issue, even with high-latency high-speed memory interfaces 33 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

34 Texture caches and local memory Designed to support sharing between work items Reduce bandwidth, not latency Global High latency load. Limited bandwidth. Local Efficient random accesses. Very high bandwidth SIMD engine 34 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

35 A throughput-oriented SIMD engine Fetch Decode Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU Local storage/cache Execute 35 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

36 A throughput-oriented SIMD engine Fetch Decode Local storage/cache Execute 36 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

37 The GPU shader cores 37 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

38 The design space 38 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

39 The AMD Phenom II X6 6 cores One state set per core 4-wide SIMD (actually there are two pipes) 39 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

40 The Intel i7 6-core variants 6 cores Two state sets per core (SMT/Hyperthreading) 4-wide SIMD Phenom II X6 40 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

41 Sun UltraSPARC T2 8 cores Eight state sets per core No SIMD Phenom II X6 Intel i7 6-core 41 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

42 The AMD HD5870 GPU 20 cores Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements) 16-wide physical Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Phenom II X6 Intel i7 6-core UltraSPARC T2 42 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

43 The AMD Radeon HD5870 GPU 43 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

44 Features of the Radeon HD5870 architecture ATI Radeon HD Teraflops architecture Area 334 mm 2 Transistors Memory andwidth L2-L1 Rd andwidth L1 andwidth Vector GPR LDS Memory LDS andwidth 2.15 billion 153 G/sec 512 bytes/clk 1280 bytes/clk 5.24 M 640k 2560 bytes/clk Concurrent Wavefronts 496 Shader (ALU units) 1600 Idle power Max power 27 W 188 W 44 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

45 Ins cache SIMD Engines 0-9 High level view Command Processor/Group Generator Sequencer GDS Sequencer Ins cache SIMD Engines10-19 Rd cache Crossbar Write crossbar Write combine caches R/W cache for global atomics Read L2 caches 8k/mem channel 8 channel memory controller Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 45 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

46 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 46 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

47 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 47 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

48 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 48 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

49 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 49 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

50 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 50 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

51 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 51 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

52 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 52 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

53 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 53 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

54 54 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Local Data Share SIMD Engine Conflict detection and control scheduling Source collectors and return data staging Input address cross bar Read data cross bar Write data cross bar Integer atomic units SEQ PE0 PE15 Pre-op return value

55 LDS features High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD). Fully coalesced reads, writes and atomics with optimization for broadcast reads Low latency access 0 latency direct reads 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline) ank conflicts hardware detected and serialized 55 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

56 The processing element (using OpenCL terms) Input Data Fetch Return General Purpose Registers LDS op 0 LDS op 1 Constants Operand Preparation Fetch Addr/Data Export Addr/Data 4 32b FP FMA or MullAdd 1 64b FMA or Mul 2 64b FP Add 4 24b Int Mul or MulAdd 4 32b Int Add, Logical or Special 1 32b FP MulAdd 1 32b Integer 1 32b Special (log, exp, rcp, sin ) LDS Requests 56 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

57 Dependent operations Co-issue of dependent operations in a single VLIW packet Full IEEE intermediate rounding & normalization Dot4 (A= A* + C*D + E*F + G*H) Dual Dot2 (A = A* + C*D; F = F*H + I*J) Dual dependent multiplies (A = * C * D; F = G * H * I) Dual dependent adds (A = * C * D; F = G * H * I) Dependent MulAdd (A = * C + D + E * F) 24 bit integer Mul and Muladd (4 way c-issue) Heavy use for workgroup address calculation 57 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

58 58 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Global Data Share Dual arrays of SIMD Engines Conflict detection and control scheduling Fast append counter control Left and right source collectors (4 Wis/clock each) and return data staging Input address cross bar Read data cross bar Write data cross bar Integer atomic units SEQ Left bus Right bus Pre-op return value

59 GDS features Low latency access to a global shared memory 25 clocks latency Issued in separate clause in the same way as texture accesses Fully coalesced reads, writes and atomics as LDS. 8 work items per clock can request Driver allocation and initialization Useful for low latency global reductions 59 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

60 Summary We ve: looked at the basic principles of the GPU architecture in the processor design space seen some of the tradeoffs that lead to GPU features gone over the basic features of the HD 5870 architecture that affect compute applications Later talks will go into detail on GPU optimizations on these architecture features 60 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

61 Questions and Answers Visit the OpenCL Zone on developer.amd.com The OpenCL Programming Webinars page includes: Schedule of upcoming webinars On-demand versions of this and past webinars Slide decks of this and past webinars Upcoming webinars include OpenCL Programming In Detail Real World Application Example Optimization Techniques And Device Fission Extensions for OpenCL 61 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

62 Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. All rights reserved. 62 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009

AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009 AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU

More information

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic

More information

Radeon HD 2900 and Geometry Generation. Michael Doggett

Radeon HD 2900 and Geometry Generation. Michael Doggett Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command

More information

GPU architecture II: Scheduling the graphics pipeline

GPU architecture II: Scheduling the graphics pipeline GPU architecture II: Scheduling the graphics pipeline Mike Houston, AMD / Stanford Aaron Lefohn, Intel / University of Washington 1 Notes The front half of this talk is almost verbatim from: Keeping Many

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008 Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer

More information

GPU Architecture. Michael Doggett ATI

GPU Architecture. Michael Doggett ATI GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super

More information

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group ATI Radeon 4800 series Graphics Michael Doggett Graphics Architecture Group Graphics Product Group Graphics Processing Units ATI Radeon HD 4870 AMD Stream Computing Next Generation GPUs 2 Radeon 4800 series

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005 Recent Advances and Future Trends in Graphics Hardware Michael Doggett Architect November 23, 2005 Overview XBOX360 GPU : Xenos Rendering performance GPU architecture Unified shader Memory Export Texture/Vertex

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series By: Binesh Tuladhar Clay Smith Overview History of GPU s GPU Definition Classical Graphics Pipeline Geforce 6 Series Architecture Vertex

More information

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture Arnon Peleg (Intel) Ben Ashbaugh (Intel) Dave Helmly (Adobe) Legal INFORMATION IN THIS DOCUMENT IS PROVIDED

More information

LSN 2 Computer Processors

LSN 2 Computer Processors LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

More information

AMD APP SDK v2.8 FAQ. 1 General Questions

AMD APP SDK v2.8 FAQ. 1 General Questions AMD APP SDK v2.8 FAQ 1 General Questions 1. Do I need to use additional software with the SDK? To run an OpenCL application, you must have an OpenCL runtime on your system. If your system includes a recent

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE Table of Contents INTRODUCTION 2 COMPUTE UNIT OVERVIEW 3 CU FRONT-END 5 SCALAR EXECUTION AND CONTROL FLOW 5 VECTOR EXECUTION 6 VECTOR REGISTERS 6

More information

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901

THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901 THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901 Benedict R. Gaster AMD Programming Models Architect Lee Howes AMD MTS Fusion System Software PROGRAMMING MODELS A Track Introduction Benedict Gaster

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

CHAPTER 7: The CPU and Memory

CHAPTER 7: The CPU and Memory CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Central Processing Unit (CPU)

Central Processing Unit (CPU) Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University Data Parallel Computing on Graphics Hardware Ian Buck Stanford University Brook General purpose Streaming language DARPA Polymorphous Computing Architectures Stanford - Smart Memories UT Austin - TRIPS

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 "JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer

More information

(Refer Slide Time: 00:01:16 min)

(Refer Slide Time: 00:01:16 min) Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Console Architecture. By: Peter Hood & Adelia Wong

Console Architecture. By: Peter Hood & Adelia Wong Console Architecture By: Peter Hood & Adelia Wong Overview Gaming console timeline and evolution Overview of the original xbox architecture Console architecture of the xbox360 Future of the xbox series

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With SAPPHIRE R9 270X 4GB GDDR5 WITH BOOST & OC Specification Display Support Output GPU Video Memory Dimension Software Accessory 3 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort 1.2

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center

GPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center GPU Hardware CS 380P Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center with thanks to Don Fussell for slides 15-28 and Bill Barth for slides 36-55 CPU vs. GPU characteris7cs

More information

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory supports up to 4 display monitor(s) without DisplayPort 4 x Maximum Display

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Architecture, Software And Performance Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2 GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory 4 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort

More information

Let s put together a Manual Processor

Let s put together a Manual Processor Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1

CPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Overview. CPU Manufacturers. Current Intel and AMD Offerings Central Processor Units (CPUs) Overview... 1 CPU Manufacturers... 1 Current Intel and AMD Offerings... 1 Evolution of Intel Processors... 3 S-Spec Code... 5 Basic Components of a CPU... 6 The CPU Die and

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

CPU Organization and Assembly Language

CPU Organization and Assembly Language COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:

More information

Family 12h AMD Athlon II Processor Product Data Sheet

Family 12h AMD Athlon II Processor Product Data Sheet Family 12h AMD Athlon II Processor Publication # 50322 Revision: 3.00 Issue Date: December 2011 Advanced Micro Devices 2011 Advanced Micro Devices, Inc. All rights reserved. The contents of this document

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

A Performance-Oriented Data Parallel Virtual Machine for GPUs

A Performance-Oriented Data Parallel Virtual Machine for GPUs A Performance-Oriented Data Parallel Virtual Machine for GPUs Mark Segal Mark Peercy ATI Technologies, Inc. Abstract Existing GPU programming interfaces require that applications employ a graphics-centric

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers

SPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright

More information

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction

More information

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology

GPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology GPUs Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology Bandwidth Gravity of modern computer systems The bandwidth between key components

More information

A Computer Vision System on a Chip: a case study from the automotive domain

A Computer Vision System on a Chip: a case study from the automotive domain A Computer Vision System on a Chip: a case study from the automotive domain Gideon P. Stein Elchanan Rushinek Gaby Hayun Amnon Shashua Mobileye Vision Technologies Ltd. Hebrew University Jerusalem, Israel

More information

MICROPROCESSOR AND MICROCOMPUTER BASICS

MICROPROCESSOR AND MICROCOMPUTER BASICS Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit

More information

Introduction to GPU Computing

Introduction to GPU Computing Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole

More information

CPU Organisation and Operation

CPU Organisation and Operation CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program

More information

Family 10h AMD Phenom II Processor Product Data Sheet

Family 10h AMD Phenom II Processor Product Data Sheet Family 10h AMD Phenom II Processor Product Data Sheet Publication # 46878 Revision: 3.05 Issue Date: April 2010 Advanced Micro Devices 2009, 2010 Advanced Micro Devices, Inc. All rights reserved. The contents

More information

High-speed image processing algorithms using MMX hardware

High-speed image processing algorithms using MMX hardware High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

A Powerful solution for next generation Pcs

A Powerful solution for next generation Pcs Product Brief 6th Generation Intel Core Desktop Processors i7-6700k and i5-6600k 6th Generation Intel Core Desktop Processors i7-6700k and i5-6600k A Powerful solution for next generation Pcs Looking for

More information

Optimizing Code for Accelerators: The Long Road to High Performance

Optimizing Code for Accelerators: The Long Road to High Performance Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?

More information

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth. SAPPHIRE FleX HD 6870 1GB GDDE5 SAPPHIRE HD 6870 FleX can support three DVI monitors in Eyefinity mode and deliver a true SLS (Single Large Surface) work area without the need for costly active adapters.

More information

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Adaptive Stable Additive Methods for Linear Algebraic Calculations Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István

More information