GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010
|
|
- Samantha McGee
- 7 years ago
- Views:
Transcription
1 GPU Architecture An OpenCL Programmer s Introduction Lee Howes November 3, 2010
2 The aim of this webinar To provide a general background to modern GPU architectures To place the AMD GPU designs in context: With other types of architecture With other GPU designs To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way 2 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
3 Agenda Talk about GPUs as graphics processing devices What they are designed for What this means architecturally The implications of SIMD execution on application development LDS and latency hiding How the GPU fits in the CPU design space. A description of features of AMD Radeon HD5870 GPU 3 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
4 What is a GPU? 4 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
5 In a nutshell The GPU is a multicore processor optimized for graphics workloads Shader Core Shader Core Tex Rasterizer Shader Core Shader Core Tex Output blend Shader Core Shader Core Tex Video decode Shader Core Shader Core Tex Scheduler 5 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
6 Processing pixels Direction of light Normal at surface Pixel 6 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
7 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel 7 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
8 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 8 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
9 Processing pixels Direction of light Normal at surface Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel 9 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
10 SIMD execution and its implications 10 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
11 SIMD pixel execution Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Pixel Pixel Pixel Pixel Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); Texture2D<float3> mytex; float3 lightdir; float4 diffuseshader(float3 norm, float2 uv) float3 kd; kd = mytex.sample(mysamp, uv); kd *= clamp( dot(lightdir, norm), 0.0, 1.0); return float4(kd, 1.0); 11 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
12 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 12 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
13 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 13 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
14 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 14 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
15 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 15 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
16 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 16 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
17 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 17 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
18 ranches that diverge ALU ALU ALU ALU uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; uffer<float> mytex; float diffuseshader( float threshold, float index) float brightness = mytex[index]; float output; if( brightness > threshold ) output = threshold; else output = brightness; return output; 18 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
19 SIMD execution: SIMD instructions? Programming SIMD with SIMD instructions Most vector instructions sets: SSE, AVX Intel s Larrabee and Knight s Corner Programmed masking OpenCL compiling to SSE Pixel shaders compiling to Larrabee. Programming SIMD with scalar instructions GPU shader languages Hardware controlled masking Current generation GPUs GPU intermediate languages OpenCL 19 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
20 Why does this matter for Compute? Graphics code traditionally has relatively short shaders on large triangles The level of branch divergence overall will not be high With graphics code you can not necessarily control it SIMD batches are constructed by the hardware depending on the scene properties. For OpenCL code you are defining your execution space You choose what work is performed by which work item You choose how to structure your algorithm to avoid this divergence 20 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
21 Throughput execution and latency hiding 21 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
22 Covering pipeline latency Lanes 0-3 Instruction 0 Stall Instruction 1 22 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
23 Covering pipeline latency: logical vector Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Stall Stall Stall Stall Instruction 1 23 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
24 Covering pipeline latency: ALU operations Lanes 0-3 Instruction 0 Lanes 4-7 Instruction 0 Lanes 8-11 Instruction 0 Lanes Instruction 0 Lanes 0-3 Instruction 1 Lanes 4-7 Instruction 1 Lanes 8-11 Instruction 1 Lanes Instruction 1 24 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
25 Covering memory latency Lanes 0-3 Instruction 0 Stall Instruction 1 25 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
26 Covering memory latency: we still stall Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Stall Stall Stall Stall Instruction 1 26 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
27 Covering memory latency: another wavefront Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes Instruction 0 Instruction 0 Instruction 1 27 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
28 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 28 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
29 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 29 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
30 Latency hiding in the SIMD engine Pixel Pixel Pixel Pixel ALU ALU ALU ALU 30 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
31 A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel Pixel Pixel Pixel Pixel ALU ALU ALU ALU 31 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
32 A throughput-oriented SIMD engine Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU 32 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
33 Adding the memory hierarchy Unlike most CPUs, GPUs do not have vast cache hierarchies. Caches on CPUs allow primarily for lower access latency Heavy multithreading reduces the latency requirement Latency is not an issue, we cover that with other threads Total bandwidth still an issue, even with high-latency high-speed memory interfaces 33 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
34 Texture caches and local memory Designed to support sharing between work items Reduce bandwidth, not latency Global High latency load. Limited bandwidth. Local Efficient random accesses. Very high bandwidth SIMD engine 34 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
35 A throughput-oriented SIMD engine Fetch Decode Pixel Pixel Pixel Pixel State Pixel Pixel Pixel Pixel State ALU ALU ALU ALU Local storage/cache Execute 35 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
36 A throughput-oriented SIMD engine Fetch Decode Local storage/cache Execute 36 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
37 The GPU shader cores 37 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
38 The design space 38 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
39 The AMD Phenom II X6 6 cores One state set per core 4-wide SIMD (actually there are two pipes) 39 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
40 The Intel i7 6-core variants 6 cores Two state sets per core (SMT/Hyperthreading) 4-wide SIMD Phenom II X6 40 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
41 Sun UltraSPARC T2 8 cores Eight state sets per core No SIMD Phenom II X6 Intel i7 6-core 41 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
42 The AMD HD5870 GPU 20 cores Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements) 16-wide physical Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Phenom II X6 Intel i7 6-core UltraSPARC T2 42 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
43 The AMD Radeon HD5870 GPU 43 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
44 Features of the Radeon HD5870 architecture ATI Radeon HD Teraflops architecture Area 334 mm 2 Transistors Memory andwidth L2-L1 Rd andwidth L1 andwidth Vector GPR LDS Memory LDS andwidth 2.15 billion 153 G/sec 512 bytes/clk 1280 bytes/clk 5.24 M 640k 2560 bytes/clk Concurrent Wavefronts 496 Shader (ALU units) 1600 Idle power Max power 27 W 188 W 44 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
45 Ins cache SIMD Engines 0-9 High level view Command Processor/Group Generator Sequencer GDS Sequencer Ins cache SIMD Engines10-19 Rd cache Crossbar Write crossbar Write combine caches R/W cache for global atomics Read L2 caches 8k/mem channel 8 channel memory controller Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 45 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
46 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 46 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
47 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 47 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
48 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 48 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
49 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 49 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
50 Ins cache SIMD Engines 0-9 Clause execution Sequencer Write combine caches R/W cache for global atomics Command Processor/Group Generator GDS Rd cache Crossbar Write crossbar Read L2 caches 8 channel memory controller Sequencer SIMD Engines10-19 Write combine caches R/W cache for global atomics GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 Ins cache 08 ALU_PUSH_EFORE: ADDR(62) CNT(2) 15 x: SETGT_INT R0.x, R2.x, R4.x 16 x: PREDNE_INT, R0.x, 0.0f 09 JUMP POP_CNT(1) ADDR(18) 10 ALU: ADDR(64) CNT(9) 17 x: SU_INT R5.x, R3.x, R2.x y: LSHL, R3.x, ( ).x z: LSHL, R2.x, ( ).x VEC_120 t: MOV R8.x, 0.0f 18 x: SU_INT R6.x, PV17.y, PV17.z y: MOV R8.y, 0.0f z: MOV R8.z, 0.0f w: MOV R8.w, 0.0f 11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17) 12 ALU: ADDR(73) CNT(12) 19 x: LSHL, R5.x, ( ).x w: ADD_INT, R6.x, R1.x VEC_120 t: ADD_INT R7.x, R1.x, ( ).y 13 TEX: ADDR(368) CNT(2) 22 VFETCH R0.x, R0.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 23 VFETCH R1.x, R0.z, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 50 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
51 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 51 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
52 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 52 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
53 The SIMD Engine SEQ/ ranch control 32k Local Data Share: 32 banks with integer atomic units Filter & Format Filter & Format Filter & Format Filter & Format Address & Load Address & Load Address & Load Address & Load 8k Read L1 16 processing elements: physical vector Executes two 64-element wavefronts over 4 cycles: logical vector Each lane executes a 5-way VLIW instruction Exp/Ld/Store Lane masking and branching to support divergence Hardware barrier support for up to 8 work groups per SIMD engine 53 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
54 54 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Local Data Share SIMD Engine Conflict detection and control scheduling Source collectors and return data staging Input address cross bar Read data cross bar Write data cross bar Integer atomic units SEQ PE0 PE15 Pre-op return value
55 LDS features High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD). Fully coalesced reads, writes and atomics with optimization for broadcast reads Low latency access 0 latency direct reads 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline) ank conflicts hardware detected and serialized 55 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
56 The processing element (using OpenCL terms) Input Data Fetch Return General Purpose Registers LDS op 0 LDS op 1 Constants Operand Preparation Fetch Addr/Data Export Addr/Data 4 32b FP FMA or MullAdd 1 64b FMA or Mul 2 64b FP Add 4 24b Int Mul or MulAdd 4 32b Int Add, Logical or Special 1 32b FP MulAdd 1 32b Integer 1 32b Special (log, exp, rcp, sin ) LDS Requests 56 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
57 Dependent operations Co-issue of dependent operations in a single VLIW packet Full IEEE intermediate rounding & normalization Dot4 (A= A* + C*D + E*F + G*H) Dual Dot2 (A = A* + C*D; F = F*H + I*J) Dual dependent multiplies (A = * C * D; F = G * H * I) Dual dependent adds (A = * C * D; F = G * H * I) Dependent MulAdd (A = * C + D + E * F) 24 bit integer Mul and Muladd (4 way c-issue) Heavy use for workgroup address calculation 57 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
58 58 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone The Global Data Share Dual arrays of SIMD Engines Conflict detection and control scheduling Fast append counter control Left and right source collectors (4 Wis/clock each) and return data staging Input address cross bar Read data cross bar Write data cross bar Integer atomic units SEQ Left bus Right bus Pre-op return value
59 GDS features Low latency access to a global shared memory 25 clocks latency Issued in separate clause in the same way as texture accesses Fully coalesced reads, writes and atomics as LDS. 8 work items per clock can request Driver allocation and initialization Useful for low latency global reductions 59 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
60 Summary We ve: looked at the basic principles of the GPU architecture in the processor design space seen some of the tradeoffs that lead to GPU features gone over the basic features of the HD 5870 architecture that affect compute applications Later talks will go into detail on GPU optimizations on these architecture features 60 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
61 Questions and Answers Visit the OpenCL Zone on developer.amd.com The OpenCL Programming Webinars page includes: Schedule of upcoming webinars On-demand versions of this and past webinars Slide decks of this and past webinars Upcoming webinars include OpenCL Programming In Detail Real World Application Example Optimization Techniques And Device Fission Extensions for OpenCL 61 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
62 Trademark Attribution AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners Advanced Micro Devices, Inc. All rights reserved. 62 GPU Architecture November 2010 developer.amd.com -> OpenCL Zone
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationAMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
More informationFLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
More informationRadeon HD 2900 and Geometry Generation. Michael Doggett
Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command
More informationGPU architecture II: Scheduling the graphics pipeline
GPU architecture II: Scheduling the graphics pipeline Mike Houston, AMD / Stanford Aaron Lefohn, Intel / University of Washington 1 Notes The front half of this talk is almost verbatim from: Keeping Many
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationRadeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008
Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer
More informationGPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
More informationATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group
ATI Radeon 4800 series Graphics Michael Doggett Graphics Architecture Group Graphics Product Group Graphics Processing Units ATI Radeon HD 4870 AMD Stream Computing Next Generation GPUs 2 Radeon 4800 series
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationRecent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005
Recent Advances and Future Trends in Graphics Hardware Michael Doggett Architect November 23, 2005 Overview XBOX360 GPU : Xenos Rendering performance GPU architecture Unified shader Memory Export Texture/Vertex
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationReal-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university
Real-Time Realistic Rendering Michael Doggett Docent Department of Computer Science Lund university 30-5-2011 Visually realistic goal force[d] us to completely rethink the entire rendering process. Cook
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationGPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith
GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series By: Binesh Tuladhar Clay Smith Overview History of GPU s GPU Definition Classical Graphics Pipeline Geforce 6 Series Architecture Vertex
More informationMaximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture
Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture Arnon Peleg (Intel) Ben Ashbaugh (Intel) Dave Helmly (Adobe) Legal INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationAMD APP SDK v2.8 FAQ. 1 General Questions
AMD APP SDK v2.8 FAQ 1 General Questions 1. Do I need to use additional software with the SDK? To run an OpenCL application, you must have an OpenCL runtime on your system. If your system includes a recent
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationIntroduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationWhite Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE
White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE Table of Contents INTRODUCTION 2 COMPUTE UNIT OVERVIEW 3 CU FRONT-END 5 SCALAR EXECUTION AND CONTROL FLOW 5 VECTOR EXECUTION 6 VECTOR REGISTERS 6
More informationTHE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901
THE FUTURE OF THE APU BRAIDED PARALLELISM Session 2901 Benedict R. Gaster AMD Programming Models Architect Lee Howes AMD MTS Fusion System Software PROGRAMMING MODELS A Track Introduction Benedict Gaster
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationChapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors
Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationCHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationCentral Processing Unit (CPU)
Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following
More informationInstruction Set Architecture. or How to talk to computers if you aren t in Star Trek
Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture
More informationData Parallel Computing on Graphics Hardware. Ian Buck Stanford University
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University Brook General purpose Streaming language DARPA Polymorphous Computing Architectures Stanford - Smart Memories UT Austin - TRIPS
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More information"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
More information(Refer Slide Time: 00:01:16 min)
Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationPipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
More informationAdvanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
More informationConsole Architecture. By: Peter Hood & Adelia Wong
Console Architecture By: Peter Hood & Adelia Wong Overview Gaming console timeline and evolution Overview of the original xbox architecture Console architecture of the xbox360 Future of the xbox series
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
More informationIntel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
More informationHow To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With
SAPPHIRE R9 270X 4GB GDDR5 WITH BOOST & OC Specification Display Support Output GPU Video Memory Dimension Software Accessory 3 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort 1.2
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationGPU Hardware CS 380P. Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center
GPU Hardware CS 380P Paul A. Navrá7l Manager Scalable Visualiza7on Technologies Texas Advanced Compu7ng Center with thanks to Don Fussell for slides 15-28 and Bill Barth for slides 36-55 CPU vs. GPU characteris7cs
More informationSAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST
SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory supports up to 4 display monitor(s) without DisplayPort 4 x Maximum Display
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationChapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
More informationIntro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
More informationBEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationGPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationIntroducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
More informationSAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST
SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory 4 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort
More informationLet s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationCPU Session 1. Praktikum Parallele Rechnerarchtitekturen. Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1
CPU Session 1 Praktikum Parallele Rechnerarchtitekturen Praktikum Parallele Rechnerarchitekturen / Johannes Hofmann April 14, 2015 1 Overview Types of Parallelism in Modern Multi-Core CPUs o Multicore
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationIntel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual
Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual Overview Metrics Monitor is part of Intel Media Server Studio 2015 for Linux Server. Metrics Monitor is a user space shared library
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationGPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
More informationOverview. CPU Manufacturers. Current Intel and AMD Offerings
Central Processor Units (CPUs) Overview... 1 CPU Manufacturers... 1 Current Intel and AMD Offerings... 1 Evolution of Intel Processors... 3 S-Spec Code... 5 Basic Components of a CPU... 6 The CPU Die and
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationCPU Organization and Assembly Language
COS 140 Foundations of Computer Science School of Computing and Information Science University of Maine October 2, 2015 Outline 1 2 3 4 5 6 7 8 Homework and announcements Reading: Chapter 12 Homework:
More informationFamily 12h AMD Athlon II Processor Product Data Sheet
Family 12h AMD Athlon II Processor Publication # 50322 Revision: 3.00 Issue Date: December 2011 Advanced Micro Devices 2011 Advanced Micro Devices, Inc. All rights reserved. The contents of this document
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationA Performance-Oriented Data Parallel Virtual Machine for GPUs
A Performance-Oriented Data Parallel Virtual Machine for GPUs Mark Segal Mark Peercy ATI Technologies, Inc. Abstract Existing GPU programming interfaces require that applications employ a graphics-centric
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationSPARC64 X: Fujitsu s New Generation 16 Core Processor for the next generation UNIX servers
X: Fujitsu s New Generation 16 Processor for the next generation UNIX servers August 29, 2012 Takumi Maruyama Processor Development Division Enterprise Server Business Unit Fujitsu Limited All Rights Reserved,Copyright
More informationUNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS
UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS Structure Page Nos. 2.0 Introduction 27 2.1 Objectives 27 2.2 Types of Classification 28 2.3 Flynn s Classification 28 2.3.1 Instruction Cycle 2.3.2 Instruction
More informationGPUs Under the Hood. Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology
GPUs Under the Hood Prof. Aaron Lanterman School of Electrical and Computer Engineering Georgia Institute of Technology Bandwidth Gravity of modern computer systems The bandwidth between key components
More informationA Computer Vision System on a Chip: a case study from the automotive domain
A Computer Vision System on a Chip: a case study from the automotive domain Gideon P. Stein Elchanan Rushinek Gaby Hayun Amnon Shashua Mobileye Vision Technologies Ltd. Hebrew University Jerusalem, Israel
More informationMICROPROCESSOR AND MICROCOMPUTER BASICS
Introduction MICROPROCESSOR AND MICROCOMPUTER BASICS At present there are many types and sizes of computers available. These computers are designed and constructed based on digital and Integrated Circuit
More informationIntroduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
More informationME964 High Performance Computing for Engineering Applications
ME964 High Performance Computing for Engineering Applications Intro, GPU Computing February 9, 2012 Dan Negrut, 2012 ME964 UW-Madison "The Internet is a great way to get on the net. US Senator Bob Dole
More informationCPU Organisation and Operation
CPU Organisation and Operation The Fetch-Execute Cycle The operation of the CPU 1 is usually described in terms of the Fetch-Execute cycle. 2 Fetch-Execute Cycle Fetch the Instruction Increment the Program
More informationFamily 10h AMD Phenom II Processor Product Data Sheet
Family 10h AMD Phenom II Processor Product Data Sheet Publication # 46878 Revision: 3.05 Issue Date: April 2010 Advanced Micro Devices 2009, 2010 Advanced Micro Devices, Inc. All rights reserved. The contents
More informationHigh-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationA Powerful solution for next generation Pcs
Product Brief 6th Generation Intel Core Desktop Processors i7-6700k and i5-6600k 6th Generation Intel Core Desktop Processors i7-6700k and i5-6600k A Powerful solution for next generation Pcs Looking for
More informationOptimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
More informationAwards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.
SAPPHIRE FleX HD 6870 1GB GDDE5 SAPPHIRE HD 6870 FleX can support three DVI monitors in Eyefinity mode and deliver a true SLS (Single Large Surface) work area without the need for costly active adapters.
More informationAdaptive Stable Additive Methods for Linear Algebraic Calculations
Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István
More information