Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling
|
|
|
- Scot Dalton
- 10 years ago
- Views:
Transcription
1 Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling
2 OPENCL INTRODUCTION
3 Programmable Solutions Technology scaling favors programmability and parallelism CPUs DSPs Multi-Cores Array GPGPUs FPGAs Single Cores Multi-Cores Coarse-Grained CPUs and DSPs Coarse-Grained Massively Parallel Processor Arrays Fine-Grained Massively Parallel Arrays
4 CPU + Hardware Accelerator ARM -based system-on-a-chip (SoC) devices by Intel, Apple, NVIDIA, Qualcomm, TI, and others SoC FPGAs by Altera
5 OpenCL Overview Open Computing Language Software-centric C/C++ API for host program OpenCL C (C99-based) for acceleration device Unified design methodology CPU offload Memory Access Parallelism Vectorizaton C/C++ API OpenCL C Host CPU Hardware Acceleration
6 OpenCL Driving Force Developed and published by consortium Royalty-free standard Application-driven Consumer Electronics Image Processing & Video Encoding Augmented reality & Computational Photography Military Video and Radar Analytics Scientific / High Performance Computing Financial, Molecular Dynamics, Bioinformatics, etc. Medical rendering algorithms
7 OpenCL Abstract Programming Model Host Device main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } Global Mem Compute Unit Local Mem Local Mem Explicit Data Storage Hierarchical Memory Model Explicit Parallelism Vectorization Multi-threading kernel void sum( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }
8 OpenCL Host Program Pure software written in standard C Communicates with the Accelerator Device via a set of library routines which abstract the communication between the host processor and the kernels Copy data from Host to FPGA Ask the FPGA to run a particular kernel main() { read_data_from_file( ); maninpulate_data( ); clenqueuewritebuffer( ); clenqueuetask(, my_kernel, ); clenqueuereadbuffer( ); Copy data from FPGA to Host } display_result_to_user( );
9 OpenCL Kernels Data-parallel function Defines many parallel threads of execution Each thread has an identifier specified by get_global_id Contains keyword extensions to specify parallelism and memory hierarchy Executed by compute object CPU GPU Accelerator float *a = float *b = kernel void sum( global const float *a, global const float *b, global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; } kernel void sum( ); float *answer =
10 Flow main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } OpenCL Host Program + Kernels kernel void sum( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Standard C Compiler OpenCL Compiler Verilog x86 EXE SOF Quartus II PCIe
11 FPGA OpenCL Architecture x86 / External Processor FPGA PCIe External Memory Controller & PHY External Memory Controller & PHY DDR* Global Memory Interconnect M9K M9K M9K Kernel Pipeline Kernel Pipeline Kernel Pipeline M9K M9K M9K Local Memory Interconnect Modest external memory bandwidth Extremely high internal memory bandwidth Highly customizable compute cores
12 Compiling OpenCL to FPGAs Host Program kernel void sum( global const float *a, global const float *b, global float *answer) { kernel void sum( global const float *a, int xid = get_global_id(0); global const float + *b, answer[xid] = a[xid] b[xid]; } global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; } Kernel Programs OpenCL Load Load Load Host Program + Kernels Load ACL Compiler Load Load Store X86 binary Load Load Load clenqueuewritebuffer( ); clenqueuekernel(, sum, ); clenqueuereadbuffer( ); PCIe Standard C Compiler Store SOF Load main() { read_data_from_file( ); maninpulate_data( ); display_result_to_user( ); } Store Load Load PCIe DDRx Store Store x86 Store
13 Mapping Multithreaded Kernels to FPGAs The most simple way of mapping kernel functions to FPGAs is to replicate hardware for each thread Inefficient and wasteful Better method involves taking advantage of pipeline parallelism Attempt to create a deeply pipelined representation of a kernel On each clock cycle, we attempt to send in input data for a new thread Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism
14 Example Pipeline for Vector Add 8 threads for vector add example Load Load Thread IDs + Store On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
15 Example Pipeline for Vector Add 8 threads for vector add example Load Load Thread IDs + Store On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
16 Example Pipeline for Vector Add 8 threads for vector add example Load Load 0 + Store Thread IDs On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
17 Example Pipeline for Vector Add 8 threads for vector add example Load Load Store Thread IDs On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
18 Example Pipeline for Vector Add 8 threads for vector add example Load Store 0 Load Thread IDs On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored
19 ALTERA OPENCL SYSTEM ARCHITECTURE
20 OpenCL System Architecture External Interface x86 / External Processor FPGA PCIe External Memory Controller & PHY External Memory Controller & PHY DDR* Global Memory Interconnect M9K M9K M9K Kernel Pipeline Kernel Pipeline Kernel Pipeline M9K M9K M9K Local Memory Interconnect
21 External Interface : High Level PCIe DMA DDR Controller K0 K1 K2 Kn Kernel Pipelines
22 OpenCL System Architecture x86 / External Processor FPGA PCIe External Memory Controller & PHY External Memory Controller & PHY DDR* Global Memory Interconnect M9K M9K M9K Kernel Pipeline Kernel Pipeline Kernel Pipeline M9K M9K M9K Local Memory Interconnect Kernel System
23 Altera OpenCL Kernel Architecture [filename]_system CRA [kernel]_top_wrapper [kernel]_function_wrapper Dispatcher CSR NDR ARG 0 ARG 1 ID ITER ID ITER [kernel]_function EWI BB0 [kernel]_function EWI BB0 BB0 BB0 BB0 BB0 BB0 BB0 DONE Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect LOCAL MEM #0 LOCAL MEM #1 LOCAL MEM #2 LOCAL MEM #3 Constant Cache Kernel Pipelines Global Memory Interconnect MEM0 MEM1
24 OpenCL CAD Flow vectoradd_kernel.cl CLANG front end Unoptimized LLVM IR Front End Parses System OpenCL extensions and Description intrinsics produces LLVM IR vectoradd_host.c C compiler program.exe ACL runtime Library Optimizer Optimized LLVM IR RTL generator Verilog ACL iface DDR* PCIe
25 OpenCL CAD Flow vectoradd_kernel.cl CLANG front end Unoptimized LLVM IR Optimizer Optimized LLVM IR RTL generator Middle End ~150 compiler passes such as loop fusion, auto vectorization, and branch elimination leading to more efficient HW Verilog vectoradd_host.c C compiler program.exe ACL iface ACL runtime Library DDR* PCIe
26 OpenCL CAD Flow vectoradd_kernel.cl CLANG front end vectoradd_host.c C compiler ACL runtime Library Unoptimized LLVM IR Optimizer Optimized LLVM IR RTL generator Verilog program.exe Back End Instantiate Verilog IP for each operation in the intermediate representation Create control flow circuitry to handle loops, DDR* memory stalls and QSYS branching Quartus Traditional optimizations such as scheduling and PCIe resource sharing
27 MEMORY ACCESS OPTIMIZATIONS
28 Memory Access Optimizations At the high level, present OpenCL benchmarks are memory-to-memory transformations Some data is read from global memory, processed, and results stored in global memory Global Memory Is implemented using DDR memory Two separate DIMMs Has long latency associated with it Access pattern to this memory has a significant impact on application performance In many cases it can dominate application performance Local Memory Is implemented on-chip Is fast, but very limited in size (54M bits on Stratix V A7 and 41M bits on Stratix V D5) The first and most significant challenge is to architect your application to balance the use of memories available to meet design requirements
29 Global Memory - DDRx Configuration Half Rate System 256 bits 200MHz Memory Controller 256 bits 200MHz Memory PHY 64 bits 800Mbps DDRx SDRAM Wide Busses System 200 MHz* 200 MHz* 200MHz* 400MHz* 512 bits 200MHz Quarter Rate Memory Controller 512 bits 200MHz Memory PHY 64 bits 1600Mbps 400MHz DDRx SDRAM 200 MHz 200 MHz 200MHz 800MHz 800MHz * This frequency target is an example only, it does not reflect the actual frequency configuration that is supported by DDRx controller
30 DDRx Memory Efficiency Access pattern is critical to DDR efficiency Favors large burst transactions Tradeoff between hardware complexity, and efficiency of the memory access pattern Certain access patterns do not warrant complex hardware Complex hardware can be simplified with knowledge about the pattern or surrounding hazards OpenCL challenge: Make memory access as efficient as possible, for a variety of common access patterns, without placing a burden on device resources Convert inefficient patterns (e.g. a series of short single word reads) into efficient patterns (e.g. a single large burst read) random access requests Simple LSU coalesce requests to optimally utilize wider bus to global memory simple direct access to the global memory bus LSU : Load/Store Unit requests to consecutive addresses Coalescing LSU
31 Load/Store Memory Coalescing External memory has wide words (256/512 bits) Loads/stores typically access narrower words (32 or 128 bits) bit DDR word Given a sequence of loads/stores, we want to make as few external memory read/write requests as possible
32 Load/Store Memory Coalescing Coalescing is important for good performance Combine loads/stores that access the same DDR word or the one ahead of the previously-accessed DDR word Make one big multi-word burst request to external memory whenever possible Fewer requests less contention to global memory Contiguous bursts less external memory overhead Load/Store Addresses (128-bit words): a 100c 100d 1 burst request for 4 DDR words 1 word 1 word 3 requests in total
33 Static Versus Dynamic Coalescing In Altera OpenCL SDK, there are two types of coalescing available Static, that can be inferred by examining the kernel source code Dynamic, performed by the hardware whenever possible But the hardware to do this is not insignificant Making it easy for the compiler to determine the access pattern in your application will drive it towards a better implementation
34 Example 1 RANDOM access (with Modelsim) kernel void attribute (((max_work_group_size(256))) example1( global int *a, global int *b) { int lid = get_local_id(0); int gid = get_global_id(0); int size = get_local_size(0); int group = get_group_id(0); int offset = (group+1)*size; b[gid] = a[gid] + a[offset-lid-1]; }
35 Example 1 RANDOM access 0 Load Load + Global Memory Store
36 Example 1 RANDOM access 1 Load 0 Load + Global Memory Store
37 Example 1 RANDOM access 1 Load 0 Load + Global Memory Store
38 Example 1 RANDOM access 1 Load 0 Load + Global Memory Store
39 Example 1 RANDOM access 1 Load 0 Load + Global Memory Store
40 Example 1 RANDOM access 1 Load 0 Load + Global Memory Store
41 Example 1 RANDOM access 1 Load 0 Load + Global Memory Store
42 Example 1 RANDOM access 2 Load 1 0 Load + Global Memory Store
43 Load-Store Unit - High-Level Block Diagram coalescer shared load/store sites ARB address count State FIFO waiting threads scheduled bursts Burst FIFO GLOBAL buffer DEMUX valid data count GLOBAL
44 Load-Store Unit - High-Level Block Diagram Kernel Pipeline Interface coalescer DDR Interconnect Interface shared load/store sites Avalon Streaming ARB State FIFO waiting threads address Burst FIFO count scheduled bursts Avalon Memory Mapped GLOBAL buffer DEMUX valid data count GLOBAL
45 Avalon Memory Mapped Single Request (Read/Write)
46 Avalon Memory Mapped Burst Request (Read)
47 Avalon Streaming
48 ModelSim Example Load Contention
49 Example 1 STREAMING access kernel void { } attribute (((max_work_group_size(256))) example1( global int *a, global int *b) local int buffer[256]; int lid = get_local_id(0); int gid = get_global_id(0); int size = get_local_size(0); buffer[lid] = a[gid]; barrier(); b[gid] = buffer[lid] + buffer[size-lid-1];
50 FPGA Translation simplified view with memory, optimized Load Store Local Memory Global Memory Load Load + Store
51 ModelSim Example No Load Contention
52 Memory Partitioning In some applications, it is important to place data in appropriate memory locations to fully utilize DDR3 bandwidth for both DIMMs A good example is a single-precision general-element matrix multiplication
53 Example 2 SGEMM (1024x1024) #define BLOCK_SIZE 32 #define AS(i, j) A_local [j + i * BLOCK_SIZE ] #define BS(i, j) B_local [j + i * BLOCK_SIZE ] kernel void sgemm( global float *restrict C, global float *A, global float *B, local float *restrict A_local, local float *restrict B_local) { // Block index int bx = get_group_id(0); int by = get_group_id(1); // Thread index int tx = get_local_id(0); int ty = get_local_id(1); } float sum = 0.0f; for (int i = 1024* BLOCK_SIZE * by, j = BLOCK_SIZE * bx; i <= 1024* BLOCK_SIZE * (1+by) - 1; i += BLOCK_SIZE, j += BLOCK_SIZE*1024) { AS(ty, tx) = A[i * ty + tx]; BS(ty, tx) = B[j * ty + tx]; barrier(clk_local_mem_fence); for (int k = 0; k < BLOCK_SIZE; ++k) { sum += AS(ty, k) * BS(k, tx); } barrier(clk_local_mem_fence); } C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = sum; Global Memory Accesses
54 Example 2 SGEMM (default) Kernel Load A DIMM1 A B C Load B Store C DIMM2 A B C
55 Example 2 SGEMM (partitioned) Kernel Load A DIMM1 A Load B Store C DIMM2 B C
56 Kernel/Host work Balancing It can sometimes be the case that some work can be performed efficiently by the host, that could also be done as efficiently in the kernel (or a little more efficiently) However, the cost of implementing the minor improvement in hardware is simply not worth it It may be better to off-load some operations to the host, and simply store their result in memory
57 Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); const float TWOPI = 2.0 * M_PI_F; theta = -1 * TWOPI * posinhalfoutsubarray / outsubarraysize; term2.x = ( c1.x*cos(theta) - c1.y*sin(theta) ); term2.y = ( c1.y*cos(theta) + c1.x*sin(theta) ); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; } Setup Barrier Index computation Compute Twiddle Factor Apply butterfly operations Swap buffers Barrier Store Result
58 Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); const float TWOPI = 2.0 * M_PI_F; theta = -1 * TWOPI * posinhalfoutsubarray / outsubarraysize; term2.x = ( c1.x*cos(theta) - c1.y*sin(theta) ); term2.y = ( c1.y*cos(theta) + c1.x*sin(theta) ); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; } Large computation, parts of which can be pre-computed
59 Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, global float2 * twiddle, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; Setup Barrier // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); float2 term2 = ((float2)(c1.x*twiddle[index].x - c1.y*twiddle[index].y, c1.y*twiddle[index].x + c1.x*twiddle[index].y); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; } Index computation Load Twiddle Factor Apply butterfly operations Swap buffers Barrier Store Result
60 Constant Cache In some applications, for example FFT, some data stored in memory is loaded once and then reused frequently This could pose a significant bottleneck on performance What can we do? Copying data to local memory may not be enough, as each work group would have to perform the copy operation Solution We can direct the OpenCL SDK to create a constant cache that only loads data when it is not present within it, regardless of which workgroup requires the data
61 Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, constant float2 * twiddle, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); float2 term2 = ((float2)(c1.x*twiddle[index].x - c1.y*twiddle[index].y, c1.y*twiddle[index].x + c1.x*twiddle[index].y); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; // Swap the scratch arrays local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } // Store data - assuming N/2 threads // The arrays were swapped, so scratch1 is the valid output data dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; }
62 FFT Datapath with a Constant Cache FFT_system FFT_wrapper FFT_function Local Memory Interconnect LOCAL MEM #0 twiddle FFT Datapath Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect LOCAL MEM #1 LOCAL MEM #2 LOCAL MEM #3 Constant Cache Global Memory Interconnect MEM0 MEM1
63 ModelSim Example FFT Constant Cache FFT_system FFT_wrapper FFT_function Local Memory Interconnect LOCAL MEM #0 twiddle FFT Datapath Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect LOCAL MEM #1 LOCAL MEM #2 LOCAL MEM #3 Constant Cache Load Unit Global Memory Interconnect MEM0 MEM1
64 ModelSim Example FFT Constant Cache
65 KERNEL OPTIMIZATIONS
66 Basic Optimizations We can optimize kernels via attributes required work group size maximum work group size These attributes specify information to the compiler about the size of work groups How much hardware is used to implement barriers is directly related to work group sizes
67 Example Monte-Carlo Black Scholes kernel void attribute((reqd_work_group_size(128,1,1))) black_scholes( int m, int n, float K1, float K2, float S_0, float K, unsigned int seed, global float *result) { } const float two_to_the_neg_32 = (float)( f); int num_sim, num_pts, i, counter = 0; float value, z, diff, sum = 0.0f, S = S_0; int local_id = get_local_id(0), running_offset = get_local_id(0); local unsigned int mt[1024]; local float s_partial[128]; if (local_id == 0) RNG_init(seed, mt); for(num_sim=0;num_sim<m;num_sim++) { barrier(clk_local_mem_fence); unsigned int u = RNG_read_int32(running_offset, mt); if (u==0) u++; running_offset = (running_offset + get_local_size(0)) & 1023; value = ((float)u) * two_to_the_neg_32; if (value >= 1.0f) value = 0.999f; z = icdf(value); S *= exp(k1 + K2*z); counter = (counter + 1) % n; if (counter == 0) { diff = S - K; if(diff > ((float)0.0)) sum += diff; S = S_0; } } s_partial[local_id] = sum; barrier(clk_local_mem_fence); // Save the result when exiting. if (local_id == 0) { for (i = 1; i < get_local_size(0); i++) sum = sum + s_partial[i]; result[get_group_id(0)] = sum; } Setup RNG Init Barrier Get Random Number ICDF Brownian Motion Barrier Compute and Store Result
68 About Barriers Barriers synchronize memory operations within a workgroup to prevent hazards They ensure that each work item in a workgroup completed memory operations before it prior to allowing another work item in the same workgroup from executing memory operations after the barrier This requires that live variables for each thread be stored in a buffer Control Logic Thread In Thread out Live Variables 0 Live Variables 1 Live Variables n-1
69 Reducing Barrier Area Setup RNG Init WGS*545 Barrier WGS=256 Fewer On-chip RAM blocks WGS=128 Setup RNG Init WGS*545 Barrier Get Random Number ICDF Brownian Motion WGS*545 Barrier Compute and Store Result Fewer On-chip RAM blocks Get Random Number ICDF Brownian Motion WGS*545 Barrier Compute and Store Result
70 Loop Unrolling Loop unrolling can help parallelize computation inside of a kernel This optimization is important for applications such as SGEMM, where more operations means more ability to optimize the circuit AND higher throughput
71 Example SGEMM unrolling #define BLOCK_SIZE 4 #define AS(i, j) A_local [j + i * BLOCK_SIZE ] #define BS(i, j) B_local [j + i * BLOCK_SIZE ] kernel void sgemm( global float *restrict C, global float *A, global float *B, local float *restrict A_local, local float *restrict B_local) { int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); Setup Load Section of each Matrix to Local Memory } float sum = 0.0f; for (int i = 1024* BLOCK_SIZE * by, j = BLOCK_SIZE * bx; i <= 1024* BLOCK_SIZE * (1+by) - 1; i += BLOCK_SIZE, j += BLOCK_SIZE*1024) { AS(ty, tx) = A[i * ty + tx]; BS(ty, tx) = B[j * ty + tx]; barrier(clk_local_mem_fence); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { sum += AS(ty, k) * BS(k, tx); } barrier(clk_local_mem_fence); } C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = sum; Barrier Dot Product Barrier Store Result
72 Unrolling SGEMM Block Diagram Setup Load Section of each Matrix to Local Memory Barrier Dot Product Barrier Store Result
73 Unrolling SGEMM Block Diagram Setup Load Section of each Matrix to Local Memory Barrier Higher Throughput Dot Product Barrier Store Result
74 Unrolling SGEMM Block Diagram Setup Setup Load Section of each Matrix to Local Memory Load Section of each Matrix to Local Memory Barrier Higher Throughput Barrier Dot Product Dot Product Barrier Barrier Store Result Store Result
75 Example Initial Implementation valid_in_0 valid_in_1 sum Load A Load B * + New sum valid_out
76 Example Improved Implementation valid_in sum Load A Load A Load A Load A Load B Load B Load B Load B * * * * New sum valid_out
77 ModelSim Example Loop Rolled vs Unrolled
78 Vectorization and Replication In many cases, we may wish to have the kernel replicate its operations several times There are two ways to do this: Vectorization Replication Vectorization extends your datapath to use wider operands Ex. Changes float to float2 Replication creates a copy of a kernel pipeline on an FPGA
79 Vectorization Example - SGEMM #define BLOCK_SIZE 32 #define AS(i, j) A_local [j + i * BLOCK_SIZE ] #define BS(i, j) B_local [j + i * BLOCK_SIZE ] kernel void attribute ((num_vector_lanes(4))) attribute ((reqd_work_group_size(block_size, BLOCK_SIZE,1))) sgemm( global float *restrict C, global float *A, global float *B, local float *restrict A_local, local float *restrict B_local) { // Block index int bx = get_group_id(0); int by = get_group_id(1); // Thread index int tx = get_local_id(0); int ty = get_local_id(1); Each thread now performs the work of four! } float sum = 0.0f; for (int i = 1024* BLOCK_SIZE * by, j = BLOCK_SIZE * bx; i <= 1024* BLOCK_SIZE * (1+by) - 1; i += BLOCK_SIZE, j += BLOCK_SIZE*1024) { AS(ty, tx) = A[i * ty + tx]; BS(ty, tx) = B[j * ty + tx]; barrier(clk_local_mem_fence); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { sum += AS(ty, k) * BS(k, tx); } barrier(clk_local_mem_fence); } C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = sum;
80 SGEMM Vectorization Setup Setup Load Section of each Matrix to Local Memory Load Section of each Matrix to Local Memory Barrier Barrier Dot Product Dot Product Dot Product Dot Product Dot Product Barrier Barrier Store Result Store Result
81 Replication Example - MCBS kernel void attribute((reqd_work_group_size(128,1,1))) attribute((num_copies(4))) black_scholes( int m, int n, float K1, float K2, float S_0, float K, unsigned int seed, global float *result) { } const float two_to_the_neg_32 = (float)( f); int num_sim, num_pts, i, counter = 0; float value, z, diff, sum = 0.0f, S = S_0; int local_id = get_local_id(0), running_offset = get_local_id(0); local unsigned int mt[1024]; local float s_partial[128]; if (local_id == 0) RNG_init(seed, mt); for(num_sim=0;num_sim<m;num_sim++) { barrier(clk_local_mem_fence); unsigned int u = RNG_read_int32(running_offset, mt); if (u==0) u++; running_offset = (running_offset + get_local_size(0)) & 1023; value = ((float)u) * two_to_the_neg_32; if (value >= 1.0f) value = 0.999f; z = icdf(value); S *= exp(k1 + K2*z); counter = (counter + 1) % n; if (counter == 0) { diff = S - K; if(diff > ((float)0.0)) sum += diff; S = S_0; } } s_partial[local_id] = sum; barrier(clk_local_mem_fence); if (local_id == 0) { for (i = 1; i < get_local_size(0); i++) sum = sum + s_partial[i]; result[get_group_id(0)] = sum; } Setup RNG Init Barrier Get Random Number ICDF Brownian Motion Barrier Compute and Store Result
82 Vectorization vs. Replication kernel_system More Copies CRA kernel_wrapper kernel_function CSR NDR ARGS Dispatcher Kernel Datapath Kernel Datapath Local Memory Interconnect Local Memory Interconnect Global Memory Interconnect LOCAL MEM #0- (n-1) LOCAL MEM #(n-2n-1) MEM0 MEM1 More Vector Lanes CRA kernel_system kernel_wrapper kernel_function CSR NDR ARGS Dispatcher Kernel Datapath Global Memory Interconnect Local Memory Interconnect MEM0 MEM1 LOCAL MEM #0- (n-1)
83 FLOATING POINT OPTIMIZATIONS
84 Tree-Balancing Floating-point operations are generally not reordered This is because rounding after each operation affects the outcome ie. ((a+b) + c)!= (a+(b+c)) This usually creates an imbalance in a pipeline, which costs area We can remedy the problem by telling the compiler that it is acceptable to balance operations For example, create a tree of floating-point additions in SGEMM, rather than a chain Use --fp-relaxed=true flag in OpenCL SDK
85 Example FP balancing + + Reg + Reg Reg Reg Reg Reg + Area Reduction
86 Why is FP large? Breaking up an number into exponent and mantissa requires pre- and post-processing Some operations are easier than others Multiplication requires the addition of exponents, but mantissas are multiplied as though they were integers Addition requires barrel shifters to align inputs such that the mantissas are added when exponents are equal Handle corner cases +/- Infinity NaN (not-a-number) Denormalized numbers (exponent = 0)
87 Hardware circuit for FP adder Comprises Alignment (100 ALMs) Operation (21 ALMs) Normalization (81 ALMs) Rounding (50 ALMs) Normalization and rounding together occupy half of the circuit area A Alignment + Normalize B Round Result
88 Result What can we do? In a chain of operations it is possible to reduce the circuit area A Alignment B C Alignment D Extending the size of the mantissa helps preserve the precision, but costs less than keeping rounding modules around + + Rounding can be applied at the end of a chain Remove support for NaN and Inf Normalize Alignment Normalize Reduces the size of normalization and alignment logic To enable FPC in Altera s OpenCL SDK + --opt-arg -fpc=true Normalize Round
89 CASE STUDIES
90 AES Encryption Counter (CTR) based encryption/decryption 256-bit key Advantage FPGA Results Integer arithmetic Coarse grain bit operations Complex decision making Platform E5503 Xeon Processor Throughput (GB/s) 0.01 (single core) AMD Radeon HD PCIe385 A7 Accelerator % utilization (2 kernels) Power conservation Fill up for even higher performance
91 Multi-Asset Barrier Option Pricing Monte-Carlo simulation Heston Model d t t dt t dwt ND Range Assets x Paths (64x ) Advantage FPGA Complex Control Flow Results ds S dt t t S dw t t S t Platform Power (W) Performance (Msims/s) W3690 Xeon Processor nvidia Tesla C PCIe385 D5 Accelerator Msims/W
92 Document Filtering Unstructured data analytics Bloom Filter Advantage FPGA Results Platform Integer Arithmetic Flexible Memory Configuration Power (W) Performance (MTs) MTs/W W3690 Xeon Processor nvidia Tesla C DE4 Stratix IV-530 Accelerator PCIe385 A7 Accelerator
93 Fractal Video Compression Best matching codebook Correlation with SAD Advantage FPGA Integer Arithmetic Results Platform Power (W) Performance (FPS) FPS/W W3690 Xeon Processor nvidia Tesla C DE4 Stratix IV-530 Accelerator PCIe385 A7 Accelerator
94 Thank You
95 ModelSim Example Load Contention
96 ModelSim Example No Load Contention
97 ModelSim Example FFT Constant Cache
98 ModelSim Example Loop Rolled
99 ModelSim Example Loop Rolled Zoomed
100 ModelSim Example Loop Unrolled
101 Example FFT; Block Diagram Load Data for Processing merge Index Computation Const Load (twiddle) Compute Butterfly Save Intermediate Value Last Iteration? Store Result
102 Example FFT; High-level Data Flow Graph Load Data for Processing merge Index Computation Const Load (twiddle) Compute Butterfly Constant Load will generate a cache to reduce demand on DDRx bandwidth Save Intermediate Value Last Iteration? Store Result 102
How OpenCL enables easy access to FPGA performance?
How OpenCL enables easy access to FPGA performance? Suleyman Demirsoy Agenda Introduction OpenCL Overview S/W Flow H/W Architecture Product Information & design flow Applications Additional Collateral
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
Altera SDK for OpenCL
Altera SDK for OpenCL Best Practices Guide Subscribe OCL003-15.0.0 101 Innovation Drive San Jose, CA 95134 www.altera.com TOC-2 Contents...1-1 Introduction...1-1 FPGA Overview...1-1 Pipelines... 1-2 Single
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de [email protected] NIOS II 1 1 What is Nios II? Altera s Second Generation
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Xeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
FPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011
Mitglied der Helmholtz-Gemeinschaft OpenCL Basics Parallel Computing on GPU and CPU Willi Homberg Agenda Introduction OpenCL architecture Platform model Execution model Memory model Programming model Platform
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:
Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1
Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion
EE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware
Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware 25 August 2014 Copyright 2001 2014 by NHL Hogeschool and Van de Loosdrecht Machine Vision BV All
AMD GPU Architecture. OpenCL Tutorial, PPAM 2009. Dominik Behr September 13th, 2009
AMD GPU Architecture OpenCL Tutorial, PPAM 2009 Dominik Behr September 13th, 2009 Overview AMD GPU architecture How OpenCL maps on GPU and CPU How to optimize for AMD GPUs and CPUs in OpenCL 2 AMD GPU
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Radar Processing: FPGAs or GPUs?
Radar Processing: FPGAs or GPUs? WP011972.0 White Paper While generalpurpose graphics processing units (GPGPUs) offer high rates of peak floatingpoint operations per second (FLOPs), FPGAs now offer competing
Altera SDK for OpenCL v14.1. February 6, 2015
Altera SDK for OpenCL v14.1 February 6, 2015 Industry Challenges Variety of applications are becoming bottlenecked by scalable performance requirements E.g. Object detection and recognition, image tracking
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
CFD Implementation with In-Socket FPGA Accelerators
CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Module: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
7a. System-on-chip design and prototyping platforms
7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner
Hardware Implementations of RSA Using Fast Montgomery Multiplications ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner Overview Introduction Functional Specifications Implemented Design and Optimizations
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected].
Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University [email protected] Review Computers in mid 50 s Hardware was expensive
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
Applying the Benefits of Network on a Chip Architecture to FPGA System Design
Applying the Benefits of on a Chip Architecture to FPGA System Design WP-01149-1.1 White Paper This document describes the advantages of network on a chip (NoC) architecture in Altera FPGA system design.
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
Model-based system-on-chip design on Altera and Xilinx platforms
CO-DEVELOPMENT MANUFACTURING INNOVATION & SUPPORT Model-based system-on-chip design on Altera and Xilinx platforms Ronald Grootelaar, System Architect [email protected] Agenda 3T Company profile Technology
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Design of a High Speed Communications Link Using Field Programmable Gate Arrays
Customer-Authored Application Note AC103 Design of a High Speed Communications Link Using Field Programmable Gate Arrays Amy Lovelace, Technical Staff Engineer Alcatel Network Systems Introduction A communication
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor
Von der Hardware zur Software in FPGAs mit Embedded Prozessoren Alexander Hahn Senior Field Application Engineer Lattice Semiconductor AGENDA Overview Mico32 Embedded Processor Development Tool Chain HW/SW
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Intel Xeon +FPGA Platform for the Data Center
Intel Xeon +FPGA Platform for the Data Center FPL 15 Workshop on Reconfigurable Computing for the Masses PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA
Optimizing Code for Accelerators: The Long Road to High Performance
Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
COSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano
Introduction to the Latest Tensilica Baseband Solutions
Introduction to the Latest Tensilica Baseband Solutions Dr. Chris Rowen Founder and Chief Technology Officer Tensilica Inc. Outline The Mobile Wireless Challenge Multi-standard Baseband Tensilica Fits
CSE467: Project Phase 1 - Building the Framebuffer, Z-buffer, and Display Interfaces
CSE467: Project Phase 1 - Building the Framebuffer, Z-buffer, and Display Interfaces Vincent Lee, Mark Wyse, Mark Oskin Winter 2015 Design Doc Due Saturday Jan. 24 @ 11:59pm Design Review Due Tuesday Jan.
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
13. Publishing Component Information to Embedded Software
February 2011 NII52018-10.1.0 13. Publishing Component Information to Embedded Software NII52018-10.1.0 This document describes how to publish SOPC Builder component information for embedded software tools.
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.
3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 s Introduction Convolution is one of the basic and most common operations in both analog and digital domain signal processing.
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Chapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
OpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
High-Density Network Flow Monitoring
Petr Velan [email protected] High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible
High Performance GPGPU Computer for Embedded Systems
High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November 2014. www.linleygroup.
Xilinx SDAccel A Unified Development Environment for Tomorrow s Data Center By Loring Wirbel Senior Analyst November 2014 www.linleygroup.com Copyright 2014 The Linley Group, Inc. This paper examines Xilinx
Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016
Software-Programmable FPGA IoT Platform Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016 Agenda Introduction Who we are IoT Platform in FPGA Lattice s IoT Vision IoT Platform
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Qsys and IP Core Integration
Qsys and IP Core Integration Prof. David Lariviere Columbia University Spring 2014 Overview What are IP Cores? Altera Design Tools for using and integrating IP Cores Overview of various IP Core Interconnect
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule
All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:
