Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling

Similar documents

How OpenCL enables easy access to FPGA performance?

OpenCL Programming for the CUDA Architecture. Version 2.3

Altera SDK for OpenCL

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Next Generation GPU Architecture Code-named Fermi

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Xeon+FPGA Platform for the Data Center

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

FPGA-based MapReduce Framework for Machine Learning

Architectures and Platforms

Embedded Systems: map to FPGA, GPU, CPU?

Multi-Threading Performance on Commodity Multi-Core Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

CUDA Programming. Week 4. Shared memory and register

Parallel Computing. Benson Muite. benson.

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Parallel Programming Survey

Introduction to Cloud Computing

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

GPU Hardware Performance. Fall 2015

Binary search tree with SIMD bandwidth optimization using SSE

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

EE361: Digital Computer Organization Course Syllabus

Accelerating sequential computer vision algorithms using OpenMP and OpenCL on commodity parallel hardware

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Evaluation of CUDA Fortran for the CFD code Strukti

HPC with Multicore and GPUs

Radar Processing: FPGAs or GPUs?

Altera SDK for OpenCL v14.1. February 6, 2015

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

OpenACC 2.0 and the PGI Accelerator Compilers

AMD Opteron Quad-Core

CFD Implementation with In-Socket FPGA Accelerators

MAQAO Performance Analysis and Optimization Tool

Module: Software Instruction Scheduling Part I

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

7a. System-on-chip design and prototyping platforms

Computer Graphics Hardware An Overview

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Stream Processing on GPUs Using Distributed Multimedia Middleware

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

GPU Computing - CUDA

Model-based system-on-chip design on Altera and Xilinx platforms

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

FPGA-based Multithreading for In-Memory Hash Joins

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

GPUs for Scientific Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Intel Xeon +FPGA Platform for the Data Center

Optimizing Code for Accelerators: The Long Road to High Performance

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

COSCO 2015 Heterogeneous Computing Programming

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Introduction to the Latest Tensilica Baseband Solutions

CSE467: Project Phase 1 - Building the Framebuffer, Z-buffer, and Display Interfaces

GPU Parallel Computing Architecture and CUDA Programming Model

13. Publishing Component Information to Embedded Software

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Introduction to GPU Programming Languages

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Introduction to GPU hardware and to CUDA

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Chapter 11 I/O Management and Disk Scheduling

OpenSPARC T1 Processor

High-Density Network Flow Monitoring

High Performance GPGPU Computer for Embedded Systems

Performance Analysis and Optimization Tool

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November

Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016

OpenACC Programming and Best Practices Guide

Qsys and IP Core Integration

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Introduction to GPGPU. Tiziano Diamanti

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Transcription:

Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling

OPENCL INTRODUCTION

Programmable Solutions Technology scaling favors programmability and parallelism CPUs DSPs Multi-Cores Array GPGPUs FPGAs Single Cores Multi-Cores Coarse-Grained CPUs and DSPs Coarse-Grained Massively Parallel Processor Arrays Fine-Grained Massively Parallel Arrays

CPU + Hardware Accelerator ARM -based system-on-a-chip (SoC) devices by Intel, Apple, NVIDIA, Qualcomm, TI, and others SoC FPGAs by Altera

OpenCL Overview Open Computing Language Software-centric C/C++ API for host program OpenCL C (C99-based) for acceleration device Unified design methodology CPU offload Memory Access Parallelism Vectorizaton C/C++ API OpenCL C Host CPU Hardware Acceleration

OpenCL Driving Force Developed and published by consortium http://www.khronos.org Royalty-free standard Application-driven Consumer Electronics Image Processing & Video Encoding Augmented reality & Computational Photography Military Video and Radar Analytics Scientific / High Performance Computing Financial, Molecular Dynamics, Bioinformatics, etc. Medical rendering algorithms

OpenCL Abstract Programming Model Host Device main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } Global Mem Compute Unit Local Mem Local Mem Explicit Data Storage Hierarchical Memory Model Explicit Parallelism Vectorization Multi-threading kernel void sum( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; }

OpenCL Host Program Pure software written in standard C Communicates with the Accelerator Device via a set of library routines which abstract the communication between the host processor and the kernels Copy data from Host to FPGA Ask the FPGA to run a particular kernel main() { read_data_from_file( ); maninpulate_data( ); clenqueuewritebuffer( ); clenqueuetask(, my_kernel, ); clenqueuereadbuffer( ); Copy data from FPGA to Host } display_result_to_user( );

OpenCL Kernels Data-parallel function Defines many parallel threads of execution Each thread has an identifier specified by get_global_id Contains keyword extensions to specify parallelism and memory hierarchy Executed by compute object CPU GPU Accelerator float *a = float *b = kernel void sum( global const float *a, global const float *b, global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; } 0 1 2 3 4 5 6 7 7 6 5 4 3 2 1 0 kernel void sum( ); float *answer = 7 7 7 7 7 7 7 7

Flow main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } OpenCL Host Program + Kernels kernel void sum( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Standard C Compiler OpenCL Compiler Verilog x86 EXE SOF Quartus II PCIe

FPGA OpenCL Architecture x86 / External Processor FPGA PCIe External Memory Controller & PHY External Memory Controller & PHY DDR* Global Memory Interconnect M9K M9K M9K Kernel Pipeline Kernel Pipeline Kernel Pipeline M9K M9K M9K Local Memory Interconnect Modest external memory bandwidth Extremely high internal memory bandwidth Highly customizable compute cores

Compiling OpenCL to FPGAs Host Program kernel void sum( global const float *a, global const float *b, global float *answer) { kernel void sum( global const float *a, int xid = get_global_id(0); global const float + *b, answer[xid] = a[xid] b[xid]; } global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; } Kernel Programs OpenCL Load Load Load Host Program + Kernels Load ACL Compiler Load Load Store X86 binary Load Load Load clenqueuewritebuffer( ); clenqueuekernel(, sum, ); clenqueuereadbuffer( ); PCIe Standard C Compiler Store SOF Load main() { read_data_from_file( ); maninpulate_data( ); display_result_to_user( ); } Store Load Load PCIe DDRx Store Store x86 Store

Mapping Multithreaded Kernels to FPGAs The most simple way of mapping kernel functions to FPGAs is to replicate hardware for each thread Inefficient and wasteful Better method involves taking advantage of pipeline parallelism Attempt to create a deeply pipelined representation of a kernel On each clock cycle, we attempt to send in input data for a new thread Method of mapping coarse grained thread parallelism to fine-grained FPGA parallelism

Example Pipeline for Vector Add 8 threads for vector add example 0 1 2 3 4 5 6 7 Load Load Thread IDs + Store On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 0 1 2 3 4 5 6 7 Load Load Thread IDs + Store On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 1 2 3 4 5 6 7 Load Load 0 + Store Thread IDs On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example 2 3 4 5 6 7 Load Load 1 + 0 Store Thread IDs On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

Example Pipeline for Vector Add 8 threads for vector add example Load 3 2 + 1 Store 0 Load 4 5 6 7 Thread IDs On each cycle the portions of the pipeline are processing different threads While thread 2 is being loaded, thread 1 is being added, and thread 0 is being stored

ALTERA OPENCL SYSTEM ARCHITECTURE

OpenCL System Architecture External Interface x86 / External Processor FPGA PCIe External Memory Controller & PHY External Memory Controller & PHY DDR* Global Memory Interconnect M9K M9K M9K Kernel Pipeline Kernel Pipeline Kernel Pipeline M9K M9K M9K Local Memory Interconnect

External Interface : High Level PCIe DMA DDR Controller K0 K1 K2 Kn Kernel Pipelines

OpenCL System Architecture x86 / External Processor FPGA PCIe External Memory Controller & PHY External Memory Controller & PHY DDR* Global Memory Interconnect M9K M9K M9K Kernel Pipeline Kernel Pipeline Kernel Pipeline M9K M9K M9K Local Memory Interconnect Kernel System

Altera OpenCL Kernel Architecture [filename]_system CRA [kernel]_top_wrapper [kernel]_function_wrapper Dispatcher CSR NDR ARG 0 ARG 1 ID ITER ID ITER [kernel]_function EWI BB0 [kernel]_function EWI BB0 BB0 BB0 BB0 BB0 BB0 BB0 DONE Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect LOCAL MEM #0 LOCAL MEM #1 LOCAL MEM #2 LOCAL MEM #3 Constant Cache Kernel Pipelines Global Memory Interconnect MEM0 MEM1

OpenCL CAD Flow vectoradd_kernel.cl CLANG front end Unoptimized LLVM IR Front End Parses System OpenCL extensions and Description intrinsics produces LLVM IR vectoradd_host.c C compiler program.exe ACL runtime Library Optimizer Optimized LLVM IR RTL generator Verilog ACL iface DDR* PCIe

OpenCL CAD Flow vectoradd_kernel.cl CLANG front end Unoptimized LLVM IR Optimizer Optimized LLVM IR RTL generator Middle End ~150 compiler passes such as loop fusion, auto vectorization, and branch elimination leading to more efficient HW Verilog vectoradd_host.c C compiler program.exe ACL iface ACL runtime Library DDR* PCIe

OpenCL CAD Flow vectoradd_kernel.cl CLANG front end vectoradd_host.c C compiler ACL runtime Library Unoptimized LLVM IR Optimizer Optimized LLVM IR RTL generator Verilog program.exe Back End Instantiate Verilog IP for each operation in the intermediate representation Create control flow circuitry to handle loops, DDR* memory stalls and QSYS branching Quartus Traditional optimizations such as scheduling and PCIe resource sharing

MEMORY ACCESS OPTIMIZATIONS

Memory Access Optimizations At the high level, present OpenCL benchmarks are memory-to-memory transformations Some data is read from global memory, processed, and results stored in global memory Global Memory Is implemented using DDR3-1600 memory Two separate DIMMs Has long latency associated with it Access pattern to this memory has a significant impact on application performance In many cases it can dominate application performance Local Memory Is implemented on-chip Is fast, but very limited in size (54M bits on Stratix V A7 and 41M bits on Stratix V D5) The first and most significant challenge is to architect your application to balance the use of memories available to meet design requirements

Global Memory - DDRx Configuration Half Rate System 256 bits 200MHz Memory Controller 256 bits 200MHz Memory PHY 64 bits 800Mbps DDRx SDRAM Wide Busses System 200 MHz* 200 MHz* 200MHz* 400MHz* 512 bits 200MHz Quarter Rate Memory Controller 512 bits 200MHz Memory PHY 64 bits 1600Mbps 400MHz DDRx SDRAM 200 MHz 200 MHz 200MHz 800MHz 800MHz * This frequency target is an example only, it does not reflect the actual frequency configuration that is supported by DDRx controller

DDRx Memory Efficiency Access pattern is critical to DDR efficiency Favors large burst transactions Tradeoff between hardware complexity, and efficiency of the memory access pattern Certain access patterns do not warrant complex hardware Complex hardware can be simplified with knowledge about the pattern or surrounding hazards OpenCL challenge: Make memory access as efficient as possible, for a variety of common access patterns, without placing a burden on device resources Convert inefficient patterns (e.g. a series of short single word reads) into efficient patterns (e.g. a single large burst read) random access requests Simple LSU coalesce requests to optimally utilize wider bus to global memory simple direct access to the global memory bus LSU : Load/Store Unit requests to consecutive addresses Coalescing LSU

Load/Store Memory Coalescing External memory has wide words (256/512 bits) Loads/stores typically access narrower words (32 or 128 bits) 128 128 256-bit DDR word Given a sequence of loads/stores, we want to make as few external memory read/write requests as possible

Load/Store Memory Coalescing Coalescing is important for good performance Combine loads/stores that access the same DDR word or the one ahead of the previously-accessed DDR word Make one big multi-word burst request to external memory whenever possible Fewer requests less contention to global memory Contiguous bursts less external memory overhead Load/Store Addresses (128-bit words): 1000 1001 1002 1003 1004 1005 1006 1007 100a 100c 100d 1 burst request for 4 DDR words 1 word 1 word 3 requests in total

Static Versus Dynamic Coalescing In Altera OpenCL SDK, there are two types of coalescing available Static, that can be inferred by examining the kernel source code Dynamic, performed by the hardware whenever possible But the hardware to do this is not insignificant Making it easy for the compiler to determine the access pattern in your application will drive it towards a better implementation

Example 1 RANDOM access (with Modelsim) kernel void attribute (((max_work_group_size(256))) example1( global int *a, global int *b) { int lid = get_local_id(0); int gid = get_global_id(0); int size = get_local_size(0); int group = get_group_id(0); int offset = (group+1)*size; b[gid] = a[gid] + a[offset-lid-1]; }

Example 1 RANDOM access 0 Load Load + Global Memory Store

Example 1 RANDOM access 1 Load 0 Load + Global Memory Store

Example 1 RANDOM access 1 Load 0 Load + Global Memory Store

Example 1 RANDOM access 1 Load 0 Load + Global Memory Store

Example 1 RANDOM access 1 Load 0 Load + Global Memory Store

Example 1 RANDOM access 1 Load 0 Load + Global Memory Store

Example 1 RANDOM access 1 Load 0 Load + Global Memory Store

Example 1 RANDOM access 2 Load 1 0 Load + Global Memory Store

Load-Store Unit - High-Level Block Diagram coalescer shared load/store sites ARB address count State FIFO waiting threads scheduled bursts Burst FIFO GLOBAL buffer DEMUX valid data count GLOBAL

Load-Store Unit - High-Level Block Diagram Kernel Pipeline Interface coalescer DDR Interconnect Interface shared load/store sites Avalon Streaming ARB State FIFO waiting threads address Burst FIFO count scheduled bursts Avalon Memory Mapped GLOBAL buffer DEMUX valid data count GLOBAL

Avalon Memory Mapped Single Request (Read/Write)

Avalon Memory Mapped Burst Request (Read)

Avalon Streaming

ModelSim Example Load Contention

Example 1 STREAMING access kernel void { } attribute (((max_work_group_size(256))) example1( global int *a, global int *b) local int buffer[256]; int lid = get_local_id(0); int gid = get_global_id(0); int size = get_local_size(0); buffer[lid] = a[gid]; barrier(); b[gid] = buffer[lid] + buffer[size-lid-1];

FPGA Translation simplified view with memory, optimized Load Store Local Memory Global Memory Load Load + Store

ModelSim Example No Load Contention

Memory Partitioning In some applications, it is important to place data in appropriate memory locations to fully utilize DDR3 bandwidth for both DIMMs A good example is a single-precision general-element matrix multiplication

Example 2 SGEMM (1024x1024) #define BLOCK_SIZE 32 #define AS(i, j) A_local [j + i * BLOCK_SIZE ] #define BS(i, j) B_local [j + i * BLOCK_SIZE ] kernel void sgemm( global float *restrict C, global float *A, global float *B, local float *restrict A_local, local float *restrict B_local) { // Block index int bx = get_group_id(0); int by = get_group_id(1); // Thread index int tx = get_local_id(0); int ty = get_local_id(1); } float sum = 0.0f; for (int i = 1024* BLOCK_SIZE * by, j = BLOCK_SIZE * bx; i <= 1024* BLOCK_SIZE * (1+by) - 1; i += BLOCK_SIZE, j += BLOCK_SIZE*1024) { AS(ty, tx) = A[i + 1024* ty + tx]; BS(ty, tx) = B[j + 1024* ty + tx]; barrier(clk_local_mem_fence); for (int k = 0; k < BLOCK_SIZE; ++k) { sum += AS(ty, k) * BS(k, tx); } barrier(clk_local_mem_fence); } C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = sum; Global Memory Accesses

Example 2 SGEMM (default) Kernel Load A DIMM1 A B C Load B Store C DIMM2 A B C

Example 2 SGEMM (partitioned) Kernel Load A DIMM1 A Load B Store C DIMM2 B C

Kernel/Host work Balancing It can sometimes be the case that some work can be performed efficiently by the host, that could also be done as efficiently in the kernel (or a little more efficiently) However, the cost of implementing the minor improvement in hardware is simply not worth it It may be better to off-load some operations to the host, and simply store their result in memory

Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); const float TWOPI = 2.0 * M_PI_F; theta = -1 * TWOPI * posinhalfoutsubarray / outsubarraysize; term2.x = ( c1.x*cos(theta) - c1.y*sin(theta) ); term2.y = ( c1.y*cos(theta) + c1.x*sin(theta) ); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; } Setup Barrier Index computation Compute Twiddle Factor Apply butterfly operations Swap buffers Barrier Store Result

Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); const float TWOPI = 2.0 * M_PI_F; theta = -1 * TWOPI * posinhalfoutsubarray / outsubarraysize; term2.x = ( c1.x*cos(theta) - c1.y*sin(theta) ); term2.y = ( c1.y*cos(theta) + c1.x*sin(theta) ); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; } Large computation, parts of which can be pre-computed

Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, global float2 * twiddle, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; Setup Barrier // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); float2 term2 = ((float2)(c1.x*twiddle[index].x - c1.y*twiddle[index].y, c1.y*twiddle[index].x + c1.x*twiddle[index].y); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; } Index computation Load Twiddle Factor Apply butterfly operations Swap buffers Barrier Store Result

Constant Cache In some applications, for example FFT, some data stored in memory is loaded once and then reused frequently This could pose a significant bottleneck on performance What can we do? Copying data to local memory may not be enough, as each work group would have to perform the copy operation Solution We can direct the OpenCL SDK to create a constant cache that only loads data when it is not present within it, regardless of which workgroup requires the data

Example - FFT kernel void fft( global float2 * datain, global float2 * dataout, constant float2 * twiddle, const unsigned N, const unsigned log2n) { local float2 array1[fft_points]; local float2 array2[fft_points]; local float2 *scratch1 = array1; local float2 *scratch2 = array2; const unsigned tid = get_local_id(0); const unsigned group_id = get_group_id(0); const unsigned gid = group_id * FFT_POINTS + tid; // Load data - assuming N/2 threads scratch1[tid] = datain[gid]; scratch1[tid + FFT_POINTS/2] = datain[gid + FFT_POINTS/2]; barrier(clk_local_mem_fence); for(unsigned t = 0 ; t < LOG2N; t++) { float2 c0 = scratch1[tid]; float2 c1 = scratch1[tid + FFT_POINTS/2]; unsigned outsubarrayid = tid >> t; unsigned outsubarraysize = 0x1 << (t+1); unsigned halfoutsubarraysize = outsubarraysize >> 1; unsigned posinhalfoutsubarray = tid & (halfoutsubarraysize-1); float2 term2 = ((float2)(c1.x*twiddle[index].x - c1.y*twiddle[index].y, c1.y*twiddle[index].x + c1.x*twiddle[index].y); unsigned index = posinhalfoutsubarray * (FFT_POINTS >> (t+1)); scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray] = c0 + term2; scratch2[outsubarrayid * outsubarraysize + posinhalfoutsubarray + halfoutsubarraysize] = c0 - term2; // Swap the scratch arrays local float2 *f2tmp; f2tmp = scratch1; scratch1 = scratch2; scratch2 = f2tmp; barrier(clk_local_mem_fence); } // Store data - assuming N/2 threads // The arrays were swapped, so scratch1 is the valid output data dataout[gid] = scratch1[tid]; dataout[gid + FFT_POINTS/2] = scratch1[tid + FFT_POINTS/2]; }

FFT Datapath with a Constant Cache FFT_system FFT_wrapper FFT_function Local Memory Interconnect LOCAL MEM #0 twiddle FFT Datapath Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect LOCAL MEM #1 LOCAL MEM #2 LOCAL MEM #3 Constant Cache Global Memory Interconnect MEM0 MEM1

ModelSim Example FFT Constant Cache FFT_system FFT_wrapper FFT_function Local Memory Interconnect LOCAL MEM #0 twiddle FFT Datapath Local Memory Interconnect Local Memory Interconnect Local Memory Interconnect LOCAL MEM #1 LOCAL MEM #2 LOCAL MEM #3 Constant Cache Load Unit Global Memory Interconnect MEM0 MEM1

ModelSim Example FFT Constant Cache

KERNEL OPTIMIZATIONS

Basic Optimizations We can optimize kernels via attributes required work group size maximum work group size These attributes specify information to the compiler about the size of work groups How much hardware is used to implement barriers is directly related to work group sizes

Example Monte-Carlo Black Scholes kernel void attribute((reqd_work_group_size(128,1,1))) black_scholes( int m, int n, float K1, float K2, float S_0, float K, unsigned int seed, global float *result) { } const float two_to_the_neg_32 = (float)(0.000000000232830643653869628f); int num_sim, num_pts, i, counter = 0; float value, z, diff, sum = 0.0f, S = S_0; int local_id = get_local_id(0), running_offset = get_local_id(0); local unsigned int mt[1024]; local float s_partial[128]; if (local_id == 0) RNG_init(seed, mt); for(num_sim=0;num_sim<m;num_sim++) { barrier(clk_local_mem_fence); unsigned int u = RNG_read_int32(running_offset, mt); if (u==0) u++; running_offset = (running_offset + get_local_size(0)) & 1023; value = ((float)u) * two_to_the_neg_32; if (value >= 1.0f) value = 0.999f; z = icdf(value); S *= exp(k1 + K2*z); counter = (counter + 1) % n; if (counter == 0) { diff = S - K; if(diff > ((float)0.0)) sum += diff; S = S_0; } } s_partial[local_id] = sum; barrier(clk_local_mem_fence); // Save the result when exiting. if (local_id == 0) { for (i = 1; i < get_local_size(0); i++) sum = sum + s_partial[i]; result[get_group_id(0)] = sum; } Setup RNG Init Barrier Get Random Number ICDF Brownian Motion Barrier Compute and Store Result

About Barriers Barriers synchronize memory operations within a workgroup to prevent hazards They ensure that each work item in a workgroup completed memory operations before it prior to allowing another work item in the same workgroup from executing memory operations after the barrier This requires that live variables for each thread be stored in a buffer Control Logic Thread In Thread out Live Variables 0 Live Variables 1 Live Variables n-1

Reducing Barrier Area Setup RNG Init WGS*545 Barrier WGS=256 Fewer On-chip RAM blocks WGS=128 Setup RNG Init WGS*545 Barrier Get Random Number ICDF Brownian Motion WGS*545 Barrier Compute and Store Result Fewer On-chip RAM blocks Get Random Number ICDF Brownian Motion WGS*545 Barrier Compute and Store Result

Loop Unrolling Loop unrolling can help parallelize computation inside of a kernel This optimization is important for applications such as SGEMM, where more operations means more ability to optimize the circuit AND higher throughput

Example SGEMM unrolling #define BLOCK_SIZE 4 #define AS(i, j) A_local [j + i * BLOCK_SIZE ] #define BS(i, j) B_local [j + i * BLOCK_SIZE ] kernel void sgemm( global float *restrict C, global float *A, global float *B, local float *restrict A_local, local float *restrict B_local) { int bx = get_group_id(0); int by = get_group_id(1); int tx = get_local_id(0); int ty = get_local_id(1); Setup Load Section of each Matrix to Local Memory } float sum = 0.0f; for (int i = 1024* BLOCK_SIZE * by, j = BLOCK_SIZE * bx; i <= 1024* BLOCK_SIZE * (1+by) - 1; i += BLOCK_SIZE, j += BLOCK_SIZE*1024) { AS(ty, tx) = A[i + 1024* ty + tx]; BS(ty, tx) = B[j + 1024* ty + tx]; barrier(clk_local_mem_fence); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { sum += AS(ty, k) * BS(k, tx); } barrier(clk_local_mem_fence); } C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = sum; Barrier Dot Product Barrier Store Result

Unrolling SGEMM Block Diagram Setup Load Section of each Matrix to Local Memory Barrier Dot Product Barrier Store Result

Unrolling SGEMM Block Diagram Setup Load Section of each Matrix to Local Memory Barrier Higher Throughput Dot Product Barrier Store Result

Unrolling SGEMM Block Diagram Setup Setup Load Section of each Matrix to Local Memory Load Section of each Matrix to Local Memory Barrier Higher Throughput Barrier Dot Product Dot Product Barrier Barrier Store Result Store Result

Example Initial Implementation valid_in_0 valid_in_1 sum Load A Load B * + New sum valid_out

Example Improved Implementation valid_in sum Load A Load A Load A Load A Load B Load B Load B Load B * * * * + + + + New sum valid_out

ModelSim Example Loop Rolled vs Unrolled

Vectorization and Replication In many cases, we may wish to have the kernel replicate its operations several times There are two ways to do this: Vectorization Replication Vectorization extends your datapath to use wider operands Ex. Changes float to float2 Replication creates a copy of a kernel pipeline on an FPGA

Vectorization Example - SGEMM #define BLOCK_SIZE 32 #define AS(i, j) A_local [j + i * BLOCK_SIZE ] #define BS(i, j) B_local [j + i * BLOCK_SIZE ] kernel void attribute ((num_vector_lanes(4))) attribute ((reqd_work_group_size(block_size, BLOCK_SIZE,1))) sgemm( global float *restrict C, global float *A, global float *B, local float *restrict A_local, local float *restrict B_local) { // Block index int bx = get_group_id(0); int by = get_group_id(1); // Thread index int tx = get_local_id(0); int ty = get_local_id(1); Each thread now performs the work of four! } float sum = 0.0f; for (int i = 1024* BLOCK_SIZE * by, j = BLOCK_SIZE * bx; i <= 1024* BLOCK_SIZE * (1+by) - 1; i += BLOCK_SIZE, j += BLOCK_SIZE*1024) { AS(ty, tx) = A[i + 1024* ty + tx]; BS(ty, tx) = B[j + 1024* ty + tx]; barrier(clk_local_mem_fence); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { sum += AS(ty, k) * BS(k, tx); } barrier(clk_local_mem_fence); } C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = sum;

SGEMM Vectorization Setup Setup Load Section of each Matrix to Local Memory Load Section of each Matrix to Local Memory Barrier Barrier Dot Product Dot Product Dot Product Dot Product Dot Product Barrier Barrier Store Result Store Result

Replication Example - MCBS kernel void attribute((reqd_work_group_size(128,1,1))) attribute((num_copies(4))) black_scholes( int m, int n, float K1, float K2, float S_0, float K, unsigned int seed, global float *result) { } const float two_to_the_neg_32 = (float)(0.000000000232830643653869628f); int num_sim, num_pts, i, counter = 0; float value, z, diff, sum = 0.0f, S = S_0; int local_id = get_local_id(0), running_offset = get_local_id(0); local unsigned int mt[1024]; local float s_partial[128]; if (local_id == 0) RNG_init(seed, mt); for(num_sim=0;num_sim<m;num_sim++) { barrier(clk_local_mem_fence); unsigned int u = RNG_read_int32(running_offset, mt); if (u==0) u++; running_offset = (running_offset + get_local_size(0)) & 1023; value = ((float)u) * two_to_the_neg_32; if (value >= 1.0f) value = 0.999f; z = icdf(value); S *= exp(k1 + K2*z); counter = (counter + 1) % n; if (counter == 0) { diff = S - K; if(diff > ((float)0.0)) sum += diff; S = S_0; } } s_partial[local_id] = sum; barrier(clk_local_mem_fence); if (local_id == 0) { for (i = 1; i < get_local_size(0); i++) sum = sum + s_partial[i]; result[get_group_id(0)] = sum; } Setup RNG Init Barrier Get Random Number ICDF Brownian Motion Barrier Compute and Store Result

Vectorization vs. Replication kernel_system More Copies CRA kernel_wrapper kernel_function CSR NDR ARGS Dispatcher Kernel Datapath Kernel Datapath Local Memory Interconnect Local Memory Interconnect Global Memory Interconnect LOCAL MEM #0- (n-1) LOCAL MEM #(n-2n-1) MEM0 MEM1 More Vector Lanes CRA kernel_system kernel_wrapper kernel_function CSR NDR ARGS Dispatcher Kernel Datapath Global Memory Interconnect Local Memory Interconnect MEM0 MEM1 LOCAL MEM #0- (n-1)

FLOATING POINT OPTIMIZATIONS

Tree-Balancing Floating-point operations are generally not reordered This is because rounding after each operation affects the outcome ie. ((a+b) + c)!= (a+(b+c)) This usually creates an imbalance in a pipeline, which costs area We can remedy the problem by telling the compiler that it is acceptable to balance operations For example, create a tree of floating-point additions in SGEMM, rather than a chain Use --fp-relaxed=true flag in OpenCL SDK

Example FP balancing + + Reg + Reg + + + + Reg Reg Reg Reg + Area Reduction + + + + + +

Why is FP large? Breaking up an number into exponent and mantissa requires pre- and post-processing Some operations are easier than others Multiplication requires the addition of exponents, but mantissas are multiplied as though they were integers Addition requires barrel shifters to align inputs such that the mantissas are added when exponents are equal Handle corner cases +/- Infinity NaN (not-a-number) Denormalized numbers (exponent = 0)

Hardware circuit for FP adder Comprises Alignment (100 ALMs) Operation (21 ALMs) Normalization (81 ALMs) Rounding (50 ALMs) Normalization and rounding together occupy half of the circuit area A Alignment + Normalize B Round Result

Result What can we do? In a chain of operations it is possible to reduce the circuit area A Alignment B C Alignment D Extending the size of the mantissa helps preserve the precision, but costs less than keeping rounding modules around + + Rounding can be applied at the end of a chain Remove support for NaN and Inf Normalize Alignment Normalize Reduces the size of normalization and alignment logic To enable FPC in Altera s OpenCL SDK + --opt-arg -fpc=true Normalize Round

CASE STUDIES

AES Encryption Counter (CTR) based encryption/decryption 256-bit key Advantage FPGA Results Integer arithmetic Coarse grain bit operations Complex decision making Platform E5503 Xeon Processor Throughput (GB/s) 0.01 (single core) AMD Radeon HD 7970 0.33 PCIe385 A7 Accelerator 5.20 42% utilization (2 kernels) Power conservation Fill up for even higher performance

Multi-Asset Barrier Option Pricing Monte-Carlo simulation Heston Model d t t dt t dwt ND Range Assets x Paths (64x1000000) Advantage FPGA Complex Control Flow Results ds S dt t t S dw t t S t Platform Power (W) Performance (Msims/s) W3690 Xeon Processor 130 32 0.25 nvidia Tesla C2075 225 63 0.28 PCIe385 D5 Accelerator 23 170 7.40 Msims/W

Document Filtering Unstructured data analytics Bloom Filter Advantage FPGA Results Platform Integer Arithmetic Flexible Memory Configuration Power (W) Performance (MTs) MTs/W W3690 Xeon Processor 130 2070 15.92 nvidia Tesla C2075 215 3240 15.07 DE4 Stratix IV-530 Accelerator 21 1755 83.57 PCIe385 A7 Accelerator 25 3602 144.08

Fractal Video Compression Best matching codebook Correlation with SAD Advantage FPGA Integer Arithmetic Results Platform Power (W) Performance (FPS) FPS/W W3690 Xeon Processor 130 4.6.035 nvidia Tesla C2075 215 53.1.247 DE4 Stratix IV-530 Accelerator 21 70.9 2.976 PCIe385 A7 Accelerator 25 74.4 2.976

Thank You

ModelSim Example Load Contention

ModelSim Example No Load Contention

ModelSim Example FFT Constant Cache

ModelSim Example Loop Rolled

ModelSim Example Loop Rolled Zoomed

ModelSim Example Loop Unrolled

Example FFT; Block Diagram Load Data for Processing merge Index Computation Const Load (twiddle) Compute Butterfly Save Intermediate Value Last Iteration? Store Result

Example FFT; High-level Data Flow Graph Load Data for Processing merge Index Computation Const Load (twiddle) Compute Butterfly Constant Load will generate a cache to reduce demand on DDRx bandwidth Save Intermediate Value Last Iteration? Store Result 102