Altera SDK for OpenCL v14.1. February 6, 2015



Similar documents
How OpenCL enables easy access to FPGA performance?

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Embedded Systems: map to FPGA, GPU, CPU?

Xeon+FPGA Platform for the Data Center

Data Center and Cloud Computing Market Landscape and Challenges

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Intel Xeon +FPGA Platform for the Data Center

CFD Implementation with In-Socket FPGA Accelerators

Cloud Data Center Acceleration 2015

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Extending the Power of FPGAs. Salil Raje, Xilinx

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Parallel Programming Survey

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Networking Virtualization Using FPGAs

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Architectures and Platforms

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Case Study on Productivity and Performance of GPGPUs

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Infrastructure Matters: POWER8 vs. Xeon x86

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Computer Graphics Hardware An Overview

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

High-Level Synthesis for FPGA Designs

Multi-Threading Performance on Commodity Multi-Core Processors

COSCO 2015 Heterogeneous Computing Programming

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Model-based system-on-chip design on Altera and Xilinx platforms

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014

Altera SDK for OpenCL

Direct GPU/FPGA Communication Via PCI Express

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

FPGA-based MapReduce Framework for Machine Learning

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

Stream Processing on GPUs Using Distributed Multimedia Middleware

Kalray MPPA Massively Parallel Processing Array

GeoImaging Accelerator Pansharp Test Results

Packet-based Network Traffic Monitoring and Analysis with GPUs

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

High Performance GPGPU Computer for Embedded Systems

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Next Generation GPU Architecture Code-named Fermi

High Performance or Cycle Accuracy?

Parallel Firewalls on General-Purpose Graphics Processing Units

High-performance vswitch of the user, by the user, for the user

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

7a. System-on-chip design and prototyping platforms

NVIDIA GeForce GTX 580 GPU Datasheet

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?

A Computer Vision System on a Chip: a case study from the automotive domain

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

COMPUTING. SharpStreamer Platform. 1U Video Transcode Acceleration Appliance

HPC with Multicore and GPUs

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Introduction to GPGPU. Tiziano Diamanti

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG).

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Accelerating variant calling

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

High Performance Computing in CST STUDIO SUITE

Accelerating CFD using OpenFOAM with GPUs

HP ProLiant SL270s Gen8 Server. Evaluation Report

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Parallel Algorithm Engineering

GPU-Based Network Traffic Monitoring & Analysis Tools

Quartus II Software Design Series : Foundation. Digitale Signalverarbeitung mit FPGA. Digitale Signalverarbeitung mit FPGA (DSF) Quartus II 1

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Fujisoft solves graphics acceleration for the Android platform

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

OpenCL Programming for the CUDA Architecture. Version 2.3

Le langage OCaml et la programmation des GPU

Scaling from Datacenter to Client

What is a System on a Chip?

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Qsys and IP Core Integration

~ Greetings from WSU CAPPLab ~

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

How SSDs Fit in Different Data Center Applications

HPC Wales Skills Academy Course Catalogue 2015

Transcription:

Altera SDK for OpenCL v14.1 February 6, 2015

Industry Challenges Variety of applications are becoming bottlenecked by scalable performance requirements E.g. Object detection and recognition, image tracking and processing, cryptography, cloud, search engines, deep packet inspection, etc Overloading CPUs capabilities Frequencies are capped Processors keep adding more cores Need to coordinate all the cores and manage data Product life cycles are long GPUs lifespan is short Require re-optimization and regression testing between generations 2 Support agreement for GPUs costly Power dissipation of CPUs and GPUs limits system size Maintaining coherency throughout scalable system

OpenCL and FPGAs Address These Challenges Power efficient acceleration Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU FPGA lifecycle over 15 years GPUs lifespan is short Require re-optimization testing between generations FPGA OpenCL code retargeted to future devices without modification Our OpenCL flow abstracts away FPGA hardware flow Puts FPGA into software engineers hands Our OpenCL SDK allows for streaming IO channels and kernel channels Data movement without host involvement 3 Low latency data transmissions to accelerator Shared virtual memory IBM CAPI and Intel QPI

Efficiency via Specialization FPGAs ASICs GPUs Source: Bob Broderson, Berkeley Wireless group

Application Development Paradigm ASIC FPGA Programmers OpenCL expands The number of application developers Parallel Programmers Standard CPU Programmers 5

More SW Engineering Resources than HW? 1000:1 software engineers to FPGA designers Software engineers are not used to long compile times OpenCL Solves This! 6 Our OpenCL flow abstracts away FPGA hardware flow bringing the FPGA to low level software programmers Software developers write, optimize and debug in their software familiar environment Quartus is run behind the scenes Emulator and profiler are software development tools Pushing long compile times to end OpenCL optimization doesn t require a board Allowing SW to drive board requirements (.xml file)

OpenCL On FPGAs Fit Into All Markets Automotive/Industrial (Pedestrian Detection, Motion Estimation) Military/Government (Crypto, Image Detection ) Data Processing Algorithms Networking (DPI, SDN, NFV) Computer & Storage (HPC, Financial, Data Compression) 7 Broadcast, Consumer (Video image processing) Medical (Diagnostic Image Processing, BioInformatics)

OpenCL and FPGA Acceleration in the News IBM and Altera Collaborate on OpenCL IBM s collaboration with Altera on OpenCL and support of the IBM Power architecture with the Altera SDK for OpenCL can bring more innovation to address Big Data and cloud computing challenges, said Tom Rosamilia, senior vice president, IBM Systems Intel Reveals FPGA and Xeon in One Socket "That allows end users that have applications that can benefit from acceleration to load their IP and accelerate that algorithm on that FPGA as an offload," explained the vice president of Intel's data center group, Diane Bryant Search Engine Gets Help From FPGA "Altera was really interesting in helping with the development the resources they were willing to throw our way were more significant than those from Xilinx Microsoft Engr Manager Baidu and Altera Demonstrate Faster Image Classification Altera Corp. and Baidu, China s largest online search engine, are collaborating on using FPGAs and convolutional neural network (CNN) algorithms for deep learning applications. Xilinx Announces SDAccel Development Environment for OpenCL Delivering Up to 25X Better Performance/Watt to the Data Center 8

What is OpenCL? 9 A software programming model for software engineers and a software methodology for system architects First industry standard for heterogeneous computing Provides increased performance with hardware acceleration Low Level Programming language Based on ANSI C99 Open, royalty-free, standard Managed by Khronos Group Altera active member Conformance requirements V1.0 is current reference V2.0 is current release http://www.khronos.org Host C/C++ API OpenCL C Accelerator

Driver Stream API CUDA Why Does OpenCL Exist? CPU Programmability Single-Core Multi-Core C/C++ AVX/OpenMP Heterogeneous Programming Architecture PCIe Accelerator Programming Language OpenCL GPGPU Performance 10

Programming Language Offerings Target GPU Multi-Core CPU DSP/ Embedded FPGA System (Heterogeneous Platform) Device IP Block Scope Designer Programmer Embedded Programmer Hardware Designer Design Flow CUDA/OpenCL Code Composer Studio (TI C) Quartus II (Verilog/VHDL) Design Activity Design Constraints Hardware Knowledge Task Parallelism Data Parallelism Throughput/Latency Power Efficiency None (Coding Style Guidelines) Real Time Function Acceleration Real Time Execution Cost Limited (macro architecture bandwidth level) HLS IP Design and Integration Clock Frequency Resource Utilization Interface Requirements Power Today Today PoC Yes (protocol-level, timing closure, micro architecture) 11

HLS vs OpenCL Positioning Targets CPU, GPU and FPGAs Target user is Software developer Implements FPGA in software development flow Performance is determined by resources allocated Host Required Targets FPGA Target user is FPGA designer Implements FGPA in traditional FPGA development flow Performance is defined and amount of resource to achieve is reported Host not required 12

Altera SDK for OpenCL Competitive Differentiator Altera s SDK for OpenCL has proven to be a powerful solution for many vendors Won design tool and development software Elektra award in Europe Won Ultimate Product of the Year for 2014 13 Actively being used today: I was extremely happy to get a great performance with such low effort. I was so impressed with how powerful the Altera tool was! --- Senior Engineer, Altera OpenCL Customer

First Conformant OpenCL Solution for FPGAs!!! OpenCL v1.0 specification >8500 Programs tested Supports Arm Host CV and AV SoC 14 http://www.khronos.org/conformance/adopters/conformant-companies http://www.khronos.org/conformance/adopters/conformant-products

Heterogeneous Platform Model OpenCL Platform Model Host Memory (Compute) Device Host Compute Unit Global Memory Processing Element Example Platform x86 PCIe 15

Heterogeneous Platform Model OpenCL Platform Model Host Memory Device Device Host Global Memory Example Platform x86 PCIe 16

OpenCL Use Model: Abstracting the FPGA away Host Code main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } OpenCL Accelerator Code kernel void sum ( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Standard gcc Compiler Altera Offline Compiler Verilog EXE AOCX Quartus II Accelerator 17 Host

OpenCL Programming Model host.c opencl.h gcc Driver Platform Context Device Queue Acquire Compute Program Kernel Visualize Buffer Launch device.cl aoc 18

Interconnect The Only Custom Accelerator Solution: Platforms DDR DDR DDR3 Memory Interface DDR3 Memory Interface OpenCL Domain Built with Altera OpenCL Compiler QDR QDRII Memory Interface QDR QDRII Memory Interface QDR QDRII Memory Interface QDR QDRII Memory Interface Kernel IP Kernel IP 10G Network 10Gb MAC/UOE Data Interface 10Gb MAC/UOE Data Interface 20 Host PCIe gen2x8 Host Interface IO Infrastructure Prebuilt BSP with standard HDL Tools by FPGA Developer

10G UDP 10G UDP Altera Reference Platforms Requirement Network Enabled Low Latency High Performance Computing (HPC) Compute Power/ Memory Bandwidth Architecture OpenCL API HAL UMD KMD Stratix V FPGA DMA PCIe CPLD Bridge CPLD DDR3 (OpenCL Kernels) FLASH DDR3 OpenCL API HAL UMD KMD Stratix V FPGA DMA PCIe DDR3 DDR3 (OpenCL Kernels) Global Memory DDR and QDRII+ Large amount of DDR IO Channels 2x10GbE (MAC/UOE) None (Minimize IP overhead) Reference Design OPRA (Streaming) Trading (with global memory access) Option Pricing 21

SoC Reference Platforms HPS block removes the complexities of the BSP creation Coherency between Host and Accelerator HPS DDR3 Stratix V FPGA H2F/F2H HPS LWH2F F2S CSR 32bit, 50Mz FPGA Memory OpenCL Kernels DVI DVO Scratch DDR3 Camera Monitor OpenCL Platforms Page contains CV SoC devkit platform users guide 22

Altera Network Enabled Reference Platform for OpenCL C/C++ API OpenCL C host.c device.cl Compiler Reference Design Software Layer Hardware Layer Reference Platform Host Device Reference Board 23 64-bit RHEL 6.4 Windows 7 s5_hft (S5PH-Q)

Guaranteed Timing Flow kernel.cl Boardspec.xml AOC Post-fit QXP partition (PCIe, UniPHY, DMA, ) Synthesis / P&R / STA on the OpencL Kernels ONLY No Meet Timing Yes Reconfig kernel PLL Re-run STA with the new PLL value 24 DONE!

Interface Heterogeneous Memory Support Host Memory Host IO Global Memory1 Global Memory2 IO Device CU Memories with different characteristics DDR Sequential Access QDR Random Access On-Chip Low Latency kernel void foo( global uint *data attribute((buffer_location(qdr) )) ) { foo(data[i]); } MoSysEfficient HMC High Capacity Combine different memories Attribute-based Automatic 25

Interconnect Interconnect Channels Advantage Standard OpenCL Altera Vendor Extension IO and Kernel Channels DDR DDR QDR QDR QDR QDR DDR3 Interface DDR3 Interface QDRII Interface QDRII Interface QDRII Interface QDRII Interface CvP Update OpenCL Kernels OpenCL Kernels DDR DDR QDR QDR QDR QDR DDR3 Interface DDR3 Interface QDRII Interface QDRII Interface QDRII Interface QDRII Interface CvP Update OpenCL Kernels OpenCL Kernels 10G Network 10Gb Interface 10Gb Interface 10G Network 10Gb Interface 10Gb Interface Host Host Interface Host Host Interface 26

Kernel Development Flow Modify kernel.cl x86 Emulator (sec) Functional Bugs? Hardware performance met? Optimization Report (min) Prototype (min) Stall-free pipeline? Memory coalesced? Profiler (hours) 28 DONE!

x86 emulator Beta v14.1 Enable functional debug on x86 system of kernel code Prototype support to allow users run kernels on x86 platform Debug support for Altera vendor specific debug support such as channels kernel void accel( ) { gid = get_global_id(0); out[gid] = proc(data[gid]); } x86 Kernel Compiler./kernel_tb Running Supports OpenCL syntax Channels Printf 29

Example: Load to Store dependency 1 2 3 4 5 6 kernel void prefixsum( global int* restrict A, unsigned N ) { for ( unsigned i = 1 ; i < N ; i++ ) { int a = A[i-1]; A[i] += a; } } ============================================================================== *** Optimization Report *** ============================================================================== Relative cost of global Kernel: prefixsum Ln.Col ============================================================================== memory to local Loop for.body computation 2.25 Pipelined execution inferred. Successive iterations launched every 321 cycles due to: Memory dependency on Load Operation from: 3.21 Store Operation 4.7 Largest Critical Path Contributors: True fix requires 49%: Load Operation restructuring the code 3.21 49%: Store Operation 4.7 ============================================================================= 30

Example: Accumulating a value 1 2 3 4 5 6 7 8 9 kernel void test( global float* restrict input, global float* restrict output, unsigned N ) { float mul = 1.0f; for ( unsigned i = 0; i < N; i++ ) { mul *= input[ i ]; } *output = mul; } ================================================================================== *** Optimization Report *** ================================================================================== Kernel: test Ln.Col ================================================================================== Loop for.body 5.24 Pipelined execution inferred. Successive iterations launched every 3 cycles due to: Data dependency on variable mul 4.10 Largest Critical Path Contributor: 100%: Fmul Operation 6.7 ================================================================================== 31

Rapid Prototyping Beta v14.1 Increases productivity during application development Uses a library of pre-compiled templates to skip Quartus II compilation Can test small versions of the final design on hardware very quickly OpenCL Compiler aoc Quartus II ~ hours User Program OpenCL Compiler....... + HW Implementation aoc march=prototype Configuration Template Library ~minutes Ability to generate custom templates based on user kernels Tailors the Rapid Prototyping Template Library to the user 32

Profiler BETA v14.1 Instrument the pipeline with performance counters and profiling logic Transfer the profiling information to the host via PCIe link Kernel Pipeline kernel void accel( ) { gid = get_global_id(0); out[gid] = a[gid]+b[gid]; } Load + Load Memory Mapped Registers Store 33

Profiler BETA v14.1 Bottlenecks, bandwidth, saturation, pipeline occupancy 34

OpenCL Host Library & Run Time Environment (RTE) Host library improvements: Lower CPU usage Improved scalability Lower memory footprint Faster run time SDK & Run Time Environment: OS SDK (needs ACDS) RTE Windows x86-64 Installer Installer Linux (RHEL) x86-64 Installer, RPM Installer, RPM Linux (RHEL) Power - RPM Linux (custom) CV SoC - Tarball 35

Installable Client Driver BETA v14.1 host.c clgetplatformid opencl.h ICD nvidiaopencl Acquire Compute AlteraOpenCL Visualize device.cl HKEY_LOCAL_MACHINE\SOFTWARE\ Khronos\OpenCL\Vendors <library>.dll DWORD /etc/opencl/vendors /<vendor>.icd <library>.so 36

Altera Client Driver BETA v14.1 host.c clgetplatformid opencl.h ICD nvidiaopencl Acquire clgetdeviceid Compute AlteraOpenCL ACD Visualize device.cl 37

OpenCL + FPGA Key Benefits Faster development vs. traditional FPGA design flow Puts the FPGA in the software developers hands Familiar C-based development flow Higher performance/watt vs. CPU/GPGPU Implement exactly what you need Pipeline parallel structures Custom interconnect converging with data processing cores Lower power vs. CPU/GPGPU Core frequency lower: 200-250MHz vs 1GHz Turn off unused logic Up to 1/5 the power Portability & Obsolescence free Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc) Code ports seamlessly to new generations of the FPGA 38 FPGA life cycle considerably longer than CPUs or GPGPUs

Additional Resources

Optimize Design Set Up Altera SDK for OpenCL Design Flow Getting Started Guide (document) Install Quartus II v13.1 with Altera SDK for OpenCL Install C Compiler or Development Environment Obtain and setup license from the Self Service Licensing Center Install the FPGA (OpenCL) board aocl install Programming Guide (document) Develop kernel code and compile on CPU/GPU for functional correctness Build, compile & link the host application (Visual Studio/GCC) Compile the OpenCL kernel with Altera offline Compiler (aoc) Run the application Best Practices (document) Optimize kernel for FPGA hardware 40

Additional Altera OpenCL Collateral White papers on OpenCL OpenCL online demos OpenCL design examples Instructor-Led training Parallel Computing with OpenCL Workshop by Altera (1 Day) Optimization of OpenCL for Altera FPGAs Training by Altera (1 Day) Online training Introduction to Parallel Computing with OpenCL Writing OpenCL Programs for Altera FPGAs Running OpenCL on Altera FPGAs Single-Threaded vs. Multi-Threaded Kernels Building Custom Platforms for Altera SDK for OpenCL OpenCL board partners page 41

Application Benchmarking

Case Study: GZIP Compression OpenCL Was 10% Slower 12% more resources 3x faster development time Altera summer intern ported and optimized GZIP algorithm in a little more than a month Industry leading companies FPGA engineer coded Verilog in 3 months Much lower design effort and design time 43

Conclusions Results CHREC/Univ of Florida Sobel, Canny, & SURF OpenCL vs. VHDL productivity table VHDL development time OpenCL development time 6 months 1 month Apps. OpenCL vs. VHDL performance table Frames/sec VHDL performance OpenCL performance Stratix 4 Predicted Stratix 5 Stratix 5 Max freq. Frames/sec Max freq. Frames/sec Max freq. Sobel 475 170 909 300 870 300 Canny 470 170 890 300 823 309 SURF 392 170 870 300 804 283 Avoid productivity challenges of HDL 6 increase in productivity OpenCL offers familiar C environment Develop fully pipeline kernels Minimum performance cost < 10 % overhead Productivity Performance 44

Case Study: Image Classification Deep Learning Algorithm Convolutional Neural Networking Based on Hinton s CNN Early Results on Stratix V 2X Perf./Power vs. gpgpu despite soft floating point 8+ simultaneous kernels vs. 2 on gpgpu Exploiting OpenCL channels between kernels A10 Expectations Hard floating point Better density and frequency ~ 4X performance/watt v SV The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Here are the classes in the dataset, as well as 10 random images from each: airplane automobile bird cat deer dog frog horse ship truck Hinton s CNN Algorithm 45

AES Encryption Encryption/decryption 256bit key Counter (CTR) method Platform Advantage FPGA Integer arithmetic Coarse grain bit operations Complex decision making Results E5503 Xeon Processor (single core) Power (W) Performance (GB/s) Efficiency (MB/s/W) est 80 0.01 0.125 AMD Radeon HD 7970 est 100 0.33 3.3 PCIe385 A7 Accelerator 25 5.20 208 46

Multi-Asset Barrier Option Pricing Monte-Carlo simulation No closed form solution possible High quality random number generator required Billions of simulations required Used GPU vendors example code Advantage FPGA Complex Control Flow Optimizations Channels, loop pipelining Results Platform Power (W) Performance (Bsims/s) Efficiency (Msims/s/W) W3690 Xeon Processor 130.032 0.0025 nvidia Kepler20 212 10.1 48 Bittware S5-PCIe-HQ 45 12.0 266 47

Document Filtering Unstructured data analytics Bloom Filter 48 Platform Advantage FPGA Integer Arithmetic Flexible Memory Configuration Results Power (W) Performance (MTs) Efficiency (MTs/W) W3690 Xeon Processor 130 2070 15.92 nvidia Tesla C2075 215 3240 15.07 PCIe385 A7 Accelerator 25 3602 144.08

Consumer (Japan) Image Processing Adaptive weighted images p xy c d c d c c 1 ij 1 xy 1 ( i 1) j 2 xy 2 ij xy 2 ( i 1 ) j W d d 2 xy Advantage FPGA Integer Arithmetic Results Platform Power (W) Performance (FPS) Efficiency (FPS/W) W3565 Xeon Processor est 130 0.05.0004 nvidia Quadro 4000 est 150 2.94.0200 PCIe385 A7 Accelerator 21 4.29.2040 49

Smith-Waterman Sequence Alignment Scoring Matrix Platform Advantage FPGA Integer Arithmetic SMT Streaming Results Power (W) Performance (MCUPS) Efficiency (MCUPS/W) W3565 Xeon Processor 140 40.29 nvidia K20 225 704 3.13 PCIe385 A7 Accelerator 25 32596 1303.00 50

Multi Function Printer Image Processing RGB output of raster scanner converted to CMYK colorants for printing Advantage FPGA SoC Solution IO and Kernel Channels Heterogeneous memory accesses Goal 50PPM at A4/letter size Results >40X improvement over C based algorithm on ARM only No NEON coprocessor used C6 speed grade part improved 20% to 128PPM 51

Suricata: IDS/IPS Implementation (Cybersecurity) 2x 10 Gbps ETH IO ETH IO STD IDS PKT PKT Processing Analysis DPI PKT PKT Processing Analysis Traffic Control IPS PKT Manipulation ETH IO ETH IO 2x 10 Gbps Ingress Network Path STD Rules Memory (QDR or DDR) DPI Rules Memory (QDR or DDR) IDS/IPS MGMT Mirror for Egress Network Path Packet Analysis Kernel IDS (task) Stream in decoded packets and store in local memory (aoclreadchannel) Parallel regex with STD rules in global memory (heterogeneous memory support) Write results to global memory Stream out decoded packets (aoclwritechannel) Host IDS/IPS Management Read results from global memory and log Decide to modify or delete packets Packet Manipulation Kernel - IPS (task) Stream in decoded packets (aoclreadchannel) Read and process decision from the host Stream out decoded packets (aoclwritechannel) Decoder Kernel (autorun) Stream in encoded packets (aoclreadchannel) Unpack single streams Stream out decoded packets (aoclwritechannel) Packet Analysis Kernel - DPI (task) Stream in decoded packets and store in local memory (aoclreadchannel) Parallel regex with DPI rules in global memory (heterogeneous memory support) Write results to global memory Stream out decoded packets (aoclwritechannel) Encoder Kernel (autorun) Stream in decoded packets (aoclreadchannel) Repack multiple streams Stream out encoded packets (aoclwritechannel)

Haplotype Caller (Pair-HMM) Smith Waterman like algorithm Uses hidden markov models to compare gene sequences 3 stages: Assembler, Pair-HMM (70%), Traversal +Genotyping Floating point (SP + DP) C++ code starting point (from JAVA) Whole genome takes 7.6 days! Results Platform Runtime (ms) Java (gatk 2.8) 10,800 Intel Xeon E5-1650 138 nvidia Tesla K40 70 Nallatech SV-D8 15.5 53

Sobel Filter Fundamental image filter algorithm Used commonly in industrial and automotive applications Sliding window based design pattern Same shift register structure, except in two dimensions WIDTH*3 WIDTH*4-1 WIDTH*4-9 A B C WIDTH*3-9 E F G WIDTH*2-9 54 0 Pixels enter here WIDTH-9 WIDTH-1

Task Implementation and Results Altera OpenCL kernel void sobel(int iters) { // Coefficients int Gx[3][3] = {{-1,-2,-1},{0,0,0},{1,2,1}}; int Gy[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}}; int rows[2 * COLS + 3]; // line buffer int count = 0; while (count!= iters) { // Shift the line buffer #pragma unroll for (int i = COLS * 2 + 2; i > 0; --i) { rows[i] = rows[i - 1]; } rows[0] = read_channel_altera(in_channel); On our design example website http://www.altera.com/su pport/examples/opencl/s obel-filter.html Device Resolution FPS } } int x_dir = 0; int y_dir = 0; #pragma unroll for (int i = 0; i < 3; ++i) { #pragma unroll for (int j = 0; j < 3; ++j) { x_dir += rows[i * COLS + j] * Gx[i][j]; y_dir += rows[i * COLS + j] * Gy[i][j]; } } int edge_weight = abs(x_dir) + abs(y_dir); write_channel_altera(out_channel, edge_weight); ++count; Cyclone V 1080p 60 Stratix V 1080p 135 55