How OpenCL enables easy access to FPGA performance?



Similar documents
Altera SDK for OpenCL v14.1. February 6, 2015

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

CFD Implementation with In-Socket FPGA Accelerators

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling

Xeon+FPGA Platform for the Data Center

Model-based system-on-chip design on Altera and Xilinx platforms

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Embedded Systems: map to FPGA, GPU, CPU?

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Next Generation GPU Architecture Code-named Fermi

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Data Center and Cloud Computing Market Landscape and Challenges

Mitglied der Helmholtz-Gemeinschaft. OpenCL Basics. Parallel Computing on GPU and CPU. Willi Homberg. 23. März 2011

Parallel Programming Survey

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

COSCO 2015 Heterogeneous Computing Programming

Cloud Data Center Acceleration 2015

Qsys and IP Core Integration

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Altera SDK for OpenCL

Intel Xeon +FPGA Platform for the Data Center

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Case Study on Productivity and Performance of GPGPUs

Networking Virtualization Using FPGAs

Architectures and Platforms

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

Computer Graphics Hardware An Overview

Direct GPU/FPGA Communication Via PCI Express

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Multi-Threading Performance on Commodity Multi-Core Processors

Introduction to GPU hardware and to CUDA

~ Greetings from WSU CAPPLab ~

7a. System-on-chip design and prototyping platforms

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November

Fujisoft solves graphics acceleration for the Android platform

Accelerating CFD using OpenFOAM with GPUs

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

FPGA-based MapReduce Framework for Machine Learning

Kalray MPPA Massively Parallel Processing Array

HPC with Multicore and GPUs

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters

Extending the Power of FPGAs. Salil Raje, Xilinx

Hardware Acceleration for CST MICROWAVE STUDIO

Evaluation of CUDA Fortran for the CFD code Strukti

Data Centric Systems (DCS)

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction to GPGPU. Tiziano Diamanti

Le langage OCaml et la programmation des GPU

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Computing - CUDA

Network connectivity controllers

OpenCL Programming for the CUDA Architecture. Version 2.3

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

LS DYNA Performance Benchmarks and Profiling. January 2009

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

NVIDIA GeForce GTX 580 GPU Datasheet

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Embedded Linux RADAR device

High Performance GPGPU Computer for Embedded Systems

Parallel Firewalls on General-Purpose Graphics Processing Units

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

High-Density Network Flow Monitoring

Infrastructure Matters: POWER8 vs. Xeon x86

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Stream Processing on GPUs Using Distributed Multimedia Middleware

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Enabling Technologies for Distributed and Cloud Computing

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Introduction to OpenCL Programming. Training Guide

High Performance Computing in CST STUDIO SUITE

GeoImaging Accelerator Pansharp Test Results

Parallel Computing with MATLAB

GPU-based Decompression for Medical Imaging Applications

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Going to the wire: The next generation financial risk management platform

Findings in High-Speed OrthoMosaic

Optimizing Code for Accelerators: The Long Road to High Performance

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Transcription:

How OpenCL enables easy access to FPGA performance? Suleyman Demirsoy

Agenda Introduction OpenCL Overview S/W Flow H/W Architecture Product Information & design flow Applications Additional Collateral 2

Introduction

Driver Stream API CUDA The Quest for Performance CPU Programmability Single-Core Multi-Core C/C++ AVX/OpenMP Heterogeneous Programming Architecture PCIe Accelerator Programming Language OpenCL GPGPU Performance 4

I/O FPGA Architecture Massive Parallelism Millions of logic elements Thousands of 20Kb memory blocks Thousands of Variable Precision DSP blocks Dozens of High-speed transceivers Hardware-centric VHDL/Verilog Synthesis Place&Route I/O I/O Programmable Routing Switch Logic Element I/O 5

OpenCL Overview

OpenCL (Open Computing Language) Overview Software programming model: C/C++ API for host program OpenCL C for acceleration device Provides increased performance with hardware acceleration CPU offload to appropriate accelerator Local Memory Explicit Parallelism Task (SMT) Data (SPMD) Open, royalty-free, standard Managed by Khronos Group Altera active member Conformance requirements V1.0 is current reference V2.0 is current release http://www.khronos.org Low Level Programming Language! Host C/C++ API OpenCL C Device 10

Altera OpenCL Program Overview 2010 research project Toronto Technology Center 2011 Development started Proof of concept 9 customer evaluations 2012 Early Access Program Demo s at Supercomputing 12 Over 60 customer evaluations 2013 First public release announcement with release Passed v1.0 conformance release v13.1 Installation image accessible from ACDS download infrastructure Documentation available online Boards available from vendor web site and ACDS installation Support flow in place Optimization improvements SoC support Design Examples on Altera.com 11

Passed OpenCL Conformance! First FPGA to pass OpenCL conformance OpenCL v1.0 specification >8500 Programs tested 12 http://www.khronos.org/conformance/adopters/conformant-companies http://www.khronos.org/conformance/adopters/conformant-products

Heterogeneous Platform Model OpenCL Platform Model Host Memory (Compute) Device Host Compute Unit Global Memory Processing Element Example Platform x86 PCIe 13

Heterogeneous Platform Model OpenCL Platform Model Host Memory Device Device Host Global Memory Example Platform x86 PCIe 14

Use Model: Abstracting the FPGA away main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } OpenCL Host Program + Kernels kernel void sum ( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Standard gcc Compiler Altera Offline Compiler Verilog x86 EXE AOCX Quartus II 15

Use Model: clcreateprogramwithbinary OpenCL.h API fp = fopen( file.aocx","rb"); fseek(fp,0,seek_end); lengths[0] = ftell(fp); binaries[0] = (unsigned char*)malloc(sizeof(unsigned char)*lengths[0]); rewind(fp); fread(binaries[0],lengths[0],1,fp); fclose(fp);.cl cl_platform clgetplatforms Program (exe) const char** const char** const char** kernel clgetdevices cl_device clcreatecontext cl_context clcreateprogramwithbinary cl_program Program (exe) clbuildprogram Offline Compiler host.c cl_command _queue clcreatecommandqueue clenqueuendrangekernel exe exe cl_program Kernel (src) Kernel (src) clcreatekernel cl_kernel exe.aocx CL File OpenCL Program Bitstream 16

Reference Platforms Requirement Network Enabled Low Latency Half-Size High Performance Computing (HPC) Compute Power/ Memory Bandwidth Full-Size Form Factor Component Single Dual Global Memory DDR3-1600 and QDRII+ 550MHz DDR3-1333/FPGA IO Channels 2x10GbE (MAC/UOE) None (Minimize IP overhead) Reference Design OPRA (Streaming) Trading (with global memory access) Option Pricing 17

Altera HPC Reference Platform for OpenCL C/C++ API OpenCL C host.c device.cl Compiler Reference Design Software Layer Hardware Layer Reference Platform Host Device Reference Board 64-bit RHEL 6.4 Windows 7 s5_hpc (S5PE-DS) 18

Altera Network Enabled Reference Platform for OpenCL C/C++ API OpenCL C host.c device.cl Compiler Reference Design Software Layer Hardware Layer Reference Platform Host Device Reference Board 19 64-bit RHEL 6.4 Windows 7 s5_hft (S5PH-Q)

Interconnect Altera Network Enabled Reference Platform for OpenCL DDR DDR3 Memory C/C++ API OpenCL C DDR QDR DDR3 Memory QDRII Memory Software Layer host.c device.cl Hardware Layer CvP Update Compiler Reference Design QDR QDRII Memory QDR QDR QDRII Memory QDRII Memory OpenCL Kernels OpenCL Kernels Reference Platform 10G Network Host Host 10Gb MAC/UOE Data 10Gb MAC/UOE Data PCIe gen2x8 Host Device Reference Board 20 64-bit RHEL 6.4 Windows 7 s5_hft (S5PH-Q)

Interconnect OpenCL Modular Hardware Architecture DDR DDR DDR3 Memory DDR3 Memory CvP Update Built with Altera OpenCL Compiler QDR QDRII Memory QDR QDRII Memory QDR QDRII Memory QDR QDRII Memory OpenCL Kernels OpenCL Kernels 10G Network 10Gb MAC/UOE Data 10Gb MAC/UOE Data 21 Host PCIe gen2x8 Host Prebuilt BSP with standard HDL tools

OpenCL on SoC Platform Single Chip Solution Lightweight bridge Starting/stopping kernel, reconfiguring PLL, etc FPGA to SDRAM bridge Default way to move data between HPS and FPGA 256bits wide @ 100Mhz ~ 2.6GB/s FPGA external memory Scratch ram for FPGA kernels Store intermediate data before passing to next kernel FPGA to HPS and HPS to FPGA bridges are connected to it as a secondary connection Very slow: 32bits @50Mhz w/out DMA H2F/F2H FPGA 32bit, 50Mz HPS External Memory 256bit, 100Mhz ARM Host LWH2F F2S CSR FPGA Memory Kernels 256bit 100Mhz FPGA External Memory 22

The Key to Performance More Operations Per Second Parallelism Maximize Throughput Minimize Latency Quick Data Access Memory Access Pipelining Instructions Processes Loop unrolling Duplication (SPMD) Multi-threading (SMT) Avoid transfer/copy Work in local memory instead of shared memory Coalesce accesses 24

Multiple Devices v13.1 Beta Altera Platform Multiple Devices/Board Multiple Boards/Host Host Memory Device Board Host Device Device Board IO 25

Heterogeneous Memory Support v13.1 Beta Host Memory Host IO Global Memory1 Global Memory2 IO Device CU Memories with different characteristics DDR Sequential Access QDR Random Access Attribute-based kernel void foo( global uint *data attribute((buffer_location(qdr) )) ) { foo(data[i]); } 26

Channels v13.1 EAP 27 Standard OpenCL Vendor Extension Channels Host CvP Update Interconnect DDR3 10Gb 10Gb DDR3 QDRII QDRII QDRII QDRII OpenCL Kernels OpenCL Kernels DDR DDR QDR QDR QDR QDR 10G Host Host CvP Update Interconnect DDR3 10Gb 10Gb DDR3 QDRII QDRII QDRII QDRII OpenCL Kernels OpenCL Kernels DDR DDR QDR QDR QDR QDR Host Network 10G Network

Interconnect Custom Board Support Generation v13.1 EAP DDR DDR3 Memory Avalon MM DDR DDR3 Memory Avalon MM CvP Update QDR QDRII Memory Avalon MM QDR QDRII Memory Avalon MM QDR QDR QDRII Memory QDRII Memory Avalon MM Avalon MM OpenCL Kernels OpenCL Kernels 10G Network 10Gb MAC/UOE Data 10Gb MAC/UOE Data Avalon ST Avalon ST Host PCIe gen2x8 Host Avalon MM s to compiler Host CPU : Avalon Memory Map Global Memory: Avalon Memory Map Option IO: Avalon Streaming 28

OpenCL + FPGA Key Benefits Higher performance/watt vs. CPU/GPGPU Offload performance-intensive functions from the host processor to an FPGA Implement exactly what you need Pipeline parallel structures Custom interconnect converging with data processing cores Lower power vs. CPU/GPGPU Core frequency lower: 200-250MHz vs 1GHz Turn off unused logic Up to 1/5 the power Faster development vs. traditional FPGA design flow Higher level of design abstraction Higher department productivity Leverage your software development team Familiar C-based design entry Portability & Obsolescence free 29 Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc) FPGA life cycle considerably longer than CPUs or GPGPUs

Product Information & Design Flow

How to Get OpenCL Part of ACDS v13.1 installation Dedicated download page at http://dl.altera.com/opencl/ Requires licensed Quartus software Supported on Windows and Linux Still need standard GCC compiler for host side code Visual Studio, Eclipse etc. 31

What is included with the Altera SDK for OpenCL? Offline compiler (aoc) GCC based model Altera OpenCL utility (aocl) Diagnostics for board installation Flash or program FPGA image Install board drivers (typically PCIe) Host libraries Required by host code and provided by the vendor (Altera) APB Board driver Design examples FFT, vectoradd, matrixmult, moving average 32

Altera SDK for OpenCL Product Details Altera SDK for OpenCL Licensing OS Purchase a 1 year perpetual license Fixed & float available 60-day evaluation license available on request Requires Quartus II v13.1 Subscription Edition or Development Kit Edition Microsoft 64-bit Windows 7 Red Hat Enterprise 64-bit Linux (RHEL) 6.x Memory requirements SDK: Computer equipped with at least 16 GB RAM Quartus II: Refer to memory requirements for target FPGA 33

Altera Preferred Board Partner Program for OpenCL Provide customers with a portfolio of COTS boards to evaluate, develop and go-to-production with Customers can develop code and target that preferred board Altera SDK for OpenCL flow has been verified on the board Ensures an exceptional out-of-box customer experience for the customer Customer purchase directly from partners Altera s Preferred Board include: Includes Quartus II Development Kit Edition Software (one year license) Includes an Altera SDK for OpenCL License (one year, perpetual license) 34

APBs Available as of 13.1 Release Partner Board Altera Device Where to Get? Altera DK-DEV-5CSXC6NES Cyclone V SX SoC Part of ACDS 13.1 BittWare S5PH-Q Stratix V D5 Part of ACDS 13.1 S5PH-Q Stratix V D8 Part of ACDS 13.1 S5PH-Q Stratix V A7 Contact BittWare S5PH-Q Stratix V AB Contact BittWare S5PH-DS Dual Stratix V AB Contact BittWare Terabox 8, S5PH-DS Boards Contact BittWare Nallatech 385-A7 Accelerator Card Stratix V A7 Part of ACDS 13.1 385-D5 Accelerator Card Stratix V D5 Part of ACDS 13.1 PLDA XP5S620LP-40G Stratix V A7 Contact PLDA Terasic DE5 Stratix V A7 Contact Terasic 35

Optimize Design Set Up Altera SDK for OpenCL Design Flow Getting Started Guide (document) Install Quartus II v13.1 with Altera SDK for OpenCL Install C Compiler or Development Environment Obtain and setup license from the Self Service Licensing Center Install the FPGA (OpenCL) board aocl install Programming Guide (document) Develop kernel code and compile on CPU/GPU for functional correctness Build, compile & link the host application (Visual Studio/GCC) Compile the OpenCL kernel with Altera offline Compiler (aoc) Run the application Optimization Guide (document) Optimize kernel for FPGA hardware 36

Applications

AES Encryption Encryption/decryption 256bit key Counter (CTR) method Advantage FPGA Integer arithmetic Coarse grain bit operations Complex decision making Results Platform Power (W) Performance (GB/s) Efficiency (GB/s/W) E5503 Xeon Processor (single core) est 80 0.01 1.25e-4 AMD Radeon HD 7970 est 100 0.33 3.30e-3 PCIe385 A7 Accelerator 25 5.20 2.08e-1 38

Multi-Asset Barrier Option Pricing Monte-Carlo simulation No closed form solution possible High quality random number generator required Billions of simulations required Used GPU vendors example code Advantage FPGA Complex Control Flow Optimizations Channels, loop pipelining Results Platform Power (W) Performance (Bsims/s) Efficiency (Msims/s/W) W3690 Xeon Processor 130.032 0.0025 nvidia Kepler20 212 10.1 48 Bittware S5-PCIe-HQ 45 12.0 266 39

Document Filtering Unstructured data analytics Bloom Filter Advantage FPGA Integer Arithmetic Flexible Memory Configuration Results Platform Power (W) Performance (MTs) Efficiency (MTs/W) W3690 Xeon Processor 130 2070 15.92 nvidia Tesla C2075 215 3240 15.07 PCIe385 A7 Accelerator 25 3602 144.08 40

Consumer (Japan) Image Processing Adaptive weighted images p xy c d c d c c 1 ij 1 xy 1 ( i1) j 2 xy 2 ij xy 2 ( i1 ) j W d d 2 xy Advantage FPGA Integer Arithmetic Results Platform Power (W) Performance (FPS) Efficiency (FPS/W) W3565 Xeon Processor est 130 0.05.0004 nvidia Quadro 4000 est 150 2.94.0200 PCIe385 A7 Accelerator 21 4.29.2040 41

Smith-Waterman Sequence Alignment Scoring Matrix Advantage FPGA Integer Arithmetic SMT Streaming Results Platform Power (W) Performance (MCUPS) Efficiency (MCUPS/W) W3565 Xeon Processor 140 40.29 nvidia K20 225 704 3.13 PCIe385 A7 Accelerator 25 32596 1303.00 42

Multi Function Printer Image Processing RGB output of raster scanner converted to CMYK colorants for printing Advantage FPGA SoC Solution IO and Kernel Channels Heterogeneous memory accesses Goal 50PPM at A4/letter size Results >40X improvement over C based algorithm on ARM only No NEON coprocessor used C6 speed grade part improved 20% to 128PPM 43

44

Additional Resources

Additional Altera Collateral White papers on OpenCL OpenCL online demos OpenCL design examples Instructor-Led training OpenCL for Altera FPGAs Training by Acceleware (4 Day) Parallel Computing with OpenCL Workshop by Altera (1 Day) Optimization of OpenCL for Altera FPGAs Training by Altera (1 Day) Online training Introduction to Parallel Computing with OpenCL Writing OpenCL Programs for Altera FPGAs Running OpenCL on Altera FPGAs OpenCL board partners page 46

Summary Productivity Unified software programmer friendly design environment for a variety of devices, now including FPGA, in a heterogeneous platform Performance Excellent throughput and latency for algorithms pushing SMT limits, and SPMD with large local memory demands Efficiency Cost Dedicated custom processors for the parallel tasks make for the most compelling performance/watt results SoC solution with host and accelerator in a single device creates a simpler system and can lowers system costs for real-time performance acceleration 47