Altera SDK for OpenCL v14.1 February 6, 2015
Industry Challenges Variety of applications are becoming bottlenecked by scalable performance requirements E.g. Object detection and recognition, image tracking and processing, cryptography, cloud, search engines, deep packet inspection, etc Overloading CPUs capabilities Frequencies are capped Processors keep adding more cores Need to coordinate all the cores and manage data Product life cycles are long GPUs lifespan is short Require re-optimization and regression testing between generations 2 Support agreement for GPUs costly Power dissipation of CPUs and GPUs limits system size Maintaining coherency throughout scalable system
OpenCL and FPGAs Address These Challenges Power efficient acceleration Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU FPGA lifecycle over 15 years GPUs lifespan is short Require re-optimization testing between generations FPGA OpenCL code retargeted to future devices without modification Our OpenCL flow abstracts away FPGA hardware flow Puts FPGA into software engineers hands Our OpenCL SDK allows for streaming IO channels and kernel channels Data movement without host involvement 3 Low latency data transmissions to accelerator Shared virtual memory IBM CAPI and Intel QPI
Efficiency via Specialization FPGAs ASICs GPUs Source: Bob Broderson, Berkeley Wireless group
Application Development Paradigm ASIC FPGA Programmers OpenCL expands The number of application developers Parallel Programmers Standard CPU Programmers 5
More SW Engineering Resources than HW? 1000:1 software engineers to FPGA designers Software engineers are not used to long compile times OpenCL Solves This! 6 Our OpenCL flow abstracts away FPGA hardware flow bringing the FPGA to low level software programmers Software developers write, optimize and debug in their software familiar environment Quartus is run behind the scenes Emulator and profiler are software development tools Pushing long compile times to end OpenCL optimization doesn t require a board Allowing SW to drive board requirements (.xml file)
OpenCL On FPGAs Fit Into All Markets Automotive/Industrial (Pedestrian Detection, Motion Estimation) Military/Government (Crypto, Image Detection ) Data Processing Algorithms Networking (DPI, SDN, NFV) Computer & Storage (HPC, Financial, Data Compression) 7 Broadcast, Consumer (Video image processing) Medical (Diagnostic Image Processing, BioInformatics)
OpenCL and FPGA Acceleration in the News IBM and Altera Collaborate on OpenCL IBM s collaboration with Altera on OpenCL and support of the IBM Power architecture with the Altera SDK for OpenCL can bring more innovation to address Big Data and cloud computing challenges, said Tom Rosamilia, senior vice president, IBM Systems Intel Reveals FPGA and Xeon in One Socket "That allows end users that have applications that can benefit from acceleration to load their IP and accelerate that algorithm on that FPGA as an offload," explained the vice president of Intel's data center group, Diane Bryant Search Engine Gets Help From FPGA "Altera was really interesting in helping with the development the resources they were willing to throw our way were more significant than those from Xilinx Microsoft Engr Manager Baidu and Altera Demonstrate Faster Image Classification Altera Corp. and Baidu, China s largest online search engine, are collaborating on using FPGAs and convolutional neural network (CNN) algorithms for deep learning applications. Xilinx Announces SDAccel Development Environment for OpenCL Delivering Up to 25X Better Performance/Watt to the Data Center 8
What is OpenCL? 9 A software programming model for software engineers and a software methodology for system architects First industry standard for heterogeneous computing Provides increased performance with hardware acceleration Low Level Programming language Based on ANSI C99 Open, royalty-free, standard Managed by Khronos Group Altera active member Conformance requirements V1.0 is current reference V2.0 is current release http://www.khronos.org Host C/C++ API OpenCL C Accelerator
Driver Stream API CUDA Why Does OpenCL Exist? CPU Programmability Single-Core Multi-Core C/C++ AVX/OpenMP Heterogeneous Programming Architecture PCIe Accelerator Programming Language OpenCL GPGPU Performance 10
Programming Language Offerings Target GPU Multi-Core CPU DSP/ Embedded FPGA System (Heterogeneous Platform) Device IP Block Scope Designer Programmer Embedded Programmer Hardware Designer Design Flow CUDA/OpenCL Code Composer Studio (TI C) Quartus II (Verilog/VHDL) Design Activity Design Constraints Hardware Knowledge Task Parallelism Data Parallelism Throughput/Latency Power Efficiency None (Coding Style Guidelines) Real Time Function Acceleration Real Time Execution Cost Limited (macro architecture bandwidth level) HLS IP Design and Integration Clock Frequency Resource Utilization Interface Requirements Power Today Today PoC Yes (protocol-level, timing closure, micro architecture) 11
HLS vs OpenCL Positioning Targets CPU, GPU and FPGAs Target user is Software developer Implements FPGA in software development flow Performance is determined by resources allocated Host Required Targets FPGA Target user is FPGA designer Implements FGPA in traditional FPGA development flow Performance is defined and amount of resource to achieve is reported Host not required 12
Altera SDK for OpenCL Competitive Differentiator Altera s SDK for OpenCL has proven to be a powerful solution for many vendors Won design tool and development software Elektra award in Europe Won Ultimate Product of the Year for 2014 13 Actively being used today: I was extremely happy to get a great performance with such low effort. I was so impressed with how powerful the Altera tool was! --- Senior Engineer, Altera OpenCL Customer
First Conformant OpenCL Solution for FPGAs!!! OpenCL v1.0 specification >8500 Programs tested Supports Arm Host CV and AV SoC 14 http://www.khronos.org/conformance/adopters/conformant-companies http://www.khronos.org/conformance/adopters/conformant-products
Heterogeneous Platform Model OpenCL Platform Model Host Memory (Compute) Device Host Compute Unit Global Memory Processing Element Example Platform x86 PCIe 15
Heterogeneous Platform Model OpenCL Platform Model Host Memory Device Device Host Global Memory Example Platform x86 PCIe 16
OpenCL Use Model: Abstracting the FPGA away Host Code main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } OpenCL Accelerator Code kernel void sum ( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Standard gcc Compiler Altera Offline Compiler Verilog EXE AOCX Quartus II Accelerator 17 Host
OpenCL Programming Model host.c opencl.h gcc Driver Platform Context Device Queue Acquire Compute Program Kernel Visualize Buffer Launch device.cl aoc 18
Interconnect The Only Custom Accelerator Solution: Platforms DDR DDR DDR3 Memory Interface DDR3 Memory Interface OpenCL Domain Built with Altera OpenCL Compiler QDR QDRII Memory Interface QDR QDRII Memory Interface QDR QDRII Memory Interface QDR QDRII Memory Interface Kernel IP Kernel IP 10G Network 10Gb MAC/UOE Data Interface 10Gb MAC/UOE Data Interface 20 Host PCIe gen2x8 Host Interface IO Infrastructure Prebuilt BSP with standard HDL Tools by FPGA Developer
10G UDP 10G UDP Altera Reference Platforms Requirement Network Enabled Low Latency High Performance Computing (HPC) Compute Power/ Memory Bandwidth Architecture OpenCL API HAL UMD KMD Stratix V FPGA DMA PCIe CPLD Bridge CPLD DDR3 (OpenCL Kernels) FLASH DDR3 OpenCL API HAL UMD KMD Stratix V FPGA DMA PCIe DDR3 DDR3 (OpenCL Kernels) Global Memory DDR and QDRII+ Large amount of DDR IO Channels 2x10GbE (MAC/UOE) None (Minimize IP overhead) Reference Design OPRA (Streaming) Trading (with global memory access) Option Pricing 21
SoC Reference Platforms HPS block removes the complexities of the BSP creation Coherency between Host and Accelerator HPS DDR3 Stratix V FPGA H2F/F2H HPS LWH2F F2S CSR 32bit, 50Mz FPGA Memory OpenCL Kernels DVI DVO Scratch DDR3 Camera Monitor OpenCL Platforms Page contains CV SoC devkit platform users guide 22
Altera Network Enabled Reference Platform for OpenCL C/C++ API OpenCL C host.c device.cl Compiler Reference Design Software Layer Hardware Layer Reference Platform Host Device Reference Board 23 64-bit RHEL 6.4 Windows 7 s5_hft (S5PH-Q)
Guaranteed Timing Flow kernel.cl Boardspec.xml AOC Post-fit QXP partition (PCIe, UniPHY, DMA, ) Synthesis / P&R / STA on the OpencL Kernels ONLY No Meet Timing Yes Reconfig kernel PLL Re-run STA with the new PLL value 24 DONE!
Interface Heterogeneous Memory Support Host Memory Host IO Global Memory1 Global Memory2 IO Device CU Memories with different characteristics DDR Sequential Access QDR Random Access On-Chip Low Latency kernel void foo( global uint *data attribute((buffer_location(qdr) )) ) { foo(data[i]); } MoSysEfficient HMC High Capacity Combine different memories Attribute-based Automatic 25
Interconnect Interconnect Channels Advantage Standard OpenCL Altera Vendor Extension IO and Kernel Channels DDR DDR QDR QDR QDR QDR DDR3 Interface DDR3 Interface QDRII Interface QDRII Interface QDRII Interface QDRII Interface CvP Update OpenCL Kernels OpenCL Kernels DDR DDR QDR QDR QDR QDR DDR3 Interface DDR3 Interface QDRII Interface QDRII Interface QDRII Interface QDRII Interface CvP Update OpenCL Kernels OpenCL Kernels 10G Network 10Gb Interface 10Gb Interface 10G Network 10Gb Interface 10Gb Interface Host Host Interface Host Host Interface 26
Kernel Development Flow Modify kernel.cl x86 Emulator (sec) Functional Bugs? Hardware performance met? Optimization Report (min) Prototype (min) Stall-free pipeline? Memory coalesced? Profiler (hours) 28 DONE!
x86 emulator Beta v14.1 Enable functional debug on x86 system of kernel code Prototype support to allow users run kernels on x86 platform Debug support for Altera vendor specific debug support such as channels kernel void accel( ) { gid = get_global_id(0); out[gid] = proc(data[gid]); } x86 Kernel Compiler./kernel_tb Running Supports OpenCL syntax Channels Printf 29
Example: Load to Store dependency 1 2 3 4 5 6 kernel void prefixsum( global int* restrict A, unsigned N ) { for ( unsigned i = 1 ; i < N ; i++ ) { int a = A[i-1]; A[i] += a; } } ============================================================================== *** Optimization Report *** ============================================================================== Relative cost of global Kernel: prefixsum Ln.Col ============================================================================== memory to local Loop for.body computation 2.25 Pipelined execution inferred. Successive iterations launched every 321 cycles due to: Memory dependency on Load Operation from: 3.21 Store Operation 4.7 Largest Critical Path Contributors: True fix requires 49%: Load Operation restructuring the code 3.21 49%: Store Operation 4.7 ============================================================================= 30
Example: Accumulating a value 1 2 3 4 5 6 7 8 9 kernel void test( global float* restrict input, global float* restrict output, unsigned N ) { float mul = 1.0f; for ( unsigned i = 0; i < N; i++ ) { mul *= input[ i ]; } *output = mul; } ================================================================================== *** Optimization Report *** ================================================================================== Kernel: test Ln.Col ================================================================================== Loop for.body 5.24 Pipelined execution inferred. Successive iterations launched every 3 cycles due to: Data dependency on variable mul 4.10 Largest Critical Path Contributor: 100%: Fmul Operation 6.7 ================================================================================== 31
Rapid Prototyping Beta v14.1 Increases productivity during application development Uses a library of pre-compiled templates to skip Quartus II compilation Can test small versions of the final design on hardware very quickly OpenCL Compiler aoc Quartus II ~ hours User Program OpenCL Compiler....... + HW Implementation aoc march=prototype Configuration Template Library ~minutes Ability to generate custom templates based on user kernels Tailors the Rapid Prototyping Template Library to the user 32
Profiler BETA v14.1 Instrument the pipeline with performance counters and profiling logic Transfer the profiling information to the host via PCIe link Kernel Pipeline kernel void accel( ) { gid = get_global_id(0); out[gid] = a[gid]+b[gid]; } Load + Load Memory Mapped Registers Store 33
Profiler BETA v14.1 Bottlenecks, bandwidth, saturation, pipeline occupancy 34
OpenCL Host Library & Run Time Environment (RTE) Host library improvements: Lower CPU usage Improved scalability Lower memory footprint Faster run time SDK & Run Time Environment: OS SDK (needs ACDS) RTE Windows x86-64 Installer Installer Linux (RHEL) x86-64 Installer, RPM Installer, RPM Linux (RHEL) Power - RPM Linux (custom) CV SoC - Tarball 35
Installable Client Driver BETA v14.1 host.c clgetplatformid opencl.h ICD nvidiaopencl Acquire Compute AlteraOpenCL Visualize device.cl HKEY_LOCAL_MACHINE\SOFTWARE\ Khronos\OpenCL\Vendors <library>.dll DWORD /etc/opencl/vendors /<vendor>.icd <library>.so 36
Altera Client Driver BETA v14.1 host.c clgetplatformid opencl.h ICD nvidiaopencl Acquire clgetdeviceid Compute AlteraOpenCL ACD Visualize device.cl 37
OpenCL + FPGA Key Benefits Faster development vs. traditional FPGA design flow Puts the FPGA in the software developers hands Familiar C-based development flow Higher performance/watt vs. CPU/GPGPU Implement exactly what you need Pipeline parallel structures Custom interconnect converging with data processing cores Lower power vs. CPU/GPGPU Core frequency lower: 200-250MHz vs 1GHz Turn off unused logic Up to 1/5 the power Portability & Obsolescence free Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc) Code ports seamlessly to new generations of the FPGA 38 FPGA life cycle considerably longer than CPUs or GPGPUs
Additional Resources
Optimize Design Set Up Altera SDK for OpenCL Design Flow Getting Started Guide (document) Install Quartus II v13.1 with Altera SDK for OpenCL Install C Compiler or Development Environment Obtain and setup license from the Self Service Licensing Center Install the FPGA (OpenCL) board aocl install Programming Guide (document) Develop kernel code and compile on CPU/GPU for functional correctness Build, compile & link the host application (Visual Studio/GCC) Compile the OpenCL kernel with Altera offline Compiler (aoc) Run the application Best Practices (document) Optimize kernel for FPGA hardware 40
Additional Altera OpenCL Collateral White papers on OpenCL OpenCL online demos OpenCL design examples Instructor-Led training Parallel Computing with OpenCL Workshop by Altera (1 Day) Optimization of OpenCL for Altera FPGAs Training by Altera (1 Day) Online training Introduction to Parallel Computing with OpenCL Writing OpenCL Programs for Altera FPGAs Running OpenCL on Altera FPGAs Single-Threaded vs. Multi-Threaded Kernels Building Custom Platforms for Altera SDK for OpenCL OpenCL board partners page 41
Application Benchmarking
Case Study: GZIP Compression OpenCL Was 10% Slower 12% more resources 3x faster development time Altera summer intern ported and optimized GZIP algorithm in a little more than a month Industry leading companies FPGA engineer coded Verilog in 3 months Much lower design effort and design time 43
Conclusions Results CHREC/Univ of Florida Sobel, Canny, & SURF OpenCL vs. VHDL productivity table VHDL development time OpenCL development time 6 months 1 month Apps. OpenCL vs. VHDL performance table Frames/sec VHDL performance OpenCL performance Stratix 4 Predicted Stratix 5 Stratix 5 Max freq. Frames/sec Max freq. Frames/sec Max freq. Sobel 475 170 909 300 870 300 Canny 470 170 890 300 823 309 SURF 392 170 870 300 804 283 Avoid productivity challenges of HDL 6 increase in productivity OpenCL offers familiar C environment Develop fully pipeline kernels Minimum performance cost < 10 % overhead Productivity Performance 44
Case Study: Image Classification Deep Learning Algorithm Convolutional Neural Networking Based on Hinton s CNN Early Results on Stratix V 2X Perf./Power vs. gpgpu despite soft floating point 8+ simultaneous kernels vs. 2 on gpgpu Exploiting OpenCL channels between kernels A10 Expectations Hard floating point Better density and frequency ~ 4X performance/watt v SV The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. Here are the classes in the dataset, as well as 10 random images from each: airplane automobile bird cat deer dog frog horse ship truck Hinton s CNN Algorithm 45
AES Encryption Encryption/decryption 256bit key Counter (CTR) method Platform Advantage FPGA Integer arithmetic Coarse grain bit operations Complex decision making Results E5503 Xeon Processor (single core) Power (W) Performance (GB/s) Efficiency (MB/s/W) est 80 0.01 0.125 AMD Radeon HD 7970 est 100 0.33 3.3 PCIe385 A7 Accelerator 25 5.20 208 46
Multi-Asset Barrier Option Pricing Monte-Carlo simulation No closed form solution possible High quality random number generator required Billions of simulations required Used GPU vendors example code Advantage FPGA Complex Control Flow Optimizations Channels, loop pipelining Results Platform Power (W) Performance (Bsims/s) Efficiency (Msims/s/W) W3690 Xeon Processor 130.032 0.0025 nvidia Kepler20 212 10.1 48 Bittware S5-PCIe-HQ 45 12.0 266 47
Document Filtering Unstructured data analytics Bloom Filter 48 Platform Advantage FPGA Integer Arithmetic Flexible Memory Configuration Results Power (W) Performance (MTs) Efficiency (MTs/W) W3690 Xeon Processor 130 2070 15.92 nvidia Tesla C2075 215 3240 15.07 PCIe385 A7 Accelerator 25 3602 144.08
Consumer (Japan) Image Processing Adaptive weighted images p xy c d c d c c 1 ij 1 xy 1 ( i 1) j 2 xy 2 ij xy 2 ( i 1 ) j W d d 2 xy Advantage FPGA Integer Arithmetic Results Platform Power (W) Performance (FPS) Efficiency (FPS/W) W3565 Xeon Processor est 130 0.05.0004 nvidia Quadro 4000 est 150 2.94.0200 PCIe385 A7 Accelerator 21 4.29.2040 49
Smith-Waterman Sequence Alignment Scoring Matrix Platform Advantage FPGA Integer Arithmetic SMT Streaming Results Power (W) Performance (MCUPS) Efficiency (MCUPS/W) W3565 Xeon Processor 140 40.29 nvidia K20 225 704 3.13 PCIe385 A7 Accelerator 25 32596 1303.00 50
Multi Function Printer Image Processing RGB output of raster scanner converted to CMYK colorants for printing Advantage FPGA SoC Solution IO and Kernel Channels Heterogeneous memory accesses Goal 50PPM at A4/letter size Results >40X improvement over C based algorithm on ARM only No NEON coprocessor used C6 speed grade part improved 20% to 128PPM 51
Suricata: IDS/IPS Implementation (Cybersecurity) 2x 10 Gbps ETH IO ETH IO STD IDS PKT PKT Processing Analysis DPI PKT PKT Processing Analysis Traffic Control IPS PKT Manipulation ETH IO ETH IO 2x 10 Gbps Ingress Network Path STD Rules Memory (QDR or DDR) DPI Rules Memory (QDR or DDR) IDS/IPS MGMT Mirror for Egress Network Path Packet Analysis Kernel IDS (task) Stream in decoded packets and store in local memory (aoclreadchannel) Parallel regex with STD rules in global memory (heterogeneous memory support) Write results to global memory Stream out decoded packets (aoclwritechannel) Host IDS/IPS Management Read results from global memory and log Decide to modify or delete packets Packet Manipulation Kernel - IPS (task) Stream in decoded packets (aoclreadchannel) Read and process decision from the host Stream out decoded packets (aoclwritechannel) Decoder Kernel (autorun) Stream in encoded packets (aoclreadchannel) Unpack single streams Stream out decoded packets (aoclwritechannel) Packet Analysis Kernel - DPI (task) Stream in decoded packets and store in local memory (aoclreadchannel) Parallel regex with DPI rules in global memory (heterogeneous memory support) Write results to global memory Stream out decoded packets (aoclwritechannel) Encoder Kernel (autorun) Stream in decoded packets (aoclreadchannel) Repack multiple streams Stream out encoded packets (aoclwritechannel)
Haplotype Caller (Pair-HMM) Smith Waterman like algorithm Uses hidden markov models to compare gene sequences 3 stages: Assembler, Pair-HMM (70%), Traversal +Genotyping Floating point (SP + DP) C++ code starting point (from JAVA) Whole genome takes 7.6 days! Results Platform Runtime (ms) Java (gatk 2.8) 10,800 Intel Xeon E5-1650 138 nvidia Tesla K40 70 Nallatech SV-D8 15.5 53
Sobel Filter Fundamental image filter algorithm Used commonly in industrial and automotive applications Sliding window based design pattern Same shift register structure, except in two dimensions WIDTH*3 WIDTH*4-1 WIDTH*4-9 A B C WIDTH*3-9 E F G WIDTH*2-9 54 0 Pixels enter here WIDTH-9 WIDTH-1
Task Implementation and Results Altera OpenCL kernel void sobel(int iters) { // Coefficients int Gx[3][3] = {{-1,-2,-1},{0,0,0},{1,2,1}}; int Gy[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}}; int rows[2 * COLS + 3]; // line buffer int count = 0; while (count!= iters) { // Shift the line buffer #pragma unroll for (int i = COLS * 2 + 2; i > 0; --i) { rows[i] = rows[i - 1]; } rows[0] = read_channel_altera(in_channel); On our design example website http://www.altera.com/su pport/examples/opencl/s obel-filter.html Device Resolution FPS } } int x_dir = 0; int y_dir = 0; #pragma unroll for (int i = 0; i < 3; ++i) { #pragma unroll for (int j = 0; j < 3; ++j) { x_dir += rows[i * COLS + j] * Gx[i][j]; y_dir += rows[i * COLS + j] * Gy[i][j]; } } int edge_weight = abs(x_dir) + abs(y_dir); write_channel_altera(out_channel, edge_weight); ++count; Cyclone V 1080p 60 Stratix V 1080p 135 55