Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013
# of transistors Moore s law versus Amdahl s law Computational Capacity Hardware capabilities underutilized Programming bottleneck Introduction of multicore technology Software Performance time 2 Nov 7, 2013
Multi-core CPUs are here to stay nvidia Tegra3 AMD Fusion Llano CPUs grow to 2, 4, 8,.. 64.. 256 cores Mobile, desktop, server Multi-threaded programming model to keep cores busy Complex multi-level caches, hardware cache coherency Intel Xeon Phi 3 Nov 7, 2013
Creating parallel programs is hard Herb Sutter, chair of the ISO C++ standards committee, Microsoft: Everybody who learns concurrency thinks they understand it, ends up finding mysterious races they thought weren t possible, and discovers that they didn t actually understand it yet after all Edward A. Lee, EECS professor at U.C. Berkeley: Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. 4 Nov 7, 2013
Learning raises the feeling of complexity Provides good insight in C++ concurrency C++11 standardizes several concurrency primitives Warns for many many subtle problems The authorative description (4th edition) Apparently requires 1300+ pages... Safe concurrency by defensive design Shows that Java shares many concurrency issues with C++ 5 Nov 7, 2013
Further appetite for performance? General-purpose CPUs are (traditionally) designed to handle code with complex control-flow Their effective usage of silicon for computations is low Area(ALU s)/area(total die) is about 1% How to significantly increase operations/sec/$ and operations/j? Hand-off compute load to: Function-specific hardware accelerators (H264 decode, LTE channel decode, GFX rendering, IP packet processing,...) GP-GPU: general-purpose programmable graphics processor units FPGA accelerators: Field programmable gate arrays 6 Nov 7, 2013
Offload CPU: computational efficiency GP-GPU: High floating point performance (>1TFlops) Large off-chip memory bandwidth Needs thousands of concurrent threads Few inter-thread data dependencies and little data-dependent control High-end chips take huge power (>100W) FPGA: High integer performance (>1Tops) Good power efficiency Needs hundreds of concurrent instructions Takes HW design expertise and effort. High-end chips are very expensive (>$1000) 7 Nov 7, 2013
CPU FPGA combinations Xilinx 'Zync' or Altera Cyclone with dual-core ARM Or all kinds of boards to fit PC architecture 8 Nov 7, 2013
CPU GPGPU combinations AMD Fusion for desktop, gaming, NVidia Tesla for high-end compute Intel Haswell: desktop, laptop, ODROID: ARM quad-core with embedded GP-GPU 9 Nov 7, 2013
Intel for embedded: don t underestimate Intel NUC (Next Unit of Compute) core-i3 or i5 on 4 x4 And furthermore: Intel Atom Bay Trail : dual- and quad-core, 22nm, with embedded GP-GPU Intel Quark: 1/10 power of Atom, 32-bit x86 architecture. Arduino-style development board. 10 Nov 7, 2013
CPU - Accelerator application mapping Functional partitioning Create SW thread with appropriate functionality Channels for synchronized inter-thread communication Plain shared data for unsynchronized access Application CPU-thread 1 Channel FPGA Accelerator Channel CPU-thread 2 Conceptually nice picture, real implementation hurdles: Application I/O to hardware is shielded by any 'real' operating system Thread control (sleep/wakeup) interacts with Accelerator progress C-code of SW thread mapped to FPGA through high level synthesis 11 Nov 7, 2013
Creation of an FPGA Accelerator Software functional reference FPGA hardware implementation Compute kernel: C source code in SW thread inter-thread communication API (channels, shared memory, mutex, ) HLS tooling IP library FPGA HW implementation of compute kernel HW implementation of same communication API High-level synthesis tooling (e.g. Xilinx Vivado) Choose local (embedded) memories for some of the C variables, synthesize shared-memory access for others. Balance amount of hardware with required performance (loop unrolling) 12 Nov 7, 2013
HW/SW communication stack CPU-side stack Application SW virtual address space Compute library e.g. lapack, crypto Channel Accelerator-side (FPGA) stack User-level driver Kernel driver Channel Lapack accelerator Crypto accelerator Linux Multi-core CPU with MMU and caches Fifo interfaces to accelerators DMA streaming, caches Shared access to local srams DDR Snoop Control unit PCI-e / AXI memory bus PCIe / AXI interface 13 Nov 7, 2013
ARM (A9) multicore example FPGA or GPU DDR L2 Cache 14 Nov 7, 2013
Intel (i5) multicore example DDR FPGA or GPU Memory bus Device reads will be pulled from CPU L1/L2/L3 caches PCIe 3.0 improves on writes with new caching hints in the protocol 15 Nov 7, 2013
Memory-mapped communication? Shared-memory paradigm to communicate with GPU/FPGA? Matches C/Java programming model Highly efficient, low run-time overhead No system calls for data transport: just CPU load/stores Take advantage of existing on-chip caches to buffer data Sounds nice Can I transfer a C/Java object pointer through my channel, for dereferencing inside my accelerator? Well that would require tackling: Cache coherency issues MMU issues (Virtual Memory paging support) 16 Nov 7, 2013
Shared memory with GP-GPU? Today, Nvidia s CUDA is the popular programming environment Based on separated memories (use on-card memory) Explicit data transport to/from GPU card, avoid shared memory Allows a streaming model, where CPU and GPU are concurrently active Providers of integrated GPUs (AMD, Intel, ARM) are working to improve on this programming model: Integrated GPUs do share the global memory with the CPU, no need to really copy data. MMUs are being added to the GPU, allowing to share pointers Cache coherency support remains (for now) only partial, requiring SW-driven transfer of ownership of data segments. 17 Nov 7, 2013
Shared memory with FPGA? FPGA vendors are late to provide SW+tools to integrate an accelerator with host CPU+OS: Support for OpenCL programming model is coming Rely on explicit data transport to/from FPGA local memory Creating mmap capable device drivers can be done by yourself? Also, MMU sharing can be implemented by yourself in the FPGA? GPU vendors are ahead of FPGA vendors in attracting customers with SW-oriented tooling. 18 Nov 7, 2013
Evaluating an application mapping (1) Vector fabrics did study the mapping of a particular video object recognition algorithm for one of our customers: Its compute kernel contained a 2-D convolution to match images. The software reference implementation performed 0.9G multiplyadds per second on a desktop PC: too low for actual deployment. We created performance estimates for potential mapping to different target architectures. 19 Nov 7, 2013
Evaluating an application mapping (2) One week of optimization of the algorithm to an Intel i5 platform Multi-threading to utilize the available 4 cores, and vectorization (SSSE3) to speed-up pixel operations Reaching 25G multiply-adds /sec. One week of mapping the C kernel to an FPGA implementation (not including the CPU-FPGA communication) Rewriting the C kernel for use in a synthesis tool (Xilinx Vivado) Carefully tune on-chip memory architecture for high parallelism Reaching amazingly the same 25G multiply-adds/sec for a (ballpark) 200 FPGA chip. Few days to study mapping to a midrange Nvidia GPU card. A rough estimate showed potential to achieve about 75G multiply-adds/sec. Required the mapping of a much larger code portion to avoid frequent data transfers. Would be a really difficult task. 20 Nov 7, 2013
Conclusion Multi-core CPUs are everywhere, yet multi-threaded programming is difficult and error-prone. Heterogeneous system programming adds further complexity. GP-GPU vendors did a nicer approach to the SW-programmer than FPGA vendors, by delivering integrated compilers and OS device drivers (and now proceed with memory-mapped integration). Spending three weeks on code tuning and mapping was sufficient to obtain good insights on heterogeneous architecture opportunities. Don t underestimate the power and potential of Intel 21 Nov 7, 2013
Thank you Check www.vectorfabrics.com for a free demo on concurrency analysis