Embedded Systems: map to FPGA, GPU, CPU?



Similar documents
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Direct GPU/FPGA Communication Via PCI Express

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Next Generation GPU Architecture Code-named Fermi

Xeon+FPGA Platform for the Data Center

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introduction to GPU hardware and to CUDA

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Parallel Programming Survey

GPU Parallel Computing Architecture and CUDA Programming Model

Data Center and Cloud Computing Market Landscape and Challenges

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

FPGA-based Multithreading for In-Memory Hash Joins

Multi-core Programming System Overview

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

FPGA-based MapReduce Framework for Machine Learning

Parallel Firewalls on General-Purpose Graphics Processing Units

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

An Introduction to Parallel Computing/ Programming

High Performance or Cycle Accuracy?

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

Computer Graphics Hardware An Overview

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

High Performance GPGPU Computer for Embedded Systems

Overview of HPC Resources at Vanderbilt

Intel Xeon +FPGA Platform for the Data Center

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

How To Build A Cloud Computer

CFD Implementation with In-Socket FPGA Accelerators

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Going Linux on Massive Multicore

Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Multi-Threading Performance on Commodity Multi-Core Processors

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Parallel Algorithm Engineering

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Choosing a Computer for Running SLX, P3D, and P5

Embedded Parallel Computing

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Kalray MPPA Massively Parallel Processing Array

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG).

~ Greetings from WSU CAPPLab ~

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Introduction to GPGPU. Tiziano Diamanti

GPU Computing - CUDA

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

NVIDIA GeForce GTX 580 GPU Datasheet

Qsys and IP Core Integration

GPUs for Scientific Computing

Low power GPUs a view from the industry. Edvard Sørgård

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

GeoImaging Accelerator Pansharp Test Results

Stream Processing on GPUs Using Distributed Multimedia Middleware

Trends in High-Performance Computing for Power Grid Applications

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

A survey on platforms for big data analytics

Application. EDIUS and Intel s Sandy Bridge Technology

GPGPU Computing. Yong Cao

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

RevoScaleR Speed and Scalability

Introduction to GPU Architecture

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Writing Applications for the GPU Using the RapidMind Development Platform

Accelerate Cloud Computing with the Xilinx Zynq SoC

Developing reliable Multi-Core Embedded-Systems with NI Linux Real-Time

Evaluation of CUDA Fortran for the CFD code Strukti

ADVANCED COMPUTER ARCHITECTURE

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

A Low Latency Library in FPGA Hardware for High Frequency Trading (HFT)

Full and Para Virtualization

Data Centric Systems (DCS)

Guided Performance Analysis with the NVIDIA Visual Profiler

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Networking Virtualization Using FPGAs

Clustering Billions of Data Points Using GPUs

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November

SOC architecture and design

SierraVMI Sizing Guide

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

AMD Opteron Quad-Core

Model-based system-on-chip design on Altera and Xilinx platforms

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Processor Architectures

Chapter 2 Parallel Architecture, Software And Performance

Transcription:

Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013

# of transistors Moore s law versus Amdahl s law Computational Capacity Hardware capabilities underutilized Programming bottleneck Introduction of multicore technology Software Performance time 2 Nov 7, 2013

Multi-core CPUs are here to stay nvidia Tegra3 AMD Fusion Llano CPUs grow to 2, 4, 8,.. 64.. 256 cores Mobile, desktop, server Multi-threaded programming model to keep cores busy Complex multi-level caches, hardware cache coherency Intel Xeon Phi 3 Nov 7, 2013

Creating parallel programs is hard Herb Sutter, chair of the ISO C++ standards committee, Microsoft: Everybody who learns concurrency thinks they understand it, ends up finding mysterious races they thought weren t possible, and discovers that they didn t actually understand it yet after all Edward A. Lee, EECS professor at U.C. Berkeley: Although threads seem to be a small step from sequential computation, in fact, they represent a huge step. They discard the most essential and appealing properties of sequential computation: understandability, predictability, and determinism. 4 Nov 7, 2013

Learning raises the feeling of complexity Provides good insight in C++ concurrency C++11 standardizes several concurrency primitives Warns for many many subtle problems The authorative description (4th edition) Apparently requires 1300+ pages... Safe concurrency by defensive design Shows that Java shares many concurrency issues with C++ 5 Nov 7, 2013

Further appetite for performance? General-purpose CPUs are (traditionally) designed to handle code with complex control-flow Their effective usage of silicon for computations is low Area(ALU s)/area(total die) is about 1% How to significantly increase operations/sec/$ and operations/j? Hand-off compute load to: Function-specific hardware accelerators (H264 decode, LTE channel decode, GFX rendering, IP packet processing,...) GP-GPU: general-purpose programmable graphics processor units FPGA accelerators: Field programmable gate arrays 6 Nov 7, 2013

Offload CPU: computational efficiency GP-GPU: High floating point performance (>1TFlops) Large off-chip memory bandwidth Needs thousands of concurrent threads Few inter-thread data dependencies and little data-dependent control High-end chips take huge power (>100W) FPGA: High integer performance (>1Tops) Good power efficiency Needs hundreds of concurrent instructions Takes HW design expertise and effort. High-end chips are very expensive (>$1000) 7 Nov 7, 2013

CPU FPGA combinations Xilinx 'Zync' or Altera Cyclone with dual-core ARM Or all kinds of boards to fit PC architecture 8 Nov 7, 2013

CPU GPGPU combinations AMD Fusion for desktop, gaming, NVidia Tesla for high-end compute Intel Haswell: desktop, laptop, ODROID: ARM quad-core with embedded GP-GPU 9 Nov 7, 2013

Intel for embedded: don t underestimate Intel NUC (Next Unit of Compute) core-i3 or i5 on 4 x4 And furthermore: Intel Atom Bay Trail : dual- and quad-core, 22nm, with embedded GP-GPU Intel Quark: 1/10 power of Atom, 32-bit x86 architecture. Arduino-style development board. 10 Nov 7, 2013

CPU - Accelerator application mapping Functional partitioning Create SW thread with appropriate functionality Channels for synchronized inter-thread communication Plain shared data for unsynchronized access Application CPU-thread 1 Channel FPGA Accelerator Channel CPU-thread 2 Conceptually nice picture, real implementation hurdles: Application I/O to hardware is shielded by any 'real' operating system Thread control (sleep/wakeup) interacts with Accelerator progress C-code of SW thread mapped to FPGA through high level synthesis 11 Nov 7, 2013

Creation of an FPGA Accelerator Software functional reference FPGA hardware implementation Compute kernel: C source code in SW thread inter-thread communication API (channels, shared memory, mutex, ) HLS tooling IP library FPGA HW implementation of compute kernel HW implementation of same communication API High-level synthesis tooling (e.g. Xilinx Vivado) Choose local (embedded) memories for some of the C variables, synthesize shared-memory access for others. Balance amount of hardware with required performance (loop unrolling) 12 Nov 7, 2013

HW/SW communication stack CPU-side stack Application SW virtual address space Compute library e.g. lapack, crypto Channel Accelerator-side (FPGA) stack User-level driver Kernel driver Channel Lapack accelerator Crypto accelerator Linux Multi-core CPU with MMU and caches Fifo interfaces to accelerators DMA streaming, caches Shared access to local srams DDR Snoop Control unit PCI-e / AXI memory bus PCIe / AXI interface 13 Nov 7, 2013

ARM (A9) multicore example FPGA or GPU DDR L2 Cache 14 Nov 7, 2013

Intel (i5) multicore example DDR FPGA or GPU Memory bus Device reads will be pulled from CPU L1/L2/L3 caches PCIe 3.0 improves on writes with new caching hints in the protocol 15 Nov 7, 2013

Memory-mapped communication? Shared-memory paradigm to communicate with GPU/FPGA? Matches C/Java programming model Highly efficient, low run-time overhead No system calls for data transport: just CPU load/stores Take advantage of existing on-chip caches to buffer data Sounds nice Can I transfer a C/Java object pointer through my channel, for dereferencing inside my accelerator? Well that would require tackling: Cache coherency issues MMU issues (Virtual Memory paging support) 16 Nov 7, 2013

Shared memory with GP-GPU? Today, Nvidia s CUDA is the popular programming environment Based on separated memories (use on-card memory) Explicit data transport to/from GPU card, avoid shared memory Allows a streaming model, where CPU and GPU are concurrently active Providers of integrated GPUs (AMD, Intel, ARM) are working to improve on this programming model: Integrated GPUs do share the global memory with the CPU, no need to really copy data. MMUs are being added to the GPU, allowing to share pointers Cache coherency support remains (for now) only partial, requiring SW-driven transfer of ownership of data segments. 17 Nov 7, 2013

Shared memory with FPGA? FPGA vendors are late to provide SW+tools to integrate an accelerator with host CPU+OS: Support for OpenCL programming model is coming Rely on explicit data transport to/from FPGA local memory Creating mmap capable device drivers can be done by yourself? Also, MMU sharing can be implemented by yourself in the FPGA? GPU vendors are ahead of FPGA vendors in attracting customers with SW-oriented tooling. 18 Nov 7, 2013

Evaluating an application mapping (1) Vector fabrics did study the mapping of a particular video object recognition algorithm for one of our customers: Its compute kernel contained a 2-D convolution to match images. The software reference implementation performed 0.9G multiplyadds per second on a desktop PC: too low for actual deployment. We created performance estimates for potential mapping to different target architectures. 19 Nov 7, 2013

Evaluating an application mapping (2) One week of optimization of the algorithm to an Intel i5 platform Multi-threading to utilize the available 4 cores, and vectorization (SSSE3) to speed-up pixel operations Reaching 25G multiply-adds /sec. One week of mapping the C kernel to an FPGA implementation (not including the CPU-FPGA communication) Rewriting the C kernel for use in a synthesis tool (Xilinx Vivado) Carefully tune on-chip memory architecture for high parallelism Reaching amazingly the same 25G multiply-adds/sec for a (ballpark) 200 FPGA chip. Few days to study mapping to a midrange Nvidia GPU card. A rough estimate showed potential to achieve about 75G multiply-adds/sec. Required the mapping of a much larger code portion to avoid frequent data transfers. Would be a really difficult task. 20 Nov 7, 2013

Conclusion Multi-core CPUs are everywhere, yet multi-threaded programming is difficult and error-prone. Heterogeneous system programming adds further complexity. GP-GPU vendors did a nicer approach to the SW-programmer than FPGA vendors, by delivering integrated compilers and OS device drivers (and now proceed with memory-mapped integration). Spending three weeks on code tuning and mapping was sufficient to obtain good insights on heterogeneous architecture opportunities. Don t underestimate the power and potential of Intel 21 Nov 7, 2013

Thank you Check www.vectorfabrics.com for a free demo on concurrency analysis