Low-latency data acquisition to GPUs using FPGA-based 3rd party devices. Denis Perret, LESIA / Observatoire de Paris



Similar documents
The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

High-Density Network Flow Monitoring

Qsys and IP Core Integration

VPX Implementation Serves Shipboard Search and Track Needs

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Xeon+FPGA Platform for the Data Center

FPGA Manager PCIe, USB 3.0 and Ethernet

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

Intel Xeon +FPGA Platform for the Data Center

White Paper Streaming Multichannel Uncompressed Video in the Broadcast Environment

1000Mbps Ethernet Performance Test Report

Chapter 11: Input/Output Organisation. Lesson 06: Programmed IO

1. Computer System Structure and Components

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Introduction to MPIO, MCS, Trunking, and LACP

From Ethernet Ubiquity to Ethernet Convergence: The Emergence of the Converged Network Interface Controller

Networking Virtualization Using FPGAs

High-performance vswitch of the user, by the user, for the user

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

Arista Application Switch: Q&A

Cloud Data Center Acceleration 2015

Embedded Systems: map to FPGA, GPU, CPU?

Performance of Host Identity Protocol on Nokia Internet Tablet

Gigabit Ethernet Design

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

D1.2 Network Load Balancing

Xilinx 7 Series FPGA Power Benchmark Design Summary May 2015

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

AGIPD Interface Electronic Prototyping

Pedraforca: ARM + GPU prototype

I3: Maximizing Packet Capture Performance. Andrew Brown

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

Router Architectures

AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS

FPGA-based MapReduce Framework for Machine Learning

DEVICE DRIVERS AND TERRUPTS SERVICE MECHANISM Lesson-14: Device types, Physical and Virtual device functions

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Going Linux on Massive Multicore

OSes. Arvind Seshadri Mark Luk Ning Qu Adrian Perrig SOSP2007. CyLab of CMU. SecVisor: A Tiny Hypervisor to Provide

I/O Device and Drivers

I/O virtualization. Jussi Hanhirova Aalto University, Helsinki, Finland Hanhirova CS/Aalto

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

An Embedded Based Web Server Using ARM 9 with SMS Alert System

A Deduplication File System & Course Review

Computer Systems Structure Input/Output

A Scalable Large Format Display Based on Zero Client Processor

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

N5 NETWORKING BEST PRACTICES

Development. Igor Sheviakov Manfred Zimmer Peter Göttlicher Qingqing Xia. AGIPD Meeting April, 2014

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

TOE2-IP FTP Server Demo Reference Design Manual Rev1.0 9-Jan-15

SPI I2C LIN Ethernet. u Today: Wired embedded networks. u Next lecture: CAN bus u Then: wireless embedded network

The Dusk of FireWire - The Dawn of USB 3.0

Linux NIC and iscsi Performance over 40GbE

QoS & Traffic Management

Model-based system-on-chip design on Altera and Xilinx platforms

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

Question: 3 When using Application Intelligence, Server Time may be defined as.

Intel DPDK Boosts Server Appliance Performance White Paper

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

The Bus (PCI and PCI-Express)

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

USB - FPGA MODULE (PRELIMINARY)

Design Issues in a Bare PC Web Server

Configuring Your Computer and Network Adapters for Best Performance

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Windows TCP Chimney: Network Protocol Offload for Optimal Application Scalability and Manageability

How to design and implement firmware for embedded systems

SMB Direct for SQL Server and Private Cloud

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

10/100 Mbps Ethernet MAC

XMC Modules. XMC-6260-CC 10-Gigabit Ethernet Interface Module with Dual XAUI Ports. Description. Key Features & Benefits

Open Flow Controller and Switch Datasheet

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Multiple Public IPs (virtual service IPs) are supported either to cover multiple network segments or to increase network performance.

Welcome to Pericom s PCIe and USB3 ReDriver/Repeater Product Training Module.

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

USB 3.0 Connectivity using the Cypress EZ-USB FX3 Controller

AFDX networks. Computers and Real-Time Group, University of Cantabria

HDBaseT Camera. For CCTV / Surveillance. July 2011

3D modeling in PCI Express Gen1 and Gen2 high speed SI simulation

Packet-based Network Traffic Monitoring and Analysis with GPUs

Client and Server System Requirements

Configure Windows 2012/Windows 2012 R2 with SMB Direct using Emulex OneConnect OCe14000 Series Adapters

An Oracle Technical White Paper November Oracle Solaris 11 Network Virtualization and Network Resource Management

Have both hardware and software. Want to hide the details from the programmer (user).

ebus Player Quick Start Guide

Broadcom Ethernet Network Controller Enhanced Virtualization Functionality

High-Density Network Flow Monitoring

MSITel provides real time telemetry up to 4.8 kbps (2xIridium modem) for balloons/experiments

Avoiding pitfalls in PROFINET RT and IRT Node Implementation

FPGAs for Trusted Cloud Computing

[Download Tech Notes TN-11, TN-18 and TN-25 for more information on D-TA s Record & Playback solution] SENSOR PROCESSING FOR DEMANDING APPLICATIONS 29

Next Generation GPU Architecture Code-named Fermi

Transcription:

Low-latency data acquisition to s using FPGA-based 3rd party devices Denis Perret, LESIA / Observatoire de Paris RTC_4_AO workshop PARIS 2016

RTC for ELT AO Deformable Mirror Sensors 40Ge Network high bandwidth low latency Hard Real Time telemetry 40Ge Network hight bandwidth Soft Real Time 2

Using FPGA to reduce acquisition latency ~ Interface : 10 GbE, low latency acquisition interface ~ W/O Direct Memory Access to the ram : multiple data copy = introduced latency RAM 10Ge NIC PCIe CPU RAM 3

Using FPGA to reduce acquisition latency ~ Interface : 10 GbE, low latency acquisition interface ~ With Direct Memory Access to the ram : single data transfer = low latency, maximum bandwidth RAM 10Ge NIC PCIe CPU RAM 4

No PCIe P2P at all. Encapsulated Data NIC Ram PCIe Ram CPU App Handles the interrupts and the data decapsulation ~ The NIC DMA engine sends data to host memory and sends an interrupt to the CPU when the buffer is (half)full. ~ The CPU suspends its tasks and saves its current state (context switch). ~ The CPU decapsulates the data (Ethernet,TCP/UDP, GigeVision ). RTC_4_AO workshop PARIS 2016

No PCIe P2P at all. Ram ~ The CPU builds a CUDA stream containing operations as kernel launch and memory transfer orders. NIC Ram Decapsulated Data CPU App Initiates the data and operations stream transfer ~ The stream is sent to the ~ The pixels are copied on the RAM ( DMA, reading process over PCIe) PCIe RTC_4_AO workshop PARIS 2016

No PCIe P2P at all. Ram ~ The CPU waits for the computation to end (or does something else) NIC Ram Computation Results CPU App Does something else. Gets interrupted periodically. ~ The results are sent to the CPU RAM. ~ The synchronization mechanisms between the and the CPU are hidden (interrupts ). PCIe RTC_4_AO workshop PARIS 2016

No PCIe P2P at all. Encapsulated Results NIC Ram CPU App Encapsulates the data Initiates the Data transfer ~ The CPU encapsulates the results (Ethernet, TCP/UDP, ) ~ The CPU initiates the transfer by configuring and launching the NIC DMA engine (reading process). Ram PCIe RTC_4_AO workshop PARIS 2016

With PCIe P2P and Custom NIC Ram ~ The CPU gets the address and size of a dedicated buffer and use it to configure the NIC DMA engine. Data Custom NIC TOE Data Decapsulation Ram Polling Kernel Decaps. Data PCIe CPU App Minding his own business ~ The data are written directly to the memory. ~ The infinitely detects new data by polling its memory (busy loop), performs the computation and fills a local buffer with the results. RTC_4_AO workshop PARIS 2016

Getting back the results: 1rst way Ram ~ The CPU gets notified in a hidden way that the computation. ults TOE Data Decapsulation Custom NIC Ram Results CPU App Still minding his own business ~ The CPU initiates the transfer from to FPGA (FPGA DMA engine -> reading process over PCIe). Polling Kernel PCIe RTC_4_AO workshop PARIS 2016

Getting back the results: 2nd way Ram ~ The sends directly the data by writing to the NIC addressable space ( DMA -> write process over PCIe). ults TOE Data Decapsulation Custom NIC Ram Results CPU App Still minding his own business Polling Kernel PCIe RTC_4_AO workshop PARIS 2016

Getting back the results: 3rd way Ram ~ The notifies the FPGA by writing a flag on the FPGA addressable space. ults TOE Data Decapsulation Custom NIC Ram Results CPU App Still minding his own business ~ The FPGA gets the data back (FPGA DMA engine -> reading process). Polling Kernel PCIe RTC_4_AO workshop PARIS 2016

Using FPGA to reduce acquisition latency: First tests ~ Stratix V PCIe development board from PLDA (+ QuickPCIe, QuickUDP IP cores) 42 Gb/s demonstrated from board to ; 8.8 Gb/s per 10GbE link in loopback mode ~ 10 GbE camera from Emergent Vision Technologies (8.9 Gb/s to mem), GigEVision protocol, 1.5 kfps in 240x240 pixels coded on 10bits (360 FPS in 2k x 1k on 8bits or 1k x 1k on 10bits) 13

Using FPGA to reduce acquisition latency Latency measurement measurements commands DMA DMA ram CPU app camera control DMC PHY PHY loopback UDP UDP DECAPS CUSTOM_NIC DEMUX Vision_protocol_handling DATA GENERATOR answers pixels deformable mirror commands DMA DMA DMA PCIe 3.0 gpu_ram pixels polling ring kernel buffer DM com buffer computation kernels 14

P2P from FPGA to (way back still launched by the CPU) 1200 1100 no p2p & computation 1000 p2p & computation 900 800 700 no p2p & no computation 600 p2p & no computation 1 63 125 187 249 311 373 435 497 559 621 683 745 807 869 931 993 1055 1117 1179 1241 1303 1365 1427 1489 1551 1613 1675 1737 1799 1861 1923 1985 2047 2109 2171 2233 2295 2357 2419 2481 2543 2605 2667 2729 2791 2853 2915 2977 3039 3101 3163 3225 3287 3349 3411 3473 3535 3597 3659 3721 3783 3845 3907 3969 4031 4093 15

Interrupts vs polling (on the cpu) PC B PC A carte PLDA "Quickplay" 10GbE carte PLDA "non QP" DDR hdl-kernel Image Gen c-kernel COG (ou pas) UDP0 UDP0 QPCIE DMA PCIE DDR DDR hdl-kernel Latency Meas UDP1 UDP1 BAR2 computing kernel(s) polling kernel 16 carte RAM

Interrupts vs polling (on the cpu) 700 600 Memory polling 500 400 Optimized Interrupt + "stress -c 8" 300 200 100 Non Optim Inter Non Optim Inter + "stress -c 8" 0 9300 9350 9400 9450 9500 9550 9600 9650-100 17

Interrupts vs polling (on the cpu) 700 600 Polling + isolated CPU 500 Optimized Interrupt 400 Polling + non isolated CPU 300 200 100 0 9330 9335 9340 9345 9350 9355 9360-100 18

Using FPGA to reduce the load on /PCIe/network read direction DPRAM DPRAM CoG CoG FIFO Dble FIFO DPRAM port CoG CoG FIFO FIFO RAM { N DPRAM RAM DMA over PCIe N*N subimages ~ FPGA are well suited for on-the-fly decapsulation, data rearrangement and highly parallelized computation. ~ Less load on the PCIe, smaller buffers on the ram, less latency ~ Could be done in the WFS, so we lower the load on the network 19

or even do everything with FPGAs ~ Exploring the possibilities with high level development environments: ~ Vendor specific HLS. Xilinx Vivado, ~ Quickplay: based on KPN. Has its own HLS. Is gonna be able to use other HLS. They are planning to integrate PCIe P2P. ~ OpenCL: is available for FPGAs (Altera), CPUs, s. Can now handle network protocols: Altera introduced I/O channels allowing kernels read and write network streams that are defined by the board designer. P2P over PCIe should be possible (synchronization?). Integration of custom HDL blocks? ~ Matlab to HDL. Did someone try it? Maybe useful to help developing IPs. -> Interactions with HPC tools as MPI, DDS, Corba are quite challenging 20