Low-latency data acquisition to s using FPGA-based 3rd party devices Denis Perret, LESIA / Observatoire de Paris RTC_4_AO workshop PARIS 2016
RTC for ELT AO Deformable Mirror Sensors 40Ge Network high bandwidth low latency Hard Real Time telemetry 40Ge Network hight bandwidth Soft Real Time 2
Using FPGA to reduce acquisition latency ~ Interface : 10 GbE, low latency acquisition interface ~ W/O Direct Memory Access to the ram : multiple data copy = introduced latency RAM 10Ge NIC PCIe CPU RAM 3
Using FPGA to reduce acquisition latency ~ Interface : 10 GbE, low latency acquisition interface ~ With Direct Memory Access to the ram : single data transfer = low latency, maximum bandwidth RAM 10Ge NIC PCIe CPU RAM 4
No PCIe P2P at all. Encapsulated Data NIC Ram PCIe Ram CPU App Handles the interrupts and the data decapsulation ~ The NIC DMA engine sends data to host memory and sends an interrupt to the CPU when the buffer is (half)full. ~ The CPU suspends its tasks and saves its current state (context switch). ~ The CPU decapsulates the data (Ethernet,TCP/UDP, GigeVision ). RTC_4_AO workshop PARIS 2016
No PCIe P2P at all. Ram ~ The CPU builds a CUDA stream containing operations as kernel launch and memory transfer orders. NIC Ram Decapsulated Data CPU App Initiates the data and operations stream transfer ~ The stream is sent to the ~ The pixels are copied on the RAM ( DMA, reading process over PCIe) PCIe RTC_4_AO workshop PARIS 2016
No PCIe P2P at all. Ram ~ The CPU waits for the computation to end (or does something else) NIC Ram Computation Results CPU App Does something else. Gets interrupted periodically. ~ The results are sent to the CPU RAM. ~ The synchronization mechanisms between the and the CPU are hidden (interrupts ). PCIe RTC_4_AO workshop PARIS 2016
No PCIe P2P at all. Encapsulated Results NIC Ram CPU App Encapsulates the data Initiates the Data transfer ~ The CPU encapsulates the results (Ethernet, TCP/UDP, ) ~ The CPU initiates the transfer by configuring and launching the NIC DMA engine (reading process). Ram PCIe RTC_4_AO workshop PARIS 2016
With PCIe P2P and Custom NIC Ram ~ The CPU gets the address and size of a dedicated buffer and use it to configure the NIC DMA engine. Data Custom NIC TOE Data Decapsulation Ram Polling Kernel Decaps. Data PCIe CPU App Minding his own business ~ The data are written directly to the memory. ~ The infinitely detects new data by polling its memory (busy loop), performs the computation and fills a local buffer with the results. RTC_4_AO workshop PARIS 2016
Getting back the results: 1rst way Ram ~ The CPU gets notified in a hidden way that the computation. ults TOE Data Decapsulation Custom NIC Ram Results CPU App Still minding his own business ~ The CPU initiates the transfer from to FPGA (FPGA DMA engine -> reading process over PCIe). Polling Kernel PCIe RTC_4_AO workshop PARIS 2016
Getting back the results: 2nd way Ram ~ The sends directly the data by writing to the NIC addressable space ( DMA -> write process over PCIe). ults TOE Data Decapsulation Custom NIC Ram Results CPU App Still minding his own business Polling Kernel PCIe RTC_4_AO workshop PARIS 2016
Getting back the results: 3rd way Ram ~ The notifies the FPGA by writing a flag on the FPGA addressable space. ults TOE Data Decapsulation Custom NIC Ram Results CPU App Still minding his own business ~ The FPGA gets the data back (FPGA DMA engine -> reading process). Polling Kernel PCIe RTC_4_AO workshop PARIS 2016
Using FPGA to reduce acquisition latency: First tests ~ Stratix V PCIe development board from PLDA (+ QuickPCIe, QuickUDP IP cores) 42 Gb/s demonstrated from board to ; 8.8 Gb/s per 10GbE link in loopback mode ~ 10 GbE camera from Emergent Vision Technologies (8.9 Gb/s to mem), GigEVision protocol, 1.5 kfps in 240x240 pixels coded on 10bits (360 FPS in 2k x 1k on 8bits or 1k x 1k on 10bits) 13
Using FPGA to reduce acquisition latency Latency measurement measurements commands DMA DMA ram CPU app camera control DMC PHY PHY loopback UDP UDP DECAPS CUSTOM_NIC DEMUX Vision_protocol_handling DATA GENERATOR answers pixels deformable mirror commands DMA DMA DMA PCIe 3.0 gpu_ram pixels polling ring kernel buffer DM com buffer computation kernels 14
P2P from FPGA to (way back still launched by the CPU) 1200 1100 no p2p & computation 1000 p2p & computation 900 800 700 no p2p & no computation 600 p2p & no computation 1 63 125 187 249 311 373 435 497 559 621 683 745 807 869 931 993 1055 1117 1179 1241 1303 1365 1427 1489 1551 1613 1675 1737 1799 1861 1923 1985 2047 2109 2171 2233 2295 2357 2419 2481 2543 2605 2667 2729 2791 2853 2915 2977 3039 3101 3163 3225 3287 3349 3411 3473 3535 3597 3659 3721 3783 3845 3907 3969 4031 4093 15
Interrupts vs polling (on the cpu) PC B PC A carte PLDA "Quickplay" 10GbE carte PLDA "non QP" DDR hdl-kernel Image Gen c-kernel COG (ou pas) UDP0 UDP0 QPCIE DMA PCIE DDR DDR hdl-kernel Latency Meas UDP1 UDP1 BAR2 computing kernel(s) polling kernel 16 carte RAM
Interrupts vs polling (on the cpu) 700 600 Memory polling 500 400 Optimized Interrupt + "stress -c 8" 300 200 100 Non Optim Inter Non Optim Inter + "stress -c 8" 0 9300 9350 9400 9450 9500 9550 9600 9650-100 17
Interrupts vs polling (on the cpu) 700 600 Polling + isolated CPU 500 Optimized Interrupt 400 Polling + non isolated CPU 300 200 100 0 9330 9335 9340 9345 9350 9355 9360-100 18
Using FPGA to reduce the load on /PCIe/network read direction DPRAM DPRAM CoG CoG FIFO Dble FIFO DPRAM port CoG CoG FIFO FIFO RAM { N DPRAM RAM DMA over PCIe N*N subimages ~ FPGA are well suited for on-the-fly decapsulation, data rearrangement and highly parallelized computation. ~ Less load on the PCIe, smaller buffers on the ram, less latency ~ Could be done in the WFS, so we lower the load on the network 19
or even do everything with FPGAs ~ Exploring the possibilities with high level development environments: ~ Vendor specific HLS. Xilinx Vivado, ~ Quickplay: based on KPN. Has its own HLS. Is gonna be able to use other HLS. They are planning to integrate PCIe P2P. ~ OpenCL: is available for FPGAs (Altera), CPUs, s. Can now handle network protocols: Altera introduced I/O channels allowing kernels read and write network streams that are defined by the board designer. P2P over PCIe should be possible (synchronization?). Integration of custom HDL blocks? ~ Matlab to HDL. Did someone try it? Maybe useful to help developing IPs. -> Interactions with HPC tools as MPI, DDS, Corba are quite challenging 20