An FPGA platform for ultra-fast data acquisition

Similar documents

Ultra-Ultra-Fast Data Acquisition System for Coherent Synchrotron Radiation Based on Superconducting Terahertz Detectors

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Open Flow Controller and Switch Datasheet

AGIPD Interface Electronic Prototyping

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

PCI Express: The Evolution to 8.0 GT/s. Navraj Nandra, Director of Marketing Mixed-Signal and Analog IP, Synopsys

ATLAS Tile Calorimeter Readout Electronics Upgrade Program for the High Luminosity LHC

LLRF. Digital RF Stabilization System

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

PLAS: Analog memory ASIC Conceptual design & development status

VPX Implementation Serves Shipboard Search and Track Needs

USB readout board for PEBS Performance test

Silicon Lab Bonn. Physikalisches Institut Universität Bonn. DEPFET Test System Test DESY

Using FPGAs to Design Gigabit Serial Backplanes. April 17, 2002

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Getting the most TCP/IP from your Embedded Processor

Welcome to Pericom s PCIe and USB3 ReDriver/Repeater Product Training Module.

High-Density Network Flow Monitoring

PCI Express* Ethernet Networking

CAPTAN: A Hardware Architecture for Integrated Data Acquisition, Control, and Analysis for Detector Development

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

Development of the electromagnetic calorimeter waveform digitizers for the Fermilab Muon g-2 experiment

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

10/100/1000Mbps Ethernet MAC with Protocol Acceleration MAC-NET Core with Avalon Interface

PCI Express and Storage. Ron Emerick, Sun Microsystems

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Supercomputing Clusters with RapidIO Interconnect Fabric

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Status of CBM-XYTER Development

How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet

EtherCAT Cutting Costs with High-speed Ethernet

Storage Architectures. Ron Emerick, Oracle Corporation

DEVELOPMENT OF DEVICES AND METHODS FOR PHASE AND AC LINEARITY MEASUREMENTS IN DIGITIZERS

Data Center and Cloud Computing Market Landscape and Challenges

Silicon Seminar. Optolinks and Off Detector Electronics in ATLAS Pixel Detector

XMC Modules. XMC-6260-CC 10-Gigabit Ethernet Interface Module with Dual XAUI Ports. Description. Key Features & Benefits

D1.2 Network Load Balancing

Computer Organization & Architecture Lecture #19

SPADIC: CBM TRD Readout ASIC

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

PCI Express IO Virtualization Overview

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

The Bus (PCI and PCI-Express)

Networking Virtualization Using FPGAs

CMS Level 1 Track Trigger

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

8 Gbps CMOS interface for parallel fiber-optic interconnects

Zynq SATA Storage Extension (Zynq SSE) - NAS. Technical Brief from Missing Link Electronics:

Managing High-Speed Clocks

White Paper Streaming Multichannel Uncompressed Video in the Broadcast Environment

Low-latency data acquisition to GPUs using FPGA-based 3rd party devices. Denis Perret, LESIA / Observatoire de Paris

Why 25GE is the Best Choice for Data Centers

AXI Performance Monitor v5.0

Basler. Line Scan Cameras

10 Gigabit Ethernet MAC Core for Altera CPLDs. 1 Introduction. Product Brief Version February 2002

The Dusk of FireWire - The Dawn of USB 3.0

Reconfigurable System-on-Chip Design

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

10/100 Mbps Ethernet MAC

Next Generation Operating Systems

Flexible I/O Using FMC Standard FPGA and CPU Track B&C HWCONF 2013

VME Data Acquisition System: Fundamentals and Beyond. Abhinav Kumar Bhabha Atomic Research Centre, Mumbai March 2011

LogiCORE IP AXI Performance Monitor v2.00.a

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Jitter in PCIe application on embedded boards with PLL Zero delay Clock buffer

Summer of LabVIEW The Sunny Side of System Design

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

The modular concept of the MPA-3 system is designed to enable easy accommodation to a huge variety of experimental requirements.

High-Speed SERDES Interfaces In High Value FPGAs

10/100/1000 Ethernet MAC with Protocol Acceleration MAC-NET Core

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Development. Igor Sheviakov Manfred Zimmer Peter Göttlicher Qingqing Xia. AGIPD Meeting April, 2014

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

ISSCC 2003 / SESSION 4 / CLOCK RECOVERY AND BACKPLANE TRANSCEIVERS / PAPER 4.7

Series: IDAM Servo Drive E Digital Motor Drive - DMD

Cloud Data Center Acceleration 2015

PCI Express 2.0 SATA III RAID Controller Card with Internal Mini-SAS SFF-8087 Connector

Xeon+FPGA Platform for the Data Center

A Gigabit Transceiver for Data Transmission in Future HEP Experiments and An overview of optoelectronics in HEP

MSITel provides real time telemetry up to 4.8 kbps (2xIridium modem) for balloons/experiments

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

85MIV2 / 85MIV2-L -- Components Locations

APPLICATION NOTE GaGe CompuScope based Lightning Monitoring System

11. High-Speed Differential Interfaces in Cyclone II Devices

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

High Speed I/O Server Computing with InfiniBand

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

A Storage Architecture for High Speed Signal Processing: Embedding RAID 0 on FPGA

Qsys and IP Core Integration

How To Make A Car A Car Into A Car With A Car Stereo And A Car Monitor

PEX 8748, PCI Express Gen 3 Switch, 48 Lanes, 12 Ports

CMS Tracker module / hybrid tests and DAQ development for the HL-LHC

Successfully negotiating the PCI EXPRESS 2.0 Super Highway Towards Full Compliance

PCIeBPMC (PCI/PCI-X Bus Compatible) Bridge based PCIe and PMC Compatible Adapter Carrier Front View shown with 1 installed fans model # PCIeBPMC-FAN2

The proliferation of the raw processing

7a. System-on-chip design and prototyping platforms

Building a Scalable Storage with InfiniBand

Basler scout AREA SCAN CAMERAS

Transcription:

An FPGA platform for ultra-fast data acquisition M. Caselle, M. Balzer, S. Chilingaryan, A. Kopmann, U. Stevanovic, M. Vogelgesang December 2012 UFO project group KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu

Ultra Fast X-ray Imaging (ANKA/UFO experimental station) UFO à Ultra-Fast X-ray Imaging of Scientific Processes with On-line Assessment and Data-driven Process Control High spatial resolution (<1 µm) included 2D and 3D visualizations + Time resolution (2D: 10kHz, 3D: 10Hz) to give insight in the temporal structure evolution and thus access to dynamics of processes Main application fields: medical diagnostics, biology, non-destructive testing, materials research and etc. Requirements: High granularity and low noise monolithic silicon pixel detector, few µm pixel pitch, several MPixel matrix operating at several kframes/sec High readout bandwidth up to 50Gb/s with GPU (3D-tomography reconstruction) 2

KIT-IPE Readout concept of high data throughput for scientific applications Concept: Data source Up to 10GB/s X-ray detector CMOS image sensor Fast ADC. Input stage Input Connection stage DAQ Boards FPG FPG FPGA A A Memory Memory Connection High speed Memory High speed Small PCIe backplane Feedback loops Real time data elaboration Data reduction High-throughput data flow Up to 4 GB/s Driver Driver GPU/CPU GPU/CPU algorithms algorithms Fast Fast Data Data storage storage Under developing by Data processing group in KIT-IPE GPUs/CPUs infrastructure Up to 0.25 GB/s Data storage LSDF 3

KIT-IPE Readout concept of high data throughput for scientific applications Concept: Data source Up to 10GB/s X-ray detector CMOS image sensor Fast ADC. Implementation: Daughter sensor board Input stage Input Connection stage DAQ Boards FPG FPG FPGA A A Memory Memory Connection High speed Memory High speed Small PCIe backplane Feedback loops Real time data elaboration Data reduction High-throughput data flow Up to 4 GB/s Driver Driver GPU/CPU GPU/CPU algorithms algorithms Fast Fast Data Data storage storage Under developing by Data processing group in KIT-IPE GPUs/CPUs infrastructure Up to 0.25 GB/s Data storage LSDF FPGA Virtex 6 PCIe link to DAQ UFO Camera Mother readout board This talk is focus on FPGA & Readout Board Small backplane 4

Flexible high-throughput FPGA platform PC DAQ FPGA internal architecture CPU Chipset root port memory memory memory 4 4 OpDcal/Electrical X4 lanes @ 5Gb/s User bank register FSM Master control FIFO SerDes input stage (KIT_ipcore) Detector Control X lanes @ 500Mbit/s... Data Source (Detector) Remote Detector Control 5

Flexible high-throughput FPGA platform PC DAQ FPGA internal architecture CPU Chipset root port memory memory memory 4 4 OpDcal/Electrical X4 lanes @ 5Gb/s User bank register FSM Master control On- line parallel data processing DDR interface (KIT_ipcore) FIFO SerDes input stage (KIT_ipcore) Detector Control X lanes @ 500Mbit/s... Data Source (Detector) DDR3 memory (800MHz @ 64bit) Remote Detector Control 6

Flexible high-throughput FPGA platform PC DAQ FPGA internal architecture CPU Chipset root port memory memory memory 4 4 OpDcal/Electrical X4 lanes @ 5Gb/s PCI Express + DMA (KIT_ipcore) FIFO FIFO User bank register FSM Master control On- line parallel data processing DDR interface (KIT_ipcore) FIFO SerDes input stage (KIT_ipcore) Detector Control X lanes @ 500Mbit/s... Data Source (Detector) DDR3 memory (800MHz @ 64bit) Remote Detector Control ü Three logic cores have been developed for a flexible high-throughput platform Virtex6 - floorplan DDR3 interface ü PCIe-Bus Master DMA readout architecture ü Multi-port high speed DDR3 interface PCIe ü Configurable 2..16 bits SerDes (Serializers /Deserializers) architecture ü PCI Express/DMA Linux 32-64 bits driver with ring buffer data management ü Integration in the parallel GPU/CPU computing framework SerDes & input stage DMA 7

PCIe-Bus Master DMA readout architecture GPU/CPU ApplicaDons Data decoding and consistency check DMA driver Software layers 4 Optical/Electrical X4 lanes @ 5Gb/s Xilinx / North-West IP-core 4 GTX Transceivers X8 @ 5Gb/s Xilinx Integrated block for PCI Express GEN 2 Custom PCIe DMA Interface FPGA core DMA Engines NW User bank register I/O interface logic FIFO RD - Control packet FSMs I/O interface logic FIFO WR - Control packet FSMs FPGA temp. & voltage control Data out [0..63] Data valid Busy_logic Clock_out Data in [0..63] WR_EN Back-pressure Clock_in IN port OUT port User applications ü Bus Master DMA operating with 4lanes PCIe @ Gen2 (250MHz) ü Two individual engines for write/read from FPGA (User logic) to PC centre memory ü IN and OUT FIFO-like interface (for User logic) ü FIFO used to decouple the time domain between DMA and User custom logic 8

Preliminary, PCIe-Bus Master DMA new architecture Disadvantage of IP-cores from external vendors, are: 1) expensive (35k for North-West DMA and 10-60k for EZDMA/QuickPCIe-IP by PLDA) 2) for unique FPGA family (Virtex 6, speed grade -2) 9

Preliminary, PCIe-Bus Master DMA new architecture Disadvantage of IP-cores from external vendors, are: 1) expensive (35k for North-West DMA and 10-60k for EZDMA/QuickPCIe-IP by PLDA) 2) for unique FPGA family (Virtex 6, speed grade -2) New, KIT-IPE Bus Master DMA engines operating with x8 lanes PCI Express @ GEN 2 IN/OUT data at 128 bit @ 250MHz à internal bandwidth of 32 Gb/s in Read/Write Virtex6 - floorplan 30000 Comparison (NW- DMA vs. KIT- DMA) FPGA resource estimation < 4% Data Transfer (Mbit/s) 25000 20000 PC Mem - - > FPGA (NW) 15000 FPGA - - > PC Mem (NW) PC Mem - - > FPGA (Michele) 10000 FPGA - - > PC Mem (Michele) 5000 Data valid for X58 PCIe chipset 0 0 10000 20000 30000 40000 50000 60000 70000 Packet size in Byte FPGA ring buffer management à on-going Software 64bit@linux driver à under optimization (~ 32Gb/s) PCIexpress GEN2 RX engine TX engine 10

Two-ports DDR3 memory interface architecture Why a two-ports DDR3 memory controller..? The Xilinx Multi-port Memory Controller (IP-Core) is limited in the maximum data throughput (less than 2GB/s for each port) & complex user interface. Ref. LogiCORE IP Multi-Port Memory Controller (MPMC) (v6.03.a), DS643 March 1, 2011 11

Two-ports DDR3 memory interface architecture Why a two-ports DDR3 memory controller..? The Xilinx Multi-port Memory Controller (IP-Core) is limited in the maximum data throughput (less than 2GB/s for each port) & complex user interface. Ref. LogiCORE IP Multi-Port Memory Controller (MPMC) (v6.03.a), DS643 March 1, 2011 Data_in [0..N] WR_EN Clock_in Start address Port 1 Start address Port 2 Read/Write DDR Busy Data_out [0..M] Data_valid Clock_out User Control Interface WR FIFO Arbiter FSM RD FIFO 256bit @200 MHz WR DDR FSM RD DDR FSM 256bit @200 MHz PHY- DDR Xilinx IPCore (800MHz @ 64 bit) DDR3 Memory Data frame segment On-line data process segment ü Bandwidth 51Gb/s, limited by FPGA speed grade ( Virtex 6, speed grade -1) ü Two operations are possible in same/different segmentation/s (each operation ~ 25Gb/s) ü Data interface FIFO-like, minimum control signals are required ü FIFO used to decouple the time domain between Memory Controller and custom User logic ü Configurable user define data width N and M à 32/64/128/512 bits 12

A configurable SerDes input stage architecture Why not a Xilinx ISERDERSE stage..? Limited parallel data width (output) not more than 10bits (for two ISEDERSE in cascade configuration) and not dynamically configurable. The FSM Alignment in not included in the Xilinx tools. Ref. Virtex-6 FPGA Select IO resources user guide. ug361 (v1.3) august 16, 2010. 13

A configurable SerDes input stage architecture Why not a Xilinx ISERDERSE stage..? Limited parallel data width (output) not more than 10bits (for two ISEDERSE in cascade configuration) and not dynamically configurable. The FSM Alignment in not included in the Xilinx tools. Ref. Virtex-6 FPGA Select IO resources user guide. ug361 (v1.3) august 16, 2010. Configurable 2 to 16bit parallel data output SerDes logic with MSB Alignment State Machine From CMOS sensor or FADC, etc. 10110010 IBUF 1 Serial data + IODELAY - IDDR Clock IBUF + - I/O clock Buffer Regional clock buffer Clock division Clock to Data Time tuning Bit-slip Alignment FSM Training pattern Custom SerDes logic ü Individual clock-to-data time tuning by IODELAY (time step of 75psec) ü I/O clock buffer located in the centre of the FPGA bank Parallel data width FPGA user defined 0 1 1 0 0 1 0 Data lock Parallel data output Data CLOCK Data (PAD) Clock (PAD) Regional Clock Clock Buffer I/O IDDR Serial/parallel and FSM Alignment ü Regional buffer synchronous to parallel data out ü SerDes input stage fully configurable by User 14

Future developments for high speed readout systems Requirements: q Real-time FPGA + GPU data elaboration à high data throughput (range of 64Gb/s) q Data source and FPGA readout board located far from DAQ system q Using commercial/well-known protocol for ease interface with commercial devices/boards Two differents approaches are possible: Ø Peer to peer (P2P) streaming data transfer (based on new generation of PCI express protocol) Ø Point to node (net) for distributed GPU/CPU High Performance Computing (HPC) clusters 15

IPE - PCI Express Readout card - Overview ü PCIe GEN3 optical/electrical data transmission (8 lanes x 8GT/s) Readout Board - Concept DDR3 Memory FPGA -Virtex 6 EndPoint PCIe Integrated block x8 lanes GEN2 Multi-port PCIe switching X8 lanes GEN3 PCIe User logic Electrical cable (up to 5m) PCIe No DMA is needed To PC host board Data Source Optical cable (up to 30m) ü 64 Gb/s (W) + 64Gb/s (R) à full-duplex mode ü FPGA Real Time process à close to data source Size à 18.6 mm x 22 mm, heightà 14.5 mm MiniPOD X12 lanes optics cable for PCIe GEN3 (8 GT/s per lane) 16

IPE - PCI Express Host card- Overview ü PCIe host board with high speed data recording PCIe host card NAND flash SSD Up to 2TB @ 64 Gbit/s FPGA SRAM DMA integrated ü Fully configurable data flow X16 lanes PCIe slot To PC memory system for GPU data elaborations 17

High bandwidth readout system based by InfiniBand Data Source Input stage FPGA QSFP+ Optical or electrical data link up to 100m Memory µ /ATCA 40Gb/s à InfiniBand, in house 120Gb/s à InfiniBand available soon 384Gb/s à in the next two years 8-port switch capable of up to 640Gb/s InfiniBand GPU cluster under developing in KIT-IPE by Data processing group IPE-KIT QDR 40Gbps InfiniBand protocol Infiniband Router InfiniBand DAQ cluster Heterogeneous FPGA + CPU + GPU Ultra-low latency for high cluster performance 18

InfiniBand readout Board - Overview InfiniBand Readout board DDR3 InfiniHost III Ex InfiniBand silicon device MicroBlaze JTAG for programming and BSDL (VHDL) test JTAG CPLD & Flash User logic Xilinx -Virtex 6 PCIe InfiniBand Mellanox SDRAM QSFP + QSFP + InfiniBand switch High Speed connectors (HPC Samtec or similar) From data Source IP based application layer à possible (i.e. TCP, UDP, SSH, FTP.. ) The InfiniHost provide the PHY, Link and Transaction layers for InfiniBand Remote DMA for fast data transfer à intranet communication 19

Conclusion and What s next v Logic cores for high data throughput platform à employed in several scientific applications: v TeraHz detector + readout system for CSR (M.Caselle, V. Judin, A.S. Müller, M. Siegel, N. Smale, P. Thoma, M. Weber, S. Wünsch). KIT departments IPE-IMS and ANKA v A X-ray camera for phase contrast tomography (M. Caselle, A. Kopmann, Felix Beckmann (HZG), Joerg Burmester(HZG) KIT and HZG v A X-ray camera for high spatial resolution tomography (M. Caselle, M. Balzer, A. Kopmann, V. E. Asadchikova) Shubnikov Institute of Crystallography, Russian Academy of Sciences, Moscow, Russia v A readout electronics for Ultrafast electron beam X-ray tomography system "ROFEX in HZDR (proposal under discussion) v New KIT-DMA (32Gb/s) engines à developed and tested v Driver 64bit@Linux à under optimization What s next v Design & production of readout board based by: v PCIe GEN3 optical communication v InfiniBand protocol v Integration in the GPU/CPU compute infrastructure 20

Bandwidth: 8 Gb/s. Future upgrade: 50Gb/s UFO-Camera Frame rate from 500 to 2Kfps Train 2 Train 1 Bandwidth: 6Gb/s: Future upgrade to 24Gb/s Pulse shape (width 200ps) ANKA 184 bunches revolution time 368ns 32 samples inside.. bunch 92 picosec Train 3 Sample time resolution < than 3psec 2 nsec 21 Recording & analysis of time evolution of each bunch in a multi-bunches accelerator filling-scheme

Backup slides.. 22

InfiniBand: Link layer Flow Control Credit-based link-level flow control Link Flow control assures NO packet loss within fabric even in the presence of congestion Link Receivers grant packet receive buffer space credits per Virtual Lane Flow control credits are issued in 64 byte units 23

InfiniBand: application layers and latency UDP or TCP, FTP, ssh Designing with InfiniBand Ref: Introduction to InfiniBand for End Users, InfiniBand Trade Association Administration 3855 SW 153rd Drive Beaverton, OR 97006 Number of node 24

UFO architecture - overview Smart highspeed camera X-ray beam line sample Scintillator & optic lens Sample set-up Sample and detector manipulator Sample Environment Detector Memory FPGA On-line Processing Fast Data Link Optical link GPU server online monitoring and evaluation Raw Data Processing Data Evaluation Post Processing Visualization Data management and distribution Storage Processing Management Beamline Fast-reject Trigger Fast HW loop Large scale data facility (LSDF) SW control loops Experimental station 2D and 3D imagebased control loop ü High speed & bandwidth, full programmable camera (continuous data acquisition at full speed) ü Optimized image processing algorithm using GPU computing ü Fast HW loop: On-line image-based self-event trigger architecture (Fast reject) ü SW control loops: based on 2D and 3D data evaluation: 2D data à camera calibration, autofocus, self-alignment & etc.. 3D data reconstructed à like optical flow, etc

UFO Camera - overview Mother Readout board (ml605) EndPoint PCIe link (to UFO infrastructure) Daughter board with CMOSIS sensor Peltier cell (camera cooling) Heat sink + fan Peltier cell control board cture) Large DDR3 local memory Xilinx FPGA (for fast readout & on-line data process) The main features already implemented and tested, include: ü Fully configurable camera à adjustable image exposure time and dynamic range, analog and digital pixel features as pixel threshold, mask, analog gain, etc. ü Continuous data acquisition at full speed ü On-line image-based self-event trigger architecture (Fast reject) ü Region-of-interest readout strategy using self-event trigger information ü Easily extendable to any available CMOS image sensor

Readout electronics for Coherent Synchrotron Radiation Measure of the peak amplitude of each bunch (resolution few mv) Measure of the pulse width of each bunch (resolution few psec) Measure of the relative time jitter between electron bunches (res. few psec) Strategy: Digitalize each pulse with 4 samples + pulse reconstruction & Constant Fraction Discriminator (CFD) for precise pulse timestamp. Analog signal (single bunch) (output of amplifier) 20 200 mv FWHM = 42 ps Train 1 Train 2 bunch 92 picosec Pulse shape (width 200ps) ANKA 184 bunches revolution time 368ns 32 samples inside.. Train 3 Sample time resolution 2 nsec < than 3psec ANKA CSR (long observation time with YBCO) Recording & analysis of time evolution of each bunch in a multi-bunches accelerator filling-scheme 27