An FPGA platform for ultra-fast data acquisition M. Caselle, M. Balzer, S. Chilingaryan, A. Kopmann, U. Stevanovic, M. Vogelgesang December 2012 UFO project group KIT Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Ultra Fast X-ray Imaging (ANKA/UFO experimental station) UFO à Ultra-Fast X-ray Imaging of Scientific Processes with On-line Assessment and Data-driven Process Control High spatial resolution (<1 µm) included 2D and 3D visualizations + Time resolution (2D: 10kHz, 3D: 10Hz) to give insight in the temporal structure evolution and thus access to dynamics of processes Main application fields: medical diagnostics, biology, non-destructive testing, materials research and etc. Requirements: High granularity and low noise monolithic silicon pixel detector, few µm pixel pitch, several MPixel matrix operating at several kframes/sec High readout bandwidth up to 50Gb/s with GPU (3D-tomography reconstruction) 2
KIT-IPE Readout concept of high data throughput for scientific applications Concept: Data source Up to 10GB/s X-ray detector CMOS image sensor Fast ADC. Input stage Input Connection stage DAQ Boards FPG FPG FPGA A A Memory Memory Connection High speed Memory High speed Small PCIe backplane Feedback loops Real time data elaboration Data reduction High-throughput data flow Up to 4 GB/s Driver Driver GPU/CPU GPU/CPU algorithms algorithms Fast Fast Data Data storage storage Under developing by Data processing group in KIT-IPE GPUs/CPUs infrastructure Up to 0.25 GB/s Data storage LSDF 3
KIT-IPE Readout concept of high data throughput for scientific applications Concept: Data source Up to 10GB/s X-ray detector CMOS image sensor Fast ADC. Implementation: Daughter sensor board Input stage Input Connection stage DAQ Boards FPG FPG FPGA A A Memory Memory Connection High speed Memory High speed Small PCIe backplane Feedback loops Real time data elaboration Data reduction High-throughput data flow Up to 4 GB/s Driver Driver GPU/CPU GPU/CPU algorithms algorithms Fast Fast Data Data storage storage Under developing by Data processing group in KIT-IPE GPUs/CPUs infrastructure Up to 0.25 GB/s Data storage LSDF FPGA Virtex 6 PCIe link to DAQ UFO Camera Mother readout board This talk is focus on FPGA & Readout Board Small backplane 4
Flexible high-throughput FPGA platform PC DAQ FPGA internal architecture CPU Chipset root port memory memory memory 4 4 OpDcal/Electrical X4 lanes @ 5Gb/s User bank register FSM Master control FIFO SerDes input stage (KIT_ipcore) Detector Control X lanes @ 500Mbit/s... Data Source (Detector) Remote Detector Control 5
Flexible high-throughput FPGA platform PC DAQ FPGA internal architecture CPU Chipset root port memory memory memory 4 4 OpDcal/Electrical X4 lanes @ 5Gb/s User bank register FSM Master control On- line parallel data processing DDR interface (KIT_ipcore) FIFO SerDes input stage (KIT_ipcore) Detector Control X lanes @ 500Mbit/s... Data Source (Detector) DDR3 memory (800MHz @ 64bit) Remote Detector Control 6
Flexible high-throughput FPGA platform PC DAQ FPGA internal architecture CPU Chipset root port memory memory memory 4 4 OpDcal/Electrical X4 lanes @ 5Gb/s PCI Express + DMA (KIT_ipcore) FIFO FIFO User bank register FSM Master control On- line parallel data processing DDR interface (KIT_ipcore) FIFO SerDes input stage (KIT_ipcore) Detector Control X lanes @ 500Mbit/s... Data Source (Detector) DDR3 memory (800MHz @ 64bit) Remote Detector Control ü Three logic cores have been developed for a flexible high-throughput platform Virtex6 - floorplan DDR3 interface ü PCIe-Bus Master DMA readout architecture ü Multi-port high speed DDR3 interface PCIe ü Configurable 2..16 bits SerDes (Serializers /Deserializers) architecture ü PCI Express/DMA Linux 32-64 bits driver with ring buffer data management ü Integration in the parallel GPU/CPU computing framework SerDes & input stage DMA 7
PCIe-Bus Master DMA readout architecture GPU/CPU ApplicaDons Data decoding and consistency check DMA driver Software layers 4 Optical/Electrical X4 lanes @ 5Gb/s Xilinx / North-West IP-core 4 GTX Transceivers X8 @ 5Gb/s Xilinx Integrated block for PCI Express GEN 2 Custom PCIe DMA Interface FPGA core DMA Engines NW User bank register I/O interface logic FIFO RD - Control packet FSMs I/O interface logic FIFO WR - Control packet FSMs FPGA temp. & voltage control Data out [0..63] Data valid Busy_logic Clock_out Data in [0..63] WR_EN Back-pressure Clock_in IN port OUT port User applications ü Bus Master DMA operating with 4lanes PCIe @ Gen2 (250MHz) ü Two individual engines for write/read from FPGA (User logic) to PC centre memory ü IN and OUT FIFO-like interface (for User logic) ü FIFO used to decouple the time domain between DMA and User custom logic 8
Preliminary, PCIe-Bus Master DMA new architecture Disadvantage of IP-cores from external vendors, are: 1) expensive (35k for North-West DMA and 10-60k for EZDMA/QuickPCIe-IP by PLDA) 2) for unique FPGA family (Virtex 6, speed grade -2) 9
Preliminary, PCIe-Bus Master DMA new architecture Disadvantage of IP-cores from external vendors, are: 1) expensive (35k for North-West DMA and 10-60k for EZDMA/QuickPCIe-IP by PLDA) 2) for unique FPGA family (Virtex 6, speed grade -2) New, KIT-IPE Bus Master DMA engines operating with x8 lanes PCI Express @ GEN 2 IN/OUT data at 128 bit @ 250MHz à internal bandwidth of 32 Gb/s in Read/Write Virtex6 - floorplan 30000 Comparison (NW- DMA vs. KIT- DMA) FPGA resource estimation < 4% Data Transfer (Mbit/s) 25000 20000 PC Mem - - > FPGA (NW) 15000 FPGA - - > PC Mem (NW) PC Mem - - > FPGA (Michele) 10000 FPGA - - > PC Mem (Michele) 5000 Data valid for X58 PCIe chipset 0 0 10000 20000 30000 40000 50000 60000 70000 Packet size in Byte FPGA ring buffer management à on-going Software 64bit@linux driver à under optimization (~ 32Gb/s) PCIexpress GEN2 RX engine TX engine 10
Two-ports DDR3 memory interface architecture Why a two-ports DDR3 memory controller..? The Xilinx Multi-port Memory Controller (IP-Core) is limited in the maximum data throughput (less than 2GB/s for each port) & complex user interface. Ref. LogiCORE IP Multi-Port Memory Controller (MPMC) (v6.03.a), DS643 March 1, 2011 11
Two-ports DDR3 memory interface architecture Why a two-ports DDR3 memory controller..? The Xilinx Multi-port Memory Controller (IP-Core) is limited in the maximum data throughput (less than 2GB/s for each port) & complex user interface. Ref. LogiCORE IP Multi-Port Memory Controller (MPMC) (v6.03.a), DS643 March 1, 2011 Data_in [0..N] WR_EN Clock_in Start address Port 1 Start address Port 2 Read/Write DDR Busy Data_out [0..M] Data_valid Clock_out User Control Interface WR FIFO Arbiter FSM RD FIFO 256bit @200 MHz WR DDR FSM RD DDR FSM 256bit @200 MHz PHY- DDR Xilinx IPCore (800MHz @ 64 bit) DDR3 Memory Data frame segment On-line data process segment ü Bandwidth 51Gb/s, limited by FPGA speed grade ( Virtex 6, speed grade -1) ü Two operations are possible in same/different segmentation/s (each operation ~ 25Gb/s) ü Data interface FIFO-like, minimum control signals are required ü FIFO used to decouple the time domain between Memory Controller and custom User logic ü Configurable user define data width N and M à 32/64/128/512 bits 12
A configurable SerDes input stage architecture Why not a Xilinx ISERDERSE stage..? Limited parallel data width (output) not more than 10bits (for two ISEDERSE in cascade configuration) and not dynamically configurable. The FSM Alignment in not included in the Xilinx tools. Ref. Virtex-6 FPGA Select IO resources user guide. ug361 (v1.3) august 16, 2010. 13
A configurable SerDes input stage architecture Why not a Xilinx ISERDERSE stage..? Limited parallel data width (output) not more than 10bits (for two ISEDERSE in cascade configuration) and not dynamically configurable. The FSM Alignment in not included in the Xilinx tools. Ref. Virtex-6 FPGA Select IO resources user guide. ug361 (v1.3) august 16, 2010. Configurable 2 to 16bit parallel data output SerDes logic with MSB Alignment State Machine From CMOS sensor or FADC, etc. 10110010 IBUF 1 Serial data + IODELAY - IDDR Clock IBUF + - I/O clock Buffer Regional clock buffer Clock division Clock to Data Time tuning Bit-slip Alignment FSM Training pattern Custom SerDes logic ü Individual clock-to-data time tuning by IODELAY (time step of 75psec) ü I/O clock buffer located in the centre of the FPGA bank Parallel data width FPGA user defined 0 1 1 0 0 1 0 Data lock Parallel data output Data CLOCK Data (PAD) Clock (PAD) Regional Clock Clock Buffer I/O IDDR Serial/parallel and FSM Alignment ü Regional buffer synchronous to parallel data out ü SerDes input stage fully configurable by User 14
Future developments for high speed readout systems Requirements: q Real-time FPGA + GPU data elaboration à high data throughput (range of 64Gb/s) q Data source and FPGA readout board located far from DAQ system q Using commercial/well-known protocol for ease interface with commercial devices/boards Two differents approaches are possible: Ø Peer to peer (P2P) streaming data transfer (based on new generation of PCI express protocol) Ø Point to node (net) for distributed GPU/CPU High Performance Computing (HPC) clusters 15
IPE - PCI Express Readout card - Overview ü PCIe GEN3 optical/electrical data transmission (8 lanes x 8GT/s) Readout Board - Concept DDR3 Memory FPGA -Virtex 6 EndPoint PCIe Integrated block x8 lanes GEN2 Multi-port PCIe switching X8 lanes GEN3 PCIe User logic Electrical cable (up to 5m) PCIe No DMA is needed To PC host board Data Source Optical cable (up to 30m) ü 64 Gb/s (W) + 64Gb/s (R) à full-duplex mode ü FPGA Real Time process à close to data source Size à 18.6 mm x 22 mm, heightà 14.5 mm MiniPOD X12 lanes optics cable for PCIe GEN3 (8 GT/s per lane) 16
IPE - PCI Express Host card- Overview ü PCIe host board with high speed data recording PCIe host card NAND flash SSD Up to 2TB @ 64 Gbit/s FPGA SRAM DMA integrated ü Fully configurable data flow X16 lanes PCIe slot To PC memory system for GPU data elaborations 17
High bandwidth readout system based by InfiniBand Data Source Input stage FPGA QSFP+ Optical or electrical data link up to 100m Memory µ /ATCA 40Gb/s à InfiniBand, in house 120Gb/s à InfiniBand available soon 384Gb/s à in the next two years 8-port switch capable of up to 640Gb/s InfiniBand GPU cluster under developing in KIT-IPE by Data processing group IPE-KIT QDR 40Gbps InfiniBand protocol Infiniband Router InfiniBand DAQ cluster Heterogeneous FPGA + CPU + GPU Ultra-low latency for high cluster performance 18
InfiniBand readout Board - Overview InfiniBand Readout board DDR3 InfiniHost III Ex InfiniBand silicon device MicroBlaze JTAG for programming and BSDL (VHDL) test JTAG CPLD & Flash User logic Xilinx -Virtex 6 PCIe InfiniBand Mellanox SDRAM QSFP + QSFP + InfiniBand switch High Speed connectors (HPC Samtec or similar) From data Source IP based application layer à possible (i.e. TCP, UDP, SSH, FTP.. ) The InfiniHost provide the PHY, Link and Transaction layers for InfiniBand Remote DMA for fast data transfer à intranet communication 19
Conclusion and What s next v Logic cores for high data throughput platform à employed in several scientific applications: v TeraHz detector + readout system for CSR (M.Caselle, V. Judin, A.S. Müller, M. Siegel, N. Smale, P. Thoma, M. Weber, S. Wünsch). KIT departments IPE-IMS and ANKA v A X-ray camera for phase contrast tomography (M. Caselle, A. Kopmann, Felix Beckmann (HZG), Joerg Burmester(HZG) KIT and HZG v A X-ray camera for high spatial resolution tomography (M. Caselle, M. Balzer, A. Kopmann, V. E. Asadchikova) Shubnikov Institute of Crystallography, Russian Academy of Sciences, Moscow, Russia v A readout electronics for Ultrafast electron beam X-ray tomography system "ROFEX in HZDR (proposal under discussion) v New KIT-DMA (32Gb/s) engines à developed and tested v Driver 64bit@Linux à under optimization What s next v Design & production of readout board based by: v PCIe GEN3 optical communication v InfiniBand protocol v Integration in the GPU/CPU compute infrastructure 20
Bandwidth: 8 Gb/s. Future upgrade: 50Gb/s UFO-Camera Frame rate from 500 to 2Kfps Train 2 Train 1 Bandwidth: 6Gb/s: Future upgrade to 24Gb/s Pulse shape (width 200ps) ANKA 184 bunches revolution time 368ns 32 samples inside.. bunch 92 picosec Train 3 Sample time resolution < than 3psec 2 nsec 21 Recording & analysis of time evolution of each bunch in a multi-bunches accelerator filling-scheme
Backup slides.. 22
InfiniBand: Link layer Flow Control Credit-based link-level flow control Link Flow control assures NO packet loss within fabric even in the presence of congestion Link Receivers grant packet receive buffer space credits per Virtual Lane Flow control credits are issued in 64 byte units 23
InfiniBand: application layers and latency UDP or TCP, FTP, ssh Designing with InfiniBand Ref: Introduction to InfiniBand for End Users, InfiniBand Trade Association Administration 3855 SW 153rd Drive Beaverton, OR 97006 Number of node 24
UFO architecture - overview Smart highspeed camera X-ray beam line sample Scintillator & optic lens Sample set-up Sample and detector manipulator Sample Environment Detector Memory FPGA On-line Processing Fast Data Link Optical link GPU server online monitoring and evaluation Raw Data Processing Data Evaluation Post Processing Visualization Data management and distribution Storage Processing Management Beamline Fast-reject Trigger Fast HW loop Large scale data facility (LSDF) SW control loops Experimental station 2D and 3D imagebased control loop ü High speed & bandwidth, full programmable camera (continuous data acquisition at full speed) ü Optimized image processing algorithm using GPU computing ü Fast HW loop: On-line image-based self-event trigger architecture (Fast reject) ü SW control loops: based on 2D and 3D data evaluation: 2D data à camera calibration, autofocus, self-alignment & etc.. 3D data reconstructed à like optical flow, etc
UFO Camera - overview Mother Readout board (ml605) EndPoint PCIe link (to UFO infrastructure) Daughter board with CMOSIS sensor Peltier cell (camera cooling) Heat sink + fan Peltier cell control board cture) Large DDR3 local memory Xilinx FPGA (for fast readout & on-line data process) The main features already implemented and tested, include: ü Fully configurable camera à adjustable image exposure time and dynamic range, analog and digital pixel features as pixel threshold, mask, analog gain, etc. ü Continuous data acquisition at full speed ü On-line image-based self-event trigger architecture (Fast reject) ü Region-of-interest readout strategy using self-event trigger information ü Easily extendable to any available CMOS image sensor
Readout electronics for Coherent Synchrotron Radiation Measure of the peak amplitude of each bunch (resolution few mv) Measure of the pulse width of each bunch (resolution few psec) Measure of the relative time jitter between electron bunches (res. few psec) Strategy: Digitalize each pulse with 4 samples + pulse reconstruction & Constant Fraction Discriminator (CFD) for precise pulse timestamp. Analog signal (single bunch) (output of amplifier) 20 200 mv FWHM = 42 ps Train 1 Train 2 bunch 92 picosec Pulse shape (width 200ps) ANKA 184 bunches revolution time 368ns 32 samples inside.. Train 3 Sample time resolution 2 nsec < than 3psec ANKA CSR (long observation time with YBCO) Recording & analysis of time evolution of each bunch in a multi-bunches accelerator filling-scheme 27