USB readout board for PEBS Performance test



Similar documents
The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

VME Data Acquisition System: Fundamentals and Beyond. Abhinav Kumar Bhabha Atomic Research Centre, Mumbai March 2011

Silicon Lab Bonn. Physikalisches Institut Universität Bonn. DEPFET Test System Test DESY

1500 bytes Universal Serial Bus Bandwidth Analysis

What is LOG Storm and what is it useful for?

Open Flow Controller and Switch Datasheet

Communicating with devices

APPLICATION NOTE GaGe CompuScope based Lightning Monitoring System

AGIPD Interface Electronic Prototyping

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Clearing the Way for VoIP

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

The modular concept of the MPA-3 system is designed to enable easy accommodation to a huge variety of experimental requirements.

Bandwidth Calculations for SA-1100 Processor LCD Displays

8-ch RAID0 Design by using SATA Host IP Manual Rev1.0 9-Jun-15

Overview of the GRETINA Auxiliary Detector Interface

Communication Protocol

Febex Data Acquisition System

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Performance Report Modular RAID for PRIMERGY

Development. Igor Sheviakov Manfred Zimmer Peter Göttlicher Qingqing Xia. AGIPD Meeting April, 2014

AND8336. Design Examples of On Board Dual Supply Voltage Logic Translators. Prepared by: Jim Lepkowski ON Semiconductor.

Benefits and Potential Dangers of Using USB for Test & Measurement Applications. Benefits of Using USB for Test and Measurement

AlazarTech SDK Programmer s Guide. Version May 28, 2010

Implementing SPI Master and Slave Functionality Using the Z8 Encore! F083A

LaCie 4big Quadra Bundle

Chapter 11 I/O Management and Disk Scheduling

Chapter 02: Computer Organization. Lesson 04: Functional units and components in a computer organization Part 3 Bus Structures

CHAPTER 11: Flip Flops

Designing VM2 Application Boards

SpW-10X Network Performance Testing. Peter Mendham, Jon Bowyer, Stuart Mills, Steve Parkes. Space Technology Centre University of Dundee

Thread level parallelism

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Multiple clock domains

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Vehicle data acquisition using CAN By Henning Olsson, OptimumG

Attenuation (amplitude of the wave loses strength thereby the signal power) Refraction Reflection Shadowing Scattering Diffraction

Chapter 3 ATM and Multimedia Traffic

ALL-USB-RS422/485. User Manual. USB to Serial Converter RS422/485. ALLNET GmbH Computersysteme Alle Rechte vorbehalten

D1.2 Network Load Balancing

PLAS: Analog memory ASIC Conceptual design & development status

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

The Bus (PCI and PCI-Express)

1000Mbps Ethernet Performance Test Report

HP Smart Array Controllers and basic RAID performance factors

MONOCHROME RGB YCbCr VIDEO DIGITIZER

I/O. Input/Output. Types of devices. Interface. Computer hardware

Lecture-3 MEMORY: Development of Memory:

Bluetooth in Automotive Applications Lars-Berno Fredriksson, KVASER AB

Nasir Memon Polytechnic Institute of NYU

Kepware Technologies Optimizing KEPServerEX V5 Projects

The Model A032-ET of Riga Event Timers

Operating Systems 4 th Class

Real-Time Scheduling 1 / 39

3.4 Planning for PCI Express

Eight Ways to Increase GPIB System Performance

Computer Systems Structure Input/Output

Remote Copy Technology of ETERNUS6000 and ETERNUS3000 Disk Arrays

High speed pattern streaming system based on AXIe s PCIe connectivity and synchronization mechanism

Implementing Multi-channel DMA with the GN412x IP

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

VDI Solutions - Advantages of Virtual Desktop Infrastructure

How To Communicate With A Token Ring Network (Dihon)

d2 Quadra ENTERPRISE CLASS Professional Hard Disk DESIGN BY NEIL POULTON esata 3Gb/s FireWire 800 & 400 USB 2.0

Technical Description. Multifunctional Desk-Top Call Station Digital Version 8, 16 Keys L.No ,

The Dusk of FireWire - The Dawn of USB 3.0

GigE Vision cameras and network performance

Yubico YubiHSM Monitor

Development of the electromagnetic calorimeter waveform digitizers for the Fermilab Muon g-2 experiment

LEN s.r.l. Via S. Andrea di Rovereto 33 c.s CHIAVARI (GE) Tel Fax mailto: len@len.it url: http//

USER GUIDE EDBG. Description

SPI I2C LIN Ethernet. u Today: Wired embedded networks. u Next lecture: CAN bus u Then: wireless embedded network

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Develop a Dallas 1-Wire Master Using the Z8F1680 Series of MCUs

AN 223: PCI-to-DDR SDRAM Reference Design

Event counters in NOVA

Versions. Q.station Q.station T. Q.station D. Q.station DT x x

Fast and Accurate Test of Mobile Phone Boards

Voice Communication Package v7.0 of front-end voice processing software technologies General description and technical specification

Mac Protocols for Wireless Sensor Networks

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

Description: Multiparameter System (4 or 8 channels)

Technical Bulletin. Arista LANZ Overview. Overview

1. Memory technology & Hierarchy

Evaluating the Accuracy of Maxim Real-Time Clocks (RTCs)

Modular Instrumentation Technology Overview

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

PL2775 JBOD / BIG / RAID0 / RAID1 Mode Application

Impedance 50 (75 connectors via adapters)

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform)

ontroller LSI with Built-in High- Performance Graphic Functions for Automotive Applications

Linux Driver Devices. Why, When, Which, How?

Below is a diagram explaining the data packet and the timing related to the mouse clock while receiving a byte from the PS-2 mouse:

Why ClearCube Technology for VDI?

Transcription:

June 11, 2009 Version 1.0 USB readout board for PEBS Performance test Guido Haefeli 1 Li Liang 2 Abstract In the context of the PEBS [1] experiment a readout board was developed in order to facilitate the lab and testbeam activity. The DAQ board is based on a mezzanine card USB interface from Quick USB [2] which allows for a nominal data acquisition bandwidth of 480Mbit/s given by its USB 2.0 based interface to the PC. The current document summarizes the results obtained for one and several boards attached to a single PC. To be approved by: 1 E-mail: Guido.Haefeli@epfl.ch 2 E-mail: liang-li02@mails.tsinghua.edu.cn

Contents 1 Revision 1 2 Introduction 1 3 USB access function call time 1 4 Fifo status check time 1 5 ADC sampling time 2 6 Data bandwidth measured as a function of access block size. 2 7 Calculation and comparison with measured speed for 3 examples 2 8 Measured access performance for HPE-VA32 4 9 System test with multiple boards (HPE-VA32) 6 9.1 More observations concerning the 10 board system............ 7 10 QuickUSB Latency and Throughput 8 11 Conclusion 9 i

1 Revision Version 1.0, 10.6.2009 Initial version 2 Introduction The USB based readout board [3] is used for these tests is the first version readout board for the PEBS experiment dedicated for the use in lab environment. Power consumption, size and cooling are only secondary requirements. To understand the performance and to obtain some information about the important system parameters a series of performance test were conducted and are summarized in this note. In addition to the single board operation several board were tested simultaneously in order to perform a DAQ in a larger system and to perform a so called event building. In the tests where several boards are used a USB hub is employed to attache several USB boards to one USB port on the DAQ PC. The tests with several boards were conducted in the view of a possible employment of the USB DAQ in the PEBS experiment. Since the estimated total number of required USB boards for the complete PEBS experiment is of order of 15, the performed test results can directly give some performance estimates for such a readout system. 3 USB access function call time To access the data of the FPGA a function is called in the c-code to read or write large blocks of data or only single data words. Three different calls can be used and naturally different performance are recorded. In figure 1 the required time for the function call only is given. Note that the fastest, the Port read is not suited for the transfer of data from the FPGA directly, these are pins on the USB interface mezzanine directly. Function QuickUsbReadData(hDev, *data,*length); QuickUsbReadCommand(hDev, address, *data, *length); QuickUsbReadPort(hDev, address, *data, *length); Access time 0.5ms 0.6ms 0.3ms Figure 1: Time required for one function call of different types. 4 Fifo status check time In a real data acquisition previous to the data read transfer a check of the availability of data in the FPGAs Fifo is required. The time for the Fifo status check function is 1

dominated by the read function call. The time for the check is 0.6ms per check and is for small data block transfers as large as the data read transfer itself. To minimize the number of Fifo checks - a large number of events can be transferred with one data read access. This mode of operation can be called multi event packing. Several events are transferred from the FPGA to the PC in one large access. The number of events packed in one access is called number of events per access. For the measurements given in figure 3, the time for Fifo checking is given. 5 ADC sampling time The time required to sample the serial data stream from the detector is fixed in the present readout board to 1.2µs per sample. For a detector with 128 samples followed by a reset this amounts to 155µs per event. The maximal performance of the DAQ is limited to this ADC sampling rate. 6 Data bandwidth measured as a function of access block size. As a first test we measured the data bandwidth at the data transfer from the FPGA to the PC. To allow the measurement of the available data bandwidth without any processing the test is performed under the following conditions: One board only is attached and no other processes are running on the PC The data is read from a Fifo on the FPGA which contains always data. No check if the Fifo is full or empty for this test The data is not processed and not stored on disc at the PC. The data is only read from the FPGA over the USB IF and written into a array of data. We observed that for each call of the read function of the type: QuickUsbReadData(hDevice, data, &len); a time of 0.5ms is required. Taking this into account, the table in figure 2 of access times recorded can be well understood. The maximal data bandwidth is achieved if the access block size is large and therefore the rate of function calls is low, only 7% of the total time is used for the function call in the case of very large (300KByte) block size. 7 Calculation and comparison with measured speed for 3 examples As the Fifo is detected to be almost full, this is a value that can be chosen, the busy signal is set and no more triggers should be send anymore. The busy signal must be 2

# Event Event/acc ess 36 word (Spiroc EPFL) Event/acc ess 128 word (HPE- VA32) Access block size [Kbyte] Time[s] Data transfer rate Event rate Function call rate Time for function call 0.5ms/call 100,000 500 146 304 1.4 340 Mbit/s 71.4kHz 0.14kHz 7% 100,000 50 15 30 2.9 166 Mbit/s 34.5kHz 0.69kHz 35% 8 16.6 160 Mbit/s 10,000 10 2.9 6 0.64 75 Mbit/s 15.6kHz 1.56kHz 78% 1 2.1 33 Mbit/s 10,000 1 0.3 0.6 5.2 9.2 Mbit/s 1.9kHz 1.9kHz 95% Figure 2: Data transfer bandwidth measured without processing and storage for one board. set at the latest when the Fifo can still contain one more event. Three examples are given below to illustrate the situation: Example 1 Only one event per access is used, the fifo almost full is set if 8 events or more are in the Fifo (the Fifo has space for 16 events and therefore a margin of 50% is introduced). The Fifo check function needs to be called for each event 0.6ms, the ADC sampling can continue as the event is read. With a data access time of about 0.5ms (dominated by the function call time) the total time is 0.6ms + 0.5 ms = 1.1ms and therefore we can expect a event rate of about 0.9 KHz compared to 0.91KHz measured (see figure 3 the first line). Example 2 The read data access is grouped by 8 events, the Fifo check function is called during the ADC data sampling until the 8 event limit is reached, this is also the moment where the busy signal is set since the limit of 8 events is reached. ADC sampling is stopped until at least one event is read from the Fifo. The Fifo status check is performed during the ADC sampling time. The randomness of the check time introduces a dead time of maybe half the Fifo check function time 0.3ms in average. Since the ADC sampling time and the USB DAQ read time is of the same order, the performance is difficult to calculate. For 8 events the time required is composed of 8x0.155ms ADC sampling time, 0.6ms/2 Fifo check time + 0.5ms + 0.332ms/8 the time the ADC sampling is stopped due to a Fifo almost full. This amounts to 0.26ms / event or 3.8KHz event rate (measured 3.9KHz). (see line 8 in figure 3) Example 3 In this example the Fifo almost full level is set to 14, and the access is grouped in blocks of 8 events. In this case the ADC sampling time is at least 0.155ms per event, if never blocked this results at a event rate of 6.5KHz. If we 3

calculate the DAQ read time, this is 0.832ms for 8 events and 0.6ms for the Fifo check time, this results at 0.179ms per event or 5.6KHz event rate, which is in perfect agreement with 5.6KHz measured (see line 10 in figure 3). The observed readout performance can be understood very precise if the measured time for the function calls are employed. The bandwidth of the USB readout is dominated by the delays introduced by the function calls if the access size is not very large (order of 100KByte). To have a high performance, multiple events should be read in one access. Using event packing of order of 50 to 100 is reducing the effect of the slow function calls to a negligible level. For example at a packing factor of 100, a trigger rate of 2KHz, the 2 function calls (Fifo check 0.6ms and read data 0,5ms) occupy only 2.2% of the time. In a system with 15 board the time for function calls is 33% which can be expressed as a usable bandwidth of 67% of the maximal user bandwidth of 360Mbit/s = 240 Mbit/s. 8 Measured access performance for HPE-VA32 Event size is (128 data word + 2 header word) x 8 Uplinks x 2 Byte = 2080 Byte. In the two following tables the measured performance on a standard PC running Windows XP is given. The first table shows the values for internal triggers which is the trigger mode where as soon as the ADC sampling of the last event is finished and the data Fifo can contain one more event, the next trigger is generated. The performance in this mode is about 15% higher than in the external trigger mode where the triggers are generated with a periodic generator running at 20KHz. The external trigger is gated with the busy signal of the USB DAQ board (Trigger = Trigger gen AND not Busy ) 4

Event number Event/ access Only read the data [ms] Pedestal noise calculation write to disc [ms] ADC sampling time [ms] USB DAQ Readout time [ms] USB estimated speed USB Fifo Check [ms] 5000 1 5450 5570 775 2521 @33Mbit/s 3000 5000 2 2900 3000 775.... 1500 5000 3 2200 2297 775 1109 @75Mbit/s 1000 5000 4 1670 1781 775.... 750 5000 5 1375 1453 775.... 600 5000 6 1210 1312 775.... 500 5000 7 1234 1328 775.... 429 5000 8 1280 1375 775 520 @160Mbit/s 375 5000* 8 520 - - - - - 5000** 8 890-775 520-375 Notes: *No check function called, only data read function called, it is 0.5ms per function call + data transfer time, average 160Mbit/s ** Here the Fifo_almost_full was increased to 14 events, but the readout is done after 8 events, this allows for continuous ADC sampling Event number Event/ access Only read the data [ms] Pedestal noise calculation write to disc [ms] ADC sampling time [ms] USB DAQ Readout time [ms] USB estimated speed USB Fifo Check [ms] 5000 1 5420 5547 775 2521 @33Mbit/s 3000 5000 2 2890 3000 775.... 1500 5000 3 2200 2300 775 1109 @75Mbit/s 1000 5000 4 1672 1780 775.... 750 5000 5 1390 1490 775.... 600 5000 6 1219 1313 775.... 500 5000 7 1410 1516 775.... 429 5000 8 1547 1656 775 520 @160Mbit/s 375 Figure 3: Internal trigger mode in the upper and external trigger mode in the lower table. 5

9 System test with multiple boards (HPE-VA32) The following table gives the performance measured with a USB Hub and 10 USB DAQ boards. From the performance with one board, the system performance can be estimated assuming that the Hub doesn t introduce any overhead. One can observe that the use of 10 USB devices instead of only one increases the readout time by about 30%. The increase is not due to the Hub, only the presence of the 10 devices is sufficient to introduce this penalty - we tested this by using only one device and having 9 devices in idle state. The test shows also that the performance is further decreased if instead of only one ten boards are used. The calculated and the measured readout time is by a factor two different 4. In the tests performed, the Fifo check is applied only for the first board, this is possible since the system runs fully synchronous. In a system where zero suppression is done on the FPGA on the USB board, the event size becomes variable and for each board the Fifo check function has to be called to read a event or multi-event length counter in order to read the correct amount of data from the Fifo. For the case of fixed length events, it is the same for all USB boards. The access performance for 1 up to 10 boards is given in 5. The limiting speed is for Event number Event/ access Internal Trigger (no processing) Expected due to the limitation of the USB DAQ* In % of expected External Trigger (no processing) 5000 1 55.3 s 28.2 s 51% 55.4 s 5000 2 31.0 s.. 31.2 s 5000 3 22.8 s 12.1 s 53% 22.8 s 5000 4 18.5 s.. 18.5 s 5000 5 15.3 s.. 15.3 s 5000 6 13.7 s.. 13.6 s 5000 7 12.9 s.. 13.6 s 5000 8 12.3 s 5.58 s 45% 12.3 s * The USB DAQ speed for the readout of 10 boards can be calculated from the different delays which are all known. Fifo check 0.6ms / 8 events, data read from each board 10 x 0.832ms / 8 events => 1.115ms / event and therefore 5.58 s for 5000 events Figure 4: Measured readout time for 10 USB DAQ boards, for internal and external trigger mode. The performance is only about 50% of the one calculated from the one board measurements. one board the ADC sampling time where for two it is a mixture and for 4 and more it is the USB DAQ speed. The internal and external trigger mode is changing the 6

performance only for the one board access, for several boards it is negligible. Event number Event/ access # USB boards External Trigger (no processing) Limiting 5000 8 1 1.44 s ADC sampling 5000 8 2 2.27 s DAQ, ADC 5000 8 4 4.00 s DAQ 5000 8 6 6.38 s DAQ 5000 8 8 8.73 s DAQ 5000 8 10 12.3 s DAQ * The time is increased for the external trigger compared to the internal trigger due to the trigger rate of only 20KHz of the generator. In average it takes half a clock cycle until a trigger is generated after the busy is removed. Figure 5: Measured readout time 1 up to 10 USB DAQ boards, external trigger mode was used. 9.1 More observations concerning the 10 board system During the test we observed some dependance of the readout performance on the system configuration, to get some idea of how much some changes of the system can influence the performance check the table in figure 6. There is a clear system performance degradation with increasing number of USB boards seen by the PC. Reading one board only but leaving the other boards plugged, increases the readout time from 1.28s to 1.69s. Additional Fifo checking at each board degrades the performance. Adding additional USB devices in the PC degrades the performance. Using a different Hub has also negative influence. The 10 port 3 and the 7 port 4 Hubs that we used in the test are from the same manufacturer, the 10 port behaves better. Using a Labtop PC degrades the system a lot. 3 MX-UA6 - USB 2.0 Hub 10-Port 4 MX-217C - USB 2.0 Vertical Hub 7-Port 7

Event number Event/ access # USB boards Internal Trigger Remove boards that are not used *= Internal Trigger Leave boards that are not used ** = * + check each Fifo at each board 5000 8 1 1.28 s 1.69 s 2.22 s ** + add USB Memory stick to PC ** + use two different Hubs ** + use a Labtop PC instead of the quite slow lab PC 5000 8 2 2.27 s 2.77 s 3.92 s 5000 8 4 4.00 s 5.03 s 7.19 s 7.69s 7.70 s 10.8 s 5000 8 6 6.38 s 7.53 s 10.8 s 5000 8 8 8.73 s 9.94 s 14.2 s 5000 8 10 12.3 s 12.3 s 17.6 s Figure 6: Some measurements to see the influence of the system configuration. 10 QuickUSB Latency and Throughput Taken from the QuickUSB User Guide: The period of time between the start of a transfer and the time that it actually occurs is the transfer latency. USB transfer latency is the result of several factors. First is the fact that the USB is a frame oriented bus and that all packets must be scheduled to a timebase of either 1ms (full speed) or 125us (Hi-Speed) 5. Secondly, the operating system generally assesses a software latency penalty when switching from user mode to kernel mode. Throughput is a measure of data transfer speed and is generally expressed in megabytes per second (MB/s). Transfer latency affects throughput because it increases the amount of time a transfer takes regardless of the connection speed. However, as the data transfer size becomes larger, the transfer latency becomes a smaller fraction of the total transfer time thereby diminishing its effect. When the transfer size is small, the transfer latency will seriously degrade throughput. Therefore, for applications that require the highest throughput, transfer sizes of at least 64KB are recommended. Another way to mitigate transfer latency issues is to minimize the amount of time that the USB subsystem waits to schedule USB packets. You can accomplish this using asynchronous function calls 6. With asynchronous function calls, the transfer is scheduled when the function is called, but the function returns without waiting for the transfer to complete. Using this mechanism, one can concurrently schedule enough USB transfers to assure that the USB will not idle waiting for data to be transferred to or from your device. The simplest and most reliable technique for this is to employ multiple transfer buffers and rotate them on an as-needed basis. 5 In our system we measure an actual latency of 0.5ms for the Data read and 0.6ms for the CMD read access. 6 For the measurements we use only synchronous data transfer. Asynchronous transfer might help to reduce the data transfer time for the multiboard access. 8

11 Conclusion A single board readout as fast as 3.9 KHz event rate with event size of 2kByte was measured. The limiting component of the speed of the system in this case is the ADC sampling time mixed with the DAQ readout time. The average bandwidth that was obtained on the single USB link is 70 Mbit/s. The current setup with 10 USB boards, a standard PC and USB Hub allows to do a DAQ without zero suppression at a event rate of 400 Hz. The total event size is 20.8 kbyte and the average bandwidth on the USB link obtained as for the single board case 70 Mibt/s. Note that in this case the ADC sampling time is not anymore a limiting factor for the readout because all 10 board sample in parallel. The readout time is the only limiting element for the 10 board setup. For a large system DAQ, the access block size should be as large as possible but at least of the order of 64 KByte. For an estimate with the current setup one can assume the 16 KByte block size that was used in the readout where 8 full events were read. If one assumes a data reduction by zero suppression in the FPGA of order of 100, the required number of events to be packed in one block access is: 16 KByte / (2 KByte / 100) = 800 Events. The packing and zero suppression can therefore lead back to acceptable event rates. For this test setup the rate with zero suppression and packing of 800 events leads to 100 x 400 Hz event rate. Note that the lab system is only 70% of a total PEPS readout. References [1] http://accms04.physik.rwth-aachen.de/~schael/pebs.html [2] http://www.quickusb.com/store/ [3] http://lphe.epfl.ch/tell1/usb_board/ 9