USB readout board for PEBS Performance test

June 11, 2009 Version 1.0 USB readout board for PEBS Performance test Guido Haefeli 1 Li Liang 2 Abstract In the context of the PEBS [1] experiment a readout board was developed in order to facilitate the lab and testbeam activity. The DAQ board is based on a mezzanine card USB interface from Quick USB [2] which allows for a nominal data acquisition bandwidth of 480Mbit/s given by its USB 2.0 based interface to the PC. The current document summarizes the results obtained for one and several boards attached to a single PC. To be approved by: 1 E-mail: Guido.Haefeli@epfl.ch 2 E-mail: liang-li02@mails.tsinghua.edu.cn

Contents 1 Revision 1 2 Introduction 1 3 USB access function call time 1 4 Fifo status check time 1 5 ADC sampling time 2 6 Data bandwidth measured as a function of access block size. 2 7 Calculation and comparison with measured speed for 3 examples 2 8 Measured access performance for HPE-VA32 4 9 System test with multiple boards (HPE-VA32) 6 9.1 More observations concerning the 10 board system............ 7 10 QuickUSB Latency and Throughput 8 11 Conclusion 9 i

1 Revision Version 1.0, 10.6.2009 Initial version 2 Introduction The USB based readout board [3] is used for these tests is the first version readout board for the PEBS experiment dedicated for the use in lab environment. Power consumption, size and cooling are only secondary requirements. To understand the performance and to obtain some information about the important system parameters a series of performance test were conducted and are summarized in this note. In addition to the single board operation several board were tested simultaneously in order to perform a DAQ in a larger system and to perform a so called event building. In the tests where several boards are used a USB hub is employed to attache several USB boards to one USB port on the DAQ PC. The tests with several boards were conducted in the view of a possible employment of the USB DAQ in the PEBS experiment. Since the estimated total number of required USB boards for the complete PEBS experiment is of order of 15, the performed test results can directly give some performance estimates for such a readout system. 3 USB access function call time To access the data of the FPGA a function is called in the c-code to read or write large blocks of data or only single data words. Three different calls can be used and naturally different performance are recorded. In figure 1 the required time for the function call only is given. Note that the fastest, the Port read is not suited for the transfer of data from the FPGA directly, these are pins on the USB interface mezzanine directly. Function QuickUsbReadData(hDev, *data,*length); QuickUsbReadCommand(hDev, address, *data, *length); QuickUsbReadPort(hDev, address, *data, *length); Access time 0.5ms 0.6ms 0.3ms Figure 1: Time required for one function call of different types. 4 Fifo status check time In a real data acquisition previous to the data read transfer a check of the availability of data in the FPGAs Fifo is required. The time for the Fifo status check function is 1

dominated by the read function call. The time for the check is 0.6ms per check and is for small data block transfers as large as the data read transfer itself. To minimize the number of Fifo checks - a large number of events can be transferred with one data read access. This mode of operation can be called multi event packing. Several events are transferred from the FPGA to the PC in one large access. The number of events packed in one access is called number of events per access. For the measurements given in figure 3, the time for Fifo checking is given. 5 ADC sampling time The time required to sample the serial data stream from the detector is fixed in the present readout board to 1.2µs per sample. For a detector with 128 samples followed by a reset this amounts to 155µs per event. The maximal performance of the DAQ is limited to this ADC sampling rate. 6 Data bandwidth measured as a function of access block size. As a first test we measured the data bandwidth at the data transfer from the FPGA to the PC. To allow the measurement of the available data bandwidth without any processing the test is performed under the following conditions: One board only is attached and no other processes are running on the PC The data is read from a Fifo on the FPGA which contains always data. No check if the Fifo is full or empty for this test The data is not processed and not stored on disc at the PC. The data is only read from the FPGA over the USB IF and written into a array of data. We observed that for each call of the read function of the type: QuickUsbReadData(hDevice, data, &len); a time of 0.5ms is required. Taking this into account, the table in figure 2 of access times recorded can be well understood. The maximal data bandwidth is achieved if the access block size is large and therefore the rate of function calls is low, only 7% of the total time is used for the function call in the case of very large (300KByte) block size. 7 Calculation and comparison with measured speed for 3 examples As the Fifo is detected to be almost full, this is a value that can be chosen, the busy signal is set and no more triggers should be send anymore. The busy signal must be 2

# Event Event/acc ess 36 word (Spiroc EPFL) Event/acc ess 128 word (HPE- VA32) Access block size [Kbyte] Time[s] Data transfer rate Event rate Function call rate Time for function call 0.5ms/call 100,000 500 146 304 1.4 340 Mbit/s 71.4kHz 0.14kHz 7% 100,000 50 15 30 2.9 166 Mbit/s 34.5kHz 0.69kHz 35% 8 16.6 160 Mbit/s 10,000 10 2.9 6 0.64 75 Mbit/s 15.6kHz 1.56kHz 78% 1 2.1 33 Mbit/s 10,000 1 0.3 0.6 5.2 9.2 Mbit/s 1.9kHz 1.9kHz 95% Figure 2: Data transfer bandwidth measured without processing and storage for one board. set at the latest when the Fifo can still contain one more event. Three examples are given below to illustrate the situation: Example 1 Only one event per access is used, the fifo almost full is set if 8 events or more are in the Fifo (the Fifo has space for 16 events and therefore a margin of 50% is introduced). The Fifo check function needs to be called for each event 0.6ms, the ADC sampling can continue as the event is read. With a data access time of about 0.5ms (dominated by the function call time) the total time is 0.6ms + 0.5 ms = 1.1ms and therefore we can expect a event rate of about 0.9 KHz compared to 0.91KHz measured (see figure 3 the first line). Example 2 The read data access is grouped by 8 events, the Fifo check function is called during the ADC data sampling until the 8 event limit is reached, this is also the moment where the busy signal is set since the limit of 8 events is reached. ADC sampling is stopped until at least one event is read from the Fifo. The Fifo status check is performed during the ADC sampling time. The randomness of the check time introduces a dead time of maybe half the Fifo check function time 0.3ms in average. Since the ADC sampling time and the USB DAQ read time is of the same order, the performance is difficult to calculate. For 8 events the time required is composed of 8x0.155ms ADC sampling time, 0.6ms/2 Fifo check time + 0.5ms + 0.332ms/8 the time the ADC sampling is stopped due to a Fifo almost full. This amounts to 0.26ms / event or 3.8KHz event rate (measured 3.9KHz). (see line 8 in figure 3) Example 3 In this example the Fifo almost full level is set to 14, and the access is grouped in blocks of 8 events. In this case the ADC sampling time is at least 0.155ms per event, if never blocked this results at a event rate of 6.5KHz. If we 3

calculate the DAQ read time, this is 0.832ms for 8 events and 0.6ms for the Fifo check time, this results at 0.179ms per event or 5.6KHz event rate, which is in perfect agreement with 5.6KHz measured (see line 10 in figure 3). The observed readout performance can be understood very precise if the measured time for the function calls are employed. The bandwidth of the USB readout is dominated by the delays introduced by the function calls if the access size is not very large (order of 100KByte). To have a high performance, multiple events should be read in one access. Using event packing of order of 50 to 100 is reducing the effect of the slow function calls to a negligible level. For example at a packing factor of 100, a trigger rate of 2KHz, the 2 function calls (Fifo check 0.6ms and read data 0,5ms) occupy only 2.2% of the time. In a system with 15 board the time for function calls is 33% which can be expressed as a usable bandwidth of 67% of the maximal user bandwidth of 360Mbit/s = 240 Mbit/s. 8 Measured access performance for HPE-VA32 Event size is (128 data word + 2 header word) x 8 Uplinks x 2 Byte = 2080 Byte. In the two following tables the measured performance on a standard PC running Windows XP is given. The first table shows the values for internal triggers which is the trigger mode where as soon as the ADC sampling of the last event is finished and the data Fifo can contain one more event, the next trigger is generated. The performance in this mode is about 15% higher than in the external trigger mode where the triggers are generated with a periodic generator running at 20KHz. The external trigger is gated with the busy signal of the USB DAQ board (Trigger = Trigger gen AND not Busy ) 4

Event number Event/ access Only read the data [ms] Pedestal noise calculation write to disc [ms] ADC sampling time [ms] USB DAQ Readout time [ms] USB estimated speed USB Fifo Check [ms] 5000 1 5450 5570 775 2521 @33Mbit/s 3000 5000 2 2900 3000 775.... 1500 5000 3 2200 2297 775 1109 @75Mbit/s 1000 5000 4 1670 1781 775.... 750 5000 5 1375 1453 775.... 600 5000 6 1210 1312 775.... 500 5000 7 1234 1328 775.... 429 5000 8 1280 1375 775 520 @160Mbit/s 375 5000* 8 520 - - - - - 5000** 8 890-775 520-375 Notes: *No check function called, only data read function called, it is 0.5ms per function call + data transfer time, average 160Mbit/s ** Here the Fifo_almost_full was increased to 14 events, but the readout is done after 8 events, this allows for continuous ADC sampling Event number Event/ access Only read the data [ms] Pedestal noise calculation write to disc [ms] ADC sampling time [ms] USB DAQ Readout time [ms] USB estimated speed USB Fifo Check [ms] 5000 1 5420 5547 775 2521 @33Mbit/s 3000 5000 2 2890 3000 775.... 1500 5000 3 2200 2300 775 1109 @75Mbit/s 1000 5000 4 1672 1780 775.... 750 5000 5 1390 1490 775.... 600 5000 6 1219 1313 775.... 500 5000 7 1410 1516 775.... 429 5000 8 1547 1656 775 520 @160Mbit/s 375 Figure 3: Internal trigger mode in the upper and external trigger mode in the lower table. 5

9 System test with multiple boards (HPE-VA32) The following table gives the performance measured with a USB Hub and 10 USB DAQ boards. From the performance with one board, the system performance can be estimated assuming that the Hub doesn t introduce any overhead. One can observe that the use of 10 USB devices instead of only one increases the readout time by about 30%. The increase is not due to the Hub, only the presence of the 10 devices is sufficient to introduce this penalty - we tested this by using only one device and having 9 devices in idle state. The test shows also that the performance is further decreased if instead of only one ten boards are used. The calculated and the measured readout time is by a factor two different 4. In the tests performed, the Fifo check is applied only for the first board, this is possible since the system runs fully synchronous. In a system where zero suppression is done on the FPGA on the USB board, the event size becomes variable and for each board the Fifo check function has to be called to read a event or multi-event length counter in order to read the correct amount of data from the Fifo. For the case of fixed length events, it is the same for all USB boards. The access performance for 1 up to 10 boards is given in 5. The limiting speed is for Event number Event/ access Internal Trigger (no processing) Expected due to the limitation of the USB DAQ* In % of expected External Trigger (no processing) 5000 1 55.3 s 28.2 s 51% 55.4 s 5000 2 31.0 s.. 31.2 s 5000 3 22.8 s 12.1 s 53% 22.8 s 5000 4 18.5 s.. 18.5 s 5000 5 15.3 s.. 15.3 s 5000 6 13.7 s.. 13.6 s 5000 7 12.9 s.. 13.6 s 5000 8 12.3 s 5.58 s 45% 12.3 s * The USB DAQ speed for the readout of 10 boards can be calculated from the different delays which are all known. Fifo check 0.6ms / 8 events, data read from each board 10 x 0.832ms / 8 events => 1.115ms / event and therefore 5.58 s for 5000 events Figure 4: Measured readout time for 10 USB DAQ boards, for internal and external trigger mode. The performance is only about 50% of the one calculated from the one board measurements. one board the ADC sampling time where for two it is a mixture and for 4 and more it is the USB DAQ speed. The internal and external trigger mode is changing the 6

performance only for the one board access, for several boards it is negligible. Event number Event/ access # USB boards External Trigger (no processing) Limiting 5000 8 1 1.44 s ADC sampling 5000 8 2 2.27 s DAQ, ADC 5000 8 4 4.00 s DAQ 5000 8 6 6.38 s DAQ 5000 8 8 8.73 s DAQ 5000 8 10 12.3 s DAQ * The time is increased for the external trigger compared to the internal trigger due to the trigger rate of only 20KHz of the generator. In average it takes half a clock cycle until a trigger is generated after the busy is removed. Figure 5: Measured readout time 1 up to 10 USB DAQ boards, external trigger mode was used. 9.1 More observations concerning the 10 board system During the test we observed some dependance of the readout performance on the system configuration, to get some idea of how much some changes of the system can influence the performance check the table in figure 6. There is a clear system performance degradation with increasing number of USB boards seen by the PC. Reading one board only but leaving the other boards plugged, increases the readout time from 1.28s to 1.69s. Additional Fifo checking at each board degrades the performance. Adding additional USB devices in the PC degrades the performance. Using a different Hub has also negative influence. The 10 port 3 and the 7 port 4 Hubs that we used in the test are from the same manufacturer, the 10 port behaves better. Using a Labtop PC degrades the system a lot. 3 MX-UA6 - USB 2.0 Hub 10-Port 4 MX-217C - USB 2.0 Vertical Hub 7-Port 7

Event number Event/ access # USB boards Internal Trigger Remove boards that are not used *= Internal Trigger Leave boards that are not used ** = * + check each Fifo at each board 5000 8 1 1.28 s 1.69 s 2.22 s ** + add USB Memory stick to PC ** + use two different Hubs ** + use a Labtop PC instead of the quite slow lab PC 5000 8 2 2.27 s 2.77 s 3.92 s 5000 8 4 4.00 s 5.03 s 7.19 s 7.69s 7.70 s 10.8 s 5000 8 6 6.38 s 7.53 s 10.8 s 5000 8 8 8.73 s 9.94 s 14.2 s 5000 8 10 12.3 s 12.3 s 17.6 s Figure 6: Some measurements to see the influence of the system configuration. 10 QuickUSB Latency and Throughput Taken from the QuickUSB User Guide: The period of time between the start of a transfer and the time that it actually occurs is the transfer latency. USB transfer latency is the result of several factors. First is the fact that the USB is a frame oriented bus and that all packets must be scheduled to a timebase of either 1ms (full speed) or 125us (Hi-Speed) 5. Secondly, the operating system generally assesses a software latency penalty when switching from user mode to kernel mode. Throughput is a measure of data transfer speed and is generally expressed in megabytes per second (MB/s). Transfer latency affects throughput because it increases the amount of time a transfer takes regardless of the connection speed. However, as the data transfer size becomes larger, the transfer latency becomes a smaller fraction of the total transfer time thereby diminishing its effect. When the transfer size is small, the transfer latency will seriously degrade throughput. Therefore, for applications that require the highest throughput, transfer sizes of at least 64KB are recommended. Another way to mitigate transfer latency issues is to minimize the amount of time that the USB subsystem waits to schedule USB packets. You can accomplish this using asynchronous function calls 6. With asynchronous function calls, the transfer is scheduled when the function is called, but the function returns without waiting for the transfer to complete. Using this mechanism, one can concurrently schedule enough USB transfers to assure that the USB will not idle waiting for data to be transferred to or from your device. The simplest and most reliable technique for this is to employ multiple transfer buffers and rotate them on an as-needed basis. 5 In our system we measure an actual latency of 0.5ms for the Data read and 0.6ms for the CMD read access. 6 For the measurements we use only synchronous data transfer. Asynchronous transfer might help to reduce the data transfer time for the multiboard access. 8

11 Conclusion A single board readout as fast as 3.9 KHz event rate with event size of 2kByte was measured. The limiting component of the speed of the system in this case is the ADC sampling time mixed with the DAQ readout time. The average bandwidth that was obtained on the single USB link is 70 Mbit/s. The current setup with 10 USB boards, a standard PC and USB Hub allows to do a DAQ without zero suppression at a event rate of 400 Hz. The total event size is 20.8 kbyte and the average bandwidth on the USB link obtained as for the single board case 70 Mibt/s. Note that in this case the ADC sampling time is not anymore a limiting factor for the readout because all 10 board sample in parallel. The readout time is the only limiting element for the 10 board setup. For a large system DAQ, the access block size should be as large as possible but at least of the order of 64 KByte. For an estimate with the current setup one can assume the 16 KByte block size that was used in the readout where 8 full events were read. If one assumes a data reduction by zero suppression in the FPGA of order of 100, the required number of events to be packed in one block access is: 16 KByte / (2 KByte / 100) = 800 Events. The packing and zero suppression can therefore lead back to acceptable event rates. For this test setup the rate with zero suppression and packing of 800 events leads to 100 x 400 Hz event rate. Note that the lab system is only 70% of a total PEPS readout. References [1] http://accms04.physik.rwth-aachen.de/~schael/pebs.html [2] http://www.quickusb.com/store/ [3] http://lphe.epfl.ch/tell1/usb_board/ 9