FPGA PCIe Bandwidth. Abstract. 1 Introduction. Mike Rose. Department of Computer Science and Engineering University of California San Diego

Transcription

1 FPGA PCIe Bandwidth Mike Rose Department of Computer Science and Engineering University of California San Diego June 9, 2010 Abstract The unique fusion of hardware and software that characterizes FPGAs has had a strong impact on the research community. Here at UCSD, there have been many successful applications of FPGAs, especially in the implementation of Machine Learning techniques that react in real-time. I have demonstrated that with the addition of DMA transfers, our bandwidth between the FPGA and host computer has been tripled for large data transfers. This increase in communication will result in increased capabilities of previous work and will permit otherwise impossible future applications. 1 Introduction Field Programmable Gate Arrays (FPGAs) already have shown an unparalleled ability to provide inexpensive real-time video processing solutions that are simply unavailable with competing technologies. This allows for unrivaled human-machine interactions by achieving undetectable visual response times of less than 100 ms. Many ongoing projects at UCSD have been utilizing these FPGA devices made by Xilinx with great success. An especially impressive project that took place at UCSD that exploited FPGAs is the real-time face detector written by Junguk Cho et al. [1]. They achieved a thirty-five times increase in system performance over an equivalent software implementation. Another exciting use of FPGA boards that is currently in development is a 60 frames per second point tracker by Matt Jacobson and Sunsern Cheamanunkul [4]. Mayank Cabra has also been looking into incorporating FPGAs into his work on analyzing pathology images for cancer detection. Most of our FPGA solutions rely on both hardware and software components. With solely software systems, the latency is just too high to be used in real time because you are forced to buffer and parse large amounts of incoming data. And while it is relatively easy to integrate in small specified hardware modules to accelerate simple tasks, the development effort and resources required to implement complex systems completely in hardware is also often unattainable. This means that in the majority of complex systems we must rely on hardware and software cooperation. There are two ways that software programs are run in our systems. Microblaze is a small processor that runs directly on the FPGA using a portion of the programmable hardware on the chip, or we can use a workstation (desktop computer) attached and powering the FPGA. Microblaze is much closer to our hardware, since it is also running on the same chip. This results in low communication latency but Microblaze is limited by a very slow clock-speed of only about 100 MHz and a small amount of memory available. Desktop computers, in comparison, have easier programming environments, seemingly limitless memory, multiple CPUs and fast clock speeds. For these reasons, it was my goal to increase the bandwidth of communication between our FPGA devices and our 1

2 workstations so we can take advantage of the superior performance of the powerful workstation computers and increase the capabilities of the projects designed by my fellow students. 2 PCIe Overview On our Xilinx Virtex-5 FPGA board, the Peripheral Component Interconnect Express (PCIe) is ideal for our communication needs because it is capable of both high bandwidth and low latency data transfer. Its major advantage over our other communication options is that it plugs the FPGA directly into the motherboard itself (like an expansion card). PCIe connections are formed based on serial links (called lanes) rather than a parallel interconnect. The new Xilinx FPGA boards support multiple lanes (up to 16X) on the PCIe interconnect which allow for greater communication bandwidth especially for unrelated or bidirectional data streams. Above the physical layer lies the Transaction Layer which is the heart of the PCIe Communication protocol. The Transaction Layer segments data into Transaction Layer Packets (TLPs) which guarantees inorder transmission across the bus. Even though the protocol claims it is capable of 250 MB/s throughput across a single lane of the PCIe, this is usually unattainable in practice. In fact, in preliminary tests here at UCSD, we were only able to send, process and send back data at a rate of about 3.5 MB/sec [2]. My first goal when I undertook this project was to investigate the reason that our throughput numbers were so low when the protocol had given us such high hopes. Upon reading about the PCIe, it quickly became evident that it was optimized for large data transfers (such as those that come constantly streaming from a graphics card). Since the entire memory size of the current system we were testing was only 64KB, it was impossible for data transfers larger than that to occur. I also discovered that by using only normal read and write calls via Process Input Output (PIO) communication in the drivers means that each individual TLP Packet can only contain a payload of 4 bytes. This is instead of something closer to the defined Figure 1: TLP Packet Breakdown maximum of 4KB. Xilinx performed a theoretical analysis of the protocol just based on the overhead of space in every packet (see Figure 1) and demonstrated you could compute the maximum theoretical efficiency with the following equation: Efficiency = Payload Payload So if we use only 4 byte payloads, this already drops even the theoretical maximum efficiency down to 16.7% efficiency instead of 99% when using 2 KB payloads [8]. But what complicated things in our case was that the systems we were using were not actually using just basic reads and writes through the driver. We actually had a memory mapping in place that allowed us to access the memory of the FPGA as if it were our own. How memory mappings are implemented is completely machine dependent, so it was very difficult to know if the same limitations were applying in this case or if Linux was using this mapping to optimize our data transfers. The recommended way to take advantage of the full payload size of TLP packets is using Direct Memory Access (DMA). In a DMA transfer the PCIe knows the amount of data to be transferred beforehand and can segment the data accordingly. 3 Related Work Patrick Lai, Matt Jacobson and myself began exploring the capabilities and possibilities of using FPGAs to implement Machine Learning algorithms under the direction of Yoav Freund here at UCSD in the Fall of It was at this time that we uncovered and created much of the framework still used today when beginning UCSD s FPGA integrated projects [4]. At 2

3 first we were only able to communicate to our workstation through an extremely limited JTAG interface, but Patrick went on to find and edit a driver by Xilinx allowing us to communicate via PCIe with much more ease and higher bandwidth [3]. I cannot stress enough that the technologies and ideas that were used in this project were not groundbreaking or new. Xilinx has had the framework to do this DMA communication via the PCIe for quite some time and others have managed to implement similar systems in the past [7]. However, my objective was to investigate and design a solution so those of us researching here at UCSD could attain analogous results for our systems. Evan Ettinger designed the initial bandwidth benchmark of our PCIe communication that I edited and used for the baseline performance and on which I drew my comparisons [2]. 4 Design and Implementation The main communication concern that we wished to address was how to maximize the amount of throughput (data per second) that could be sent from the FPGA to the workstation, processed, and then returned back to the FPGA. Since we wanted to benchmark the bandwidth of a round trip, we decided that the communication bandwidth would be the same whether the data originated on the workstation or the FPGA as long as a full cycle was completed. It was easier at the time to locate the controlling application that was on the workstation since development time of a normal C program is much quicker and taking timing measurements would be far simpler. The other reason that it was easier to create a benchmark with the workstation in control was because our communication with the FPGA was still asymmetric. This means that the workstation was completely in control of the source and destination addresses of any data transmissions across the PCIe. An important byproduct of my research was to enable the FPGA components to be able to control reads and writes of workstation memory, creating a full symmetric communication system. With these goals in mind, the following benchmark Figure 2: Original Design Flow was devised by Evan Ettinger (see Figure 2). First, the workstation fills a buffer with test data. It then begins timing and sends the data over the PCIe using its mapping into BRAM (the FPGAs fast access memory). When the transmission is complete, the benchmark sets a flag in a specific location in BRAM signaling that the data is ready for processing. Meanwhile, a hardware module on the FPGA named the bitflipper polls this location. Once it has been updated, the bitflipper wakes and begins negating each word of data that was received. The bitflipper then signals the workstation by writing to a different location in BRAM. In the meantime, the workstation must poll BRAM across the PCIe until it receives notification that the processing is complete. The workstation then reads all of the data back from FPGA. Lastly, it stops our timer and checks the validity of the data by comparing its returned data to the inverse of the input data. The bitflipper was added because it proved to be a useful tool in verifying that our data was indeed received by the FPGA and there was no caching or errors occurring in our data transfers. Based on my initial research, there were two areas that I perceived as the most likely possible bottlenecks in the benchmark. The first was using the memory map as our main mechanism for data transfers. Since the mechanisms used inside the memory map were such an unknown area, it seemed very plausible that the resulting data transmissions were only sending one word data transmissions at a time. I decided that the most worthwhile optimization I could hope to achieve would be to convert the system to use DMA transfers which are capable of taking full 3

4 advantage of the PCIe protocol. The other issue was having the workstation completely in control of data transmission. This not only made a strange, unnatural dependency that caused an ugly interface between the two halves of our systems, but I also believed that having the workstation polling across the PCIe channel was creating latency that could dominate the running time of small data transfers. 4.1 DMA Daemon To realize these optimizations, the first step was to incorporate a DMA controller into the system. Luckily, Xilinx has an available IP core available that can be generated for exactly this purpose. The XPS Central DMA Controller [6] transfers a programmable amount of data from a source address to a destination address. I generated the core and attached it to the PLB (processor local bus). The PLB is the central bus of the FPGA that can be accessed from the Microblaze and is used to access other peripherals like BRAM and the PCIe bridge. The next step was completing a local standalone DMA transfer. To achieve this, I wrote a small program running in the Microblaze on the FPGA that put some data into BRAM. In order for the Microblaze to initiate a DMA transfer it must edit four registers. These registers are located in the XPS Central DMA Controller base address on the PLB plus corresponding offsets for source address, destination address, DMA control (DMACR), and Length. The source and destination in this case were both addresses within BRAM so the transfer would be local. Lastly, when the length register is set, the DMA transfer commences. It was quite useful to discover that while the transaction is underway you can poll the DMA Status Register to find out when the transfer has been completed. Upon finishing and debugging this code, I had my first successful DMA transfer. The bitflipper IP core was only ever connected directly to BRAM and never interfaced to the PLB. This created an interesting problem since the bitflipper was therefore incapable of using the DMA Controller alone. A colleague advised me that writing hardware level code to interface to the PLB was acheivable but was not a good use of my time. So instead of having the bitflipper IP core communicate to the DMA controller directly, I made a daemon that ran on the Microblaze and would poll the specific BRAM locations and pass on the parameters to the DMA Controller. I realized this would also be very useful for ordering DMA transfers from the workstation as well, since we already had working access to BRAM. This simple daemon made a clean and useful interface to the DMA handler and bridged the communication gap. 4.2 Workstation Driver Now that I had verified my ability to make local DMA Transfers, it was necessary to allow the workstation s memory to be addressable from the FPGA. The only way that this can be done is at the kernel driver level because user level applications live in a world of virtual addresses that are process specific and would mean nothing to the out-of-context hardware on the FPGA. At the kernel level you can allocate true physical addresses, but here I uncovered another interesting problem. The workstation we were using had a 64-bit address system so a simple call of kmalloc to allocate a buffer could very likely lay in a region not addressable with only 32 bits. After some searching, I found exactly the system call required for this situation. The pci alloc consistent call takes a size and returns both a cpu virtual memory address to be used by the PC and a hardware address (which is guaranteed to be within the first 32 bits of address space) that maps to the same address to be used by the device. This memory is also managed to ensure consistency. This means that either the workstation or the FPGA could be accessing this memory, but not both at any given time (using a mutex locking system). With this useful call in mind, I could now make some changes to the driver. Upon loading the driver I now allocated two buffers, one for sending data to the FPGA, and one for receiving data from the FPGA. Now using the previous working mechanism for writing across the PCIe (writel), I manually write the hardware addresses to specific locations in BRAM. 4

5 These locations will be picked up by the daemon and stored for later use. I could now begin changing the read and write functions of the driver. Instead of calling individual writes for each word, it would instead copy the user data into my preallocated buffer for workstation to FPGA communication. It would then simply send the DMA parameters of source, destination and length to my DMA daemon. I was happy to discover that with my new system in place, reads have became even easier. The bitflipper was now responsible for initiating the transfers from the FPGA to the workstation on its own. So the only functionality required in the workstation driver s read function was to copy the contents of the shared buffer back to the user. This means for a read operation there was no longer a need to transmit data across the PCIe. 4.3 PCIe Bridge Now most of the pieces of my project were in place. The DMA Controller was complete and the workstation driver was able to accept data transfers. But the next and final piece to connect it altogether proved very challenging. The PCIe Bridge (more specifically the Xilinx PLBv46 PCIe core [5]) that we had always relied on for PCIe communication was still only configured to allow data requests that originated from the workstation. We had always known it should be capable of allowing the FPGA the same freedom but had never understood how to accomplish this. The PCIe bridge itself, is actually broken up into two parts. There is the PCIe Controller which we had used before to enable the workstation to access BRAM, and the PCIe Bar, which was meant to be an address window on the PLB that maps directly to workstation memory. The Bar had so far gone unused, and it was now my task to use the Controller to initialize and configure the PCIe Bar. The first issue I faced was that in the beginning I could only find a mechanism to configure the workstation memory location that the bar should be mapped to at compile time. This was not very useful since I did not know the hardware address of my buffer until the driver was up and running. And about half the time it would get the same address but occasionally it would not, and it was unacceptable to consider a solution that was dependant on something so arbitrary. I finally found that if I rebuilt the Controller with the parameter C INCLUDE BAROFFSET REG set to 1 there would be an allocated register named IP- IFBAR2PCIBAR 0L where I could input my destination address at runtime [5]. The next problem I confronted can be summed up in one word: ëndianness. The Microblaze architecture is big endian and the workstation like any x86 computer is a little endian system. This means when I transferred over the hardware address that I should be setting the bar to point to, it was received but mixed up. A bug like this is easily fixable with a short method that flips the endianness of your data, though it was a difficult bug to detect. The hardest part was confirming which endianness I even should be using since I could find no documentation stating which endianness the PCIe controller would expect an addresses to be written in. The final difficulty I dealt with in altering the PCIe Bridge was due to alignment. A subtlety of how the PCIe Bar is designed is that it can only be configured to begin at addresses that are a multiple of its allocated address range. At first, I had given it an address range that was very large without much thought. I later found out this was causing the bar to ignore many of the least significant bits of the hardware address. This made the bridge appear that it was not working when it in fact was working but the beginning of the window was located far before the beginning of my buffers. This problem can be alleviated by either lowering the address size of the bar or by using bit manipulation on the hardware location to add the correct relative offset to the bar. 4.4 The Completed Benchmark The resulting data flow of the benchmark can be seen in Figure 3. The blue arrows represent the process to send data to the FPGA, and the red arrows represent the return trip. As the diagram shows, the driver and the bitswapper both communicate with the DMA Daemon but neither one is responsible for sending the bulk of the data being transferred. This system also 5

6 Figure 3: Modified Design Flow allows both the workstation and the FPGA to only poll memory locally which alleviates the more costly process of polling across the PCIe. 5 Results The results presented in this section were obtained by running the baseline benchmark that did not include the configured PCIe bar or DMA cores against the new system that I designed. When running the two benchmarks, all of the timings included in this paper were calculated using the RDTSC registers to count CPU clock cycles. The cycles were divided by the workstation s clock speed to yield our time. I also checked these values using gettimeofday which returned very similar results. Each test consisted of 10 trials that each requested 1000 DMA transfers in both directions for the the varying data sizes. I was careful to record both averages and standard deviations to ensure the quality of my results. Figure 4 shows the results of running the original benchmark in green (on the left), and the results of running our new benchmark in blue (on the right). The error bars represent the standard deviation in our test data which remained very small. Table 1 shows the results from the same trials, but I have added in the ratio of improvement. It can clearly be seen the throughput has improved in all sizes of data processed after the optimizations were added. Figure 4: Processed Data Throughput It was especially interesting to observe that even the small 4 byte transfers have shown a slight improvement. This is in spite of incurring extra overhead to begin our DMA transfers. This can only be explained by the fact that we have eliminated the need to poll for data readiness across the PCIe. As the data size is increased, the advantage of using DMA transfers begins to dominate. By the time we transfer 2 kilobytes of data, we have surpassed a 3x improvement in our data processing capabilities. Even though the results show great improvement, the numbers still do not approach the maximum bandwidth limits imposed by the PCIe protocol. However, I do not believe this in anyway undercuts the impact that this added bandwidth will have. 6 Future Work Accelerating bulk communication between the workstation and the FPGA will remain an open research problem here at UCSD, as it will always limit the abilities of our researchers. I believe the next step is to contact Xilinx on their forums and ask what can be done to improve our results further. An interesting topic that I was unable to test due to time constraints was the impact of using wider lane widths for the PCIe communication on the FPGA. The ML506 board we are currently us- 6

7 Bytes Original Modified Ratio K K K K K K Table 1: Processed Data Throughput (MB/sec) ing only supports 1X communication but the slightly newer ML555 series has 8X capability, and the just released Virtex-6 FPGA boards has been reported to use up to 16X. I have read mixed figures on how much has been gained from increasing the width, but using 8X lane widths did result in close to a 8 times speedup for this bandwidth test run by Xilinx [8] and is said to be especially beneficial for bidirectional communication. The other great improvement to the benchmark would be to take advantage of the time in which the PCIe is not full. This benchmark could be pipelined so that there are multiple buffers of data and a separate buffer could be transferred while the earlier data is being processed. I think this is the only true way to really maximize the usage of the PCIe. I hope that Xilinx is able to diagnose the limiting factors in our current communication method. But at the very least, when pipelined data flow is being incorporated with wider lane widths of new boards, we will see far superior performance. haar classifiers. In FPGA 09: Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays (New York, NY, USA, 2009), ACM, pp [2] Ettinger, E. Cse291 bram pcie. php/cse291_bram_pcie. [3] Lai, P. Patricklabdiary. edu/mediawiki/index.php/patricklabdiary. [4] UCSD. Fpga. bin/view/fpgaweb/webhome. [5] Xilinx. Ip logicore plbv46 rc/ep bridge for pci express (v4.04a). support/documentation/ip_documentation/ plbv46_pcie.pdf. [6] Xilinx. Logicore ip xps central dma controller (v2.01c). support/documentation/ip_documentation/ xps_central_dma.pdf. [7] Xilinx. Pci express forum. PCI-Express/bd-p/PCIe;jsessionid= CD9813F58F3E0898E51FD9AD783737A4. [8] Xilinx. Spartan-6 fpga connectivity targetedreferencedesignperformance. http: // ip_documentation/xps_central_dma.pdf. References [1] Cho, J., Mirzaei, S., Oberg, J., and Kastner, R. Fpga-based face detection system using 7