Challenges in high speed packet processing

Challenges in high speed packet processing Denis Salopek University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia denis.salopek@fer.hr Abstract With billions of packets traveling through the Internet and through different devices every second, the demand for fast transfer rates and even faster processing is growing. Physical layer speed is reaching its limit for 100 Gigabit Ethernet, and it is necessary to increase the processing speed of the midpoint devices (such as routers or switches). This paper brings an overview of technologies which aim to improve the flexibility and foster innovation in the domain of line-speed packet processing. Index Terms High speed networking, Packet processing, Packet forwarding, Hardware, FPGA 1. INTRODUCTION Transfer rates of 100 Gbit/s are becoming widely adopted, while vendors are under pressure from telecomm operators to introduce even faster line cards with the expectations of introducing even 400 Gbit/s in the following years [1], [2]. Commercial vendors are meeting the throughput demands by developing application specific integrated circuits (ASICs) which are employed in timing-critical datapaths. However, such an approach results in closed architectures with mostly inflexible designs. In order to gain the possibility to achieve line speeds (hypothetical maximum physical layer bitrates) in packet processing, but to also have a programmable environment for easy deployment of updates and modifications, numerous hardware and software implementations have been proposed [3] [11], of which several most important ones are covered in this paper. One of the applications for fast packet processing is route lookup which is becoming increasingly difficult and challenging with time because of the rapidly increasing number of devices connected to the internet. According to recent data 1, there are over 500 thousands different IP prefixes on the internet today. While, as a rule, transmission throughput is declared in multiples of bits per second, routers and switches have to perform roughly the same amount of processing regardless of the packet size. This makes it clear that the primary metric of measuring packet processing speed should be packets per second. As an example, while with standard 1500 byte packets a 100 Gbit/s link can transmit 8.13 millions of packets per second (Mpps), with minimum-sized (64 byte) packets the figure rises to 148.8 Mpps in a single direction, or a new packet every 6.7 ns. 1 http://bgp.potaroo.net/ - Growth of the BGP Table - 1994 to Present Software defined networking (SDN) [12] is a relatively new trend in networking where all the network control is programmable and easily accessible to the user. It greatly simplifies tunneling, filtering packets, deployment of firewalls or nodes for packet inspection without the need for special hardware. It does so by inserting a layer between hardware (Network Interface Cards - NICs, queues and other network peripherals) and userspace so the traffic can be controlled directly by the user without any other obstructions in between. It represents the interface to the forwarding plane and differs from regular, non-open networking devices without such interface. Software implementations, in addition to corporations such as Cisco, are mostly being developed on general-purpose platforms (such as personal computers or servers) easily accessible to wider population, to enable every-day users the possibility to modify and upgrade router functionalities. One of the main problems with using a personal computer for packet handling and the biggest reason why doing it would be impractical and slow is packet kernel path [13] explained in more details later in this paper. The difference between using a kernel network path and avoiding it is best shown in a simple comparison in [14], where two different ways of forwarding are compared: through kernel and effectively redirecting packets from hardware directly to userspace (on FreeBSD), and without modifications to kernel, so packets follow the kernel path to the userspace application (on Linux). The maximum of around 2M packets per second for FreeBSD (around 1M packets per second for Linux) is achieved with this configuration. If the problem of kernel path is avoided, there are still some other obstacles to solve, such as data copying and interrupt handling, both of which are time consuming and need to be addressed in order to achieve high speed performance. Hardware is generally harder to upgrade so some features which are present in software implementations are not possible in hardware ones: such as the ease of updating and modifying. When the fact that the hardware development user base is much smaller is accounted, it is obvious that there is no availability to spread to a wider developer community, so these implementations are less common. The rest of this paper is organized as follows: section 2 gives an overview of current software implementations, how they work and their advantages and disadvantages. Likewise, section 3 gives the overview of current hardware implementations. Finally, section 4 summarizes the paper, gives some future work possibilities, and concludes it.

2. SOFTWARE IMPLEMENTATIONS Today s general-purpose computers possess high memory bandwidth, but their random memory access latency is still very high. That is why memory access is one of the biggest bottlenecks in software implementations for high speed packet processing (around 50 ns per access) and why higher number of random memory accesses per packet causes lower throughput. Because of a large number of memory accesses, kernel network path contributes a lot to this latency. Whenever a packet enters the computer via a physical interface, it must go through hardware (NIC) which stores it in the kernel memory and interrupts the CPU to notify it that a new packet is available. It then waits for its turn and gets processed by its driver when ready. After that, the driver needs to find out how to forward the packet to the upper layer (for example, there is a difference whether the packet is used for virtual LAN or for bridging). Some more time is spent on that check, and some other checks on the IP layer (error and validity checks). Depending on the packet destination (multicast, this machine or other destination), the packet is forwarded to a different socket with parameters that depend on upper layer protocol. It is obvious that some more time is spent here. Finally, after processing the packet in the transport layer (again, more time is spent), the packet is ready for the application to receive it. Then, some more validity and security checks are performed, and the packet is received. Similar path (but of course, reversed) applies for transmission, only somewhat more linear and straightforward: application sends the packet to the appropriate socket, the transport layer takes care of specific processing on the packet depending on the protocol, builds a header and sends it to kernel memory (copies it). The kernel sends the packet to the IP layer which creates its IP header and does its route lookup, after which it is sent to the link layer. Link layer takes care of scheduling and queuing packets to be sent to the NIC, and ultimately out of the computer if the destination is an external one. To avoid this complex path and high number of memory accesses, a couple of solutions already exist, such as raw sockets, the Berkeley Packet Filter (BPF) [15] and similar APIs, but they depend on certain hardware features, give unprotected access to hardware, or lack the necessary integration with the OS. Some implementations [16] suggest solutions with minimum number of high-level operations for forwarding an incoming packet: 1) direct memory access of packet from NIC to memory 2) the CPU reads packet header 3) CPU does packet processing - route lookup 4) CPU modifies the packet in memory 5) direct memory access of packet from memory to NIC 2.1 Click Click [8] is one of the first software implementations for creating and configuring network processing devices (such as routers and switches). It is implemented in the Linux kernel and that makes it possible to avoid kernel path, but it also has a user-level implementation working with BPF or any other packet socket mechanism. When Click was being developed (late 1990s), handling of multiple processes was done by using CPU interrupts. This proved to be inefficient mainly because of the livelocks they caused. To mitigate these livelocks, Click started to use polling mechanism. That way, Click polls the device for new packets and receives them when they arrive, without interrupting the processor. To send a packet, Click puts it in the transmission queue and the network device is configured to check whether there are new packets to send (as a kernel module). Click can be used to build modular routers by assembling various different elements. These elements are simple parts of a router that can perform different functionalities such as packet forwarding or packet classification individually and independently as shown in Figure 1. Figure 1. A Click example demonstrating connected elements [8] Click gives the user the possibility and simplicity to create fully functional custom routers by using predefined elements or writing their own modules without losing too much speed and functionalities. Because of that, it is often used in other software routers implementations, and one of the most

popular software-based routers today. In the beginnings of Click development, its maximum lossfree forwarding rate was 333K packets per second, around 4 times higher than a Linux router on the same hardware at the time. According to [17], with some modifications to include netmap s I/O libraries, Click can achieve 3.95M packets per second today. 2.2 RouteBricks RouteBricks [5] is a platform for high-speed packet processing that can use multiple interconnected general-purpose PCs, taking advantage of multiple processors and multiple cores per processor. Polling makes every CPU more efficient because no time is spent on interrupts and also it avoids packet kernel path by using kernel based Click which is directly connected to NIC. To design a router with a n ports in RouteBricks, the same number of n servers is needed. The router should receive every packet from every port and after processing them, transfer it from the input port to the required output port. Because there are multiple servers handling the packets, they preform both port switching and packet processing at the lower speed and with less CPU cost per server than any other existing software router solutions that use only one host for packet handling. Figure 2. Traditional vs. cluster-based router [16] To achieve this, all the servers are connected in clusters. This differs it from the classical router structure, as shown in Figure 2, to allow processing parallelism and communication between every server. To improve router functionality parallelism even more, it could spread processing across multiple CPUs and by using multiple memory accesses in every server. RouteBricks guarantees routing performance of 12.7 Gbit/s for 64B packets, which calculates to around 19M packets per second, and 19.4 Gbit/s minimal forwarding capabilities, i.e. around 29M packets per second for a configuration with four 8-core servers and four ports. This can scale up depending on the amount of servers and processing units, but with higher financial and power costs. 2.3 netmap netmap [9] is a framework for connecting userspace applications with hardware (NIC), for fast access to received and transmitted network packets including those used by the host stack. It is not bound to any specific hardware devices and does not need any special hardware features. To achieve high performance and speed, netmap utilises different techniques such as: metadata refactoring to lose some unnecessary packet weight fixed size packets with memory preallocation no copying of packet data, only access to the packet buffers multiple hardware queues It works by disconnecting host stack from the network adapter and assigning it to the application as shown in Figure 3. This makes packets unavailable to the host, but now they reside in the shared memory area for netmap to use. The same memory contains buffers that represent contents of the network adapter rings (NIC rings - circular arrays of buffers which represent hardware managed memory) to improve packet processing speed. Those buffers are filled depending on the status of received packets ready for processing, or for pushing the already processed packets on the transmission side. netmap claims 14.88M packets per second on a 10 Gbit/s links which is the maximum for that speed. This is the speed achieved on a single core, 900MHz computer so it is safe to assume that the faster, multi-core computers should reach even faster speeds for faster network adapters (40/100G Ethernet). Newer measurements on Intel s ixl 40G network cards indicate 32M packets per second on transmission end and 24M packets per second on receiving end with source and receiver on two different ports of the same 40G card. 2.4 DPDK DPDK [10] is, like netmap, a framework for fast userspace packet processing. It avoids kernel network path by communicating directly with the NIC and enables the userspace application to transfer packet data with minimal overhead.

Figure 3. netmap communication with NIC [9] Figure 4. PacketShader architecture [11] DPDK is being developed for Intel CPUs as a set of software libraries. It optimizes NIC drivers [18] by: implementing no interrupts, substituted by polling fixed size packets with memory preallocation (same as netmap) spreading objects optimally across all DRAM channels Intel claims throughput of 80M packets per second on a single 8-core processor, doubling it for the same dualprocessor on a Linux or FreeBSD userspace application. This test was performed on four 10Gbit Ethernet dual port NICs (with two NICs per processor). 2.5 PacketShader PacketShader [11] is a software router platform that uses Graphic Processing Units (GPUs) for packet processing. Graphics rendering is one of the most complicated and intensive computer operations and requires a high level of processing parallelism and computing. This, as well as GPU memory management, is what makes GPUs suitable for packet processing and enables more calculations with less strain on the CPU. The PacketShader software architecture is shown in Figure 4. PacketShader uses Linux as a host, whose network stack is not suitable for high-speed packet processing, so some modifications were being made to optimize packet I/O: Large packet buffer - for storing a greater amount of packet data which speeds up memory allocation and deallocation. Parallel hardware processing - thanks to the GPU, it is possible to process multiple packets at the same time in hardware, which greatly reduces per-packet overhead. NUMA - Non-Uniform Memory Access enables minimal movement between different types of memory. Multiple core CPU scalability - optimizations that allow linear scalability for multi-core CPU systems. PacketShader claims to achieve 40 Gbit/s packet forwarding for 64-byte packet size, reaching almost 60M packets per second. 3. HARDWARE IMPLEMENTATIONS Because of its speed and easier implementation of parallelism, working on the hardware level to gain speed is a much better choice, but with a cost of losing extensibility (as already noted above) and some other advantages of software such as memory management. Even with fast packet processing of hardware, memory access remains one of the biggest bottlenecks in hardware related development. This bottleneck is being addressed by shared memory modules and multithreaded access to those modules but more time has to be spent researching how to avoid memory accesses in packet processing. The platforms for hardware developing (such as ASICs) are mostly time consuming for their long development cycles and that is why FPGA is worth mentioning. FPGA brings fast deployment and short development cycles and that makes it suitable for hardware implementation. Intel already implemented [19] FPGA in their Xeon processors in 2014 to provide a programmable, high performance acceleration to the CPU. The FPGA is reconfigurable, so the functionalities of the FPGA can be modified depending on the task, which is intended to take over some of the load from CPU. 3.1 NetFPGA Recent trends in hardware development have shown that FPGAs are becoming more and more popular. Due to their nature of parallelism, they can be used for different tasks and more suitable for some application implementations.

NetFPGA [20] is an open source hardware/software platform intended for research and development of high speed hardware-based networking systems. As such, NetFPGA is not a solution for high speed packet processing, but merely a tool to accomplish such goal. The newest NetFPGA platform at this moment is NetFPGA SUME, based on Virtex-7 FPGA with the capabilities for 10, 40 and 100 Gbit/s operation and is one of the platforms that could be used for high speed processing. Being open-source, it is easily accessible and affordable platform. Figure 5 shows how the NetFPGA platform components are mutually interconnected. Some of the its features that show that it is capable of processing a high number of packets include: High-Speed Interfaces Subsystem - 30 serial links connected from the FPGA board to the Ethernet Interfaces, other extensible interfaces, interfaces for connecting the board to other boards and PCIe for connecting to the host. Memory Subsystem - 2 x 4 GB DDR3-SDRAM replaceable modules and 500 Mhz SRAM subsystems. PCIe Subsystem - 8 lane gen. PCIe 3 enabling up to 64 Gbit/s communication with the host. and other options. After processing the packet, this header is stripped. To forward received packets, SwitchBlade offers four different ways to process the received packet: Longest Prefix Matching - when a searched destination address matches multiple entries in the forwarding table, the choice for forwarding is the address with the longest subnet mask. Hashing - exact matching on the hash value based on the platform header hash value. Software processing - send the packet to the CPU to process it. User defined software processing - send the packet to the CPU for software processing but only if it matches user defined filters. SwitchBlade claims getting the maximum of 1.4M packets per second without packet dropping, and hitting that limit only because their packet generator on older, 1 Gbit/s version NetFPGA could not create packets any faster [21]. No tests were carried out on faster network cards (10G, 100G), so it is not safe to assume that the speed and performance would scale on such hardware. Figure 5. NetFPGA SUME block diagram [20] 4. CONCLUSION NetFPGA can be used in tandem with the host by using a PCIe adapter or by itself as a standalone development platform. 3.2 SwitchBlade SwitchBlade [4] is a platform for deployment of custom protocols on programmable hardware - more specifically FPGA. It is implemented on NetFPGA, but can use any type of FPGA. Besides the hardware component that is FPGA, it also uses software processing for some exceptions when the packet needs to be processed by the CPU. This actually means that SwitchBlade uses a hybrid of hardware and software approach to packet processing, but majority of this processing is done by the FPGA. Before processing, SwitchBlade first prepends a platform header to every packet. It contains information about the packet, such as its unique hash value, how it will be forwarded The Internet is expanding and the need for faster packet processing is rising, so there is always room for improvement by focusing on what is important for a machine that does the processing, and concentrating on its functionality should be a priority when doing so. In the case of packet forwarding, todays routers do not generally allow the user to modify and reconfigure packet datapath or disregard some of its unnecessary functionalities. This paper gave an overview of some of those technologies, outlining their limitations and advantages, as well as throughputs they can achieve and opportunities they open. The technologies in this field are always advancing, and speed being a priority, new ways of keeping up to date with it should be developed and researched. REFERENCES [1] C. Hermsmeyer, H. Song, R. Schlenk, R. Gemelli, and S. Bunse, Towards 100G packet processing: Challenges and technologies, Bell Labs Technical Journal, vol. 14, no. 2, pp. 57 79, 2009. [2] Evolution of Ethernet speeds: what s new and what s next. https://conference.apnic.net/data/37/ apricot-2014-ethernet-speed-whats-new-and-next 1393233328.pdf. Created: 2014-02-25. [3] M. Bando, Y.-L. Lin, and H. J. Chao, FlashTrie: beyond 100-Gb/s IP route lookup using hash-based prefix-compressed trie, Networking, IEEE/ACM Transactions on, vol. 20, no. 4, pp. 1262 1275, 2012. [4] M. B. Anwer, M. Motiwala, M. b. Tariq, and N. Feamster, Switchblade: a platform for rapid deployment of network protocols on programmable hardware, ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 183 194, 2011.

[5] M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, and S. Ratnasamy, RouteBricks: exploiting parallelism to scale software routers, in Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pp. 15 28, ACM, 2009. [6] L. Rizzo, netmap: A Novel Framework for Fast Packet I/O., in USENIX Annual Technical Conference, pp. 101 112, 2012. [7] W. Sun and R. Ricci, Fast and flexible: parallel packet processing with GPUs and Click, in Proceedings of the ninth ACM/IEEE symposium on Architectures for networking and communications systems, pp. 25 36, IEEE Press, 2013. [8] R. Morris, E. Kohler, J. Jannotti, and M. F. Kaashoek, The Click modular router, in ACM Transactions on Computer Systems, Citeseer, 2000. [9] L. Rizzo and M. Landi, netmap: memory mapped access to network devices, in ACM SIGCOMM Computer Communication Review, vol. 41, pp. 422 423, ACM, 2011. [10] DPDK: Data Plane Development Kit. http://dpdk.org. Accessed: 2015-02-22. [11] S. Han, K. Jang, K. Park, and S. Moon, PacketShader: a GPUaccelerated software router, ACM SIGCOMM Computer Communication Review, vol. 41, no. 4, pp. 195 206, 2011. [12] Software-Defined Networking: The New Norm for Networks, April 2012. ONF White Paper. [13] A. K. Chimata, Path Of A Packet In The Linux Kernel Stack, 2005. [14] D. Salopek, V. Vasić, M. Zec, M. Mikuc, M. Vašarević, and V. Končar, A network testbed for commercial telecommunications product testing, in 22nd International Conference on Software, Telecommunications and Computer Networks-SoftCOM 2014, 2014. [15] S. McCanne and V. Jacobson, The BSD packet filter: A new architecture for user-level packet capture, in Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings, pp. 2 2, USENIX Association, 1993. [16] K. Argyraki, S. Baset, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, E. Kohler, M. Manesh, S. Nedevschi, and S. Ratnasamy, Can software routers scale?, in Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrow, pp. 21 26, ACM, 2008. [17] L. Rizzo, M. Carbone, and G. Catalli, Transparent acceleration of software packet forwarding using netmap, in INFOCOM, 2012 Proceedings IEEE, pp. 2471 2479, IEEE, 2012. [18] Intel Data Plane Development Kit (Intel DPDK) Overview Packet Processing on Intel Architecture. http://www.intel. com/content/dam/www/public/us/en/documents/presentation/ dpdk-packet-processing-ia-overview-presentation.pdf, 2012. Accessed: 2014-02-22. [19] Disrupting the Data Center to Create the Digital Services Economy. https://communities.intel.com/ community/itpeernetwork/datastack/blog/2014/06/18/ disrupting-the-data-center-to-create-the-digital-services-economy, 2014. 2015-02-22. [20] N. Zilberman, Y. Audzevich, A. Covington, and A. Moore, NetFPGA SUME: Toward Research Commodity 100Gb/s, 2014. [21] G. A. Covington, G. Gibb, J. W. Lockwood, and N. Mckeown, A packet generator on the NetFPGA platform, in Field Programmable Custom Computing Machines, FCCM 09., pp. 235 238, IEEE, 2009.