Linux Software Router: Data Plane Optimization and Performance Evaluation

Transcription

1 6 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 Linux Software Router: Data Plane Optimization and Performance Evaluation Raffaele Bolla and Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, Genoa, Italy {raffaele.bolla, roberto.bruschi}@unige.it Abstract - Recent technological advances provide an excellent opportunity to achieve truly effective results in the field of open Internet devices, also known as Open Routers or ORs. Even though some initiatives have been undertaken over the last few years to investigate ORs and related topics, other extensive areas still require additional investigation. In this contribution we report the results of the in-depth optimization and testing carried out on a PC Open Router architecture based on Linux software and COTS hardware. The main focus of this paper was the forwarding performance evaluation of different OR Linux-based software architectures. This analysis was performed with both external (throughput and latencies) and internal (profiling) measurements. In particular, for the external measurements, a set of RFC2544 compliant tests was also proposed and analyzed. Index Terms - Linux Router; Open Router; RFC 2544; IP forwarding. I. INTRODUCTION Internet technology has been developed in an open environment and all Internet-related protocols, architectures and structures are publicly created and described. For this reason, in principle, everyone can easily develop an Internet device (e.g., a router). On the contrary, and to a certain extent quite surprising, most of the professional devices are developed in an extremely closed manner. In fact, it is very difficult to acquire details about internal operations and to perform anything more complex than a parametrical configuration. From a general viewpoint, this is not very strange since it can be considered a clear attempt to protect the industrial investment. However, sometimes the experimental nature of the Internet and its diffusion in many contexts might suggest a different approach. Such a need is even more evident within the scientific community, which often runs into various problems when carrying out experiments, testbeds and trials to evaluate new functionalities and protocols. Today, recent technological advances provide an opportunity to do something truly effective in the field of open Internet devices, sometimes called Open Routers (ORs). Such an opportunity arises from the use of Open Source Operative Systems (OSs) and COTS/PC components. The attractiveness of the OR solution can be summarized as: multi-vendor availability, low-cost and continuous updating/evolution of the basic parts. As far as performance is concerned, the PC architecture is general-purpose which means that, in principle, it cannot attain the same performance level as custom, high-end network devices, which often use dedicated HW elements to handle and to parallelize the most critical operations. Otherwise, the performance gap might not be so large and, in any case, more than justified by the cost differences. Our activities, carried out within the framework of the BORA-BORA project [1], are geared to facilitate the investigation by reporting the results of an extensive optimization and testing operation carried out on OR architecture based on Linux software. We focused our attention mainly on packet forwarding functionalities. Our main objectives were the performance evaluation of an optimized OR, in addition to external (throughput and latencies) and internal (profiling) measurements. To this regard, we identified a high-end reference PC-based hardware architecture and Linux kernel 2.6 for the software data plane. Subsequently, we optimized this OR structure, defined a test environment and finally developed a complete series of tests with an accurate evaluation of the software module s role in defining performance limits. With regard to the state-of-the-art of OR devices, some initiatives have been undertaken over the last few years to develop and investigate the ORs and related topics. In the software area, one of the most important initiatives is the Modular Router Project [2], which proposes an effective data plane solution. In the control plane area two important projects can be cited: Zebra [3] and Xorp [4]. Despite custom developments, some standard Open Source OSs can also provide very effective support for an OR project. The most relevant OSs in this sense are Linux [5][6] and FreeBSD [7]. Other activities focus on hardware: [8] and [9] propose a router architecture based on a PC cluster, while [1] reports some performance results (in packet transmission and reception) obtained with a PC Linux-based testbed. Some evaluations have also been carried out on network boards (see, for example, [11]). Other fascinating projects involving Linux-based ORs can be found in [12] and [13], where Bianco et al. report some interesting performance results. In [14] a performance analysis of an OR architecture enhanced with FPGA line cards, which allows direct NIC-to-NIC packet forwarding, is introduced. [15] describes the Intel 27 ACADEMY PUBLISHER

2 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 7 I/OAT, a technology that enables DMA engines to improve network reception and transmission by offloading the CPU of some low-level operations. In [16] the virtualization of a multiservice OR architecture is discussed: the authors propose multiple forwarding chains virtualized with Xen. Finally, in [17], we proposed an in-depth study of the IP lookup mechanism included in the Linux kernel. The paper is organized as follows. the hardware and software details of the proposed OR architecture are reported in sections II and III reports, while Section IV contains a description of performance tuning and optimization techniques. The benchmarking scenario and the performance results are reported in Sections V and VI, respectively. Conclusions are presented in Section VII. II. LINUX OR SOFTWARE ARCHITECTURE The OR architecture has to provide many different types of functionalities: from those directly involved in the packet forwarding process to the ones needed for control functionalities, dynamic configuration and monitoring. As outlined in [5], in [18] and in [19], all the forwarding functions are developed inside the Linux kernel, while most of the control and monitoring operations (the signaling protocols such as routing protocols, control protocols, etc.) are daemons / applications running in the user mode. Like the older kernel versions, the Linux networking architecture is basically based on an interrupt mechanism: network boards signal the kernel upon packet reception or transmission through HW interrupts. Each HW interrupt is served as soon as possible by a handling routine, which suspends the operations currently being processed by the CPU. Until completed, the runtime cannot be interrupted by anything, and not even by other interrupt handlers. Thus, with the clear purpose of making the system reactive, the interrupt handlers are designed to be very short, while all the time-consuming tasks are performed by the so-called Software Interrupts (SoftIRQs) afterwards. This is the well-known top half bottom half IRQ routine division implemented in the Linux kernel [18]. SoftIRQs are actually a form of kernel activity that can be scheduled for later execution rather than real interrupts. They differ from HW IRQs mainly in that a SoftIRQ is scheduled for execution by a kernel activity, such as an HW IRQ routine, and has to wait until it is called by the scheduler. SoftIRQs can be interrupted only by HW IRQ routines. The NET_TX_SOFTIRQ and the NET_RX_ SOFTIRQ are two of the most important SortIRQs in the Linux kernel and the backbone of the entire networking architecture, since they are designed to manage the packet transmission and reception operations, respectively. In detail, the forwarding process is triggered by an HW IRQ generated from a network device, which signals the reception or the transmission of packets. Then the corresponding routine performs some fast checks, and schedules the correct SoftIRQ, which is activated by the kernel scheduler as soon as possible. When the SoftIRQ is finally executed, it performs all the packet forwarding operations. As shown in Figure 1, which reports a scheme of Linux source code involved in the forwarding process, these operations computed during SoftIRQs can be organized in a chain of three different modules: a reception API that handles packet reception (NAPI 1 ), a module that carries out the IP layer elaboration and, finally, a transmission API that manages the forwarding operations to the egress network interfaces. In particular, the reception and the transmission APIs are the lowest level modules, and are activated by both HW IRQ routines and scheduled SoftIRQs. They handle the network interfaces and perform some layer 2 functionalities. The NAPI [2] was introduced in the kernel version, and has been explicitly created to increase reception process scalability. It handles network interface requests with an interrupt moderation mechanism, through it is possible to adaptively switch from a classical to a polling interrupt management of the network interfaces. In greater detail, this is accomplished by inserting the identifier of the board generating the IRQ on a special list, called the poll list, during the HW IRQ routine, scheduling a reception SoftIRQ, and disabling the HW IRQs for that device. When the SoftIRQ is activated, the kernel polls all the devices, whose identifier is included on the poll list, and a maximum of quota packets are served per device. If the board buffer (Rx Ring) is emptied, then the identifier is removed from the poll list and its HW IRQs re-enabled. Otherwise, its HW IRQs is left disabled, the identifier remains on the poll list and another SoftIRQ is scheduled. While this mechanism behaves like a pure interrupt mechanism in the presence of a low ingress rate (i.e., we have more or less one HW IRQ per packet), when traffic increases, the probability of emptying the RxRing, and thus re-enabling HW IRQs, decreases more and more, and the NAPI starts working like a polling mechanism. For each packet received during the NAPI processing a descriptor, called skbuff [21], is immediately allocated. In particular, as shown in Figure 1, to avoid unnecessary and tedious memory transfer operations, the packets are allowed to reside in the memory locations used by the DMA-engines of ingress network interfaces, and each subsequent operation is performed using the skbuffs. These descriptors do in fact consist of pointers to the different key fields of the headers contained in the associated packets, and are used for all the layer 2 and 3 operations. A packet is elaborated in the same NET_RX SoftIRQ, until it is enqueued in an egress device buffer, called Qdisc. Each time a NET_TX SoftIRQ is activated or a new packet is enqueued, the Qdisc buffer is served. When 1 In greater detail, the NAPI architecture includes a part of the interrupt handler. 27 ACADEMY PUBLISHER

3 8 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 ip_rcv ip_rcv_finish ip_forward ip_forward_finish ip_send Netfilter hook ip_route_input ip_output rt_hash_code IP Processing ip_finish_output Root Qdisc device 2 Poll_Queue CPU1 Device 1 Device 2 Device 3 dev_queue_xmit qdisc_restart netif_receive_skb eth_header e1_alloc_rx_buffers eth_type_trans hard_start_xmit e1_clean_rx_irq alloc_skb e1_clean_tx_irq e1_xmit_frame Completion queue Tx_Ring device 2 NAPI Rx_Ring device 3 Kernel Memory kfree TX-API DMA engines net_rx_action interrupt handler net_tx_action HW Interrupt Figure 1. Detailed scheme of forwarding code in 2.6 Linux kernel versions. a packet is dequeued from the Qdisc buffer, it is placed on the Tx Ring of the egress device. After the board successfully transmits one or more packets, it generates an HW IRQ, whose routine schedules a NET_TX SoftIRQ. The Tx Ring is periodically cleaned of all the descriptors of transmitted packets, which will be deallocated and refilled by the packets coming from the Qdisc buffer. Another interesting characteristic of the 2.6 kernels (introduced to reduce performance deterioration due to CPU concurrency) is the Symmetric Multi-Processors (SMP) support that may assign management of each network interface to a single CPU for both the transmission and reception functionalities. III. HARDWARE ARCHITECTURE The Linux OS supports many different hardware architectures, but only a small portion of them can be effectively used to obtain high OR performance. In particular, we must take into account that, during networking operations, the PC internal data path has to use a centralized I/O structure consisting of the I/O bus, the memory channel (used by DMA to transfer data from network interfaces to RAM and vice versa) and the Front Side Bus (FSB) (used by the CPU with the memory channel to access the RAM during packet elaboration). The selection criterions for hardware elements have been very fast internal busses, RAM with very low access times, and CPUs with high integer computational power (i.e., packet processing does not generally require any floating point operations). In order to understand how hardware architecture affects overall system performance, we selected two different architectures that represent the current state-ofthe art of server architectures and the state-of-the-art from 3 years ago, respectively. To this regard, as old HW architecture, we chose a system based on the Supermicro X5DL8-GG mainboard: it can support a dual-xeon system with a dual memory channel and a PCI-X bus at 133MHz with 64 parallel bits. The Xeon processors (32 bit and mono-core) we utilized have a 2.4 GHz clock and a 512KB cache. For the new OR architecture we used a Supermicro X7DBE mainboard, equipped with both the PCI Express and PCI- X busses, and with a 55 Intel Xeon (dual core 64-bit processor). Network interfaces are another critical element, since they can heavily affect PC Router performance. As reported in [11], the network adapters on the market offer different performance levels and configurability. With this in mind, we selected two different types of adapters with different features and speed: a high performance and configurable Gigabit Ethernet interface, namely Intel PRO 1, which is equipped with a PCI-X controller (XT version) or a PCI-Express (PT version) [22]; a D- Link DFE-58TX [23] that is a network card equipped with four Fast Ethernet interfaces and a PCI 2.1 controller. IV. SOFTWARE PERFORMANCE TUNING The entire networking Linux kernel architecture is quite complex and has numerous aspects and parameters that can be tuned for system optimization. In particular, in this environment, since the OS has been developed to act 27 ACADEMY PUBLISHER

4 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 9 as network host (i.e., workstation, server, etc.), it is natively tuned for general purpose network end-node usage. In this last case, packets are not fully processed inside kernel-space, but are usually delivered from network interfaces to applications in user-space, and vice versa. When the Linux kernel is used in an OR architecture, it generally works in a different manner, and should be specifically tuned and customized to obtain the maximum packet forwarding performance. As reported in [19] and [25], where a more detailed description of the adopted tuning actions can be found, this optimization is very important for obtaining maximum performance. Some of the optimal parameter values can be identified by logical considerations, but most of them have to be empirically determined, since their optimal value cannot be easily derived from the software structure and because they also depend on the hardware components. So we carried out our tuning first by identifying the critical elements on which to operate, and, then, by finding the most convenient values with both logical considerations and experimental measures. As far as the adopted tuning settings are concerned, we used the e1 driver [24], configured with both the Rx and Tx ring buffers to 256 descriptors, while the Rx interrupt generation was not limited. The qdisc size for all the adapters was dimensioned to 2, descriptors, while the scheduler clock frequency was fixed to 1Hz. Moreover the.13 kernel images used to obtain the numerical results in Section VI include two structural patches that we created to test and/or optimize kernel functionalities. In particular, those patches are described in the following discussion. A. Skbuff Recycling patch We studied and developed a new version of the skbuff Recycling patch, originally proposed by R. Olsson [26] for the e1 driver. In particular, the new version is stabilized for the.13 kernel version and extended to the sundance driver. This patch intercepts the skbuff descriptors of transmitted packets before they are de-allocated, and reuses them for new incoming packets. As shown in [19], this architectural change significantly reduces the computation weight of the memory management operations, thus attaining a very high performance level (i.e., about % of the maximum throughput of standard kernels). B. Performance Counter patch To further analyze the OR s internal behavior, we decided to introduce a set of counters in the kernel source code in order to understand how many times a certain procedure is called, or how many packets are kept per time. Specifically, we introduced the following counters: IRQ: number of interrupt handlers generated by a network card; Tx/Rx IRQ: number of tx/rx IRQ routines per device; Tx/Rx SoftIRQ: number of tx/rx software IRQ routines; Qdiscrun and Qdiscpkt: number of times the output buffer (Qdisc) is served, and number of served packets per time. Pollrun and Pollpkt: number of times the rx ring of a device is served, and the number of served packets per time. tx/rx clean: number of times the tx/rx procedures of the driver are activated. The values of all these parameters have been mapped in the Linux proc file system. V. BENCHMARKING SCENARIO To benchmark the OR forwarding performance, we used a professional device, known as Agilent N2X Router Tester [27], which can be used to obtain throughput and latency measurements with high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 1 ns). Moreover, with two dual Gigabit Ethernet cards and one 16 Fast Ethernet card, we can analyze the OR behavior with a large number of Fast and Gigabit Ethernet interfaces. To better support the performance analysis and to identify the OR bottlenecks, we also performed some internal measurements using specific software tools (called profilers) placed inside the OR which trace the percentage of CPU utilization for each software module running on the node. The problem is that with many of these profilers the relevant computational effort required perturbs system performance, thus generating what are not very meaningful the results. We verified with many different tests that one of the best is Oprofile [28], an open source tool that continuously monitors system dynamics with frequent and quite regular sampling of CPU hardware registers. Oprofile effectively and profoundly evaluate CPU utilization of each software application and each single kernel function running in the system with very low computational overhead. With regard to the benchmarking scenario, we decided to start by defining a reasonable set of test setups (with increasing level of complexity) and for each selected setup to apply some of the tests defined in the RFC 2544 [29]. In particular, we chose to perform these activities by using both a core and an edge router configuration: the former consists of a few high-speed (Gigabit Ethernet) network interfaces, while the latter utilizes a high-speed gateway interface and a large number of Fast Ethernet cards which collect traffic from the access networks. More specifically, we performed our tests by using the following setups (see Figure 2): 1) Setup A: a single mono directional flow crosses the OR from one Gigabit port to another one; 2) Setup B: two full duplex flows cross the OR, each one using a different pair of Gigabit ports; 3) Setup C: a full-meshed (and full-duplex) traffic matrix applied on 4 Gigabit Ethernet ports; 4) Setup D: a full-meshed (and full-duplex) traffic matrix applied on 1 Gigabit Ethernet port and 12 Fast Ethernet interfaces. In greater detail, each OR forwarding benchmarking session essentially consists of three test sets, and namely: 27 ACADEMY PUBLISHER

5 1 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 a) Throughput and latency: this test set is performed by using constant bit rate traffic flows, consisting of fixed size datagrams, to obtain: a) the maximum effective throughput (in Kpackets/s and as a percentage with respect to the theoretical value) versus different IP datagram sizes; b) the average, maximum and minimum latencies versus different IP datagram sizes; b) Back-to-back: these tests are carried out by using burst traffic flows and by changing both the burst dimension (i.e., the number of the packets comprising the burst) and the datagram size. The main results for this kind of test are: a) zero loss burst length versus different IP datagram sizes; b) average, maximum and minimum latencies versus different sizes of IP datagram comprising the burst ( zero loss burst length is the maximum number of packets transmitted with minimum inter-frame gaps that the System Under Test (SUT) can handle without any loss). c) Loss Rate: this kind of test is carried out by using CBR traffic flows with different offered loads and IP datagram sizes; the obtainable results can be summarized in throughput versus both offered load and IP datagram sizes. Note that all these tests have been performed by using different IP datagram sizes (i.e., 4, 64, 128, 256, 512, 124 and 15 bytes) and both CBR and burst traffic flows. Setup A Setup B Setup C Setup D Figure 2. Benchmarking setups. VI. NUMERICAL RESULTS A selection of the experimental results is reported in this section. In particular, the results of the benchmarking setups shown in Figure 2 are reported in Subsections A, B, C and D. In all such cases, the tests were performed with the old hardware architecture described in Section III (i.e., 32-bit Xeon and PCI-X bus). With regard to Software architecture, we decided to compare different Linux kernel configurations and a Modular Router. In particular, we used the following versions of the Linux kernel: single-processor optimized kernel (a version based on the standard one with single processor support that includes the descriptor recycling patch). dual-processor standard kernel (a standard NAPI kernel version similar to the previous one but with SMP support); Note that we decided not take into account the SMP versions of both the optimized Linux kernel and the Modular Router, since they lack a minimum acceptable level of stability. Subsection E summarizes the results obtained in the previous tests by showing the maximum performance for each benchmarking setup. Finally, the performance of the two hardware architectures described in Section III are reported in Subsection F, in order to evaluate how HW evolution affects forwarding performance. A. Setup A numerical results In the first benchmarking session, we performed the RFC 2544 tests by using setup A (see Figure 2) with both the single-processor optimized kernel and. As we can observe in Figs. 3, 4 and 5, which report the numerical results of the throughput and latency tests, both software architectures cannot achieve the maximum theoretical throughput in the presence of small datagram sizes. As demonstrated by the profiling measurements reported in Fig. 6, obtained with the single processor optimized kernel and with 64 Bytes sized datagrams, this effect is clearly caused by the computational CPU capacity that limits the maximum forwarding rate of the Linux kernel to about 7 Kpackets/s (4% of the full Gigabit speed). In fact, even if the CPU idle goes to zero at 4% of full load, the CPU occupancies of all the most important function sets appear to adapt their contributions up to 7 Kpackets/s; after this point their percentage contributions to CPU utilization remains almost constant Figure 3. Throughput and latencies test, testbed A: effective throughput results for the single-processor optimized kernel and. 4 min 35 max 3 min max Figure 4. Throughput and latencies test, testbed A: minimum and maximum latencies for both the single-processor optimized kernel and. More expressly, Fig. 5 shows that the computational weight of memory management operations (like sk_buff allocations and de-allocations) is substantially limited, thanks to the descriptor recycling patch, to less than 25%. In other our works, such as [19], we have shown that this patch can be used to save a CPU time share equal to about 2%. 27 ACADEMY PUBLISHER

6 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE avg avg Figure 5. Throughput and latencies test, testbed A: average latencies for both the single-processor optimized kernel and. CPU Utilization [%] idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile Figure 6. Profiling results of the optimized Linux kernel obtained with testbed setup A. Occurrences [# / Pkts] Occurrences [# / Pkts] IRQ Poll Run rxsoftirq Figure 7. Number of IRQ routines, polls and Rx SoftIRQ (second y- axis) for the RX board for the skbuff recycling patched kernel, in the presence of an incoming traffic flow with only 1 IP source address. Occurrences [# / Pkts] Occurrences [# / Pkts] IRQ Wake Func Figure 8. Number of IRQ routines for the TX board, of Tx Ring cleaned by TxSoftIRQ ( func ) and by RxSoftIRQ ( wake ) for the skbuff recycling patched kernel, in the presence of an incoming traffic flow with only 1 IP source address. The second y axis refers to wake. The behavior of the IRQ management operations would appear to be rather strange: in fact, their CPU utilization level decreases with an increase in input rate. There are mainly two reasons for such a behavior related to the packet grouping effect in the Tx and in the RxAPI: in particular, when the ingress packet rate rises, NAPI tends to moderate the IRQ rate by causing it to operate more like a polling than an interrupt-like mechanism (and thus we have the first interrupt number reduction), while TxAPI, under the same conditions, can better exploit the packet grouping mechanism by sending more packets at time (and then the number of interrupts for successful transmission confirmations decreases). When the IRQ weight becomes zero, the OR reaches the saturation point, and operates like a polling mechanism. With regard to all the other operation sets (i.e., IP and Ethernet processing, NAPI and TxAPI), their behaviour is clearly bound by the number of forwarded packets: the weight of almost all the classes increases linearly up to the saturation point, and subsequently remains more or less constant. This analysis is confirmed also by the performance counters reported in Figs. 7 and 8, in which both the Tx and Rx boards reduce their IRQ generation rates, while the kernel passes from polling the Rx Ring twice per received packet, to about.22 times. The number of Rx SoftIRQ per received packet also decreases as offered traffic load rises. For what concerns transmission dynamics, Fig. 8 shows very low function occurrences: in fact, the Tx IRQ routines decrease their occurrences up to saturation, while the wake function, which represents the number of times that the Tx Ring is cleaned and the Qdisc buffer is served during an Rx SoftIRQ, exhibits a mirror-like behavior: this occurs because when the OR reaches the saturation, all the tx functionalities are activated when the Rx SoftIRQ starts. Burst Length [pkt] Figure 9. Back-to-back test, testbed A: maximum zero loss burst lengths. TABLE I. BACK-TO-BACK TEST, TESTBED A: LATENCY VALUES FOR BOTH THE SINGLE-PROCESSOR OPTIMIZED KERNEL AND CLICK. optimized Kernel PktLength Latency Latency [Byte] Min Average Max Min Average Max [us] [us] [us] [us] [us] [us] Similar considerations can also be made for the modular router: the performance limitations in the presence of short-sized datagrams continue to be caused by a computational bottleneck, but the simple packet receive API based on the polling mechanism improves throughput performance by lowering the weight of IRQ management and RxAPI functions. For the same reasons, as shown in Figs. 4 and 5, the receive mechanism included in introduces higher packet latencies. According to the previous results, the back-toback tests, as reported in Fig. 9 and Table I, also demonstrate that the optimized Linux kernel and continue to be affected by small-sized datagrams. In fact, while when using 256 Byte or higher sized datagrams the measured zero-loss burst length is quite 27 ACADEMY PUBLISHER

7 12 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 close to the maximum burst length used in the tests carried out, it appears to be heavily limited in the presence of 4, 64 and, only for what concerns the Linux kernel, 128 Byte-sized packets. Exception is made for the single 128-Byte case, in which the computational bottleneck starts to affect NAPI while the forwarding rate continues to be very close to the theoretical one. The Linux kernel provides a better support for burst traffic than. As a result, zero-loss burst lengths are longer and associated latency times are smaller. The loss rate test results are reported in Fig B 64B 2 128B 4B 64B Figure 1. Loss Rate test, testbed A: maximum throughput. B. Setup B numerical results In the second benchmarking session we analyzed the performance achieved by the optimized single processor Linux kernel, the SMP standard Linux kernel and the modular router with testbed setup B (see Fig. 2). Fig. 11 reports the maximum effective throughput in terms of forwarded packets per second for a single router interface. From this figure it is clear that, in the presence of short-sized packets, the performance level of all three software architectures is not close to the theoretical one. More specifically, while the best throughput values are achieved by, the SMP kernel seems to provide better forwarding rates with respect to the optimized kernel. In fact, as outlined in [25], if no explicit CPUinterface bounds are present, the SMP kernel processes the received packets (using, if possible, the same CPU for the entire packet elaboration) and attempts to dynamically distribute the computational load among the CPUs SMP Figure 11. Throughput and latencies test, testbed setup B: effective throughput. Thus, in this particular setup, computational load sharing attempts to manage the two interfaces, to which a traffic pair is applied, with a single fixed CPU, fully processing each received packet with only one CPU, thus avoiding any memory concurrency problems. Figs. 12 and 13 report the minimum, the average and the maximum latency values according to different datagram sizes obtained for all three software architectures. In particular, we note that both Linux kernels, which in this case provide very similar results, ensure minimum latencies lower than. Instead, provides better average and maximum latency values for short-sized datagrams min max min max SMP min SMP max Figure 12. Throughput and latencies test, testbed B: minimum and maximum latencies avg avg SMP avg Figure 13. Throughput and latencies test, testbed B: average latencies. 8 7 SMP Figure 14. Back-to-back test, testbed B: maximum zero-loss burst lengths. Burst Length [pkt] The back-to-back results, reported in Fig. 14 and Table II, show that the performance level of all analyzed architectures is nearly comparable in terms of zero-loss burst length, while as far as latencies are concerned, the Linux kernels provide better values. By analyzing Fig. 15, which reports the loss rate results, we note how the performance values obtained with and the SMP kernel are better, especially for low-sized datagrams, than the one obtained by the optimized single processor kernel. Moreover, Fig. 15 also shows that all three OR software architectures do not achieve the full Gigabit/s speeds even for large datagrams, with a maximum forwarding rate of about 65 Mbps per interface. To improve the readability of these results, we reported in Fig. 15 and in all the following loss rate tests only the OR behavior with the minimum and maximum datagram sizes since they are, respectively, the performance lower and upper bound. 27 ACADEMY PUBLISHER

8 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE TABLE II. BACK-TO-BACK TEST, TESTBED B: LATENCY VALUES FOR ALL THREE SOFTWARE ARCHITECTURES. Optimized Kernel SMP Kernel Pkt Length [Byte] Min Average Max Min Average Max Min Average Max B 15B 4B 15B SMP 4B SMP 15B Figure 15. Loss Rate test, testbed B: maximum throughput versus both offered load and IP datagram sizes. C. Setup C numerical results In this benchmarking session, the three software architectures were tested in the presence of four Gigabit Ethernet interfaces with a full-meshed traffic matrix (Fig. 2). By analyzing the maximum effective throughput values in Fig. 16, we note that appears to achieve a better performance level with respect to the Linux kernels while, unlike the previous case, the single processor kernel provides maximum forwarding rates larger than the SMP version with small packets. In fact, the SMP kernel tries to share the computational load of the incoming traffic among the CPUs, resulting in an almost static assignment of each CPU to two specific network interfaces. Since, in the presence of a fullmeshed traffic matrix, about half of the forwarded packets cross the OR between two interfaces managed by different CPUs, this decreases performance due to memory concurrency problems [19]. Figs. 17 and 18 show the minimum, the maximum and the average latency values obtained during this test set. In observing the last results, we note how the SMP kernel, in the presence of short-sized datagrams, continues to undergo memory concurrency problems which lowers OR performance while considerably increasing both the average and the maximum latency values. By analyzing Fig. 19 and Table III, which report the back-to-back test results, we note that all three OR architectures achieve a similar zero-loss burst length, while reaches very high average and maximum latencies with respect to the single-processor and SMP kernels when small packets are used. The loss-rate results in Fig. 2 highlight the performance decay of the SMP kernel, while a fairly similar behavior is achieved by the other two architectures. Moreover, as in the previous benchmarking session, the maximum forwarding rate for each Gigabit network interface is limited to about 6/65 Mbps SMP Figure 16. Throughput and latencies test, setup C: effective throughput results min max min max SMP min SMP max Figure 17. Throughput and latencies test, testbed C: minimum and maximum latencies avg avg SMP avg Figure 18. Throughput and latencies test, results for testbed C: average latencies SMP Figure 19. Back-to-back test, testbed C: maximum zero loss burst lengths Burst Length [pkt] 4B 15B 4B 15B SMP 4B SMP 15B Figure 2. Loss Rate test, testbed C: maximum throughput versus both offered load and IP datagram sizes 27 ACADEMY PUBLISHER

9 14 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 TABLE III. BACK-TO-BACK TEST, TESTBED C: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL optimized Kernel SMP Kernel Pkt Length [Byte] Min Average Max Min Average Max Min Average Max D. Setup D numerical results In the last benchmarking session, we applied setup D, which provides a full-meshed traffic matrix between one Gigabit Ethernet and 12 Fast Ethernet interfaces, to the single-processor Linux kernel and to the SMP version. We did not use in this last test since, at the moment and for this software architecture, there are no drivers with polling support for the D-Link interfaces. By analyzing the throughput and latency results in Figs. 21, 22 and 23, we note how, in the presence of a high number of interfaces and a full-meshed traffic matrix, the performance of the SMP kernel version drops significantly: the maximum measured value for the effective throughput is limited to about 24 packets/s and the corresponding latencies would appear to be much higher with respect to those obtained with the single processor kernel. However, the single processor kernel also does not support the maximum theoretical rate: it achieves 1% of full speed in the presence of short-sized datagrams and about 75% for high datagram sizes. 8 7 SMP Figure 21. Throughput and latencies test, setup D: effective throughput results for both Linux kernels min max SMP min SMP max Figure 22. Throughput and latencies test, results for testbed D: minimum and maximum latencies for both Linux kernels. To better understand why the OR does not attain fullspeed with such a high number of interfaces, we decided to perform several profiling tests. In particular, these tests were carried out using two simple traffic matrices: the first (Fig. 24) consists of 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one, while the second (Fig. 25) still consists of 12 CBR flows that cross the OR in the opposite direction (e.g., from the Gigabit to the Fast Ethernet interfaces). These simple traffic matrices allow us to separately analyze the reception and transmission operations avg SMP avg Figure 23. Throughput and latencies test, testbed D: average latencies for both the Linux kernels. CPU Percentage [%] Offered Load [Kpackets/s] idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile control Figure 24. Profiling results obtained by using 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one. CPU Percentage [%] Offered Load [Kpackets/s] idle scheduler memory IP processing NAPI Tx API IRQ Eth processing oprofile control Figure 25. Profiling results obtained by using 12 CBR flows that cross the OR from a Gigabit interface to 12 FastEthernet ones. Thus, Figs. 24 and 25 report the profiling results corresponding to the two traffic matrices. The internal measurements shown in Fig. 24 highlight that fact that the CPUs are overloaded by the very high computational load of the IRQ and TX API management operations. This is due to the fact that during the transmission process each interface must signal the state of both the transmitting packets and the transmission ring to the associated driver instance through interrupts. More specifically, and again referring to Fig. 24, we note that IRQ CPU occupancy decreases by up to 3% of the offered load, and afterwards, while the OR reaches saturation, it remains constantly at about 5% of the computational resources. The initial decreasing behavior is due to the fact that by increasing the offered load traffic, the OR can better exploit packet grouping effects. Instead, the constant behavior is due to the fact that the OR manages the same packet quantity. Referring to Fig. 27 ACADEMY PUBLISHER

10 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE , we note how the presence of traffic incoming from many interfaces increases the computational weights of both the IRQ and the memory management operations. The decreasing behavior of the IRQ management computational weight is not due, as in the previous case, to the packet grouping effect, but to the typical NAPI structure that passes from an IRQ based mechanism to a polling one. The high memory management values can be explained quite simply by the fact that the recycling patch is not operating with the Fast Ethernet driver. Burst Length [pkt] SMP Figure 26. Back-to-back test, testbed D: maximum zero loss burst lengths. TABLE IV. BACK-TO-BACK TEST, TESTBED D: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL AND THE SMP KERNEL. Optimized Kernel SMP Kernel Pkt Length [Byte] Min Average Max Min Average Max B 15B 8 SMP 4B SMP 15B Figure 27. Loss Rate test, testbed D: maximum throughput versus both offered load and IP datagram sizes. The back-to-back results, reported in Fig. 26 and Table IV, show a very particular behavior: in fact, even if the single processor kernel can achieve longer zero-loss burst lengths than the SMP kernel, the latter appears to ensure lower minimum, average and maximum latency values. In the end, Fig. 27 reports the loss rate test results, which, compatible with the previous results, show that a single processor kernel can sustain a higher forwarding throughput than the SMP version. E. Maximum Performance In order to effectively synthesize and improve the evaluation of the proposed performance results, we report in Figs. 28 and 29 the aggregated 2 maximum values for each testbed of, respectively, the effective throughput and the maximum throughput (obtained in the loss rate test). By analyzing Fig. 28, we note that in the presence of more network interfaces, the OR generates values higher than 1 Gbps and, in particular, that it reaches maximum values equal to 1.6 Gbps with testbed D. We can also point out that the maximum effective throughput of setups B and C are almost the same: in fact, these very similar testbeds have only one difference (i.e., the traffic matrix), which has an effect only on the performance level of the SMP kernel, but practically no effect on the behaviors of the single processor kernel and. Effective Throughput [Mbps] 18 setup A 16 setup B 14 setup C setup D Figure 28. Maximum effective throughput values obtained in the implemented testbeds. Throughput [Mbps] setup A setup B setup C setup D Figure 29. Maximum throughput values obtained in the implemented testbeds. The aggregated maximum throughput values, as reported in Fig. 29, are obviously higher than the ones in Fig. 28. This highlights the fact that the maximum forwarding rates sustainable by the OR are achieved in setups B and C with 2.5 Gbps. Moreover, while in setup A the maximum theoretical rate is achieved for packet sizes larger than 128, in all the other setups the maximum throughput values are not much higher than half the theoretical ones. F. Hardware Architecture Impact In the final benchmarking session, we decided to compare the performance of the two hardware architectures introduced in Section III, which represent the current and the state-of-the-art of server architectures four years ago. The benchmarking scenario is the one used in testbed A (with reference to Fig. 2), while the selected software architecture is the single processor optimized kernel. 2 In this case, aggregated refers to the sum of the forwarding rates of all the OR network interfaces. 27 ACADEMY PUBLISHER

11 16 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 It is clear that the purposes of these tests was to understand how the continuous evolution of COTS hardware affects overall OR performance. Therefore, Figs. 3, 31 and 32 report the results of effective throughput tests for the old architecture (i.e., 32-bit Xeon) and the new one (i.e., 64-bit Xeon) equipped with both PCI-X and PCI-Express busses. The loss rate results are shown in Fig Old PCI-Ex PCI-X 1 1 Figure 3. Throughput and latencies test, setup A with the old HW architecture and the new one equipped with PCI-X and PCI-Express busses: effective throughput results for the single processor optimized kernel. Note that the x-axis is in the logarithmic scale Min Old Max Old Min PCI-Ex Max PCI-Ex Min PCI-X Max PCI-X Figure 31. Throughput and latencies test, results for testbed A with the old HW architecture and the new one equipped with PCI-X and PCI-Express busses: minimum and maximum latencies for the single processor optimized kernel Avg Old Avg PCI-Ex Avg PCI-X Figure 32. Throughput and latencies test, results for testbed A with the old HW architecture and the new one equipped with PCI-X and PCI-Express busses: average latencies for the single processor optimized kernel. By observing the comparisons in Figs. 3 and 31, it is clear that the new architecture generally provides better performance values than the old one: more specifically, while using the new architecture with the PCI-X bus slightly improves performance, when the PCI-Express is used the OR effective throughput is an impressive 88% with 4 Byte-sized packets, achieving the maximum theoretical rate for all other packet sizes. All this is clearly due to the high efficiency of the PCI Express bus. In fact, with this I/O bus DMA transfers occur with a very low control overhead (since it behaves like a leased line), which probably leads to less heavy accesses to the RAM and, subsequently, to benefits in terms of memory accesses by the CPU. In other words, this high performance enhancement is caused by a more effective memory access of the CPU, thanks to the features of the PCI Express DMA Old PCI-Ex PCI-X Figure 33. Loss Rate test, testbed A for the old HW architecture and the new one equipped with PCI-X and PCI-Express busses: maximum throughput versus both offered load and IP datagram sizes. VII. CONCLUSIONS In this contribution we report the results of the in-depth optimization and testing carried out on PC Open Router architecture based on Linux software and, more specifically, based on the Linux kernel. We have presented a performance evaluation in some common working environments of three different data plane architectures, including the optimized Linux 2.6 kernel, the Modular Router and the SMP Linux 2.6 kernel, with external (throughput and latencies) and internal (profiling) measurements. External measurements were performed in an RFC2544 [29] compliant manner by using professional devices [27]. Two hardware architectures were tested and compared for the purpose of understanding how the evolution in COTS hardware may affect performance. The experimental results show that the optimized version of the Linux kernel with suitable hardware architectures can achieve such high performance levels to effectively support several Gigabit interfaces. The results obtained show that the OR can achieve very interesting performance levels while attaining aggregated forwarding rate values of about 2.5 Gbps with relatively low latencies. REFERENCES [1] Building Open Router Architectures Based On Router Aggregation project (BORA-BORA), homepage at [2] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek, "The modular router", ACM Transactions on Computer Systems 18(3), Aug. 2, pp [3] Zebra, [4] M. Handley, O. Hodson, E. Kohler, XORP: an open platform for network research, ACM SIGCOMM Computer Communication Review, Vol. 33 Issue 1, Jan. 23, pp [5] S. Radhakrishnan, Linux - Advanced networking overview, [6] M. Rio et al., A map of the networking code in Linux kernel 2.4.2, Technical Report DataTAG-24-1, FP5/IST DataTAG Project, Mar. 24. [7] FreeBSD, 27 ACADEMY PUBLISHER

12 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE [8] B. Chen and R. Morris, "Flexible Control of Parallelism in a Multiprocessor PC Router", Proc. of the 21 USENIX Annual Technical Conference (USENIX '1), Boston, USA, June 21. [9] C. Duret, F. Rischette, J. Lattmann, V. Laspreses, P. Van Heuven, S. Van den Berghe, P. Demeester, High Router Flexibility and Performance by Combining Dedicated Lookup Hardware (IFT), off the Shelf Switches and Linux, Proc. of the 2 nd International IFIP-TC6 Networking Conference, Pisa, Italy, May 22, LNCS 2345, Ed E. Gregori et al, Springer-Verlag 22, pp [1] A. Barczyk, A. Carbone, J.P. Dufey, D. Galli, B. Jost, U. Marconi, N. Neufeld, G. Peco, V. Vagnoni, Reliability of datagram transmission on Gigabit Ethernet at full link load, LHCb technical note, LHCB 24-3 DAQ, Mar. 24. [11] P. Gray, A. Betz, Performance Evaluation of Copper- Based Gigabit Ethernet Interfaces, Proc. of 27 th Annual IEEE Conference on Local Computer Networks (LCN'2), Tampa, Florida, November 22, pp [12] A. Bianco, R. Birke, D. Bolognesi, J. M. Finochietto, G. Galante, M. Mellia, M.L.N.P.P. Prashant, Fabio Neri, vs. Linux: Two Efficient Open-Source IP Network Stacks for Software Routers, Proc. of the 25 IEEE Workshop on High Performance Switching and Routing (HPSR 25), Hong Kong, May 25, pp [13] A. Bianco, J. M. Finochietto, G. Galante, M. Mellia, F. Neri, Open-Source PC-Based Software Routers: a Viable Approach to High-Performance Packet Switching, Proc. of the 3 rd International Workshop on QoS in Multiservice IP Networks (QOS-IP 25), Catania, Italy, Feb. 25, pp [14] A. Bianco, R. Birke, G. Botto, M. Chiaberge, J. Finochietto, G. Galante, M. Mellia, F. Neri, M. Petracca, Boosting the Performance of PC-based Software Routers with FPGA-enhanced Network Interface Cards, Proc. of the 26 IEEE Workshop on High Performance Switching and Routing (HPSR 26), Poznan, Poland, June 26, pp [15] A. Grover, C. Leech, Accelerating Network Receive Processing: Intel I/O Acceleration Technology, Proc. of the 25 Linux Symposium, Ottawa, Ontario, Canada, Jul. 25, vol. 1, pp [16] R. McIlroy, J. Sventek, Resource Virtualization of Network Routers, Proc. of the 26 IEEE Workshop on High Performance Switching and Routing (HPSR 26), Poznan, Poland, June 26, pp [17] R. Bolla, R. Bruschi, The IP Lookup Mechanism in a Linux Software Router: Performance Evaluation and Optimizations, Proc. of the 27 IEEE Workshop on High Performance Switching and Routing (HPSR 27), New York, USA. [18] K. Wehrle, F. Pählke, H. Ritter, D. Müller, M. Bechler, The Linux Networking Architecture: Design and Implementation of Network Protocols in the Linux Kernel, Pearson Prentice Hall, Upper Saddle River, NJ, USA, 24. [19] R. Bolla, R. Bruschi, A high-end Linux based Open Router for IP QoS networks: tuning and performance analysis with internal (profiling) and external measurement tools of the packet forwarding capabilities, Proc. of the 3 rd International Workshop on Internet Performance, Simulation, Monitoring and Measurements (IPS MoMe 25), Warsaw, Poland, Mar. 25. [2] J. H. Salim, R. Olsson, A. Kuznetsov, Beyond Softnet, Proc. of the 5 th annual Linux Showcase & Conference, Nov. 21, Oakland, California, USA. [21] A. Cox, "Network Buffers and Memory Management" Linux Journal, Oct. 1996, lj issues/issue3/1312.html. [22] The Intel PRO 1 XT Server Adapter, xt.htm. [23] The D-Link DFE-58TX quad network adapter, E%2D58TX#. [24] J. A. Ronciak, J. Brandeburg, G. Venkatesan, M. Williams, Networking Driver Performance and Measurement e1 A Case Study, Proc. of the 25 Linux Symposium, Ottawa, Ontario, Canada, July 25, vol. 2, pp [25] R. Bolla, R. Bruschi, IP forwarding Performance Analysis in presence of Control Plane Functionalities in a PC-based Open Router, Proc. of the 25 Tyrrhenian International Workshop on Digital Communications (TIWDC 25), Sorrento, Italy, June 25, and in F. Davoli, S. Palazzo, S. Zappatore, Eds., Distributed Cooperative Laboratories: Networking, Instrumentation, and Measurements, Springer, Norwell, MA, 26, pp [26] The descriptor recycling patch, ftp://robur.slu.se/pub/ Linux/net-development/skb_recycling/. [27] The Agilent N2X Router Tester, comms.agilent.com/n2x/products/. [28] Oprofile, [29] Request for Comments 2544 (RFC 2544), org/rfcs/rfc2544.html. Raffaele Bolla was born in Savona (Italy) in He received his Master of Science degree in Electronic Engineering from the University of Genoa in 1989 and his Ph.D. degree in Telecommunications at the Department of Communications, Computer and Systems Science (DIST) in 1994 from the same university. From 1996 to 24 he worked as a researcher at DIST where, since 24, he has been an Associate Professor, and teaches a course in Telecommunication Networks and Telematics. His current research interests focus on resource allocation, Call Admission Control and routing in Multi-service IP networks, Multiple Access Control, resource allocation and routing in both cellular and ad hoc wireless networks. He has authored or coauthored over 1 scientific publications in international journals and conference proceedings. He has been the Principal Investigator in many projects in the Telecommunication Networks field. Roberto Bruschi was born in Genoa (Italy) in He received his Master of Science degree in Telecommunication Engineering in 22 from the University of Genoa and his Ph.D. in Electronic Engineering in 26 from the same university. He is presently working with the Telematics and Telecommunication Networks Lab (TNT) in the Department of Communication, Computer and System Sciences (DIST) at the University of Genoa. He is also a member of CNIT, the Italian inter-university Consortium for Telecommunications. Roberto is an active member of various Italian research projects in the networking area, such as BORA-BORA, FAMOUS, TANGO and EURO. He has co-authored over 1 papers in international conferences and journals. His main interests include Linux Software Router, Network processors, TCP and network modeling, VPN design, P2P modeling, bandwidth allocation, admission control and routing in multiservice QoS IP/MPLS networks. 27 ACADEMY PUBLISHER