RFC 2544 Performance Evaluation for a Linux Based Open Router

Transcription

1 RFC 2544 Performance Evaluation for a Linux Based Open Router Raffaele Bolla, Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, Genova, Italy raffaele.bolla, roberto.bruschi@unige.it Abstract Nowadays, networking equipment is realized by using decentralized architectures that often include special-purpose hardware elements. The latter considerably improve the performance on one hand, while on the other they limit the level of flexibility. Indeed, it is very difficult both to have access to details about internal operations and to perform any kind of interventions more complex than a configuration of parameters. Sometimes, the experimental nature of the Internet and its diffusion in many contexts suggest a different approach. This type of need is more evident inside the scientific community, which often encounters many difficulties in realizing experiments. Recent technological advances give a good chance to do something really effective in the field of open Internet equipment, also called Open Routers (ORs). Starting from these considerations, some initiatives have been activated since the last few years to investigate the OR and related issues. But despite these activities, large interesting areas still require a deeper investigation. This work tries to give a contribution by reporting the results of an in-depth activity of optimization and testing realized on a PC Open Router architecture based on Linux software and COTS hardware. The main target approached in this paper has been the forwarding performance evaluation of different OR Linux-based software architectures. This analysis has been performed both with external (throughput and latencies) and internal (profiling) measurements. In particular, for what concerns the external measurements, a set of RFC2544 compliant tests has been proposed and analyzed. Keywords-Open Router; RFC 2544; IP forwarding. I. INTRODUCTION The Internet technology has been developed in an open environment; all Internet related protocols, architectures and structures are publicly created and described. For this reason, in principle, everyone can easily develop an Internet equipment (e. g., a router). On the contrary, and in some sense quite surprising, most of the professional equipments are realized in a very closed way; it is very difficult both to have some details about internal operations and to do any kind of interventions more complex than a parametrical configuration. Generally speaking this does not appear so strange; it is a clear attempt to protect the industrial investment. But sometime the experimental nature of Internet and its diffusion in many contests suggest a different approach. This type of need is more evident inside the scientific community, which finds often many difficulties in realizing experiments, test-beds and trials for the evaluation of new functionalities, protocols and control mechanisms. But also the commercial contest frequently asks for a more open approach, like that suggested by the Open Source philosophy for the software, especially in the situations where the network functions must be inserted in products whose main aim is not exactly base network capabilities. Today, the recent technology advances give a good chance to do something really effective in the field of open Internet equipments, sometime called Open Routers (OR). This possibility comes, for what concern the software, from the Open Source Operative Systems (OS) like Linux and FreeBSD (which have sophisticated and complete networking capabilities), and for what concern the hardware from the COTS/PC components (whose performance are always increasing while their costs are decreasing). The attractiveness of the OR solution can be summarized in multi-vendor availability, low-cost and continuous update/evolution of the basic parts. Starting from these considerations, some initiatives have been initiated since last few years to investigate the Open Router and related issues. In the software area, one of the most important initiatives is the Click Modular Router Project [1-3] from MIT, which proposes an effective solution for the data plane. In the control plane area two important projects can be cited: Zebra [4] with its evolution Quagga [5], and Xorp [6, 7]. Despite from custom developments, also some standard Open Source OSs can give a very effective support to an OR realization; the most relevant OSs in this sense are Linux [8-10] and FreeBSD [11]. Other activities are focused on hardware: [12] and [13] propose a router architecture based on PC cluster, and [14] reports some performance results (in packet transmission and reception) obtained with a PC Linuxbased test-bed. Some evaluations have been realized also on the network boards, see for example [15]. But despite of these activities, large interesting areas still require a deeper investigation: the identification of the more appropriate HW structure, the comparison of different SW solutions (specific Linux kernel version, FreeBSD, Click, ), the identification of the best software configurations with an indication of the most significant parameters, the accurate characterization of the open node performance and the identification of SW and HW bottlenecks. This work tries to give a contribution to the investigation of the above aspects by reporting the results of a

2 deep activity of optimization and testing realized on a OR architecture based on Linux software, realized inside the participation to the EURO project [16]. We have focused our effort on the packet forwarding functionality testing. Our main objectives have been the performance evaluation of an optimized OR, both with external (throughput) and internal (profiling) measurements. Moreover, specific tests have been realized for an IP QoS aware contest to verify the effectiveness of the flows differentiation and the impact on the performance of classification and scheduling functionalities. With this respect we have identified a high-end reference PC based hardware architecture and Linux kernel 2.5.x for the software data plane, we have optimized this OR structure, defined a test environment and finally realized a complete series of tests with an accurate evaluation of the software module role in the definition of performance limits. With the QoS tests we have verified the presence of serious problems in assuring flow separation but we have also experimented a feasible solution to this problem obtained by using advance capabilities of some modern network boards. The paper is organized as in the following. II. HARDWARE AND SOFTWARE ARCHITECTURE To define the OR reference architecture, we have established some main criteria and we have used them to a priori select a set of basic elements. The objective has been to obtain a high end node base structure, able to support top performance with respect to IP packet forwarding and control plane elaborations. The decision process, the criteria and the final selection results are described in some detail in the following, separately for hardware and software elements. A. The hardware architecture The PC architecture is a general-purpose one and it is not specifically optimized for network operations. This means that, in principle, it cannot reach the same performance level of custom high-end network equipments, which generally use dedicated HW elements to handle and to parallelize the most critical operations. This characteristic has more impact on the data plane performance where custom equipments usually utilize dedicated ASIC, FPGA, Network Processor and specific internal bus, to provide a high parallelism level in the packet elaboration and exchange. On the other hand, COTS hardware can guarantee two very important features as the cheapness and the fast and continuous evolution of many of its components. Moreover, the performance gap might be not so large and anyway more than justified by the cost difference. During networking operations, the PC internal data path has to use a centralized I/O structure composed by: the I/O bus, the memory channel (both used by DMA to transfer data from network interfaces to RAM and vice versa) and the Front Side Bus (FSB) (used by the CPU with memory channel to access to the RAM during the packet elaboration). It is evident that the bandwidth of these busses and the PC computational capacity are the two most critical hardware elements involved in determination of the maximum performance in terms of both the peak passing bandwidth (in Mbps) and the maximum number of forwarded packets per second. So the selection criterions have been very fast internal busses and a dual CPUs system with high integer computational power. Figure 1. Scheme of the packet path in a PC hardware architecture. With this goal we have chosen Supermicro X5DL8-GG mainboard with a ServerWorks GC-LE chipset mounted on, whose structure is shown in Fig. 2. This chipset can support a dual-xeon system with a dual memory channel and a PCI-X bus at 133MHz with 64 parallel bits. The used Xeon processors has 2.4 GHz clock and 512KB sized cache. The memory bandwidth, supported by this chipset, matches the system bus speed, but the most important point is that the memory is 2- way interleaved, which assures high performance and low average access latencies. The bus that connects the North Bridge to the PCI bridges, namely IMB, has more bandwidth (more than 25 Gbps) than the maximum combined bandwidth of the two PCI-X busses (16 Gbps) on each I/O bridge. Figure 2. Scheme of the Supermicro X5DL8-GG mainboard. Network interfaces are another critical element in the system, as they can heavily condition the PC Router performance. As reported in [15], the network adapters on the market have different levels of maximum performance and a different configurability. With this respect, we have selected two different types of OR adapters with different characteristics. For the first adapter kind, we have decided on using a high performance Gigabit Ethernet interface, namely Intel PRO 1000 XT Server, which is equipped with a PCI-X controller supporting the 133MHz frequency and also a wide configuration range for many parameters like, for example, transmission and receive buffer lengths, interrupt maximum rates and other important features [17]. For what concerns the second the second network adapter type, we have chosen for a D-Link DFE-580TX [18] that is a network card equipped with four Fast Ethernet interfaces and a PCI 2.1 controller (i.e., 32 parallelism bits for 66MHz clock frequency). Notwithstanding a Fast Ethernet adapter cannot considerably influence the OR performance, since they work at lower and less critical speeds than the Gigabit ones, the choice of a quad port adapter allows

3 us to analyze the OR behaviour in presence of a high interface number. B. The software architecture The software architecture of an OR has to provide many different functionalities: from the ones directly involved in the packet forwarding process to the ones needed for control functionalities, dynamic configuration and monitoring. In particular, we have chosen to study and to analyze a Linux based OR framework, as it is one of the open source OSs that have a large and sophisticated kernel-integrated network support, it is equipped with numerous GNU software applications. and it has been selected in the last years as framework for a large part of networking research projects. For what concerns the Linux OR architecture, as outlined in [1] and in [3], while all the forwarding functions are realized inside the Linux kernel, the large part of the control and monitoring operations is running as daemons/applications in user mode. Thus, we have to outline that, unlike most of the commercial network equipment, the forwarding functionalities and the control ones have to share the CPUs in the system. In fact, especially the high-end commercial network equipment provides separated computational resources for these processes: the forwarding is managed by a switching fabric, which is a switching matrix, often realized with ad hoc hardware elements (ASICs, FPGAs and Network Processors), while all the control functionalities are executed by one ore more separated processors. [19] reports a detailed description of how the resource sharing between the control plane and the forwarding process can have effect on the overall performance in different OR configurations (e.g., SMP kernel, single processor kernel, etc.). Let us now go into the details of the Linux OR architecture: as shown in Fig. 3 and previously sketched, the control plane functionalities run in the user space, while the forwarding process is entirely realized inside the kernel. There are two main sets of functions to consider: packet forwarding supporting functions (the real switching operation) and the control plane supporting functions (the signalling protocols like, for example, routing protocols, control protocols, ). For what concerns the control plane, one of the several open source tools that work as applications or demons in the user space such as Zebra [4], Quagga [5] and Xorp [6,7] can be used. The critical element for the IP forwarding is the kernel where all the link, network and transport layer operations are realized 1 (see Fig. 3). During the last years, the networking support, integrated in the Linux kernel, has experienced many structural and refining developments mostly for what concerns the packet reception mechanism: it has rapidly changed from a simple interrupt architecture (which was adopted till the kernel version 2.2) through a SW interrupt receive mechanism (called SoftNet and adopted up to the version ) up to an interrupt moderation one (called NAPI (New API) and adopted from the kernel ). 1 To be more detailed, some layer 2 operations are directly realized in the network interface drivers Figure 3. Block diagram of the SW architecture of the Linux PC-Router. The Softnet architecture, even if maintains an interrupt based structure, improves the performance, because it lowers the computational overhead of context switching, by delaying the elaboration of the received packets with an interrupt scheduling. In spite of all these improvements, this architecture has proved to be inadequate at medium/high packets rates. In fact, in presence of a high ingress packets rate, the well-known interrupt livelock phenomenon (clearly described in [21]) causes heavy performance deteriorations. The NAPI architecture has been explicitly created to increase the system scalability, as it can handle network interface requests with a interrupt moderation mechanism that allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one. For these reasons, we have chosen to use a last generation Linux kernel, more in particular a 2.6 version, which, besides the NAPI support, has other new interesting features (i.e. kernel preemptyness, O(1) complexity scheduler). The forwarding mechanism of all Linux kernels is fundamentally composed by a chain of three different modules: a reception API that handles the packet reception (NAPI), a module that carries out the IP layer elaboration and, finally, a transmission API that manages the forwarding operations to the egress network interfaces. Starting the analysis of the network code structure in 2.6 kernels, we have to note that all the kernel code is structured with a zero-copy statement [22]: to avoid unnecessary and onerous memory transfer operations of packets, they are left to reside in the memory locations used by the DMA-engines of ingress network interfaces, and each following operation is performed by using a sort of pointer to the packet and to its key fields, called descriptor or sk_buff. These descriptors are effectively composed by pointers to the different fields of the headers contained in the associated packets. A descriptors is immediately allocated and associated to each received packet in the sub-ip layer, subsequently used in every networking operation, and finally deallocated when the network interface signals the successful transmission. About the Symmetric Multi-Processors (SMP) support, in NAPI kernels, to reduce performance deteriorations due to CPU concurrency, the management of each network interface is assigned to a single CPU for both the transmission and reception functionalities. However, as outlined in [19] this interface-cpu assignment can be dynamically modified: for example, the cpu_affinity module tries to distribute adaptively the traffic load coming from the interfaces to all the CPUs in the system. In short, the standard forwarding operations provide that the received packets are transferred by the DMA-engine of the interface card, in a way transparent to the OS, and located in memory areas reserved to the kernel, where the packets will waiting for the kernel layer elaboration. When the kernel handles the packets, it associates them to descriptors and, then,

4 it elaborates them one at time. During the IP layer elaboration an egress interface is selected for each packet and its descriptor is placed in a buffer, called device egress qdisc, waiting for the transmission. Let us now analyze more in particular the architecture of the three fundamental modules of the networking code of 2.6 kernels: the NAPI, the IP processing and the transmission (Tx) API. 1) NAPI As previously written, NAPI is the new Linux reception API, and it has been proposed to achieve the scalability level needed to support Gigabit network interfaces. It implements an adaptive mechanism that, by using the interrupt mitigation, behaves as the classical SofNet API with low input rates, while, for higher rates, it works like a polling mechanism. Thus, let us to analyze in-depth the NAPI structure. At the first received packet placed in the ring buffer, the network interface generates an interrupt. When a CPU receives the interrupt, it invokes the netif_rx_action routine, which calls both netif_rx_schedule and netif_rx_schedule that: Records the identifier of the interface that has generated the interrupt in a buffer, called poll_list. Schedules the NET_RX_SOFTIRQ SW interrupt, instead of immediately serving the requesting device, to delay the context switching. Disables the interrupt generation, if the network card hardware supports this operation. All the next packets, which are received before the kernel scheduler serves the SW interrupt, do not cause any interrupt to the OS, but are moved to a reserved memory location by the DMA-engine. To know the address of the reserved memory location, the DMA-engine uses a list of available descriptors. Such list of descriptors is organized like a ring buffer, called rx_ring, in the kernel reserved memory. If there is not enough space or available descriptors, the packets are dropped. The scheduler awakes the softirq handler (net_rx_action) that scans the poll_list, and for each contained device: dev->poll is called and it processes, by using some driver functions, a packet number equal to the minimum value between the number of the packets in the rx_ring and the quota parameter. For each received packet the netif_receive_skb function is invoked. If the rx_ring of the device becomes empty, the identifier is cleared from the poll_list, and the interrupt generation of the card is enabled. If the rx_ring of the device does not become empty, the identifier is maintained in the poll_list. The procedure stops when there are no more device identifiers in the poll_list or when a number of packets equal to quota parameter have been processed. Note that the quota parameter sets the maximum number of packets that can be processed before activating other tasks. 2) IP processing The IP network layer code is generally composed by four important segments, which manage respectively the reception, the routing, the forwarding and the transmission process of the datagrams. More in particular, the reception process is realized by two functions: ip_rcv that controls the datagram header consistence (i. e., checksum), and ip_rcv_finish that decides if the IP destination is valid and where to send the datagram (to the local delivery or to the forwarding chain). If the datagram is not locally delivered, it is elaborated by the routing module that is structured with a two level address lookup: first a fast address lookup is made in ip_route_input by using a cache of the most frequently used destination addresses, and, if the cache misses the request destination IP, the lookup goes on with the ip_route_input_slow routine which uses the entire IP table. Afterwards, the ip_forward decreases the ttl value by one. The forwarding actions is completed by the ip_forward_finish function, that, if there are IP options, invokes ip_forward_options and then it passes directly the sk_buff to the transmission IP module. The transmission IP module first function (ip_output) handles the datagram fragmentation if necessary; thus, it calls the ip_output_finish function, which concludes the IP layer elaboration. 3) Tx API Another important part of the kernel networking code is the transmission API. It is quite simple and, unlike the Rx one, has not experienced any great revisions in the past years. By the way, Tx API is based, in large part, on driver functions that signal to network interfaces the presence of transmitting packets or that receive messages about their successful elaboration. In particular, TX API elaboration starts by calling, for each packet coming from the IP layer, dev_queue_xmit: this function enqueues the descriptor in a buffer associated to the egress device, called qdisc. Then, when there is enough space in the device tx_ring, the descriptors in the egress qdisc are served by the qdisc_restart function. For each served descriptor, the virtual method hard_start_xmit is called: the last one is implemented in the device driver (i.e. for Intel Gigabit cards is e1000_start_xmit). Such function adds the descriptor to the tx_ring and signals to the network device that there are data to transfer. Then, the DMA-engine of the egress card transfers the packets from the kernel reserved memory to its hardware ring buffer and it transmits them. When the packet transmission has been successfully completed, the network interface generates a hardware interrupt, whose routine schedules a software interrupt with NET_TX_ACTION flag. Then, the scheduler wakes up the net_tx_action handler that moves the descriptors of transmitted packets to a queue, called completion_queue where the descriptors are periodically deallocated. III. SOFTWARE PERFORMANCE TUNING. As previously sketched and shown in Fig. 4, the whole networking kernel architecture is quite complex and presents some aspects and many parameters that can be tuned for system optimization. As reported in [23], this tuning is very important for the final performance. Some of the optimal parameter values can be identified by logical considerations,

5 but most of them have to be empirically leaded, since their optimal value cannot be easily derived from the software structure and since they also depend on the hardware components. So our tuning has been realized first by identifying the critical parameters on which to operate, and, then, by finding the more convenient values with both logical considerations and experimental measures. The whole kernel architecture includes only three packet buffering points (see Fig. 4): the Rx ring buffers, the egress qdisc queues/tx ring buffers and the completion queue. While this last queue has proved to be not so critical for performance optimization purposes (it contains only descriptors of the already sent packets waiting for memory deallocation), to find an optimal configuration of the other two queues is very important to maximize the performance. In fact, to avoid useless CPU wastes it is preferable to minimize the losses at egress qdisc buffers, where the packets are already processed by IP layer, and to keep these packet losses before processing them at the Rx ring buffers: this can be easily obtained, for example, by increasing the qdisc buffer length to a value equal to about 20,000 descriptors. Many network adapter drivers, for example, allow the dimensioning of the ring buffers and of the interrupt maximum rates. Both these parameters have a great influence on the NAPI performance. In fact, if we consider a medium-high traffic load condition, NAPI practically works by polling the network interfaces, which insert themselves in the poll list with the interrupt generated by the first received packet; all the packets, received before the interface turns in the polling sequence, are enqueued in the hardware ring buffers. Thus, while it is clearly desirable not to limit the interrupt rate of network adapters in NAPI kernels, a large ring buffer permits to reduce the packet drops in presence of bursty traffic. Another interesting parameter to tune is the quota value that fixes the number of packets that each device can elaborate at every polling cycle. It is also possible to act on some specific 2.5 kernel parameters by customizing them to the specific networking usage: for example, the profiling results (Section 6) show that the kernel scheduler operations employ uselessly about the 4-5% of the overall quantity of computational resources. To avoid this CPU time waste, the OS scheduler clock frequency can be decreased: by reducing its value to 100Hz, the forwarding rate improves of about 20K packets per second (see Section 6). The rationalization of memory management is another important aspect: as highlighted in profiling results of Section 6, a considerable part of the available resources is used in the allocation and deallocation of packet descriptors (memory management functions). [24] proposes a patch that allows to recycle the descriptors of the successful sent packets: the base idea is to save CPU resources during the receive NAPI operations by reusing the packet descriptors inside the completion queue. The use of this patch can again improve the performance. Summarizing, our optimized NAPI kernel image, includes the descriptor recycling patch and the k2 version of e1000 driver and it has been configured with the following optimized values: I) the Rx and Tx ring buffers have been set to the maximum value: 4096 descriptors; II) the Rx interrupt generation has not been limited; III) the qdisc size for all the adapters has been dimensioned equal to 20,000 descriptors; IV) the NAPI quota parameter has been set to 23 descriptors; V) the scheduler clock frequency has been fixed to 100Hz. Figure 4. Detailed scheme of the forwarding operations in 2.6 kernel NAPI.

6 IV. BENCHMARKING TOOLS A reliable performance evaluation of network equipments that work forwarding packets at Gigabit/s speed is not a simple task, and it generally requires ad-hoc hardware platform: it is quite well-known that the common generation/measurement software tools [25-27] running on PCs, which even if are widely used, do not provide an adequate level of accuracy. In particular, in presence of high speed interfaces, some significant limitations (i.e. maximum sustainable rate and high packet generation/measurement latency), due to the bottleneck of PC internal busses and the lack of computational resources, are highlighted. Thus, the alternative is to use professional products, which are generally realized with ad hoc hardware and therefore can guarantee high performances (i.e. precision of few nanoseconds both in generation and in measurement); besides being quite expensive, although they cannot be completely and easily customized or modified, resulting sometimes not so much flexible as researchers need. Starting from these considerations, we have decided to build a traffic generator that combines the performance of ad hoc hardware elements with the flexibility of software based solutions. To achieve this result, we have decided to develop that traffic generator, namely PktGen, on custom highly programmable hardware like network processors [28]: in particular, we have decided to use a Radisys ENP-2611 evaluation board [29] that includes three Gigabit Ethernet interfaces and the Intel IXP 2400 network processor [30]. Analyzing thoroughly the PktGen features, we have to highlight that it has been planned not only to generate traffic at high speed with an adequate precision level, but also to generate simultaneously a high number of different traffic profiles. These last can have very different (deterministic or stochastic) characteristics. In particular, PktGen can act independently for each traffic profile by tuning both the time parameters (i.e., interarrival times and burst lengths) and the packet templates (i.e., a set of specific rules to build the packets and to fill the header fields). A detailed description of the PktGen architecture can be found in [31]. In spite of the good performance level provided by this generation tool, to analyze the OR performance with the RFC 2544 tests [32, 33] we need also a measurer tool able to reliably estimate both throughput and latency provided by the System Under Test (SUT). Thus, to benchmark the OR forwarding performance and to test the PktGen capabilities, we have finally decided to use a professional equipment, namely Agilent N2X Router Tester [34]. This last allows to obtain throughput and latency measurements with very high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10ns). Moreover, with two dual Gigabit Ethernet test cards and one 16 Fast Ethernet card at disposal, it allows us to analyze the OR behaviour in presence of a high number of heterogeneous interfaces. To better support the performance analysis and to identifying the OR bottleneck, we have also performed some internal measures, obtained by using specific software tools (called profilers) placed inside the OR, which are able to trace the percentage of CPU utilization for each software modules running on the node. Internal measurements are very useful for the identification of the architecture bottlenecks. The problem is that many of these tools require a relevant computational effort that perturbs the system performance, and then making not meaningful the results. In this respect, their correct choice is a strategic point. We have verified with many different tests that one of the best is Oprofile [35], an open source tool that realizes a continuous monitoring of system dynamics with a frequent and quite regular sampling of CPU hardware registers. Oprofile allows evaluating, in a very effective and deep way, the CPU utilization of both each software application and each single kernel function running in the system with a very low computational overhead. V. BENCHMARKING SCENARIO To deeply analyze the OR forwarding performance, we have chosen to start by defining a reasonable set of benchmarking setups with an increasing level of complexity, and for each selected setup to apply some of the tests defined in [32]. In particular, we have chosen to perform these test activities by using both a core router configuration and a edge router one: the first one is composed by few high-speed (Gigabit Ethernet) network interfaces, while the last is composed by a high-sped gateway interface and a high number of Fast Ethernet cards, which collect the traffic from the access networks. More in detail, we have performed our tests by using the following setups: Setup A (Figure 5. ): a single mono directional flow crosses the OR from a Gigabit port to another one; Setup B (Figure 6. ): two full duplex flows cross the OR, each one using a different pair of Gigabit ports; Setup C (Figure 7. ): the OR includes 4 Gigabit Ethernet ports and a full-meshed (and full-duplex) traffic matrix is applied; Setup D (Figure 8. ): the OR includes 1 Gigabit Ethernet port and 12 Fast Ethernet cards, the used traffic matrix is full-meshed (and full-duplex). Figure 5. Setup A Figure 6. Setup B Figure 7. Setup C Figure 8. Setup D The A, B and C setups, which obviously refer to a simple core router working configuration, are quite significant to understand the maximum performance obtainable by the

7 forwarding mechanism. Indeed, in such cases the OR has to handle the traffic from only few network interfaces, reducing the computational overhead needed to manage a higher interface number with respect to the setup D. In particular, each OR forwarding benchmarking session is essentially composed by three test sets namely respectively throughput and latency, back-to-back burst length and packet loss rate. Note that all these tests have been performed by using different IP datagram sizes (i.e., 40, 64, 128, 256, 512, 1024 and 1500 bytes) and both CBR and bursty traffic flows. More in detail, we report a short description of all the applied forwarding tests: 1. Throughput and latency tests: this test set is performed by using CBR traffic flows, composed by fixed sized datagrams, to obtain: a. maximum effective throughput, in Kpackets/s and in percentage with respect to the theoretical value, versus different IP datagram sizes; b. average, maximum and minimum latencies versus different IP datagram sizes; where throughput and latency are interpreted as defined in [RFC1242]. 2. Back-to-back tests: these tests are fulfilled by using bursty traffic flows and changing both the burst dimension (i.e., the number of the packets composing the burst) and the datagram size. The main results for this kind of test are: a. zero loss burst length versus different IP datagram sizes; b. average, maximum and minimum latencies versus different sizes of IP datagram composing the burst; where zero loss burst length is the maximum number of packets transmitted with minimum inter-frame gaps that the System Under Test (SUT) can handle without any loss. 3. Loss Rate tests: this kind of test is fulfilled using CBR traffic flows with different offered loads and IP datagram sizes; the obtainable results can be summarized in: a. throughput versus both offered load and IP datagram sizes; VI. NUMERICAL RESULTS In this Section some of the numerical results, obtained with the benchmarking techniques and setups described in Section III, are reported. Moreover, we have decided to consider different Linux kernel configurations and a Click Modular Router as term of comparison. In particular, we have decided to test the following versions of the Linux kernel: o o dual-processor standard kernel: it is a standard NAPI kernel version similar to the previous one but with SMP support; single-processor optimized kernel: it is a version based on the standard one with single processor support that includes the descriptor recycling patch. The driver parameter tuning for both the previously cited kernel version includes the receive ring buffer dimensioning to 80 descriptors, the transmission ones to 256 descriptors, the receive and transmission interrupt delays have been set to zero, while the output qdisc buffers to 20,000 descriptors and the scheduler clock to 100Hz. For what concerns the Click Modular Router, we have used a single processor version mounted on a Linux kernel that includes the driver parameter tuning (both transmission and receive buffers have been set to 256 descriptors). Note that we have chosen to not take into account the SMP versions both of the optimized Linux kernel and of the Click Modular Router, as these versions lack of a minimum acceptable stability level. A. Setup A numerical results In the first benchmarking session, we have performed the RFC 2544 tests by using the setup A (with reference to Figure 5. ) with both the single-processor optimized kernel and Click. As we can observe in Figs. 9, 10 and 11, which report the numerical results of the throughput and latency tests, both the software architectures cannot achieve the maximum theoretic throughput in presence of small datagram sizes. As demonstrated by the profiling measurements reported in Fig. 12, obtained with the single processor optimized Linux kernel and with 64 Bytes sized datagrams, this effect is clearly caused by computational CPU capacity that limits the maximum packet forwarding rate of Linux kernel to about 700 Kpackets per second. In fact, even if the CPU idle goes to zero in correspondence of an offered load equal to 300 Kpackets/s, the CPU occupancies of all the most important function sets appear to adapt their contributes till 700 Kpackets/s; after this point their percentage contributions to the CPU utilization remains almost costant. More in particular, Fig. 12 shows that the computational weight of memory management operations (like sk_buff allocations and de-allocations) is substantially limited, thanks to descriptor recycling patch, to less than the 1%. Both the TX API (the most onerous operation set that takes at minimum about the 35% of the overall resources) and interrupt management operations apparently have strange and somehow similar behaviours: in fact, their CPU utilization level, after an initial growth, decreases with the increase of the input rate. This behaviour is mostly due to two different aspects concerning packet grouping effect in the Tx and the Rx API: in particular, when the ingress packet rate raises, NAPI tends to moderate the Rx interrupt rate by changing its behaviour from a interrupt-like mechanism to a polling one (so we have a first interrupt number reduction), while Tx API, in the same condition, can better exploit packet grouping mechanism by sending more packets at time (and then the number of interrupts for successful transmission confirmations decreases). About IP and Ethernet processing operations, we can note that they increase their CPU percentage utilizations in an almost linear way with respect to the number of the forwarded packets per second. Considerations similar to the previous ones can be done also for what concerns Click: the performance limitations in presence of short sized datagrams continues to be due to a computational bottleneck, but the simple Click packet receive API based on polling mechanism allows to achieve better

8 performance in terms of throughput by lowing the computational weight of IRQ management and RxAPI functions. For the same reasons, as shown in Figs. 10 and 11 obtained with the RFC2544 throughput and latencies test, the receive mechanism embedded in Click introduces higher packet latencies. According to the previous results, also the back-to-back tests, reported in Fig. 13 and Table I, show that the optimized Linux kernel and Click continue to suffer small sized datagrams. In fact, while using 256 Bytes or higher sized datagrams the measured zero loss burst length is quite close to the maximum burst length used in the performed tests, it appears to be heavily limited in presence of 40, 64 and, only for what concerns the Linux kernel, 128 Bytes sized packets. Apart from the single 128 Bytes case, where NAPI starts to suffer the computational bottleneck while Click continues to have a forwarding rate very close to the theoretic one; the Linux kernel provides a better support to bursty traffic than Click by providing both higher zero loss burst lengths and lower associated latency times. In Fig. 14, the loss rate test results are finally reported. Figure 11. Troughput and latencies test, testbed A: average lantecies for both the single-processor optimized kernel and the Click Modular Router. Figure 12. Profiling results of the optimized Linux kernel obtained with the testbed setup A. Figure 9. Troughput and latencies test, testbed A: effective throughput results for the single-processor optimized kernel and the Click Modular Router. Figure 13. Back-to-back test, testbed A: maximum zero loss burst lengths. TABLE I. BACK-TO-BACK TEST, TESTBED A: LATENCY VALUES FOR BOTH THE SINGLE-PROCESSOR OPTIMIZED KERNEL AND THE CLICK MODULAR ROUTER. Figure 10. Troughput and latencies test, testbed A: minimum and maximum latencies for both the single-processor optimized kernel and the Click Modular Router. optimized Kernel Click PktLength Latency Latency [Byte] Min Average Max Min Average Max [us] [us] [us] [us] [us] [us]

9 Figure 14. Loss Rate test, testbed A: maximum throughput versus both offered load and IP datagram sizes. B. Setup B numerical results Thus, in the second benchmarking session we have analyzed the performance achieved by the optimized single processor Linux kernel, of the SMP standard Linux kernel and of the Click modular router with the testbed setup B (with reference to Fig. 6). Fig. 15, which reports the maximum effective throughput in terms of forwarded packets per second for a single router interface, shows that, in presence of short sized packets, all the three software architectures do not provide a performance level close to the theoretic one. More in particular, while the best throughput values are achieved by Click, the SMP kernel seems to provide better forwarding rates with respect to the optimized kernel. In fact, as outlined in [19], if no explicit CPU-interface bounds are present, the SMP kernel processes the received packets (using, if possible, the same CPU for the whole packet elaboration) trying to dynamically share the computational load among the CPU. Thus, with the considered setup, the computational load sharing aims to manage the two interfaces, to which a traffic pair is applied, with a single fixed CPU, fully processing each received packets with only a CPU and avoiding in this way any memory concurrency problems. Figs. 16 and 17 report the minimum, the average and the maximum latency values according different datagram sizes obtained for all the three considered software architectures. In particular, we can note that both the Linux kernels, which provide in this case very similar results, assure minimum latencies lower than Click, while this last provides better average and maximum latency values for short sized datagrams. Figure 16. Troughput and latencies test, testbed B: minimum and maximum latencies for both the single-processor optimized kernel and the Click Modular Router. Figure 17. Troughput and latencies test, testbed B: average lantecies for both the single-processor optimized kernel and the Click Modular Router. The back-to-back results, reported in Fig. 18 and Tables II, show that the performance level of all analyzed architectures is almost comparable in terms of zero-loss burst length, while for what concerns the latencies the Linux kernels provide better values with respect to Click. Figure 18. Back-to-back test, testbed B: maximum zero loss burst lengths. Figure 15. Troughput and latencies test, testbed setup B: effective throughput for both the single-processor optimized kernel and the Click Modular Router. By analyzing Fig. 19, which reports the loss rate numerical results, we can observe how the performance obtained with Click and the SMP kernel are better, especially for low sized datagrams, than the one obtained by optimized single processor kernel. Moreover, Fig. 19 shows also that all the three OR software architectures do not achieve the full Gigabit/s speeds also for large datagrams, with a the maximum forwarding rate of about 650 Mbps/s per interface. Note that, to improve the readibility of the obtained results, we have decided to report in Fig. 19 and in all the following loss rate tests only the O.R. behavior with the minimum and

10 the maximum datagram sizes, since they are respectively the perfomance lower and upper bound. TABLE II. BACK-TO-BACK TEST, TESTBED B: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL. Optimized Kernel Click SMP Kernel Pkt Length Latency [us] Latency [us] Latency [us] [Byte] Min Average Max Min Average Max Min Average Max maximum latencies with respect to the single-processor and SMP kernels. The loss-rate results, shown in Fig. 24, highlight the performance decay of the Linux SMP kernel, while a fairly similar behaviour is achieved for what concerns the other two architectures. Moreover, like in previous benchmarking session, the maximum forwarding rate for each Gigabit network interface is limited at about 600/650 Mbps. Figure 20. Troughput and latencies test, setup C: effective throughput results for both the single-processor optimized kernel, the Click Modular Router and the SMP kernel. Figure 19. Loss Rate test, testbed B: maximum throughput versus both offered load and IP datagram sizes. C. Setup C numerical results The third benchmarking session, the three considered software architectures have been tested in presence of four Gigabit Ethernet interfaces with a fullmeshed traffic matrix (Fig. 7). By analyzing the maximum effective throughput values, reported in Fig. 20, we can note that Click appears to achieve a better performance level with respect to the Linux kernels, while, unlike the previous case, the single processor Linux kernel provides maximum forwarding rates considerably higher than the SMP version. In fact the SMP kernel tries to share the computational load of the incoming traffic among the CPUs and, in presence of a fullmeshed traffic matrix, this results in a almost static assignment of each CPU to two specific network interfaces. As, in this situation, about the half of the forwarded packets cross the OR between two interfaces managed by different CPUs, this causes a performance lack due to memory concurrency problems. Figs. 21 and 22 show the minimum, the maximum and the average latency values obtained during this test set. Observing these lasts, we can note how the SMP Linux kernel, in presence of short sized datagrams, suffers memory concurrency problems lowing on the OR performance and considerably increasing both the average and the maximum latency values. Analyzing Fig 23 and Tables IV, which report the back-toback test results, we can note that, on one hand, all the three OR architectures achieve a similar zero-loss burst length, while on the other hand, Click reaches very high average and Figure 21. Troughput and latencies test, testbed C: minimum and maximum latencies for the single-processor optimized kernel, the Click Modular Router and the SMP kernel. Figure 22. Troughput and latencies test, results for testbed C: average lantecies for the single-processor optimized kernel, the Click Modular Router and the SMP kernel.

11 speed in presence of short sized datagrams and about the 75% for high datagram sizes. Figure 23. Back-to-back test, testbed C: maximum zero loss burst lengths.. TABLE III. BACK-TO-BACK TEST, TESTBED C: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL Figure 25. Troughput and latencies test, setup D: effective throughput results for both the single-processor optimized kernel and the SMP kernel. optimized Kernel Click SMP Kernel Pkt Length Latency [us] Latency [us] Latency [us] [Byte] Min Average Max Min Average Max Min Average Max Figure 26. Troughput and latencies test, results for testbed D: minimum and maximum latencies for both the single-processor optimized kernel and the SMP kernel. Figure 24. Loss Rate test, testbed C: maximum throughput versus both offered load and IP datagram sizes D. Setup D numerical results In the last benchmarking session, we have applied the setup D, that provides a full-meshed traffic matrix between one Gigabit Ethernet and 12 Fast Ethernet interfaces, to the singleprocessor Linux kernel and to the SMP version. Note that we have decided to not use Click in this last test, as, at the moment and for this software architecture, there is no drivers with polling support for the used Fast Ethernet interfaces. By analyzing the throughput and latency results in Figs. 25, 26 and 27, we can note how, in presence of a high interface number and a fullmeshed traffic matrix, the SMP kernel version annihilates its performance: as shown in Fig. 25, the maximum measured value for the effective throughput is limited to about 2400 packets/s and the corresponding latencies appear clearly higher with respect to the ones obtained with the single processor kernel. However, it can be highlighted that also the single processor kernel does not support the maximum theoretic rate: in particular, it achieves the 10% of the full Figure 27. Troughput and latencies test, testbed D: average lantecies for both the single-processor optimized kernel and the SMP kernel. To better understand why the OR does not reach the fullspeed with a so high number of Fast Ethernet interfaces, we have decided to perform several profiling tests. In particular, these tests were carried out by using two simple traffic matrixes: the first (Fig. 28) is composed by 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one, while the second (Fig. 29) is still composed by 12 CBR flows that though cross the OR with the opposite direction (e.g., from the Gigabit to the Fast Ethernet interfaces). These simple traffic matrixes allow us to separately analyze the receive and transmission operation in this contest. Thus, the Figs. 28 and 29 report the profiling results corresponding to the two traffic matrixes. The obtained internal measurements in