RFC 2544 Performance Evaluation for a Linux Based Open Router

Size: px
Start display at page:

Download "RFC 2544 Performance Evaluation for a Linux Based Open Router"

Transcription

1 RFC 2544 Performance Evaluation for a Linux Based Open Router Raffaele Bolla, Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, Genova, Italy raffaele.bolla, roberto.bruschi@unige.it Abstract Nowadays, networking equipment is realized by using decentralized architectures that often include special-purpose hardware elements. The latter considerably improve the performance on one hand, while on the other they limit the level of flexibility. Indeed, it is very difficult both to have access to details about internal operations and to perform any kind of interventions more complex than a configuration of parameters. Sometimes, the experimental nature of the Internet and its diffusion in many contexts suggest a different approach. This type of need is more evident inside the scientific community, which often encounters many difficulties in realizing experiments. Recent technological advances give a good chance to do something really effective in the field of open Internet equipment, also called Open Routers (ORs). Starting from these considerations, some initiatives have been activated since the last few years to investigate the OR and related issues. But despite these activities, large interesting areas still require a deeper investigation. This work tries to give a contribution by reporting the results of an in-depth activity of optimization and testing realized on a PC Open Router architecture based on Linux software and COTS hardware. The main target approached in this paper has been the forwarding performance evaluation of different OR Linux-based software architectures. This analysis has been performed both with external (throughput and latencies) and internal (profiling) measurements. In particular, for what concerns the external measurements, a set of RFC2544 compliant tests has been proposed and analyzed. Keywords-Open Router; RFC 2544; IP forwarding. I. INTRODUCTION The Internet technology has been developed in an open environment; all Internet related protocols, architectures and structures are publicly created and described. For this reason, in principle, everyone can easily develop an Internet equipment (e. g., a router). On the contrary, and in some sense quite surprising, most of the professional equipments are realized in a very closed way; it is very difficult both to have some details about internal operations and to do any kind of interventions more complex than a parametrical configuration. Generally speaking this does not appear so strange; it is a clear attempt to protect the industrial investment. But sometime the experimental nature of Internet and its diffusion in many contests suggest a different approach. This type of need is more evident inside the scientific community, which finds often many difficulties in realizing experiments, test-beds and trials for the evaluation of new functionalities, protocols and control mechanisms. But also the commercial contest frequently asks for a more open approach, like that suggested by the Open Source philosophy for the software, especially in the situations where the network functions must be inserted in products whose main aim is not exactly base network capabilities. Today, the recent technology advances give a good chance to do something really effective in the field of open Internet equipments, sometime called Open Routers (OR). This possibility comes, for what concern the software, from the Open Source Operative Systems (OS) like Linux and FreeBSD (which have sophisticated and complete networking capabilities), and for what concern the hardware from the COTS/PC components (whose performance are always increasing while their costs are decreasing). The attractiveness of the OR solution can be summarized in multi-vendor availability, low-cost and continuous update/evolution of the basic parts. Starting from these considerations, some initiatives have been initiated since last few years to investigate the Open Router and related issues. In the software area, one of the most important initiatives is the Click Modular Router Project [1-3] from MIT, which proposes an effective solution for the data plane. In the control plane area two important projects can be cited: Zebra [4] with its evolution Quagga [5], and Xorp [6, 7]. Despite from custom developments, also some standard Open Source OSs can give a very effective support to an OR realization; the most relevant OSs in this sense are Linux [8-10] and FreeBSD [11]. Other activities are focused on hardware: [12] and [13] propose a router architecture based on PC cluster, and [14] reports some performance results (in packet transmission and reception) obtained with a PC Linuxbased test-bed. Some evaluations have been realized also on the network boards, see for example [15]. But despite of these activities, large interesting areas still require a deeper investigation: the identification of the more appropriate HW structure, the comparison of different SW solutions (specific Linux kernel version, FreeBSD, Click, ), the identification of the best software configurations with an indication of the most significant parameters, the accurate characterization of the open node performance and the identification of SW and HW bottlenecks. This work tries to give a contribution to the investigation of the above aspects by reporting the results of a

2 deep activity of optimization and testing realized on a OR architecture based on Linux software, realized inside the participation to the EURO project [16]. We have focused our effort on the packet forwarding functionality testing. Our main objectives have been the performance evaluation of an optimized OR, both with external (throughput) and internal (profiling) measurements. Moreover, specific tests have been realized for an IP QoS aware contest to verify the effectiveness of the flows differentiation and the impact on the performance of classification and scheduling functionalities. With this respect we have identified a high-end reference PC based hardware architecture and Linux kernel 2.5.x for the software data plane, we have optimized this OR structure, defined a test environment and finally realized a complete series of tests with an accurate evaluation of the software module role in the definition of performance limits. With the QoS tests we have verified the presence of serious problems in assuring flow separation but we have also experimented a feasible solution to this problem obtained by using advance capabilities of some modern network boards. The paper is organized as in the following. II. HARDWARE AND SOFTWARE ARCHITECTURE To define the OR reference architecture, we have established some main criteria and we have used them to a priori select a set of basic elements. The objective has been to obtain a high end node base structure, able to support top performance with respect to IP packet forwarding and control plane elaborations. The decision process, the criteria and the final selection results are described in some detail in the following, separately for hardware and software elements. A. The hardware architecture The PC architecture is a general-purpose one and it is not specifically optimized for network operations. This means that, in principle, it cannot reach the same performance level of custom high-end network equipments, which generally use dedicated HW elements to handle and to parallelize the most critical operations. This characteristic has more impact on the data plane performance where custom equipments usually utilize dedicated ASIC, FPGA, Network Processor and specific internal bus, to provide a high parallelism level in the packet elaboration and exchange. On the other hand, COTS hardware can guarantee two very important features as the cheapness and the fast and continuous evolution of many of its components. Moreover, the performance gap might be not so large and anyway more than justified by the cost difference. During networking operations, the PC internal data path has to use a centralized I/O structure composed by: the I/O bus, the memory channel (both used by DMA to transfer data from network interfaces to RAM and vice versa) and the Front Side Bus (FSB) (used by the CPU with memory channel to access to the RAM during the packet elaboration). It is evident that the bandwidth of these busses and the PC computational capacity are the two most critical hardware elements involved in determination of the maximum performance in terms of both the peak passing bandwidth (in Mbps) and the maximum number of forwarded packets per second. So the selection criterions have been very fast internal busses and a dual CPUs system with high integer computational power. Figure 1. Scheme of the packet path in a PC hardware architecture. With this goal we have chosen Supermicro X5DL8-GG mainboard with a ServerWorks GC-LE chipset mounted on, whose structure is shown in Fig. 2. This chipset can support a dual-xeon system with a dual memory channel and a PCI-X bus at 133MHz with 64 parallel bits. The used Xeon processors has 2.4 GHz clock and 512KB sized cache. The memory bandwidth, supported by this chipset, matches the system bus speed, but the most important point is that the memory is 2- way interleaved, which assures high performance and low average access latencies. The bus that connects the North Bridge to the PCI bridges, namely IMB, has more bandwidth (more than 25 Gbps) than the maximum combined bandwidth of the two PCI-X busses (16 Gbps) on each I/O bridge. Figure 2. Scheme of the Supermicro X5DL8-GG mainboard. Network interfaces are another critical element in the system, as they can heavily condition the PC Router performance. As reported in [15], the network adapters on the market have different levels of maximum performance and a different configurability. With this respect, we have selected two different types of OR adapters with different characteristics. For the first adapter kind, we have decided on using a high performance Gigabit Ethernet interface, namely Intel PRO 1000 XT Server, which is equipped with a PCI-X controller supporting the 133MHz frequency and also a wide configuration range for many parameters like, for example, transmission and receive buffer lengths, interrupt maximum rates and other important features [17]. For what concerns the second the second network adapter type, we have chosen for a D-Link DFE-580TX [18] that is a network card equipped with four Fast Ethernet interfaces and a PCI 2.1 controller (i.e., 32 parallelism bits for 66MHz clock frequency). Notwithstanding a Fast Ethernet adapter cannot considerably influence the OR performance, since they work at lower and less critical speeds than the Gigabit ones, the choice of a quad port adapter allows

3 us to analyze the OR behaviour in presence of a high interface number. B. The software architecture The software architecture of an OR has to provide many different functionalities: from the ones directly involved in the packet forwarding process to the ones needed for control functionalities, dynamic configuration and monitoring. In particular, we have chosen to study and to analyze a Linux based OR framework, as it is one of the open source OSs that have a large and sophisticated kernel-integrated network support, it is equipped with numerous GNU software applications. and it has been selected in the last years as framework for a large part of networking research projects. For what concerns the Linux OR architecture, as outlined in [1] and in [3], while all the forwarding functions are realized inside the Linux kernel, the large part of the control and monitoring operations is running as daemons/applications in user mode. Thus, we have to outline that, unlike most of the commercial network equipment, the forwarding functionalities and the control ones have to share the CPUs in the system. In fact, especially the high-end commercial network equipment provides separated computational resources for these processes: the forwarding is managed by a switching fabric, which is a switching matrix, often realized with ad hoc hardware elements (ASICs, FPGAs and Network Processors), while all the control functionalities are executed by one ore more separated processors. [19] reports a detailed description of how the resource sharing between the control plane and the forwarding process can have effect on the overall performance in different OR configurations (e.g., SMP kernel, single processor kernel, etc.). Let us now go into the details of the Linux OR architecture: as shown in Fig. 3 and previously sketched, the control plane functionalities run in the user space, while the forwarding process is entirely realized inside the kernel. There are two main sets of functions to consider: packet forwarding supporting functions (the real switching operation) and the control plane supporting functions (the signalling protocols like, for example, routing protocols, control protocols, ). For what concerns the control plane, one of the several open source tools that work as applications or demons in the user space such as Zebra [4], Quagga [5] and Xorp [6,7] can be used. The critical element for the IP forwarding is the kernel where all the link, network and transport layer operations are realized 1 (see Fig. 3). During the last years, the networking support, integrated in the Linux kernel, has experienced many structural and refining developments mostly for what concerns the packet reception mechanism: it has rapidly changed from a simple interrupt architecture (which was adopted till the kernel version 2.2) through a SW interrupt receive mechanism (called SoftNet and adopted up to the version ) up to an interrupt moderation one (called NAPI (New API) and adopted from the kernel ). 1 To be more detailed, some layer 2 operations are directly realized in the network interface drivers Figure 3. Block diagram of the SW architecture of the Linux PC-Router. The Softnet architecture, even if maintains an interrupt based structure, improves the performance, because it lowers the computational overhead of context switching, by delaying the elaboration of the received packets with an interrupt scheduling. In spite of all these improvements, this architecture has proved to be inadequate at medium/high packets rates. In fact, in presence of a high ingress packets rate, the well-known interrupt livelock phenomenon (clearly described in [21]) causes heavy performance deteriorations. The NAPI architecture has been explicitly created to increase the system scalability, as it can handle network interface requests with a interrupt moderation mechanism that allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one. For these reasons, we have chosen to use a last generation Linux kernel, more in particular a 2.6 version, which, besides the NAPI support, has other new interesting features (i.e. kernel preemptyness, O(1) complexity scheduler). The forwarding mechanism of all Linux kernels is fundamentally composed by a chain of three different modules: a reception API that handles the packet reception (NAPI), a module that carries out the IP layer elaboration and, finally, a transmission API that manages the forwarding operations to the egress network interfaces. Starting the analysis of the network code structure in 2.6 kernels, we have to note that all the kernel code is structured with a zero-copy statement [22]: to avoid unnecessary and onerous memory transfer operations of packets, they are left to reside in the memory locations used by the DMA-engines of ingress network interfaces, and each following operation is performed by using a sort of pointer to the packet and to its key fields, called descriptor or sk_buff. These descriptors are effectively composed by pointers to the different fields of the headers contained in the associated packets. A descriptors is immediately allocated and associated to each received packet in the sub-ip layer, subsequently used in every networking operation, and finally deallocated when the network interface signals the successful transmission. About the Symmetric Multi-Processors (SMP) support, in NAPI kernels, to reduce performance deteriorations due to CPU concurrency, the management of each network interface is assigned to a single CPU for both the transmission and reception functionalities. However, as outlined in [19] this interface-cpu assignment can be dynamically modified: for example, the cpu_affinity module tries to distribute adaptively the traffic load coming from the interfaces to all the CPUs in the system. In short, the standard forwarding operations provide that the received packets are transferred by the DMA-engine of the interface card, in a way transparent to the OS, and located in memory areas reserved to the kernel, where the packets will waiting for the kernel layer elaboration. When the kernel handles the packets, it associates them to descriptors and, then,

4 it elaborates them one at time. During the IP layer elaboration an egress interface is selected for each packet and its descriptor is placed in a buffer, called device egress qdisc, waiting for the transmission. Let us now analyze more in particular the architecture of the three fundamental modules of the networking code of 2.6 kernels: the NAPI, the IP processing and the transmission (Tx) API. 1) NAPI As previously written, NAPI is the new Linux reception API, and it has been proposed to achieve the scalability level needed to support Gigabit network interfaces. It implements an adaptive mechanism that, by using the interrupt mitigation, behaves as the classical SofNet API with low input rates, while, for higher rates, it works like a polling mechanism. Thus, let us to analyze in-depth the NAPI structure. At the first received packet placed in the ring buffer, the network interface generates an interrupt. When a CPU receives the interrupt, it invokes the netif_rx_action routine, which calls both netif_rx_schedule and netif_rx_schedule that: Records the identifier of the interface that has generated the interrupt in a buffer, called poll_list. Schedules the NET_RX_SOFTIRQ SW interrupt, instead of immediately serving the requesting device, to delay the context switching. Disables the interrupt generation, if the network card hardware supports this operation. All the next packets, which are received before the kernel scheduler serves the SW interrupt, do not cause any interrupt to the OS, but are moved to a reserved memory location by the DMA-engine. To know the address of the reserved memory location, the DMA-engine uses a list of available descriptors. Such list of descriptors is organized like a ring buffer, called rx_ring, in the kernel reserved memory. If there is not enough space or available descriptors, the packets are dropped. The scheduler awakes the softirq handler (net_rx_action) that scans the poll_list, and for each contained device: dev->poll is called and it processes, by using some driver functions, a packet number equal to the minimum value between the number of the packets in the rx_ring and the quota parameter. For each received packet the netif_receive_skb function is invoked. If the rx_ring of the device becomes empty, the identifier is cleared from the poll_list, and the interrupt generation of the card is enabled. If the rx_ring of the device does not become empty, the identifier is maintained in the poll_list. The procedure stops when there are no more device identifiers in the poll_list or when a number of packets equal to quota parameter have been processed. Note that the quota parameter sets the maximum number of packets that can be processed before activating other tasks. 2) IP processing The IP network layer code is generally composed by four important segments, which manage respectively the reception, the routing, the forwarding and the transmission process of the datagrams. More in particular, the reception process is realized by two functions: ip_rcv that controls the datagram header consistence (i. e., checksum), and ip_rcv_finish that decides if the IP destination is valid and where to send the datagram (to the local delivery or to the forwarding chain). If the datagram is not locally delivered, it is elaborated by the routing module that is structured with a two level address lookup: first a fast address lookup is made in ip_route_input by using a cache of the most frequently used destination addresses, and, if the cache misses the request destination IP, the lookup goes on with the ip_route_input_slow routine which uses the entire IP table. Afterwards, the ip_forward decreases the ttl value by one. The forwarding actions is completed by the ip_forward_finish function, that, if there are IP options, invokes ip_forward_options and then it passes directly the sk_buff to the transmission IP module. The transmission IP module first function (ip_output) handles the datagram fragmentation if necessary; thus, it calls the ip_output_finish function, which concludes the IP layer elaboration. 3) Tx API Another important part of the kernel networking code is the transmission API. It is quite simple and, unlike the Rx one, has not experienced any great revisions in the past years. By the way, Tx API is based, in large part, on driver functions that signal to network interfaces the presence of transmitting packets or that receive messages about their successful elaboration. In particular, TX API elaboration starts by calling, for each packet coming from the IP layer, dev_queue_xmit: this function enqueues the descriptor in a buffer associated to the egress device, called qdisc. Then, when there is enough space in the device tx_ring, the descriptors in the egress qdisc are served by the qdisc_restart function. For each served descriptor, the virtual method hard_start_xmit is called: the last one is implemented in the device driver (i.e. for Intel Gigabit cards is e1000_start_xmit). Such function adds the descriptor to the tx_ring and signals to the network device that there are data to transfer. Then, the DMA-engine of the egress card transfers the packets from the kernel reserved memory to its hardware ring buffer and it transmits them. When the packet transmission has been successfully completed, the network interface generates a hardware interrupt, whose routine schedules a software interrupt with NET_TX_ACTION flag. Then, the scheduler wakes up the net_tx_action handler that moves the descriptors of transmitted packets to a queue, called completion_queue where the descriptors are periodically deallocated. III. SOFTWARE PERFORMANCE TUNING. As previously sketched and shown in Fig. 4, the whole networking kernel architecture is quite complex and presents some aspects and many parameters that can be tuned for system optimization. As reported in [23], this tuning is very important for the final performance. Some of the optimal parameter values can be identified by logical considerations,

5 but most of them have to be empirically leaded, since their optimal value cannot be easily derived from the software structure and since they also depend on the hardware components. So our tuning has been realized first by identifying the critical parameters on which to operate, and, then, by finding the more convenient values with both logical considerations and experimental measures. The whole kernel architecture includes only three packet buffering points (see Fig. 4): the Rx ring buffers, the egress qdisc queues/tx ring buffers and the completion queue. While this last queue has proved to be not so critical for performance optimization purposes (it contains only descriptors of the already sent packets waiting for memory deallocation), to find an optimal configuration of the other two queues is very important to maximize the performance. In fact, to avoid useless CPU wastes it is preferable to minimize the losses at egress qdisc buffers, where the packets are already processed by IP layer, and to keep these packet losses before processing them at the Rx ring buffers: this can be easily obtained, for example, by increasing the qdisc buffer length to a value equal to about 20,000 descriptors. Many network adapter drivers, for example, allow the dimensioning of the ring buffers and of the interrupt maximum rates. Both these parameters have a great influence on the NAPI performance. In fact, if we consider a medium-high traffic load condition, NAPI practically works by polling the network interfaces, which insert themselves in the poll list with the interrupt generated by the first received packet; all the packets, received before the interface turns in the polling sequence, are enqueued in the hardware ring buffers. Thus, while it is clearly desirable not to limit the interrupt rate of network adapters in NAPI kernels, a large ring buffer permits to reduce the packet drops in presence of bursty traffic. Another interesting parameter to tune is the quota value that fixes the number of packets that each device can elaborate at every polling cycle. It is also possible to act on some specific 2.5 kernel parameters by customizing them to the specific networking usage: for example, the profiling results (Section 6) show that the kernel scheduler operations employ uselessly about the 4-5% of the overall quantity of computational resources. To avoid this CPU time waste, the OS scheduler clock frequency can be decreased: by reducing its value to 100Hz, the forwarding rate improves of about 20K packets per second (see Section 6). The rationalization of memory management is another important aspect: as highlighted in profiling results of Section 6, a considerable part of the available resources is used in the allocation and deallocation of packet descriptors (memory management functions). [24] proposes a patch that allows to recycle the descriptors of the successful sent packets: the base idea is to save CPU resources during the receive NAPI operations by reusing the packet descriptors inside the completion queue. The use of this patch can again improve the performance. Summarizing, our optimized NAPI kernel image, includes the descriptor recycling patch and the k2 version of e1000 driver and it has been configured with the following optimized values: I) the Rx and Tx ring buffers have been set to the maximum value: 4096 descriptors; II) the Rx interrupt generation has not been limited; III) the qdisc size for all the adapters has been dimensioned equal to 20,000 descriptors; IV) the NAPI quota parameter has been set to 23 descriptors; V) the scheduler clock frequency has been fixed to 100Hz. Figure 4. Detailed scheme of the forwarding operations in 2.6 kernel NAPI.

6 IV. BENCHMARKING TOOLS A reliable performance evaluation of network equipments that work forwarding packets at Gigabit/s speed is not a simple task, and it generally requires ad-hoc hardware platform: it is quite well-known that the common generation/measurement software tools [25-27] running on PCs, which even if are widely used, do not provide an adequate level of accuracy. In particular, in presence of high speed interfaces, some significant limitations (i.e. maximum sustainable rate and high packet generation/measurement latency), due to the bottleneck of PC internal busses and the lack of computational resources, are highlighted. Thus, the alternative is to use professional products, which are generally realized with ad hoc hardware and therefore can guarantee high performances (i.e. precision of few nanoseconds both in generation and in measurement); besides being quite expensive, although they cannot be completely and easily customized or modified, resulting sometimes not so much flexible as researchers need. Starting from these considerations, we have decided to build a traffic generator that combines the performance of ad hoc hardware elements with the flexibility of software based solutions. To achieve this result, we have decided to develop that traffic generator, namely PktGen, on custom highly programmable hardware like network processors [28]: in particular, we have decided to use a Radisys ENP-2611 evaluation board [29] that includes three Gigabit Ethernet interfaces and the Intel IXP 2400 network processor [30]. Analyzing thoroughly the PktGen features, we have to highlight that it has been planned not only to generate traffic at high speed with an adequate precision level, but also to generate simultaneously a high number of different traffic profiles. These last can have very different (deterministic or stochastic) characteristics. In particular, PktGen can act independently for each traffic profile by tuning both the time parameters (i.e., interarrival times and burst lengths) and the packet templates (i.e., a set of specific rules to build the packets and to fill the header fields). A detailed description of the PktGen architecture can be found in [31]. In spite of the good performance level provided by this generation tool, to analyze the OR performance with the RFC 2544 tests [32, 33] we need also a measurer tool able to reliably estimate both throughput and latency provided by the System Under Test (SUT). Thus, to benchmark the OR forwarding performance and to test the PktGen capabilities, we have finally decided to use a professional equipment, namely Agilent N2X Router Tester [34]. This last allows to obtain throughput and latency measurements with very high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10ns). Moreover, with two dual Gigabit Ethernet test cards and one 16 Fast Ethernet card at disposal, it allows us to analyze the OR behaviour in presence of a high number of heterogeneous interfaces. To better support the performance analysis and to identifying the OR bottleneck, we have also performed some internal measures, obtained by using specific software tools (called profilers) placed inside the OR, which are able to trace the percentage of CPU utilization for each software modules running on the node. Internal measurements are very useful for the identification of the architecture bottlenecks. The problem is that many of these tools require a relevant computational effort that perturbs the system performance, and then making not meaningful the results. In this respect, their correct choice is a strategic point. We have verified with many different tests that one of the best is Oprofile [35], an open source tool that realizes a continuous monitoring of system dynamics with a frequent and quite regular sampling of CPU hardware registers. Oprofile allows evaluating, in a very effective and deep way, the CPU utilization of both each software application and each single kernel function running in the system with a very low computational overhead. V. BENCHMARKING SCENARIO To deeply analyze the OR forwarding performance, we have chosen to start by defining a reasonable set of benchmarking setups with an increasing level of complexity, and for each selected setup to apply some of the tests defined in [32]. In particular, we have chosen to perform these test activities by using both a core router configuration and a edge router one: the first one is composed by few high-speed (Gigabit Ethernet) network interfaces, while the last is composed by a high-sped gateway interface and a high number of Fast Ethernet cards, which collect the traffic from the access networks. More in detail, we have performed our tests by using the following setups: Setup A (Figure 5. ): a single mono directional flow crosses the OR from a Gigabit port to another one; Setup B (Figure 6. ): two full duplex flows cross the OR, each one using a different pair of Gigabit ports; Setup C (Figure 7. ): the OR includes 4 Gigabit Ethernet ports and a full-meshed (and full-duplex) traffic matrix is applied; Setup D (Figure 8. ): the OR includes 1 Gigabit Ethernet port and 12 Fast Ethernet cards, the used traffic matrix is full-meshed (and full-duplex). Figure 5. Setup A Figure 6. Setup B Figure 7. Setup C Figure 8. Setup D The A, B and C setups, which obviously refer to a simple core router working configuration, are quite significant to understand the maximum performance obtainable by the

7 forwarding mechanism. Indeed, in such cases the OR has to handle the traffic from only few network interfaces, reducing the computational overhead needed to manage a higher interface number with respect to the setup D. In particular, each OR forwarding benchmarking session is essentially composed by three test sets namely respectively throughput and latency, back-to-back burst length and packet loss rate. Note that all these tests have been performed by using different IP datagram sizes (i.e., 40, 64, 128, 256, 512, 1024 and 1500 bytes) and both CBR and bursty traffic flows. More in detail, we report a short description of all the applied forwarding tests: 1. Throughput and latency tests: this test set is performed by using CBR traffic flows, composed by fixed sized datagrams, to obtain: a. maximum effective throughput, in Kpackets/s and in percentage with respect to the theoretical value, versus different IP datagram sizes; b. average, maximum and minimum latencies versus different IP datagram sizes; where throughput and latency are interpreted as defined in [RFC1242]. 2. Back-to-back tests: these tests are fulfilled by using bursty traffic flows and changing both the burst dimension (i.e., the number of the packets composing the burst) and the datagram size. The main results for this kind of test are: a. zero loss burst length versus different IP datagram sizes; b. average, maximum and minimum latencies versus different sizes of IP datagram composing the burst; where zero loss burst length is the maximum number of packets transmitted with minimum inter-frame gaps that the System Under Test (SUT) can handle without any loss. 3. Loss Rate tests: this kind of test is fulfilled using CBR traffic flows with different offered loads and IP datagram sizes; the obtainable results can be summarized in: a. throughput versus both offered load and IP datagram sizes; VI. NUMERICAL RESULTS In this Section some of the numerical results, obtained with the benchmarking techniques and setups described in Section III, are reported. Moreover, we have decided to consider different Linux kernel configurations and a Click Modular Router as term of comparison. In particular, we have decided to test the following versions of the Linux kernel: o o dual-processor standard kernel: it is a standard NAPI kernel version similar to the previous one but with SMP support; single-processor optimized kernel: it is a version based on the standard one with single processor support that includes the descriptor recycling patch. The driver parameter tuning for both the previously cited kernel version includes the receive ring buffer dimensioning to 80 descriptors, the transmission ones to 256 descriptors, the receive and transmission interrupt delays have been set to zero, while the output qdisc buffers to 20,000 descriptors and the scheduler clock to 100Hz. For what concerns the Click Modular Router, we have used a single processor version mounted on a Linux kernel that includes the driver parameter tuning (both transmission and receive buffers have been set to 256 descriptors). Note that we have chosen to not take into account the SMP versions both of the optimized Linux kernel and of the Click Modular Router, as these versions lack of a minimum acceptable stability level. A. Setup A numerical results In the first benchmarking session, we have performed the RFC 2544 tests by using the setup A (with reference to Figure 5. ) with both the single-processor optimized kernel and Click. As we can observe in Figs. 9, 10 and 11, which report the numerical results of the throughput and latency tests, both the software architectures cannot achieve the maximum theoretic throughput in presence of small datagram sizes. As demonstrated by the profiling measurements reported in Fig. 12, obtained with the single processor optimized Linux kernel and with 64 Bytes sized datagrams, this effect is clearly caused by computational CPU capacity that limits the maximum packet forwarding rate of Linux kernel to about 700 Kpackets per second. In fact, even if the CPU idle goes to zero in correspondence of an offered load equal to 300 Kpackets/s, the CPU occupancies of all the most important function sets appear to adapt their contributes till 700 Kpackets/s; after this point their percentage contributions to the CPU utilization remains almost costant. More in particular, Fig. 12 shows that the computational weight of memory management operations (like sk_buff allocations and de-allocations) is substantially limited, thanks to descriptor recycling patch, to less than the 1%. Both the TX API (the most onerous operation set that takes at minimum about the 35% of the overall resources) and interrupt management operations apparently have strange and somehow similar behaviours: in fact, their CPU utilization level, after an initial growth, decreases with the increase of the input rate. This behaviour is mostly due to two different aspects concerning packet grouping effect in the Tx and the Rx API: in particular, when the ingress packet rate raises, NAPI tends to moderate the Rx interrupt rate by changing its behaviour from a interrupt-like mechanism to a polling one (so we have a first interrupt number reduction), while Tx API, in the same condition, can better exploit packet grouping mechanism by sending more packets at time (and then the number of interrupts for successful transmission confirmations decreases). About IP and Ethernet processing operations, we can note that they increase their CPU percentage utilizations in an almost linear way with respect to the number of the forwarded packets per second. Considerations similar to the previous ones can be done also for what concerns Click: the performance limitations in presence of short sized datagrams continues to be due to a computational bottleneck, but the simple Click packet receive API based on polling mechanism allows to achieve better

8 performance in terms of throughput by lowing the computational weight of IRQ management and RxAPI functions. For the same reasons, as shown in Figs. 10 and 11 obtained with the RFC2544 throughput and latencies test, the receive mechanism embedded in Click introduces higher packet latencies. According to the previous results, also the back-to-back tests, reported in Fig. 13 and Table I, show that the optimized Linux kernel and Click continue to suffer small sized datagrams. In fact, while using 256 Bytes or higher sized datagrams the measured zero loss burst length is quite close to the maximum burst length used in the performed tests, it appears to be heavily limited in presence of 40, 64 and, only for what concerns the Linux kernel, 128 Bytes sized packets. Apart from the single 128 Bytes case, where NAPI starts to suffer the computational bottleneck while Click continues to have a forwarding rate very close to the theoretic one; the Linux kernel provides a better support to bursty traffic than Click by providing both higher zero loss burst lengths and lower associated latency times. In Fig. 14, the loss rate test results are finally reported. Figure 11. Troughput and latencies test, testbed A: average lantecies for both the single-processor optimized kernel and the Click Modular Router. Figure 12. Profiling results of the optimized Linux kernel obtained with the testbed setup A. Figure 9. Troughput and latencies test, testbed A: effective throughput results for the single-processor optimized kernel and the Click Modular Router. Figure 13. Back-to-back test, testbed A: maximum zero loss burst lengths. TABLE I. BACK-TO-BACK TEST, TESTBED A: LATENCY VALUES FOR BOTH THE SINGLE-PROCESSOR OPTIMIZED KERNEL AND THE CLICK MODULAR ROUTER. Figure 10. Troughput and latencies test, testbed A: minimum and maximum latencies for both the single-processor optimized kernel and the Click Modular Router. optimized Kernel Click PktLength Latency Latency [Byte] Min Average Max Min Average Max [us] [us] [us] [us] [us] [us]

9 Figure 14. Loss Rate test, testbed A: maximum throughput versus both offered load and IP datagram sizes. B. Setup B numerical results Thus, in the second benchmarking session we have analyzed the performance achieved by the optimized single processor Linux kernel, of the SMP standard Linux kernel and of the Click modular router with the testbed setup B (with reference to Fig. 6). Fig. 15, which reports the maximum effective throughput in terms of forwarded packets per second for a single router interface, shows that, in presence of short sized packets, all the three software architectures do not provide a performance level close to the theoretic one. More in particular, while the best throughput values are achieved by Click, the SMP kernel seems to provide better forwarding rates with respect to the optimized kernel. In fact, as outlined in [19], if no explicit CPU-interface bounds are present, the SMP kernel processes the received packets (using, if possible, the same CPU for the whole packet elaboration) trying to dynamically share the computational load among the CPU. Thus, with the considered setup, the computational load sharing aims to manage the two interfaces, to which a traffic pair is applied, with a single fixed CPU, fully processing each received packets with only a CPU and avoiding in this way any memory concurrency problems. Figs. 16 and 17 report the minimum, the average and the maximum latency values according different datagram sizes obtained for all the three considered software architectures. In particular, we can note that both the Linux kernels, which provide in this case very similar results, assure minimum latencies lower than Click, while this last provides better average and maximum latency values for short sized datagrams. Figure 16. Troughput and latencies test, testbed B: minimum and maximum latencies for both the single-processor optimized kernel and the Click Modular Router. Figure 17. Troughput and latencies test, testbed B: average lantecies for both the single-processor optimized kernel and the Click Modular Router. The back-to-back results, reported in Fig. 18 and Tables II, show that the performance level of all analyzed architectures is almost comparable in terms of zero-loss burst length, while for what concerns the latencies the Linux kernels provide better values with respect to Click. Figure 18. Back-to-back test, testbed B: maximum zero loss burst lengths. Figure 15. Troughput and latencies test, testbed setup B: effective throughput for both the single-processor optimized kernel and the Click Modular Router. By analyzing Fig. 19, which reports the loss rate numerical results, we can observe how the performance obtained with Click and the SMP kernel are better, especially for low sized datagrams, than the one obtained by optimized single processor kernel. Moreover, Fig. 19 shows also that all the three OR software architectures do not achieve the full Gigabit/s speeds also for large datagrams, with a the maximum forwarding rate of about 650 Mbps/s per interface. Note that, to improve the readibility of the obtained results, we have decided to report in Fig. 19 and in all the following loss rate tests only the O.R. behavior with the minimum and

10 the maximum datagram sizes, since they are respectively the perfomance lower and upper bound. TABLE II. BACK-TO-BACK TEST, TESTBED B: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL. Optimized Kernel Click SMP Kernel Pkt Length Latency [us] Latency [us] Latency [us] [Byte] Min Average Max Min Average Max Min Average Max maximum latencies with respect to the single-processor and SMP kernels. The loss-rate results, shown in Fig. 24, highlight the performance decay of the Linux SMP kernel, while a fairly similar behaviour is achieved for what concerns the other two architectures. Moreover, like in previous benchmarking session, the maximum forwarding rate for each Gigabit network interface is limited at about 600/650 Mbps. Figure 20. Troughput and latencies test, setup C: effective throughput results for both the single-processor optimized kernel, the Click Modular Router and the SMP kernel. Figure 19. Loss Rate test, testbed B: maximum throughput versus both offered load and IP datagram sizes. C. Setup C numerical results The third benchmarking session, the three considered software architectures have been tested in presence of four Gigabit Ethernet interfaces with a fullmeshed traffic matrix (Fig. 7). By analyzing the maximum effective throughput values, reported in Fig. 20, we can note that Click appears to achieve a better performance level with respect to the Linux kernels, while, unlike the previous case, the single processor Linux kernel provides maximum forwarding rates considerably higher than the SMP version. In fact the SMP kernel tries to share the computational load of the incoming traffic among the CPUs and, in presence of a fullmeshed traffic matrix, this results in a almost static assignment of each CPU to two specific network interfaces. As, in this situation, about the half of the forwarded packets cross the OR between two interfaces managed by different CPUs, this causes a performance lack due to memory concurrency problems. Figs. 21 and 22 show the minimum, the maximum and the average latency values obtained during this test set. Observing these lasts, we can note how the SMP Linux kernel, in presence of short sized datagrams, suffers memory concurrency problems lowing on the OR performance and considerably increasing both the average and the maximum latency values. Analyzing Fig 23 and Tables IV, which report the back-toback test results, we can note that, on one hand, all the three OR architectures achieve a similar zero-loss burst length, while on the other hand, Click reaches very high average and Figure 21. Troughput and latencies test, testbed C: minimum and maximum latencies for the single-processor optimized kernel, the Click Modular Router and the SMP kernel. Figure 22. Troughput and latencies test, results for testbed C: average lantecies for the single-processor optimized kernel, the Click Modular Router and the SMP kernel.

11 speed in presence of short sized datagrams and about the 75% for high datagram sizes. Figure 23. Back-to-back test, testbed C: maximum zero loss burst lengths.. TABLE III. BACK-TO-BACK TEST, TESTBED C: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL Figure 25. Troughput and latencies test, setup D: effective throughput results for both the single-processor optimized kernel and the SMP kernel. optimized Kernel Click SMP Kernel Pkt Length Latency [us] Latency [us] Latency [us] [Byte] Min Average Max Min Average Max Min Average Max Figure 26. Troughput and latencies test, results for testbed D: minimum and maximum latencies for both the single-processor optimized kernel and the SMP kernel. Figure 24. Loss Rate test, testbed C: maximum throughput versus both offered load and IP datagram sizes D. Setup D numerical results In the last benchmarking session, we have applied the setup D, that provides a full-meshed traffic matrix between one Gigabit Ethernet and 12 Fast Ethernet interfaces, to the singleprocessor Linux kernel and to the SMP version. Note that we have decided to not use Click in this last test, as, at the moment and for this software architecture, there is no drivers with polling support for the used Fast Ethernet interfaces. By analyzing the throughput and latency results in Figs. 25, 26 and 27, we can note how, in presence of a high interface number and a fullmeshed traffic matrix, the SMP kernel version annihilates its performance: as shown in Fig. 25, the maximum measured value for the effective throughput is limited to about 2400 packets/s and the corresponding latencies appear clearly higher with respect to the ones obtained with the single processor kernel. However, it can be highlighted that also the single processor kernel does not support the maximum theoretic rate: in particular, it achieves the 10% of the full Figure 27. Troughput and latencies test, testbed D: average lantecies for both the single-processor optimized kernel and the SMP kernel. To better understand why the OR does not reach the fullspeed with a so high number of Fast Ethernet interfaces, we have decided to perform several profiling tests. In particular, these tests were carried out by using two simple traffic matrixes: the first (Fig. 28) is composed by 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one, while the second (Fig. 29) is still composed by 12 CBR flows that though cross the OR with the opposite direction (e.g., from the Gigabit to the Fast Ethernet interfaces). These simple traffic matrixes allow us to separately analyze the receive and transmission operation in this contest. Thus, the Figs. 28 and 29 report the profiling results corresponding to the two traffic matrixes. The obtained internal measurements in

Analyzing and Optimizing the Linux Networking Stack

Analyzing and Optimizing the Linux Networking Stack Analyzing and Optimizing the Linux Networking Stack Raffaele Bolla*, Roberto Bruschi*, Andrea Ranieri*, Gioele Traverso * Department of Communications, Computer and Systems Science (DIST) University of

More information

Linux Software Router: Data Plane Optimization and Performance Evaluation

Linux Software Router: Data Plane Optimization and Performance Evaluation 6 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 Linux Software Router: Data Plane Optimization and Performance Evaluation Raffaele Bolla and Roberto Bruschi DIST - Department of Communications, Computer

More information

PC-based Software Routers: High Performance and Application Service Support

PC-based Software Routers: High Performance and Application Service Support PC-based Software Routers: High Performance and Application Service Support Raffaele Bolla, Roberto Bruschi DIST, University of Genoa Via all Opera Pia 13, 16139, Genoa, Italy {raffaele.bolla, roberto.bruschi}@unige.it

More information

An Effective Forwarding Architecture for SMP Linux Routers

An Effective Forwarding Architecture for SMP Linux Routers An Effective Forwarding Architecture for SMP Linux Routers Raffaele Bolla, Roberto Bruschi Department of Communications, Computer and Systems Science (DIST), University of Genoa, Via Opera Pia 13, 16145

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

Performance of Software Switching

Performance of Software Switching Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Gigabit Ethernet Packet Capture. User s Guide

Gigabit Ethernet Packet Capture. User s Guide Gigabit Ethernet Packet Capture User s Guide Copyrights Copyright 2008 CACE Technologies, Inc. All rights reserved. This document may not, in whole or part, be: copied; photocopied; reproduced; translated;

More information

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology 3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

OpenFlow with Intel 82599. Voravit Tanyingyong, Markus Hidell, Peter Sjödin

OpenFlow with Intel 82599. Voravit Tanyingyong, Markus Hidell, Peter Sjödin OpenFlow with Intel 82599 Voravit Tanyingyong, Markus Hidell, Peter Sjödin Outline Background Goal Design Experiment and Evaluation Conclusion OpenFlow SW HW Open up commercial network hardware for experiment

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

Open-Source PC-Based Software Routers: A Viable Approach to High-Performance Packet Switching

Open-Source PC-Based Software Routers: A Viable Approach to High-Performance Packet Switching Open-Source PC-Based Software Routers: A Viable Approach to High-Performance Packet Switching 353 Andrea Bianco 1, Jorge M. Finochietto 1, Giulio Galante 2, Marco Mellia 1, and Fabio Neri 1 1 Dipartimento

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Collecting Packet Traces at High Speed

Collecting Packet Traces at High Speed Collecting Packet Traces at High Speed Gorka Aguirre Cascallana Universidad Pública de Navarra Depto. de Automatica y Computacion 31006 Pamplona, Spain aguirre.36047@e.unavarra.es Eduardo Magaña Lizarrondo

More information

PCI Express* Ethernet Networking

PCI Express* Ethernet Networking White Paper Intel PRO Network Adapters Network Performance Network Connectivity Express* Ethernet Networking Express*, a new third-generation input/output (I/O) standard, allows enhanced Ethernet network

More information

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project Effects of Filler Traffic In IP Networks Adam Feldman April 5, 2001 Master s Project Abstract On the Internet, there is a well-documented requirement that much more bandwidth be available than is used

More information

OpenFlow Switching: Data Plane Performance

OpenFlow Switching: Data Plane Performance This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE ICC 21 proceedings OpenFlow : Data Plane Performance Andrea Bianco,

More information

Operating Systems Design 16. Networking: Sockets

Operating Systems Design 16. Networking: Sockets Operating Systems Design 16. Networking: Sockets Paul Krzyzanowski pxk@cs.rutgers.edu 1 Sockets IP lets us send data between machines TCP & UDP are transport layer protocols Contain port number to identify

More information

Wire-speed Packet Capture and Transmission

Wire-speed Packet Capture and Transmission Wire-speed Packet Capture and Transmission Luca Deri Packet Capture: Open Issues Monitoring low speed (100 Mbit) networks is already possible using commodity hardware and tools based on libpcap.

More information

Application Performance Testing Basics

Application Performance Testing Basics Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,

More information

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand

More information

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card In-Su Yoon 1, Sang-Hwa Chung 1, Ben Lee 2, and Hyuk-Chul Kwon 1 1 Pusan National University School of Electrical and Computer

More information

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features UDC 621.395.31:681.3 High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features VTsuneo Katsuyama VAkira Hakata VMasafumi Katoh VAkira Takeyama (Manuscript received February 27, 2001)

More information

PCI Express Overview. And, by the way, they need to do it in less time.

PCI Express Overview. And, by the way, they need to do it in less time. PCI Express Overview Introduction This paper is intended to introduce design engineers, system architects and business managers to the PCI Express protocol and how this interconnect technology fits into

More information

Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan gilligan@vyatta.com

Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan gilligan@vyatta.com Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan gilligan@vyatta.com Outline About Vyatta: Open source project, and software product Areas we re working on or interested in

More information

White Paper Abstract Disclaimer

White Paper Abstract Disclaimer White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification

More information

How To Test A Microsoft Vxworks Vx Works 2.2.2 (Vxworks) And Vxwork 2.4.2-2.4 (Vkworks) (Powerpc) (Vzworks)

How To Test A Microsoft Vxworks Vx Works 2.2.2 (Vxworks) And Vxwork 2.4.2-2.4 (Vkworks) (Powerpc) (Vzworks) DSS NETWORKS, INC. The Gigabit Experts GigMAC PMC/PMC-X and PCI/PCI-X Cards GigPMCX-Switch Cards GigPCI-Express Switch Cards GigCPCI-3U Card Family Release Notes OEM Developer Kit and Drivers Document

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

Scalable Layer-2/Layer-3 Multistage Switching Architectures for Software Routers

Scalable Layer-2/Layer-3 Multistage Switching Architectures for Software Routers Scalable Layer-2/Layer-3 Multistage Switching Architectures for Software Routers Andrea Bianco, Jorge M. Finochietto, Giulio Galante, Marco Mellia, Davide Mazzucchi, Fabio Neri, Dipartimento di Elettronica,

More information

High-Speed TCP Performance Characterization under Various Operating Systems

High-Speed TCP Performance Characterization under Various Operating Systems High-Speed TCP Performance Characterization under Various Operating Systems Y. Iwanaga, K. Kumazoe, D. Cavendish, M.Tsuru and Y. Oie Kyushu Institute of Technology 68-4, Kawazu, Iizuka-shi, Fukuoka, 82-852,

More information

How To Provide Qos Based Routing In The Internet

How To Provide Qos Based Routing In The Internet CHAPTER 2 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 22 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 2.1 INTRODUCTION As the main emphasis of the present research work is on achieving QoS in routing, hence this

More information

The Bus (PCI and PCI-Express)

The Bus (PCI and PCI-Express) 4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the

More information

High-performance vswitch of the user, by the user, for the user

High-performance vswitch of the user, by the user, for the user A bird in cloud High-performance vswitch of the user, by the user, for the user Yoshihiro Nakajima, Wataru Ishida, Tomonori Fujita, Takahashi Hirokazu, Tomoya Hibi, Hitoshi Matsutahi, Katsuhiro Shimano

More information

Networking Driver Performance and Measurement - e1000 A Case Study

Networking Driver Performance and Measurement - e1000 A Case Study Networking Driver Performance and Measurement - e1000 A Case Study John A. Ronciak Intel Corporation john.ronciak@intel.com Ganesh Venkatesan Intel Corporation ganesh.venkatesan@intel.com Jesse Brandeburg

More information

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Open Flow Controller and Switch Datasheet

Open Flow Controller and Switch Datasheet Open Flow Controller and Switch Datasheet California State University Chico Alan Braithwaite Spring 2013 Block Diagram Figure 1. High Level Block Diagram The project will consist of a network development

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Performance Evaluation of Linux Bridge

Performance Evaluation of Linux Bridge Performance Evaluation of Linux Bridge James T. Yu School of Computer Science, Telecommunications, and Information System (CTI) DePaul University ABSTRACT This paper studies a unique network feature, Ethernet

More information

A High Performance IP Traffic Generation Tool Based on the Intel IXP2400 Network Processor

A High Performance IP Traffic Generation Tool Based on the Intel IXP2400 Network Processor A High Performance IP Traffic Generation Tool Based on the Intel IXP2400 Network Processor Raffaele Bolla, Roberto Bruschi, Marco Canini, and Matteo Repetto Department of Communications, Computer and Systems

More information

A NOVEL RESOURCE EFFICIENT DMMS APPROACH

A NOVEL RESOURCE EFFICIENT DMMS APPROACH A NOVEL RESOURCE EFFICIENT DMMS APPROACH FOR NETWORK MONITORING AND CONTROLLING FUNCTIONS Golam R. Khan 1, Sharmistha Khan 2, Dhadesugoor R. Vaman 3, and Suxia Cui 4 Department of Electrical and Computer

More information

Architecture of distributed network processors: specifics of application in information security systems

Architecture of distributed network processors: specifics of application in information security systems Architecture of distributed network processors: specifics of application in information security systems V.Zaborovsky, Politechnical University, Sait-Petersburg, Russia vlad@neva.ru 1. Introduction Modern

More information

Lustre Networking BY PETER J. BRAAM

Lustre Networking BY PETER J. BRAAM Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information

More information

UPPER LAYER SWITCHING

UPPER LAYER SWITCHING 52-20-40 DATA COMMUNICATIONS MANAGEMENT UPPER LAYER SWITCHING Gilbert Held INSIDE Upper Layer Operations; Address Translation; Layer 3 Switching; Layer 4 Switching OVERVIEW The first series of LAN switches

More information

Technical Bulletin. Arista LANZ Overview. Overview

Technical Bulletin. Arista LANZ Overview. Overview Technical Bulletin Arista LANZ Overview Overview Highlights: LANZ provides unparalleled visibility into congestion hotspots LANZ time stamping provides for precision historical trending for congestion

More information

Tyche: An efficient Ethernet-based protocol for converged networked storage

Tyche: An efficient Ethernet-based protocol for converged networked storage Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas 30 th International Conference on Massive Storage Systems and Technology MSST 2014 June

More information

Computer Organization & Architecture Lecture #19

Computer Organization & Architecture Lecture #19 Computer Organization & Architecture Lecture #19 Input/Output The computer system s I/O architecture is its interface to the outside world. This architecture is designed to provide a systematic means of

More information

The proliferation of the raw processing

The proliferation of the raw processing TECHNOLOGY CONNECTED Advances with System Area Network Speeds Data Transfer between Servers with A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer

More information

4 Internet QoS Management

4 Internet QoS Management 4 Internet QoS Management Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology stadler@ee.kth.se September 2008 Overview Network Management Performance Mgt QoS Mgt Resource Control

More information

Network Layer: Network Layer and IP Protocol

Network Layer: Network Layer and IP Protocol 1 Network Layer: Network Layer and IP Protocol Required reading: Garcia 7.3.3, 8.1, 8.2.1 CSE 3213, Winter 2010 Instructor: N. Vlajic 2 1. Introduction 2. Router Architecture 3. Network Layer Protocols

More information

Smart Queue Scheduling for QoS Spring 2001 Final Report

Smart Queue Scheduling for QoS Spring 2001 Final Report ENSC 833-3: NETWORK PROTOCOLS AND PERFORMANCE CMPT 885-3: SPECIAL TOPICS: HIGH-PERFORMANCE NETWORKS Smart Queue Scheduling for QoS Spring 2001 Final Report By Haijing Fang(hfanga@sfu.ca) & Liu Tang(llt@sfu.ca)

More information

High-Density Network Flow Monitoring

High-Density Network Flow Monitoring Petr Velan petr.velan@cesnet.cz High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible

More information

Introduction to PCI Express Positioning Information

Introduction to PCI Express Positioning Information Introduction to PCI Express Positioning Information Main PCI Express is the latest development in PCI to support adapters and devices. The technology is aimed at multiple market segments, meaning that

More information

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器

基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器 基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器 楊 竹 星 教 授 國 立 成 功 大 學 電 機 工 程 學 系 Outline Introduction OpenFlow NetFPGA OpenFlow Switch on NetFPGA Development Cases Conclusion 2 Introduction With the proposal

More information

Linux Driver Devices. Why, When, Which, How?

Linux Driver Devices. Why, When, Which, How? Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

Autonomous NetFlow Probe

Autonomous NetFlow Probe Autonomous Ladislav Lhotka lhotka@cesnet.cz Martin Žádník xzadni00@stud.fit.vutbr.cz TF-CSIRT meeting, September 15, 2005 Outline 1 2 Specification Hardware Firmware Software 3 4 Short-term fixes Test

More information

Router Architectures

Router Architectures Router Architectures An overview of router architectures. Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers 2 1 Router Components

More information

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Accelerating High-Speed Networking with Intel I/O Acceleration Technology White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing

More information

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring CESNET Technical Report 2/2014 HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring VIKTOR PUš, LUKÁš KEKELY, MARTIN ŠPINLER, VÁCLAV HUMMEL, JAN PALIČKA Received 3. 10. 2014 Abstract

More information

Getting the most TCP/IP from your Embedded Processor

Getting the most TCP/IP from your Embedded Processor Getting the most TCP/IP from your Embedded Processor Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges TCP Acceleration Techniques 2 Getting the most

More information

SYSTEM ecos Embedded Configurable Operating System

SYSTEM ecos Embedded Configurable Operating System BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original

More information

Leased Line + Remote Dial-in connectivity

Leased Line + Remote Dial-in connectivity Leased Line + Remote Dial-in connectivity Client: One of the TELCO offices in a Southern state. The customer wanted to establish WAN Connectivity between central location and 10 remote locations. The customer

More information

Scaling Networking Applications to Multiple Cores

Scaling Networking Applications to Multiple Cores Scaling Networking Applications to Multiple Cores Greg Seibert Sr. Technical Marketing Engineer Cavium Networks Challenges with multi-core application performance Amdahl s Law Evaluates application performance

More information

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented

More information

Switch Fabric Implementation Using Shared Memory

Switch Fabric Implementation Using Shared Memory Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today

More information

Monitoring high-speed networks using ntop. Luca Deri <deri@ntop.org>

Monitoring high-speed networks using ntop. Luca Deri <deri@ntop.org> Monitoring high-speed networks using ntop Luca Deri 1 Project History Started in 1997 as monitoring application for the Univ. of Pisa 1998: First public release v 0.4 (GPL2) 1999-2002:

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Per-Flow Queuing Allot's Approach to Bandwidth Management

Per-Flow Queuing Allot's Approach to Bandwidth Management White Paper Per-Flow Queuing Allot's Approach to Bandwidth Management Allot Communications, July 2006. All Rights Reserved. Table of Contents Executive Overview... 3 Understanding TCP/IP... 4 What is Bandwidth

More information

Network Simulation Traffic, Paths and Impairment

Network Simulation Traffic, Paths and Impairment Network Simulation Traffic, Paths and Impairment Summary Network simulation software and hardware appliances can emulate networks and network hardware. Wide Area Network (WAN) emulation, by simulating

More information

Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware

Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware Lothar Braun, Alexander Didebulidze, Nils Kammenhuber, Georg Carle Technische Universität München Institute for Informatics

More information

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade

Application Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade Application Note Windows 2000/XP TCP Tuning for High Bandwidth Networks mguard smart mguard PCI mguard blade mguard industrial mguard delta Innominate Security Technologies AG Albert-Einstein-Str. 14 12489

More information

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding

CS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding CS 78 Computer Networks Internet Protocol (IP) Andrew T. Campbell campbell@cs.dartmouth.edu our focus What we will lean What s inside a router IP forwarding Internet Control Message Protocol (ICMP) IP

More information

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Packet Capture in 1-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Fabian Schneider Jörg Wallerich Anja Feldmann {fabian,joerg,anja}@net.t-labs.tu-berlin.de Technische Universtität

More information

EVALUATING THE NETWORKING PERFORMANCE OF LINUX-BASED HOME ROUTER PLATFORMS FOR MULTIMEDIA SERVICES. Ingo Kofler, Robert Kuschnig, Hermann Hellwagner

EVALUATING THE NETWORKING PERFORMANCE OF LINUX-BASED HOME ROUTER PLATFORMS FOR MULTIMEDIA SERVICES. Ingo Kofler, Robert Kuschnig, Hermann Hellwagner EVALUATING THE NETWORKING PERFORMANCE OF LINUX-BASED HOME ROUTER PLATFORMS FOR MULTIMEDIA SERVICES Ingo Kofler, Robert Kuschnig, Hermann Hellwagner Institute of Information Technology (ITEC) Alpen-Adria-Universität

More information

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University Napatech - Sharkfest 2009 1 Presentation Overview About Napatech

More information

Virtualised MikroTik

Virtualised MikroTik Virtualised MikroTik MikroTik in a Virtualised Hardware Environment Speaker: Tom Smyth CTO Wireless Connect Ltd. Event: MUM Krackow Feb 2008 http://wirelessconnect.eu/ Copyright 2008 1 Objectives Understand

More information

Latency on a Switched Ethernet Network

Latency on a Switched Ethernet Network Application Note 8 Latency on a Switched Ethernet Network Introduction: This document serves to explain the sources of latency on a switched Ethernet network and describe how to calculate cumulative latency

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

Programmable Networking with Open vswitch

Programmable Networking with Open vswitch Programmable Networking with Open vswitch Jesse Gross LinuxCon September, 2013 2009 VMware Inc. All rights reserved Background: The Evolution of Data Centers Virtualization has created data center workloads

More information

Technical Bulletin. Enabling Arista Advanced Monitoring. Overview

Technical Bulletin. Enabling Arista Advanced Monitoring. Overview Technical Bulletin Enabling Arista Advanced Monitoring Overview Highlights: Independent observation networks are costly and can t keep pace with the production network speed increase EOS eapi allows programmatic

More information

Local-Area Network -LAN

Local-Area Network -LAN Computer Networks A group of two or more computer systems linked together. There are many [types] of computer networks: Peer To Peer (workgroups) The computers are connected by a network, however, there

More information

Course 12 Synchronous transmission multiplexing systems used in digital telephone networks

Course 12 Synchronous transmission multiplexing systems used in digital telephone networks Course 12 Synchronous transmission multiplexing systems used in digital telephone networks o Disadvantages of the PDH transmission multiplexing system PDH: no unitary international standardization of the

More information

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

How To Monitor And Test An Ethernet Network On A Computer Or Network Card 3. MONITORING AND TESTING THE ETHERNET NETWORK 3.1 Introduction The following parameters are covered by the Ethernet performance metrics: Latency (delay) the amount of time required for a frame to travel

More information

Open-source routing at 10Gb/s

Open-source routing at 10Gb/s Open-source routing at Gb/s Olof Hagsand, Robert Olsson and Bengt Gördén, Royal Institute of Technology (KTH), Sweden Email: {olofh, gorden}@kth.se Uppsala University, Uppsala, Sweden Email: robert.olsson@its.uu.se

More information

Insiders View: Network Security Devices

Insiders View: Network Security Devices Insiders View: Network Security Devices Dennis Cox CTO @ BreakingPoint Systems CanSecWest/Core06 Vancouver, April 2006 Who am I? Chief Technology Officer - BreakingPoint Systems Director of Engineering

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

ncap: Wire-speed Packet Capture and Transmission

ncap: Wire-speed Packet Capture and Transmission ncap: Wire-speed Packet Capture and Transmission L. Deri ntop.org Pisa Italy deri@ntop.org Abstract With the increasing network speed, it is no longer possible to capture and transmit network packets at

More information

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management An Oracle Technical White Paper November 2011 Oracle Solaris 11 Network Virtualization and Network Resource Management Executive Overview... 2 Introduction... 2 Network Virtualization... 2 Network Resource

More information

Benchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00. Maryam Tahhan Al Morton

Benchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00. Maryam Tahhan Al Morton Benchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00 Maryam Tahhan Al Morton Introduction Maryam Tahhan Network Software Engineer Intel Corporation (Shannon Ireland). VSPERF project

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308 Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308 Laura Knapp WW Business Consultant Laurak@aesclever.com Applied Expert Systems, Inc. 2011 1 Background

More information

Unified Fabric: Cisco's Innovation for Data Center Networks

Unified Fabric: Cisco's Innovation for Data Center Networks . White Paper Unified Fabric: Cisco's Innovation for Data Center Networks What You Will Learn Unified Fabric supports new concepts such as IEEE Data Center Bridging enhancements that improve the robustness

More information

How To Improve Performance On A Linux Based Router

How To Improve Performance On A Linux Based Router Linux Based Router Over 10GE LAN Cheng Cui, Chui-hui Chiu, and Lin Xue Department of Computer Science Louisiana State University, LA USA Abstract High speed routing with 10Gbps link speed is still very

More information

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...

More information

PART III. OPS-based wide area networks

PART III. OPS-based wide area networks PART III OPS-based wide area networks Chapter 7 Introduction to the OPS-based wide area network 7.1 State-of-the-art In this thesis, we consider the general switch architecture with full connectivity

More information