RFC 2544 Performance Evaluation for a Linux Based Open Router
|
|
- Gregory Bishop
- 8 years ago
- Views:
Transcription
1 RFC 2544 Performance Evaluation for a Linux Based Open Router Raffaele Bolla, Roberto Bruschi DIST - Department of Communications, Computer and Systems Science University of Genoa Via Opera Pia 13, Genova, Italy raffaele.bolla, roberto.bruschi@unige.it Abstract Nowadays, networking equipment is realized by using decentralized architectures that often include special-purpose hardware elements. The latter considerably improve the performance on one hand, while on the other they limit the level of flexibility. Indeed, it is very difficult both to have access to details about internal operations and to perform any kind of interventions more complex than a configuration of parameters. Sometimes, the experimental nature of the Internet and its diffusion in many contexts suggest a different approach. This type of need is more evident inside the scientific community, which often encounters many difficulties in realizing experiments. Recent technological advances give a good chance to do something really effective in the field of open Internet equipment, also called Open Routers (ORs). Starting from these considerations, some initiatives have been activated since the last few years to investigate the OR and related issues. But despite these activities, large interesting areas still require a deeper investigation. This work tries to give a contribution by reporting the results of an in-depth activity of optimization and testing realized on a PC Open Router architecture based on Linux software and COTS hardware. The main target approached in this paper has been the forwarding performance evaluation of different OR Linux-based software architectures. This analysis has been performed both with external (throughput and latencies) and internal (profiling) measurements. In particular, for what concerns the external measurements, a set of RFC2544 compliant tests has been proposed and analyzed. Keywords-Open Router; RFC 2544; IP forwarding. I. INTRODUCTION The Internet technology has been developed in an open environment; all Internet related protocols, architectures and structures are publicly created and described. For this reason, in principle, everyone can easily develop an Internet equipment (e. g., a router). On the contrary, and in some sense quite surprising, most of the professional equipments are realized in a very closed way; it is very difficult both to have some details about internal operations and to do any kind of interventions more complex than a parametrical configuration. Generally speaking this does not appear so strange; it is a clear attempt to protect the industrial investment. But sometime the experimental nature of Internet and its diffusion in many contests suggest a different approach. This type of need is more evident inside the scientific community, which finds often many difficulties in realizing experiments, test-beds and trials for the evaluation of new functionalities, protocols and control mechanisms. But also the commercial contest frequently asks for a more open approach, like that suggested by the Open Source philosophy for the software, especially in the situations where the network functions must be inserted in products whose main aim is not exactly base network capabilities. Today, the recent technology advances give a good chance to do something really effective in the field of open Internet equipments, sometime called Open Routers (OR). This possibility comes, for what concern the software, from the Open Source Operative Systems (OS) like Linux and FreeBSD (which have sophisticated and complete networking capabilities), and for what concern the hardware from the COTS/PC components (whose performance are always increasing while their costs are decreasing). The attractiveness of the OR solution can be summarized in multi-vendor availability, low-cost and continuous update/evolution of the basic parts. Starting from these considerations, some initiatives have been initiated since last few years to investigate the Open Router and related issues. In the software area, one of the most important initiatives is the Click Modular Router Project [1-3] from MIT, which proposes an effective solution for the data plane. In the control plane area two important projects can be cited: Zebra [4] with its evolution Quagga [5], and Xorp [6, 7]. Despite from custom developments, also some standard Open Source OSs can give a very effective support to an OR realization; the most relevant OSs in this sense are Linux [8-10] and FreeBSD [11]. Other activities are focused on hardware: [12] and [13] propose a router architecture based on PC cluster, and [14] reports some performance results (in packet transmission and reception) obtained with a PC Linuxbased test-bed. Some evaluations have been realized also on the network boards, see for example [15]. But despite of these activities, large interesting areas still require a deeper investigation: the identification of the more appropriate HW structure, the comparison of different SW solutions (specific Linux kernel version, FreeBSD, Click, ), the identification of the best software configurations with an indication of the most significant parameters, the accurate characterization of the open node performance and the identification of SW and HW bottlenecks. This work tries to give a contribution to the investigation of the above aspects by reporting the results of a
2 deep activity of optimization and testing realized on a OR architecture based on Linux software, realized inside the participation to the EURO project [16]. We have focused our effort on the packet forwarding functionality testing. Our main objectives have been the performance evaluation of an optimized OR, both with external (throughput) and internal (profiling) measurements. Moreover, specific tests have been realized for an IP QoS aware contest to verify the effectiveness of the flows differentiation and the impact on the performance of classification and scheduling functionalities. With this respect we have identified a high-end reference PC based hardware architecture and Linux kernel 2.5.x for the software data plane, we have optimized this OR structure, defined a test environment and finally realized a complete series of tests with an accurate evaluation of the software module role in the definition of performance limits. With the QoS tests we have verified the presence of serious problems in assuring flow separation but we have also experimented a feasible solution to this problem obtained by using advance capabilities of some modern network boards. The paper is organized as in the following. II. HARDWARE AND SOFTWARE ARCHITECTURE To define the OR reference architecture, we have established some main criteria and we have used them to a priori select a set of basic elements. The objective has been to obtain a high end node base structure, able to support top performance with respect to IP packet forwarding and control plane elaborations. The decision process, the criteria and the final selection results are described in some detail in the following, separately for hardware and software elements. A. The hardware architecture The PC architecture is a general-purpose one and it is not specifically optimized for network operations. This means that, in principle, it cannot reach the same performance level of custom high-end network equipments, which generally use dedicated HW elements to handle and to parallelize the most critical operations. This characteristic has more impact on the data plane performance where custom equipments usually utilize dedicated ASIC, FPGA, Network Processor and specific internal bus, to provide a high parallelism level in the packet elaboration and exchange. On the other hand, COTS hardware can guarantee two very important features as the cheapness and the fast and continuous evolution of many of its components. Moreover, the performance gap might be not so large and anyway more than justified by the cost difference. During networking operations, the PC internal data path has to use a centralized I/O structure composed by: the I/O bus, the memory channel (both used by DMA to transfer data from network interfaces to RAM and vice versa) and the Front Side Bus (FSB) (used by the CPU with memory channel to access to the RAM during the packet elaboration). It is evident that the bandwidth of these busses and the PC computational capacity are the two most critical hardware elements involved in determination of the maximum performance in terms of both the peak passing bandwidth (in Mbps) and the maximum number of forwarded packets per second. So the selection criterions have been very fast internal busses and a dual CPUs system with high integer computational power. Figure 1. Scheme of the packet path in a PC hardware architecture. With this goal we have chosen Supermicro X5DL8-GG mainboard with a ServerWorks GC-LE chipset mounted on, whose structure is shown in Fig. 2. This chipset can support a dual-xeon system with a dual memory channel and a PCI-X bus at 133MHz with 64 parallel bits. The used Xeon processors has 2.4 GHz clock and 512KB sized cache. The memory bandwidth, supported by this chipset, matches the system bus speed, but the most important point is that the memory is 2- way interleaved, which assures high performance and low average access latencies. The bus that connects the North Bridge to the PCI bridges, namely IMB, has more bandwidth (more than 25 Gbps) than the maximum combined bandwidth of the two PCI-X busses (16 Gbps) on each I/O bridge. Figure 2. Scheme of the Supermicro X5DL8-GG mainboard. Network interfaces are another critical element in the system, as they can heavily condition the PC Router performance. As reported in [15], the network adapters on the market have different levels of maximum performance and a different configurability. With this respect, we have selected two different types of OR adapters with different characteristics. For the first adapter kind, we have decided on using a high performance Gigabit Ethernet interface, namely Intel PRO 1000 XT Server, which is equipped with a PCI-X controller supporting the 133MHz frequency and also a wide configuration range for many parameters like, for example, transmission and receive buffer lengths, interrupt maximum rates and other important features [17]. For what concerns the second the second network adapter type, we have chosen for a D-Link DFE-580TX [18] that is a network card equipped with four Fast Ethernet interfaces and a PCI 2.1 controller (i.e., 32 parallelism bits for 66MHz clock frequency). Notwithstanding a Fast Ethernet adapter cannot considerably influence the OR performance, since they work at lower and less critical speeds than the Gigabit ones, the choice of a quad port adapter allows
3 us to analyze the OR behaviour in presence of a high interface number. B. The software architecture The software architecture of an OR has to provide many different functionalities: from the ones directly involved in the packet forwarding process to the ones needed for control functionalities, dynamic configuration and monitoring. In particular, we have chosen to study and to analyze a Linux based OR framework, as it is one of the open source OSs that have a large and sophisticated kernel-integrated network support, it is equipped with numerous GNU software applications. and it has been selected in the last years as framework for a large part of networking research projects. For what concerns the Linux OR architecture, as outlined in [1] and in [3], while all the forwarding functions are realized inside the Linux kernel, the large part of the control and monitoring operations is running as daemons/applications in user mode. Thus, we have to outline that, unlike most of the commercial network equipment, the forwarding functionalities and the control ones have to share the CPUs in the system. In fact, especially the high-end commercial network equipment provides separated computational resources for these processes: the forwarding is managed by a switching fabric, which is a switching matrix, often realized with ad hoc hardware elements (ASICs, FPGAs and Network Processors), while all the control functionalities are executed by one ore more separated processors. [19] reports a detailed description of how the resource sharing between the control plane and the forwarding process can have effect on the overall performance in different OR configurations (e.g., SMP kernel, single processor kernel, etc.). Let us now go into the details of the Linux OR architecture: as shown in Fig. 3 and previously sketched, the control plane functionalities run in the user space, while the forwarding process is entirely realized inside the kernel. There are two main sets of functions to consider: packet forwarding supporting functions (the real switching operation) and the control plane supporting functions (the signalling protocols like, for example, routing protocols, control protocols, ). For what concerns the control plane, one of the several open source tools that work as applications or demons in the user space such as Zebra [4], Quagga [5] and Xorp [6,7] can be used. The critical element for the IP forwarding is the kernel where all the link, network and transport layer operations are realized 1 (see Fig. 3). During the last years, the networking support, integrated in the Linux kernel, has experienced many structural and refining developments mostly for what concerns the packet reception mechanism: it has rapidly changed from a simple interrupt architecture (which was adopted till the kernel version 2.2) through a SW interrupt receive mechanism (called SoftNet and adopted up to the version ) up to an interrupt moderation one (called NAPI (New API) and adopted from the kernel ). 1 To be more detailed, some layer 2 operations are directly realized in the network interface drivers Figure 3. Block diagram of the SW architecture of the Linux PC-Router. The Softnet architecture, even if maintains an interrupt based structure, improves the performance, because it lowers the computational overhead of context switching, by delaying the elaboration of the received packets with an interrupt scheduling. In spite of all these improvements, this architecture has proved to be inadequate at medium/high packets rates. In fact, in presence of a high ingress packets rate, the well-known interrupt livelock phenomenon (clearly described in [21]) causes heavy performance deteriorations. The NAPI architecture has been explicitly created to increase the system scalability, as it can handle network interface requests with a interrupt moderation mechanism that allows to adaptively switch from a classical interrupt management of the network interfaces to a polling one. For these reasons, we have chosen to use a last generation Linux kernel, more in particular a 2.6 version, which, besides the NAPI support, has other new interesting features (i.e. kernel preemptyness, O(1) complexity scheduler). The forwarding mechanism of all Linux kernels is fundamentally composed by a chain of three different modules: a reception API that handles the packet reception (NAPI), a module that carries out the IP layer elaboration and, finally, a transmission API that manages the forwarding operations to the egress network interfaces. Starting the analysis of the network code structure in 2.6 kernels, we have to note that all the kernel code is structured with a zero-copy statement [22]: to avoid unnecessary and onerous memory transfer operations of packets, they are left to reside in the memory locations used by the DMA-engines of ingress network interfaces, and each following operation is performed by using a sort of pointer to the packet and to its key fields, called descriptor or sk_buff. These descriptors are effectively composed by pointers to the different fields of the headers contained in the associated packets. A descriptors is immediately allocated and associated to each received packet in the sub-ip layer, subsequently used in every networking operation, and finally deallocated when the network interface signals the successful transmission. About the Symmetric Multi-Processors (SMP) support, in NAPI kernels, to reduce performance deteriorations due to CPU concurrency, the management of each network interface is assigned to a single CPU for both the transmission and reception functionalities. However, as outlined in [19] this interface-cpu assignment can be dynamically modified: for example, the cpu_affinity module tries to distribute adaptively the traffic load coming from the interfaces to all the CPUs in the system. In short, the standard forwarding operations provide that the received packets are transferred by the DMA-engine of the interface card, in a way transparent to the OS, and located in memory areas reserved to the kernel, where the packets will waiting for the kernel layer elaboration. When the kernel handles the packets, it associates them to descriptors and, then,
4 it elaborates them one at time. During the IP layer elaboration an egress interface is selected for each packet and its descriptor is placed in a buffer, called device egress qdisc, waiting for the transmission. Let us now analyze more in particular the architecture of the three fundamental modules of the networking code of 2.6 kernels: the NAPI, the IP processing and the transmission (Tx) API. 1) NAPI As previously written, NAPI is the new Linux reception API, and it has been proposed to achieve the scalability level needed to support Gigabit network interfaces. It implements an adaptive mechanism that, by using the interrupt mitigation, behaves as the classical SofNet API with low input rates, while, for higher rates, it works like a polling mechanism. Thus, let us to analyze in-depth the NAPI structure. At the first received packet placed in the ring buffer, the network interface generates an interrupt. When a CPU receives the interrupt, it invokes the netif_rx_action routine, which calls both netif_rx_schedule and netif_rx_schedule that: Records the identifier of the interface that has generated the interrupt in a buffer, called poll_list. Schedules the NET_RX_SOFTIRQ SW interrupt, instead of immediately serving the requesting device, to delay the context switching. Disables the interrupt generation, if the network card hardware supports this operation. All the next packets, which are received before the kernel scheduler serves the SW interrupt, do not cause any interrupt to the OS, but are moved to a reserved memory location by the DMA-engine. To know the address of the reserved memory location, the DMA-engine uses a list of available descriptors. Such list of descriptors is organized like a ring buffer, called rx_ring, in the kernel reserved memory. If there is not enough space or available descriptors, the packets are dropped. The scheduler awakes the softirq handler (net_rx_action) that scans the poll_list, and for each contained device: dev->poll is called and it processes, by using some driver functions, a packet number equal to the minimum value between the number of the packets in the rx_ring and the quota parameter. For each received packet the netif_receive_skb function is invoked. If the rx_ring of the device becomes empty, the identifier is cleared from the poll_list, and the interrupt generation of the card is enabled. If the rx_ring of the device does not become empty, the identifier is maintained in the poll_list. The procedure stops when there are no more device identifiers in the poll_list or when a number of packets equal to quota parameter have been processed. Note that the quota parameter sets the maximum number of packets that can be processed before activating other tasks. 2) IP processing The IP network layer code is generally composed by four important segments, which manage respectively the reception, the routing, the forwarding and the transmission process of the datagrams. More in particular, the reception process is realized by two functions: ip_rcv that controls the datagram header consistence (i. e., checksum), and ip_rcv_finish that decides if the IP destination is valid and where to send the datagram (to the local delivery or to the forwarding chain). If the datagram is not locally delivered, it is elaborated by the routing module that is structured with a two level address lookup: first a fast address lookup is made in ip_route_input by using a cache of the most frequently used destination addresses, and, if the cache misses the request destination IP, the lookup goes on with the ip_route_input_slow routine which uses the entire IP table. Afterwards, the ip_forward decreases the ttl value by one. The forwarding actions is completed by the ip_forward_finish function, that, if there are IP options, invokes ip_forward_options and then it passes directly the sk_buff to the transmission IP module. The transmission IP module first function (ip_output) handles the datagram fragmentation if necessary; thus, it calls the ip_output_finish function, which concludes the IP layer elaboration. 3) Tx API Another important part of the kernel networking code is the transmission API. It is quite simple and, unlike the Rx one, has not experienced any great revisions in the past years. By the way, Tx API is based, in large part, on driver functions that signal to network interfaces the presence of transmitting packets or that receive messages about their successful elaboration. In particular, TX API elaboration starts by calling, for each packet coming from the IP layer, dev_queue_xmit: this function enqueues the descriptor in a buffer associated to the egress device, called qdisc. Then, when there is enough space in the device tx_ring, the descriptors in the egress qdisc are served by the qdisc_restart function. For each served descriptor, the virtual method hard_start_xmit is called: the last one is implemented in the device driver (i.e. for Intel Gigabit cards is e1000_start_xmit). Such function adds the descriptor to the tx_ring and signals to the network device that there are data to transfer. Then, the DMA-engine of the egress card transfers the packets from the kernel reserved memory to its hardware ring buffer and it transmits them. When the packet transmission has been successfully completed, the network interface generates a hardware interrupt, whose routine schedules a software interrupt with NET_TX_ACTION flag. Then, the scheduler wakes up the net_tx_action handler that moves the descriptors of transmitted packets to a queue, called completion_queue where the descriptors are periodically deallocated. III. SOFTWARE PERFORMANCE TUNING. As previously sketched and shown in Fig. 4, the whole networking kernel architecture is quite complex and presents some aspects and many parameters that can be tuned for system optimization. As reported in [23], this tuning is very important for the final performance. Some of the optimal parameter values can be identified by logical considerations,
5 but most of them have to be empirically leaded, since their optimal value cannot be easily derived from the software structure and since they also depend on the hardware components. So our tuning has been realized first by identifying the critical parameters on which to operate, and, then, by finding the more convenient values with both logical considerations and experimental measures. The whole kernel architecture includes only three packet buffering points (see Fig. 4): the Rx ring buffers, the egress qdisc queues/tx ring buffers and the completion queue. While this last queue has proved to be not so critical for performance optimization purposes (it contains only descriptors of the already sent packets waiting for memory deallocation), to find an optimal configuration of the other two queues is very important to maximize the performance. In fact, to avoid useless CPU wastes it is preferable to minimize the losses at egress qdisc buffers, where the packets are already processed by IP layer, and to keep these packet losses before processing them at the Rx ring buffers: this can be easily obtained, for example, by increasing the qdisc buffer length to a value equal to about 20,000 descriptors. Many network adapter drivers, for example, allow the dimensioning of the ring buffers and of the interrupt maximum rates. Both these parameters have a great influence on the NAPI performance. In fact, if we consider a medium-high traffic load condition, NAPI practically works by polling the network interfaces, which insert themselves in the poll list with the interrupt generated by the first received packet; all the packets, received before the interface turns in the polling sequence, are enqueued in the hardware ring buffers. Thus, while it is clearly desirable not to limit the interrupt rate of network adapters in NAPI kernels, a large ring buffer permits to reduce the packet drops in presence of bursty traffic. Another interesting parameter to tune is the quota value that fixes the number of packets that each device can elaborate at every polling cycle. It is also possible to act on some specific 2.5 kernel parameters by customizing them to the specific networking usage: for example, the profiling results (Section 6) show that the kernel scheduler operations employ uselessly about the 4-5% of the overall quantity of computational resources. To avoid this CPU time waste, the OS scheduler clock frequency can be decreased: by reducing its value to 100Hz, the forwarding rate improves of about 20K packets per second (see Section 6). The rationalization of memory management is another important aspect: as highlighted in profiling results of Section 6, a considerable part of the available resources is used in the allocation and deallocation of packet descriptors (memory management functions). [24] proposes a patch that allows to recycle the descriptors of the successful sent packets: the base idea is to save CPU resources during the receive NAPI operations by reusing the packet descriptors inside the completion queue. The use of this patch can again improve the performance. Summarizing, our optimized NAPI kernel image, includes the descriptor recycling patch and the k2 version of e1000 driver and it has been configured with the following optimized values: I) the Rx and Tx ring buffers have been set to the maximum value: 4096 descriptors; II) the Rx interrupt generation has not been limited; III) the qdisc size for all the adapters has been dimensioned equal to 20,000 descriptors; IV) the NAPI quota parameter has been set to 23 descriptors; V) the scheduler clock frequency has been fixed to 100Hz. Figure 4. Detailed scheme of the forwarding operations in 2.6 kernel NAPI.
6 IV. BENCHMARKING TOOLS A reliable performance evaluation of network equipments that work forwarding packets at Gigabit/s speed is not a simple task, and it generally requires ad-hoc hardware platform: it is quite well-known that the common generation/measurement software tools [25-27] running on PCs, which even if are widely used, do not provide an adequate level of accuracy. In particular, in presence of high speed interfaces, some significant limitations (i.e. maximum sustainable rate and high packet generation/measurement latency), due to the bottleneck of PC internal busses and the lack of computational resources, are highlighted. Thus, the alternative is to use professional products, which are generally realized with ad hoc hardware and therefore can guarantee high performances (i.e. precision of few nanoseconds both in generation and in measurement); besides being quite expensive, although they cannot be completely and easily customized or modified, resulting sometimes not so much flexible as researchers need. Starting from these considerations, we have decided to build a traffic generator that combines the performance of ad hoc hardware elements with the flexibility of software based solutions. To achieve this result, we have decided to develop that traffic generator, namely PktGen, on custom highly programmable hardware like network processors [28]: in particular, we have decided to use a Radisys ENP-2611 evaluation board [29] that includes three Gigabit Ethernet interfaces and the Intel IXP 2400 network processor [30]. Analyzing thoroughly the PktGen features, we have to highlight that it has been planned not only to generate traffic at high speed with an adequate precision level, but also to generate simultaneously a high number of different traffic profiles. These last can have very different (deterministic or stochastic) characteristics. In particular, PktGen can act independently for each traffic profile by tuning both the time parameters (i.e., interarrival times and burst lengths) and the packet templates (i.e., a set of specific rules to build the packets and to fill the header fields). A detailed description of the PktGen architecture can be found in [31]. In spite of the good performance level provided by this generation tool, to analyze the OR performance with the RFC 2544 tests [32, 33] we need also a measurer tool able to reliably estimate both throughput and latency provided by the System Under Test (SUT). Thus, to benchmark the OR forwarding performance and to test the PktGen capabilities, we have finally decided to use a professional equipment, namely Agilent N2X Router Tester [34]. This last allows to obtain throughput and latency measurements with very high availability and accuracy levels (i.e., the minimum guaranteed timestamp resolution is 10ns). Moreover, with two dual Gigabit Ethernet test cards and one 16 Fast Ethernet card at disposal, it allows us to analyze the OR behaviour in presence of a high number of heterogeneous interfaces. To better support the performance analysis and to identifying the OR bottleneck, we have also performed some internal measures, obtained by using specific software tools (called profilers) placed inside the OR, which are able to trace the percentage of CPU utilization for each software modules running on the node. Internal measurements are very useful for the identification of the architecture bottlenecks. The problem is that many of these tools require a relevant computational effort that perturbs the system performance, and then making not meaningful the results. In this respect, their correct choice is a strategic point. We have verified with many different tests that one of the best is Oprofile [35], an open source tool that realizes a continuous monitoring of system dynamics with a frequent and quite regular sampling of CPU hardware registers. Oprofile allows evaluating, in a very effective and deep way, the CPU utilization of both each software application and each single kernel function running in the system with a very low computational overhead. V. BENCHMARKING SCENARIO To deeply analyze the OR forwarding performance, we have chosen to start by defining a reasonable set of benchmarking setups with an increasing level of complexity, and for each selected setup to apply some of the tests defined in [32]. In particular, we have chosen to perform these test activities by using both a core router configuration and a edge router one: the first one is composed by few high-speed (Gigabit Ethernet) network interfaces, while the last is composed by a high-sped gateway interface and a high number of Fast Ethernet cards, which collect the traffic from the access networks. More in detail, we have performed our tests by using the following setups: Setup A (Figure 5. ): a single mono directional flow crosses the OR from a Gigabit port to another one; Setup B (Figure 6. ): two full duplex flows cross the OR, each one using a different pair of Gigabit ports; Setup C (Figure 7. ): the OR includes 4 Gigabit Ethernet ports and a full-meshed (and full-duplex) traffic matrix is applied; Setup D (Figure 8. ): the OR includes 1 Gigabit Ethernet port and 12 Fast Ethernet cards, the used traffic matrix is full-meshed (and full-duplex). Figure 5. Setup A Figure 6. Setup B Figure 7. Setup C Figure 8. Setup D The A, B and C setups, which obviously refer to a simple core router working configuration, are quite significant to understand the maximum performance obtainable by the
7 forwarding mechanism. Indeed, in such cases the OR has to handle the traffic from only few network interfaces, reducing the computational overhead needed to manage a higher interface number with respect to the setup D. In particular, each OR forwarding benchmarking session is essentially composed by three test sets namely respectively throughput and latency, back-to-back burst length and packet loss rate. Note that all these tests have been performed by using different IP datagram sizes (i.e., 40, 64, 128, 256, 512, 1024 and 1500 bytes) and both CBR and bursty traffic flows. More in detail, we report a short description of all the applied forwarding tests: 1. Throughput and latency tests: this test set is performed by using CBR traffic flows, composed by fixed sized datagrams, to obtain: a. maximum effective throughput, in Kpackets/s and in percentage with respect to the theoretical value, versus different IP datagram sizes; b. average, maximum and minimum latencies versus different IP datagram sizes; where throughput and latency are interpreted as defined in [RFC1242]. 2. Back-to-back tests: these tests are fulfilled by using bursty traffic flows and changing both the burst dimension (i.e., the number of the packets composing the burst) and the datagram size. The main results for this kind of test are: a. zero loss burst length versus different IP datagram sizes; b. average, maximum and minimum latencies versus different sizes of IP datagram composing the burst; where zero loss burst length is the maximum number of packets transmitted with minimum inter-frame gaps that the System Under Test (SUT) can handle without any loss. 3. Loss Rate tests: this kind of test is fulfilled using CBR traffic flows with different offered loads and IP datagram sizes; the obtainable results can be summarized in: a. throughput versus both offered load and IP datagram sizes; VI. NUMERICAL RESULTS In this Section some of the numerical results, obtained with the benchmarking techniques and setups described in Section III, are reported. Moreover, we have decided to consider different Linux kernel configurations and a Click Modular Router as term of comparison. In particular, we have decided to test the following versions of the Linux kernel: o o dual-processor standard kernel: it is a standard NAPI kernel version similar to the previous one but with SMP support; single-processor optimized kernel: it is a version based on the standard one with single processor support that includes the descriptor recycling patch. The driver parameter tuning for both the previously cited kernel version includes the receive ring buffer dimensioning to 80 descriptors, the transmission ones to 256 descriptors, the receive and transmission interrupt delays have been set to zero, while the output qdisc buffers to 20,000 descriptors and the scheduler clock to 100Hz. For what concerns the Click Modular Router, we have used a single processor version mounted on a Linux kernel that includes the driver parameter tuning (both transmission and receive buffers have been set to 256 descriptors). Note that we have chosen to not take into account the SMP versions both of the optimized Linux kernel and of the Click Modular Router, as these versions lack of a minimum acceptable stability level. A. Setup A numerical results In the first benchmarking session, we have performed the RFC 2544 tests by using the setup A (with reference to Figure 5. ) with both the single-processor optimized kernel and Click. As we can observe in Figs. 9, 10 and 11, which report the numerical results of the throughput and latency tests, both the software architectures cannot achieve the maximum theoretic throughput in presence of small datagram sizes. As demonstrated by the profiling measurements reported in Fig. 12, obtained with the single processor optimized Linux kernel and with 64 Bytes sized datagrams, this effect is clearly caused by computational CPU capacity that limits the maximum packet forwarding rate of Linux kernel to about 700 Kpackets per second. In fact, even if the CPU idle goes to zero in correspondence of an offered load equal to 300 Kpackets/s, the CPU occupancies of all the most important function sets appear to adapt their contributes till 700 Kpackets/s; after this point their percentage contributions to the CPU utilization remains almost costant. More in particular, Fig. 12 shows that the computational weight of memory management operations (like sk_buff allocations and de-allocations) is substantially limited, thanks to descriptor recycling patch, to less than the 1%. Both the TX API (the most onerous operation set that takes at minimum about the 35% of the overall resources) and interrupt management operations apparently have strange and somehow similar behaviours: in fact, their CPU utilization level, after an initial growth, decreases with the increase of the input rate. This behaviour is mostly due to two different aspects concerning packet grouping effect in the Tx and the Rx API: in particular, when the ingress packet rate raises, NAPI tends to moderate the Rx interrupt rate by changing its behaviour from a interrupt-like mechanism to a polling one (so we have a first interrupt number reduction), while Tx API, in the same condition, can better exploit packet grouping mechanism by sending more packets at time (and then the number of interrupts for successful transmission confirmations decreases). About IP and Ethernet processing operations, we can note that they increase their CPU percentage utilizations in an almost linear way with respect to the number of the forwarded packets per second. Considerations similar to the previous ones can be done also for what concerns Click: the performance limitations in presence of short sized datagrams continues to be due to a computational bottleneck, but the simple Click packet receive API based on polling mechanism allows to achieve better
8 performance in terms of throughput by lowing the computational weight of IRQ management and RxAPI functions. For the same reasons, as shown in Figs. 10 and 11 obtained with the RFC2544 throughput and latencies test, the receive mechanism embedded in Click introduces higher packet latencies. According to the previous results, also the back-to-back tests, reported in Fig. 13 and Table I, show that the optimized Linux kernel and Click continue to suffer small sized datagrams. In fact, while using 256 Bytes or higher sized datagrams the measured zero loss burst length is quite close to the maximum burst length used in the performed tests, it appears to be heavily limited in presence of 40, 64 and, only for what concerns the Linux kernel, 128 Bytes sized packets. Apart from the single 128 Bytes case, where NAPI starts to suffer the computational bottleneck while Click continues to have a forwarding rate very close to the theoretic one; the Linux kernel provides a better support to bursty traffic than Click by providing both higher zero loss burst lengths and lower associated latency times. In Fig. 14, the loss rate test results are finally reported. Figure 11. Troughput and latencies test, testbed A: average lantecies for both the single-processor optimized kernel and the Click Modular Router. Figure 12. Profiling results of the optimized Linux kernel obtained with the testbed setup A. Figure 9. Troughput and latencies test, testbed A: effective throughput results for the single-processor optimized kernel and the Click Modular Router. Figure 13. Back-to-back test, testbed A: maximum zero loss burst lengths. TABLE I. BACK-TO-BACK TEST, TESTBED A: LATENCY VALUES FOR BOTH THE SINGLE-PROCESSOR OPTIMIZED KERNEL AND THE CLICK MODULAR ROUTER. Figure 10. Troughput and latencies test, testbed A: minimum and maximum latencies for both the single-processor optimized kernel and the Click Modular Router. optimized Kernel Click PktLength Latency Latency [Byte] Min Average Max Min Average Max [us] [us] [us] [us] [us] [us]
9 Figure 14. Loss Rate test, testbed A: maximum throughput versus both offered load and IP datagram sizes. B. Setup B numerical results Thus, in the second benchmarking session we have analyzed the performance achieved by the optimized single processor Linux kernel, of the SMP standard Linux kernel and of the Click modular router with the testbed setup B (with reference to Fig. 6). Fig. 15, which reports the maximum effective throughput in terms of forwarded packets per second for a single router interface, shows that, in presence of short sized packets, all the three software architectures do not provide a performance level close to the theoretic one. More in particular, while the best throughput values are achieved by Click, the SMP kernel seems to provide better forwarding rates with respect to the optimized kernel. In fact, as outlined in [19], if no explicit CPU-interface bounds are present, the SMP kernel processes the received packets (using, if possible, the same CPU for the whole packet elaboration) trying to dynamically share the computational load among the CPU. Thus, with the considered setup, the computational load sharing aims to manage the two interfaces, to which a traffic pair is applied, with a single fixed CPU, fully processing each received packets with only a CPU and avoiding in this way any memory concurrency problems. Figs. 16 and 17 report the minimum, the average and the maximum latency values according different datagram sizes obtained for all the three considered software architectures. In particular, we can note that both the Linux kernels, which provide in this case very similar results, assure minimum latencies lower than Click, while this last provides better average and maximum latency values for short sized datagrams. Figure 16. Troughput and latencies test, testbed B: minimum and maximum latencies for both the single-processor optimized kernel and the Click Modular Router. Figure 17. Troughput and latencies test, testbed B: average lantecies for both the single-processor optimized kernel and the Click Modular Router. The back-to-back results, reported in Fig. 18 and Tables II, show that the performance level of all analyzed architectures is almost comparable in terms of zero-loss burst length, while for what concerns the latencies the Linux kernels provide better values with respect to Click. Figure 18. Back-to-back test, testbed B: maximum zero loss burst lengths. Figure 15. Troughput and latencies test, testbed setup B: effective throughput for both the single-processor optimized kernel and the Click Modular Router. By analyzing Fig. 19, which reports the loss rate numerical results, we can observe how the performance obtained with Click and the SMP kernel are better, especially for low sized datagrams, than the one obtained by optimized single processor kernel. Moreover, Fig. 19 shows also that all the three OR software architectures do not achieve the full Gigabit/s speeds also for large datagrams, with a the maximum forwarding rate of about 650 Mbps/s per interface. Note that, to improve the readibility of the obtained results, we have decided to report in Fig. 19 and in all the following loss rate tests only the O.R. behavior with the minimum and
10 the maximum datagram sizes, since they are respectively the perfomance lower and upper bound. TABLE II. BACK-TO-BACK TEST, TESTBED B: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL. Optimized Kernel Click SMP Kernel Pkt Length Latency [us] Latency [us] Latency [us] [Byte] Min Average Max Min Average Max Min Average Max maximum latencies with respect to the single-processor and SMP kernels. The loss-rate results, shown in Fig. 24, highlight the performance decay of the Linux SMP kernel, while a fairly similar behaviour is achieved for what concerns the other two architectures. Moreover, like in previous benchmarking session, the maximum forwarding rate for each Gigabit network interface is limited at about 600/650 Mbps. Figure 20. Troughput and latencies test, setup C: effective throughput results for both the single-processor optimized kernel, the Click Modular Router and the SMP kernel. Figure 19. Loss Rate test, testbed B: maximum throughput versus both offered load and IP datagram sizes. C. Setup C numerical results The third benchmarking session, the three considered software architectures have been tested in presence of four Gigabit Ethernet interfaces with a fullmeshed traffic matrix (Fig. 7). By analyzing the maximum effective throughput values, reported in Fig. 20, we can note that Click appears to achieve a better performance level with respect to the Linux kernels, while, unlike the previous case, the single processor Linux kernel provides maximum forwarding rates considerably higher than the SMP version. In fact the SMP kernel tries to share the computational load of the incoming traffic among the CPUs and, in presence of a fullmeshed traffic matrix, this results in a almost static assignment of each CPU to two specific network interfaces. As, in this situation, about the half of the forwarded packets cross the OR between two interfaces managed by different CPUs, this causes a performance lack due to memory concurrency problems. Figs. 21 and 22 show the minimum, the maximum and the average latency values obtained during this test set. Observing these lasts, we can note how the SMP Linux kernel, in presence of short sized datagrams, suffers memory concurrency problems lowing on the OR performance and considerably increasing both the average and the maximum latency values. Analyzing Fig 23 and Tables IV, which report the back-toback test results, we can note that, on one hand, all the three OR architectures achieve a similar zero-loss burst length, while on the other hand, Click reaches very high average and Figure 21. Troughput and latencies test, testbed C: minimum and maximum latencies for the single-processor optimized kernel, the Click Modular Router and the SMP kernel. Figure 22. Troughput and latencies test, results for testbed C: average lantecies for the single-processor optimized kernel, the Click Modular Router and the SMP kernel.
11 speed in presence of short sized datagrams and about the 75% for high datagram sizes. Figure 23. Back-to-back test, testbed C: maximum zero loss burst lengths.. TABLE III. BACK-TO-BACK TEST, TESTBED C: LATENCY VALUES FOR THE SINGLE-PROCESSOR OPTIMIZED KERNEL, THE CLICK MODULAR ROUTER AND THE SMP KERNEL Figure 25. Troughput and latencies test, setup D: effective throughput results for both the single-processor optimized kernel and the SMP kernel. optimized Kernel Click SMP Kernel Pkt Length Latency [us] Latency [us] Latency [us] [Byte] Min Average Max Min Average Max Min Average Max Figure 26. Troughput and latencies test, results for testbed D: minimum and maximum latencies for both the single-processor optimized kernel and the SMP kernel. Figure 24. Loss Rate test, testbed C: maximum throughput versus both offered load and IP datagram sizes D. Setup D numerical results In the last benchmarking session, we have applied the setup D, that provides a full-meshed traffic matrix between one Gigabit Ethernet and 12 Fast Ethernet interfaces, to the singleprocessor Linux kernel and to the SMP version. Note that we have decided to not use Click in this last test, as, at the moment and for this software architecture, there is no drivers with polling support for the used Fast Ethernet interfaces. By analyzing the throughput and latency results in Figs. 25, 26 and 27, we can note how, in presence of a high interface number and a fullmeshed traffic matrix, the SMP kernel version annihilates its performance: as shown in Fig. 25, the maximum measured value for the effective throughput is limited to about 2400 packets/s and the corresponding latencies appear clearly higher with respect to the ones obtained with the single processor kernel. However, it can be highlighted that also the single processor kernel does not support the maximum theoretic rate: in particular, it achieves the 10% of the full Figure 27. Troughput and latencies test, testbed D: average lantecies for both the single-processor optimized kernel and the SMP kernel. To better understand why the OR does not reach the fullspeed with a so high number of Fast Ethernet interfaces, we have decided to perform several profiling tests. In particular, these tests were carried out by using two simple traffic matrixes: the first (Fig. 28) is composed by 12 CBR flows that cross the OR from the Fast Ethernet interfaces to the Gigabit one, while the second (Fig. 29) is still composed by 12 CBR flows that though cross the OR with the opposite direction (e.g., from the Gigabit to the Fast Ethernet interfaces). These simple traffic matrixes allow us to separately analyze the receive and transmission operation in this contest. Thus, the Figs. 28 and 29 report the profiling results corresponding to the two traffic matrixes. The obtained internal measurements in
Analyzing and Optimizing the Linux Networking Stack
Analyzing and Optimizing the Linux Networking Stack Raffaele Bolla*, Roberto Bruschi*, Andrea Ranieri*, Gioele Traverso * Department of Communications, Computer and Systems Science (DIST) University of
More informationLinux Software Router: Data Plane Optimization and Performance Evaluation
6 JOURNAL OF NETWORKS, VOL. 2, NO. 3, JUNE 27 Linux Software Router: Data Plane Optimization and Performance Evaluation Raffaele Bolla and Roberto Bruschi DIST - Department of Communications, Computer
More informationPC-based Software Routers: High Performance and Application Service Support
PC-based Software Routers: High Performance and Application Service Support Raffaele Bolla, Roberto Bruschi DIST, University of Genoa Via all Opera Pia 13, 16139, Genoa, Italy {raffaele.bolla, roberto.bruschi}@unige.it
More informationAn Effective Forwarding Architecture for SMP Linux Routers
An Effective Forwarding Architecture for SMP Linux Routers Raffaele Bolla, Roberto Bruschi Department of Communications, Computer and Systems Science (DIST), University of Genoa, Via Opera Pia 13, 16145
More informationSockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
More informationPerformance of Software Switching
Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance
More informationIntel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
More informationGigabit Ethernet Packet Capture. User s Guide
Gigabit Ethernet Packet Capture User s Guide Copyrights Copyright 2008 CACE Technologies, Inc. All rights reserved. This document may not, in whole or part, be: copied; photocopied; reproduced; translated;
More informationThe Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology
3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related
More informationPerformance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009
Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized
More informationOpenFlow with Intel 82599. Voravit Tanyingyong, Markus Hidell, Peter Sjödin
OpenFlow with Intel 82599 Voravit Tanyingyong, Markus Hidell, Peter Sjödin Outline Background Goal Design Experiment and Evaluation Conclusion OpenFlow SW HW Open up commercial network hardware for experiment
More informationVMWARE WHITE PAPER 1
1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the
More informationOpen-Source PC-Based Software Routers: A Viable Approach to High-Performance Packet Switching
Open-Source PC-Based Software Routers: A Viable Approach to High-Performance Packet Switching 353 Andrea Bianco 1, Jorge M. Finochietto 1, Giulio Galante 2, Marco Mellia 1, and Fabio Neri 1 1 Dipartimento
More informationExploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
More informationNetworking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
More informationCollecting Packet Traces at High Speed
Collecting Packet Traces at High Speed Gorka Aguirre Cascallana Universidad Pública de Navarra Depto. de Automatica y Computacion 31006 Pamplona, Spain aguirre.36047@e.unavarra.es Eduardo Magaña Lizarrondo
More informationPCI Express* Ethernet Networking
White Paper Intel PRO Network Adapters Network Performance Network Connectivity Express* Ethernet Networking Express*, a new third-generation input/output (I/O) standard, allows enhanced Ethernet network
More informationEffects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project
Effects of Filler Traffic In IP Networks Adam Feldman April 5, 2001 Master s Project Abstract On the Internet, there is a well-documented requirement that much more bandwidth be available than is used
More informationOpenFlow Switching: Data Plane Performance
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE ICC 21 proceedings OpenFlow : Data Plane Performance Andrea Bianco,
More informationOperating Systems Design 16. Networking: Sockets
Operating Systems Design 16. Networking: Sockets Paul Krzyzanowski pxk@cs.rutgers.edu 1 Sockets IP lets us send data between machines TCP & UDP are transport layer protocols Contain port number to identify
More informationWire-speed Packet Capture and Transmission
Wire-speed Packet Capture and Transmission Luca Deri Packet Capture: Open Issues Monitoring low speed (100 Mbit) networks is already possible using commodity hardware and tools based on libpcap.
More informationApplication Performance Testing Basics
Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free
More information- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
More informationD1.2 Network Load Balancing
D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,
More informationQuantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand
More informationImplementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card
Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card In-Su Yoon 1, Sang-Hwa Chung 1, Ben Lee 2, and Hyuk-Chul Kwon 1 1 Pusan National University School of Electrical and Computer
More informationHigh-Performance IP Service Node with Layer 4 to 7 Packet Processing Features
UDC 621.395.31:681.3 High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features VTsuneo Katsuyama VAkira Hakata VMasafumi Katoh VAkira Takeyama (Manuscript received February 27, 2001)
More informationPCI Express Overview. And, by the way, they need to do it in less time.
PCI Express Overview Introduction This paper is intended to introduce design engineers, system architects and business managers to the PCI Express protocol and how this interconnect technology fits into
More informationWelcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan gilligan@vyatta.com
Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan gilligan@vyatta.com Outline About Vyatta: Open source project, and software product Areas we re working on or interested in
More informationWhite Paper Abstract Disclaimer
White Paper Synopsis of the Data Streaming Logical Specification (Phase I) Based on: RapidIO Specification Part X: Data Streaming Logical Specification Rev. 1.2, 08/2004 Abstract The Data Streaming specification
More informationHow To Test A Microsoft Vxworks Vx Works 2.2.2 (Vxworks) And Vxwork 2.4.2-2.4 (Vkworks) (Powerpc) (Vzworks)
DSS NETWORKS, INC. The Gigabit Experts GigMAC PMC/PMC-X and PCI/PCI-X Cards GigPMCX-Switch Cards GigPCI-Express Switch Cards GigCPCI-3U Card Family Release Notes OEM Developer Kit and Drivers Document
More informationComputer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
More informationScalable Layer-2/Layer-3 Multistage Switching Architectures for Software Routers
Scalable Layer-2/Layer-3 Multistage Switching Architectures for Software Routers Andrea Bianco, Jorge M. Finochietto, Giulio Galante, Marco Mellia, Davide Mazzucchi, Fabio Neri, Dipartimento di Elettronica,
More informationHigh-Speed TCP Performance Characterization under Various Operating Systems
High-Speed TCP Performance Characterization under Various Operating Systems Y. Iwanaga, K. Kumazoe, D. Cavendish, M.Tsuru and Y. Oie Kyushu Institute of Technology 68-4, Kawazu, Iizuka-shi, Fukuoka, 82-852,
More informationHow To Provide Qos Based Routing In The Internet
CHAPTER 2 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 22 QoS ROUTING AND ITS ROLE IN QOS PARADIGM 2.1 INTRODUCTION As the main emphasis of the present research work is on achieving QoS in routing, hence this
More informationThe Bus (PCI and PCI-Express)
4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the
More informationHigh-performance vswitch of the user, by the user, for the user
A bird in cloud High-performance vswitch of the user, by the user, for the user Yoshihiro Nakajima, Wataru Ishida, Tomonori Fujita, Takahashi Hirokazu, Tomoya Hibi, Hitoshi Matsutahi, Katsuhiro Shimano
More informationNetworking Driver Performance and Measurement - e1000 A Case Study
Networking Driver Performance and Measurement - e1000 A Case Study John A. Ronciak Intel Corporation john.ronciak@intel.com Ganesh Venkatesan Intel Corporation ganesh.venkatesan@intel.com Jesse Brandeburg
More informationTCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to
Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationOpen Flow Controller and Switch Datasheet
Open Flow Controller and Switch Datasheet California State University Chico Alan Braithwaite Spring 2013 Block Diagram Figure 1. High Level Block Diagram The project will consist of a network development
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationPerformance Evaluation of Linux Bridge
Performance Evaluation of Linux Bridge James T. Yu School of Computer Science, Telecommunications, and Information System (CTI) DePaul University ABSTRACT This paper studies a unique network feature, Ethernet
More informationA High Performance IP Traffic Generation Tool Based on the Intel IXP2400 Network Processor
A High Performance IP Traffic Generation Tool Based on the Intel IXP2400 Network Processor Raffaele Bolla, Roberto Bruschi, Marco Canini, and Matteo Repetto Department of Communications, Computer and Systems
More informationA NOVEL RESOURCE EFFICIENT DMMS APPROACH
A NOVEL RESOURCE EFFICIENT DMMS APPROACH FOR NETWORK MONITORING AND CONTROLLING FUNCTIONS Golam R. Khan 1, Sharmistha Khan 2, Dhadesugoor R. Vaman 3, and Suxia Cui 4 Department of Electrical and Computer
More informationArchitecture of distributed network processors: specifics of application in information security systems
Architecture of distributed network processors: specifics of application in information security systems V.Zaborovsky, Politechnical University, Sait-Petersburg, Russia vlad@neva.ru 1. Introduction Modern
More informationLustre Networking BY PETER J. BRAAM
Lustre Networking BY PETER J. BRAAM A WHITE PAPER FROM CLUSTER FILE SYSTEMS, INC. APRIL 2007 Audience Architects of HPC clusters Abstract This paper provides architects of HPC clusters with information
More informationUPPER LAYER SWITCHING
52-20-40 DATA COMMUNICATIONS MANAGEMENT UPPER LAYER SWITCHING Gilbert Held INSIDE Upper Layer Operations; Address Translation; Layer 3 Switching; Layer 4 Switching OVERVIEW The first series of LAN switches
More informationTechnical Bulletin. Arista LANZ Overview. Overview
Technical Bulletin Arista LANZ Overview Overview Highlights: LANZ provides unparalleled visibility into congestion hotspots LANZ time stamping provides for precision historical trending for congestion
More informationTyche: An efficient Ethernet-based protocol for converged networked storage
Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas 30 th International Conference on Massive Storage Systems and Technology MSST 2014 June
More informationComputer Organization & Architecture Lecture #19
Computer Organization & Architecture Lecture #19 Input/Output The computer system s I/O architecture is its interface to the outside world. This architecture is designed to provide a systematic means of
More informationThe proliferation of the raw processing
TECHNOLOGY CONNECTED Advances with System Area Network Speeds Data Transfer between Servers with A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer
More information4 Internet QoS Management
4 Internet QoS Management Rolf Stadler School of Electrical Engineering KTH Royal Institute of Technology stadler@ee.kth.se September 2008 Overview Network Management Performance Mgt QoS Mgt Resource Control
More informationNetwork Layer: Network Layer and IP Protocol
1 Network Layer: Network Layer and IP Protocol Required reading: Garcia 7.3.3, 8.1, 8.2.1 CSE 3213, Winter 2010 Instructor: N. Vlajic 2 1. Introduction 2. Router Architecture 3. Network Layer Protocols
More informationSmart Queue Scheduling for QoS Spring 2001 Final Report
ENSC 833-3: NETWORK PROTOCOLS AND PERFORMANCE CMPT 885-3: SPECIAL TOPICS: HIGH-PERFORMANCE NETWORKS Smart Queue Scheduling for QoS Spring 2001 Final Report By Haijing Fang(hfanga@sfu.ca) & Liu Tang(llt@sfu.ca)
More informationHigh-Density Network Flow Monitoring
Petr Velan petr.velan@cesnet.cz High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible
More informationIntroduction to PCI Express Positioning Information
Introduction to PCI Express Positioning Information Main PCI Express is the latest development in PCI to support adapters and devices. The technology is aimed at multiple market segments, meaning that
More information基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器
基 於 SDN 與 可 程 式 化 硬 體 架 構 之 雲 端 網 路 系 統 交 換 器 楊 竹 星 教 授 國 立 成 功 大 學 電 機 工 程 學 系 Outline Introduction OpenFlow NetFPGA OpenFlow Switch on NetFPGA Development Cases Conclusion 2 Introduction With the proposal
More informationLinux Driver Devices. Why, When, Which, How?
Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may
More informationLCMON Network Traffic Analysis
LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne
More informationAutonomous NetFlow Probe
Autonomous Ladislav Lhotka lhotka@cesnet.cz Martin Žádník xzadni00@stud.fit.vutbr.cz TF-CSIRT meeting, September 15, 2005 Outline 1 2 Specification Hardware Firmware Software 3 4 Short-term fixes Test
More informationRouter Architectures
Router Architectures An overview of router architectures. Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers 2 1 Router Components
More informationAccelerating High-Speed Networking with Intel I/O Acceleration Technology
White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing
More informationHANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring
CESNET Technical Report 2/2014 HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring VIKTOR PUš, LUKÁš KEKELY, MARTIN ŠPINLER, VÁCLAV HUMMEL, JAN PALIČKA Received 3. 10. 2014 Abstract
More informationGetting the most TCP/IP from your Embedded Processor
Getting the most TCP/IP from your Embedded Processor Overview Introduction to TCP/IP Protocol Suite Embedded TCP/IP Applications TCP Termination Challenges TCP Acceleration Techniques 2 Getting the most
More informationSYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
More informationLeased Line + Remote Dial-in connectivity
Leased Line + Remote Dial-in connectivity Client: One of the TELCO offices in a Southern state. The customer wanted to establish WAN Connectivity between central location and 10 remote locations. The customer
More informationScaling Networking Applications to Multiple Cores
Scaling Networking Applications to Multiple Cores Greg Seibert Sr. Technical Marketing Engineer Cavium Networks Challenges with multi-core application performance Amdahl s Law Evaluates application performance
More informationTCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance
TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented
More informationSwitch Fabric Implementation Using Shared Memory
Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today
More informationMonitoring high-speed networks using ntop. Luca Deri <deri@ntop.org>
Monitoring high-speed networks using ntop Luca Deri 1 Project History Started in 1997 as monitoring application for the Univ. of Pisa 1998: First public release v 0.4 (GPL2) 1999-2002:
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationPer-Flow Queuing Allot's Approach to Bandwidth Management
White Paper Per-Flow Queuing Allot's Approach to Bandwidth Management Allot Communications, July 2006. All Rights Reserved. Table of Contents Executive Overview... 3 Understanding TCP/IP... 4 What is Bandwidth
More informationNetwork Simulation Traffic, Paths and Impairment
Network Simulation Traffic, Paths and Impairment Summary Network simulation software and hardware appliances can emulate networks and network hardware. Wide Area Network (WAN) emulation, by simulating
More informationComparing and Improving Current Packet Capturing Solutions based on Commodity Hardware
Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware Lothar Braun, Alexander Didebulidze, Nils Kammenhuber, Georg Carle Technische Universität München Institute for Informatics
More informationApplication Note. Windows 2000/XP TCP Tuning for High Bandwidth Networks. mguard smart mguard PCI mguard blade
Application Note Windows 2000/XP TCP Tuning for High Bandwidth Networks mguard smart mguard PCI mguard blade mguard industrial mguard delta Innominate Security Technologies AG Albert-Einstein-Str. 14 12489
More informationCS 78 Computer Networks. Internet Protocol (IP) our focus. The Network Layer. Interplay between routing and forwarding
CS 78 Computer Networks Internet Protocol (IP) Andrew T. Campbell campbell@cs.dartmouth.edu our focus What we will lean What s inside a router IP forwarding Internet Control Message Protocol (ICMP) IP
More informationPacket Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware
Packet Capture in 1-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Fabian Schneider Jörg Wallerich Anja Feldmann {fabian,joerg,anja}@net.t-labs.tu-berlin.de Technische Universtität
More informationEVALUATING THE NETWORKING PERFORMANCE OF LINUX-BASED HOME ROUTER PLATFORMS FOR MULTIMEDIA SERVICES. Ingo Kofler, Robert Kuschnig, Hermann Hellwagner
EVALUATING THE NETWORKING PERFORMANCE OF LINUX-BASED HOME ROUTER PLATFORMS FOR MULTIMEDIA SERVICES Ingo Kofler, Robert Kuschnig, Hermann Hellwagner Institute of Information Technology (ITEC) Alpen-Adria-Universität
More informationWireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University
Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University Napatech - Sharkfest 2009 1 Presentation Overview About Napatech
More informationVirtualised MikroTik
Virtualised MikroTik MikroTik in a Virtualised Hardware Environment Speaker: Tom Smyth CTO Wireless Connect Ltd. Event: MUM Krackow Feb 2008 http://wirelessconnect.eu/ Copyright 2008 1 Objectives Understand
More informationLatency on a Switched Ethernet Network
Application Note 8 Latency on a Switched Ethernet Network Introduction: This document serves to explain the sources of latency on a switched Ethernet network and describe how to calculate cumulative latency
More informationWindows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
More informationProgrammable Networking with Open vswitch
Programmable Networking with Open vswitch Jesse Gross LinuxCon September, 2013 2009 VMware Inc. All rights reserved Background: The Evolution of Data Centers Virtualization has created data center workloads
More informationTechnical Bulletin. Enabling Arista Advanced Monitoring. Overview
Technical Bulletin Enabling Arista Advanced Monitoring Overview Highlights: Independent observation networks are costly and can t keep pace with the production network speed increase EOS eapi allows programmatic
More informationLocal-Area Network -LAN
Computer Networks A group of two or more computer systems linked together. There are many [types] of computer networks: Peer To Peer (workgroups) The computers are connected by a network, however, there
More informationCourse 12 Synchronous transmission multiplexing systems used in digital telephone networks
Course 12 Synchronous transmission multiplexing systems used in digital telephone networks o Disadvantages of the PDH transmission multiplexing system PDH: no unitary international standardization of the
More informationHow To Monitor And Test An Ethernet Network On A Computer Or Network Card
3. MONITORING AND TESTING THE ETHERNET NETWORK 3.1 Introduction The following parameters are covered by the Ethernet performance metrics: Latency (delay) the amount of time required for a frame to travel
More informationOpen-source routing at 10Gb/s
Open-source routing at Gb/s Olof Hagsand, Robert Olsson and Bengt Gördén, Royal Institute of Technology (KTH), Sweden Email: {olofh, gorden}@kth.se Uppsala University, Uppsala, Sweden Email: robert.olsson@its.uu.se
More informationInsiders View: Network Security Devices
Insiders View: Network Security Devices Dennis Cox CTO @ BreakingPoint Systems CanSecWest/Core06 Vancouver, April 2006 Who am I? Chief Technology Officer - BreakingPoint Systems Director of Engineering
More informationSAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
More informationncap: Wire-speed Packet Capture and Transmission
ncap: Wire-speed Packet Capture and Transmission L. Deri ntop.org Pisa Italy deri@ntop.org Abstract With the increasing network speed, it is no longer possible to capture and transmit network packets at
More informationAn Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management
An Oracle Technical White Paper November 2011 Oracle Solaris 11 Network Virtualization and Network Resource Management Executive Overview... 2 Introduction... 2 Network Virtualization... 2 Network Resource
More informationBenchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00. Maryam Tahhan Al Morton
Benchmarking Virtual Switches in OPNFV draft-vsperf-bmwg-vswitch-opnfv-00 Maryam Tahhan Al Morton Introduction Maryam Tahhan Network Software Engineer Intel Corporation (Shannon Ireland). VSPERF project
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationVirtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308
Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308 Laura Knapp WW Business Consultant Laurak@aesclever.com Applied Expert Systems, Inc. 2011 1 Background
More informationUnified Fabric: Cisco's Innovation for Data Center Networks
. White Paper Unified Fabric: Cisco's Innovation for Data Center Networks What You Will Learn Unified Fabric supports new concepts such as IEEE Data Center Bridging enhancements that improve the robustness
More informationHow To Improve Performance On A Linux Based Router
Linux Based Router Over 10GE LAN Cheng Cui, Chui-hui Chiu, and Lin Xue Department of Computer Science Louisiana State University, LA USA Abstract High speed routing with 10Gbps link speed is still very
More informationA Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
More informationPART III. OPS-based wide area networks
PART III OPS-based wide area networks Chapter 7 Introduction to the OPS-based wide area network 7.1 State-of-the-art In this thesis, we consider the general switch architecture with full connectivity
More information