Exploiting Commodity Multi-core Systems for Network Traffic Analysis

Size: px
Start display at page:

Download "Exploiting Commodity Multi-core Systems for Network Traffic Analysis"

Transcription

1 Exploiting Commodity Multi-core Systems for Network Traffic Analysis Luca Deri ntop.org Pisa, Italy Francesco Fusco IBM Zurich Research Laboratory Rüschlikon, Switzerland Abstract The current trend in computer processors is towards multi-core systems. Although operating systems were adapted a long time ago to support multi-processing, kernel network layers have not yet taken advantage of this new technology. The result is that packet capture, the cornerstone of every network monitoring application, is not efficient on modern systems and its performance gets worse with an increasing number of cores. This paper describes common pitfalls of network monitoring applications when used with multi-core systems, and presents solutions to these problems. In addition, it covers the design and implementation of a new multi-core aware packet capture kernel module that enables monitoring applications to scale with the number of cores, contrary to what happens in most operating systems. Keywords: Passive packet capture, multi-core processors, network traffic monitoring, Linux kernel, operating systems.

2 1. Introduction The complexity of Internet-based services and advances in interconnection technologies increased the demand for advanced monitoring applications designed for high-speed networks. The increased complexity of monitoring tasks such as anomaly detection, intrusion detection and traffic classification, made software extremely attractive because it is more flexible and less expensive than dedicated hardware. On the other hand, analyzing high-speed network by means of software applications running on commodity off-the-shelf (COTS) hardware presents major performance challenges. Packet capture is still one of the most resource intensive tasks for the majority of passive monitoring applications. The industry followed three paths for accelerating software applications by means of specialized hardware while keeping the software flexibility: Accelerating the packet capture process in hardware via packet capture accelerators [8]. Splitting the traffic among different monitoring stations [24] in order to reduce the application workload. Developing programmable hardware boards with standard languages [25, 26]. Along with hardware-based solutions, researchers [15, 21, 27, 28] have demonstrated that packet capture performance of software solutions based on commodity hardware can be substantially improved by enhancing general purpose operating systems for traffic analysis. Today s COTS hardware offers features (and performance) that just a few years ago were only provided by expensive special purpose hardware: desktopclass machines are becoming advanced multi-core (or even multi-processor) parallel architectures [1, 2, 22] capable of concurrently execute multiple threads at the same time. Modern network adapters feature several independent transmission (TX) and reception () queues, each mapped on a separate core [9]. Initially designed for facilitating the implementation of virtual machine supervisors, network queues can also be used to accelerate network traffic tasks, such as routing [31] by processing incoming packets into concurrent threads of execution. Unfortunately, operating systems are not yet able to fully take advantage of this breakthrough network technology for network traffic analysis, thus dramatically limiting its scope of application. In this paper, we introduce a new packet capture technology named TNAPI, which exploits the increased parallelism of commodity hardware for speeding up network analysis applications. The rest of the paper is structured as follow. Section 2 provides an analysis of multi-core technology when applied to packet capture. Section 3 describes the common design pitfalls when developing network analysis application on top of multicore. Section 4 describes TNAPI, and its performance is evaluated in Section 5. Finally, section 6 lists some open issues and future development activities. 2. Packet Capture and Commodity Multi-core Systems Exploiting parallel architectures to perform passive network analysis is a challenge for several reasons. First, packet capture applications are memory bound, but memory bandwidth does not seem to increase as fast as the number of core available [14]. Second, balancing the traffic among different processing units is challenging, as it is not possible to predict the nature of the incoming traffic. Exploiting the parallelism with general-purpose operating systems is even more difficult as they have not been designed for accelerating packet capture. During the last three decades, memory access has always been one of the worst enemies of scalability and thus several solutions to this problem have been proposed. With the advent of symmetric multiprocessor systems (SMP), multiple processors are connected to the same memory bus, hereby causing processors to compete for the same memory bandwidth. Integrating the memory controller inside the main processor is another approach for increasing the memory bandwidth. The main advantage of this architecture is fairly obvious: multiple memory modules can be attached to each processor, thus increasing bandwidth. However this architecture also has some disadvantages. Since the memory has to be shared between all the processors, the memory access is no longer uniform (hence the name NUMA - Non Uniform Memory Access). Local memory (i.e. the memory close to a processor) can be accessed quickly, whereas the situation is completely different when a processor accesses a memory close to another one. In shared memory multiprocessors, such as SMP and NUMA, a cache coherence protocol [3] must be used in order to guarantee synchronization among processors. A multi-core processor is a processing system composed of two or more or more individual processors, called cores, integrated onto a single chip package. As a result, the inter-core bandwidth of multi-core processors can be many times greater than the one of SMP systems. In addition inter-core latencies are substantially reduced. In some cases, the number of processor cores is increased by

3 leveraging technologies such as hyper-threading that allows a physical core to be partitioned even further. Cores often share components such as the level-2 cache and the front side bus, so they cannot usually be regarded as truly independent processors. For traffic monitoring applications, the most effective approach to optimize the bandwidth utilization is to reduce the number of packet copies. With general purpose operating systems, the journey of the packet inside the kernel can be long and it usually involves at least two copies: from the network card to a socket buffer, and from the socket buffer to the monitoring application. The packet capture performance can be substantially increased by reducing this journey using memory map [15]. Capture accelerators provide a straight path from the wire to the address space of the monitoring application. Thus, the operating system overhead is completely avoided. However, bandwidth utilization can also be optimized by increasing the cache locality of packet capture application. The work in [6] shows that improving the cache locality of packet capture software through packet reordering allows the overall performance to be significantly improved. However, this approach is unfeasible with high speed networks as packet reordering is very costly. In order to better exploit the cache subsystem, Intel introduced some dedicated circuitry to assist the task of handling incoming packets, known as Direct Cache Access (DCA) [4, 10], which allows incoming packets to be written directly into the core s cache. In multi-core and multi-processor architectures, memory bandwidth can be wasted [18] in many ways, including improper scheduling, wrong balancing of interrupt (IRQ) requests, and subtle mechanisms such as false sharing [20, 29]. For these reasons, squeezing performance out of those architectures requires additional effort. Even though there is a lot of ongoing research in this area [9], most of the existing schedulers are unaware of architectural differences between cores, therefore the scheduling does not guarantee the best memory bandwidth utilization. This may happen when two threads using the same data set are scheduled on different processors, or on the same multi-core processors having separate cache domains. What happens in this case is that the two processors (or cores) fetch the same data from the memory, i.e, the same data is fetched twice. When cache levels are shared among cores this situation can be avoided by placing both threads on the same multi-core processor. Unfortunately, schedulers are not so sophisticated and thus, the intervention of the programmer through the manipulation of CPU affinity [23], is required in order to ensure that scheduling is more effective. The work described in [13] shows that the introduction of a software scheduler to better distribute the workload among threads can substantially increase the scalability. The same work also demonstrated that the increased parallelism can justify the cost of synchronization between threads. Capture accelerators supporting multiple ring buffers implement in firmware the logic for balancing the traffic according to traffic rules, so that different processes or threads receive and analyze a portion of the traffic. It is worth noting that the balancing scheme, although programmable, is not meant to be modified at runtime and, in fact, a rule set reconfiguration may take seconds, if not minutes. Thanks to this feature, capture accelerators smoothed the transition towards parallel network packet processing. Similar technologies, such as Receive Side Scaling (RSS) [7], have also been introduced in modern off-the-shelf network interface cards to increase the networking throughput on multi-core machines. RSS enabled network adapters include the logic for balancing incoming traffic across multiple queues. This balancing policy is implemented in hardware and thus RSS is not as flexible as rule based systems. However, this simple policy is effective in practice also for network monitoring applications, as most of them are flow-oriented. To the best of our knowledge, none of the available packet capture software is able of fully benefiting from RSS technology. In summary, we have identified the following issues while engineering monitoring applications for off-the-shelf multi-core and multi-processor architectures: The operating system scheduler is completely unaware of the workload and in some cases it does not have the knowledge to relocate threads on the right core/processors. Balancing the interruptions among different cores may not be an appropriate strategy. Balancing the workload among processors is not straightforward, as the workload depends on the incoming traffic, which cannot be predicted. Preserving the cache locality is a prerequisite in order to achieve a good scalability on modern parallel architectures and to overcome the bandwidth limitations, but monitoring software has poor cache locality.

4 Modern parallel architectures are getting complex and their performance is hard to predict. Optimizing the code for parallel architectures may be much more difficult than optimizing code for uniprocessor architectures. Cache coherence has to be taken into account during the implementation. As general purpose operating systems are not optimized for network monitoring, packet capture performance, the cornerstone of every monitoring applications, do not fully take benefit from the increased parallelism and from the advanced features offered by modern network adapters. The contribution of our work is a new packet capture technology based on a general purpose operating system, designed for exploiting the parallelism of commodity parallel architecture for network traffic analysis. The technology allow to achieve high packet capture rates on those architectures without requiring extensive knowledge, and provide the basic mechanisms for dividing the workload among different processing units. For this reason, we believe it can smooth the transition of monitoring applications toward parallel traffic analysis. 3. Common Application Design Pitfalls As previously explained, modern multi-core-aware network adapters are logically divided in several /TX queues where packets are flow-balanced across queues using hardware-based facilities such as Intel RSS (Receive-side Scaling) part of Intel I/O Acceleration Technology (I/O Acceleration Technology) [5, 10]. As this technology has been designed for virtualization, it is not commonly used in network traffic monitoring. The following figure highlights some design limitations of many packet capture applications, that could have been fixed by exploiting RSS that logically partitions a network adapter into multiple virtual /TX queues. The number of such queues depends on the NIC chipset, and it is limited by the number of cores. This means that on a given NIC, RSS on a dual-core machine can create up to two queues per port, whereas RSS can virtualize four queues per port when using the same NIC on a quad-core machine. Thread Thread Thread Thread Monitoring Application Userland Single Resource Competition Merge & Split Linux Device Driver MSI-X Sequential Polling RSS (Resource Side Scaling) 1/10 Gbit PHY Figure 1. Design Limitations in Network Monitoring Architectures In network monitoring, it is very important to make sure that all incoming packets are captured and forwarded to the monitoring applications. Modern network adapters are trying to improve network performance by splitting a single queue into several queues, each mapped to a processor core. The idea is to balance the load, both in terms of packets and interrupts, across all cores hence to improve the overall performance. Device drivers are unable to preserve this design up to the application: they merge all queues in one as it used to happen with legacy adapters featuring only one queue. This limitation is a major performance bottleneck, because even if a userland application uses several threads to consume packets, they all have to compete for receiving packets from the same socket. Competition is costly as semaphores or similar techniques have to be used in order to serialize this work instead of carrying it out in parallel, as happens at the kernel level. In multi-core systems, this problem is even worse because it is not often possible to map the monitoring application on the same core from which packets are coming. In addition the use of semaphores that, as a side effect, invalidates the processor s cache, which represents the basic ingredient for

5 preserving multi-core performance. In a nutshell, current network layer design needs to merge and split packets a couple of times and access them using semaphores, instead of providing a straight, lock-less path to the application with no performance limitation due to cache invalidation. In most operating systems, packets are fetched using packet polling [11,12] techniques that have been designed in the pre-multi-core age, when network adapters had only one queue. From the operating system point of view, there is no difference between a legacy 100 Mbit card and a modern 10 Gbit card as the driver hides all card, media and network speed details. As a result, it is not possible to poll queues in parallel but only sequentially, nor is possible that packets coming from queue X are marked as such and have this information delivered to the user space. The latter information could be profitably used for balancing traffic inside the monitoring application. In summary, modern network adapters offer several features that are not exploited at software level as many applications sequentially read packets from a single source instead of reading them in parallel without locks from all the queues simultaneously. Moreover, these limitations also have an effect on cache usage because they prevent applications from being bound to the same core from which packets are coming. Another area where performance can be improved is related to memory management. In network monitoring, packets are often received on dedicated adapters not used for routing or management; this means that they do not need to be forwarded nor routed but just used for monitoring. Additionally, in most operating systems, captured packets are moved to userland via mapped memory [15] rather than using system calls such as read () that are much less efficient. This means that it is possible to use zero-copy techniques to carry packets from the kernel to user-space. Unfortunately, incoming packets are copied into a memory area [16, 17] that holds the packet until it gets copied to userland via memory map. This causes unnecessary kernel memory allocation and deallocation, as zero-copy could start directly at the driver layer and not just at the networking layer. The problem of efficient memory management is serious for multi-core systems because in addition to overheads due to allocation/deallocation, operating systems do not usually feature efficient multi-core-aware memory allocators [19] resulting in serializing memory allocation across cores/threads hence further reducing application performance. 4. Towards Efficient Multi-core-aware Monitoring Architectures So far this paper has identified some features of multi-core systems that need to be taken into account while designing applications, and it has also highlighted how operating systems do not fully take advantage of modern network interface cards. The enhancements previously discussed have been introduced to the Linux kernel by modifying both the networking layer and device drivers. This is necessary to poll queues independently and to let this information propagate up to the userland where monitoring applications are executed. Linux has been selected as the target kernel because it is the de-facto reference platform for the evaluation of novel packet capture technologies. However, all the concepts are general and can also be adapted to other operating systems. Instead of modifying a vanilla kernel, the authors enhanced a homegrown Linux kernel module named PF_RING [15] and modified the underlying network device drivers. In order to unleash the driver performance, all available queues are polled simultaneously by kernel threads (one kernel thread per queue), hence the term TNAPI (Threaded NAPI). As shown in figure 2, TNAPI replaces Linux NAPI [30] packet polling mechanism with a thread per queue, that fetches packets from the NIC and pushes them to the networking stack. Depending on the PF_RING configuration, packets can follow the standard journey into the kernel or be pushed directly to PF_RING. Beside the speed advantage offered by PF_RING with respect to vanilla Linux, it allows the incoming queue id to be preserved. As this information is used by PF_RING to implement balancing across packet consumer application, throughout this paper TNAPI is always used on top of PF_RING. Another advantage of TNAPI when used on top of PF_RING derives from memory management. As both the NIC ring and PF_RING ring are allocated statically, TNAPI does not copy packets into socket buffers (skb) as vanilla Linux does; instead it copies the packet payload from the NIC ring top the PF_RING ring. This means that for each incoming packet, costly skb memory allocation/deallocation is avoided as well memory mapping across the PCI bus. Finally, if TNAPI sends directly packets to PF_RING, it can refrain from passing packets to PF_RING whenever the PF_RING, this to preserve CPU cycles.

6 No Mutex User Space Thread or Process Thread or Process Thread or Process Thread or Process Virtual Network Kernel Space PF_RING Threaded Polling RSS (Resource Side Scaling) [Hardware per-flow Balancing] 1 Gbit / 10 Gbit NIC Figure 2. TNAPI (Threaded NAPI) Contrary to PF_RING that is a network layer on top of NIC drivers, TNAPI is implemented directly into the NIC driver as it is necessary to change the mechanism that notifies the kernel when an incoming packet has been received. As of today, TNAPI is available for modern Intel 1 and 10 Gbit network adapters (82575/6 and 82598/9 chipsets) that support RSS hence multiple /TX queues. The TNAPI driver is responsible of spawning one thread per queue per port, and binding it to the same core where interrupts for such queue are received. PF_RING has been adapted to read the queue identifier from TNAPI before upper layers invalidate its value. In such a way the authors implemented inside PF_RING the concept of virtual network adapter, a feature provided only by high-end capture accelerator. Monitoring applications can either bind to a physical device (e.g. eth1) for receiving packets from all queues, or to a virtual device (e.g. eth1@2) for consuming packets from a specific queue. The latter solution allows applications to be easily split into several independent threads of execution, each receiving and analyzing a portion of the traffic. As previously explained, in order to avoid performance degradation due to suboptimal cache utilization, it is highly recommended to bind the thread or process to the same core to which the virtual interface/queue is bound. This operation is easily doable by manipulating the CPU affinity. The following section will explain CPU affinity tuning in more in detail. 5. TNAPI Evaluation In this section the performance and scalability of TNAPI is compared to PF_RING. The authors did not port TNAPI to other operating systems such as the BSD family, as the idea behind validation is to position TNAPI with respect to PF_RING and vanilla linux packet capture, and not to compare it with other operating systems. The authors evaluated the work using two different parallel architectures belonging to different market segments (low-end and high-end). The first monitoring station is based on a single Core 2 processor (model E6300) running at 1.86 GHz and equipped with 2 MB of L2 cache shared among the cores. The second one is a high-end machine equipped with two Xeon i7 processors (model E5520) running at 2.27 GHz and a Supermicro X8DTL based server board. Each processor is equipped with four hyper-threaded cores sharing 6 MB of L3 cache, for a total of 16 cores. Both the monitoring stations are equipped with 4 GB of RAM and an Intel ET 1 Gbit installed on a 4x PCI-E slot. The latter platform supports the latest features introduced by Intel such as DCA and I/OAT, whereas the Core2Duo board does not. An IXIA 400 has been used to generate the network traffic for experiments. In order to exploit balancing across queues, the IXIA was configured to generate 64 bytes TCP packets originated from a single IP address towards a rotating set of 4096 IP destination addresses. The first test measured the packet capture performance of simple packet capture-and-discard (i.e. no packet processing is performed) application named pfcount. PF_RING is used both on top of vanilla NIC drivers and PF_RING-aware drivers; in the latter case, the driver sends packets directly to PF_RING without using the Linux stack (this is called transparent mode), and then discards the packet. Finally pfcount has been tested on top of PF_RING with TNAPI. Figure 3 depicts the maximum loss-free packet capture rate of pfcount in various configurations when increasing the number of capture threads. The input packet rate is Mpps with 64 bytes long packets (wire speed with minimum size packets). This test has been repeated in a few configurations:

7 PF_RING in vanilla mode (i.e. it hooked to the Linux kernel as AF_PACKET) PF_RING in transparent mode (i.e. packets are pushed by the NIC driver to PF_RING without using the standard kernel mechanisms). Single (SQ): the NIC driver is the one that comes with the Linux kernel and it does not enable multiple queues into the NIC (default behavior on Linux). Multiple (MQ): the NIC driver is PF_RING-aware (i.e. it is part of the PF_RING source tree) so that it sends directly to PF_RING incoming packets by also passing the queue identifier on which the packet has been received. In this configuration the driver is started in multi-queue mode. PF_RING+TNAPI: PF_RING is started in transparent mode, on top of the NIC driver started in MQ mode. Note that in SQ mode, a single multi-threaded pfcount per NIC is activated. In MQ mode, two multithreaded pfcount instances have been activated, each polling from a different queue; in this case the sum of packets captured by both application has been reported Maximum loss-free Rate (kpps) PF_RING (vanilla)sq PF_RING (transp)sq PF_RING (vanilla)mq PF_RING (transp)mq PF_RING+TNAPI MQ Wire-rate Number of threads Figure 3. TNAPI and PF_RING Performance Tests: Captured Packets vs. Number of Threads As shown in Figure 3, the original PF_RING does not benefit from multi-queue adapters. In fact, the performance is roughly the same when a single or multiple queues are enabled. Spawning more packet capture threads does not provide any performance gain, and, on the contrary, is the cause of a performance drop. This is obvious for thread pools larger than the number of physical cores (greater than 2) and expected also for the two thread scenario, as they compete for a shared resource and spend most of their time in synchronizations. A similar behavior can be observed when TNAPI is enabled. In fact by using two threads (the TNAPI kernel thread and the packet capture thread at the user level) the application can reach 1260 Kpps which is close, but not equal to the wire-rate. On the other side, by spawning two threads at user-level (TNAPI disabled) the maximum loss-free rate is 540 Kpps which is even lower than the one obtained with a single one thread (610 Kpps).

8 So far, this test has demonstrated that contrary to the original PF_RING, TNAPI allows the parallelism of the Core2Duo processor to be exploited. Test Platform PF_RING SQ PF_RING MQ PF_RING+TNAPI Core2Duo (pfcount not bound to a CPU) 721 Kpps 610 Kpps 1264 Kpps Dual i7 (pfcount not bound to a CPU) 1326 Kpps 708 Kpps 1488 Kpps Dual i7 (pfcount bound to a CPU) 858 Kpps 1178 Kpps 1488 Kpps Table 1. TNAPI and PF_RING Performance Tests Using pfcount (mono-threaded) Table 1 reports the results of pfcount mono-threaded on the two selected test platforms. The test has demonstrated that: When TNAPI is enabled, the Core2Duo is performing as good as the high-end platform equipped with the original version of PF_RING. TNAPI allows the high-end platform to capture packets at wire-rate performance. Multi-queue technology is not beneficial for packet capture unless properly used, as its use can result in major performance drops with respect to single queue. Binding (also known as SMP affinity) pfcount to a specific CPU (by means of the taskset tool) can significantly change the performance figures. In addition to SMP affinity, it is important also to take into account how IRQ balancing affects results. In fact, the NIC sends an interrupt whenever it has to notify the operating system that a new packet has been received. With MQ, the operating system assigns a different IRQ to each queue. In order to better understand how SMP affinity and IRQ balancing affects the performance figures hence TNAPI scalability, the authors decided to play with various setups on the dual i7 platform. Even if NICs support up to 8 queues, the authors decided to configure only two queues, as they are enough for handling 1 Gbit of traffic. Before reporting the test results, it is worth to understand how i7 s NUMA affects the tests. Memory Bank 0 Memory Bank 1 QPI CPU 0 CPU 1 I/O HUB 1 Gb 1 Gb NIC NIC Figure 4. Typical NUMA Architecture In NUMA architectures, memory is local to a CPU and when a core needs to access memory not allocated on the same CPU, it cannot access it directly but it must go through QPI (Quick Path Interconnect). The same applies to IRQs that for maximum performance must be delivered to the same core on which the application that consumes such interrupts resides. This means that in order to optimize the performance, for a given queue it is necessary to bind on the same core: The TNAPI thread that fetches packets from the NIC adapter. The IRQ used for the queue. PCIe Expansion Slots On board NICs In Linux /proc/cpuinfo lists the available CPUs. Each entry contains: processor: unique core identifier physical id: the identifier of the CPU on which the core resides. core id: the unique identifier of the core inside the CPU is running on.

9 Processor Physical CPU CoreId Processor Physical CoreId CPU Table 2. Cores Mapping on Linux with Dual i7 The table 2 shows that processors with an adjacent processor identifier are not physically close. Furthermore due to Hyper-Threading, processor 0 and 8 are basically two twin cores, similar to 1 and 9, and so on. Table 3 reports the maximum packet capture rate when capturing from two NIC simultaneously while the traffic generator is injecting traffic at wire speed. In all tests interrupts are bound onto the same core used for TNAPI: for NIC 1 processors 1 and 9 are used, for NIC 2 processors 2 and 10 are used. Test pfcount Affinity TNAPI Threads Affinity NIC 1 NIC NIC1 queues on 1,9 NIC2 queues on 2,10 3 pfcount on NIC1 queues on 1,9 NIC1 queues on 1,9 pfcount on NIC2 queues on 2,10 NIC2 queues on 2,10 4 pfcount on NIC1@0 on 1 NIC1@0 on 1 pfcount on NIC1@1 on 9 NIC1@1 on 9 pfcount on NIC2@0 on 2 NIC2@0 on 2 pfcount on NIC2@1 on 10 NIC2@1 on 10 5 pfcount on NIC1 queues on 1,9 pfcount on NIC2 queues on 2,10 NIC1@0 on 1 NIC1@1 on 9 NIC2@0 on 2 NIC2@1 on Table 3. How SMP affinity and IRQ Balancing Affect Packet Capture Throughput (Kpps) The test outcome shows that, even with TNAPI, a very fast machine as a dual i7 is not able to capture at wire speed from two adapters simultaneously, unless the affinity is properly tuned. In fact, setting the wrong CPU affinity may cause a substantial drop of the aggregated packet capture rate (i.e., the difference between test 2 and test 5 is over 900 Mpps). In principle, the pfcount application should be bound to the same core used for the queue it handles. However splitting the load on two queues means that pfcount is idle most of its time, at least on fast processors as the i7. As a consequence pfcount must call poll() very often as it has no packet to process hence it needs to go to sleep until new packet arrive; this may lead to packet losses. As system calls are slow, it is better to keep pfcount busy so that poll() calls are reduce. The best way of doing so is to bind pfcount to two queues, this in order to increase the number of incoming packets. In fact, in test 5 pfcount has been able to capture all incoming packets with no loss (i.e Mpps). Note that if pfcount is replaced with a more computing intensive application (hence that does not call poll() too often) settings of test 4 may provide better performance. The authors decided to plug two extra NICs to the system to check weather was possible to reach the wirerate with 4 NICs at the same time (4 Gbit/s of aggregated bandwidth). The 3 rd and 4 th NIC were configured using the same tuning parameter as in test 5 and the measurements repeated. All four pfcount applications captured the traffic with no loss. Preliminary tests at 10 Gbit confirmed that this setup is also effective on this scenario, with the difference that the card used in the tests (based on chipset 82598) does not support more than eight queues, so scalability beyond 4 Gbit needs to be verified on newer adapters (based on chipset 82599).

10 The conclusion of TNAPI validation is that SMP, IRQ, TNAPI and pfcount correct binding allows: Packet capture rate to scale linearly with the number of NICs. Multi-core computers to be partitioned processor-by-processor. This means that load on each processor does not affect the load on other processors. In a nutshell, if incoming traffic can be balanced across all available processors, in principle (this because authors have not tested it beyond 4 Gbit due to lack of NICs on the traffic generator) TNAPI can scale linearly with the number of NICs and processors. 6. Open Issues and Future Work Items One of the basic assumptions of multi-core systems is the ability to balance load across cores. Modern network adapters have been designed to share the load across queues using technologies such as RSS. The basic assumption is that incoming traffic can be balanced, that is often true but not all the time. In this case, a few cores will be responsible for handling all incoming packets whereas other cores will be basically idle. This problem is also present in TNAPI that takes advantage of RSS for traffic balancing. The authors are currently investigating software solutions whose aim is to further balance the traffic on top of RSS by creating virtual queues in addition to the physical ones, while preserving the actual performance. This problem is the same as balancing the traffic at the kernel level on top of legacy network adapters that do not feature multiple queues. 7. Conclusions The use of multi-cores enables the development of both high-speed and memory/computing-intensive applications. The market trend is clearly towards many-core architectures as they are the only ones able to provide developers with a positive answer to network monitoring needs, as well as provide the scalability for future high-speed networks. This paper has highlighted some of the challenges users face when using multicore systems for the purpose of network monitoring. Although multi-core is the future of computing, it is a technology that needs to be used with care to avoid performance issues as well as, in some cases, performance degradation. The authors have identified which aspects need to be taken into account when developing applications for multi-core systems, as well as the limitations of current monitoring architectures. TNAPI is a novel approach to overcome existing operating systems limitations and unleash the power of multi-core when used for network monitoring. The validation process has demonstrated that by using TNAPI it is possible to capture packets very efficiently both at 1 and 10 Gbit, and in contrast to the current operating systems generation, it can scale almost linearly with the number of processors. Code Availability This work is distributed under the GNU GPL2 license and is available at 8. References [1] P. Van Roy, The Challenges and Opportunities of Multiple Processors: Why Multi-Core Processors are Easy and Internet is Hard, Proceedings of ICMC, (2008). [2] J. Frühe, Planning Considerations for Multi-core Processor Technology, White Paper, (2005). [3] T. Tartalja and V. Milutinovich, The Cache Coherence Problem in Shared-Memory Multiprocessors: Software Solutions, ISBN: , (1996). [4] A. Kumar and R. Huggahalli, Impact of Cache Coherence Protocols on the Processing of Network Traffic, 40th Annual IEEE/ACM International Symposium on Microarchitecture,(2007). [5] Intel, Intelligent Queuing Technologies for Virtualization, White Paper, (2008). [6] A. Papagiodannakis and others, Improving the Performance of Passive Network Monitoring Applications using Locality Buffering, Proceedings of MASCOTS, (2007). [7] Microsoft, Scalable Networking: Eliminating the Receive Processing Bottleneck Introducing RSS, WinHEC (Windows Hardware Engineering Conference) (2004). [8] A. Heyde, Investigating the performance of Endace DAG monitoring hardware and Intel NICs in the context of Lawful Interception, CAIA Technical Report A, (2008).

11 [9] Intel, Improving Network Performance in Multi-Core Systems, White Paper, (2007). [10] Intel, Accelerating High-Speed Networking with Intel I/O Acceleration Technology, White Paper, (2006). [11] J. Salim and R. Olsson, Beyond Softnet, Proceedings of the 5th annual Linux Showcase & Conference, (2001). [12] L. Rizzo, Device Polling Support for FreeBSD, BSDConEurope Conference, (2001). [13] L. Degioanni and G. Varenni, Introducing scalability in network measurement: toward 10 Gbps with commodity hardware, Internet Measurement Conference, (2004). [14] K. Asanovic and others, The landscape of parallel computing research: A view from Berkley. Technical Report UCB/EECS , EECS Department, University of California, Berkley, December (2006). [15] L. Deri, Improving Passive Packet Capture: Beyond Device Polling, Proceedings of SANE, (2004). [16] A. Cox, Network Buffers and Memory Management, The Linux Journal, Issue 30, (1996). [17] B. Milekic, Network Buffer Allocation in the FreeBSD Operating System, Proceedings of BSDCan, (2004). [18] C. Leiserson and I. Mirman, How to Survive the Multi-core Software Revolution, Cilk Arts, (2009). [19] S. Schneider, Scalable Locality-Conscious Multithreaded Memory Allocation, Proceedings of the ACM SIGPLAN, (2006). [20] H. Sutter, Eliminate False Sharing, Dr. Dobb s Journal, Issue 5, (2009). [21] F. Fusco and others, Enabling High-Speed and Extensible Real-Time Communications Monitoring, 11 th IFIP/IEEE International Symposium on Integrated Network Management, (2009). [22] T. Sterling, Multi-Core for HPC: breakthrough or breakdown?, Panel of SC06 Conference, (2006). [23] R. Love, Linux System Programming: Talking Directly to the kernel and C Library, O Reilly Media Inc., (2007) [24] R. Kay, Pragmatic Network Latency Engineering Fundamental Facts and Analysis, White Paper, (2009). [25] D. Comer, Network systems design using network processors, Computing Reviews. Vol. 45, no. 9, (2004). [26] A. Agarwal, The Tile processor: A 64-core multicore for embedded processing, Proceedings of HPEC Workshop, (2007). [27] M. Dashtbozorgi and M. Abdollahi Azgomi, A high-performance software solution for packet capture and transmission, Proceedings of ICCSIT, (2009). [28] L. Deri, ncap: Wire-speed Packet Capture and Transmission, Proceedings of E2EMON, (2005). [29] W.J. Bolosky and M. L. Scott, False sharing and its effect on shared memory performance, USENIX Experiences with Distributed and Multiprocessor Systems, (1994). [30] J. Hadi Salim and others, Beyond Softnet, Proceedings of Usenix Annual Technical Conference, (2001). [31] M. Dobrescu and others, RouteBricks: Exploiting Parallelism To Scale Software Routers, 22nd ACM Symposium on Operating Systems Principles (SOSP), (2009).

High Speed Network Traffic Analysis with Commodity Multi-core Systems

High Speed Network Traffic Analysis with Commodity Multi-core Systems High Speed Network Traffic Analysis with Commodity Multi-core Systems Francesco Fusco IBM Research - Zurich ETH Zurich [email protected] Luca Deri ntop [email protected] ABSTRACT Multi-core systems are the

More information

10 Gbit Hardware Packet Filtering Using Commodity Network Adapters. Luca Deri <[email protected]> Joseph Gasparakis <joseph.gasparakis@intel.

10 Gbit Hardware Packet Filtering Using Commodity Network Adapters. Luca Deri <deri@ntop.org> Joseph Gasparakis <joseph.gasparakis@intel. 10 Gbit Hardware Packet Filtering Using Commodity Network Adapters Luca Deri Joseph Gasparakis 10 Gbit Monitoring Challenges [1/2] High number of packets to

More information

Wire-speed Packet Capture and Transmission

Wire-speed Packet Capture and Transmission Wire-speed Packet Capture and Transmission Luca Deri Packet Capture: Open Issues Monitoring low speed (100 Mbit) networks is already possible using commodity hardware and tools based on libpcap.

More information

ncap: Wire-speed Packet Capture and Transmission

ncap: Wire-speed Packet Capture and Transmission ncap: Wire-speed Packet Capture and Transmission L. Deri ntop.org Pisa Italy [email protected] Abstract With the increasing network speed, it is no longer possible to capture and transmit network packets at

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Monitoring high-speed networks using ntop. Luca Deri <[email protected]>

Monitoring high-speed networks using ntop. Luca Deri <deri@ntop.org> Monitoring high-speed networks using ntop Luca Deri 1 Project History Started in 1997 as monitoring application for the Univ. of Pisa 1998: First public release v 0.4 (GPL2) 1999-2002:

More information

Wire-Speed Hardware-Assisted Traffic Filtering with Mainstream Network Adapters

Wire-Speed Hardware-Assisted Traffic Filtering with Mainstream Network Adapters Wire-Speed Hardware-Assisted Traffic Filtering with Mainstream Network Adapters Luca Deri 1, Joseph Gasparakis 2, Peter Waskiewicz Jr 3, Francesco Fusco 4 5 1 ntop, Pisa, Italy 2 Intel Corporation, Embedded

More information

Performance of Software Switching

Performance of Software Switching Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance

More information

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology 3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related

More information

Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware

Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware Comparing and Improving Current Packet Capturing Solutions based on Commodity Hardware Lothar Braun, Alexander Didebulidze, Nils Kammenhuber, Georg Carle Technische Universität München Institute for Informatics

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

On Multi Gigabit Packet Capturing With Multi Core Commodity Hardware

On Multi Gigabit Packet Capturing With Multi Core Commodity Hardware On Multi Gigabit Packet Capturing With Multi Core Commodity Hardware Nicola Bonelli, Andrea Di Pietro, Stefano Giordano, and Gregorio Procissi CNIT and Università di Pisa, Pisa, Italy Abstract. Nowadays

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

Open Source in Network Administration: the ntop Project

Open Source in Network Administration: the ntop Project Open Source in Network Administration: the ntop Project Luca Deri 1 Project History Started in 1997 as monitoring application for the Univ. of Pisa 1998: First public release v 0.4 (GPL2) 1999-2002:

More information

Boosting Data Transfer with TCP Offload Engine Technology

Boosting Data Transfer with TCP Offload Engine Technology Boosting Data Transfer with TCP Offload Engine Technology on Ninth-Generation Dell PowerEdge Servers TCP/IP Offload Engine () technology makes its debut in the ninth generation of Dell PowerEdge servers,

More information

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented

More information

Design considerations for efficient network applications with Intel multi-core processor-based systems on Linux*

Design considerations for efficient network applications with Intel multi-core processor-based systems on Linux* White Paper Joseph Gasparakis Performance Products Division, Embedded & Communications Group, Intel Corporation Peter P Waskiewicz, Jr. LAN Access Division, Data Center Group, Intel Corporation Design

More information

OpenFlow with Intel 82599. Voravit Tanyingyong, Markus Hidell, Peter Sjödin

OpenFlow with Intel 82599. Voravit Tanyingyong, Markus Hidell, Peter Sjödin OpenFlow with Intel 82599 Voravit Tanyingyong, Markus Hidell, Peter Sjödin Outline Background Goal Design Experiment and Evaluation Conclusion OpenFlow SW HW Open up commercial network hardware for experiment

More information

Collecting Packet Traces at High Speed

Collecting Packet Traces at High Speed Collecting Packet Traces at High Speed Gorka Aguirre Cascallana Universidad Pública de Navarra Depto. de Automatica y Computacion 31006 Pamplona, Spain [email protected] Eduardo Magaña Lizarrondo

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

MIDeA: A Multi-Parallel Intrusion Detection Architecture

MIDeA: A Multi-Parallel Intrusion Detection Architecture MIDeA: A Multi-Parallel Intrusion Detection Architecture Giorgos Vasiliadis, FORTH-ICS, Greece Michalis Polychronakis, Columbia U., USA Sotiris Ioannidis, FORTH-ICS, Greece CCS 2011, 19 October 2011 Network

More information

Multi-core and Linux* Kernel

Multi-core and Linux* Kernel Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores

More information

High-Density Network Flow Monitoring

High-Density Network Flow Monitoring High-Density Network Flow Monitoring Petr Velan CESNET, z.s.p.o. Zikova 4, 160 00 Praha 6, Czech Republic [email protected] Viktor Puš CESNET, z.s.p.o. Zikova 4, 160 00 Praha 6, Czech Republic [email protected]

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines

vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines Alfredo Cardigliano 1 Luca Deri 1 2 1 ntop, 2 IIT-CNR Pisa, Italy Joseph Gasparakis Intel Corporation Shannon, Ireland Francesco Fusco

More information

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro Whitepaper What s wrong with today s clouds? Compute and storage virtualization has enabled

More information

Evaluating the Suitability of Server Network Cards for Software Routers

Evaluating the Suitability of Server Network Cards for Software Routers Evaluating the Suitability of Server Network Cards for Software Routers Maziar Manesh Katerina Argyraki Mihai Dobrescu Norbert Egi Kevin Fall Gianluca Iannaccone Eddie Kohler Sylvia Ratnasamy EPFL, UCLA,

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Accelerating High-Speed Networking with Intel I/O Acceleration Technology White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing

More information

Bivio 7000 Series Network Appliance Platforms

Bivio 7000 Series Network Appliance Platforms W H I T E P A P E R Bivio 7000 Series Network Appliance Platforms Uncompromising performance. Unmatched flexibility. Uncompromising performance. Unmatched flexibility. The Bivio 7000 Series Programmable

More information

Networking Driver Performance and Measurement - e1000 A Case Study

Networking Driver Performance and Measurement - e1000 A Case Study Networking Driver Performance and Measurement - e1000 A Case Study John A. Ronciak Intel Corporation [email protected] Ganesh Venkatesan Intel Corporation [email protected] Jesse Brandeburg

More information

Challenges in high speed packet processing

Challenges in high speed packet processing Challenges in high speed packet processing Denis Salopek University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia [email protected] Abstract With billions of packets traveling

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Open-source routing at 10Gb/s

Open-source routing at 10Gb/s Open-source routing at Gb/s Olof Hagsand, Robert Olsson and Bengt Gördén, Royal Institute of Technology (KTH), Sweden Email: {olofh, gorden}@kth.se Uppsala University, Uppsala, Sweden Email: [email protected]

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June [email protected],[email protected],

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Packet Capture in 1-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Fabian Schneider Jörg Wallerich Anja Feldmann {fabian,joerg,anja}@net.t-labs.tu-berlin.de Technische Universtität

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

Windows 8 SMB 2.2 File Sharing Performance

Windows 8 SMB 2.2 File Sharing Performance Windows 8 SMB 2.2 File Sharing Performance Abstract This paper provides a preliminary analysis of the performance capabilities of the Server Message Block (SMB) 2.2 file sharing protocol with 10 gigabit

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

High-Performance Many-Core Networking: Design and Implementation

High-Performance Many-Core Networking: Design and Implementation High-Performance Many-Core Networking: Design and Implementation Jordi Ros-Giralt, Alan Commike, Dan Honey, Richard Lethin Reservoir Labs 632 Broadway, Suite 803 New York, NY 10012 Abstract Operating systems

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

WireCAP: a Novel Packet Capture Engine for Commodity NICs in High-speed Networks

WireCAP: a Novel Packet Capture Engine for Commodity NICs in High-speed Networks WireCAP: a Novel Packet Capture Engine for Commodity NICs in High-speed Networks Wenji Wu, Phil DeMar Fermilab Network Research Group [email protected], [email protected] ACM IMC 2014 November 5-7, 2014 Vancouver,

More information

Distribution One Server Requirements

Distribution One Server Requirements Distribution One Server Requirements Introduction Welcome to the Hardware Configuration Guide. The goal of this guide is to provide a practical approach to sizing your Distribution One application and

More information

SAN Conceptual and Design Basics

SAN Conceptual and Design Basics TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Advanced Core Operating System (ACOS): Experience the Performance

Advanced Core Operating System (ACOS): Experience the Performance WHITE PAPER Advanced Core Operating System (ACOS): Experience the Performance Table of Contents Trends Affecting Application Networking...3 The Era of Multicore...3 Multicore System Design Challenges...3

More information

vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines

vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines vpf_ring: Towards Wire-Speed Network Monitoring Using Virtual Machines Alfredo Cardigliano 1 Luca Deri 1,2 ntop 1, IIT-CNR 2 Pisa, Italy {cardigliano, deri}@ntop.org Joseph Gasparakis Intel Corporation

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Where IT perceptions are reality Test Report OCe14000 Performance Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Document # TEST2014001 v9, October 2014 Copyright 2014 IT Brand

More information

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server

Express5800 Scalable Enterprise Server Reference Architecture. For NEC PCIe SSD Appliance for Microsoft SQL Server Express5800 Scalable Enterprise Server Reference Architecture For NEC PCIe SSD Appliance for Microsoft SQL Server An appliance that significantly improves performance of enterprise systems and large-scale

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service Achieving Scalability and High Availability Abstract DB2 Connect Enterprise Edition for Windows NT provides fast and robust connectivity

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Performance Guideline for syslog-ng Premium Edition 5 LTS

Performance Guideline for syslog-ng Premium Edition 5 LTS Performance Guideline for syslog-ng Premium Edition 5 LTS May 08, 2015 Abstract Performance analysis of syslog-ng Premium Edition Copyright 1996-2015 BalaBit S.a.r.l. Table of Contents 1. Preface... 3

More information

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking

Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking Assessing the Performance of Virtualization Technologies for NFV: a Preliminary Benchmarking Roberto Bonafiglia, Ivano Cerrato, Francesco Ciaccia, Mario Nemirovsky, Fulvio Risso Politecnico di Torino,

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Infrastructure for active and passive measurements at 10Gbps and beyond

Infrastructure for active and passive measurements at 10Gbps and beyond Infrastructure for active and passive measurements at 10Gbps and beyond Best Practice Document Produced by UNINETT led working group on network monitoring (UFS 142) Author: Arne Øslebø August 2014 1 TERENA

More information

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Leveraging NIC Technology to Improve Network Performance in VMware vsphere Leveraging NIC Technology to Improve Network Performance in VMware vsphere Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction... 3 Hardware Description... 3 List of Features... 4 NetQueue...

More information

High-Performance Network Traffic Processing Systems Using Commodity Hardware

High-Performance Network Traffic Processing Systems Using Commodity Hardware High-Performance Network Traffic Processing Systems Using Commodity Hardware José LuisGarcía-Dorado, Felipe Mata, Javier Ramos, Pedro M. Santiago del Río, Victor Moreno, and Javier Aracil High Performance

More information

The proliferation of the raw processing

The proliferation of the raw processing TECHNOLOGY CONNECTED Advances with System Area Network Speeds Data Transfer between Servers with A new network switch technology is targeted to answer the phenomenal demands on intercommunication transfer

More information

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring CESNET Technical Report 2/2014 HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring VIKTOR PUš, LUKÁš KEKELY, MARTIN ŠPINLER, VÁCLAV HUMMEL, JAN PALIČKA Received 3. 10. 2014 Abstract

More information

Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture

Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture Intel Network Builders Reference Architecture Packet Processing Performance of Virtualized Platforms Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture Packet Processing

More information

Linux Driver Devices. Why, When, Which, How?

Linux Driver Devices. Why, When, Which, How? Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may

More information

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies Kurt Klemperer, Principal System Performance Engineer [email protected] Agenda Session Length:

More information

Hitachi Virtage Embedded Virtualization Hitachi BladeSymphony 10U

Hitachi Virtage Embedded Virtualization Hitachi BladeSymphony 10U Hitachi Virtage Embedded Virtualization Hitachi BladeSymphony 10U Datasheet Brings the performance and reliability of mainframe virtualization to blade computing BladeSymphony is the first true enterprise-class

More information

Improving Passive Packet Capture: Beyond Device Polling

Improving Passive Packet Capture: Beyond Device Polling Improving Passive Packet Capture: Beyond Device Polling Luca Deri NETikos S.p.A. Via del Brennero Km 4, Loc. La Figuretta 56123 Pisa, Italy Email: [email protected] http://luca.ntop.org/ Abstract Passive

More information

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE

MODULE 3 VIRTUALIZED DATA CENTER COMPUTE MODULE 3 VIRTUALIZED DATA CENTER COMPUTE Module 3: Virtualized Data Center Compute Upon completion of this module, you should be able to: Describe compute virtualization Discuss the compute virtualization

More information

Router Architectures

Router Architectures Router Architectures An overview of router architectures. Introduction What is a Packet Switch? Basic Architectural Components Some Example Packet Switches The Evolution of IP Routers 2 1 Router Components

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information

How System Settings Impact PCIe SSD Performance

How System Settings Impact PCIe SSD Performance How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage

More information

Open Source VoIP Traffic Monitoring

Open Source VoIP Traffic Monitoring Open Source VoIP Traffic Monitoring Luca Deri Why VoIP is a Hot Topic? Thanks to open source projects (e.g. Asterisk, Gizmo), and custom Linux distributions (e.g. Asterisk@Home) setting up a VoIP

More information

Cisco Integrated Services Routers Performance Overview

Cisco Integrated Services Routers Performance Overview Integrated Services Routers Performance Overview What You Will Learn The Integrated Services Routers Generation 2 (ISR G2) provide a robust platform for delivering WAN services, unified communications,

More information

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT: Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT: In view of the fast-growing Internet traffic, this paper propose a distributed traffic management

More information

50. DFN Betriebstagung

50. DFN Betriebstagung 50. DFN Betriebstagung IPS Serial Clustering in 10GbE Environment Tuukka Helander, Stonesoft Germany GmbH Frank Brüggemann, RWTH Aachen Slide 1 Agenda Introduction Stonesoft clustering Firewall parallel

More information

High-performance vswitch of the user, by the user, for the user

High-performance vswitch of the user, by the user, for the user A bird in cloud High-performance vswitch of the user, by the user, for the user Yoshihiro Nakajima, Wataru Ishida, Tomonori Fujita, Takahashi Hirokazu, Tomoya Hibi, Hitoshi Matsutahi, Katsuhiro Shimano

More information

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors White Paper Cristian F. Dumitrescu Software Engineer Intel Corporation Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors December 2008 321058 Executive Summary

More information

SIDN Server Measurements

SIDN Server Measurements SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

More information

I3: Maximizing Packet Capture Performance. Andrew Brown

I3: Maximizing Packet Capture Performance. Andrew Brown I3: Maximizing Packet Capture Performance Andrew Brown Agenda Why do captures drop packets, how can you tell? Software considerations Hardware considerations Potential hardware improvements Test configurations/parameters

More information

PCI Express High Speed Networks. Complete Solution for High Speed Networking

PCI Express High Speed Networks. Complete Solution for High Speed Networking PCI Express High Speed Networks Complete Solution for High Speed Networking Ultra Low Latency Ultra High Throughput Maximizing application performance is a combination of processing, communication, and

More information

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand

More information

Computer Systems Structure Input/Output

Computer Systems Structure Input/Output Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices

More information

Chapter 14 Virtual Machines

Chapter 14 Virtual Machines Operating Systems: Internals and Design Principles Chapter 14 Virtual Machines Eighth Edition By William Stallings Virtual Machines (VM) Virtualization technology enables a single PC or server to simultaneously

More information

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management An Oracle Technical White Paper November 2011 Oracle Solaris 11 Network Virtualization and Network Resource Management Executive Overview... 2 Introduction... 2 Network Virtualization... 2 Network Resource

More information

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

OpenMosix Presented by Dr. Moshe Bar and MAASK [01] OpenMosix Presented by Dr. Moshe Bar and MAASK [01] openmosix is a kernel extension for single-system image clustering. openmosix [24] is a tool for a Unix-like kernel, such as Linux, consisting of adaptive

More information

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Chronicle: Capture and Analysis of NFS Workloads at Line Rate Chronicle: Capture and Analysis of NFS Workloads at Line Rate Ardalan Kangarlou, Sandip Shete, and John Strunk Advanced Technology Group 1 Motivation Goal: To gather insights from customer workloads via

More information

Network Virtualization Technologies and their Effect on Performance

Network Virtualization Technologies and their Effect on Performance Network Virtualization Technologies and their Effect on Performance Dror Goldenberg VP Software Architecture TCE NFV Winter School 2015 Cloud Computing and NFV Cloud - scalable computing resources (CPU,

More information