Linux TCP Stack Performance Comparison and Analysis on SMP systems

Size: px
Start display at page:

Download "Linux TCP Stack Performance Comparison and Analysis on SMP systems"

Transcription

1 Linux TCP Stack Performance Comparison and Analysis on SMP systems Shourya P. Bhattacharya Indian Institute of Technology, Bombay Kanwal Kekhi School of Information Technology Powai, Mumbai shourya@it.iitb.ac.in 1. Introduction Abstract With the phenomenal growth of networked applications and their users, there is a compelling need for very high speed and highly scaleable communications software. A networked application can be made high speed and scalable by using a variety of approaches. We can aim at the application layer first, achieving speed up by improving the internal algorithms, scale up by having multiple replicas of the application and balancing the load between these replicas. However, at the heart of any application written for use over a network, is the lower layer protocol stack implementation of the underlying operating system. For any application to run at high speed and achieve high scalability, the protocol stack implementation that it uses must also be able to keep up, and not become a bottleneck. Thus it is important to study and understand the performance of the protocol stack implementation which the application will use. Recent trends in technology are showing that although the raw transmission speeds used in networks are increasing rapidly, the rate of advancement of microprocessor technology, that needs to drive the network, has slowed down over the last couple of years. Gigabit Ethernet networks now have become commonplace while 10 Gbps Ethernet is rapidly making its presence felt[12]. On the other hand,...***some data about processor speeds***??. Hence major microprocessor manufacturers such as Intel and AMD are exploring newer processor architectures. Intel was the first to introduce symmetric multi-treading or Hyperthreading [5] in their Pentium IV processors which enabled a single Pentium IV processor to appear as two logical processors. More recently both AMD and Intel have launched their dual core processors??. Dual core processors, as the name suggests, packages two processing cores on a single chip, effectively behaving like an SMP (Symmmetric MultiProcessor) system. In the future we can expect multiple core CPU to be introduced. The end result of this is that network protocol processing overheads have risen sharply in comparison with the time spend in packet transmission. Although the network layer processing is highly optimized in routers, which are special-purpose, in application servers (i.e. hosts ), this protocol processing has to happen on the general purpose hardware that the server runs on. When load on the server increases, protocol processing overheads can start dominating dominate??, and can degrade the useful throughput of the application. Several approaches have been employed to speed up and scale up the protocol stack imple-

2 mentations. In the context of TCP/IP, offloading the protocol processing to dedicated hardware, instead of carrying it out on the general purpose CPU, has been proposed. /* Shourya add something here */ This essentially enables the CPU to be dedicated to application processing. In either case - i.e. with our without offloading, parallel processing architectures can be exploited for speeding up protocol processing. Whether the purpose is to determine the TCP/IP components that have the highest processing requirements, or to determine how the implementation scales to SMP architectures, a careful performance study of the TCP/IP stack implementation of the operating system in question is a must. In this paper, we discuss the the results of such a study done for the Linux Operating System. The Linux OS has shot up in popularity recently, and is now used even by large-scale system operatiors??. Specifically, we have focussed on the network stack performance of Linux kernel 2.4 and kernel 2.6. Until recently, kernel 2.4 was the most stable Linux kernel and was used extensively. Kernel 2.6 is the latest stable Linux kernel and is fast replacing kernel 2.4. Although performance studies of /*..*/ have been done, this is the first time a thorough comparison of Linux 2.4 and Linux 2.6 TCP/IP stack performance has been carried out. We have studied and compared the performance of these two Linux version along various metrics: bulk data throughput, connection throughput and scalability across multiple processors. We also present a fine-grained profiling of resource usage by the TCP/IP stack functions, thereby identifying the bottlnecks. In almost all the tests, kernel 2.6 performed better than kernel 2.4. It was observed that kernel 2.6 sustained higher data transfer rates and much higher number of simultaneous connections than kernel 2.4. The higher sustained data transfer rates in kernel 2.6 was attributed to it s more efficient copy routines while the much higher number of simultaneous connections in kernel 2.6 was a result of its superior O(1) scheduler. Our experiments with Kernel 2.6 on SMP architecture shows degraded performance when the processing of a single connection is spread out on multiple processes, thus verifying the superiority of the Processor per Connection [2] model for protocol processing in SMP systems. The kernel profiling results also revealed that interrupt processing time, the device driver code and the copy routines take up significant amount of the total network processing time. This paper is organised as follows. Section 3 discusses the improvements made in Linux kernel 2.6 which affect the network performance of the system. Section 4 describes the benchmarking experiments performed, the hardware setup, their results and implications. In section 5 we discuss the kernel profiling results obtained with OProfile [8] and inferences drawn from it. In section?? we conclude our observations and suggests future work. 2. Speeding up protocol processing Improving and speeding up the network protocol processing stack has become an area of active research interest. In this section we look at a couple of ways in which the problem of improving protocol processing speed has been approached. Firstly, attempts has been made to offload the protocol processing from the host system to the NIC which has dedicated hardware for it. Secondly, efforts have been made to parallelise the protocol stack processing to make use of multiple processors. We take a brief look at both these approaches. It must be noted that these approaches are not mutually exclusive TCP offloading TCP offload has been well debated topic over the last decade [11]. The typical benefits of TCP/IP offloading include reduction of host CPU requirements for stack processing and checksumming, fewer interrupts to the host CPU, fewer bytes copied over the system bus and offloading of computationally expensive features such as encryption to specialised hardware on the NIC. In spite of these benefits, there has been some criticism of TCP offload. The primary argument against it has been the fact that TCP protocol in itself is not a

3 very expensive operation and the rapid increase of CPU processing power will nullify the cost effectiveness of TCP offloading to the NIC, but given the changing dynamics of technological developments, TCP offloading has become extremely relevant [6]. Our experimental results in section 5 show that significant protocol processing time is spent in the buffer copying, checksumming and interrupt handling. TCP offloading will be able to free the host CPU from these overheads and hence will be extremely useful Parallel Protocol Processing Approaches The layered architecture of the network stack makes it difficult to parallelise it efficiently. A lot of work has been done [2, 1, 7] in this field to effectively parallelise the protocol stack. We discuss some of the well known paradigms. Processor per Message is a parallelising paradigm in which each processor executes the whole protocol stack for one message. With this approach heavily used connections can be efficiently served since several processors can service different messages of the same connection, however this also implies that the connection state has to be shared between the processors servicing the packets which can lead to synchronisation problems. Processor per Connection lets one processor handle all the messages belonging to a particular connection. This approach works well in SMP systems as it can make optimal use of the processor cache, however, it suffers from fragmentation of resources. Processor per Protocol is another approach in which each layer of the protocol stack is processed by a particular processor. Messages may be shuffled from one processor to another as it moves up or down the protocol stack. One limitation of this approach is that the messages can not be cached efficiently as they are shuffled from one processor to the other[2]. Processor per Task is a parallelising technique in which each processor performs a specific task or function within a protocol, or it might also do a task common to more than one protocol. This paradigm, in theory, tries to reduce the processing time or latency of a message. The main disadvantage of this approach is that both protocol state and messages must be shared between processors. It also suffers from poor caching as the case of processor per protocol paradigm. It has been shown by Schmidt and Suda [10] that processor per message and processor per connection generally work better than the other two paradigm s in a shared memory multiprocessor system. This happens because of the high cost of context switches while crossing protocol layers. Our results have shown that even processor per message paradigm has a detrimental effect on TCP performance. This happens because TCP is a connection oriented protocol and needs to store the connection state in memory. Distributing the processing of packets belonging to a single connection on different processors, leads to frequent cache invalidations in all the processors. Hence only processor per connection paradigm is optimally suited for new generation processors having multiple cores and symmetric multithreading. 3. Improvements in Linux kernel 2.6 The Linux kernel 2.6 was a major upgrade from the earlier default kernel 2.4 with many improvements and was supposed to be faster and more responsive than the earlier kernel. In this section we discuss some of the improvements and changes made in kernel 2.6, which can have an impact on the performance of the networking subsystem. Some of the major improvements in kernel 2.6 include minimal use of the Big Kernel Lock (BKL), an improved interrupt mechanism i.e. NAPI, more efficient block copy routines and a vastly improved O(1) Scheduler. The kernel 2.6 TCP stack also includes new congestion control and recovery algorithms which are not available in the kernel 2.4 TCP stack.

4 3.1. Minimal use of Big Kernel Lock The BKL is a global kernel lock, which only allows only one processor to be running kernel code at any given time, to make the kernel safe for concurrent access from multiple CPUs. The BKL is essentially a spinlock, but with a couple of interesting properties: The BKL can be taken recursively; Therefore two consecutive requests for it will not deadlock the process. Code holding the BKL can sleep and even enter the scheduler while holding the lock. The lock is released while the given thread sleeps, and re-acquired upon awakening. The BKL makes SMP Linux possible, but it dosn t scale very well, hence there is a continuous effort to avoid the BKL and use more fine grained locks instead. Although kernel 2.6 is still not completely free of the BKL, it s usage has been greatly reduced. The kernel 2.6 networking stack has only one reference of the BKL New API - NAPI One of the most significant change in kernel 2.6 network stack, is the addition of NAPI ( New API ), which is designed to improve the performance of high-speed networking. The basic principle of NAPI is: Interrupt mitigation. High-speed networking can create thousands of interrupts per second, which can lead to an interrupt livelock. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load. Packet throttling. When the system is overwhelmed and must drop packets, it s better if those packets are disposed of before much effort goes into processing them. NAPIcompliant drivers can often cause packets to be dropped in the network adapter itself, before the kernel sees them at all Efficient copy routines The Linux kernel maintains separate address space for the kernel and user processes for protection against misbehaving programs. Due to the two separate address spaces, when a packet is sent or received over the network, an additional step of copying the network buffer from the user space to the kernel space or vice versa is required. As a result the efficiency of the kernel copy routine used has a profound impact on the overall network performance. Kernel 2.6 copy routines have been optimised for the x86 architecture which use the technique of hand unrolled loop [3, 9] with integer registers instead of the less efficient movsd instruction used in kernel Scheduling Algorithm The Linux kernel 2.6 scheduler is probably the single most significant change made in the kernel. The kernel 2.6 scheduler is written completely from scratch to overcome some of the limitations of the kernel 2.4 scheduler. The kernel 2.4 scheduler while being widely used and quite reliable, had several undesirable characteristics. The biggest flaw of the kernel 2.4 scheduler was that it contained O(n) algorithms where n is the number of processes in the system, and hence was not scalable. The new scheduler in kernel 2.6 on the other hand does not contain any algorithms that run in worse than O(1) time. This is extremely important in applications like web servers as it allows them to handle large number of concurrent connections, without dropping requests. The Linux kernel 2.4 scheduling algorithm divides time into epochs, which are periods of time during which every task is allowed to use up it s timeslice. Timeslices need to be computed for all tasks in the system when epochs begin, which means that the scheduler s algorithm for timeslice calculation runs in O(n) time, since it must iterate over every task. On the other hand kernel 2.6 uses more sophisticated algorithms which make sure that there is no point at which all tasks need

5 new timeslice calculated for them at the same time ****give reference here***** 4. High Level Performance comparison tests In this section we discusses the benchmarking experiments performed for comparative study between kernel 2.4 and 2.6. Each subsection describes the type of tests performed, the experimental setup and the results obtained Web server performance HTTP performance tests were done on kernel 2.4 and 2.6 to get a comparative picture of both kernel s efficiency in a real world scenario. The freely available httperf [4] tool was used for load generation and the performance of Apache web server was tested on both the kernels. The following two changes were made to the default Apache configuration file: 1. MaxClients was set to the maximum value of MaxRequestsPerChild was set to zero (unlimited) The Apache web server was run on a dual processor 3.2Ghz Xeon system. Separate tests were done on both kernel 2.4 and 2.6 with SMP enabled and disabled. The httperf load generator was run on two separate client machines. This was necessary as a single client was unable to generate enough requests to saturate the server. To reliably test the maximum number of simultaneous connections that the web server could handle, the clients were made to request a static text page of only 6 Bytes in size. This ensured that the 100Mbps network bandwidth did not become a bottleneck. The following command line instruction was used for the httperf experiments. $httperf --hog --port 80 --uri /small.html --server (IP) --rate=(r) --num-conn (R 15) --timeout 5 The maximum client connection request rate sustained by the server, response time for the requests, connection establishing time and the number of errors reported by the two kernels are shown in the Figures 1, 2, 3 and 4 respectively. Request sustained by the server Time (ms) Server sustained request rate Number of Client requests per second Kernel 2.6 UNI Kernel 2.4 UNI Figure 1. Request rate sustained by kernel 2.4 and Response Time Number of requests per second Kernel 2.6 UNI Kernel 2.4 UNI Figure 2. Response time comparisons of kernel 2.4 and 2.6 These graphs clearly show that kernel 2.6 performance is much better than that of kernel 2.4. ker-

6 Connection Duration Kernel 2.6 UNI Kernel 2.4 UNI kernel 2.6 scheduler Connection throughput Time (ms) Number of errors Number of requests per second Figure 3. Connection time comparisons of kernel 2.4 and Errors Number of requests per second Kernel 2.6 UNI Kernel 2.4 UNI Figure 4. Time Out error comparison of kernel 2.4 and 2.6 nel 2.4 struggled to handle more than 2800 simultaneous connections and started reporting errors beyond that point. It s connection time and response time also started rising sharply. In contrast kernel 2.6 could easily handle 4000 simultaneous connections and there was no sign of any increase in connection time or response time, suggesting that kernel 2.6 would be able to handle even higher number of simultaneous connections than could be tested. These results again showcase the superiority of the In this experiment we have compared the maximum throughput of multiple simultaneous client connections in kernel 2.4 and 2.6. The tests were done on a single processor, P4 machine which acted as a server by listening on many ports for connections. Clients repeatedly connected and disconnected from the server without transmitting any data. The server forked a child process for each new connection request. Two experiments were done with slight variations. In the first experiment the server was made to listen on a fixed number (N) of ports. Exactly N clients were then started which repeatedly connected and disconnected to each of the open ports on the server. The CPU utilisation of the server and the throughput as seen by the clients was measured. In the second experiment, the server instead of listing on exactly N ports was listening on M ports such that M >> N. In our experiments M was fixed at 300 while N was varied from 10 to 100. During both the experiments the network I/O was monitored to make sure that the network bandwidth was not a bottleneck. Results obtained from the experiments are shown in Figure 5 and Figure 6. It must be noted that new connections on any given active port were made in a sequential order, after the connection prior to it had terminated. Hence 100% CPU utilisation could only be achieved with multiple number of active ports simultaneously attempting connection setup and tear down. Table 1 displays the average CPU time per connection on the two kernels. This gives us the time the server spent in servicing a single connection, i.e. the total time taken for accept() and close(). The observations from this experiment reveal a very interesting story. The first thing that is evident from Figure 5 and Table 1 is that the processing overheads in kernel 2.6 for connection setup and teardown is higher than that of kernel 2.4. This is attributed to the fact that the kernel 2.6 socket code contains many security hooks, for example, the socket system call in ker-

7 Thruput comparision Thruput comparision Kernel 2.6 Kernel Kernel 2.6 Kernel 2.4 Connections per second Connections per second Active Connections Active Connections Figure 5. Connection throughput comparison with varying number of active connections. Figure 6. Measured throughput on the server with 300 open ports and varying number of active connections. CPU Time/conn. (µs) Kernel Kernel Table 1. Average CPU time spent per connection in microseconds. nel 2.6 has the additional overhead of calling the functions security_socket_crete() and security_socket_post_create() which result in higher CPU utilisation. Figure 6 reveals another very interesting fact. When there exists a large number of open port, the connection setup/teardown throughput in kernel 2.4 lags behind that of kernel 2.6. This clearly demonstrates the superiority of the kernel 2.6 scheduler. The kernel 2.4 scheduler has to cycle through all the processes listening on the open ports in the system irrespective of the fact that they are active or not. On the other hand the kernel 2.6 scheduler is unaffected by the number of open ports in the system and it s performance is comparable to that shown in Figure 5 Thus the two conclusions we can draw from this experiment are: The per connection processing costs in kernel 2.6 is slightly higher than that of kernel 2.4 The kernel 2.6 scheduler is vastly superior to the kernel 2.4 scheduler and to a large degree compensates for the higher per connection processing cost in kernel Socket system calls This was a high level test, intended to compare the performance of network related system calls. The tests were run on stock kernel and kernel The test environment consisted of a Pentium 4 machine with 256MB RAM. All non essential services were turned off. The test were run with the highest priority set and with no other network activity. Custom TCP server and client program were written and both programs were run on the same machine connecting over the loopback interface. Time taken by the socket system calls were measured on both the server and the client programs. strace was used for the measurements. The result obtained is shown in Table 4.3. The observed socket system call times in both the kernels are very comparable. It shows that there is not much difference in the bind() and socket() system call overheads between the two kernels, but the lis-

8 The tests were run for a duration of 30 seconds Kernel with the TCP window size set to the maximum of socket() KB. The buffer size was set to the default value bind() of 8KB. This experiment yielded some surprising listen() results which provided significant insight into the connect() SMP scalability issue of the kernel TCP stack. The results are shown in Figure 7. Table 2. Average time spent in each As one might expect in Uni-Processor mode kernel 2.6 was considerably faster than kernel 2.4 but system call. Values in microsecond the surprising result here is that of kernel 2.6 in SMP mode. The observed throughput in SMP mode oscillated between 3.4 Gbits/sec or 7.8 Gbits/sec. ten() and connect() system calls are a little cheaper Such a large variation cannot be attributed to random errors. On the other hand in the case of SMP in kernel 2.6. kernel 2.4, the throughput rises marginally to Bulk data transfer performance on Gbits/sec from 4.6 Gbits/sec. SMP system The higher data throughput of kernel 2.6 in Uni- Processor mode is due to it s more efficient copy Bulk data transfer throughput is one of the most routines as discussed in section 3. important parameter to measure the TCP stack s The variability in SMP kernel 2.6 could be attributed to cache bouncing. In kernel 2.6 because performance. To measure the maximum data transfer throughput the freely available tool iperf was of it s better scheduling logic and smaller kernel used. iperf is a tool to measure maximum TCP locks, packet processing can be distributed on all bandwidth, allowing the tuning of various parameters. iperf can report bandwidth, delay jitter and single TCP connection and sends data over that available processors. In our tests, iperf creates a datagram loss. connection, but if incoming packets of a connection We used a system with dual CPU 3.2 Ghz Xeon are processed on different CPU s it would lead to processors as our test machine. The Xeon processors were HyperThreading (HT) [5] capable and same connection will not be able to reuse the TCP frequent cache misses, as all packets belong to the we performed different sets of test with both HT state information optimally. This results in poorer enabled and disabled. Kernels 2.4 and 2.6 were performance in comparison to Uni-Processor kernel, because in a Uni-Processor kernel the entire compiled with SMP support and their performance were compared. The tests were also repeated without SMP support on both the kernels to obtain a ref- advantage of TCP state information present in it s processing is done on a single CPU which can take erence point for comparison. The experiments were cache, resulting in fewer misses. run over the loopback interface as the the physical This also explain the fluctuating high performance ( 7.5Gbits/sec) on 2.6 SMP kernel. Since the 100Mbps network interface was too slow for testing the CPU. Since the experiments were done over Intel Xeon processors are hyper-threaded, the SMP the loopback interface the cost of interrupt handling scheduler would randomly schedule the packet processing on 2 logical processors of the same physical and the efficiency of the device driver code did not come into picture. The iperf server and client processor. In such a situation there will not be any were invoked with the following command line arguments respectively. access to the same cache. To verify this, HT was cache penalty as the logical processors will have disabled and the SMP kernel tests repeated. This $iperf -s -p 9999 stopped the performance oscillations and the 2.6 $iperf -c localhost -p w 255K -tsmp 30 kernel consistently gave a throughput of 3.4

9 12 Thruput comparision Kernel 2.4-uni Kernel 2.4-smp Kernel 2.6-uni Kernel 2.6-smp 9 Data Throughput (Gbps) Number of TCP connections Figure 8. Data throughput with varying number of TCP connections Figure 7. Graphical representation of the data transfer rates achived in different test cases Gbits/sec. The results from the kernel profiling tests in section 5 also confirm the fact that the 2.6 SMP kernel is spending excessive amounts of time in it s copy routines. Further tests were done by transmitting data over two TCP connections instead of one, to check the SMP performance of kernel 2.4 and 2.6. Figure 7 clearly show that data transfer throughput in SMP kernel 2.6 is much better with two TCP connections on a dual CPU system. The increase in the data throughput on kernel 2.6 in SMP mode on a dual processor system with two TCP connections is nearly 180% when compared with the data throughput of a single TCP connection on a Uni-Processor kernel. On the other hand kernel 2.4 shows a speedup of only 147%. To verify that the observed behaviour of kernel 2.6 in SMP mode with a single TCP connection is an anomaly, we did more tests with multiple simultaneous TCP connections. The results are shown in Figure 8. It clearly shows that the low performance of kernel 2.6 in SMP mode with a single TCP connection is indeed an anomaly. When the number of simultaneous TCP connections are increased, kernel 2.6 gives excellent performance. Few other observations that can be made from Figure 8 is that in Uni-Processor kernel 2.4 there is practically no change in the data throughput with increasing number of simultaneous TCP connections. On the other hand, in SMP kernel 2.4, the data throughput rises initially with two simultaneous connection but drops slightly as the number of parallel TCP connections are increased. Since the test machine had only two physical processors, this implies that kernel 2.4 SMP incurs some small penalties while multiplexing multiple TCP streams on a physical processor. 5. Kernel profiling results The processing of TCP packets involves interaction with a large number subsystems within the kernel. Any attempt to optimise and improve the packet processing time can not be successful unless all these factors are considered. To identify these overheads, we profiled both the Linux kernels using OProfile [8]. It is a statistical profiler that uses hardware performance counters available on modern processors to collect information of executing processes. The profiling results also provide valu-

10 able insight and concrete explanation of the observed anomalous behaviour in section 4.4 of SMP kernel 2.6, processing a single TCP connection on a dual CPU system Breakup of TCP packet processing overheads The breakup of TCP packet processing overheads are shown in Table 3. It lists the kernel functions that took more than 1% of the overall TCP packet processing time. The function boomerang interrupt function is the interrupt service routine for the 3COM 3c59x series NIC, which was used in our experiments. The other boomerang * functions are also part of the NIC driver involved in packet transmission and reception. copy from user ll copies a block of memory from the user space to kernel space. csum partial is the kernel checksumming routine. Thus we can see that the NIC driver code, Interrupt processing, buffer copying, checksumming are the most CPU intensive operations during TCP packet processing. In comparison TCP functions take up only a small part of the overall CPU time. These data make a strong case for TCP offloading which could potentially lead to 30-40% of CPU time Analysis of kernel 2.6 SMP anomaly In section 4.4 we had observed that there was a sharp drop in the performance of SMP kernel 2.6 when a single TCP connection was setup on a dual CPU system, but as the number of TCP flows were increased to 2 and more, kernel 2.6 performed extremely well. To analyse this anomalous behaviour, we reran the data throughput experiments for kernel 2.6 in both SMP and Uni-Processor mode, and profiled the kernel during that period. In these experiments, each TCP connection sent and received exactly 2GB of data. This allowed us to directly compare the samples collected in both the situations. CPU Samples % Function Name boomerang interrupt boomerang start xmit copy from user ll issue and wait csum partial mark offset tsc ipt do table csum partial boomerang rx ipt do table tcp sendmsg irq entries start default idle skb release data ip queue xmit tcp v4 rcv timer interrupt Table 3. Breakup of TCP packet processing overheads in the kernel CPU Samples % Function Name copy from user ll copy to user ll system call (no symbols) schedule tcp sendmsg switch to Table 5. TCP packet processing overheads in Kernel 2.6 UNI with a single TCP connection

11 CPU 0 Samples % CPU 1 Samples % Total % Function Name copy from user ll copy to user ll tcp sendmsg schedule (no symbols) system call tcp v4 rcv Table 4. TCP packet processing costs in Kernel 2.6 SMP with single TCP connection CPU 0 Samples % CPU 1 Samples % Total % Function Name copy from user ll copy to user ll schedule tcp sendmsg system call switch to tcp v4 rcv Table 6. TCP packet processing costs in Kernel 2.6 SMP with two TCP connection The most striking fact emerging from Table 4 and 6 is the large increase in time spent in the kernel copy routines. The functions copy from user ll() and copy to user ll() are used for copying buffers from user space to kernel space and from kernel space to user space respectively. There is a very sharp increase in the time spent by these two functions of the SMP Kernel with a single TCP connection. More than 50% of the total time is spent in these functions. Such a sharp increase in the cost of copy routines can be attributed to a high miss rate of processor cache. To verify this, the copy from user ll() and copy to user ll() routines were further analysed and it was found that more than 95% time in these routines were spent on the assembly instruction repz movsl %ds:(%esi),%es:(%edi) The above instruction copies data between the memory locations pointed by the registers in a loop. The performance of the movsl instruction is heavily dependent on the processor data cache hits or misses. The significantly higher number of clocks required by the movsl instruction in the case of SMP kernel 2.6, for copying the same amount of data can only be explained by an increase in the data cache misses of the processor. 6. Conclusion References [1] M. Björkman and P. Gunningberg. Locking effects in multiprocessor implementations of protocols. In Conference proceedings on Communications architectures, protocols and applications, pages ACM Press, [2] M. Björkman and P. Gunningberg. Performance modeling of multiprocessor implementations of

12 protocols. IEEE/ACM Trans. Netw., 6(3): , [3] J. W. Davidson and S. Jinturkar. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In MICRO 28: Proceedings of the 28th annual international symposium on Microarchitecture, pages IEEE Computer Society Press, [4] A Tool for Measuring Web Server Performance. World Wide Web, Mosberger/httperf.html. [5] D. Marr, F. Binns, D. Hill, G. Hinton, and D. Koufaty. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, hyper/p15 authors.htm, [6] J. C. Mogul. TCP offload is a dumb idea whose time has come. In Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, May [7] E. M. Nahum, D. J. Yates, J. F. Kurose, and D. F. Towsley. Performance issues in parallelized network protocols. In Operating Systems Design and Implementation, pages , [8] OProfile profiling system for Linux 2.2/2.4/2.6. World Wide Web, [9] V. S. Pai and S. Adve. Code transformations to improve memory parallelism. In MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, pages IEEE Computer Society, [10] D. C. Schmidt and T. Suda. Measuring the impact of alternative parallel process architecture on communication subsystem performance. In Protocols for High-Speed Networks IV, pages Chapman & Hall, Ltd., [11] P. Shivam and J. S. Chase. On the elusive benefits of protocol offload. In Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence, pages ACM Press, [12] H. Xie, L. Zhao, and L. Bhuyan. Architectural analysis and instruction-set optimization for design of network protocol processors. In Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign & system synthesis, pages ACM Press, 2003.

Scalability on SMP systems*

Scalability on SMP systems* A Measurement Study of the Linux TCP/IP Stack Performance and Scalability on SMP systems* Shourya P. Bhattacharya Kanwal Rekhi School of Information Technology Indian Institute of Technology, Bombay Email:

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to

TCP Offload Engines. As network interconnect speeds advance to Gigabit. Introduction to Introduction to TCP Offload Engines By implementing a TCP Offload Engine (TOE) in high-speed computing environments, administrators can help relieve network bottlenecks and improve application performance.

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,

More information

IxChariot Virtualization Performance Test Plan

IxChariot Virtualization Performance Test Plan WHITE PAPER IxChariot Virtualization Performance Test Plan Test Methodologies The following test plan gives a brief overview of the trend toward virtualization, and how IxChariot can be used to validate

More information

HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

More information

SIDN Server Measurements

SIDN Server Measurements SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand

More information

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology 3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related

More information

Presentation of Diagnosing performance overheads in the Xen virtual machine environment

Presentation of Diagnosing performance overheads in the Xen virtual machine environment Presentation of Diagnosing performance overheads in the Xen virtual machine environment September 26, 2005 Framework Using to fix the Network Anomaly Xen Network Performance Test Using Outline 1 Introduction

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

Linux NIC and iscsi Performance over 40GbE

Linux NIC and iscsi Performance over 40GbE Linux NIC and iscsi Performance over 4GbE Chelsio T8-CR vs. Intel Fortville XL71 Executive Summary This paper presents NIC and iscsi performance results comparing Chelsio s T8-CR and Intel s latest XL71

More information

Gigabit Ethernet Design

Gigabit Ethernet Design Gigabit Ethernet Design Laura Jeanne Knapp Network Consultant 1-919-254-8801 laura@lauraknapp.com www.lauraknapp.com Tom Hadley Network Consultant 1-919-301-3052 tmhadley@us.ibm.com HSEdes_ 010 ed and

More information

Boosting Data Transfer with TCP Offload Engine Technology

Boosting Data Transfer with TCP Offload Engine Technology Boosting Data Transfer with TCP Offload Engine Technology on Ninth-Generation Dell PowerEdge Servers TCP/IP Offload Engine () technology makes its debut in the ninth generation of Dell PowerEdge servers,

More information

Performance Evaluation of Linux Bridge

Performance Evaluation of Linux Bridge Performance Evaluation of Linux Bridge James T. Yu School of Computer Science, Telecommunications, and Information System (CTI) DePaul University ABSTRACT This paper studies a unique network feature, Ethernet

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card

Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card Implementation and Performance Evaluation of M-VIA on AceNIC Gigabit Ethernet Card In-Su Yoon 1, Sang-Hwa Chung 1, Ben Lee 2, and Hyuk-Chul Kwon 1 1 Pusan National University School of Electrical and Computer

More information

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU Savita Shiwani Computer Science,Gyan Vihar University, Rajasthan, India G.N. Purohit AIM & ACT, Banasthali University, Banasthali,

More information

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM? MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM? Ashutosh Shinde Performance Architect ashutosh_shinde@hotmail.com Validating if the workload generated by the load generating tools is applied

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management An Oracle Technical White Paper November 2011 Oracle Solaris 11 Network Virtualization and Network Resource Management Executive Overview... 2 Introduction... 2 Network Virtualization... 2 Network Resource

More information

MAGENTO HOSTING Progressive Server Performance Improvements

MAGENTO HOSTING Progressive Server Performance Improvements MAGENTO HOSTING Progressive Server Performance Improvements Simple Helix, LLC 4092 Memorial Parkway Ste 202 Huntsville, AL 35802 sales@simplehelix.com 1.866.963.0424 www.simplehelix.com 2 Table of Contents

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Accelerating High-Speed Networking with Intel I/O Acceleration Technology White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing

More information

Development and Evaluation of an Experimental Javabased

Development and Evaluation of an Experimental Javabased Development and Evaluation of an Experimental Javabased Web Server Syed Mutahar Aaqib Department of Computer Science & IT University of Jammu Jammu, India Lalitsen Sharma, PhD. Department of Computer Science

More information

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

More information

Collecting Packet Traces at High Speed

Collecting Packet Traces at High Speed Collecting Packet Traces at High Speed Gorka Aguirre Cascallana Universidad Pública de Navarra Depto. de Automatica y Computacion 31006 Pamplona, Spain aguirre.36047@e.unavarra.es Eduardo Magaña Lizarrondo

More information

Frequently Asked Questions

Frequently Asked Questions Frequently Asked Questions 1. Q: What is the Network Data Tunnel? A: Network Data Tunnel (NDT) is a software-based solution that accelerates data transfer in point-to-point or point-to-multipoint network

More information

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide

An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

Cisco Application Networking for Citrix Presentation Server

Cisco Application Networking for Citrix Presentation Server Cisco Application Networking for Citrix Presentation Server Faster Site Navigation, Less Bandwidth and Server Processing, and Greater Availability for Global Deployments What You Will Learn To address

More information

A comparison of TCP and SCTP performance using the HTTP protocol

A comparison of TCP and SCTP performance using the HTTP protocol A comparison of TCP and SCTP performance using the HTTP protocol Henrik Österdahl (henost@kth.se), 800606-0290, D-01 Abstract This paper discusses using HTTP over SCTP as an alternative to the traditional

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

High-Speed TCP Performance Characterization under Various Operating Systems

High-Speed TCP Performance Characterization under Various Operating Systems High-Speed TCP Performance Characterization under Various Operating Systems Y. Iwanaga, K. Kumazoe, D. Cavendish, M.Tsuru and Y. Oie Kyushu Institute of Technology 68-4, Kawazu, Iizuka-shi, Fukuoka, 82-852,

More information

Microsoft Exchange Server 2003 Deployment Considerations

Microsoft Exchange Server 2003 Deployment Considerations Microsoft Exchange Server 3 Deployment Considerations for Small and Medium Businesses A Dell PowerEdge server can provide an effective platform for Microsoft Exchange Server 3. A team of Dell engineers

More information

Effects of Interrupt Coalescence on Network Measurements

Effects of Interrupt Coalescence on Network Measurements Effects of Interrupt Coalescence on Network Measurements Ravi Prasad, Manish Jain, and Constantinos Dovrolis College of Computing, Georgia Tech., USA ravi,jain,dovrolis@cc.gatech.edu Abstract. Several

More information

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service Eddie Dong, Yunhong Jiang 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance M. Rangarajan, A. Bohra, K. Banerjee, E.V. Carrera, R. Bianchini, L. Iftode, W. Zwaenepoel. Presented

More information

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Where IT perceptions are reality Test Report OCe14000 Performance Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine Document # TEST2014001 v9, October 2014 Copyright 2014 IT Brand

More information

Networking Driver Performance and Measurement - e1000 A Case Study

Networking Driver Performance and Measurement - e1000 A Case Study Networking Driver Performance and Measurement - e1000 A Case Study John A. Ronciak Intel Corporation john.ronciak@intel.com Ganesh Venkatesan Intel Corporation ganesh.venkatesan@intel.com Jesse Brandeburg

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware

Packet Capture in 10-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Packet Capture in 1-Gigabit Ethernet Environments Using Contemporary Commodity Hardware Fabian Schneider Jörg Wallerich Anja Feldmann {fabian,joerg,anja}@net.t-labs.tu-berlin.de Technische Universtität

More information

Copyright www.agileload.com 1

Copyright www.agileload.com 1 Copyright www.agileload.com 1 INTRODUCTION Performance testing is a complex activity where dozens of factors contribute to its success and effective usage of all those factors is necessary to get the accurate

More information

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

More information

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7 Introduction 1 Performance on Hosted Server 1 Figure 1: Real World Performance 1 Benchmarks 2 System configuration used for benchmarks 2 Figure 2a: New tickets per minute on E5440 processors 3 Figure 2b:

More information

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content

CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content Advances in Networks, Computing and Communications 6 92 CentOS Linux 5.2 and Apache 2.2 vs. Microsoft Windows Web Server 2008 and IIS 7.0 when Serving Static and PHP Content Abstract D.J.Moore and P.S.Dowland

More information

The Performance Analysis of Linux Networking Packet Receiving

The Performance Analysis of Linux Networking Packet Receiving The Performance Analysis of Linux Networking Packet Receiving Wenji Wu, Matt Crawford Fermilab CHEP 2006 wenji@fnal.gov, crawdad@fnal.gov Topics Background Problems Linux Packet Receiving Process NIC &

More information

Virtualised MikroTik

Virtualised MikroTik Virtualised MikroTik MikroTik in a Virtualised Hardware Environment Speaker: Tom Smyth CTO Wireless Connect Ltd. Event: MUM Krackow Feb 2008 http://wirelessconnect.eu/ Copyright 2008 1 Objectives Understand

More information

Optimizing Network Virtualization in Xen

Optimizing Network Virtualization in Xen Optimizing Network Virtualization in Xen Aravind Menon EPFL, Switzerland Alan L. Cox Rice university, Houston Willy Zwaenepoel EPFL, Switzerland Abstract In this paper, we propose and evaluate three techniques

More information

Optimizing Network Virtualization in Xen

Optimizing Network Virtualization in Xen Optimizing Network Virtualization in Xen Aravind Menon EPFL, Lausanne aravind.menon@epfl.ch Alan L. Cox Rice University, Houston alc@cs.rice.edu Willy Zwaenepoel EPFL, Lausanne willy.zwaenepoel@epfl.ch

More information

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

How To Monitor And Test An Ethernet Network On A Computer Or Network Card 3. MONITORING AND TESTING THE ETHERNET NETWORK 3.1 Introduction The following parameters are covered by the Ethernet performance metrics: Latency (delay) the amount of time required for a frame to travel

More information

Windows Server 2008 R2 Hyper-V Live Migration

Windows Server 2008 R2 Hyper-V Live Migration Windows Server 2008 R2 Hyper-V Live Migration White Paper Published: August 09 This is a preliminary document and may be changed substantially prior to final commercial release of the software described

More information

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser Network Performance Optimisation and Load Balancing Wulf Thannhaeuser 1 Network Performance Optimisation 2 Network Optimisation: Where? Fixed latency 4.0 µs Variable latency

More information

White Paper. Recording Server Virtualization

White Paper. Recording Server Virtualization White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...

More information

Network Performance Evaluation of Latest Windows Operating Systems

Network Performance Evaluation of Latest Windows Operating Systems Network Performance Evaluation of Latest dows Operating Systems Josip Balen, Goran Martinovic, Zeljko Hocenski Faculty of Electrical Engineering Josip Juraj Strossmayer University of Osijek Osijek, Croatia

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

Optimizing Shared Resource Contention in HPC Clusters

Optimizing Shared Resource Contention in HPC Clusters Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs

More information

Performance Tuning Guide for ECM 2.0

Performance Tuning Guide for ECM 2.0 Performance Tuning Guide for ECM 2.0 Rev: 20 December 2012 Sitecore ECM 2.0 Performance Tuning Guide for ECM 2.0 A developer's guide to optimizing the performance of Sitecore ECM The information contained

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

PC-based Software Routers: High Performance and Application Service Support

PC-based Software Routers: High Performance and Application Service Support PC-based Software Routers: High Performance and Application Service Support Raffaele Bolla, Roberto Bruschi DIST, University of Genoa Via all Opera Pia 13, 16139, Genoa, Italy {raffaele.bolla, roberto.bruschi}@unige.it

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department

More information

Application Performance Testing Basics

Application Performance Testing Basics Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free

More information

Windows Server 2008 R2 Hyper-V Live Migration

Windows Server 2008 R2 Hyper-V Live Migration Windows Server 2008 R2 Hyper-V Live Migration Table of Contents Overview of Windows Server 2008 R2 Hyper-V Features... 3 Dynamic VM storage... 3 Enhanced Processor Support... 3 Enhanced Networking Support...

More information

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study

Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study White Paper Microsoft SQL Server 2012 on Cisco UCS with iscsi-based Storage Access in VMware ESX Virtualization Environment: Performance Study 2012 Cisco and/or its affiliates. All rights reserved. This

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Performance Analysis of Large Receive Offload in a Xen Virtualized System

Performance Analysis of Large Receive Offload in a Xen Virtualized System Performance Analysis of Large Receive Offload in a Virtualized System Hitoshi Oi and Fumio Nakajima The University of Aizu, Aizu Wakamatsu, JAPAN {oi,f.nkjm}@oslab.biz Abstract System-level virtualization

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

QoS & Traffic Management

QoS & Traffic Management QoS & Traffic Management Advanced Features for Managing Application Performance and Achieving End-to-End Quality of Service in Data Center and Cloud Computing Environments using Chelsio T4 Adapters Chelsio

More information

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann Andre.Brinkmann@uni-paderborn.de Universität Paderborn PC² Agenda Multiprocessor and

More information

PORTrockIT. Spectrum Protect : faster WAN replication and backups with PORTrockIT

PORTrockIT. Spectrum Protect : faster WAN replication and backups with PORTrockIT 1 PORTrockIT 2 Executive summary IBM Spectrum Protect, previously known as IBM Tivoli Storage Manager or TSM, is the cornerstone of many large companies data protection strategies, offering a wide range

More information

Real-Time Scheduling 1 / 39

Real-Time Scheduling 1 / 39 Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A

More information

1-Gigabit TCP Offload Engine

1-Gigabit TCP Offload Engine White Paper 1-Gigabit TCP Offload Engine Achieving greater data center efficiencies by providing Green conscious and cost-effective reductions in power consumption. June 2009 Background Broadcom is a recognized

More information

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308

Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308 Virtualization: TCP/IP Performance Management in a Virtualized Environment Orlando Share Session 9308 Laura Knapp WW Business Consultant Laurak@aesclever.com Applied Expert Systems, Inc. 2011 1 Background

More information

Operating Systems Design 16. Networking: Sockets

Operating Systems Design 16. Networking: Sockets Operating Systems Design 16. Networking: Sockets Paul Krzyzanowski pxk@cs.rutgers.edu 1 Sockets IP lets us send data between machines TCP & UDP are transport layer protocols Contain port number to identify

More information

A Packet Forwarding Method for the ISCSI Virtualization Switch

A Packet Forwarding Method for the ISCSI Virtualization Switch Fourth International Workshop on Storage Network Architecture and Parallel I/Os A Packet Forwarding Method for the ISCSI Virtualization Switch Yi-Cheng Chung a, Stanley Lee b Network & Communications Technology,

More information

Diagnosing Performance Overheads in the Xen Virtual Machine Environment

Diagnosing Performance Overheads in the Xen Virtual Machine Environment Diagnosing Performance Overheads in the Xen Virtual Machine Environment Aravind Menon EPFL, Lausanne aravind.menon@epfl.ch Jose Renato Santos HP Labs, Palo Alto joserenato.santos@hp.com Yoshio Turner HP

More information

An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems

An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems Rice University Computer Science Technical Report TR06-872 1 An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems Paul Willmann, Scott Rixner, and Alan L. Cox Rice University

More information

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Leveraging NIC Technology to Improve Network Performance in VMware vsphere Leveraging NIC Technology to Improve Network Performance in VMware vsphere Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction... 3 Hardware Description... 3 List of Features... 4 NetQueue...

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Performance of Software Switching

Performance of Software Switching Performance of Software Switching Based on papers in IEEE HPSR 2011 and IFIP/ACM Performance 2011 Nuutti Varis, Jukka Manner Department of Communications and Networking (COMNET) Agenda Motivation Performance

More information

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0

Muse Server Sizing. 18 June 2012. Document Version 0.0.1.9 Muse 2.7.0.0 Muse Server Sizing 18 June 2012 Document Version 0.0.1.9 Muse 2.7.0.0 Notice No part of this publication may be reproduced stored in a retrieval system, or transmitted, in any form or by any means, without

More information

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

More information

Performance analysis of a Linux based FTP server

Performance analysis of a Linux based FTP server Performance analysis of a Linux based FTP server A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Technology by Anand Srivastava to the Department of Computer Science

More information

Performance of Host Identity Protocol on Nokia Internet Tablet

Performance of Host Identity Protocol on Nokia Internet Tablet Performance of Host Identity Protocol on Nokia Internet Tablet Andrey Khurri Helsinki Institute for Information Technology HIP Research Group IETF 68 Prague March 23, 2007

More information

PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based

PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based Description Silicom s quad port Copper 10 Gigabit Ethernet Bypass server adapter is a PCI-Express X8 network interface

More information

Infrastructure for active and passive measurements at 10Gbps and beyond

Infrastructure for active and passive measurements at 10Gbps and beyond Infrastructure for active and passive measurements at 10Gbps and beyond Best Practice Document Produced by UNINETT led working group on network monitoring (UFS 142) Author: Arne Øslebø August 2014 1 TERENA

More information

Network Function Virtualization

Network Function Virtualization Intel Network Builders Reference Architecture Network Function Virtualization Network Function Virtualization Quality of Service in Broadband Remote Access Servers with Linux* and Intel Architecture Audience

More information

Operating Systems 4 th Class

Operating Systems 4 th Class Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

More information

1000Mbps Ethernet Performance Test Report 2014.4

1000Mbps Ethernet Performance Test Report 2014.4 1000Mbps Ethernet Performance Test Report 2014.4 Test Setup: Test Equipment Used: Lenovo ThinkPad T420 Laptop Intel Core i5-2540m CPU - 2.60 GHz 4GB DDR3 Memory Intel 82579LM Gigabit Ethernet Adapter CentOS

More information

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison April 23 11 Aviation Parkway, Suite 4 Morrisville, NC 2756 919-38-28 Fax 919-38-2899 32 B Lakeside Drive Foster City, CA 9444 65-513-8 Fax 65-513-899 www.veritest.com info@veritest.com Microsoft Windows

More information