Linux TCP Stack Performance Comparison and Analysis on SMP systems

Transcription

1 Linux TCP Stack Performance Comparison and Analysis on SMP systems Shourya P. Bhattacharya Indian Institute of Technology, Bombay Kanwal Kekhi School of Information Technology Powai, Mumbai shourya@it.iitb.ac.in 1. Introduction Abstract With the phenomenal growth of networked applications and their users, there is a compelling need for very high speed and highly scaleable communications software. A networked application can be made high speed and scalable by using a variety of approaches. We can aim at the application layer first, achieving speed up by improving the internal algorithms, scale up by having multiple replicas of the application and balancing the load between these replicas. However, at the heart of any application written for use over a network, is the lower layer protocol stack implementation of the underlying operating system. For any application to run at high speed and achieve high scalability, the protocol stack implementation that it uses must also be able to keep up, and not become a bottleneck. Thus it is important to study and understand the performance of the protocol stack implementation which the application will use. Recent trends in technology are showing that although the raw transmission speeds used in networks are increasing rapidly, the rate of advancement of microprocessor technology, that needs to drive the network, has slowed down over the last couple of years. Gigabit Ethernet networks now have become commonplace while 10 Gbps Ethernet is rapidly making its presence felt[12]. On the other hand,...***some data about processor speeds***??. Hence major microprocessor manufacturers such as Intel and AMD are exploring newer processor architectures. Intel was the first to introduce symmetric multi-treading or Hyperthreading [5] in their Pentium IV processors which enabled a single Pentium IV processor to appear as two logical processors. More recently both AMD and Intel have launched their dual core processors??. Dual core processors, as the name suggests, packages two processing cores on a single chip, effectively behaving like an SMP (Symmmetric MultiProcessor) system. In the future we can expect multiple core CPU to be introduced. The end result of this is that network protocol processing overheads have risen sharply in comparison with the time spend in packet transmission. Although the network layer processing is highly optimized in routers, which are special-purpose, in application servers (i.e. hosts ), this protocol processing has to happen on the general purpose hardware that the server runs on. When load on the server increases, protocol processing overheads can start dominating dominate??, and can degrade the useful throughput of the application. Several approaches have been employed to speed up and scale up the protocol stack imple-

2 mentations. In the context of TCP/IP, offloading the protocol processing to dedicated hardware, instead of carrying it out on the general purpose CPU, has been proposed. /* Shourya add something here */ This essentially enables the CPU to be dedicated to application processing. In either case - i.e. with our without offloading, parallel processing architectures can be exploited for speeding up protocol processing. Whether the purpose is to determine the TCP/IP components that have the highest processing requirements, or to determine how the implementation scales to SMP architectures, a careful performance study of the TCP/IP stack implementation of the operating system in question is a must. In this paper, we discuss the the results of such a study done for the Linux Operating System. The Linux OS has shot up in popularity recently, and is now used even by large-scale system operatiors??. Specifically, we have focussed on the network stack performance of Linux kernel 2.4 and kernel 2.6. Until recently, kernel 2.4 was the most stable Linux kernel and was used extensively. Kernel 2.6 is the latest stable Linux kernel and is fast replacing kernel 2.4. Although performance studies of /*..*/ have been done, this is the first time a thorough comparison of Linux 2.4 and Linux 2.6 TCP/IP stack performance has been carried out. We have studied and compared the performance of these two Linux version along various metrics: bulk data throughput, connection throughput and scalability across multiple processors. We also present a fine-grained profiling of resource usage by the TCP/IP stack functions, thereby identifying the bottlnecks. In almost all the tests, kernel 2.6 performed better than kernel 2.4. It was observed that kernel 2.6 sustained higher data transfer rates and much higher number of simultaneous connections than kernel 2.4. The higher sustained data transfer rates in kernel 2.6 was attributed to it s more efficient copy routines while the much higher number of simultaneous connections in kernel 2.6 was a result of its superior O(1) scheduler. Our experiments with Kernel 2.6 on SMP architecture shows degraded performance when the processing of a single connection is spread out on multiple processes, thus verifying the superiority of the Processor per Connection [2] model for protocol processing in SMP systems. The kernel profiling results also revealed that interrupt processing time, the device driver code and the copy routines take up significant amount of the total network processing time. This paper is organised as follows. Section 3 discusses the improvements made in Linux kernel 2.6 which affect the network performance of the system. Section 4 describes the benchmarking experiments performed, the hardware setup, their results and implications. In section 5 we discuss the kernel profiling results obtained with OProfile [8] and inferences drawn from it. In section?? we conclude our observations and suggests future work. 2. Speeding up protocol processing Improving and speeding up the network protocol processing stack has become an area of active research interest. In this section we look at a couple of ways in which the problem of improving protocol processing speed has been approached. Firstly, attempts has been made to offload the protocol processing from the host system to the NIC which has dedicated hardware for it. Secondly, efforts have been made to parallelise the protocol stack processing to make use of multiple processors. We take a brief look at both these approaches. It must be noted that these approaches are not mutually exclusive TCP offloading TCP offload has been well debated topic over the last decade [11]. The typical benefits of TCP/IP offloading include reduction of host CPU requirements for stack processing and checksumming, fewer interrupts to the host CPU, fewer bytes copied over the system bus and offloading of computationally expensive features such as encryption to specialised hardware on the NIC. In spite of these benefits, there has been some criticism of TCP offload. The primary argument against it has been the fact that TCP protocol in itself is not a

3 very expensive operation and the rapid increase of CPU processing power will nullify the cost effectiveness of TCP offloading to the NIC, but given the changing dynamics of technological developments, TCP offloading has become extremely relevant [6]. Our experimental results in section 5 show that significant protocol processing time is spent in the buffer copying, checksumming and interrupt handling. TCP offloading will be able to free the host CPU from these overheads and hence will be extremely useful Parallel Protocol Processing Approaches The layered architecture of the network stack makes it difficult to parallelise it efficiently. A lot of work has been done [2, 1, 7] in this field to effectively parallelise the protocol stack. We discuss some of the well known paradigms. Processor per Message is a parallelising paradigm in which each processor executes the whole protocol stack for one message. With this approach heavily used connections can be efficiently served since several processors can service different messages of the same connection, however this also implies that the connection state has to be shared between the processors servicing the packets which can lead to synchronisation problems. Processor per Connection lets one processor handle all the messages belonging to a particular connection. This approach works well in SMP systems as it can make optimal use of the processor cache, however, it suffers from fragmentation of resources. Processor per Protocol is another approach in which each layer of the protocol stack is processed by a particular processor. Messages may be shuffled from one processor to another as it moves up or down the protocol stack. One limitation of this approach is that the messages can not be cached efficiently as they are shuffled from one processor to the other[2]. Processor per Task is a parallelising technique in which each processor performs a specific task or function within a protocol, or it might also do a task common to more than one protocol. This paradigm, in theory, tries to reduce the processing time or latency of a message. The main disadvantage of this approach is that both protocol state and messages must be shared between processors. It also suffers from poor caching as the case of processor per protocol paradigm. It has been shown by Schmidt and Suda [10] that processor per message and processor per connection generally work better than the other two paradigm s in a shared memory multiprocessor system. This happens because of the high cost of context switches while crossing protocol layers. Our results have shown that even processor per message paradigm has a detrimental effect on TCP performance. This happens because TCP is a connection oriented protocol and needs to store the connection state in memory. Distributing the processing of packets belonging to a single connection on different processors, leads to frequent cache invalidations in all the processors. Hence only processor per connection paradigm is optimally suited for new generation processors having multiple cores and symmetric multithreading. 3. Improvements in Linux kernel 2.6 The Linux kernel 2.6 was a major upgrade from the earlier default kernel 2.4 with many improvements and was supposed to be faster and more responsive than the earlier kernel. In this section we discuss some of the improvements and changes made in kernel 2.6, which can have an impact on the performance of the networking subsystem. Some of the major improvements in kernel 2.6 include minimal use of the Big Kernel Lock (BKL), an improved interrupt mechanism i.e. NAPI, more efficient block copy routines and a vastly improved O(1) Scheduler. The kernel 2.6 TCP stack also includes new congestion control and recovery algorithms which are not available in the kernel 2.4 TCP stack.

4 3.1. Minimal use of Big Kernel Lock The BKL is a global kernel lock, which only allows only one processor to be running kernel code at any given time, to make the kernel safe for concurrent access from multiple CPUs. The BKL is essentially a spinlock, but with a couple of interesting properties: The BKL can be taken recursively; Therefore two consecutive requests for it will not deadlock the process. Code holding the BKL can sleep and even enter the scheduler while holding the lock. The lock is released while the given thread sleeps, and re-acquired upon awakening. The BKL makes SMP Linux possible, but it dosn t scale very well, hence there is a continuous effort to avoid the BKL and use more fine grained locks instead. Although kernel 2.6 is still not completely free of the BKL, it s usage has been greatly reduced. The kernel 2.6 networking stack has only one reference of the BKL New API - NAPI One of the most significant change in kernel 2.6 network stack, is the addition of NAPI ( New API ), which is designed to improve the performance of high-speed networking. The basic principle of NAPI is: Interrupt mitigation. High-speed networking can create thousands of interrupts per second, which can lead to an interrupt livelock. NAPI allows drivers to run with (some) interrupts disabled during times of high traffic, with a corresponding decrease in system load. Packet throttling. When the system is overwhelmed and must drop packets, it s better if those packets are disposed of before much effort goes into processing them. NAPIcompliant drivers can often cause packets to be dropped in the network adapter itself, before the kernel sees them at all Efficient copy routines The Linux kernel maintains separate address space for the kernel and user processes for protection against misbehaving programs. Due to the two separate address spaces, when a packet is sent or received over the network, an additional step of copying the network buffer from the user space to the kernel space or vice versa is required. As a result the efficiency of the kernel copy routine used has a profound impact on the overall network performance. Kernel 2.6 copy routines have been optimised for the x86 architecture which use the technique of hand unrolled loop [3, 9] with integer registers instead of the less efficient movsd instruction used in kernel Scheduling Algorithm The Linux kernel 2.6 scheduler is probably the single most significant change made in the kernel. The kernel 2.6 scheduler is written completely from scratch to overcome some of the limitations of the kernel 2.4 scheduler. The kernel 2.4 scheduler while being widely used and quite reliable, had several undesirable characteristics. The biggest flaw of the kernel 2.4 scheduler was that it contained O(n) algorithms where n is the number of processes in the system, and hence was not scalable. The new scheduler in kernel 2.6 on the other hand does not contain any algorithms that run in worse than O(1) time. This is extremely important in applications like web servers as it allows them to handle large number of concurrent connections, without dropping requests. The Linux kernel 2.4 scheduling algorithm divides time into epochs, which are periods of time during which every task is allowed to use up it s timeslice. Timeslices need to be computed for all tasks in the system when epochs begin, which means that the scheduler s algorithm for timeslice calculation runs in O(n) time, since it must iterate over every task. On the other hand kernel 2.6 uses more sophisticated algorithms which make sure that there is no point at which all tasks need

5 new timeslice calculated for them at the same time ****give reference here***** 4. High Level Performance comparison tests In this section we discusses the benchmarking experiments performed for comparative study between kernel 2.4 and 2.6. Each subsection describes the type of tests performed, the experimental setup and the results obtained Web server performance HTTP performance tests were done on kernel 2.4 and 2.6 to get a comparative picture of both kernel s efficiency in a real world scenario. The freely available httperf [4] tool was used for load generation and the performance of Apache web server was tested on both the kernels. The following two changes were made to the default Apache configuration file: 1. MaxClients was set to the maximum value of MaxRequestsPerChild was set to zero (unlimited) The Apache web server was run on a dual processor 3.2Ghz Xeon system. Separate tests were done on both kernel 2.4 and 2.6 with SMP enabled and disabled. The httperf load generator was run on two separate client machines. This was necessary as a single client was unable to generate enough requests to saturate the server. To reliably test the maximum number of simultaneous connections that the web server could handle, the clients were made to request a static text page of only 6 Bytes in size. This ensured that the 100Mbps network bandwidth did not become a bottleneck. The following command line instruction was used for the httperf experiments. $httperf --hog --port 80 --uri /small.html --server (IP) --rate=(r) --num-conn (R 15) --timeout 5 The maximum client connection request rate sustained by the server, response time for the requests, connection establishing time and the number of errors reported by the two kernels are shown in the Figures 1, 2, 3 and 4 respectively. Request sustained by the server Time (ms) Server sustained request rate Number of Client requests per second Kernel 2.6 UNI Kernel 2.4 UNI Figure 1. Request rate sustained by kernel 2.4 and Response Time Number of requests per second Kernel 2.6 UNI Kernel 2.4 UNI Figure 2. Response time comparisons of kernel 2.4 and 2.6 These graphs clearly show that kernel 2.6 performance is much better than that of kernel 2.4. ker-

6 Connection Duration Kernel 2.6 UNI Kernel 2.4 UNI kernel 2.6 scheduler Connection throughput Time (ms) Number of errors Number of requests per second Figure 3. Connection time comparisons of kernel 2.4 and Errors Number of requests per second Kernel 2.6 UNI Kernel 2.4 UNI Figure 4. Time Out error comparison of kernel 2.4 and 2.6 nel 2.4 struggled to handle more than 2800 simultaneous connections and started reporting errors beyond that point. It s connection time and response time also started rising sharply. In contrast kernel 2.6 could easily handle 4000 simultaneous connections and there was no sign of any increase in connection time or response time, suggesting that kernel 2.6 would be able to handle even higher number of simultaneous connections than could be tested. These results again showcase the superiority of the In this experiment we have compared the maximum throughput of multiple simultaneous client connections in kernel 2.4 and 2.6. The tests were done on a single processor, P4 machine which acted as a server by listening on many ports for connections. Clients repeatedly connected and disconnected from the server without transmitting any data. The server forked a child process for each new connection request. Two experiments were done with slight variations. In the first experiment the server was made to listen on a fixed number (N) of ports. Exactly N clients were then started which repeatedly connected and disconnected to each of the open ports on the server. The CPU utilisation of the server and the throughput as seen by the clients was measured. In the second experiment, the server instead of listing on exactly N ports was listening on M ports such that M >> N. In our experiments M was fixed at 300 while N was varied from 10 to 100. During both the experiments the network I/O was monitored to make sure that the network bandwidth was not a bottleneck. Results obtained from the experiments are shown in Figure 5 and Figure 6. It must be noted that new connections on any given active port were made in a sequential order, after the connection prior to it had terminated. Hence 100% CPU utilisation could only be achieved with multiple number of active ports simultaneously attempting connection setup and tear down. Table 1 displays the average CPU time per connection on the two kernels. This gives us the time the server spent in servicing a single connection, i.e. the total time taken for accept() and close(). The observations from this experiment reveal a very interesting story. The first thing that is evident from Figure 5 and Table 1 is that the processing overheads in kernel 2.6 for connection setup and teardown is higher than that of kernel 2.4. This is attributed to the fact that the kernel 2.6 socket code contains many security hooks, for example, the socket system call in ker-

7 Thruput comparision Thruput comparision Kernel 2.6 Kernel Kernel 2.6 Kernel 2.4 Connections per second Connections per second Active Connections Active Connections Figure 5. Connection throughput comparison with varying number of active connections. Figure 6. Measured throughput on the server with 300 open ports and varying number of active connections. CPU Time/conn. (µs) Kernel Kernel Table 1. Average CPU time spent per connection in microseconds. nel 2.6 has the additional overhead of calling the functions security_socket_crete() and security_socket_post_create() which result in higher CPU utilisation. Figure 6 reveals another very interesting fact. When there exists a large number of open port, the connection setup/teardown throughput in kernel 2.4 lags behind that of kernel 2.6. This clearly demonstrates the superiority of the kernel 2.6 scheduler. The kernel 2.4 scheduler has to cycle through all the processes listening on the open ports in the system irrespective of the fact that they are active or not. On the other hand the kernel 2.6 scheduler is unaffected by the number of open ports in the system and it s performance is comparable to that shown in Figure 5 Thus the two conclusions we can draw from this experiment are: The per connection processing costs in kernel 2.6 is slightly higher than that of kernel 2.4 The kernel 2.6 scheduler is vastly superior to the kernel 2.4 scheduler and to a large degree compensates for the higher per connection processing cost in kernel Socket system calls This was a high level test, intended to compare the performance of network related system calls. The tests were run on stock kernel and kernel The test environment consisted of a Pentium 4 machine with 256MB RAM. All non essential services were turned off. The test were run with the highest priority set and with no other network activity. Custom TCP server and client program were written and both programs were run on the same machine connecting over the loopback interface. Time taken by the socket system calls were measured on both the server and the client programs. strace was used for the measurements. The result obtained is shown in Table 4.3. The observed socket system call times in both the kernels are very comparable. It shows that there is not much difference in the bind() and socket() system call overheads between the two kernels, but the lis-

8 The tests were run for a duration of 30 seconds Kernel with the TCP window size set to the maximum of socket() KB. The buffer size was set to the default value bind() of 8KB. This experiment yielded some surprising listen() results which provided significant insight into the connect() SMP scalability issue of the kernel TCP stack. The results are shown in Figure 7. Table 2. Average time spent in each As one might expect in Uni-Processor mode kernel 2.6 was considerably faster than kernel 2.4 but system call. Values in microsecond the surprising result here is that of kernel 2.6 in SMP mode. The observed throughput in SMP mode oscillated between 3.4 Gbits/sec or 7.8 Gbits/sec. ten() and connect() system calls are a little cheaper Such a large variation cannot be attributed to random errors. On the other hand in the case of SMP in kernel 2.6. kernel 2.4, the throughput rises marginally to Bulk data transfer performance on Gbits/sec from 4.6 Gbits/sec. SMP system The higher data throughput of kernel 2.6 in Uni- Processor mode is due to it s more efficient copy Bulk data transfer throughput is one of the most routines as discussed in section 3. important parameter to measure the TCP stack s The variability in SMP kernel 2.6 could be attributed to cache bouncing. In kernel 2.6 because performance. To measure the maximum data transfer throughput the freely available tool iperf was of it s better scheduling logic and smaller kernel used. iperf is a tool to measure maximum TCP locks, packet processing can be distributed on all bandwidth, allowing the tuning of various parameters. iperf can report bandwidth, delay jitter and single TCP connection and sends data over that available processors. In our tests, iperf creates a datagram loss. connection, but if incoming packets of a connection We used a system with dual CPU 3.2 Ghz Xeon are processed on different CPU s it would lead to processors as our test machine. The Xeon processors were HyperThreading (HT) [5] capable and same connection will not be able to reuse the TCP frequent cache misses, as all packets belong to the we performed different sets of test with both HT state information optimally. This results in poorer enabled and disabled. Kernels 2.4 and 2.6 were performance in comparison to Uni-Processor kernel, because in a Uni-Processor kernel the entire compiled with SMP support and their performance were compared. The tests were also repeated without SMP support on both the kernels to obtain a ref- advantage of TCP state information present in it s processing is done on a single CPU which can take erence point for comparison. The experiments were cache, resulting in fewer misses. run over the loopback interface as the the physical This also explain the fluctuating high performance ( 7.5Gbits/sec) on 2.6 SMP kernel. Since the 100Mbps network interface was too slow for testing the CPU. Since the experiments were done over Intel Xeon processors are hyper-threaded, the SMP the loopback interface the cost of interrupt handling scheduler would randomly schedule the packet processing on 2 logical processors of the same physical and the efficiency of the device driver code did not come into picture. The iperf server and client processor. In such a situation there will not be any were invoked with the following command line arguments respectively. access to the same cache. To verify this, HT was cache penalty as the logical processors will have disabled and the SMP kernel tests repeated. This $iperf -s -p 9999 stopped the performance oscillations and the 2.6 $iperf -c localhost -p w 255K -tsmp 30 kernel consistently gave a throughput of 3.4

9 12 Thruput comparision Kernel 2.4-uni Kernel 2.4-smp Kernel 2.6-uni Kernel 2.6-smp 9 Data Throughput (Gbps) Number of TCP connections Figure 8. Data throughput with varying number of TCP connections Figure 7. Graphical representation of the data transfer rates achived in different test cases Gbits/sec. The results from the kernel profiling tests in section 5 also confirm the fact that the 2.6 SMP kernel is spending excessive amounts of time in it s copy routines. Further tests were done by transmitting data over two TCP connections instead of one, to check the SMP performance of kernel 2.4 and 2.6. Figure 7 clearly show that data transfer throughput in SMP kernel 2.6 is much better with two TCP connections on a dual CPU system. The increase in the data throughput on kernel 2.6 in SMP mode on a dual processor system with two TCP connections is nearly 180% when compared with the data throughput of a single TCP connection on a Uni-Processor kernel. On the other hand kernel 2.4 shows a speedup of only 147%. To verify that the observed behaviour of kernel 2.6 in SMP mode with a single TCP connection is an anomaly, we did more tests with multiple simultaneous TCP connections. The results are shown in Figure 8. It clearly shows that the low performance of kernel 2.6 in SMP mode with a single TCP connection is indeed an anomaly. When the number of simultaneous TCP connections are increased, kernel 2.6 gives excellent performance. Few other observations that can be made from Figure 8 is that in Uni-Processor kernel 2.4 there is practically no change in the data throughput with increasing number of simultaneous TCP connections. On the other hand, in SMP kernel 2.4, the data throughput rises initially with two simultaneous connection but drops slightly as the number of parallel TCP connections are increased. Since the test machine had only two physical processors, this implies that kernel 2.4 SMP incurs some small penalties while multiplexing multiple TCP streams on a physical processor. 5. Kernel profiling results The processing of TCP packets involves interaction with a large number subsystems within the kernel. Any attempt to optimise and improve the packet processing time can not be successful unless all these factors are considered. To identify these overheads, we profiled both the Linux kernels using OProfile [8]. It is a statistical profiler that uses hardware performance counters available on modern processors to collect information of executing processes. The profiling results also provide valu-

10 able insight and concrete explanation of the observed anomalous behaviour in section 4.4 of SMP kernel 2.6, processing a single TCP connection on a dual CPU system Breakup of TCP packet processing overheads The breakup of TCP packet processing overheads are shown in Table 3. It lists the kernel functions that took more than 1% of the overall TCP packet processing time. The function boomerang interrupt function is the interrupt service routine for the 3COM 3c59x series NIC, which was used in our experiments. The other boomerang * functions are also part of the NIC driver involved in packet transmission and reception. copy from user ll copies a block of memory from the user space to kernel space. csum partial is the kernel checksumming routine. Thus we can see that the NIC driver code, Interrupt processing, buffer copying, checksumming are the most CPU intensive operations during TCP packet processing. In comparison TCP functions take up only a small part of the overall CPU time. These data make a strong case for TCP offloading which could potentially lead to 30-40% of CPU time Analysis of kernel 2.6 SMP anomaly In section 4.4 we had observed that there was a sharp drop in the performance of SMP kernel 2.6 when a single TCP connection was setup on a dual CPU system, but as the number of TCP flows were increased to 2 and more, kernel 2.6 performed extremely well. To analyse this anomalous behaviour, we reran the data throughput experiments for kernel 2.6 in both SMP and Uni-Processor mode, and profiled the kernel during that period. In these experiments, each TCP connection sent and received exactly 2GB of data. This allowed us to directly compare the samples collected in both the situations. CPU Samples % Function Name boomerang interrupt boomerang start xmit copy from user ll issue and wait csum partial mark offset tsc ipt do table csum partial boomerang rx ipt do table tcp sendmsg irq entries start default idle skb release data ip queue xmit tcp v4 rcv timer interrupt Table 3. Breakup of TCP packet processing overheads in the kernel CPU Samples % Function Name copy from user ll copy to user ll system call (no symbols) schedule tcp sendmsg switch to Table 5. TCP packet processing overheads in Kernel 2.6 UNI with a single TCP connection

11 CPU 0 Samples % CPU 1 Samples % Total % Function Name copy from user ll copy to user ll tcp sendmsg schedule (no symbols) system call tcp v4 rcv Table 4. TCP packet processing costs in Kernel 2.6 SMP with single TCP connection CPU 0 Samples % CPU 1 Samples % Total % Function Name copy from user ll copy to user ll schedule tcp sendmsg system call switch to tcp v4 rcv Table 6. TCP packet processing costs in Kernel 2.6 SMP with two TCP connection The most striking fact emerging from Table 4 and 6 is the large increase in time spent in the kernel copy routines. The functions copy from user ll() and copy to user ll() are used for copying buffers from user space to kernel space and from kernel space to user space respectively. There is a very sharp increase in the time spent by these two functions of the SMP Kernel with a single TCP connection. More than 50% of the total time is spent in these functions. Such a sharp increase in the cost of copy routines can be attributed to a high miss rate of processor cache. To verify this, the copy from user ll() and copy to user ll() routines were further analysed and it was found that more than 95% time in these routines were spent on the assembly instruction repz movsl %ds:(%esi),%es:(%edi) The above instruction copies data between the memory locations pointed by the registers in a loop. The performance of the movsl instruction is heavily dependent on the processor data cache hits or misses. The significantly higher number of clocks required by the movsl instruction in the case of SMP kernel 2.6, for copying the same amount of data can only be explained by an increase in the data cache misses of the processor. 6. Conclusion References [1] M. Björkman and P. Gunningberg. Locking effects in multiprocessor implementations of protocols. In Conference proceedings on Communications architectures, protocols and applications, pages ACM Press, [2] M. Björkman and P. Gunningberg. Performance modeling of multiprocessor implementations of

12 protocols. IEEE/ACM Trans. Netw., 6(3): , [3] J. W. Davidson and S. Jinturkar. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In MICRO 28: Proceedings of the 28th annual international symposium on Microarchitecture, pages IEEE Computer Society Press, [4] A Tool for Measuring Web Server Performance. World Wide Web, Mosberger/httperf.html. [5] D. Marr, F. Binns, D. Hill, G. Hinton, and D. Koufaty. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, hyper/p15 authors.htm, [6] J. C. Mogul. TCP offload is a dumb idea whose time has come. In Proceedings of HotOS IX: The 9th Workshop on Hot Topics in Operating Systems, May [7] E. M. Nahum, D. J. Yates, J. F. Kurose, and D. F. Towsley. Performance issues in parallelized network protocols. In Operating Systems Design and Implementation, pages , [8] OProfile profiling system for Linux 2.2/2.4/2.6. World Wide Web, [9] V. S. Pai and S. Adve. Code transformations to improve memory parallelism. In MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, pages IEEE Computer Society, [10] D. C. Schmidt and T. Suda. Measuring the impact of alternative parallel process architecture on communication subsystem performance. In Protocols for High-Speed Networks IV, pages Chapman & Hall, Ltd., [11] P. Shivam and J. S. Chase. On the elusive benefits of protocol offload. In Proceedings of the ACM SIGCOMM workshop on Network-I/O convergence, pages ACM Press, [12] H. Xie, L. Zhao, and L. Bhuyan. Architectural analysis and instruction-set optimization for design of network protocol processors. In Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign & system synthesis, pages ACM Press, 2003.