Effective Computing with SMP Linux

Transcription

1 Effective Computing with SMP Linux Multi-processor systems were once a feature of high-end servers and mainframes, but today, even desktops for personal use have multiple processors. Linux is a popular server OS and is increasingly being accepted as a mainstream desktop OS. But how good is Linux at multi-processing? The traditional Linux kernel was built for Uniprocessor (UP) systems, but it did not utilize the power offered by multiprocessor systems. However, with the arrival of the Linux 2.6 kernel, this has changed drastically. The Linux 2.6 kernel adds many features to support symmetric multi-processing. This paper will discuss the relevance of an SMP kernel for today s computers, the changes in the Linux 2.6 kernel to support SMP and the benefits of an SMP Linux system. The paper will also explain how developers can take advantage of the SMP features of Linux to develop software that runs more efficiently.

2 About the Author Brijai Sudarsan Brijai Sudarsan is an Associate Consultant at Tata Consultancy Services and has about 13.5 years of experience in the IT industry. He holds a Bachelor s Degree in Electrical and Electronics Engineering and currently works on Storage related technologies for one of the TCS customers - a Global leader in data storage platforms. In his career span in TCS he has led multiple projects in the Linux/Unix domain and has extensive experience in developing distributed applications on Linux. He has also worked on Operating Systems development, mainly on HP NonStop Massively Parallel Processing (MPP) servers. His areas of interest include parallel processing, objected-oriented design and development and storage systems. 1

3 Table of Contents 1. Introduction 4 Overview 4 History 4 What is Symmetric Multiprocessing or SMP? 5 2. Benefits of an SMP Linux system 6 Performance and Scalability 6 Cost Is SMP Linux Cost Effective? 8 Application Portability 8 3. Challenges in Building an SMP System 9 Task Scheduling 9 Synchronization and Parallelization 9 4. The Linux Solution 11 Task Scheduling 11 Scheduler Scalability 11 Load Balancing 14 CPU Affinity 15 Synchronization and Parallelization 15 Per-CPU Variables 15 Atomic Variables 16 Spin Locks 16 Semaphores 17 The Big Kernel Lock (BKL) Application Development on a SMP Linux System Summary 20 2

4 List of Abbreviations Abbreviation/ Acronym Expansion CFS CPU DMA GHz IPC JVM MHz MPP NPTL NUMA OS PIC Completely Fair Scheduler Central Processing Unit (Processor) Direct Memory Access Giga Hertz Inter Process Communication Java Virtual Machine Mega Hertz Massively Parallel Processing Native POSIX Thread Library Non-Uniform Memory Access Operating System Programmable Interrupt Controller POSIX Portable Operating System Interface for Unix SMP UP Symmetric Multiprocessing Uniprocessor 3

5 Introduction There are two ways to improve the processing power of a computer - first is to build a Uniprocessor (UP) system with a faster (and more expensive) processor and the second is to build a system with multiple processors. The second approach is generally referred to as parallel processing. There are different approaches for building a parallel processing system such as Symmetric Multiprocessing (SMP), Cluster Computing and Massively Parallel Processing (MPP). Overview This paper focuses on the most common approach to parallel processing today - Symmetric Multiprocessing (SMP) using Linux as a reference implementation. Firstly, the paper presents the benefits of an SMP system over a UP system. After establishing the relevance of an SMP system, the paper then delves further into the implementation challenges of an SMP kernel as compared to a UP kernel and how the Linux 2.6 kernel addresses these challenges. The discussion is oriented more towards the software design than the hardware architecture. However, hardware features are also briefly mentioned when appropriate. Finally, application development on an SMP Linux system is discussed with specific focus on performance improvement by harnessing the parallel processing features of an SMP kernel. The paper concludes that an SMP Linux system is one of the best platforms available today for delivering low-cost, high-performance software solutions. History A bit of history before we dive deeper into the topic. Multi-processing was first commercially introduced by IBM in 1955 when they made the 704 system. It was designed by Gene Amdahl who is also famous for Amdahl s law for parallel computing. However, the 704 was not an SMP system. The first SMP system was the Burroughs D825 which could support up to four CPUs. Later in 1969, MIT, Bell Labs and General Electric developed the MULTICS (Multiplexed Information and Computing Service) system which could support up to eight processors. MULTICS was a landmark in the development of multi-tasking systems as it has had a significant influence in the development of other operating systems like UNIX. Other SMP systems also became commercially available in the 70s and 80s like the DEC KL 10. However, parallel processing remained the domain of super computers, mainframes and high end servers. In the 90s, desktop computers became popular but were fed with faster processors every few months and the need for parallel processing was never felt. Today, the situation has changed. Symmetric multiprocessing has come of age and is no longer limited to mainframes and super computers. Much of the credit goes to the popularity of desktop computers. As CPU speeds increased dramatically from a few MHz in the 90s to GHz today, chip manufacturers have found it increasingly difficult to design chips that can operate at higher frequencies to meet the computing demands of today s desktop computers. As a result, they turned to multi-processing, and SMP has been the choice due to lower costs and operating system portability. Due to the symmetric design of an SMP system, a UP OS like Linux could be enhanced to support SMP without a complete re-write of the kernel code. Also, OS vendors continued to provide the same interfaces for application programmers (APIs or system calls as they are popularly called) on the SMP system so that applications developed for UP systems could be run without any changes. Today, SMP is so popular that major desktop and server operating systems like Linux and Windows have SMP support built-in. 4

6 The Linux 2.0 kernel supported SMP but in a rather crude way with a single large kernel lock. Major enhancements have been made since then and the Linux 2.6 kernel has excellent support for SMP with fine grained synchronization mechanisms and a scalable scheduler. This will be discussed in detail in later sections of this paper. What is Symmetric Multiprocessing or SMP? A symmetric multiprocessing system consists of two or more identical CPUs that share the system's main memory in a symmetric manner. Every CPU has equal access to the main memory and the data accessed by one CPU can be used by any other CPU in the system. Additionally, both CPU and memory are tightly interconnected so that they can communicate with each other at high speeds. An SMP system is built by tightly interconnecting the CPUs and main memory through a high speed bus as shown in Figure 1.0 below. The components are usually built on a single board so that they can communicate with each other over short distances. The bus length is also short to improve communication speed. Additionally, I/O devices that need Direct Memory Access (DMA) can also be connected to the bus in a symmetric manner. CPU 1 CPU 2 CPU 3 CPU 4 Cache Cache Cache Cache High Speed Bus Main Memory (Shared) Figure 1.0: SMP system CPUs and main memory are interconnected by a high speed bus There are variants of the SMP system like SMP NUMA (Non-Uniform Memory Access) systems which may also have local memory that is allocated to a processor or shared between a set of processors. Such architectures are not discussed in this paper. Let us now see how an SMP system works from an Operating System (OS) perspective. An SMP OS is always a multi-tasking system, but this is not the critical difference between an SMP and UP system. Most UP systems are multi-tasking too as described in the note below. The main difference is that an SMP system can run more than one task at a time as it has multiple CPUs at its disposal. This means that a task may get more time to run on a CPU as the total number of tasks per CPU would be less when compared to a UP system for a given system load. Figure 1.1 and Figure 1.2 provide a snapshot of a UP and a 2 CPU SMP system respectively with 10 tasks T0 to T9 to execute. In Figure 1.1, the 10 tasks are handled by a single CPU, where as in Figure 1.2, the tasks are distributed across the 2 CPUs. Each box represents a time slice given to a process to run. Let us assume each slice is 10 milliseconds long. Multi-tasking on a Uniprocessor system A uniprocessor system has to run multiple tasks on a single CPU. Since it would not be wise to execute these tasks in a sequential manner as some tasks can take a long time to complete, the OS gives each task a specific amount of time (time slice) to run irrespective of how much time a task actually takes. The task is moved out of the CPU if it could not finish its job within the time given and another task is given the chance to run. The swapped out task waits till the OS allows it to run again. All the tasks are swapped in and out of the CPU till they complete. As the swapping happens very fast usually a task occupies the CPU for a few milliseconds only the OS is able to give the user the illusion that multiple tasks are running at any point of time. 5

7 CPU 0 T1 T2 T6 T9 T0 T4 T8 T2 T3 T7 Figure 1.1: Task execution on Uniprocessor system in a 100 millisecond timeframe CPU 0 CPU 1 T1 T2 T3 T1 T0 T4 T1 T2 T3 T0 T6 T7 T5 T9 T5 T8 T6 T1 T7 T1 Figure 1.2: Task execution on a 2 CPU SMP system in a 100 millisecond timeframe Clearly, the two systems do not schedule tasks in the same way. Please note that a task is a generalization and could be a process, thread or an interrupt routine. The following comparisons can be made as shown in table below. Comparison #3 is interesting and there is a clear reason for tasks having a tendency to continue on the same CPU. This is due to Processor or CPU affinity and is discussed in section Challenges in Building an SMP System. Sl. No UP System Runs all the tasks on a single CPU. Tasks take longer to complete as they are competing with more tasks for a time slice of the CPU, so they get fewer time slices for a given period of time. As a result, some of the tasks (T3 and T5) do not get to run at all in the 100 millisecond time frame shown above. A task always runs on the same CPU. SMP Distributes tasks across the two CPUs. Tasks T0 to T5 run on CPU 0 with the exception of T1, which also gets to run on CPU 1. Tasks T5 to T9 run on CPU 1. Tasks take less time to complete as they get more time slices to run. For instance, task T5 gets to run on CPU 1 as it is competing only with T6 to T9 and T1. A task tends to run on the same CPU but there are exceptions like task T1 which is re-scheduled on CPU 1 after initially running on CPU 0 for the first 70 milliseconds. Benefits of an SMP Linux System An SMP system has multiple benefits that have made it a popular choice for desktops, mobile computers and entry as well as mid-range servers. Let us look at the benefits an SMP Linux system has to offer. Performance and Scalability Moore s law predicted that the number of transistors on a processor could increase at an exponential rate. There is considerable debate now on how long Moore s law will hold to predict future processor performance. There is another law that is of interest to a system s performance Amdahl s law, which predicts a system s performance based on the amount of parallelization that can be achieved. 6

8 Amdahl s law: Speedup = 1 (1-P) + P/N N is the number of processors on the system and P is proportion of the system that can be parallelized and (1-P) is that part that is sequential. The system speedup (performance) can be calculated for different levels of parallelization as shown in the graph below. P10 has 10% parallelization, whereas P90 has only 10% of the system that is sequential in execution. Clearly, performance of an SMP system depends on the level of parallelization that can be achieved by the system. For instance, the P10 system shows hardly any performance improvement even when the number of CPUs are increased from 1 to 16, whereas the P90 system scales very well as more CPUs are added. System performance and Parallelization as per Amdhal's law Speedup No. of CPUs Figure 2.0: Amdahl's law P90 P75 P50 P25 P10 Parallelization depends on multiple elements within the system: 1. Hardware Scalability As shown in Figure 1.0, all CPUs in an SMP system share the global main memory and communicate through a common bus. Only one CPU is given access to the bus at a time and other CPUs have to wait till the bus is released by the current CPU. The bus should have enough bandwidth so that CPUs are not kept waiting to fetch instructions and data from the main memory. I/O devices also share the bus for DMA or to communicate with the CPUs. Hence, the bus speed is of critical importance and is the main bottleneck. SMP systems have been found to scale well for 2 to 16 CPU configurations. Larger systems which need a greater degree of hardware parallelization and scalability usually use other approaches like Massively Parallel Processing (MPP) or clusters. 2. Kernel Synchronization Tasks (processes, threads or interrupt routines) frequently need to execute in Kernel mode to access services such as I/O, inter-process communication and to synchronize access to shared resources. Hence, the kernel code is quite frequently run by the CPUs and needs a high level of parallelization. The Kernel needs to implement synchronization mechanisms as different CPUs could be executing kernel code that is accessing the same data. Linux provides various synchronization mechanisms such as spin locks and semaphores to ensure that all concurrent threads of execution in the kernel are synchronized. 3. Kernel Pre-emption - Linux has evolved with each release of the kernel supporting more fine grained synchronization. Up to version 2.4, the Linux kernel was non-preemptive if a task is executing in kernel mode, it continues to execute until it voluntarily exits from kernel mode. Although this simplifies the kernel design a great 7

9 deal as very little synchronization is required within the kernel, this is bad news for an SMP system. The Linux 2.6 kernel is preemptive. Except for critical sections which are non-preemptive, the 2.6 kernel allows multiple CPUs to execute in kernel mode at the same time. The critical sections which could be updating data shared across CPUs are protected by various locking mechanisms as discussed in section The Linux Solution. 4. Task Scheduling Tasks need to be scheduled by the kernel so that the CPUs are utilized with a high level of efficiency. The kernel scheduler needs to distribute tasks across CPUs so that the system load is balanced across all CPUs. The scheduler also needs to ensure that tasks are re-scheduled on the same CPU to address the cache affinity problem discussed earlier in section What is Symmetric Multiprocessing or SMP? Tasks and particularly high priority tasks should not migrate between CPUs as they are re-scheduled. The CPU cache needs to be rebuilt every time a task migrates to a different CPU and this is expensive. For a high priority process, this can be quite expensive as the task can get re-scheduled more frequently. Finally, the scheduler itself should be efficient so that it does not use a lot of CPU cycles trying to schedule tasks. Ideally, the performance of the scheduler should remain the same irrespective of the number of tasks it has to schedule. The Linux 2.6 kernel has a highly scalable scheduler that addresses these challenges as we shall see in section The Linux Solution. 5. Application Parallelism Even if the hardware and the kernel have good SMP support, applications may still not perform significantly better on a SMP system when compared to a UP system. Building large, monolithic applications is not a good idea on an SMP system. The kernel views it as a single task and is allocated time slices like any other task irrespective of its size. To take advantage of the processing power available, it is better to split a large application into multiple processes and threads. Tasks that can run in parallel within an application can be written to run as independent processes or threads. Cost Is SMP Linux Cost Effective? This is a critical question. Is an SMP system more cost effective than building a cluster of UP systems or an MPP system? There is definitely a cost benefit from a hardware perspective. As an SMP system has a symmetric architecture - every CPU has equal access to main memory and I/O devices - it is a logical extension of the UP system with more CPUs added in. Hence, an N processor SMP system is cheaper to build than interconnecting N UP systems together. Furthermore, an SMP enabled Linux OS is available at no additional cost. Also, Linux scales well as more processors are added. SMP Linux systems with a 2 to 16 CPU configuration have shown significant performance benefits to justify the cost involved in adding more processors. Distributed applications are well suited to harness the power of an SMP Linux system and can provide the expected cost benefits by reducing the hardware required to run software at a required performance level. A good example is the Java Virtual Machine (JVM) which has a multi-threaded architecture. The JVM has shown significant performance improvements using the NPTL threads library on SMP Linux systems. A major reason for this is the high performance Completely Fair Scheduler (CFS) in Linux 2.6. This enables distributed applications to create a large number of processes and threads to take advantage of the additional processing power. Application Portability Linux provides the same programming interface (APIs or system calls) on both SMP and UP systems. The complexity of the SMP system is managed by the Kernel and is transparent to an application developer. This means that applications developed for UP systems can be run on an SMP system without any changes. However, the application may need further performance tuning to take advantage of the parallel processing capabilities of an SMP system. 8

10 Challenges in Building an SMP System An SMP system is a logical extension of a UP system. The main difference between the two is that the SMP system is capable of multi-processing. Many of the other features of a UP system can be re-used with relevant changes on an SMP system. Hence, writing an SMP kernel usually does not involve developing a new kernel from scratch. An SMP system differs from a UP system mainly in two aspects: Task Scheduling - With more processors, SMP systems can run more number of tasks and so need an efficient task scheduler. Synchronization and Parallelization - Multiple processors concurrently execute different parts of the kernel in kernel mode. The kernel needs appropriate synchronization mechanisms so that a high level of parallelism can be achieved without corrupting kernel data. Let us look at the above two challenges in more detail. Task Scheduling Scheduler scalability - The main aim of choosing an SMP system is that it can handle a larger processing load than a UP system. As more processors are added, the system will have more runnable tasks. The kernel task scheduler has to scale well to adjust to a higher system load. The scheduler should not take too many CPU cycles trying to find the next task to run. Ideally, the scheduling algorithm efficiency should be independent of the number of tasks it has to schedule, the number of CPUs or any other system parameter. Load balancing - Unlike a UP scheduler, an SMP scheduler also needs to balance the load across all CPUs. Processes and threads need to be assigned to CPUs so that all CPUs are optimally utilized. In addition to processes and tasks, interrupts need to be serviced in a timely manner. Processor affinity - Tasks should be re-scheduled on the same CPU as far as possible. Every CPU has a local cache where it holds data that it thinks may be used by processes in the near future. This improves performance as the CPU does not need to fetch data from the relatively slower main memory. If a task is re-scheduled on a different CPU, all data cached for the task on the last CPU cannot be used. The new CPU needs to build its own local cache for the migrated task. This is expensive and can have a major impact on system performance particularly if the migrating task is a high priority one and is frequently migrated across CPUs. The scheduler also needs to ensure that loads across all CPUs are balanced and at the same time address the processor affinity issue these two problems can be contradictory and the right choice needs to be made to maximize system performance. Synchronization and Parallelization An SMP system has true parallelism. On an N CPU system, N tasks could be running on each CPU at the same time. Some of these tasks could be running in user mode and some in kernel mode. A user mode task can enter kernel mode when it makes a system call like fork() or write(). Interrupts and kernel threads always execute in kernel mode. Hence, at any point of time, there could be many concurrent threads of execution within the kernel a process may have invoked a system call, a CPU could be servicing an interrupt or a kernel thread could be running. 9

11 A UP kernel also has concurrency issues if the kernel is preemptive. A task running in kernel mode could be swapped out of the CPU and another task could start running which may be modifying the same data structures as the preempted task. However, there are certain conditions that are unique to an SMP system: Concurrency due to Parallelism An SMP kernel is designed to have parallel threads of execution. A non-preemptive kernel will allow a task to run in kernel mode as long it wishes and the task can do a planned switch out of the kernel after ensuring that all shared data structures are updated correctly. This greatly simplifies kernel synchronization on a UP system. However, this will not work on an SMP system for the simple reason that multiple CPUs could be executing kernel code at the same time and some of them could be changing the same data structures as another. Serializing a CPU s access to the kernel using kernel non-preemption defeats the very purpose of having an SMP system. Interrupts UP systems often disable pre-emption of a kernel mode process or thread by disabling interrupts. Although this can hamper system response, it simplifies kernel synchronization. Again, this will not work on an SMP system. I/O Interrupts are delivered to the CPU by a Programmable Interrupt Controller (PIC). The PIC has its own logic of delivering interrupts and can deliver the interrupt to any of the available CPUs. For example, on the Intel x86 platform, interrupts are delivered by the PIC to the CPU that is running the lowest priority process. If multiple CPUs are running at the same low priority, the PIC delivers interrupts in a round robin fashion. To sum up, neither kernel non-preemption nor interrupt disabling can simplify kernel synchronization on an SMP system. The kernel needs to implement proper synchronization mechanisms so that kernel data structures are kept in a consistent state. Locking granularity - Another challenge before the kernel is the amount of synchronization to be done. The kernel can opt for coarse-grained and fine-grained locking mechanisms. Coarse-grained locking involves locking large areas of code where a major percentage of the code may be updating shared kernel data. Code that is updating shared data is called a critical region and needs to be protected by locking mechanisms. Thus, the entire region may not be a critical region but a major part of it could be. If the region is too large, it will affect system performance as other tasks will need to wait till the large lock is released. On the positive side, fewer locks are required which could mean lower code complexity and possibly fewer deadlocks. On low-end SMP systems, this may be desirable too as the level of parallelism or contention between tasks may be low. In fact, having more locks may lower performance due to the cost of managing locks. Often a lock may be acquired even though there may not be any contention. However, coarse-grained locking can be bad for scalability and on higher end SMP systems this can be a bottleneck. For better scalability, fine-grained locking can be used. Fine-grained locking involves more precise identification of critical regions. For example, a linked list can be protected by providing a large lock for the entire list. However, if a large number of tasks need to access the list, all but one will be blocked and the access to the entire list becomes serialized. A fine-grained approach can provide a lock for each node of the list instead of the entire list. The tasks may need to access different nodes in the list and hence there is less contention between them. However, as the number of tasks increase, there may be many tasks trying to access a node itself and hence a finer-grained approach may be needed perhaps by providing locks for each element in the node. While this may give very good response on a high 10

12 end 32 CPU system, performance on a 2 CPU system could be disastrous as a lot of locks will be created although there may not be a lot of tasks hitting the critical region at the same time. The kernel has to be designed with the right balance of coarse and fine grained locking. Often this is a continuous evolution and involves incorporating feedback from the real world and kernel performance test teams. The Linux Solution Task Scheduling The 2.6 kernel scheduler has been re-written to provide better performance on SMP systems. In fact, the scheduler has been re-written twice since the first release of the 2.6 kernel. The initial release sported an O(1) algorithm, run queues per processor and SMP affinity. Later releases starting with have a new scheduling algorithm called the Completely Fair Scheduler (CFS). Let us take a closer look at the 2.6 scheduler to see how it meets the challenges posed by an SMP system. Scheduler Scalability A scheduler should be scalable as the number of tasks and processors increase. Let us see how Linux handles these two parameters. Task Scalability Linux 2.4 had an O(n) scheduler. The scheduler selected the next task to run by searching the list of tasks that are ready to be run. Clearly, the performance of the scheduler depended on the number of tasks and the time taken to select the next task to run would increase linearly with the number of tasks. The initial releases of Linux 2.6 kernel provided an O(1) scheduler. An O(1) scheduling algorithm performs with the same efficiency over time irrespective of any of the system parameters like the number of tasks that are runnable or the number of CPUs. The scheduler can find the next task to be run in constant time if there are 10 tasks or 500 tasks waiting to run. This was a major achievement as it can provide significant performance benefits on high end systems that may be running thousands of tasks. Even on smaller machines, distributed applications like the Java Virtual Machine (JVM) can spawn hundreds of threads and each thread is scheduled independently like a process. The scheduler achieves this constant time performance by using task priority queues instead of runnable tasks lists. Each task is given a priority that can range from 1 (highest) to N, where N is the number of priority levels in the system. Hence, there are N priority queues. A priority queue can be empty if there are no tasks with a particular priority that is runnable. An N-bit bitmap is used to track this with 1 bit per priority. If a queue has tasks, the bit is set to 1 and if empty the corresponding bit is set to 0. The Big O notation The O notation is often used in computer science to indicate an algorithm's efficiency or time complexity. For example, an algorithm that searches an unsorted array of n elements has an order O(n) as it will scale linearly with n. The larger the value of n, the slower the performance of the search. Similarly, an algorithm that searches an n*n matrix would have a time complexity of O(n^2). An algorithm that fetches an element of a contiguous array of n elements has a complexity of O(1) as it can fetch the 10th or the 100th element in constant time. 11

13 The scheduler finds the first bit in the bitmap (highest priority task available) that is set and selects the first task in the priority queue to run. Since finding the first set bit in a N size bitmap is a constant time operation and is independent of the number of runnable tasks, the scheduler has O(1) efficiency. We will not discuss this further as this algorithm has been replaced with a completely new logic in later 2.6 releases. Starting with release , Linux uses a new scheduling algorithm known as the Completely Fair Scheduler (CFS). The CFS has an efficiency of O(log n), where n is the number of runnable tasks. Thus, the CFS is not as efficient as the O(1) scheduler, but could have other advantages like better system responsiveness for interactive and real time tasks. Tests results for CFS have been positive and difference in performance of the O(1) and CFS scheduler is marginal even at very high loads. The CFS algorithm is based on logic that a truly multi-tasking CPU would run all tasks at the same time giving each a share of its processing power. However, in real hardware, only one task can run at a time on a CPU. Instead, the CFS algorithm uses the concept of virtual time. Every task is assigned a virtual time in nanoseconds and is tracked through a per task variable. Let us call this variable vruntime. A task having a low value of vruntime has not got its fair share of CPU; where as a task with a high value may have got an unfair share. The aim of the CFS algorithm is to run all tasks so that every task gets its fair share and the system is balanced. CFS picks up the task that has the lowest vruntime. For this, it uses a time ordered red-black tree instead of task priority queues. The tree is sorted using vruntime as the key. The red-black tree is a per-cpu binary search tree and every runnable task in a CPU has a node in the tree. As a task runs on a CPU, its vruntime is increased by the scheduler. At some point, the value of vruntime is no longer the lowest in the red-black tree and another task with a lower value gets selected and the current task is switched out of the CPU. The algorithm proceeds in this manner to ensure that every task gets its proportional share of CPU time. Processor Scalability On an SMP system, the scheduler can be invoked concurrently by any of the processors. This means that if the scheduler's data structures are shared, it would become a source of contention between CPUs. The scheduler's main data structure that contains the list of runnable tasks is called a run-queue. The run-queue is a container structure that stores data such as the list of runnable tasks, load on a CPU and count of tasks. For the scheduler to scale well with more number of CPUs, it is important that access to this data structure is parallelized. Linux 2.4 global run-queue Task-Queue Lock CPU 0 CPU 1 CPU 2 CPU 3 Figure 4.0: Global run-queue in Linux

14 The Linux 2.4 kernel had only a single run queue for all processors as shown in Figure 4.0. Processors would acquire a lock on the run-queue while scheduling the next task to run. This will block all other CPUs that are attempting to schedule as well. Since tasks run on a CPU for a very small time slice, the scheduler is frequently invoked and as the number of CPUs goes up it can negatively impact the scheduler s efficiency. Linux 2.6 introduces a run-queue per processor as shown in Figure 4.1a and 4.1b below. Run-queue for CPU 0 Run-queue for CPU 1 Priority 1 FIFO (2 tasks) Priority 2 FIFO (empty) N B i t Priority 1 FIFO (empty) Priority 2 FIFO (7 tasks) N B i t Priority N FIFO (5 tasks) B i t m a p Priority N FIFO (2 tasks) B i t m a p Lock Lock CPU0 CPU1 Figure 4.1a: Linux 2.6 per CPU run-queue with O(1) scheduler While the initial releases of Linux 2.6 had N (maximum task priority level) task priority queues and a bitmap within the run-queue, the newer CFS scheduler stores the red-black binary tree and associated elements within the runqueue. The run-queue also has additional fields like CPU load indicators and SMP load balancing data. Each runqueue is protected by a spin lock. Tasks that are runnable on a particular CPU are stored with the CPU's own run-queue data structure and cannot be used by another processor. Such a design is often called a multi-queue scheduler and significantly reduces contention between CPUs concurrently running the scheduler. This approach also solves the problem of CPU affinity as mentioned later in this section under CPU Affinity. Run-queue for CPU 0 Run-queue for CPU 1 Ta Th Tb Tc Ti Tj Td Te Tf Tg Tk Tl Tm Tn Red-Black Tree Lock Red-Black Tree Lock CPU0 CPU1 Figure 4.1b: Linux 2.6 per CPU run-queue with CFS scheduler. Tasks Ta to Tn are distributed across the 2 CPUs. 13

15 Load Balancing The scheduler on an SMP system needs to ensure that tasks are evenly distributed across all CPUs apart from ensuring that all processes get a fair share to run on a CPU. It is not possible for the scheduler to predict how long a task will run when it is created and scheduled to run for the first time. It depends on the program flow and cannot be predicted. As more processes are created over time, the load on all the CPUs is bound to get unbalanced as some tasks may be short-lived, others long running, some may be CPU intensive while others more I/O bound. Linux 2.6 uses a sophisticated load balancing algorithm that supports load balancing for different multi-processing architectures like SMP, Hyper threading and NUMA. All CPUs are divided into scheduling domains. Scheduler domains are hierarchical and reflect the CPU topology of the system. Each domain is further split up into groups. Each group represents a subset of the domain's CPUs. Load balancing is always done between groups of a scheduling domain a process is migrated only if the workload of a group in a domain is much less than the workload of another group in the same domain. Scheduling domains are not as relevant to SMP systems as they are to NUMA or Hyper threaded CPU topologies as the system has a symmetric design - all CPUs are identical and have equal access to memory and I/O. An SMP system could have just one domain and each group in the domain corresponds to a physical CPU. Let us now look at load balancing from an SMP perspective. The kernel periodically checks the run-queues of all CPUs to see if the load is balanced across all CPUs. The run-queue of a processor tracks the load on the CPU. At periodic intervals the kernel re-calculates the load and determines if load balancing is required. The frequency at which the kernel may call the load balancing algorithm depends on the current state of the CPU. If the CPU is idle the run-queue is empty the load balancing code is invoked quite frequently. If the CPU has an active run queue, the load balancer will be called less frequently. The load balancer once invoked looks for the busiest CPU on the scheduling domain to determine the load imbalance. The busiest CPU is identified only if it is significantly busier than all other CPUs in the domain. The load balancer then attempts to pull tasks from the busiest CPU to the local CPU on which it is running. The task migration can happen after taking the following into consideration: 1. The process on busiest CPU is not currently running. 2. The local CPU is idle. 3. Previous attempts to balance the system by migrating tasks from the busiest CPU have failed. 4. The process to be moved is not cache hot. A process is likely to have a hot CPU cache if it was run recently on that CPU. Moving a process with a hot CPU cache is expensive as the cache on the busiest CPU is lost and needs to be rebuilt on the local CPU. If the task migration fails, the kernel then looks for another idle CPU in the system. If it finds one, it re-attempts the task migration from the busiest CPU to balance the system. 14

16 CPU Affinity As mentioned earlier in this section, Linux 2.6 uses individual run-queues for each CPU. As the run-queue contains only the list of tasks to be scheduled on that CPU, tasks have a natural affinity for the last CPU on which it was scheduled and continues to be re-scheduled on the same CPU. The only scenario in which a task may be migrated to the run-queue of another CPU is when the load balancing algorithm runs and finds that the load needs to be balanced across CPUs. An unfortunate task could get selected and migrated to a different CPU. While this may provide better CPU load balancing, frequent task migration is expensive as the CPU cache needs to be re-built for the migrated process. The load balancing algorithm tries to minimize task migration while trying to balance the load. One of the criteria that the balancing algorithm uses to decide if a task can be migrated from a busy CPU to an idle one is that if the task has a hot cache. A task would have a hot cache if it was recently run on a busy CPU. If this is the case, the load balance algorithm may not select the task for migration. Synchronization and Parallelization The Linux 2.6 kernel supports kernel pre-emption. This means that a task running in kernel mode can be switched out of the CPU anytime by the task scheduler and another task could be run. SMP together with kernel pre-emption increases the concurrency within the kernel. Linux provides different synchronization techniques to handle this challenge like spin locks, semaphores, per-cpu variables and atomic variables. While spin locks, semaphores and atomic variables ensure atomic operation of the critical region of code they protect, per-cpu variables allow greater degree of parallelization by avoiding the need to synchronize between CPUs completely. The kernel also supports variants of these synchronization mechanisms like reader-writer spin locks and semaphores, sequencers and completion variables. These are not discussed here as the focus is on the fundamental techniques used. Per-CPU Variables Per-CPU variables distribute a data structure so that every CPU has a copy of the data and can access it without any contention from tasks running on other CPUs. Another advantage of a per-cpu variable is that a frequently used variable can reside in the local CPU cache and can be accessed much faster. Per-CPU variables do not migrate along with tasks to other CPUs and can be useful in reducing cache invalidation due to task migration or task ping-pong as it is popularly called in Linux circles. Although per-cpu variables do not need to be protected from tasks running on other CPUs, as the kernel is preemptive, protection may be needed from tasks on the same CPU. Also, if the task gets pre-empted and is later rescheduled on another CPU, it may leave the per-cpu variable in an inconsistent state. For these reasons, it is desirable to disable kernel pre-emption when modifying per-cpu variables. The Linux kernel provides useful macros and functions to handle this. 15

17 Atomic Variables Often a critical region can be as small as incrementing or adding a value to a variable. Within the CPU, only single instructions may be guaranteed to be atomic. On an SMP system, even some of the single but complex instructions may not be atomic as all CPUs share the system bus that connects to memory. A hardware memory arbiter circuit regulates access to memory for all CPUs and may provide the bus to another CPU while the current one has not completed all the steps within an instruction. Applications quite often need atomic read-modify-write instructions which may increment a counter, assign a new value or add two memory location and store the result back in memory. For example, the following piece of code may be compiled as shown on the right: // Add j to i and assign to i. i = i + j; Copy i to an internal CPU register; Copy j to an internal CPU register; Add the 2 registers and store in a result register; Store result register to memory location i; Clearly, the compiler has generated four instructions for the single line of C code. As the kernel supports preemption, the task executing the above piece of code may get pre-empted in the middle and memory locations could be updated by another task. To provide atomicity for such simple operations, Linux provides an atomic data type atomic_t and additional macros/functions that can perform simple operation like incrementing the variable, adding a value to the variable or adding and testing the result of addition. Most hardware platforms provide an atomic instruction that can increment/decrement or add/sub a memory location. Some platforms may not have a corresponding atomic instruction but allow locking of the system bus while the operation is executed by the CPU. Operations on the atomic_t are implemented using these atomic machine instructions by the Linux kernel. Spin Locks A spin lock is the fundamental and most widely used synchronization mechanism in an SMP kernel. They are unique to SMP Linux kernels and are not used on a UP Linux system. A spin lock, as the name indicates is a light-weight spinning or looping lock. The task attempting to acquire a spin lock either acquires the lock or loops until it gets the lock. Hence, a spin lock keeps the task running in the CPU and is busy waiting for the lock. Spin locks are usually implemented using platform specific atomic test and set instructions that would check the value of a memory location and set its value in an atomic manner. The lock loops till the test and set instruction succeeds. Figure 4.2 shows two tasks A and B trying to acquire a spin lock S. Task A succeeds and executes the critical region while B spins on another CPU waiting for A to release the lock. Please note that a spin lock is useful only on multiprocessor systems. On a UP system, if B spins, it will wait indefinitely as A will not get a chance to run. The spin lock could disable kernel preemption and B will continue to run creating a deadlock. 16

18 Task A Acquire Lock S success Execute critical region Release lock S Execute non-critical region Task B Acquire Lock S fail Spins trying to acquire S keeps CPU busy while spinning Acquires lock S Execute critical region Release lock S Figure 4.2: Task A and B try to acquire spin lock S. B fails and spins till A releases the lock. A spin lock is used for fine grained locking and is used to protect short critical regions of code on SMP systems. As it spins a task waiting for a lock, precious CPU cycles are spent doing nothing. Thus, they are suited only for locking for short periods of time. Spin locks are also commonly used by interrupts to lock a resource. Interrupts cannot use other mechanisms of locking like semaphores as they are non-schedulable tasks. Locking mechanisms like semaphores can put a task to sleep and can be later re-scheduled to run. Clearly, spin locks are only useful for locking regions of code that execute in a short period of time. They cannot be used for locking data when the associated critical region may need to sleep waiting for events as the blocked tasks will spin wasting CPU cycles. Linux supports another locking mechanism for such critical regions Semaphores. Semaphores A semaphore is another fundamental locking mechanism that can be used to protect large critical regions. A large critical region may execute longer and may sleep within the critical region waiting for an event to happen. Clearly, it is not advisable to use a spin lock in such scenarios. Unlike a spin lock, a semaphore is a sleep lock. If a task tries to acquire a semaphore that is currently locked, the task is added to a wait queue and put to sleep. Later, when the task that locked the semaphore unlocks it, the first task in the wait queue is awakened and is ready to run. The awakened task when scheduled to run, acquires the semaphore and executes the critical region. As the task sleeps while waiting for the lock, the kernel can schedule another process to run on the CPU if the two tasks are on different CPUs. Also, the task that holds the semaphore could itself be preempted by the kernel as a semaphore does not disable kernel preemption. Semaphores are suitable for coarse-grained locking as it allows the scheduler to execute other tasks. They are not suited for locking short critical regions as they are not light-weight like spin locks. A significant overhead is involved in maintaining wait queues and context switches required when a task is put to sleep when trying to acquire a locked semaphore and waking it back later when the lock is released. The Big Kernel Lock (BKL) The Big Kernel Lock (BKL) is a historic coarse-grained lock that was introduced in Linux 2.0 to support SMP. The BKL was used to lock large sections of the code to provide synchronization within the kernel on an SMP system. The Linux 2.6 kernel also has BKL locks but mostly in legacy file system code. The use of BKL is restricted as it increases serialization within the kernel and reduces system performance as more processors are added. 17

19 Application Development on a SMP Linux System An SMP system is a parallel processing system and hence applications design should mirror the system to take advantage of the parallel processing capabilities of SMP Linux. A large monolithic application will more often than not underperform on SMP Linux. This is because the application is serialized and is not capable of harnessing the parallelism with the system. Let us discuss this with an example. An application needs to process N input feeds. Each feed has M records of information and each record can be processed independent of the other. Every feed can also be processed independently. Let us see how a monolithic application will handle this problem. As shown in Figure 5.0, the program opens a feed, processes each record in the feed sequentially and then opens the next feed and continues till all feeds are processed. Let us assume that each record takes time T to process. What is the time taken by the application to process N feeds? It is TxMxN. If there are 10 feeds with 100 records per feed and each feed takes 5 seconds to process, the time required will be 5x100x10 or 5000 seconds. Can a SMP Linux system execute this application faster given all other run conditions are constant? No. Perhaps there can be a slight improvement if certain kernel parameters can be tuned. But that is not the problem here. Open a feed Read and Process a record in feed M iterations N iterations Y More records? N STOP N More feeds? Y Figure 5.0: Monolithic Application 18

20 The problem is that the application does not take advantage of the SMP capabilities of the Linux platform. It will get scheduled to run on a CPU and will get a time slice like any other process. If the load on the CPU is high, the process will get fewer time slices due to contention from other processes. As it executes, its priority also may be brought down by the scheduler if it got to run frequently initially. The application probably will never get to run on another less busy CPU due to the CPU affinity issue that the scheduler considers while performing load balancing. Even if it gets to run, there is no parallelism involved as the entire application will migrate to the new CPU. Linux provides processes and threads to distribute a program s execution across CPUs. A process or an NPTL thread is an independent schedulable entity and can be run on one of the available CPUs on the system. Hence, the right approach to application development on an SMP system is to develop distributed applications. Independent tasks or tasks which have a high degree of independence can be parallelized. In the real world, tasks are not completely independent of each other and there is bound to be some contention between tasks for shared resources. Linux provides user space inter-process communication (IPC) mechanisms such as semaphores, mutexes, conditional variables and shared memory to synchronize tasks in a distributed application. Let us get back to our application and see how the design can be improved to achieve better performance. Each of the N feeds can be processed by a feeder thread. Some of these N threads may get to execute on different CPUs and would execute at the same time. In an ideal situation, if the system had N CPUs and all N tasks got to run on different CPUs, the application could complete in TxM time. Of course, this is only the ideal, but the real performance would be much better than TxMxN. Some of the N threads will be running on the same CPU but they are independent schedulable entities and hence will get their own time slices on the same CPU. Thus overall, the application s performance is bound to improve. The parallelism can be further improved by adding more threads. For example, each feed can be segmented into S segments and a thread can process a segment. Another approach could be to have N processes for each feeds and S threads within process for each segment. Various approaches can be taken and parallelism of the SMP system provides the developer multiple options to improve application performance. 19

21 Summary The Linux platform has been evolving at a rapid pace to support true symmetric multi-processing. From the crude support for SMP provided in Linux 2.0, the platform has come a long way. The Linux 2.6 kernel boasts of features such as highly scalable task scheduler, kernel pre-emption and fine-grained locking mechanisms to provide better scalability and performance. For the application developer, Linux provides process and thread creation interfaces such as NPTL and IPCs to harness the parallelism of the system. The Linux 2.6 kernel is one of the best choices to exploit the SMP hardware platforms of today. SMP computing is becoming common today with even personal desktops sporting multiple CPUs. High end servers and mainframes can also benefit as Linux is highly scalable and stable. Linux is also portable as much of it is written in C and newer SMP platforms can port Linux and avoid costs involved in developing an OS from scratch. A fine example of this is the IBM System z mainframe that can run on Linux. Finally, Linux is open, free and evolving. There is a large and talented development and hacker community that is devoted to the development and testing of the platform. Innovation is promoted and anyone is free to contribute to the future evolution of Linux. There is no fear of development stopping dead in its tracks due to budgetary concerns. To conclude, SMP Linux computing has a bright future ahead. It is already one of the leading SMP server operating systems and is a serious and free challenger for SMP desktop computing. The future of computing may be oriented towards multi-processing as processing speeds hit an upper limit and Linux could be the right platform to run these systems. References 1. Unix Systems for Modern Architectures : Symmetric Multiprocessing and Caching for Kernel Programmers, by Curt Schimmel 2. Linux Kernel Development by Robert Love 3. Understanding the Linux Kernel by Daniel P. Bovet and Marco Cesati 4. Linux source code from 5. Linux and Symmetric Multiprocessing by Tim Jones 6. Inside the Linux Scheduler by Tim Jones 7. Towards Linux 2.6 by Anand Santhanam, 8. Introducing the 2.6 Kernel by Robert Love 9. Native POSIX Thread Library from Wikipedia 10. Red-black trees from Wikipedia 20

22 About HiTech Industry Solution Unit TCS HiTech Industry Solution Unit comprises of Semiconductors, Electronics, Computer Platforms & Services, Software industry and Professional Services. At TCS, we leverage our experience in Engineering Services, Business Process Transformation, end-to-end IT Solutions and Infrastructure Services to provide comprehensive solutions that will help the High Tech firms and manufactures accelerate product innovation, achieve operational excellence, attain greater profitability and maintain market leadership. Our proven consulting capabilities, Extensive engineering expertise, and in-house innovation labs provide breakthrough transformation of product and service portfolios. Our recent investments include dedicated labs and infrastructure in support for convergence solutions, embedded printer solutions, storage optimization and High Tech Center of Excellence based in Eindhoven (The Netherlands). About Tata Consultancy Services (TCS) Tata Consultancy Services is an IT services, business solutions and outsourcing organization that delivers real results to global businesses, ensuring a level of certainty no other firm can match. TCS offers a consulting-led, integrated portfolio of IT and IT-enabled services delivered through its unique Global Network Delivery TM Model, recognized as the benchmark of excellence in software development. A part of the Tata Group, India s largest industrial conglomerate, TCS has over 143,000 of the world's best trained IT consultants in 42 countries. The company generated consolidated revenues of US $6 billion for fiscal year ended 31 March 2009 and is listed on the National Stock Exchange and Bombay Stock Exchange in India. For more information, visit us at Subscribe to TCS White Papers TCS.com RSS: Feedburner: [email protected] For more Information please contact us at [email protected] All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties. Copyright 2010 Tata Consultancy Services Limited TCS Design Services M