A Survey of Fitting Device-Driver Implementations into Real-Time Theoretical Schedulability Analysis

Transcription

1 A Survey of Fitting Device-Driver Implementations into Real-Time Theoretical Schedulability Analysis Mark Stanovich Florida State University, USA Contents 1 Introduction 2 2 Scheduling Theory Workload Models Scheduling Static Schedulers Priority Scheduling Fixed-Priority Scheduling Dynamic-Priority Scheduling Schedulability Tests Basic Scheduling Threads Scheduler Switching Among Threads Choosing Threads Thread States Fairness Priorities Regaining Control Voluntary Yield Forced Yield Temporal Isolation Conveyance of Task/Scheduling Policy Semantics Device Drivers CPU Time Execution Accounting Control I/O Scheduling Device Characteristics Device Timings Backlogging of Work Global Scheduling Conclusion 22 4 Basic Requirements for RT Implementation Time Accounting Variabilities in WCET Context Switching Input and Output Operations Caching and Other Processor Optimizations Memory System Workloads Scheduling Overhead Temporal Control Scheduling Jitter Nonpreemptible Sections Non-unified Priority Spaces

2 Abstract General purpose operating systems (GPOSs) are commonly being used in embedded applications. These applications include cellphones, navigations systems (TomTom), routers, etc. While these systems may not be is considered hard real-time, they have timing constraints. That is, missing a deadline may not be catastrophic, but at the same time missed deadlines should not occur frequently. Recently there have been many enhancements in GPOS s real-time characteristics. These include standardized interfaces (e.g. POSIX), more precise processor scheduling, and increased control over system activities. While these enhancements have improved the ability for real-time applications to meet timing constraints, there is still much left to be desired. In particular, improper management of device driver activities can cause extreme difficulties for a system to meet timing constraints. Device drivers consume processor in competition with other real-time activities. Additionally, device drivers act as I/O schedulers, meaning that timeliness of activities such as network and storage device I/O are directly affected by the operation of the device drivers. In order to guarantee deadlines, these factors (processor time and I/O scheduling) must be properly managed. This requires understanding device driver characteristics and performing abstract analysis (on their operations) to ensure timing constraints will be met. This survey will first provide a brief introduction to realtime scheduling theory, which provides the foundation for ensuring timeliness of a system. We will then cover some basic attributes of an operating system that affect the ability of a system to adhere to the theoretical schedulability models. Finally, this paper will survey the approaches that have been developed to deal with the impact device drivers have on the timeliness of an implemented system. 1 Introduction A real-time system has constraints on when activities must be performed. Some examples of such systems include audio/video devices (e.g., mp3 players, mobile phones), control systems (e.g., anti-lock braking systems), and special devices (e.g., navigation systems, pacemakers). The correctness of the system not only depends on the correct output for a given input, but also depends on the time at which the output is provided. If the output arrives too early or too late, the system may fail. This failure could be as simple as a visual glitch when watching a movie, or could be as severe as an explosion at a chemical plant. These timing constraints are an integral part of the system, and guaranteeing that they will always be met is one of the main challenges of building such a system. Traditional real-time systems have typically been embedded devices. These devices have limited resources, use specialized hardware and software, and provide only a few functions. Traditional real-time systems have the advantage of simplicity, which makes the challenge of validating the system s timing correctness tractable. Timing constraints of the system are guaranteed to be met without great difficulty because mapping the theoretical models to the implementation is generally straight-forward and the slight variations from the theoretical models introduced by the implementation can be compensated by minor adjustments to the theoretical model. Over time embedded computing systems have become more prevalent and more integrated into the world around us. Embedded systems are now expected to perform numerous, complex functions. One illustration of this evolution toward greater complexity is the mobile phone. When these devices first emerged on the market, the only functionality expected was to allow voice communication. Now, if we look at any of the new smartphones we will see a much different device, not only capable of voice communication, but resembling a desktop system with services including gaming, , playing music, taking pictures, web browsing, and navigational support. These additional functionalities require much more complex hardware and software, compared to traditional embedded systems. While many real-time applications still run on specialized hardware and software, it is becoming much more common to see real-time systems utilizing off-the-shelf hardware and software. In particular, general purpose operating systems (GPOSs) have found their way into the realtime domain. One recent example of this is the development of the Android platform [21]. The Android platform utilizes a modified version of the popular Linux kernel and other readily available applications. Using a GPOS has numerous advantages, including wide-spread familiarity, lower cost, reduction in maintenance, and availability of many software components. However, GPOSs are much more complicated, making it very difficult to analyze them, and thereby guarantee timing constraints. These GPOSs were never designed for real-time environments. In fact, a common goal of a GPOS is to improve average-case performance and maximize throughput, many times at the cost of increasing the range of execution times between the best-case and worst-case behavior. In contrast to explicitly designed real-time systems, most GPOSs were not designed to consistently provide low-latency response times, predictable scheduling, or explicit allocation of resources. The lack of these attributes can significantly hinder the ability of a system to meet deadlines.

3 Fortunately, significant progress has been made toward providing real-time capabilities in GPOSs. For example, Linux kernel improvements have reduced non-preemptible sections [51], added high-resolution timer support [19, 20], and included application development support through POSIX real-time extensions [25]. On the other hand, the device-driver components of GPOSs have been generally overlooked as a concern for systems that must meet deadlines. Device drivers are an integral part of the overall system, allowing interaction between applications and hardware through a common interface. These device drivers allow an OS to support a wide range of devices without requiring a substantial rewrite of the OS code. Instead, if a new piece of hardware is added, one can just write a new device driver that provides the needed abstraction, without ever touching any of the application software that uses the device. Device drivers can have a considerable effect on the timing of a system. They typically have complete access to the raw hardware functionalities, including the entire processor instruction set, physical memory, and hardware scheduling mechanisms. Worse, device drivers are often developed by third parties whose primary concern is ensuring their devices meet timing constraints, without regard for other realtime activities [34]. Numerous theoretical techniques exist to guarantee given activities will be scheduled to complete by their associated timing constraints [67]. These theoretical techniques generally rely on the real-world activities adhering to some abstract workload models and the system scheduling this work according to a specified scheduling algorithm. The difficulties emerging are that many of the device driver workloads do not adhere to known workload models. Further, the scheduling of this work on GPOSs tends to be ad hoc and deviates significantly from the theoretical scheduling algorithms that have been analyzed, thereby making many analysis techniques unusable for such systems. Other difficulties with device drivers are that the scheduling of their CPU time is commonly performed through hardware schedulers. These hardware schedulers are typically not configurable and the scheduling policy is predetermined. This inflexibility in allocating CPU time means that workload models that provide better system schedulability may not be usable. Therefore, activities may not meet their timing constraints due to inappropriate allocation of the CPU, even though they logically should be able to because the amount of CPU time is available, just not at the right time. This paper is organized as follows: Section 2 will cover some of the basic aspects of scheduling theory. Section 3 will provide an overview of how scheduling of the CPU is commonly performed on computer systems. Section 4 will provide an idea of what is required in order to implement a real-time system. Finally, Section 5 will provide some of the more important problems and developments with fitting device drivers into a real-time system. 2 Scheduling Theory Scheduling theory provides techniques to abstractly verify that real-world activities will complete within their associated timing constraints. That is, scheduling theory provides the ability to guarantee that a given abstract workload will be scheduled on a given abstract set of processing resources by a given scheduling algorithm in a way that will satisfy a given set of timing constraints. There exists a substantial amount of theoretical research on analyzing real-time systems, much of which traces back to a seminal paper published by Liu and Layland in 1973 [41]. In this section we will review a small portion of this theoretical research. 2.1 Workload Models One aspect of a system that must be modeled is the work to be completed. An example is some calculation to be performed based on sensor inputs. If one were to think in terms of a gasoline engine in an automobile, 1 a calculation may be used to determine the amount of fuel to inject into a cylinder. This calculation would use the sensor readings such as air temperature, altitude, throttle position, and others as inputs. Given the inputs, a sequence of processor instructions would be used to determine the output of the calculation. Then, the appropriate signal(s) would be sent to the fuel injection mechanism. Execution of these processor instructions are the work for the processor resource. The term typically used for one instance of processor work (e.g. one calculation) is a job. Typically, to provide an ongoing system function such as fuel injection, a job must be performed over and over again. We can think of performing some system functionality as a potentially endless sequence of jobs. This sequence of jobs performing a particular system function is known as a task Figure 1. Gantt chart representing execution of work over time. One way to visualize the work being performed on a given resource over time is through a Gantt chart. In Fig- 1 The automobile engine used throughout this section is imaginary and used only as an illustrative analogy. The actual design of an automobile engine is at best more complicated than this and most likely much different. 3

4 ure 1, each shaded block represents a job using a given resource for 5 units of time. So, one job executes over the time interval between 0 and 5, another executes over the time interval between 25 and 30, etc. The amount of work performed by a given job will be referred to as the job s execution time. This is the amount of time the given job uses a resource. Note that all jobs of a given task may not have the same execution time. For instance, different code paths may be taken for a different sensor input values. One input may require fewer processor instructions, while another may require more, thereby varying the execution time from job to job. Each job has a release time, the earliest time when the job is allowed to begin execution. This release time may depend on data being available from sensors, another job being completed, or other reasons. This is not necessarily the point in time when the job begins execution, since it may be delayed if another job is already using a required resource and the newly released job cannot acquire the resource immediately. Another term used in the abstract workload model is deadline, which is the point in time when a job s work must be completed. In an automobile engine, a deadline may be set so that the fuel must be injected before the piston reaches a certain position in its cylinder. At that position a spark will be produced and if the fuel is not present, no combustion will take place. This point in time when the job must be completed may be represented in several different ways. One is known as a relative deadline, which is specified as some number of time units from the release of the job. The other is an absolute deadline, which is in relation to time zero of the time keeping clock. For example, a job with a release time of t = 12 and relative deadline of 5 would have an absolute deadline of 17. One commonly used abstract workload model used to describe a recurring arrival of work is the periodic task model. In the periodic task model, the next job is separated from the previous job by a constant amount of time. An example for the use of the periodic task model can be thought of in terms of an engine running at a constant number of revolutions per minute (RPM). The calculation for the amount of fuel to inject must be performed some constant amount of time from the previous calculation. While this model does not work if the RPMs of the engine change (this will be taken into account later), this does characterize many applications that are used to monitor and respond to events. To express a periodic task denoted as τ i, where i identifies a unique task on a system that contains multiple tasks. A task has a period, T i defining the inter-arrival time of jobs of the task. To represent the k th job of task i, the notation, j i,k is used. Therefore, a task τ i can be represented as a sequence of jobs j i,0, j i,1,..., j i,k, j i,k+1,... where the time between the arrival or release of any job j i,k+1 is T i time units from job j i,k. A task also has an execution or computation time, C i, that is the maximum execution time of all the jobs for a given task. This is referred to as the task s worst-case execution time (WCET). Each job also has a relative deadline from its release time. This relative deadline is the same for every job of a given task and is denoted as D i. At this point we can describe a periodic workload as τ i comprised of three parameters: C i, T i, and D i. ji,k r i,k completes d i,k r i,k+1 D i C i Τ i Figure 2. Representation of sporadic task τ i. The periodic task model has the constraint that the interarrival times between tasks must be equal to the period (T i ). A task that treats the period only as a lower bound between inter-arrival times is known as a sporadic task. So, if we have a sporadic task τ i and we denote the release time of job k of task τ i as r i,k, then r i,k+1 r i,k T i. Figure 2 illustrates the sporadic task model. In this figure, d i,k represents the deadline of job j i,k. With this relaxed model, an engine that runs at a varying RPMs can now be represented. The period using the sporadic task model would be the minimum time between the execution of the fuel amount calculation, which would happen at the maximum RPM the engine could possibly run. This time between arrivals at the maximum RPM would give us the period for our sporadic task. 2.2 Scheduling In order to allow multiple tasks on a system at one time, we must determine how to control access to the resource(s) to resolve contention. In the case where the processor is the resource, more than one job may want to use a single processor to perform computations. Similarly, one job may be using a resource when another job arrives and therefore contends for use of the processor. A scheduling algorithm is used to determine which job(s) can use what resource(s) at any given time Static Schedulers One desirable characteristic of a scheduling algorithm is the ability to adapt as the arrival pattern changes. This charac- 4

5 j 1 j 1 j 2 j 2 j 2 arrivals j 1 time j 2 arrivals j 1 time (a) Example of preemptive scheduling. (b) Example of non-preemptive scheduling. Figure 3. Comparison of preemptive and non-preemptive scheduling. teristic distinguishes between static (non-adaptive) versus dynamic (adaptive) schedulers. Since dynamic schedulers are the common case, we will only briefly describe static schedulers. Static scheduling algorithms precompute when each job will execute. Therefore, applying these algorithms requires knowledge of all future jobs properties, including release times, execution times, and deadlines. The static scheduler will then compute the schedule to be used at runtime. During runtime, the exact schedule is known. So, once one job completes or a given point in time is reached the next job will begin execution. One type of static scheduler is known as the cyclic executive, in which a sequence of jobs is executed one after the other in a recurring fashion. The jobs are not preemptible and are run to completion. A cyclic executive is typically implemented as an infinite loop that executes a set of jobs [42]. The cyclic executive model is simple to implement and validate, concurrency control is not necessary, and dependency constraints are taken into account by the scheduler. However, this model does have the drawback of being very inflexible [10, 43]. For instance, if additional work is added to the loop, it is likely that the time boundaries for portions of the original work will be different. These changes will require additional, extensive testing and verification to ensure the original timing requirements are still guaranteed. Ideally one would like the system to automatically adapt to changes in the workloads Priority Scheduling A priority scheduler uses numeric priority values as the primary attribute to order access to a resource. In most priority scheduling policies, priority values are assigned at the job level. When multiple jobs contend to use a given resource (e.g. processor) this contention is resolved by allocating the resource to the job with the highest priority. It is generally desired to provide the resource to the highest- priority job immediately. However, this is not always possible. This characteristic of being able to provide a resource immediately to a job is known as preemption. For instance, consider Figure 3a. Here we have two jobs, j 1 and j 2 sharing a resource. The subscripts indicate the job s priority and the lower the number indicates a higher priority. So, j 1 has higher priority than j 2. On the arrival of j 1, j 2 is stopped and execution of j 1 is started. This interruption of one job to start another job is known as a preemption. In this case, the higher priority job is able to preempt the lower priority job. If, j 1 is unable to preempt j 2 when it arrives as in Figure 3b, then j 2 is said to be non-preemptible. A non-preemptible job or resource means that once a job begins executing with the resource it will run to completion without interruption. Preemption may not always be desired if an operation (e.g., section of code) is required to be mutually exclusive. To inhibit preemption some form of locking mechanism is typically used (e.g., monitors, mutexes, semaphores). However, preventing preemption can result in a violation of priority scheduling assumptions known as priority inversion. Priority inversion is a condition where a lower-priority task is executing, but at the same time a higher-priority task is not suspended but is also not executing. Using an example from [39], consider three tasks τ 1, τ 2, and τ 3, where the subscript indicates the priority of the task. The larger the numeric priority, the higher the task s priority. Further, consider a monitor M by which τ 1 and τ 3 use for communication. Suppose τ 1 enters M and before τ 1 leaves M, τ 2 preempts τ 1. While τ 2 is executing, τ 3 preempts τ 2 and τ 3 attempts to enter M, but is forced to wait (τ 1 is currently in M) and therefore is suspended. The next highest-priority task will be chosen to execute which is τ 2. Now, τ 2 will execute, effectively preventing τ 3 from executing, resulting in priority inversion Fixed-Priority Scheduling Fixed-task-priority scheduling assigns priority values to tasks and all jobs of 5

6 τ 1 τ 1 τ (a) fixed priority (RM) scheduling time a given task are assigned the priority of their corresponding task. The assignment of priorities to tasks can be performed using a number of different policies. One widely known policy for assigning priorities for periodic tasks is what Liu and Layland termed rate-monotonic (RM) scheduling [41]. Using this scheduling policy, the shorter the task s period the higher the task s priority. One assumption of this policy is that the task s period is equal to its deadline. In order to generalize for tasks where the deadline may be less than the period, Audsley, et. al [3] introduced the deadlinemonotonic scheduling policy. Rather than assigning priorities related to the period of the task, this approach schedules priorities according to the deadline of the task. Similar to RM scheduling, deadline-monotonic assigns a priority that is inversely proportional to the length of a task s deadline Dynamic-Priority Scheduling As in fixedtask-priority scheduling the priority of a job does not change, however, with dynamic-priority scheduling jobs of a given task may have different priority values. One of the best known dynamic-priority scheduling algorithms is known as earliest deadline first (EDF), in which, the highest priority job is the job that has the earliest deadline. To illustrate the dynamic-priority vs fixed-priority scheduling, consider Figure 4. τ 1 and τ 2 are periodic tasks assigned priority values using either EDF (dynamic) or RM (fixed) priorities. τ 1 has an execution time of 2 and period/deadline of 5. τ 2 has an execution time of 4 and period/deadline of 8. So, at time 0 the job of τ 1 has a higher priority than the job of τ 2 in both EDF and RM. At time 5, for RM, τ 1 s job still has higher priority, however, for EDF, τ 2 s job now has a higher priority than τ 1 s, hence the priority assignment for jobs of a single task can change dynamically. 2.3 Schedulability Tests The tests used to determine whether timing constraints of given abstract workload models, scheduled on a given set of abstract resources using a given scheduling algorithm can be guaranteed are termed schedulability tests, or schedulability analyses. Figure 4. Fixed vs. dynamic priority scheduling. τ (b) dynamic priority (EDF) scheduling time In a real-time system one expects to guarantee the timing constraints for a given set of tasks are always met. In order to guarantee these timing constraints the work to be performed on the system, the resources available to perform this work, and the schedule of access to the resources all must considered. One possible conclusion from a schedulability analysis is that the set of tasks is schedulable, meaning that every job will complete by its associated deadline. In this case, the schedule produced is said to be feasible. Another possible conclusion from the schedulability test is that the schedule is not feasible, meaning that it is possible that at least one job will not meet its deadline. necessary and sufficient (guranteed unschedulable/schedulable) unschedulable task sets necessary only (guaranteed unschedulable) schedulable task sets sufficient only (guaranteed schedulable) Figure 5. Guarantees made by various schedulability tests. A schedulability test will typically report either a positive result, indicating that the task set is guaranteed to be schedulable, or a negative result, indicating the one or more jobs of the given task set may miss their deadlines. However, depending on the given schedulability test, the result may not be definite in either the positive or negative result. The terms sufficient-only, necessary-only, and sufficientand-necessary are commonly used to distinguish between the different types of tests as described below and illustrated in Figure 5. A schedulability test where a positive result means the task set is guaranteed to be schedulable, but a negative result means that there is still a possibility that the task set is schedulable is termed a sufficient-only test. Similarly, a test where the negative result means that the task set is certainly unschedulable, but the positive result means there is still a possibility that the task set is unschedulable is a necessary-only test. Ideally one would always strive 6

7 for tests that are necessary-and-sufficient, or exact, where a positive result means that all jobs are guaranteed to meet their deadlines and a negative result means that there is at least one scenario where a job may miss its deadline. Liu and Layland published one of the first works on fixed-priority scheduling [41]. In their work, the critical instant theorem was formulated. The critical instant is the worst-case scenario for a given periodic task, which Liu and Layland showed occur when the task is released with all tasks that have an equal or higher priority. This creates the most difficult scenario for the task to meet its deadline because the task will experience the largest amount of interference, thereby maximizing the job s response time. Liu and Layland used the critical instant to develop the Critical Zone Theorem which states that for a given set of independent periodic tasks, if τ i is released with all higher priority tasks and meets its first deadline, then τ i will meet all future deadlines, regardless of varying the task release times [41]. Using this theorem a necessary-and-sufficient test is developed by simulating all tasks at their critical instant to determine if they will meet their first deadline. If all tasks meet their first deadline, then the schedule is feasible. A naive implementation of this approach must consider all deadline and release points between the critical instant and the deadline of the lowest priority task. Therefore, for each task τ i, one must consider D n /T i such points, resulting in a complexity of O( n 1 i=0 D n T i ) [67]. While schedulability analyses like the one above are useful for determining whether a particular task set is schedulable, it is sometimes preferable to think of task sets in more general terms. For instance, we may want to think of task parameters in terms of ranges rather than exact values. One approach that is particularly useful is known as maximum schedulable utilization, where the test to determine the schedulability of a task set is based on its total processor utilization. Utilization of a periodic task is the fraction of processor time that the task can demand from a resource. Utilization is calculated by dividing the computation time by the period, U i = Ci T i. The utilization of the task set, or total utilization, is then the sum of utilization of the individual tasks in the set, U sum = n 1 i=0 U i, where n is the number of tasks in the set. Now to determine whether a task set is schedulable, one need only compare the utilization of the task set with that of the maximum schedulable utilization. As long as total utilization of the task set is less than maximum schedulable utilization then the task set is schedulable. The maximum schedulable utilization varies depending on the scheduling policy. Considering a uniprocessor system with preemptive scheduling and tasks assigned priorities according to the RM scheduling policy, the maximum schedulable utilization is n(2 1 n 1), and referred to as the RM utilization bound (U RM ) [41]. As long as U sum n(2 1 n 1) the tasks are guaranteed to always meet their deadlines. This RM utilization bound test is sufficient, but not necessary (failure of the test does not mean the task set is necessary unschedulable). Therefore, a task set satisfying the RM utilization test will always be schedulable, but task sets with higher utilization cannot be ensured schedulability. While the RM utilization bound cannot guarantee any task sets above U RM, one particularly useful class of task sets which can guarantee higher utilizations are those whose task periods are harmonic. These task sets can be guaranteed for utilizations up to 100% [67]. Preemptive EDF is another commonly used scheduling algorithm. The schedulability utilization bound provided for this scheduling policy is 100% on a uniprocessor [41]. This means that as long as the utilization of the task set does not exceed 100% then the task set is guaranteed to be schedulable. In fact, for a uniprocessor, the EDF scheduling algorithm is optimal, in the sense that if any feasible schedule exists than EDF can also produce a feasible schedule. Many other scheduling algorithms and analyses exist to provide guarantees of meeting deadlines. This is especially true for the area of multiprocessor scheduling. However, the basic principles are essentially the same, given a workload model and scheduling algorithm, a schedulability test can determine whether timing constraints of a given system will be met. 3 Basic Scheduling In this section we will cover basic terminology and methods used to allow multiple applications to reside on a single processor system. The major concern is management of resources to allow work to be performed in a flexible yet relatively predictable and analyzable manner. 3.1 Threads A sequence of instructions executing on a processor is referred to as a thread. On a typical system, many threads coexist; however, a processor may be allocated to only one thread at a time 2. Therefore, for many threads to use one processor, the allocation of a processor must be rotated among the available threads. 3.2 Scheduler Granting access to the processor is performed by the OS s scheduler, sometimes called the dispatcher. The scheduler decides which thread at a given time will execute on the CPU. 2 This is only considering a single CPU system with one processing core and no hyper-threading. 7

8 3.2.1 Switching Among Threads To control access to the processor, the scheduler must have a way to start, stop, and resume the execution of threads on a processor. This mechanism is known as a context switch. The thread that is removed from the processor will be known as the outgoing thread and the thread that is being given the processor will be called the incoming thread. The first step in a context switch involves saving all the information or context that will be needed to later resume the outgoing thread. This information must be saved since the incoming thread is likely to overwrite much of the context of the outgoing thread. Next the incoming thread s context will be restored to the original state when it was paused. At this point, processor control will be turned over to the incoming thread Choosing Threads Each time the scheduler is invoked, it decides which thread to run next based on a number of criteria. ready (competing for execution time) running (executing) Figure 6. State diagram of a thread. blocked Thread States One scheduling consideration is whether a thread can use the processor. At any given time, a thread is in one of three states (Figure 6). To explain these states, we will start with a thread that is not running but is ready to execute. At this point, the thread waits in the ready queue for the processor, and the thread is in a ready state and considered runnable. Once the scheduler selects the thread to execute, a context switch will occur. The chosen incoming thread will transition from the ready state to the running state. The thread will then execute on the processor. While the thread is executing, the processor may be taken away from the thread even though the thread has not completed all its work. This means that the thread will transition back to the ready state. In a different scenario, a thread in the running state may request some service, such as, reading from a file, sending a network packet, etc. Some of the requests cannot be fulfilled immediately and the system must wait for a subsystem to complete the request. While the thread is waiting, the processor can be used by other threads. So, if the current thread cannot continue until the service is completed, the thread will transition to the blocked state. Once in the blocked state, the thread will not execute. It is the job of the OS to change the thread from the blocked state to the ready state when the event for which it is blocked occurs Fairness With multiple threads on a system, one reasonable policy is to expect each thread to make similar progress. The scheduler may attempt to provide fairness among the ready threads by choosing the one that received the least amount of execution time in the recent past Priorities Providing fairness between all threads is not appropriate when one thread is more important or urgent than others. So, priorities are generally utilized in real-time scheduling policies. Under simple priority scheduling the highest priority thread will occupy the processor for as long as it desires. This means that one thread can lock up the system, causing the system to be unresponsive to other lower priority, ready threads. Therefore, threads scheduled with priorities must be programmed with caution. When two threads have the same priority, the scheduler can choose a thread based on which thread arrived first. Under fifo scheduling, earlier arriving threads have higher priority. Alternatively, with the use of round-robin scheduling, each thread at a given priority will be allotted a specific amount of time, known as a time slice. All threads at a given priority level will receive one time slice before any thread of that level receives additional time slices Regaining Control The scheduler is the component that decides which threads will be allocated the CPU. However, a question may arise as to how the scheduler gets scheduled to obtain the CPU Voluntary Yield As mentioned earlier, a thread may call the OS and request services. These calls, among other things, allow a thread to become blocked and yield the processor to other threads. When the current thread becomes blocked, the scheduler code will execute and choose another thread to use the processor Forced Yield If a thread does not voluntarily yield the processor, we need to rely on other mechanisms for the scheduler to regain control of the processor. 8

9 The typical way is through the use of interrupts. Interrupts are used to communicate between devices and the processor. The interrupts signal the processor that some event has taken place. When an interrupt is raised by some device, the processor that handles the interrupt transfers execution to the corresponding interrupt handler, or interrupt service routine (ISR). An ISR can be thought of as similar to another thread on the system. thread A thread B timer interrupt 4.1 Time Accounting The validity of schedulability analysis techniques depends on there being an accurate mapping of usage of the processor to the given workload in the theoretical model. We will refer to this mapping as time accounting. During execution of the system, all execution time must stay within the bounds of the model. For example in the periodic task model, if some time is used on the processor, this time should correspond to some given task. Further, this time should not exceed the task s WCET. The proper accounting of all the time on a system is difficult. This section will cover some of the more common problems that hinder a system from performing proper time accounting Variabilities in WCET Figure 7. Periodic timer interrupt. time To control the processor at some time in the future, the OS can program a timer interrupt, which is sent by a hardware component on the system. The timers are typically capable of at least two modes of operation. The legacy mechanism is periodic mode, where the timer will send interrupts repetitively at a specified interval. Figure 7 shows an example of periodic mode. The periodic timer interrupt allows threads A and B to share the processor through the intervention of the scheduler. The other timer interrupt mode is sometimes referred to as one-shot mode. In one-shot mode the timer is set to arrive some OS-specified time in the future. Once the timer expires and the interrupt is sent, the OS must reset the timer in order for another timer interrupt to be produced. 4 Basic Requirements for RT Implementation Using appropriate real-time scheduling analysis one can provide a guarantee that timing constraints of a set of conceptual activities (e.g. tasks) will be met. This assumptions that the analysis relies upon, must also be true from an implemented system. If the assumptions do not hold true for an implemented system, the guarantees made by the scheduling analysis may no longer be valid. Whether a system can support a given theoretical model relies on the system s ability to perform accurate accounting of time, control the behavior of tasks, and to properly communicate timing parameters as detailed in this section. The task abstraction requires that one know the WCET of each task. To determine the WCET of a task, one approach would be to enumerate all possible code paths that a task may take and use the time associated with the longest execution time path as the WCET. In simple system such as that of a cyclic executive, this approach may work, but using a GPOS, this WCET would unlikely reflect the true WCET since tasks on such systems could have additional complexities such as context switching, caching, blocking due to I/O operations, and so on. We will go over some common cases that cause variabilities in a task s WCET Context Switching Context switch overhead is typically small compared to the intervals of time a thread executes on a processor. However, if context switches occur often enough, this overhead becomes significant and must be accounted for in the analysis. Consider a job-level fixed-priority system where jobs cannot self suspend. If the time to perform a context switch is denoted as CS, then one needs to add 2CS to the WCET of each job of a task [42]. The reason is that each job can preempt at most one other job, and each job can incur at most two context switches: one when starting and one at its completion. Similar reasoning can be used to allow for self-suspending jobs where each self suspension adds two additional context switches. Therefore, if S i is the maximum number of self-suspensions per job for task i, then the WCET should be increased by 2(S i + 1)CS [42]. To include context switches in the analysis, one must also determine the time to perform a context switch. Ousterhout s empirical method [54] measures two processes communicating through a pipe. A process will create a pipe and fork off a child. Then the child and parent will switch between one and other each repeatedly performing a read and a write on the created pipe. Doing this some number 9

10 of times provides an estimate on the cost of performing a context switch. Ousterhout s method not only includes the cost of a context switch but also the cost of a read and a write system call on the pipe which in itself can contribute a significant amount of time. To factor out this time, McVoy and Staelin [48], measured the time of a single process performing the same number of write and read sequences on the pipe as performed by both processes previously. This measured time of only the system calls are subtracted from the time measured via Ousterhout s method, thereby leaving only the cost of the context switches. This method is implemented in the benchmarking tool, lmbench [49] Input and Output Operations Performing input and output operations during the time critical path of a real-time activity can create large variations in its service time. For example, accessing hard drives can last anywhere from few hundred microseconds to more than one second. Determining the blocking time for accessing the device is not only difficult, but can increase the worst-case completion time (WCCT) of a task to such a point that the system becomes unusable. Further, the analysis of combining I/O scheduling and processor scheduling becomes extremely complex and starts to reach the limits of real-time scheduling theory [5]. Since large timing variances cannot typically be tolerated for a real-time activity, it is common to ensure that these I/O operations do not occur in the time critical path. One way is to perform I/O in a separate server thread. This allows the actions that deal with the I/O devices to be scheduled with little interference on the real-time activities. Another approach is to perform I/Os as asynchronous operations, allowing the real-time threads to continue without blocking while the submitted I/O operations are performed. One must also be aware of indirect causes of I/O operations. For example, the use of virtual memory allows a system to use more than the physical RAM on the system by storing or swapping out currently unused portions of memory on secondary storage (e.g. hard disk). However, this can cause large increases in the WCCT of a real-time activity by delaying access to data stored in the memory. If this WCCT is exceeded, the timing guarantees of the scheduling theory will be invalidated. Fortunately, many GPOSs realize the consequences of these swapping effects on time- critical activities and therefore provide APIs that prevent memory from being relocated to secondary storage (e.g., POSIX s mlock set of APIs [29]). Even when memory pages are not swapped to secondary storage, virtual memory address translation still takes some amount of time. This concern is addressed by Bennett and Audsley [8] by providing time bounds for using virtual addressing. While it is not common for real-time systems to allow swapping, Puaut and Hardy [57] have provided support to permit the use of swapping real-time pages. At compile time, they select page-in and page-out points that provide bounded delays for memory access. The drawback is that hardware and software support is required in order to provide the implementation, which may not be available Caching and Other Processor Optimizations The number of instructions, the speed to execute these instructions, caching, processor optimizations, etc. can introduce extremely large variabilities in the time to execute a piece of code. As processors become increasingly complicated, the difficulty in determining accurate WCETs also becomes more complicated. Many difficulties arise from instruction-level parallelism in the processor pipeline, caching, branch prediction, etc. These developments make it difficult to discover what rare sequence of events induces the WCET. Given code for an application, there are generally three methods used to determine the WCET [53, 79]: compiler techniques [2, 26], simulators [52], and measurement techniques [40]. These methods can be effectively used together to take advantage of the strengths of each. For example, RapiTime [44, 45], a commercially available WCET analysis tool, combines measurement and static analysis. Static analysis is used to determine the overall structure of the code, and measurement techniques are used to establish the WCETs for sections of code on an actual processor Memory Theoretical analysis generally relies on the WCET of one task not being affected by another task. In practice, this affect on WCET is typically not true due to contention for memory bus access. With a uniprocessor the caching effects between one application and another may affect the execution time when context switching; however, this is typically taken into account in the WCET. Now as the trend is toward more processors per system, not only is caching an issue, but also the contention for access to the memory bus. What processes are concurrently accessing which regions of memory can greatly affect the time to complete an activity. When one process accesses a region of memory this can effectively lock out another process, forcing that process to idle its processor until the particular region of memory becomes available. Further, processes are not the only entities competing for memory accesses, peripheral devices also access memory, increasing the memory interference and making WCETs even more uncertain [56, 66]. 10

11 4.1.2 System Workloads When implementing tasks on top of a GPOS, system workloads may be created in order to support applications. These workloads contribute to the proper operation of the system, but do not directly correspond to work being performed. Further, since they may not be the result of any particular task, they do not fit into any of the task s execution times and can be easily overlooked. The problem is that the processor time used by the system competes with the time used by the tasks. Without properly accounting for this time in the abstract model, these system workloads can steal execution time from other activities on the system, thereby causing missed deadlines Scheduling Overhead The scheduler determines the mapping of tasks to processors. In order to perform this task, it uses processor time. In a GPOS, the change of task assignments to CPUs occurs when the scheduler is invoked from an interrupt or when a task selfsuspends/blocks. The timer hardware provides interrupts to perform time slicing between tasks as well as other timed events. Katcher et. al [37] describe two types of scheduling interrupts, timer-driven and event-driven. Tick scheduling [11] occurs when the timer periodically sends an interrupt to the processor. The interrupt handler then invokes the scheduler. From here, the scheduler will update the run queue by determining which tasks are available to execute. Any task that has release times at or before the current time will be put in the run queue and able to compete for the CPU at its given priority level. Performing these scheduling functions consumes CPU time which should be considered in the schedulability analysis. Overlooking system code called from a timer can be detrimental to the schedulability of a system because timer handlers can preempt any thread, regardless of the thread s priority or deadline. 4.2 Temporal Control Temporal control ensures that the enforcement mechanisms in the implementation correctly adhere to the realtime models used in the analysis. For the processor, this includes the system s ability to allocate the processor to a given activity in a timely manner. For example, when a job with a higher priority than that of the one currently executing on the processor is released (arrives), the preemptive scheduling model says the system should provide the processor to the higher priority job immediately. However, in practice this is not always possible Scheduling Jitter Scheduling points are events where the scheduler evaluates what tasks should be assigned to which CPU. In an ideal scheduling algorithm scheduling actions take place at the exact points in time when some state in the system changes causing the mapping of threads to CPUs to change. In a GPOS, the scheduling points are the points in time when the CPU scheduler is invoked, such as when a task completes one of its jobs and therefore self-suspends, or the system receives an interrupt. The difference between the ideal scheduling points in an abstract scheduling algorithm and that of the CPU scheduler is commonly called scheduling jitter. If a job is set to arrive or become runnable at time τ 1, but is not recognized by the system until time τ 2, the scheduling jitter is τ 2 τ 1. Minimizing scheduling jitter is important in real-time systems. Generally, the smaller the scheduling jitter, the better the theoretical results can be trusted to hold on the implemented system Nonpreemptible Sections Another common problem in real-world systems is that of nonpreemptible sections. A nonpreemptible section is a fragment of code that must complete execution before the processor may be given to another thread. Clearly, a long enough nonpreemptible section can cause a real-time task to miss its execution time window. While accounting for nonpreemptible sections in the schedulability analysis is necessary for guaranteeing timing guarantees, it is generally preferable to design such that nonpreemptible sections are avoided as much as possible. The reason is that nonpreemptible sections increase the amount of interference a given task may encounter, potentially making the system unschedulable Non-unified Priority Spaces When a device wishes to inform the CPU of some event, the device will interrupt the CPU, causing the execution of an interrupt handler. The interrupt handler is executed immediately without consulting the system s scheduler, creating two separate priority spaces: the hardware interrupt priority space and the OS scheduler s priority space, of which, the hardware interrupt scheduler always has the higher priority. Therefore, any interrupt handler, regardless of priority, may preempt an OS schedulable thread. The fact that all interrupts have higher priority than all OS schedulable threads, must be modeled as such in the theoretical analysis. The more code that runs at interrupt priority the greater the amount of interference an OS schedulable thread may experience, potentially causing OS threads to become unschedulable. 11

12 4.2.4 Temporal Isolation Exact WCETs can be extremely difficult to determine in many cases, therefore, only estimated WCETs may be specified. If a given task overruns their allotted time budget due to their exact WCET being longer than the specified WCET, one or more other tasks may also miss their deadlines. Rather than all tasks missing their deadlines, it is generally preferable to isolate the failure of one task from other tasks on the system. This property is known as temporal isolation. 4.3 Conveyance of Task/Scheduling Policy Semantics For an implemented system to adhere to a given theoretical model, one must be able to convey the characteristics of this model to the implemented system. To perform this in a GPOS, it is common to provide a set of system interfaces that inform the system of a task s parameters. For example, consider the periodic task model scheduled with fixed-priority preemptive scheduling. Each task is released periodically and competes at its specified priority level until its activity is completed. If a periodic task abstraction were available directly in a given OS, then the theoretical model could easily be implemented. However, in GPOSs such interfaces typically do not exist. However, many systems adhere to the POSIX operating systems standards, which support real-time primitives to allow for implementation of a periodic task model scheduled using fixed-priority preemptive scheduling. These interfaces include functions for setting a fixed priority to a thread and allowing a thread to self-suspend when a job is completed, which map from the task model to the implementation. These types of interfaces are critical for applications to convey their intentions and constraints to the system. They inform the OS of the abstract model parameters in order for the OS scheduler to make decisions that match the ideal scheduler. Lacking this information, the OS may make improper decisions, resulting in tasks missing their deadlines. 5 Device Drivers Devices are used to provide system services such as sending and receiving network packets, managing storage devices, displaying video, etc. The number of these services is small compared to the variety of hardware components, which are produced by a multitude of vendors, each with many distinct operating characteristics. For instance, sound cards provide a means to produce audio signals. The user will typically provide the audio signal in a digital format to the sound card and the sound card will output an analog audio signal that can be converted to sound waves through a speaker. However, to produce sound from an application, interaction between the system and the sound card must occur. Due to the many different features, components, and designs of the different cards, specifics (e.g., timings, buffers, commands) by which communication with these cards occurs is typically different depending on the manufacturer or even model. To ease the use of devices such as sound cards, OSs abstract much of the hardware component complexity into software components known as device drivers, which are typically provided by the device manufacturer. Therefore, instead of having to know the particulars of a given device, the application or OS can communicate generically with the device driver and the device driver, having knowledge of the device specifics, can communicate with the actual device. Using device drivers in a real-time system complicates the process of guaranteeing deadlines. These devices share many of the same resources used by the real-time tasks and can cause interference when contending for these resources. Further, many device drivers are used in the critical path of meeting deadlines. Therefore, device driver activity must be included in the schedulability analysis. The difficulty is that the device driver workloads generally do not conform to well understood real-time workload models. 5.1 CPU Time The CPU usage of device drivers tends to be different from other real-time application tasks, and therefore fitting the usage of CPU time into known, analyzable real-time workload models can be awkward. Trying to force usage into these models tends either be invalid due to lack of OS control over scheduling, inefficient due to the limited number of implementable scheduling algorithms, or impractical due to large WCETs being used for the analysis even though average case execution times may be much smaller. Further, many of the scheduling mechanisms created for user-space applications do not extend to the device drivers. That is, the explicit real-time constructs such as pre-emptive priority-driven resource scheduling, real-time synchronization mechanisms, etc. are not typically available to or used by device drivers. This section will enumerate some of the temporal effects associated with device drivers and show why these can hinder the proper functioning of a real-time system. We will see how using I/O devices in a system increases the time accounting errors, reduces the amount of control over system resources, and leads to incompatibility with existing workload models Execution Device drivers consume system resources and, therefore, compete with other activities on the system, including real- 12

13 time tasks. The contended-for system resources include CPU time, memory, and other core components of the system. For example, consider a network device driver. The end user expects a reliable, in-order network communication channel. The sending and receiving of basic data packets is handled by the card. However, execution on the processor is required to process the packets, which includes communicating with the network card, handling packet headers, and dealing with lost packets. Since device driver CPU usage competes with real-time tasks, the CPU time consumed must be considered in the schedulability analysis. The CPU usage due to device drivers may seem negligible for relatively slow devices such as hard disks. The speed differences between the processor and the hard disk should mean that only small slivers of time will be taken from the system. Unfortunately, the competition from some other device drivers for CPU time significantly impacts the timeliness of other activities on the system. The device driver overhead can be especially large for high bandwidth devices such as network cards. According to [40], the CPU usage for a Gigabit network device driver can be as high as 70%, which is large enough to interfere with a real-time task receiving enough CPU time before its deadline. The problem of device drivers interfering with real-time tasks is not likely to diminish over time. Devices are becoming faster and utilizing more system resources. One example is the replacement of solid-state storage for hard disk drives. The solid-state devices are much faster and can create significantly more CPU interference for other activities on the system. To better understand the problems with device drivers in the context of real-time scheduling, we will first look at the manner in which these components consume CPU time and how this can affect the ability of a system to meet timing constraints. Stewart [75] lists improper accounting for the use of interrupt handlers as a common pitfall when developing embedded real-time software. Interrupt handlers allow device drivers to obtain CPU time regardless of the OS s scheduling policy. While scheduling of application threads is carried out using the OS scheduler, the scheduling of interrupt handlers is accomplished through interrupt controllers typically implemented in hardware. Interrupts effectively create a hierarchy of schedulers, or two priority spaces, where all interrupts have a priority higher than other OS schedulable threads on the system. Interrupts prevent other activities from running on the system until they have completed. While an interrupt is being handled, other interrupts are commonly disabled. This produces a blocking effect for other activities that may arrive on the system. Until interrupts are re-enabled, no other threads can preempt the currently executing interrupt handler. Therefore, if a high-priority job arrives while interrupts are disabled, this job will have to wait until the interrupt completes, effectively reducing the time window the has to complete its activities. Since device drivers typically use interrupts, some, if not all, of the device driver processor time is out of the control of the OS scheduler. [62] pointed out that device drivers can in effect steal processor time from real-time tasks. This time stolen by device drivers can cause real-time tasks to miss their deadlines. In order to illustrate and quantify this stolen time, Regehr [60] describes how an application-level thread can monitor its own execution time without special OS support, in the implementation of a benchmark application program called Hourglass. In Hourglass, a synthetic real-time thread, which we call an hourglass thread, monitors the amount of processor time it consumes over a given time interval. The thread needs to measure the amount of processor time it receives, without the help of any OS internal instrumentation. This is difficult because processor allocation is typically broken up due to time slicing, interrupt processing, awakened threads, etc., and the endpoints of these intervals of execution are not directly visible to a thread. An hourglass thread infers the times of its transitions between executing and not executing, by reading the clock in a tight loop. If the time between two successive clock values is small, no preemption occurred. However, if the difference is large, then the thread was likely preempted. Using this technique to determine preemption points, an hourglass thread can find the start and stop times of each execution interval, and calculate the amount of processor time it receives in that interval. Knowing the amount of execution time allows hourglass threads to emulate various real-time workload models. For example, periodic workloads can be emulated by having the hourglass threads alternate between states of contention for the processor and self-suspension. More specifically, a periodic hourglass thread contends for the processor until it receives its nominal WCET, and then suspends itself until the beginning of its next period. The thread can also observe whether its deadlines are met or missed. Given that interrupts interfere with real-time applications, interrupt service time must be included in the analysis of schedulability. [65] considered the problem of including interrupt executions whose arrival times are not known in advance with other tasks scheduled by a static schedule constructed offline. The naive approach pointed out by [65] is to include the interrupt WCET into the execution times of all tasks on the system. However, using this mechanism is typically pessimistic and can reduce the ability to prove that a system is schedulable. Instead of adding the WCET to each task, [65] considers only adding the WCET to a task chain, which is a number of tasks that are always executed sequentially. The WCET of the interrupts is considered as 13

14 that of a higher-priority task which is considered to arrive at the start time of the chain. The point where the task chain is released is a critical instant and schedulability can then be calculated for all tasks in the chain. Another way to include device driver CPU time in schedulability analysis is to consider interrupt execution as a task. To do this in a fixed-priority system, one could model the interrupt as a sporadic task, with the execution time being the interrupt handler s WCET and the period being the smallest time between the arrivals of two subsequent interrupts. The priority of this interrupt task would need to be modeled as higher than any other task on the system, due to the nature of the built-in hardware scheduling mechanisms of interrupts. However, modeling the interrupts as a task with highest priority in the system may not be consistent with all scheduling algorithms. For instance, in an EDF scheduled system the work performed by the handler for the interrupt may have a logical deadline further in the future than other jobs. Therefore, according to the EDF scheduling policy, the interrupt should logically have a lower priority than jobs with earlier deadlines, but in fact will have a higher priority, violating the rules of the scheduling policy. Further, executing interrupts at a real-time priority may not be required. If the interrupts are not needed by any realtime task on the system, it may make sense, if possible, to schedule the interrupt execution as the lowest priority on the system, or with other non-realtime tasks. One possibility to gain some control over interrupts is through interrupt enabling and disabling. This can be accomplished by disabling interrupts whenever a logically higher-priority task begins execution and re-enabling interrupts when no higher-priority tasks exist. This provides one possibility for interrupt priorities to be interleaved with the real-time task priorities. However some interrupt handlers have hard real-time requirements of their own, which demand a high priority. For example, some devices require service from the CPU in a bounded amount of time and without acknowledgment from the CPU, the device may enter an unstable state or events may be lost. Other effects such as idling the device may occur. This can greatly reduce the utilization of certain devices. Consider the hard disk device. Once a request to the hard disk is completed, new requests, if any, should be presented to the hard disk to prevent idling. Idling a hard disk is normally unacceptable due to the relatively long service times. If the hard disk is unable to query the processor about another request via interrupts, the disk may become idle, wasting time that could be used to service requests. The OS scheduler is not always able to provide controlled access to shared resources (e.g. data structures) used inside interrupt handlers, therefore other mechanisms are needed to ensure proper access of these shared resources. One common protection mechanism is to disable the interrupt that may violate the access restrictions to shared resources, thereby preventing the interrupt handler executing. Disabling a single interrupt rather than all interrupts is known as interrupt masking and is typically accomplished by setting a bit in the corresponding interrupt controller register [30]. This approach, if done correctly, can be used to provide correct mutually exclusive access to shared resources, but introduces the issue of priority inversion due to locking [39, 69]. Interrupt masking introduces CPU overhead, since manipulating the registers of the interrupt controller typically involves off-chip access and cause effects such as pipeline flushes. Given that very few interrupt attempts occur during the periods of masked interrupts, [76] proposed optimistic interrupt protection, which does not mask the interrupts using the hardware. To maintain critical sections, a flag is set that indicates a critical section is being entered, and it is cleared at the end of the critical section. If a hardware interrupt occurs during the critical section, an interrupt handler prologue will note that an interrupt has occurred, save the necessary system state, and update the hardware interrupt mask. The interrupted code will then continue. At the end of the critical section, a check will be performed for any deferred interrupts. If one does exist, the corresponding interrupt routine will then be executed. In addition to maskable interrupts, systems may also contain non-maskable interrupts (NMIs) which must be included in the schedulability analysis. To complicate matters, some NMIs are handled by the BIOS firmware and do not travel through the OS. The most common form of NMIs handled by the BIOS are known as System Management Interrupts (SMI) and can cause added latency to activities on the system [82]. It is important in a real-time system that one be aware of, and account for if necessary, the time taken by SMI activities. Discretion must be used when performing computations inside interrupt handlers. For instance, Jones and Saroiu [34] provided a study of a soft modem. This study shows that performing the signal processing required for the soft modem in interrupt context is unnecessary and can prevent other activities on their system from meeting their deadlines. Therefore, one should minimize the amount of processing time consumed by interrupts and consider other possibilities. Interrupts are not the only way to synchronize a peripheral device and the CPU. Instead, the processor can poll the device to determine whether an event has occurred. Interrupts and polling each have their own merits which are discussed below. Interrupts allow the processor to detect events such as state changes of devices without constantly having to use the processor to poll the device. Further, with interrupt notification the time before detecting an event is generally 14

15 shorter than with polling, since the delay is only due to the signal transmission from the device to the CPU (assuming the interrupt is enabled). While this delay is non-zero, it is very small and generally intrinsic to the hardware design. With polling the elapsed time for noticing an event can be as much as the largest interval between polls. On the other hand, with polling, the processor communicates with the device at a time of the processor s choosing. The querying of the device can occur at a time that does not interfere with higher-priority tasks. The task that queries the device can be under the direct control of the system s scheduler, thereby providing much more flexibility for scheduling device-cpu synchronization activities. Interrupt execution can potentially consume all of the CPU time. This phenomenon, known as interrupt overload, is pointed out by Regehr and Duongsaa [61]. Interrupt overload occurs when interrupts arrive at such a rate that the interrupt processing time used starves other activities on the system (including the OS scheduler). Several situations may cause an unexpectedly high interrupt rate. One is a faulty device continuously sending interrupts, also known as a stuck interrupt. Another is a device that can legitimately send interrupts at high rates. In either case, servicing each interrupt as it arrives can starve other activities on the system. One device with a high maximum arrival rate of interrupts is the network card. When a packet arrives, the system would like to be informed in order to wake up and/or pass packets to threads waiting for information from the network. A low response time for noticing this event is desired because there may be high-priority tasks awaiting the data, and delaying the delivery will increase the tasks response time, possibly causing missed deadlines. As long as interrupts are infrequent, and their handlers take a small amount of time, the impact on the system may be considered negligible. However, the increase in performance of the newer gigabit and higher ethernet cards has the side effect of also increasing the maximum number of interrupt arrivals to a rate that can consume significant portions of CPU time. This means that the danger of interrupt overload is present on systems with these devices. For instance, the interrupt arrival rate of a gigabit ethernet device can be nearly 1.5 million per second [36, 61]. This arrival rate can overwhelm the processor with handling interrupts and leaves little or no time to perform other activities on the system. To address the problem of interrupt overloads, [61] proposed to rate-limit interrupts via intelligently enabling and disabling interrupts. This mechanism will either delay interrupts or shed the interrupt load by dropping excessive interrupts to ensure that thread-level processing can make progress and not be blocked for extended periods of time due to a malfunctioning hardware device. The first approach enforces a minimum interarrival time between subsequent interrupts. The second approach caps the maximum number of interrupts in a given time interval. These solutions only count the number of interrupts arriving rather than calculating the actual amount of processor time the interrupts use, since counting incurs lower overhead. Also, for simple systems, the execution time of any given interrupt handler is nearly constant. On systems where the execution time of interrupt handlers can vary widely or where the execution time depends on the state of the processor, counting interrupts alone may be insufficient. In the context of network cards, [50] provides another solution to throttle interrupts. One key observation is that when the first receive interrupt arrives (signaling a packet has been received), the second receive interrupt is not useful until the first packet has been processed. Therefore, without completion of the first packet, the second interrupt and further interrupts are only informing the system of something it already knows, that is, the device needs attention. Therefore, [50] proposes to switch between interrupt and polling modes dynamically, so, once one interrupt arrives from the network device, interrupts are disabled. The work required to service the network device will then be performed outside the interrupt handler, typically in a schedulable thread. That thread will poll, querying the network device for work to perform once the previous unit of work has been completed. Once the network device has no further work to perform, the interrupts for the network device will be reenabled and the receive thread will suspend itself, performing the transition back to interrupt mode and out of polling mode. To provide some control over hardware interrupts, hardware platforms such as the Motorola 68xxx series and some of the newer x86 microprocessors [31], provide the ability to set priorities for interrupt handlers within the same priority space as the system s schedulable threads. The OS is able to set a priority value for each interrupt. The OS then sets a priority value for the currently running thread. When an interrupt arrives to the interrupt controller, if the interrupt priority value is greater than that of the currently executing thread, then the processor will be interrupted. Otherwise, the interrupt processing must wait until the currently running thread s priority drops to a value below that of the interrupt. Assigning priorities for interrupts does alleviate some of the problems with interrupts. However, this mechanism is not available on all hardware platforms and does require OS support. While servicing a particular interrupt it is common to disable other interrupts, preventing any other task or interrupt handler from executing. To schedule the work of the interrupt handlers more effectively, it would be advantageous to put their execution under the control of the OS s scheduler [38], rather than allowing them to execute in a separate 15

16 priority space. This can be accomplished if the interrupt handlers are made into OS schedulable threads, thereby removing the restriction of interrupts being modeled as the highest-priority threads on the system. While this approach does simplify the scheduling of activities on the system, it typically increases the system s overhead. Further, it is common for the interrupt mechanisms to be highly optimized due to hardware support for switching into and out of interrupt-level processing, so that the dispatching of interrupt handlers can happen much faster than dispatching of a user-space thread. For that reason, [27] proposes to convert all threads into interrupts, thereby allowing the interrupt hardware to handle all the scheduling. This seems to work well on embedded systems with a limited number of threads. However, it is unclear whether this approach would scale to systems with larger number of threads Accounting Another challenge of scheduling device driver CPU time is proper time accounting, or attributing consumed CPU time to a given entity. This entity commonly corresponds to a task in the theoretical models. A thread is typically used to implement a task in a real-time system. This thread consumes CPU time in order to accomplish its given activities. Ideally, only the time the thread logically uses should be charged to the thread. Accurate accounting allows the scheduler to make correct scheduling decisions. If a thread is undercharged, this may result in one thread getting more than its correct share of the CPU resource, potentially causing missed deadlines for other threads. Conversely, if the thread is overcharged, this may result in the overcharged thread not receiving its required CPU time needed to complete necessary activities by a given deadline. Without improper time accounting, device driver time may be charged to the unlucky process that happened to be executing when the device driver begins execution. One proposed solution provided by [62] uses fine-grained time accounting in order to ascertain the amount of stolen time. This time can then be provided to the OS scheduler, allowing it to ensure that the affected threads are compensated and receive their allocated time. Device driver activity occurs in many different contexts. In one context an application invokes the device driver, by requesting service from the OS through the system call interface. For example, sending a network packet will typically involve a system call, which, in turn, will call the device driver that will send the packet. This call to the device driver consumes processor time, but since the call originates from a thread s context this time can be legitimately attributed to the application thread that sent the packet, and modeled for the purposes of schedulability analysis as part of the same task as that thread. So, such user-initiated device driver activity does not require the introduction of any additional tasks to the workload model or special time accounting. However, device driver activity may also occur in contexts that do not have any immediate association with a particular user-level thread. For instance, upon arrival of data at a network interface, the received packet must be processed by the network device driver and network protocol stack before the system can determine the user-level thread that should receive the packet. This device-driver execution is typically triggered by a hardware interrupt. It is not obvious how to map such device-initiated device-driver activity to the tasks of a workload model for the purpose of time accounting and schedulability analysis. Most logically, it ought to be attributed to conceptual sender tasks. Alternatively, it might be modeled by a system I/O server task, or by a mapping to the application-level tasks that are the ultimate recipients of the packets (if the recipient is known). These solutions present various difficulties for schedulability analysis, and none of them matches very well the way that device-initiated device-driver activities are typically scheduled and accounted for within operating systems today. For received network packets, determining the destination process as early as possible can improve the proper allocation of CPU time to all processes on the system, as described in [16]. Determining the destination as early as possible allows CPU and memory resources to be allocated fairly between processes on the system. In order to demultiplex early, it is ideal for this processing to take place on the network interface card, thereby acquiring enough information to ensure that any further processing takes place in the context of the thread waiting on the packet, and is charged to the correct thread. Unfortunately this functionality is not available in many network cards and so CPU time must be consumed to determine the destination thread. [16] proposes to perform demultiplexing of packets in the interrupt handler, allowing the recipient to be determined early and allowing further processing to be charged to the receiving thread. For the purpose of schedulability analysis, it is preferable to schedule the execution of all device driver activities in a way that allows them to be attributed to application threads. Zhang and West [83] sought to associate the processor time consumed by device-initiated device-driver activities with threads that utilize the device-driver services. A logical consequence is that the device-driver processing should take place at the priority of an associated receiver thread, and be charged to that thread. However, in the previous example of receiving a network packet, at the time that device driver activity occurs, it is often not yet clear what thread will be receiving the service. The receiver can only be determined after processor time has been consumed to demultiplex the packet. To solve this problem, [83] ex- 16

17 ecutes the device-driver code at the highest priority of all the threads waiting on its services. Then, once the receiving thread has been determined, the time used by the devicedriver is back-charged to that process. This allows more accurate accounting and control of device-driver execution. However, device-driver execution time is not always related or attributable to any application thread. For instance, device driver execution may be due to system maintenance operations or external events that turn out to not have any appropriate recipient thread. Further, even if an appropriate recipient thread is found, back-charging processor time after a thread has already executed can potentially violates the assumptions of scheduling theory (e.g. a low priority thread preempting a higher priority thread). Lewandowski et al. [40] proposed a way to characterize the worst-case processor workload generated by a network device driver. This characterization of the processor usage provided a workload model that could then be used in schedulability analysis. The device driver workload is characterized by an empirically derived demandbound function. Demand is the amount of processor time a given task can consume over a given sized time interval. To determine the worst-case demand of a device driver, one presents the system with an I/O workload that is believed to cause the device driver to consume the maximum possible amount of processor time. Then, using a technique similar to Hourglass (described above), one measures the observed maximum amount of processor time consumed by the device driver over a time intervals of size over multiple experiments, for various values of. These data points can then be used to approximate a demand bound function. To provide a simple and more intuitive measure of processor usage, the concept of load was introduced. If f( ) is a processor demand bound function the corresponding load is f( )/. In other words, the load of a device driver is the maximum percentage of the processor required to service the device driver over a given interval length. Therefore, determining the schedulability of a task for a uniprocessor system involves ensuring for a equal to the deadline, that the load of all higher-priority tasks plus the task itself does not exceed 100%. This provides a technique to include device-driver workloads into schedulability analysis. The technique addresses the two limitations mentioned for the Zhang and West [83] approach above, but has its own limitations. In particular, it is based on empirical estimation of worst-case processor demand of a device driver, measured under the conditions of a synthetic benchmark application. The problem of characterizing the processor workloads of device-initiated device-driver activities, in a way that supports whole-system schedulability analysis, continues to be a challenge. Many other issues arise when trying to account for resources consumed by device drivers. These include cases where multiple processes are responsible for a single device driver s activity. In these cases it becomes unclear which process should be charged for the device driver CPU time consumption. Further, it may be that device driver processing occurs at a higher priority than is permitted by the theoretical model being used to analyze the system schedulability. In this instance, charging the particular thread for the time may be an accurate reflection of reality, but it may still break the theoretical assumptions, meaning the timeliness guarantees made by the theoretical analysis are no longer valid Control Even with accurate CPU time accounting there may still be issues related to matching an implemented system to a theoretical model. Since device driver activity can be aperiodic and bursty (e.g., packet interrupt arrival times), it can be difficult matching this activity to any known, analyzable workload model. Fortunately, a class of algorithms known as aperiodic servers have been developed for these aperiodic and bursty workloads. Aperiodic servers force the service of unwieldy workloads into common simple workload abstractions. In fixed-priority systems, the periodic (polling) server [68], deferrable server [77], and sporadic server [70] all provide ways to bound and characterize the service demands of aperiodic workloads. Similarly for deadline-based scheduling aperiodic servers exists such as deadline sporadic server [18], constant bandwidth server [1], and total bandwidth server [71]. Using these servers allows one to take execution workload arrival patterns that may have no apparent or viable abstract model and schedule their service in a way that allows the workload to be analyzed as a periodic task. Lewandowski et al. [40] proposed to use the sporadic server scheduling algorithm to schedule device driver activity. Using sporadic server scheduling provides a mechanism to bound the maximum amount of CPU time that a given device driver is able to consume within any given time interval. That is, if one is able to obtain an estimate of the maximum execution time over a given time interval the device driver can consume, the sporadic server can be used to force the device driver activity to never exceed the this estimate. This is especially important for job-level fixedpriority systems since high-priority device driver activity can potentially consume all the available CPU time. Using aperiodic servers to schedule device-driver activity works well on systems that have precise scheduling control. Control here is the degree of precision with which the scheduling mechanisms are able to start and stop threads on the system. For instance, many OSs operate on a tickdriven mechanism where the scheduler is invoked on a periodic basis. Therefore, the scheduler can only be guaranteed 17

18 request a request b request c d a d b d c d a d b d c (a) disk requests to start and stop threads when the tick occurs. This tick preempts the running code allowing the scheduler to reevaluate and decide the subsequent task that should be running on the CPU. When the tick s period is relatively large, the control of the scheduler is very coarse. Imagine a thread released just after the scheduler returns from being invoked by the tick and that this thread has a higher priority than the currently running thread. In this instance, if the currently running thread does not voluntarily yield, then the higher-priority thread will have to wait until the next tick to arrive in order to begin execution on the CPU. This delays the higher-priority task s execution and shortens the time to its deadline. Without the proper control, the effectiveness of aperiodic servers is limited. Further, without precise control it is possible that a given thread can overrun its allocated CPU time, or budget. Several papers have addressed this problem. Ghazalie and Baker [18] present a mechanism whereby this overrun time is charged against future time that the server will receive. This mechanism was also adapted in [73] to handle overruns in the POSIX s version of sporadic server. CPU control aids in forcing the execution patterns of implementations to adhere to the theoretical models. This is important because these theoretical models are what provide the foundation for timeliness guarantees. While it may be difficult for implementations to provide exact compliance with a given theoretical model, it is important to come as close as possible. Minor discrepancies can be accounted for by tweaking the theoretical model (e.g., task parameters). However, if the discrepancies get too large, the amount of tweaking may lead to a significant reduction in the number of tasks that can be assured schedulable because of additional tweaking terms that must be added or will create a system that deviates from the theoretical assumptions and therefore meeting deadlines will not be guaranteed. 5.2 I/O Scheduling (b) EDF scheduling of disk requests time Figure 8. Scheduling disk requests. (c) improved scheduling of disk requests Ensuring that the device driver CPU time consumption is modeled correctly is important, but is not the only consideration in order to provide timely services. Another important aspect of device drivers is that they arbitrate access to a given device and therefore act as I/O schedulers. Therefore, to ensure timely service the scheduling of the device s resource by the device driver must be properly handled Device Characteristics Scheduling I/O devices is different than scheduling CPUs. I/O devices tend to be non-preemptible, have service times that vary substantially between the worst-case and the average case, and have physical characteristics that make predicting I/O service times difficult. These characteristics make it problematic to guarantee timing constraints for tasks that use I/O devices. As an example consider a hard disk scheduled using the EDF scheduling algorithm. The EDF scheduling algorithm works well on a uniprocessor, providing a 100% CPU utilization bound for guaranteeing timing constraints of task sets. However, using EDF scheduling on a hard disk devices tends to provide results far from optimal in terms of both efficiency and meeting timing constraints. Figure 8 illustrates the difficulty of scheduling disk requests with EDF. In this figure, we will assume a very simplified disk where the time to read one track takes one time unit and the time to move to an adjacent track takes one time unit. Requests a, b, and c each read one track and have corresponding deadlines d a, d b, and d c as shown in figure 8b. Scheduling the disk requests according to the EDF scheduling policy results in poor throughput of the disk, as well as, a missed deadline. Scheduling requests in the order of a, c, b, achieves better throughput and no missed deadlines as shown in figure 8c. Based on the poor performance of EDF to schedule disk requests, various modifications to the EDF scheduling algorithm have been proposed. SCAN-EDF [59] is a variant on the EDF scheduling algorithm. This algorithm first services requests in EDF order and services requests with the same deadline according to the a SCAN elevator algorithm. The SCAN algorithm services all requests in one direction (e.g. lowest address to highest address) and then services the remaining requests in the other direction (e.g. highest address to lowest address). Using the SCAN algorithm reduces the amount of seek time required to service a batch of requests. The effectiveness of SCAN-EDF relies on many requests having the same deadline to increase the efficiency of the hard disk time 18

19 service. Without having requests with the same deadline, the throughput provided by SCAN-EDF is poor. To overcome the deficiency that the throughput is dependent on requests having the same deadline, [13] proposes Deadline-Modification-SCAN (DM-SCAN). DM- SCAN first orders requests in EDF order as in SCAN-EDF. However, to maximize the number of requests with the same deadline, DM-SCAN creates groups of tasks that can be serviced in SCAN order without missing any deadlines and assigns modified deadlines to each task in the group so each task has the same deadline. The guarantees for both DM-SCAN and SCAN-EDF rely on being given a task set that is verified to be schedulable using the EDF scheduling policy. The problem is that for a single non-preemptive resource (e.g., hard disk, processor), determining schedulability for tasks tends to be NPhard in general [14, 72]. Jeffay [33] proposed a pseudo-polynomial time algorithm to determine schedulability for EDF scheduled tasks. This test involves two conditions. The first condition ensures the utilization is less than or equal to one. The second condition verifies that the demand for processor time does not exceed any task s interval L, where L begins at the invocation of the task and ends before its first deadline. That is, the second part of the schedulability test essentially simulates the execution of the tasks until a given task reaches meets or misses its first deadline. This simulation is performed for each task. [4] provided two schedulability tests that are O(1) by placing restrictions on the periodic tasks. The first schedulability test requires that the computation times of all tasks are equal and that the periods are integral multiples of the computation time. The second schedulability test requires that all tasks have equal periods, but may have arbitrary computation times. Both tests guarantee such task sets scheduled using EDF scheduling are schedulable as long as the sum of all tasks utilizations is less than or equal to one. Others [17, 28] have attempted to provide better nonpreemptive EDF scheduling by inserting idle time in the schedule. Inserting idle time allows certain jobs to be postponed that could otherwise cause missed deadlines Device Timings Knowing the timing characteristics of devices is critical for providing good schedules, both in terms of schedulability analysis and good utilization of I/O devices. For instance, to provide guaranteed service of I/O requests, schedulability analysis typically requires the worst-case time to service requests be known, which can be very difficult to extract. Devices are also difficult to model due to asynchronous events that are not directly related to I/O requests and proprietary algorithms in firmware. In this section we will look at some common methods to extract and control I/O device service times. Hard disks are one of the more complicated devices, and predicting precise service times can be difficult. Therefore, simplified models have been developed to provide good estimations of service times for hard disk I/O requests. For example, [63] uses the following formula for worst-case service time for one disk request: t seek + n t rotation + m t sector + t ovh + v t skew In the formula, the terms have the following meaning: t seek is the time to move the disk head(s) from one end of the disk (e.g. outermost cylinder) to the other end, that is, the time for a full sweep of a disk head across the disk platter; t rotation is the time for the platter to spin one revolution; n signifies that more than one rotation may be required to stabilize the disk head on a track in order to perform data operations in the worst case; t sector is the time to access one sector, for most modern disks, this time will vary due to zone-bit recording (ZBR) [24], where the outer tracks have more sectors than the inner tracks, for disks with ZBR, the worst-case service time for the inner track should be taken as t sector ; t ovh is the time for disk controller processing and data transfer between the disk controller and the host system, t ovh is assumed to be constant; t skew is the time to switch to a new cylinder and new head, depending on the request size, this may occur more than once, indicated by the parameter v. These parameters will typically vary from hard disk to hard disk and must be determined for each hard disk. While obtaining worst-case service times (WCSTs) for disk I/O requests is important for ensuring timing constraints will be met, it is also important for providing good service times for non-real-time requests. Since these nonreal-time requests do not have explicit timing constraints they are commonly referred to as best-effort requests. To provide service to best-effort requests, the L scheduler [9] uses the WCST for real-time disk I/O requests to determine slack time in the schedule. Slack time is either idle time or the remaining time before the latest time that real-time requests must be serviced so that no deadlines are missed. Whenever the remaining slack time in the schedule is greater than the time to service a best-effort request, the best-effort request is given priority over the next real-time request to service. Reuther and Pohlack [63] developed a hard disk scheduling technique known as Dynamic Active Subset (DAS) where at each time a new request is chosen to be serviced by the disk, a subset of requests is constructed so that no service guarantees will be violated for the subset. Any request can be chosen from this subset without the possibility of violating any deadlines. The choice of the next request to 19

20 execute can then be picked from the DAS based on the request that can be serviced most efficiently. Therefore, to pick the next request from the DAS, the shortest-accesstime-first (SATF) [32] algorithm is used, which orders request in order to lower mean response time and maximize throughput. In order to create the DAS and perform SATF scheduling, a detailed model of the hard disk must be created, which involves gathering detailed timing characteristics of the hard drive that is to be modeled. Gathering detailed timing characteristics for individual hard disks can be very time-consuming and challenging [46, 47, 80]. The difficulty in extracting timing characteristics is that much of the inner workings of a hard disk are proprietary and therefore must be empirically determined. Further, many events such as bad block remapping, thermal recalibration, read errors, etc. are difficult to predict and subject to change over the lifetime of each individual disk, making the utility of fine-grained external scheduling of hard disks questionable. Recognizing the trend of increased complexity and autonomy of modern disk drives, rather than attempting finegrained hard disk scheduling, Wu and Brandt [81] proposed a more coarse grained approach that provides support for soft real-time applications. Their basic idea is to control the rate of non-real-time requests in order to give real-time requests the best chance of meeting deadlines. The rate of non-real-time requests is controlled through an adapted traffic shaping mechanism originally designed for networking known as Token Bucket Filter (TBF) [15]. To use the TBF mechanism, information about the allocation to the non-real-time requests needs to be given. This is provided through a mechanism known as Missed Deadline Notification (MDN) [6] where real-time requests inform the OS that a deadline is missed. When a MDN is received, the service rate for non-real-time requests is decreased. As time passes with no MDN received, the allocation to non-realtime requests is increased. While this traffic-shaping approach does not provide hard real-time guarantees, it does provide good support for applications that are able to handle some missed deadlines. Many I/O devices have internal queues that can hold outstanding requests waiting to be serviced [78]. The order in which these requests are processed varies and can significantly increase the variability in request service time. In particular, the hard disk drives have internal queues that allow a disk to service requests in a throughput optimized order (with heuristics to prevent starvation). Without intimate knowledge of the scheduling mechanisms used in these disk drives, meeting timing constraints becomes very challenging. Stanovich et al. [74] developed a technique that allows one to use the optimized scheduling of the hard disk drives while still meeting request deadlines. Their approach is to maximize the number of outstanding requests that are sent to the hard disk in order to allow the best chance for optimizations and therefore increase throughput. At the same time, new requests to be sent to the hard disk are throttled so that real-time requests already issued will not be in jeopardy of missing their deadlines if the real-time request(s) happen to be serviced last. Empirical techniques were developed to invoke worst-case or near worst-case behavior in order to extract service times. These times were used to determine whether new requests would jeopardize already issued real-time requests. Rather than guaranteeing some amount of service time on a particular I/O device, requirements for most devices are typically in terms of data throughput rather than time utilization. For example, an application may need to read 1MB every 30msec. Guaranteeing data throughput can be much harder than guaranteeing service time. However, Kaldewey et al. [35] have shown that in many situations, guaranteeing time utilization for real-time applications provides better control, isolation, and efficient use of hard disk resources in comparison with guaranteeing bandwidth or I/O rates. The approach is to provide each application with a virtual disk, which is a fractional share of a physical hard disk. Each application is allowed to reserve a percentage of the hard disk s time utilization, where the maximum sum of 100% time utilization is enforced to ensure feasibility. An important goal of the virtual disk abstractions is that its performance should be independent of any other virtual disk using the same physical hard disk. This is performed by charging each application for time used. This charging is done by first estimating the service time of an I/O request to ensure the request has enough budget and then charging the application once the I/O request returns, the point at which the actual service time can be computed Backlogging of Work Since I/O device operations tend to be much slower than CPU operations, in many cases requests are cached in order to speed up performance. For example, when write requests to hard disks are sent, this data will typically be cached in main memory and actually written to the hard disk later. This caching can create problems later on by causing an avalanche of data being written to the hard disk at the same time a real-time request is attempting to send requests to the hard disk. Due to file system and journaling order dependencies, these real-time requests may be forced to wait until previous request are written to the hard disk. If too large a backlog of cached requests has accumulated, the delay to service the real-time request may be enough to cause a missed deadline. Flash memory is a type of non-volatile storage which unlike hard disks has no moving parts, thereby providing much more predictable timings for simple operations. This 20

21 predictability makes flash memory much more suitable for use in real-time systems. However, flash memory has its own challenges due to the requirement that storage blocks must be erased before data can be written to them. This means that directly overwriting of old data is not possible. Additionally, storage blocks can only be erased a limited number of times. A component called the garbage collector is typically used to provide pre-erased, or free blocks of storage for new writes. The garbage collector tries to spread the number of erases uniformly across the flash storage to prolong the life of the storage media. This process called wear leveling. Flash-based storage also has the characteristic that the erase granularity (i.e., flash block) is larger than the write granularity (i.e., flash page). Therefore, when an erase occurs, all pages within a block to be erased must not hold usable data. If there are no such erasable blocks, the garbage collector must move data from an erasable block before the erase operation takes place. Reading from flash storage tends to be very predictable, however write performance can vary substantially [55]. This unpredictability is due to the possibility of blocking while the garbage collector creates free pages. To address the unpredictable write time of flash, [12] provides a realtime garbage collection mechanism to provide guaranteed write times to flash storage. The basic idea is to ensure that free pages are immediately available for real-time tasks, removing any blocking times that otherwise may need to be included in the schedulability analysis. To do this, a realtime garbage-collector task is created for every real-time task that writes to the flash storage. This garbage collector is created to reclaim enough free pages to supply to the realtime task. To determine schedulability the real-time task and its associated garbage collector task are included in the analysis. The access to the flash based storage is assumed to be programmed I/O, meaning that the processor is occupied during any operations being performed on the flash memory. Processor scheduling is stated to be EDF and only one processor is used. Verification of schedulability is performed by summing the utilizations for all real-time tasks and their associated garbage collectors plus the worst-case blocking time due to nonpreemptive operations is less than or equal to one. The worst-case blocking term is defined to be the time for an erase operation (the longest operation on flash memory) over the minimum period of all real-time and garbage collecting tasks. 5.3 Global Scheduling Many real-time applications use multiple resources. These resources provide the necessary services that allow an application to complete some given activity. When considering a real-time application, this activity will have timing constraints associated with it. Each resource along the critical path of the real-time application affects the timeliness of the entire application. To guarantee that the application meets its end-to-end timing constraints, coordinated scheduling between the different resources must occur. Guaranteeing end-to-end timing constraints involving multiple resources and multiple applications tends to be intractable as one single problem. To simplify the problem, the resource decoupling approach is proposed in [58]. Resource decoupling takes the larger problem involving multiple resources and breaks it into a series of smaller problems, where each smaller problem involves the scheduling of only one resource. That is, each job is broken into multiple subjobs, where each job involves only one resource and each resource is scheduled independently. This allows the calculation of the end-to-end delay to be equal to the sum of the delays of each job, thereby providing a mechanism to determine whether a given end-to-end timing constraint will be met. However, the resource decoupling approach [58] is only valid if the resources are independent of one another. A problem is that scheduling I/O resources generally also requires the processor, breaking the independence requirement. That is, if we consider devices such as the network or disk, both require some of the CPU processing time in order to provide their service (e.g. device driver code). This means that managing or scheduling each resource independently of any other resource does not occur. [64] uses a cooperative scheduling server (CSS) which reserves both I/O and processor time in order to guarantee end-to-end timing requirements. The CSS approach divides the end-toend operation into multiple jobs, pre-processing, I/O device service time, and post-processing. Each sub-job is then assigned a deadline that guarantees the end-to-end deadline is met. These deadlines are assigned on a fixed slack sharing scheme where the slack in the schedule is divided proportionally between the different sub-jobs. Initially, deadlines are set to equal the execution time. Then, a fixed percentage of the total slack is given to the I/O service to extend its deadline. The remaining slack is then divided between the pre- and post-processing sub-jobs proportional to the reserved processor time of the two sub-jobs. The mechanism in which end-to-end slack time is divided between the different sub-jobs can affect the total number of tasks that can be admitted to the system. Since each resource is scheduled independently of other resources on the system, the more slack time given to one sub-job decreases its utilization and thereby increases the likelihood of a given sub-job being schedulable. Further, this additional slack time means that the total utilization of that resource will be less, increasing the number of tasks that can be accepted in the future. To provide better deadline assignment, [22] proposed to assign deadlines based on resource utilizations. For instance, if a disk resource were 25% utilized and 21

22 the CPU resource were 75% utilized, rather than equally dividing the slack between the two resources, a more sensible assignment would be to divide the slack based on the total utilization of a given resource. So, if the total slack were 100ms, the deadline assignment of the disk would be increased by 250ms and the deadline for the CPU usage deadline would be increased by 750ms. Decoupling the resources from one other simplifies the scheduling, however, it removes some valuable information that can potentially provide better scheduling. [23] mentions the problem of non-preemptive scheduling and meeting end-to-end deadlines. With inter-scheduler coordination, the scheduler of each resource makes locally optimal solutions, however, due to precedence constraints release times for future jobs may be known and releasing a nonpreemptible job may interfere with the end-to-end deadline of an already released job. To partially solve this problem [23] proposes inter-scheduler coordination that informs resources of future job arrivals and deadlines. This advance notification allows resources to insert place holders for upcoming real-time jobs, thereby, preventing any interfering jobs from starting that may jeopardize upcoming deadlines. However, this solution is only partial because it cannot prevent cases such as long non-preemptible jobs starting before the real-time job arrives. The theoretical task or job abstraction typically maps to a process (or thread) in an implemented system. This allows the process to be the accountable entity for resource usages. This accounting can then be used to make future scheduling decisions. However, many real-world jobs do not map exclusively to one process. If we consider a web server, a common scenario is for a request to arrive through the network interface, be handled by the server s connection manager, be served by another process (to isolate and protect the system from unruly applications), and then returned to the client again through network interface. In this scenario, any single process is not the accountable entity. To solve this problem, Banga et al. [7] proposed the resource container abstraction. Resource containers provide an accountable entity that is used to track processor and resource usages. Therefore, each activity (job) on the system has associated with it a resource container. When an activity travels through different processes, the resource container will travel along as well. This allows proper scheduling decisions to take place based on the activities (jobs) rather than processes. 6 Conclusion Real-time systems guarantee that activities are completed within specified timing constraints. To provide these guarantees, real-time scheduling theory provides a foundation. For an implemented system to benefit from these theoretical guarantees, the system workload and scheduling must adhere closely to the theoretical models. In the past real-time systems have typically been developed on specialized hardware and software that were simple and mapped easily to theoretical models. Recently, more systems with real-time requirements have been built with off-the-shelf hardware and software components. These components were not designed for use in a real-time system and therefore composing a real-time system with these components is challenging. In particular, GPOSs have found their way into many real-time systems. While there are many benefits of using GPOSs, one major drawback is that they do not conform well to uses where timeliness is a concern. For GPOSs to be a viable component of a real-time system, the obstacles that prevent the theoretical models from being implemented must be addressed. One class of OS components that causes difficulties for GPOSs to follow real-time theoretical models is device drivers. These components have many characteristics that violate the assumptions of scheduling theory. In this paper, a number of these characteristics have been enumerated, as well as some of the more prominent solutions that have been proposed in the literature. While these solutions do address portions of the problem, many of the challenges remain. References [1] L. Abeni and G. Buttazzo. Integrating multimedia applications in hard real-time systems. In Proc. 19th IEEE Real- Time Systems Symposium, pages 4 13, Madrid, Spain, Dec [2] R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-case instruction cache performance. In Real-Time Systems Symposium, 1994., Proceedings., pages , Dec [3] N. C. Audsley, A. Burns, M. Richardson, and A. J. Wellings. Hard real-time scheduling: the deadline monotonic approach. In Proc. 8th IEEE Workshop on Real-Time Operating Systems and Software, pages , Atlanta, GA, USA, [4] Y.-C. Baek and K. Koh. Real-time scheduling of nonpreemptive periodic tasks for continuous media retrieval. In TENCON 94. IEEE Region 10 s Ninth Annual International Conference. Theme: Frontiers of Computer Technology. Proceedings of 1994, pages , Aug [5] T. P. Baker, A. A.-I. Wang, and M. Stanovich. Fitting linux device drivers into an analyzable scheduling framework. In OSPERT 2007, the Workshop on Operating Systems Platforms for Embedded Real-time Applications, pages 1 9, July [6] S. Banachowski, J. Wu, and S. A. Brandt. Missed deadline notification in best-effort schedulers. In SPIE, Multimedia 22

23 Computing and Networking (MMCN), San Jose, CA, USA, Jan [7] G. Banga, P. Druschel, and J. C. Mogul. Resource containers: A new facility for resource management in server systems. In Proc. 3rd USENIX Symposium on Operating Systems Design and Implementation, pages 45 58, New Orleans, Louisiana, Feb [8] M. Bennett and N. Audsley. Predictable and efficient virtual addressing for safety-critical real-time systems. In Real- Time Systems, 13th Euromicro Conference on, 2001., pages , June [9] P. Bosch and S. J. Mullender. Real-time disk scheduling in a mixed-media file system. In RTAS 00: Proceedings of the Sixth IEEE Real Time Technology and Applications Symposium (RTAS 2000), pages 23 32, Washington, DC, USA, May June IEEE Computer Society. [10] A. Burns. Preemptive priority-based scheduling: an appropriate engineering approach. In Advances in real-time systems, pages , Upper Saddle River, NJ, USA, Prentice-Hall, Inc. [11] A. Burns, A. J. Wellings, and A. Hutcheon. The impact of an ada run-time system s performance characteristics on scheduling models+, [12] L.-P. Chang, T.-W. Kuo, and S.-W. Lo. Real-time garbage collection for flash-memory storage systems of real-time embedded systems. ACM Trans. Embed. Comput. Syst., 3(4): , [13] R.-I. Chang, W.-K. Shih, and R.-C. Chang. Deadlinemodification-scan with maximum-scannable-groups for multimedia real-time disk scheduling. In Real-Time Systems Symposium, Proceedings., The 19th IEEE, pages 40 49, Madrid, Spain, Dec [14] S. Cheng, J. A. Stankovic, and K. Ramamritham. Scheduling algorithms for hard real-time systems a brief survey. Technical report, University of Massachusetts, Amherst, MA, USA, [15] D. D. Clark, S. Shenker, and L. Zhang. Supporting realtime applications in an integrated services packet network: architecture and mechanism. ACM SIGCOMM Computer Communication Review, 22(4):14 26, Oct [16] P. Druschel and G. Banga. Lazy receiver processing (LRP): a network subsystem architecture for server systems. In Proc. 2nd USENIX symposium on operating systems design and implementation, pages , Oct [17] C. Ekelin. Clairvoyant non-preemptive edf scheduling. In ECRTS 06: Proceedings of the 18th Euromicro Conference on Real-Time Systems, pages 23 32, Dresden, Germany, July [18] T. M. Ghazalie and T. P. Baker. Aperiodic servers in a deadline scheduling environment. Real-Time Systems, 9(1):31 67, [19] T. Gleixner. ktimers subsystem. Articles/152363/, Sept [20] T. Gleixner and D. Niehaus. Hrtimers and beyond: Transforming the linux time subsystems. In In Proceedings of the Ottawa Linux Symposium, Ottawa, Canada, July [21] What Is Andriod? com/guide/basics/what-is-android.html, [Online; accessed 14-February-2010]. [22] K. Gopalan and T. Chiueh. Multi-resource allocation and scheduling for periodic soft real-time applications. In Proc. Multimedia Computing and Networking, San Jose, CA, USA, Jan [23] K. Gopalan and K.-D. Kang. Coordinated allocation and scheduling of multiple resources in real-time operating systems. In Proc. of Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT), Pisa, Italy, June [24] T. P. Guide. Zoned Bit Recording. pcguide.com/ref/hdd/geom/trackszbr-c. html. [Online; accessed 20-August-2010]. [25] M. G. Harbour. Real-time posix: an overview. In VVConex 93 International Conference, Moscow, June [26] C. Healy, R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding pipeline and instruction cache performance. Computers, IEEE Transactions on, 48(1):53 70, Jan [27] W. Hofer, D. Lohmann, F. Scheler, and W. Schroder- Preikschat. Sloth: Threads as interrupts. Real-Time Systems Symposium, IEEE International, 0: , [28] R. R. Howell and M. K. Venkatrao. On non-preemptive scheduling of recurring tasks using inserted idle times. Information and Computation, 117(1):50 62, Feb [29] IEEE Portable Application Standards Committee (PASC). Standard for Information Technology - Portable Operating System Interface (POSIX) Base Specifations, Issue 7. IEEE, Dec [30] Intel AA I/O advanced programmable interrupt controller (IOAPIC). design/chipsets/datashts/ pdf, [31] Intel. Intel 64 architecture (x2apic) specification pdf, Mar [32] D. M. Jacobson and J. Wilkes. Disk scheduling algorithms based on rotational position. Technical Report HPL-CSP- 91-7rev1, HP Laboratories, [33] K. Jeffay, D. F. Stanat, and C. U. Martel. On non-preemptive scheduling of periodic and sporadic tasks. In Proc. 12th IEEE Symposium on Real-Time Systems, pages , San Antonio, TX, [34] M. B. Jones and S. Saroiu. Predictability requirements of a soft modem. SIGMETRICS Perform. Eval. Rev., 29(1):37 49, [35] T. Kaldewey, T. Wong, R. Golding, A. Povzner, S. Brand, and C. Maltzahn. Virtualizing disk performance. In Real- Time and Embedded Technology and Applications Symposium, RTAS 08. IEEE, pages , St. Louis, MO, USA, Apr [36] S. Karlin and L. Peterson. Maximum packet rates for fullduplex ethernet. Tech Report TR , Princeton University, Feb [37] D. Katcher, H. Arakawa, and J. Strosnider. Engineering and analysis of fixed priority schedulers. Software Engineering, IEEE Transactions on, 19(9): , Sept [38] S. Kleiman and J. Eykholt. Interrupts as threads. ACM SIGOPS Operating Systems Review, 29(2):21 26, Apr

24 [39] B. W. Lampson and D. D. Redell. Experience with processes and monitors in mesa. Commun. ACM, 23(2): , [40] M. Lewandowski, M. J. Stanovich, T. P. Baker, K. Gopalan, and A.-I. Wang. Modeling device driver effects in real-time schedulability analysis: Study of a network driver. In Real Time and Embedded Technology and Applications Symposium, RTAS th IEEE, pages 57 68, Apr [41] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a hard real-time environment. Journal of the ACM, 20(1):46 61, Jan [42] J. W. S. Liu. Real-Time Systems. Prentice-Hall, [43] C. D. Locke. Software architecture for hard real-time applications: cyclic executives vs. fixed priority executives. Real- Time Syst., 4(1):37 53, [44] R. S. Ltd. Hybrid (measurement and static analysis). WCET-Hybrid-Static-and-Measurement, May [45] R. S. Ltd. Rapitime on target timing analysis. May [46] C. R. Lumb, J. Schindler, and G. R. Ganger. Freeblock scheduling outside of disk firmware. In Proceedings of the Conf. on File and Storage Technologies (FAST), Monterey, CA, Jan USENIX. [47] A. A. M. Aboutabl and J.-D. Decotignie. Temporally determinate disk access: An experimental approach. Technical Report CS-TR-3752, University of Maryland, [48] L. McVoy and C. Staelin. lmbench: Portable tools for performance analysis. In In USENIX Annual Technical Conference, pages , Jan [49] L. McVoy and C. Staelin. Lmbench - tools for performance analysis. May [50] J. Mogul and K. Ramakrishnan. Eliminating receive livelock in an interrupt-driven kernel. ACM Transactions on Computer Systems, 15(3): , [51] I. Molnar. Preemptive kernel patches. redhat.com/mingo/. [52] K. Narasimhan and K. D. Nilsen. Portable execution time analysis for risc processors. In in Proceedings of the Workshop on Architectures for Real-Time Applications, [53] K. D. Nilsen and B. Rygg. Worst-case execution time analysis on modern processors. In LCTES 95: Proceedings of the ACM SIGPLAN 1995 workshop on Languages, compilers, & tools for real-time systems, pages 20 30, New York, NY, USA, ACM. [54] J. K. Ousterhout. Why aren t operating systems getting faster as fast as hardware? In Proceedings of the Usenix Summer 1990 Technical Conference, pages , Anaheim, CA, USA, June [55] D. Parthey. Analyzing real-time behavior of flash memories. Master s thesis, Chemnitz University of Technology, Chemnitz, Germany, Apr [56] R. Pellizzoni and M. Caccamo. Toward the predictable integration of real-time cots based systems. In Real-Time Systems Symposium, RTSS th IEEE International, pages 73 82, Dec [57] I. Puaut and D. Hardy. Predictable paging in real-time systems: A compiler approach. In Real-Time Systems, ECRTS th Euromicro Conference on, pages , July [58] R. Rajkumar, K. Juvva, A. Molano, and S. Oikawa. Resource kernels: A resource-centric approach to real-time systems. In Proc. SPIE/ACM Conf. Multimedia Computing and Networking, pages , Jan [59] A. L. N. Reddy and J. Wyllie. Disk scheduling in a multimedia I/O system. In MULTIMEDIA 93: Proceedings of the first ACM Int. Conf. on Multimedia, pages , New York, NY, USA, ACM Press. [60] J. Regehr. Inferring scheduling behavior with Hourglass. In Proc. of the USENIX Annual Technical Conf. FREENIX Track, pages , Monterey, CA, June [61] J. Regehr and U. Duongsaa. Preventing interrupt overload. In Proc ACM SIGPLAN/SIGBED Conf. on languages, compilers, and tools for embedded systems, pages 50 58, Chicago, Illinois, June [62] J. Regehr and J. A. Stankovic. Augmented CPU Reservations: Towards predictable execution on general-purpose operating systems. In Proc. of the 7th Real-Time Technology and Applications Symposium (RTAS 2001), pages , Taipei, Taiwan, May [63] L. Reuther and M. Pohlack. Rotational-position-aware realtime disk scheduling using a dynamic active subset (DAS). In RTSS 03: Proceedings of the 24th IEEE Int. Real- Time Systems Symposium, page 374, Washington, DC, USA, IEEE Computer Society. [64] S. Saewong and R. R. Rajkumar. Cooperative scheduling of multiple resources. In RTSS 99: Proceedings of the 20th IEEE Real-Time Systems Symposium, pages , Phoenix, AZ, USA, Dec IEEE Computer Society. [65] K. Sandstrom, C. Eriksson, and G. Fohler. Handling interrupts with static scheduling in an automotive vehicle control system. In Real-Time Computing Systems and Applications, Proceedings. Fifth International Conference on, pages , Oct [66] S. Schönberg. Impact of pci-bus load on applications in a pc architecture. In RTSS 03: Proceedings of the 24th IEEE International Real-Time Systems Symposium, pages , Cancun, Mexico, Dec IEEE Computer Society. [67] L. Sha, T. Abdelzaher, K. E. Årzén, A. Cervin, T. P. Baker, A. Burns, G. Buttazzo, M. Caccamo, J. Lehoczky, and A. K. Mok. Real time scheduling theory: A historical perspective. Real-Time Systems, 28(2 3): , Nov [68] L. Sha, J. P. Lehoczky, and R. Rajkumar. Solutions for some practical problems in prioritized preemptive scheduling. In Proc. 7th IEEE Real-Time Sytems Symposium, [69] L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority inheritance protocols: an approach to real-time synchronisation. IEEE Trans. Computers, 39(9): , [70] B. Sprunt, L. Sha, and L. Lehoczky. Aperiodic task scheduling for hard real-time systems. Real-Time Systems, 1(1):27 60, [71] M. Spuri and G. Buttazzo. Efficient aperiodic service under the earliest deadline scheduling. In Proc. IEEE Real-Time Systems Symposium, Dec

25 [72] J. A. Stankovic and K. Ramamritham, editors. Tutorial: hard real-time systems. IEEE Computer Society Press, Los Alamitos, CA, USA, [73] M. Stanovich, T. P. Baker, A.-I. A. Wang, and M. G. Harbour. Defects of the posix sporadic server and how to correct them. In Real Time and Embedded Technology and Applications Symposium, RTAS th IEEE, pages 35 45, Stockholm, Sweden, Apr IEEE Computer Society. [74] M. J. Stanovich, T. P. Baker, and A.-I. A. Wang. Throttling on-disk schedulers to meet soft-real-time requirements. In Real-Time and Embedded Technology and Applications Symposium, RTAS 08. IEEE, pages , Apr [75] D. B. Stewart. Twenty-five-most commons mistakes with real-time software development. In Embedded Systems Conference 1999., Sept [76] D. Stodolsky, J. B. Chen, and B. N. Bershad. Fast interrupt priority management in operating system kernels. In USENIX Microkernels and Other Kernel Architectures Symposium, pages , Berkeley, CA, USA, Sept USENIX Association. [77] J. Strosnider, J. P. Lehoczky, and L. Sha. The deferrable server algorithm for enhanced aperiodic responsiveness in real-time environments. IEEE Trans. Computers, 44(1):73 91, Jan [78] T10 Technical Committee on SCSI Storage Interfaces. SCSI architecture model - 3 (SAM-3). org/ftp/t10/drafts/sam3/sam3r14.pdf, [79] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra, F. Mueller, I. Puaut, P. Puschner, J. Staschulat, and P. Stenström. The worst-case execution-time problem overview of methods and survey of tools. ACM Trans. Embed. Comput. Syst., 7(3):1 53, [80] B. L. Worthington, G. R. Ganger, Y. N. Patt, and J. Wilkes. On-line extraction of SCSI disk drive parameters. In SIG- METRICS 95/PERFORMANCE 95: Proceedings of the 1995 ACM SIGMETRICS joint Int. Conf. on Measurement and modeling of computer systems, pages , Ottawa, Ontario, Canada, ACM Press. [81] J. C. Wu and S. A. Brandt. Storage access support for soft real-time applications. In RTAS 04: Proceedings of the 10th IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS 04), page 164, Washington, DC, USA, IEEE Computer Society. [82] J. Zhang, R. Lumia, J. Wood, and G. Starr. Achieving deterministic, hard real-time control on an ibm-compatible pc: a general configuration guideline. In Systems, Man and Cybernetics, 2005 IEEE International Conference on, volume 1, pages Vol. 1, Oct [83] Y. Zhang and R. West. Process-aware interrupt scheduling and accounting. In Proc. 27th Real Time Systems Symposium, Rio de Janeiro, Brazil, Dec