QoS in the Linux Operating System. Technical Report

Similar documents

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

An Alternative Way to Measure Private Equity Performance

Conferencing protocols and Petri net analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Recurrence. 1 Definitions and main statements

DEFINING %COMPLETE IN MICROSOFT PROJECT

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Real-Time Process Scheduling

Vembu StoreGrid Windows Client Installation Guide

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

An Interest-Oriented Network Evolution Mechanism for Online Communities

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

M3S MULTIMEDIA MOBILITY MANAGEMENT AND LOAD BALANCING IN WIRELESS BROADCAST NETWORKS

Politecnico di Torino. Porto Institutional Repository

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Project Networks With Mixed-Time Constraints

The OC Curve of Attribute Acceptance Plans

Activity Scheduling for Cost-Time Investment Optimization in Project Management

RequIn, a tool for fast web traffic inference

FORMAL ANALYSIS FOR REAL-TIME SCHEDULING

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

Network Services Definition and Deployment in a Differentiated Services Architecture

What is Candidate Sampling

Traffic State Estimation in the Traffic Management Center of Berlin

A GENERIC HANDOVER DECISION MANAGEMENT FRAMEWORK FOR NEXT GENERATION NETWORKS

Canon NTSC Help Desk Documentation

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

EE31 Series. Manual. Logger & Visualisation Software. BA_EE31_VisuLoggerSW_01_eng // Technical data are subject to change V1.0

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

A Design Method of High-availability and Low-optical-loss Optical Aggregation Network Architecture

Multiple-Period Attribution: Residuals and Compounding

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

A generalized hierarchical fair service curve algorithm for high network utilization and link-sharing

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Efficient Project Portfolio as a tool for Enterprise Risk Management

Alarm Task Script Language

Performance Analysis and Comparison of QoS Provisioning Mechanisms for CBR Traffic in Noisy IEEE e WLANs Environments

QoS-based Scheduling of Workflow Applications on Service Grids

iavenue iavenue i i i iavenue iavenue iavenue

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

A Secure Password-Authenticated Key Agreement Using Smart Cards

8 Algorithm for Binary Searching in Trees

Cloud Auto-Scaling with Deadline and Budget Constraints

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Calculating the high frequency transmission line parameters of power cables

Construction Rules for Morningstar Canada Target Dividend Index SM

Instructions for Analyzing Data from CAHPS Surveys:

A Load-Balancing Algorithm for Cluster-based Multi-core Web Servers

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

QOS DISTRIBUTION MONITORING FOR PERFORMANCE MANAGEMENT IN MULTIMEDIA NETWORKS

VIP X1600 M4S Encoder module. Installation and Operating Manual

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

For example, you might want to capture security group membership changes. A quick web search may lead you to the 632 event.

Traffic-light a stress test for life insurance provisions

Enterprise Master Patient Index

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

Little s Law & Bottleneck Law

A Dynamic Energy-Efficiency Mechanism for Data Center Networks

= (2) T a,2 a,2. T a,3 a,3. T a,1 a,1

universitat Autónoma' de Barcelona

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

IMPACT ANALYSIS OF A CELLULAR PHONE

An Optimal Model for Priority based Service Scheduling Policy for Cloud Computing Environment

A Parallel Architecture for Stateful Intrusion Detection in High Traffic Networks

Dynamic Pricing for Smart Grid with Reinforcement Learning

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Efficient Bandwidth Management in Broadband Wireless Access Systems Using CAC-based Dynamic Pricing

A New Task Scheduling Algorithm Based on Improved Genetic Algorithm

Financial Mathemetics

Nordea G10 Alpha Carry Index

Calculation of Sampling Weights

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Statistical Methods to Develop Rating Models

Resource Scheduling in Desktop Grid by Grid-JQA

Power Low Modified Dual Priority in Hard Real Time Systems with Resource Requirements

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:

Complex Service Provisioning in Collaborative Cloud Markets

Extending Probabilistic Dynamic Epistemic Logic

Transcription:

Unverstät Karlsruhe (H) Insttut für elematk QoS n the Lnux Operatng System echncal Report Marc Bechler and Hartmut Rtter Insttut für elematk Fakultät für Informatk Unverstät Karlsruhe (H) E-Mal: [mbechler rtter]@telematk.nformatk.un-karlsruhe.de Fakultät für Informatk Insttut für elematk

able of Contents QoS n the Lnux Operatng System...1 1 Introducton...1 2 he QoS Archtecture...3 3 Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel...7 3.1 Schedulng n the Lnux Operatng System...7 3.1.1 Bascs: he Archtecture of Lnux...7 3.1.2 Processes, hreads and asks...9 3.1.3 Schedulng...11 3.1.3.1 Interrupt Handlng...12 3.1.3.2 Schedulng Algorthm...13 3.1.3.2.1 Real-me asks...14 3.1.3.2.2 Common asks...15 3.1.3.2.3 Intalzaton of Expred asks...18 3.1.3.3 Dspatchng...19 3.1.4 Example...20 4 Desgn Issues for an Adaptve Scheduler...23 4.1 Varable me Slces...23 4.1.1 Negatve Sde Effects...25 4.2 A QoS-based Schedulng Approach...27 4.2.1 Example...30 4.3 QoS Support n heory and Practce...32 5 Further Implementaton Detals...35 5.1 Addng a New System Call...35 5.2 he Implementaton of modfy_qos()...36 6 Bblography...39 Appendx A: Overvew of the Modfed Fles...41 Appendx B: Gettng Started...43 Appendx C: he ask Structure task_struct...45 Appendx D: Schedulng Source Code...47 Appendx E: Proof of Maxmum hreshold...53 Appendx F: Dervaton of the Percentage A...55

1 Introducton hs document descrbes a man part of the mplementaton of our QoS archtecture, whch has been developed n the context of our UNIQuE proect [1]. In order to acheve an end-to-end support for qualty of servce, our approach bases on feedback loops as proposed n [2]. hs techncal report covers the mplementaton detals of the realzaton of an adaptve CPU scheduler, whch s a key element n the QoS archtecture. For the mplementaton we used the Lnux operatng system (verson 2.3.49) as the QoS archtecture requres modfcatons n the kernel mechansms of the operatng system to acheve the end-to-end QoS support and the sources of Lnux are publc avalable. he document s structured as follows: In the next chapter the basc concept of our QoS archtecture s descrbed. In ths context, t wll become clearer what the term adaptve scheduler denotes and what requrements result from ths approach. In chapter 3, the CPU schedulng n the exstng Lnux kernel s descrbed n detal, whereas chapter 4 presents the modfed adaptve scheduler as well as some theoretcal and practcal results. Chapter 5 fnally gves an overvew mplementaton detals; followed by a bblography and techncal appendces. As wth many research and development especally n the Lnux area, ths work s not ready but work-n-progress, but we thnk t brngs up new aspects of Qualty of Servce support that fts the user s needs and s talored much better to ndvdual user preferences. An electronc verson of ths document can be found on the home ste of our UNIQuE proect as well as other papers related wth ths topc: http://www.telematk.nformatk.un-karlsruhe.de/forschung/unque

2 he QoS Archtecture he overall pcture of the QoS archtecture s gven n Fgure 2.1. Bascally, t conssts of a set of Montorng&Control enttes, whch are each assocated wth a resource. he resources currently under consderaton n the end system are the processng tme (processor scheduler entty), the avalable data rate shared between dfferent flows and the avalable network lnks n a mult-homed end system. he Montorng&Control enttes montor the resource sharng of the assocated resource,.e. they schedule the resource usage between the demands. he control part realzes the nteracton wth other Montorng&Control enttes. hey accept demands for a modfed resource share, and pass ths demand down to another entty accordng to some adaptaton and coordnaton rules. User adaptve applcaton proactve applcaton Applcatons Process Scheduler Montorng &Control Kernel Data Scheduler Montorng &Control Networksubsystem Network swtchng Network swtchng Montorng &Control Lnk layer Informaton Base Informaton Base Informaton Base Adapter 1 Adapter 2 Adapter 3 Fgure 2.1: Overall Pcture of our QoS Archtecture Many applcaton-orented approaches exst that try to mantan nformaton about the applcaton s requrements and perform resource reservaton n the end system based on ahead calculaton of the need [3, 4]. Opposed to ths, our QoS archtecture s strctly user orented. he resource sharng n the end system and the bas of the resource dstrbuton should be done n favor of those applcatons the user s currently focusng on. hs could be smply the wndow that s actve n the moment, as t s done n the Wndows 2000 operatng system [5]. As ths mght ft

4 Marc Bechler and Hartmut Rtter the user s nterest only some tmes, n [6] we proposed a smple user nterface that allows the user to sgnal hs current commtment to one or another applcaton more explctly. hs nterface s also shown n Fgure 2.1: A button labeled wth the letter Q (denotng Qualty of Servce) s ntegrated nto the wndow ttle bar of each applcaton started on the system. hs Q-Button can be ntegrated nto any wndowng system (X-Wndow as well as Wndows 2000) and does not necessarly depend on an applcaton. It s the ob of the operatng system (and, f supported, of the applcaton tself) to determne the meanng of better. he user s not concerned wth techncal detals and parameters detachng hm or her from productve work. Instead, the mappng of a sngle user parameter, gven ust by a clck on a button, must be mapped onto dfferent system parameters lke processng tme, data rate, used network, cost factors etc. (cf. [7]). Dfferent parts of the archtecture have been realzed up to now: he adaptve data scheduler s descrbed n [5], the network swtchng s part of the current work. Whereas applcaton adaptaton s a well-known concept n dstrbuted multmeda systems [8], we frst ntroduced the noton of proactve applcatons [2, d]. In contrast to adaptve 1 applcatons that try to adapt to the current stuaton but cannot actvely control resources, proactve applcatons actvely take effect on the sharng of resources n advance. hus, besde user nteracton by means of the Q-Button applcatons can nteract wth the system on ther own. here are some basc desgn decsons that result from the presented archtecture and follow the goal not to desgn a completely new operatng system kernel: & Insde each Montorng&Control entty, mnor resource varatons should be hdden from the neghbor enttes. hus, short-term and long-term varatons are gong to be decoupled and the system stablty s ncreased. & he Montorng&Control enttes accept and deal wth smple resource requests that are passed over from other Montorng&Control enttes as well as from proactve applcatons. In general, these requests result n a bas of the resource dstrbuton n favor of the flow or process related to the request. Obvously, resources are lmted and can not be ncreased, but they should be gven to the applcaton that s currently n the user s focus. hus a paradgm shft from calculaton and reservaton n advance to adaptaton to the stuaton s performed and complexty s beng reduced. & Especally the CPU schedulng s crucal for both the behavor and the performance of an operatng system. Hence, modfcatons of the used algorthms should be done very carefully, exstng applcatons relyng on some propertes should not be broken. 1 In our context, adaptve applcatons have the same meanng as reactve applcatons as they react on dynamc envronments by adaptng to the current avalablty of resources.

he QoS Archtecture 5 Followng these goals, n the next chapter the exstng CPU scheduler n the Lnux operatng system s descrbed and analyzed n more detal. Afterwards, the concept of an adaptve scheduler s presented n chapter 4.

3 Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel he basc dea for provdng applcatons wth the qualty of servce they need s to use an adaptve process scheduler combned wth an adaptve networkng subsystem, as descrbed n the prevous secton. Before the mplementaton of the adaptve process scheduler s descrbed n more detal, we gve an ntroducton n the schedulng of tasks wthn the Lnux kernel. hs helps to understand the decsons we made for the mplementaton of our adaptve process scheduler. Accordng to the prevous chapters, the followng general condtons were taken nto account: & Exstng ( legacy ) applcatons should be executable further on wthout any restrctons. Addtonally, those legacy applcatons should also take advantage of the QoS support. & he mechansms for achevng qualty of servce should be smple and effcent. & he user (or the applcaton, respectvely) must be nvolved n the closed loop n an actve way,.e., s/he must be able to affect the qualty of servce for each applcaton. & If possble, encroachments n the mechansms of the operatng system should be mnmzed as modfcatons may result n an napprecable behavor of the system. 3.1 Schedulng n the Lnux Operatng System Wthn the Lnux Kernel, the mechansms for schedulng the avalable CPU resources to the runnng processes are prmarly optmzed for hgh effcency, farness, throughput and low response tmes. However, qualty of servce aspects were not consdered for the development of the scheduler. For a runnng applcaton t s not possble to use any QoS mechansms related to the schedulng of tasks as a specfcaton of the user s preferences s not provded. In the followng sectons, the terms process schedulng, thread schedulng, and task schedulng wll appear several tmes, partcularly n the context of the queston, to whch knd of obects the CPU resources are scheduled. In order to avod obscurty, we use the generc term schedulng for allottng the avalable computaton tme. A detaled descrpton of further components and the mechansms of the Lnux kernel are descrbed n [9, 10]. 3.1.1 Bascs: he Archtecture of Lnux In contrast to the current evoluton of operatng systems towards a mcro-kernel archtecture, the Lnux kernel bases on a monolthc archtecture. hs basc desgn aspect also affects the

8 Marc Bechler and Hartmut Rtter functonng of the process scheduler. In mcro-kernel archtectures, e.g., the Mach Kernel [11] or the Kernel of Mcrosoft Wndows 2000 [5], the operatng system s kernel only provdes a mnmum of functonalty for performng basc operatons. Examples are the nter-process communcaton (IPC), and the memory management (MM). Bult on top of the mcro-kernel, the remanng functonalty of the operatng system s mplemented n ndependent processes or threads runnng n the user space. hey communcate wth the mcro-kernel va a defned nterface. For the schedulng of the avalable computng tme, the mcro-kernel only supports the basc mechansm for swtchng from one process or thread to the next, called a context swtch. he schedule sequence of the processes or tasks s determned separately, e.g., by kernel threads. In monolthc archtectures lke the Lnux operatng system the entre functonalty s concentrated wthn the kernel,.e., all mechansms and nformaton relevant for the schedulng are avalable n the kernel. Processes User Mode asks Scheduler System Mode CPU Hardware Fgure 3.1: he Archtecture of the Lnux Operatng System Fgure 3.1 gves a coarse vew on the Lnux archtecture, focused on the nteracton between applcatons, scheduler, and hardware. Further resource managers lke the memory manager are not shown n Fgure 3.1 as the llustraton mght become too complex. Lke other operatng systems such as Mcrosoft Wndows 2000 [5], a Lnux process resdes n one of two modes; a prvleged system mode, and a non-prvleged user mode. In general, applcatons are runnng as processes wthn the user mode. Each process exsts ndependent from other processes;.e., each process has ts own memory space and s not allowed to access the memory of other processes. When a process enters the Lnux kernel (e.g., n the case of an nterrupton or a system call), the system swtches to the prvleged system mode. In the Lnux operatng system, only one process can be

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 9 executed n the system mode, as the kernel performs crtcal operatons whch must be protected from mutual preempton 2. From the Lnux kernel s pont of vew, the processes are noted as tasks and are treated lke common functons. In Fgure 3.1, the tasks are marked by the dotted lnes of each process resumed n the system kernel. Once a process/task enters the prvleged system mode, t has access to the resources of every other task currently runnng on the system, whch s not possble n the user mode. he scheduler s completely mplemented nsde the operatng system s kernel. In mult-processor envronments, each processor has ts own scheduler, but the schedulng algorthm used by each scheduler s always the same. 3.1.2 Processes, hreads and asks In papers about Lnux, the terms process, thread, and task are n many cases handled wth the same mportance, although they are completely dfferent. A process s generally defned as a sequence of actvtes related to a dstngushed problem. Each process exsts ndependent of other processes. Processes are not able to affect each other, albet they can communcate wth each other usng kernel mechansms for nter-process communcaton. hus, swtchng from one process to the next one requres a complete swtch of the envronment, ncludng address space and CPU regsters. Wthn a process address space, one or more threads could exst n parallel workng on the same problem. hose threads are runnng wthn the same envronment,.e., swtchng from one thread to the next only requres a fast context swtch rather than an extensve swtch of the envronment. hus, threads are also known as lght-weghted processes. Another mportant dfferentaton s that parts of the process memory space (the so called shared memory) are accessble by all threads runnng wthn that process. hs feature allows for an easy and effcent synchronzaton of threads and an effcent data exchange between threads as no communcaton paths have to be establshed. he basc dfference between threads and processes s llustrated n Fgure 3.2. (Note that Fgure 3.2 s a typcal example for a mult-threadng operatng system, such as Solars. It s not typcal for Lnux!) Wthn the envronment of a process, one or more threads are runnng smultaneously. he arrows mark the current poston of the nstructon counter for resumng the computaton. In mult-threaded operatng system desgns, threads are dspatched mmedately to the processor, whereas n Lnux processes are scheduled and threads mght run wthn a process as descrbed below. 2 Note that even n mult-processor envronments only one process could be executed wthn the system mode.

10 Marc Bechler and Hartmut Rtter Process 1 Process 2 Process 3 CPU Fgure 3.2: Dfference between Processes and asks From the vewpont of the Lnux kernel, every process runnng on the system s referred to as a task. Compared to processes, tasks are allowed to access every resource (n partcular, tasks are allowed to modfy other tasks), as they are runnng n the prvleged system mode. hus, the schedulng algorthm runnng n the kernel shares the avalable CPU bandwdth between the tasks currently runnng on the system, ndependent whether a task represents a process or a thread outsde the Lnux kernel. he task s structure s defned n nclude/lnux/sched.h (see struct task_struct) and contans parameters that characterze a task, such as memory management nformaton, varous varables for states, flags, and statstcs, ponters to related tasks (such as chld tasks, parent tasks, prevous and next task n the task-lst, etc.), hash tables, process credentals, lmts, fle system nfo, sgnal handlers, etc. he defnton of the complete task structure s prnted n Chapter 0. Orgnally, threads were not supported from the Lnux operatng system; applcatons were bascally mapped to processes. In order to acheve compatblty to the POSIX 1003.1c standard [12] relevant for UNIX operatng systems, there are two basc mplementaton desgns enablng the use of threads n Lnux. he basc dfference between those approaches s the way how the scheduler treats threads: & he frst approach s the mplementaton of threads wthn a process, mplemented by the Phread lbrary [12]. If a process starts one or more Phreads, t s responsble for schedulng the computaton tme he gets from the operatng system s scheduler to the Phreads currently runnng wthn ts context. hus, Phreads are nvsble for the scheduler as they are completely mplemented wthn the user space. & he alternatve s the mappng of threads to processes,.e., each thread s represented by exactly one process. One example for ths approach s the system call clone() (defned n arch/386/kernel/process.c) Usng ths system call, a new process s forked, that s allowed to access a shared memory allocated from the parent s process. In ths case, the

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 11 Lnux scheduler s responsble for sharng the computaton tme between all threads, as those threads are represented as common tasks for the kernel. hus, the more threads created by a process or an applcaton, the more computaton tme s scheduled for ths applcaton. In the remander of ths document, the dfferentaton between processes, threads, and tasks s not consdered as the scheduler of the Lnux operatng system s responsble for schedulng the avalable computaton tme to tasks. Note that the defnton of the task structure (struct task_struct) can be found n nclude/lnux/unstd.h. 3.1.3 Schedulng he prncple ob of the scheduler s to decde the next executable task that s dspatched to the CPU. A task s marked as executable, f the computaton of the task can be (actvely) contnued and does not depend on any other resources, such as arrvng data from a hard dsk drve. From the mplementaton pont of vew, a task p s executable f ts actual state s p->state = ASK_RUNNING. he schedulng functonalty can be found n kernel/sched.c n the functon schedule(). In Lnux, the schedulng works n a preemptve manner,.e., a task wll run as long as one of the followng events occur: & the task voluntarly releases ts control back to the scheduler, or & the scheduler revokes the CPU resource from the actve task and dspatches the CPU to another task. he schedulng sequence s defned by a round-robn-based tme dvson multplexng mechansm. Each (executable) task s dspatched to the CPU for a short perod of tme. hs perod, the so-called tme slce, has a value of 200 ms n Lnux (defned n the constant varable DEF_PRIORIY n nclude/lnux/sched.h). For each task currently dspatched to a processor, the tme slce wll be counted downwards. If the tme slce expres, the actve task wll be preempted n favor of another task. An expred task wll not be dspatched untl the tme slce of each executable task s expred. After ths tme, all tasks ncludng the non-executable wll be re-ntalzed and the competton for the CPU starts from the begnnng. In Lnux, tme s measured n tcks (specfed n kernel/tmer.c) defned by the constant varable HZ n the fle nclude/asm/param.h (for x86-based CPU archtectures). he temporal resoluton of tme depends on the CPU archtecture. In x86-based processor archtectures, one tck corresponds to a tme perod of 10 ms (HZ = 100), whereas n the Alpha archtecture, the tme s updated 1,024 tmes per seconds 3,.e., one tck takes 0.977 ms. he tme [n tcks] passed snce the last system startup s stored n the global varable ffes, defned n the fle 3 he reason for ths unusual value s a regster n the Alpha CPU whch s updated 1,024 tmes per second.

12 Marc Bechler and Hartmut Rtter nclude/lnux/sched.h. However, n many documents about the Lnux kernel, tcks and ffes are used n the same sense. As mentoned above, the scheduler wll be called n three cases: 1. he tme slce of the actve task expres. 2. A task voluntarly releases the control back to the scheduler (e.g., t has to wat for a resource). 3. An nterrupt or a system call trggers the scheduler. he functonng of the scheduler can be subdvded n three steps. Frst, nterrupts that were not handled completely have to be fnshed. he second step s the selecton of the next task that wll be scheduled to the CPU. Fnally, the new thread wll be dspatched to the CPU. he three steps are explaned n the followng sectons n more detal. 3.1.3.1 Interrupt Handlng Interrupts are used for notfyng the operatng system that certan events occurred, e.g., a completed data transfer or an error n a perpheral devce. Interrupts are handled wthn the kernel of an operatng system; therefore, the system swtches to the prvleged system mode. In Lnux, three types of nterrupts can be dstngushed: & Fast nterrupts lke the nput of characters va the keyboard requre no tme-senstve actons and are handled mmedately. For the scheduler, the handlng of fast nterrupts s transparent; t can only be notced by the system that there s less tme avalable for the tasks. & In contrast to fast nterrupts, a slow nterrupt could be ntercepted by other nterrupts. he handlng of slow nterrupts s realzed n two steps. Frst, actons that are essental for handlng the nterrupt are performed. he remanng operatons for ts completon, called bottom halfs, are carred out when the scheduler s called the next tme. he completon of the nterrupt routne s performed before the schedulng algorthm starts; on account of the 20 tcks for the ntal tme slce, the maxmum amount of tme for the completon s 200 ms. he system tmer s a typcal example for a slow nterrupt, whch s released every 10 ms by the processor (on x86-based archtectures). he actons n the frst step are generally to update the system tme,.e., by ncrementng the ffes. Later, the bottom half takes care on executng expred tmers and updatng statstcs for the currently scheduled task.

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 13 & he thrd type of nterrupt s a system call. A system call requres to enter the system mode, whch s realzed by a soft nterrupt 4. he scheduler s never called drectly; nstead, the need_resched flag (defned n the task structure n nclude/lnux/sched.h) of an executable task s set to 1. hs flag s automatcally checked after the completon of a system call and, when approprate, the scheduler wll be trggered. If the scheduler s executed, the frst step s to complete the bottom halfs. In the Lnux 2.3.49 kernel, the exstence of bottom halfs are ndcated by the varable tq_scheduler, defned n nclude/lnux/tqueue.h, as can be seen by the followng lnes of code. 444 f (tq_scheduler) 445 goto handle_tq_scheduler; If the tq_scheduler flag s set, the functon run_task_queue() (also defned and mplemented n nclude/lnux/tqueue.h) wll be called whch works out all bottom halfs that were marked by nterrupt handlng routnes. Bascally, the same procedure s performed n the next step for handlng the bottom halfs of software nterrupts. 457 f (softrq_state[ths_cpu].actve & softrq_state[ths_cpu].mask) 458 goto handle_softrq; he treatment of software nterrupts s performed by the functon do_softrq(), whch s mplemented n kernel/softrq.c and bascally works n the same way as descrbed n the prevous case. 3.1.3.2 Schedulng Algorthm he schedulng algorthm n Lnux s responsble for decdng whch (executable) task s next dspatched to the processor. Accordng to the POSIX 1003.1b standard [13] for UNIX operatng systems, Lnux dstngushes two classes of tasks: real-tme tasks and common tasks. Compared to common tasks, real-tme tasks are prncpally favored by the scheduler. hey only have to compete wth other real-tme tasks for the avalable computaton tme. For schedulng real-tme tasks, the POSIX 1003.1b specfcaton defnes two operatons: Perodc real-tme tasks are scheduled by a round-robn-based algorthm, whereas a sporadc real-tme task s executed untl t voluntarly releases the control over the CPU or untl a real-tme task wth a hgher prorty preempts t, whereas perodc real-tme tasks are preempted by the expraton of ther tme slce. In the Lnux kernel, the 4 A soft nterrupt s ntated by software. In Lnux, soft nterrupts for swtchng to the system mode are realzed by callng the nterrupt 0x80.

14 Marc Bechler and Hartmut Rtter dfferent types of tasks are defned n nclude/lnux/sched.h. Perodc real-tme tasks are represented by SCHED_RR, sporadc real-tme tasks by SCHED_FIFO, and common tasks by SCHED_OHER. Before the core schedulng algorthm starts, some actons need to be performed that guarantee a correct and far schedulng of the avalable CPU resources. he actve perodc realtme task s appended to the end of the queue of real-tme tasks, so other equal real-tme tasks wll also be scheduled. Furthermore, tasks watng for specfc events must be marked as executable f the correspondng sgnals arrve [10]. Further modfcatons of a task s state are descrbed n [14]. However, those modfcatons are only relevant for schedulng n a way that the set (realzed as a lst) of executable tasks s updated. For the decson of the next task that wll be dspatched to the CPU, the schedulng algorthm computes a weght for each executable task. hs task s mplemented n the functon goodness() (n kernel/sched.c) and wll be descrbed n more detal. As real-tme tasks are favored, t seems that there are dfferent schedulers for each knd of tasks. Indeed, both common tasks and real-tme tasks are scheduled by the same schedulng functon. Only the weght functon dffers for each class of tasks. 3.1.3.2.1 Real-me asks he prorty of real-tme tasks potentally runnng n the system s always hgher compared to common tasks. hus, a real-tme task only has to compete wth other real-tme tasks for the avalable computaton tme. herefore, a fxed prorty scheme s used, where the decson for the next task dspatched to the CPU s exclusvely determned by the real-tme prorty of a task, defned n the varable rt_prorty (defned for each task). Other factors such as the current sze of the consumed computaton tme are not taken nto account. Generally, the task wth the hghest realtme prorty wll be executed; t has to share the CPU bandwdth wth other real-tme tasks havng the same real-tme prorty. If a task wth a lower prorty currently runs on the system, t wll be preempted. A dynamc modfcaton of the real-tme prortes s n general possble by external programs, however, the scheduler does not alter ther values. hus the weght calculated for each real-tme task s constant (f one of those external programs does not modfy the real-tme prorty). An expred tme slce causes the preempton of a perodc real-tme task, whch wll be re-ntalzed mmedately. he calculaton of a real-tme s weght s very smple, as can be seen by the followng two lnes of code n the goodness() functon: 120 f (p->polcy!= SCHED_OHER) { 121 weght = 1000 + p->rt_prorty;

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 15 A real-tme task that s not marked wth the SCHED_OHER flag gets a weght of ts (constant) real-tme prorty plus 1,000. hs weght could not be reached by any common tasks; thus, executable real-tme tasks are always treated before any common tasks. he handlng of real-tme tasks reveals that the term real-tme s not approprate n ths context. ypcally, a real-tme task has to delver a correct value wthn a pre-determned tme. Although a Lnux-based applcaton can affect the real-tme prorty of ts process, there s no correlaton to the support of a guaranteed qualty of servce, as nether a reservaton nor any admsson control s performed. Real-tme n the context of Lnux only means that real-tme tasks have a hgher prorty as common tasks. he scheduler handles real-tme tasks accordng to the POSIX 1003.1b defntons [12]. Nevertheless, the term real tme wll be used further on n ths paper as both the lterature on Lnux as well as the source code of the Lnux kernel use ths term. (Note that wth R-Lnux [15] there s an extenson to Lnux supportng hard real-tme capabltes!) 3.1.3.2.2 Common asks Common tasks are only scheduled f no executable real-tme task currently runs on the system. he schedulng mechansm uses a round-robn-based algorthm, where the schedulng sequence can change dynamcally. In contrast to real-tme tasks, other parameters are used for the calculaton of a task s weght. For common tasks, the weght functon guarantees a far schedulng, whch s not necessarly gven for real-tme tasks. All parameters used by the weght functon are defned for each task n the task s structure (struct task_struct, see Appendx C). & me slce: he current sze [n tcks] of the tme slce. hs value s defned n the varable counter and s decreased when the task s scheduled to the CPU. hus, f a task s tme slce expres, counter s 0. he tme slce s the bass for the weght functon. If the tme slce expres, the scheduler has to re-ntalze the counter varable (whch s done f the tme slce of each executable task s expred). & Processor swtch: hs parameter s only used f symmetrc mult-processng s enabled. In SMP envronments, each CPU has ts own scheduler. herefore, a slght advantage s gven to the task s weght f the task was last dspatched to ths CPU, as the processor s cache mght stll contan vald entres for the envronment of ths process and mght not to reload the complete envronment from memory. Addtonally, some vald address mappngs could have remaned n the CPU s LB (translaton look-asde buffer). hs advantage gets lost when the task swtches to another processor. he processor swtch parameter s mplemented by addng PROC_CHANGE_PENALY to the weght. he schedule penalty s defned n nclude/asm/smp.h and s set to 15.

16 Marc Bechler and Hartmut Rtter & Envronment swtch: asks runnng n the same memory as the current task (whch could be realzed by forkng threads wth the clone() system call, see secton 3.1.2) are addtonally preferred by ncreasng ther weght. In ths case a more effcent context swtch (restorng the processor regsters) s suffcent for swtchng to the next task nstead of swtchng to a completely new envronment. hs slght advantage (whch s mplemented by ncrementng the weght) results n an mproved effcency, albet t contradcts to the prncple of farness for common tasks. & Prorty: In the Lnux operatng system, the prorty has two meanngs. Frst, t defnes the (mnmum) sze used for the ntalzaton of the tme slces (n the counter varable) f all executable tasks are expred and need to be re-ntalzed. he prorty s mplemented n the task_struct s varable prorty and has a pre-defned value of 20 tcks for Intel archtectures. hat s why the default value for a tme slce s 200 ms; t guarantees a far sharng of the avalable computaton tme between the executable tasks. prorty s not a statc varable; e.g., the nce command (only avalable for root) manpulates the prorty of a task, and thus ts share on the CPU resources. he second meanng of the prorty arses n ts use for the weght functon, as t s added to the parameters lsted above. herefore, processes wth a lower prorty are ndrectly penalzed as ther tme slce wll not be consumed up as fast as the tme slce of tasks wth a hgher prorty. Note that the prorty used for common tasks s dfferent from the prorty used for weghtng real-tme tasks (rt_prorty), whch s of no relevance for common tasks. Wth the descrpton of the parameters nvolved n the calculaton of the weght for common tasks, the followng lnes of code whch mplement the core functonalty of the goodness () functon summarze those facts and are easy to understand: 132 weght = p->counter; 133 f (!weght) 134 goto out; 135 136 #fdef SMP 137 /* Gve a largsh advantage to the same processor... */ 138 /* (ths s equvalent to penalzng other processors) */ 139 f (p->processor == ths_cpu) 140 weght += PROC_CHANGE_PENALY; 141 #endf 142 143 /*.. and a slght advantage to the current MM */ 144 f (p->mm == ths_mm!p->mm) 145 weght += 1; 146 weght += p->prorty;

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 17 147 148 out: 149 return weght; As can be seen n lne 133, the weght wll not be calculated for expred tasks (.e., counter must be greater than 0). he bass s the currently remanng tme slce (counter), maybe ncreased by the PROC_CHANGE_PENALY f SMP s enabled n the kernel confguraton (see above), maybe ncremented f an envronment swtch can be avoded (lne 145), and fnally ncreased by the prorty n lne 146. Note that all the nformaton used for the calculaton of the weght functon s stored n each task s structure. he only sense of the goodness() functon s bascally to determne the next task that s dspatched to the CPU. It has no effects on the share a task gets on the avalable computaton tme ths share s defned only by the remanng tme slce, represented n the counter parameter (as descrbed above). Compared to the calculaton of a task s weght, the schedulng algorthm works qute easy, as can be seen n the code prnted below. he schedulng algorthm uses the goodness() functonalty. However, t mght be possble that the actve task gves up ts control voluntarly. hs s marked by settng the task s state to SCHED_YIELD. hus, for the actve task, the prev_goodness() routne wll be called, whch frst checks f the SCHED_YIELD flag s set. In ths case, the task becomes no weght (weght = 0), otherwse the weght wll be computed accordng to the goodness() routne. 486 /* 487 * ths s the scheduler proper: 488 */ 489 490 repeat_schedule: 491 /* 492 * Default process to select.. 493 */ 494 next = dle_task(ths_cpu); 495 c = -1000; 496 f (prev->state == ASK_RUNNING) 497 goto stll_runnng; 498 499 stll_runnng_back: 500 lst_for_each(tmp, &runqueue_head) { 501 p = lst_entry(tmp, struct task_struct, run_lst); 502 f (can_schedule(p)) { 503 nt weght = goodness(p, ths_cpu, prev->actve_mm); 504 f (weght > c) 505 c = weght, next = p;

18 Marc Bechler and Hartmut Rtter 506 } Startng wth the dle task (lne 494; whch s always on the frst poston n the run queue), for each (executable) task n the run queue t s checked whether the task could be scheduled or not (lne 502). hs check s only of relevance for SMP systems; n sngle CPU envronments, the can_schedule() routne s always true. herefore, for each executable task the goodness() functon s called, whereby the helper varable c (lne 495) stores the hghest value of weght. hs process shows that there s a second aspect that nfluences the determnaton of the next task f two tasks have the same (hghest) weght, the one that s postoned n front of the other tasks n the run queue wns the competton for the CPU resources. Remember that tasks wth an expred tme slce have a weght of zero. If the tme slce of all executable tasks s expred, ther tme slces wll be re-ntalzed, as descrbed n the followng secton. 3.1.3.2.3 Intalzaton of Expred asks If the tme-slce of a perodc real-tme task expres, the scheduler re-ntalzes ts tme slce mmedately by assgnng the value stored n the prorty parameter,.e., task->counter = task->prorty, as can be seen n lnes 624-629 (Appendx D). In contrast to real-tme tasks, common tasks are only re-ntalzed f the tme slce of each executable task s expred. In ths case, all tasks wll be re-ntalzed ncludng the tasks that are currently not executable. hereby, the response tme for tasks watng for some resources s mproved to acheve a faster reacton to the sgnal(s) the task s watng for. he code for the re-calculaton of the expred tmers s also qute easy to understand. In order to avod undefned states whle modfyng the task s parameter, the task lst (tasklst) has to be locked; other routnes are allowed to read the parameters, but not to modfy them (lne 603). After the re-ntalzaton, the run queue s unlocked (n lne 607). 599 recalculate: 600 { 601 struct task_struct *p; 602 spn_unlock_rq(&runqueue_lock); 603 read_lock(&tasklst_lock); 604 for_each_task(p) 605 p->counter = (p->counter >> 1) + p->prorty; 606 read_unlock(&tasklst_lock); 607 spn_lock_rq(&runqueue_lock); 608 }

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 19 he re-ntalzaton of the tme slces s done n lne 605. Addtonally to the prorty parameter, the half of the orgnal sze of the remaned tme slce s added. herefore, the new sze of the tme slce wll be calculated accordng to the followng equaton: old tmeslce new tmeslce = prorty + 2 hs guarantees that tasks watng for resources (and, thus, do not completely utlze ther tme slce) can react more spontaneous f the resource wll become avalable. Dvdng the tme slce s realzed as a smple and effcent bt shftng. Note that tasks wth an expred tme slce are re-ntalzed only by the prorty (whch s equal to 200 ms) as ther counter s zero. If, as an example, a re-ntalzaton s nvoked and a (watng) task has a remanng tme slce of 17 tcks, t wll be rentalzed wth (20 + 17/2 ) tcks = 28 tcks. If the scheduler s called the next tme, ths task wll be preferred as the tme slce s the bass for calculatng a task s weght. Addtonally, the CPU s bandwdth wll be shared more equally between the tasks as the computaton tme currently not needed wll be partcularly avalable later. Albet the tme slce for a watng task seems to be ncreasng wth every re-ntalzaton, t has a maxmum threshold of the double prorty (.e., 40 tcks), whch can be easly proved by nducton (see Appendx E for the proof). In the case that no executable task currently exsts, Lnux provdes a specal task that s always executable. hs task s called the dle task and s scheduled n ths stuaton. he dle task s mplemented as a smple endless loop, whch s defned as a pre-processor macro n the fle kernel/sched.c. Accordngly, the operatng system provdes an dle task for each CPU n SMP envronments. 3.1.3.3 Dspatchng he task wth the hghest weght wns the competton for the resources of the CPU and has to be dspatched n the last step of the schedulng to the correspondng processor. Before ths step s done, some statstcal operatons are performed. For the current task, the kstat.context_swtch varable (lne 558) s ncremented whch stores the current number of preemptons. Addtonally, n mult-processor envronments, the average utlzaton of the tme slce s computed (lnes 535-548). For dspatchng, the context (whch are the regster values of the processor) of the precedng task must be saved, before the new context could be loaded from memory. he components of a context are descrbed n [16] n more detal, as archtecture specfc characterstcs are taken nto account. Note that the scheduler tself s not related to any task; untl ths pont of tme the scheduler runs n the context of the prevous task that should be preempted. After dspatchng, the schedulng s fnshed.

20 Marc Bechler and Hartmut Rtter If the tme slces of all executable tasks are expred, the tme slces wll be re-ntalzed (as descrbed above) and the weght functon for each executable task wll be repeated. For dspatchng the next task, the scheduler frst swtches the memory context (lnes 568-586), before t loads the context of the new task n the CPU regsters usng the swtch_to() functon n lne 592 (note that ths routne s completely wrtten n assembler code, see nclude/asm/system.h). he followng secton gves a smple example of the schedulng. 3.1.4 Example Fgure 3.3 (a) demonstrates the schedulng n an example usng three executable (common) tasks. At the tme pont t 0 = 0 tcks, all tasks have a remanng tme slce of 20 tcks (whch means that they were ust re-ntalzed) and the task runnng before t 0 has to wat for any resource. We also assume a system runnng wth one CPU. (a) ask 1 ask 2 ask 3 0 10 20 30 40 50 60 70 t [cks] (b) ask 1 ask 2 ask 3 t 0 = 0 tcks 20 + 20 = 40 20 + 20 = 40 20 + 20 = 40 t 1 = 10 tcks - 20 + 20 = 40 20 + 20 = 40 t 2 = 30 tcks 10 + 20 = 30 0 20 + 20 = 40 t 3 = 40 tcks 10 + 20 = 30 0 10 + 20 + 1 = 31 t 4 = 50 tcks 10 + 20 = 30 0 0 t 5 = 60 tcks 20 + 20 + 1 = 41 20 + 20 = 40 20 + 20 = 40 Fgure 3.3: Example for Schedulng of dfferent asks he dotted lnes n Fgure 3.3 mark the tme ponts of the scheduler s actvaton. Whle runnng, the weght wll be calculated for each of the three tasks accordng to 3.1.3.2.2. e results of the goodness() functon are lsted n Fgure 3.3 (b), and base on the followng calculatons:

Background: Schedulng and Polcy-based Networkng wthn the Lnux Kernel 21 & t 0 = 0 tcks. At the begnnng of the example, no task has consumed anythng of ts tme slce. hus, the weght of 40 results by addng the tme slce (counter, 20 tcks) to the prorty (20 tcks). As ask 1 s the frst n the run queue, t wll be dspatched. & t 1 = 10 tcks. After 100 ms, ask 1 voluntarly gves the control over the CPU back to the scheduler as t has to wat for an external resource. At ths pont, the weght for ask 2 and ask 3 are stll 40 as descrbed above. hus, ask 2 s the next task n the run queue and wll be dspatched. & t 2 = 30 tcks. he tme slce from ask 2 expres resultng n the actvaton of the scheduler. he weght functon wll weght ask 2 wth 0 as ts tme slce has expred. hus, ask 3 wll be scheduled to the CPU. & t 3 = 40 tcks. ask 1 s sgnaled that the needed resource s now avalable and, the scheduler wll be called. hus, ask 1 becomes executable and wll be weghted wth 30 as ts remanng tme slce s 10 tcks. ask 3, whch has also a remanng tme slce of 10 tcks, wll be kept dspatched, as t s weght s 31 due to the advantage that no swtchng of the memory context s needed. & t 4 = 50 tcks. he weght for ask 2 and ask 3 s 0 as the tme slces of both tasks are expred. & t 5 = 60 tcks. At ths pont, all tme slces are expred and the scheduler ntates a rentalzaton, whch results n a weght of 0 for all tasks. hus, the tme slce of all tasks wll be re-ntalzed and the weght wll be computed a second tme for each task. Note that after the re-ntalzaton, ask 1 wll keep the control over the CPU as no memory context swtch s needed and, thus, the new weght wll be 41 compared to 40 for ask 2 and ask 3. hs bref and smple example demonstrates the rather nflexble but effcent work of the scheduler. However, the support for qualty of servce s not possble as the Lnux kernel does not support possbltes to affect the schedulng of the tasks. he followng secton descrbes the problems and the realzaton of an adaptve task scheduler for supportng qualty of servce to applcatons.

4 Desgn Issues for an Adaptve Scheduler An adaptve schedulng mechansm s the bass for supportng qualty of servce on operatng system s level. Our approach s to realze an adaptve scheduler wth a mnmum of modfcatons at the fundamental schedulng mechansms n Lnux. herefore, two basc possbltes can be dentfed: he use of perodc real-tme tasks wth dynamc prortes, or a flexble tme slce scheme for common tasks. We decded to realze the second approach, as the use of dynamc realtme prortes has several dsadvantages: & A problem we dentfed n many tests s the favored treatment of real-tme tasks. In general, an executable real-tme task s always scheduled before all other common tasks and must not compete wth them for the CPU resources. If a complex multmeda applcaton bases on one or more real-tme tasks, the performance of all other applcatons runnng on the system wll break down. & Dynamc real-tme prortes requre an extensve and complex prorty management. A Realtme QoS Manager (RQM) would be necessary to handle the schedulng polcy of QoSbased applcatons. As an example, the RQM has to perform a contnuous adaptaton of the real-tme prortes for all tasks n order to guarantee the desred sharng, as the Lnux scheduler only consders executable tasks wth the hghest real-tme prorty. & he Real-tme QoS Manager also has to handle the prorty nverson problem, whch occurs by the use of flexble prortes for real-tme tasks. In the prorty nverson problem, a low prorty task that currently uses a resource blocks a (runnng) task wth a hgher real-tme prorty that tres to access ths resource. herefore, the RQM has to perform addtonal operatons for (real-tme) prorty nverson or the temporary ncreasng of (real-tme) prorty. & he last aspect s the real-tme prorty tself, whch has no relaton to the scheduled CPU resources. herefore, no statement about the effectve qualty of servce support wll be possble. Partcularly, t wll become very hard to map the applcaton s QoS requrements to real-tme prortes. 4.1 Varable me Slces Instead of usng dynamc prortes for real-tme tasks for supportng qualty of servce on operatng system s level, we use a flexble tme slce scheme for all tasks. hereby, the component s of specal nterest that defnes the prorty of a (common) task and smultaneously defnes the (mnmum) tme slce for ts re-ntalzaton. hus, the basc dea of our approach s to use the prorty varable as a representatve for qualty of servce. As can be seen n Fgure 4.1, we also

24 Marc Bechler and Hartmut Rtter realzed a QoS Manager that s responsble for nteractng wth the applcatons (see also chapter 2, note that QoS Manager and RQM are dfferent enttes!). If a process requests an mproved QoS support from the QoS Manager, t gets more CPU resources as ts tme slce wll be ncreased. Note that the QoS Manager has to take nto account whether schedulng more CPU resources to a task ndeed mproves the applcatons qualty of servce, or f the CPU resources are not the bottleneck. he mappng of dfferent tme slces to the correspondng amounts of CPU resources s mplctly performed by the scheduler. hs approach addtonally has the advantage that other tasks wll not starve (whch mght be possble n the real-tme approach), as the Lnux scheduler wll be called at the latest after all executable tasks are expred and, thus, were dspatched durng ths tme perod. Conversely, t has to be consdered that the prorty s at least one tck, as tasks wth a prorty of 0 are nterpreted as expred and are not consdered by the scheduler. User Mode Scheduler QoS Manager System Mode CPU Hardware Fgure 4.1: A QoS-based Scheduler for Lnux he QoS support specfed by the user or an applcaton can be easly mapped on the sze of a task s tme slce. On account of the coarse resoluton of tme wthn the Lnux kernel, we chose the dentty functon for the mappng. A modfcaton of the qualty of servce support by +1 results n an ncrease of the tme slce by 1 tck. A formal descrpton of the percental amendment on the share can be found n secton 4.3. Note that the mappng of the specfed qualty of servce covers only the scheduled computaton tme. Predctons about the tme ponts for the actvaton of the scheduler are not possble as ths manly depends on the unpredctable occurrence of nterrupts and other system events. hus, t s not possble to specfy the trggerng of the scheduler. It s only possble to determne the maxmum tme perod between two calls of the scheduler, whch s the double of the

Desgn Issues for an Adaptve Scheduler 25 maxmum tme slce (see secton 3.1.3.2.3 and Appendx E). However, ths also means that t s not possble to guarantee the desred qualty of servce to an applcaton, as ths bascally depends on the behavor of the other tasks, nterrupts and system events. hat s, our approach only enables a statstcal guarantee of the QoS support, whch we call a soft qualty of servce. But, we beleve that ths knd of guarantee s suffcent for multmeda applcatons, as can be seen on the results descrbed n [2]. Note that our approach descrbed so far does not depend on explct nterventon n the schedulng mechansms of the Lnux kernel. However, there are some negatve sde effects, whch requre slght modfcatons of the scheduler. 4.1.1 Negatve Sde Effects As some mechansms of the Lnux operatng system are not desgned for supportng qualty of servce wthn the end system, the use of varable tme slces results n several negatve sde effects. As an example, the use of real-tme tasks s n contrast to the QoS support, as those tasks are bascally preferred and the user as well as the QoS Manager has rather any possbltes to control real-tme tasks dynamcally. One approach for avodng ths problem mght be to treat real-tme tasks as common tasks. In ths case, only slght modfcatons of the scheduler are requred. However, ths encroachment contravenes aganst the POSIX 1003.1b specfcaton [13], whch explctly postulates the exstence of real-tme tasks and ther schedulng. Addtonally, the modfcatons could result n an undesrable behavor of exstng real-tme applcatons. hus, our approach treats real-tme tasks as before. he user mght remedy ths dsadvantage ether by closng real-tme applcatons, or (f s/he has root access) by usng the system call sched_setscheduler(), whch resets a real-tme task to a common task. A second obvous problem results from the demand-drven actvaton of the scheduler. A task that s currently dspatched to the CPU wll only be preempted f t voluntarly releases the control over the CPU, f ts tme slce expres, or f a system call explctly trggers the scheduler. Although ths mechansm s very effcent for small tme slces as the scheduler wll only be called f t s necessary, t s dsadvantageous for large tme slces. If, e.g., two MPEG-2 vdeos are runnng wth a tme slce of one second, the CPU resources are shared equally between both applcatons. However, whle runnng, the frst applcaton gets the complete CPU resources for 1 s and dsplays the vdeo sequence wth the maxmum frame rate, whle the other applcaton s suspended. After 1 s, the tme slce of the frst vdeo expres and the second vdeo runs for 1 s, whle the frst applcatons s suspended. Wth a further ncreasng of the tme slce, ths effect s renforced. However, ths qualty of servce support s not n terms of the user s satsfacton t would be more convenent f both vdeo sequences would be played smultaneously n a smoother way.

26 Marc Bechler and Hartmut Rtter Another weak pont that occurs from large tme slces s the calculaton of the weght for each task. he goodness() functon uses the a task s (re-)ntalzaton value for the tme slce (stored n the prorty varable), resultng n a hgh weght for tasks wth a large tme slce. hus, ths task frst utlzes ts complete tme slce, before other tasks are scheduled even f the scheduler wll be called multple durng ths tme perod. hs effect s clarfed by the example shown n Fgure 4.2. In ths example, the tme slce for ask 1 s 600 ms. hus, ts mnmal weght wll always exceed a value of 60 (untl ts tme slce expres), whch wll never be reached by common tasks wth a tme slce of 20 ms. For common tasks, the maxmum weght s 41 tcks (20 tcks remanng tme slce + 20 tcks prorty + 1 tck for avodng a context swtch). Note that the weghts for each tasks at t 5 n Fgure 4.2 are the values for the second call of the functon goodness(), after all tasks were re-ntalzed. (a) ask 1 ask 2 ask 3 ask 4 0 20 40 60 80 100 120 140 t [cks] (b) ask 1 ask 2 ask 3 ask 4 t 0 = 0 tcks 60 + 60 = 120 20 + 20 = 40 20 + 20 = 40 20 + 20 = 40 t 1 = 40 tcks 20 + 60 + 1 = 81 20 + 20 = 40 20 + 20 = 40 20 + 20 = 40 t 2 = 60 tcks 0 20 + 20 = 40 20 + 20 = 40 20 + 20 = 40 t 3 = 80 tcks 0 0 20 + 20 = 40 20 + 20 = 40 t 4 = 100 tcks 0 0 0 20 + 20 = 40 t 5 = 120 tcks 60 + 60 + 1 = 120 20 + 20 = 40 20 + 20 = 40 20 + 20 + 1 = 41 Fgure 4.2: Example for asks wth large me Slces In the context of large tme slces, the re-ntalzaton after the expry of a task s tme slce s also crucal for the QoS support. he specfcaton of qualty of servce on the level of the operatng system has no drect correlaton to the tme slce of a task. Addtonally, a re-ntalzed tme slce also depends on the current tme slce (see lne 605) and, thus, results n an unpredctable qualty of servce support, f the task has to wat for any resources. If, e.g., an adaptve vdeo applcaton wth a prorty of 100 tcks has to wat for ncomng data, ts tme slce ncreases up to 2 s. If the data s avalable, the applcaton gets the complete CPU resources for 2 s. hereupon,

Desgn Issues for an Adaptve Scheduler 27 the qualty of servce support s degraded suddenly after the next re-ntalzaton and the share of the CPU resources s settled down at the desred level. hus, the mproved response tme for watng tasks n Lnux also affects the QoS support n a negatve way. Addtonally, varatons n the qualty of servce support are caused by applcatons that do not completely consume ther scheduled computaton tme, resultng n a dynamc balance between user requrements, current load, behavor of applcatons, and nterrupts from the hardware. 4.2 A QoS-based Schedulng Approach he effects descrbed n the prevous secton show the need for encroachments n the basc schedulng mechansms of Lnux. hus, the qualty of servce support from the operatng system s acheved by a combnaton of two basc concepts: Frst, adaptaton s acheved by usng a flexble tme slce scheme, whch s controlled by a feedback loop. Addtonally, a modfed QoS-based scheduler cares for an mproved qualty of servce support. he negatve effects result manly from tasks wth large tme slces runnng on the system. hus, one obvous approach could be a restrcton of the maxmal sze for the tme slce. Addtonally, t s concevable to reduce the prorty and thus the (re-)ntalzaton value (whch s at least 20 tcks) for each task. However, ths approach s rather nflexble, because t s on the one hand dffcult to specfy an upper lmt for the tme slces of all applcatons. On the other hand, the default granularty of the system tme s very coarse (1 tck = 10 ms). Subsdng the prorty level for all tasks, the granularty for adustng the QoS support becomes much coarser (cf. secton 4.3). he alternatve we used for handlng tasks wth a large tme slce s the addtonal use of a tme-drven mechansm for trggerng the scheduler. herefore, an external tmer calls the scheduler perodcally. he actvaton of ths tmer s only necessary, f the user or an applcaton requres a QoS support. If the user needs no further qualty of servce (or a QoS-based applcaton s termnated), the tmer wll be deactvated f no other QoS-based applcatons are currently runnng on the Lnux system, resultng n the orgnal effcency of the schedulng mechansm. hus, the external tmer takes care for a more regular (and thus a qualtatve better) QoS support of the operatng system, as large tme slces wll be splt n several smaller unts that are dspatched to the CPU resultng n an mproved nterleavng of the tasks. We mplemented the external tmer by usng the add_tmer(), del_tmer(), mod_tmer()system calls, whch perodcally trgger our scheduler wth a frequency of 100 Hz. However, the external tmer does not remedy the general problem of the coarse temporal granularty. Of course, a resoluton of 10 ms causes lttle effort for the management of the tmng aspects and the schedulng of tasks, but for current and future applcatons, ths granularty s not suffcent as the desred frequency of the scheduler calls cannot be adusted exactly. Note that a

28 Marc Bechler and Hartmut Rtter sx-tmes super-scalar processor such as the Intel Pentum 4 CPU runnng at 1.5 GHz performs 15,000,000 CPU cycles n 10 ms and s, thus, able to calculate theoretcally up to 90,000,000 operatons n 1 tck [17]. hs suggests the dea of modfyng the tmng bass for Lnux. Although ths can be easly acheved by decreasng the HZ constant (currently set to 100 (fle nclude/asm/param.h) for x86-based CPU archtectures), whch results n a shorter perod for one tck. However, every routne handlng temporal aspects (such as the calculaton of the tme, routnes for statstc calculatons such as the average tme slce, etc.) has to be modfed n order to acheve a correct system behavor. However, tasks wth a large tme slce are stll treated n the same way. As can be seen n Fgure 4.2, ask 1 gets the CPU wthn the frst 600 ms, even f n the scheduler s called wthn ths tme perod. Afterwards, t has to wat for 600 ms untl all other tasks wth a smaller tme slce are expred. A more regular schedulng polcy can only be acheved by modfyng the schedulng algorthm. he reasons for the behavor of tasks wth a large tme slce can be manly seen n the goodness() functon, as both the prorty and the remanng tme slce of each task are the most relevant parameters for the calculaton of the weght. In order to obtan the effectveness of the scheduler, we decded not to modfy the basc schedulng mechansms but to adapt the goodness() functon n the followng way. (In contrast to earler versons of the Lnux kernel, we wrote a new functon n parallel to goodness() called goodness_qos(), whch s only used by the schedulng algorthm as there are other functons [such as reschedule_dle()] that use the values generated by goodness().) & For each task, we addtonally retan the number of schedules n the varable schedule_counter (stored n struct task_struct n nclude/lnux/sched.h). If all executable tasks are expred and ther tme slces are re-ntalzed, the schedule_counter wll also be reset to 1. hus, the weght wll be computed as follows: For the bass, we use the currently remanng tme slce dvded by the number of schedules (and rounded to the lower whole-numbered value). Where applcable, both the advantage for avodng of a CPU penalty as well as the context swtch beneft are added to ths value as descrbed n secton 3.1.3.2. Note that we have to take care that the weght does not become 0 as n ths case the task wll not be scheduled further on. & Compared to the orgnal goodness() functon, the prorty parameter s not consdered n the weght. & Another modfcaton s requred for the re-ntalzaton of the tme slces. As the tme slce of a non-executable task ncreases, the control of the qualty of servce support becomes more dffcult (and mght even result n a degraded QoS support). hus, we only re-ntalze a task wth ts prorty,.e., t s easer to map the qualty of servce to the sze of the tme slce and the QoS support becomes more transparent and calculable to the applcaton.

Desgn Issues for an Adaptve Scheduler 29 he followng lnes of code show the mportant part of the mplementaton of the goodness_qos() functon. 1 statc nlne nt goodness_qos(struct task_struct * p, 2 nt ths_cpu, struct mm_struct *ths_mm) 3 { 4 nt weght = 0; 5... 23 f (!(p->counter)) 24 goto out; 25 26 27 f (p->schedule_counter >= p->counter) 28 { 29 // schedule_counter must not get hgher as the counter value, 30 // otherwse the DIV wll become 0! Beware that there s no 31 // schedule_counter overflow. schedule_counter must not become 32 // 0, as n ths case the followng code wll not be called. 33 // Set weght to 1 for optmzaton :-) 34 35 p->schedule_counter = p->counter; 36 weght = 1; 37 } 38 else 39 { 40 weght = (nt) p->counter/p->schedule_counter; 41 } 42 43 #fdef SMP 44 /* Gve a largsh advantage to the same processor... */ 45 /* (ths s equvalent to penalzng other processors) */ 46 f (p->processor == ths_cpu) 47 weght += PROC_CHANGE_PENALY; 48 #endf 49 50 /*.. and a slght advantage to the current MM */ 51 f (p->mm == ths_mm!p->mm) 52 weght += 1; 53 54 out: 55 return weght; 56 } 57 58 #endf //CONFIG_QOS_SCHED

30 Marc Bechler and Hartmut Rtter On the bass of ths calculaton, tasks that are dspatched qute often between two rentalzatons wll become slghtly handcapped. However, n combnaton wth the external tmer descrbed above, the avalable CPU resources wll be scheduled much smoother between runnng tasks, as can be seen n the example n the next secton. Wthn the functon responsble for the schedulng of tasks (schedule() n fle kernel/sched.c), we only have to modfy mnor functonalty. he frst modfcaton s to ncrease the number of schedules (p->schedule_counter) for the current task every tme the scheduler s called. Second, we have to call the goodness_qos() functon nstead of the goodness() functon for the calculaton of each task s weght. he thrd modfcaton of the scheduler has to be done n the re-ntalzaton of each task. We only use the prorty value for the re-ntalzaton and omt the current sze of the tme slce. Addtonally, the schedule_counter parameter must be reset. hus, the followng lnes of code perform the rentalzaton of each task that currently exsts on the Lnux system. 1 for_each_task(p) 2 { 3 p->counter = p->prorty; 4 p->schedule_counter = 1; 5 } As can be seen n the prevous lnes of code, the effort of our adaptve schedulng mechansm only mnmally ncreases the system load. Compared to the orgnal schedulng mechansm, the adaptve scheduler needs an addtonal f-then-else construct n the goodness() functon, whch results n ether two assgnments of values to varables or a whole-numbered dvson of the varable weght (unfortunately, dvsons are qute expensve n terms of CPU cycles). In return, we can omt one assgnment and the addton of the prorty to the weght. he scheduler needs an addtonal ncrementng of the schedule_counter varable; the re-ntalzaton also requres an addtonal assgnment for the same varable, whch s yet compensated by resgnng on the calculaton of the ncreased tme slce f a task does not completely utlze ts tme slce (cf. lne 605 n Appendx D). Altogether, we have an addtonal overhead of one f-then-else query wth an eventual dvson when passng through our adaptve scheduler for the Lnux operatng system. 4.2.1 Example In order to demonstrate the behavor of our QoS-based scheduler, we used four tasks wth dfferent prortes. ask 1 and ask 2 have both a tme slce of 40 tcks, ask 3 was ntalzed wth

Desgn Issues for an Adaptve Scheduler 31 20 tcks and ask 4 had a pre-defned tme slce of 10 ms. he frequency of the external tmer was adusted to 10 Hz,.e., the scheduler was called every 100 ms. Fgure 4.3 (a) shows the schedule sequence of the four tasks; n Fgure 4.3 (b) the weghts for the dfferent tasks at each call of the scheduler can be seen. In the terms a/b n Fgure 4.3 (b), a s the remanng tme slce of the correspondng task at tme t, b represents the number of schedules (snce the last re-ntalzaton process) we ntroduced for our QoS-based scheduler, and the operator / denotes the dvson that rounds the result to the lower whole-numbered value. he bonus for avodng a context swtch was set to the default value of 1. he beneft for avodng a processor swtch s not consdered n ths scenaro (as we developed and optmzed our adaptve scheduler for sngleprocessor systems). In Fgure 4.4, we demonstrate the schedule sequence wth the same example shown n Fgure 4.2. (a) ask 1 ask 2 ask 3 ask 4 0 10 20 30 40 50 60 70 80 90 100 110 t [cks] (b) ask 1 ask 2 ask 3 ask 4 t 0 = 0 tcks 40/1 = 40 40/1 = 40 20/1 = 20 10/1 = 10 t 1 = 10 tcks 30/2 + 1 = 16 40/1 = 40 20/1 = 20 10/1 = 10 t 2 = 20 tcks 30/2 = 15 30/2 + 1 = 16 20/1 = 20 10/1 = 10 t 3 = 30 tcks 30/2 = 15 30/2 = 15 10/2 + 1 = 6 10/1 = 10 t 4 = 40 tcks 20/3 + 1 = 7 30/2 = 15 10/2 = 5 10/1 = 10 t 5 = 50 tcks 20/3 = 6 20/3 + 1 = 7 10/2 = 5 10/1 = 10 t 6 = 60 tcks 20/3 = 6 20/3 = 6 10/2 = 5 0 t 7 = 70 tcks 10/4 + 1 = 3 20/3 = 6 10/2 = 5 0 t 8 = 80 tcks 10/4 = 2 10/4 + 1 = 3 10/2 = 5 0 t 9 = 90 tcks 10/4 = 2 10/4 = 2 0 0 t 10 = 100 tcks 0 10/4 = 2 0 0 t 11 = 110 tcks 40/1 = 40 40/1 + 1 = 41 20/1 = 20 10/1 = 10 Fgure 4.3: Example QoS-based Schedulng

32 Marc Bechler and Hartmut Rtter (a) ask 1 ask 2 ask 3 ask 4 0 10 20 30 40 50 60 70 80 90 100 110 120 t [cks] (b) ask 1 ask 2 ask 3 ask 4 t 0 = 0 tcks 60/1 = 60 20/1 = 20 20/1 = 20 20/1 = 20 t 1 = 10 tcks 50/2 + 1 = 26 20/1 = 20 20/1 = 20 20/1 = 20 t 2 = 20 tcks 40/3 + 1 = 14 20/1 = 20 20/1 = 20 20/1 = 20 t 3 = 30 tcks 40/3 = 13 10/2 + 1 = 6 20/1 = 20 20/1 = 20 t 4 = 40 tcks 40/3 = 13 10/2 = 5 10/2 + 1 = 6 20/1 = 20 t 5 = 50 tcks 40/3 = 13 10/2 = 5 10/2 = 5 10/2 = 5 t 6 = 60 tcks 30/4 + 1 = 8 10/2 = 5 10/2 = 5 10/2 = 5 t 7 = 70 tcks 20/5 + 1 = 5 10/2 = 5 10/2 = 5 10/2 = 5 t 8 = 80 tcks 10/6 + 1 = 2 10/2 = 5 10/2 = 5 10/2 = 5 t 9 = 90 tcks 10/4 = 2 0 10/2 = 5 10/2 = 5 t 10 = 100 tcks 10/4 = 2 0 0 10/2 = 5 t 11 = 110 tcks 10/4 = 2 0 0 0 t 12 = 120 tcks 60/1 = 60 20/1 = 20 20/1 = 20 20/1 = 20 Fgure 4.4: QoS-based Schedulng n the Scenaro of Fgure 4.2 As can be seen on the two examples, our QoS-based scheduler provdes an mproved qualty of servce support for the applcatons runnng on the operatng system, as the scheduler s called more often and the QoS-based schedulng mechansm cares for a smoother schedule of the runnng tasks. 4.3 QoS Support n heory and Practce In our approach, the QoS support relates to the percentage of scheduled CPU resources wthn a tme perod. In ths context, the tme perod s the tme between two re-ntalzaton processes. Assume that there are n executable tasks. denotes the prorty and, thus, the ntal tme slce for task. If all tasks are executable, no further tasks are started and the CPU s completely utlzed, the percentage share A on the avalable CPU bandwdth for task s defned by the followng equaton.

Desgn Issues for an Adaptve Scheduler 33 A = n = 1 If the user (or an applcaton) changes the prorty for task by [tcks], the percentage of he scheduled CPU resources A for ths task s defned by A = n n = 1 n = 1 = 1 + he dervaton of ths equaton can be found n Appendx F. We also measured the system behavor n practce. herefore, we started four processes performng nteger addtons n an endless loop and, thus, completely utlze ther scheduled CPU resources,.e., each process gets about ¼ of the avalable CPU resources whle each process traced the tme t was scheduled to the CPU. Fgure 4.5 shows the results of ths experment. In Fgure (a), we ncreased the tme slce of process 1 by 2 tcks n each step, n Fgure (b) process 4 was decreased by 2 tcks n each step. he two fgures show the effect on the CPU utlzaton for each process. As we can see (whch was already mentoned above), ncreasng the tme slce for one task results n a lower ncreasng of the percentage on the CPU resources compared to a decreasng of the tme slce 5. hose experments show the feasblty of our adaptve scheduler. Further measurements, such as the result of the qualty of servce support for applcatons are descrbed n [1, 2] n more detal. We also mplemented a QoS Lbrary that allows applcatons to specfy ther own qualty of servce support n order to acheve an optmzed convergence of the QoS requrements to the actual QoS support provded by the operatng system. A bref descrpton of the QoS Lbrary can be found n [2]. 5 Assumng the stuaton n Fgure 4.5 startng wth 20 tcks per process, ncreasng the tme slce of Process 1 n (a) to 22 tcks results n an mproved percentage of 1.829 %, whereas decreasng the prorty of Process 4 (n (b)) results n a degraded percentage of 1.92 %.

34 Marc Bechler and Hartmut Rtter (a) Advantage for Process 1 CPU Utlzaton [%] 35 30 25 20 15 10 5 0 0 2 4 6 8 10 Advantage [tcks] Process 1 Process 2 Process 3 Process 4 (b) Dsadvantage for Process 4 CPU Utlzaton [%] 35 30 25 20 15 10 5 0 0-2 -4-6 -8-10 Dsadvantage [tcks] Process 1 Process 2 Process 3 Process 4 Fgure 4.5: Experment Modfcaton of the me Slces

5 Further Implementaton Detals In order to nteract wth the scheduler, an nterface s needed that can be used by applcatons n order to pass ther requrements to the Lnux kernel. As schedulng s performed completely wthn the Lnux kernel, we realzed a system call that allows to access the QoS Manager wthn the Kernel, whch tself modfes the schedulng of the avalable CPU resources or passes the request to the networkng subsystem. Note that the followng descrpton bases on the Lnux kernel verson 2.3.49 for x86-based archtectures (as mentoned earler). Appendx A brefly summarzes the changes that were made n the dfferent fles. 5.1 Addng a New System Call he new system call we added n our mplementaton has the followng prototype: nt modfy_qos(pd_t pd, nt qosfy, nt deltaqos) he mplementaton of a new system call n Lnux s qute easy. hs process s descrbed n the followng on the bass of our new system call modfy_qos(). Frst the new system call must be regstered n the Lnux kernel by extendng the fle arch/386/kernel/entry.s. he sys_call_table defnes every system call for the Lnux system. he new system call s regstered by extendng ths table usng.long SYMBOL_NAME(), whch s the 218 th system call n the Lnux kernel (verson 2.3.49). he second step s to assgn the correct system regstraton number of the system call as well as to defne ts prototype. hs s performed n the fle nclude/asm/unstd.h. In order to defne the correspondng number, we appended #defne NR_modfy_qos 218 to the lst of the system call numbers. For defnng the prototype (and, thus, propagatng the defnton of the system call n the kernel so t wll also become accessble from outsde the kernel), we used one of the _syscallx macros that are defned n the same fle, where x specfes the number of the arguments of the system call. For our system call modfy_qos() we used statc nlne _syscall3(nt, modfy_qos, pd_t, pd, nt, qosfy, nt, deltaqos)

36 Marc Bechler and Hartmut Rtter Note to nclude the return value and the name of the system call, respectvely, n the lst of the arguments of the _sycall3() macro. hose two steps are necessary to add a new system call to your Lnux system. Now you have to mplement the functonalty of your system call. In our example, we created the fle kernel/qossched.c for the mplementaton the modfy_qos() system call. here are three topcs that must be consdered: & Note that the code runs completely n the prvleged system mode, so the correspondng header fles (that contan the necessary varables that are accessed) must be ncluded wthn the new fle. hose header fles can be found n the nclude path of your kernel tree. You can nclude them, e.g., va #nclude <lnux/sched.h> or #nclude <net/pv4.h>. & he routne realzng the system call has to be defned n the followng way: asmlnkage nt sys_modfy_qos(pd_t pd, nt qosfy, nt deltaqos) & For the complaton of the kernel (whch s necessary when modfyng system calls), the created fle would not be consdered unless the Makefle n the correspondng drectory s extended. In our example, we have to add the obect name of your fle to the O_OBJS varable,.e., qossched.o (n the kernel/ path). Before you can use your system call, the dependences for the kernel must be bult and the complete kernel has to be compled. Do not forget to copy the System.map fle to the /boot drectory so the new system call can be found whle bootng the QoS-based Lnux kernel. 5.2 he Implementaton of modfy_qos() As mentoned above, our system call modfy_qos() s mplemented n the kernel/qossched.c fle wth the followng prototype: asmlnkage nt sys_modfy_qos(pd_t pd, nt qosfy, nt deltaqos). he handlng of ths functon s qute ntutve: pd specfes the process/task that should become a better/worse QoS support, qosfy tells the QoS Manager to swtch the QoS support for a task on (= 1) or off (= 0), and deltaqos defnes the aglty of the adaptaton process,.e., f deltaqos s +3, our QoS Manager ncreases the tme slce for the specfed task by 3 tcks. Note that deltaqos can be also negatve, resultng n a degraded QoS support. For mplementng the functonalty of our system call, we used four routnes (n fle kernel/qossched.c). call_schedule_func() s the callback functon that wll be called f the external tmer expres. he frequency of ths tmer s defned by the constant IMESLICE, whch s set to 1 (= 1 tck = 10 ms). he callback functon trggers the scheduler, whch s done by

Further Implementaton Detals 37 settng the need_resched varable of the current task to 1, and renews the external tmer usng the mod_tmer() system call. Note that the scheduler should never be called mmedately; nstead, after a system call s executed, the need_resched flag s tested by the system and the scheduler wll be called approprately. hs flag s defned for each task wthn the task structure. he second functon s modfy_task(), whch s responsble for adaptng the task structure f requested by the applcaton. Frst, ths functon checks whether all arguments are vald (e.g., an exstng process dentfcaton (PID), and a vald deltaqos parameter) and f the access rghts of the user are suffcent for performng ths knd of modfcaton. Currently, only root s allowed to modfy the tme slce of a task; otherwse you have to adapt the correspondng lne of code. Fnally, the qosfed flag and the new value for the prorty varable are assgned accordngly. he next functon n the callng herarchy s modfy_chld_tree(), whch modfed all forked tasks of the specfed task,.e., f a task started several chld tasks, the QoS support for ths parent task s also extended to each chld task. herefore, ths functon s called recursvely usng a depth-frst search over the complete tree wth the parent process as the root of ths tree. hs functon s called by the actual system call modfy_qos() (wth the arguments descrbed above) n order to set the values for the correspondng parameters. herefore, modfy_qos() frst has to fnd the task structure to the specfed process dentfer (PID), t then qosfes (or unqosfes) the correspondng task(s) and fnally actvates or deactvates the external tmer. he return values of modfy_qos() are defned as follows: & -EINVAL: he PID of the specfed process s nvald,.e., t ponts to the dle task whch must not be modfed. & -ESRCH: he specfed process does not exst n the system. & -EPERM: Unauthorzed access to a task,.e., the user has not enough access rghts to modfy the tme slces of a task. & 0: he task was unqosfed successfully. & >0: he task was qosfed successfully and the new sze of the tme slce s returned. In order to use the functonalty of the QoS support, the two fles qossy.o (generated from qossy.c by usng gcc -Wall -c qossy.c -o qossy.o) and qossy.h can be used. qossy.h contans the prototype for the system call (thus, ths fle must be ncluded n a QoSbased applcaton), and qossy.o must be lnked to that applcaton. Wthn the code of the applcaton, the system call can be easly and smply used by callng the functon modfy_qos() wth the specfed parameters.

6 Bblography [1] UNIQuE-Proect (Unversal Network Infrastructure wth Qualty of servce Enhancements), Unversty of Karlsruhe, Insttute of elematcs, http://www.telematk.nformatk.un-karlsruhe.de/forschung/unque, 2001 [2] M. Bechler, H. Rtter, J. Schller: Qualty of Servce n Moble and Wreless Networks: he Need for Proactve and Adaptve Applcatons. 33 rd Annual Hawa' Internatonal Conference on System Scences (HICSS), Mau, January 2000 [3] K. Nahrstedt, H. Chu, S. Narayan. QoS-aware Resource Management for Dstrbuted Multmeda Applcatons, Journal on Hgh-Speed Networkng, Specal Issue on Multmeda Networkng, IOS Press, Vol. 8, No. 3-4, 1998, pp. 227-255 [4] L. C. Wolf, R. Stenmetz. Concepts for Resource Reservaton n Advance, Specal Issue of the Journal of Multmeda ools and Applcatons on he State of he Art n Multmeda, Vol. 4. No. 3, May 1997 [5] D. A. Solomon, M. E. Russnovch. Insde Wndows 2000, 3 nd Edton, Mcrosoft Press, Redmond, 2000 [6] M. Bechler, H. Rtter, G. Schäfer, J. Schller. raffc Shapng n End Systems attached to QoS-supportng Networks, Proceedngs of the 6 th IEEE Symposum on Computer and Communcatons (ISCC 2001), July 2001 [7] A. Rchards, G. Rogers, M. Antonades, V. Wtana. Mappng User level QoS from a Sngle Parameter, Proceedngs of the 2 nd Internatonal Conference on Multmeda Networks and Servces (MMNS 98), November 1998 [8] J. Gecse. Adaptaton n Dstrbuted Multmeda Systems, IEEE Multmeda Magazne, Aprl/June 1997 [9] R. Card, É. Dumas, F. Mével. he Lnux Kernel Book. Wley, Chchester/Wenhem, 1998 [10] R. W. Stevens. Advanced Programmng n the UNIX Envronment. Addson-Wesley, Massachusetts, 1997 [11] A. S. anenbaum. Modern Operatng Systems. Prentce-Hall Internatonal, London, 1986 [12] B. Lews, D. J. Berg. hreads Prmer: A gude to Multthreaded Programmng. Sunsoft Press, Calforna, 1996 [13] IEEE Standard for Informaton echnology. Portable Operatng System Interface (POSIX ) Part 1: System Applcaton Program Interface (API), Amendment 1: Realtme Extenson [C Language]. Insttute of Electrcal and Electroncs Engneers, New York, 1994 [14] D. A. Ruslng. he Lnux Kernel. Verson 0.8.2, http://www.redhat.com, February 1998 [15] R-Lnux: http://www.rt-lnux.com, 2000 [16] K. Hwang. Advanced Computer Archtecture, McGraw-Hll, 1993

40 Marc Bechler and Hartmut Rtter [17] Informaton for the Intel Pentum 4 processor, Intel Inc., http://www.ntel.com/pentum4, 2001

Appendx A: Overvew of the Modfed Fles hs secton gves a bref overvew of the fles that were modfed n order to acheve the QoS support. A more detaled descrpton s gven n secton 4.2 and chapter 5. Wthn the fles, modfcatons are edged wth a sequence of # n the followng way: // ####### [Modfcatons] // ####### mm As we use the opportunty to confgure the QoS support n the kernel confguraton, we have to nclude the lnux/confg.h fle n each fle that has been modfed (see also Appendx B for more nformaton). nclude/lnux/sched.h Both the task structure (struct task_struct) and ts ntal defnton (INI_ASK) has to be extended by several components, states, and varables needed for the adaptve scheduler. Note not to forget the INI_ASK macro that defnes the ntalzaton for each task that s created. hs macro also has to be extended wth the ntalzaton values for the new parameters. nclude/asm/unstd.h Addng the new system call to the lst of functons. hus, applcatons can use the new system call f they ether nclude the header fle qossy.h and use the qossy.o fle for complaton. Otherwse, the applcatons have to consder the nclude fles and macro calls performed n those two fles. arch/386/kernel/entry.s Add the new system call to the defnton lst of all other system calls n ths fle, so t s regstered and defned by the specfed prototype. kernel/sched.c In ths fle the (modfed) task scheduler s located. he relevant functons are goodness() and schedule(). kernel/qossched.c

42 Marc Bechler and Hartmut Rtter In ths fle the functonalty of the created system call s mplemented. Note that qossched.c s not part of the orgnal Lnux kernel. kernel/makefle Of course, the Makefle has to be extended for complng the kernel/qossched.c fle. arch/386/confg.n hs fle was modfed to extend the confguraton menu (menuconfg, xconfg, ) wth an entry to swtch on the QoS support for the scheduler (reflected by he parameter CONFIG_QOS_SCHED). Wthn the Lnux kernel, the statement #fdef CONFIG_QOS_SCHED allows to swtch on the support for qualty of servce or to comple the orgnal Lnux kernel (f swtched off n the confguraton menu). As a result, the value for CONFIG_QOS_SCHED s defned n nclude/nux/confg.h and s set to 1. hus, lnux/confg.h has to be ncluded n each fle that uses CONFIG_QOS_SCHED (after make menuconfg was executed) arch/386/defconfg Defnes the default values for the varables specfed n arch/386/confg.n. he default value for each of the QoS-based varables s yes,.e. CONFIG_QOS_SCHED s set to 1.

Appendx B: Gettng Started he nstallaton of our QoS support for Lnux s qute easy and smlar to the nstallaton process of other kernel patches. Unfortunately, the complete kernel needs to be rebult for the QoS-based schedulng, as central data structures used by several Lnux kernel mechansms are modfed. Also, a reboot of the system wll become necessary. Frst, take a 2.3.49 Lnux kernel and patch ths raw kernel wth our QoS patch (qos-patch-2.3.49). Copy the QoS patch n the Lnux drectory (usually /usr/src/lnux) and call the command patch -p1 qos-patch-2.3.49 After the nstallaton of the QoS patch, the kernel has to be confgured for your system and for the QoS support by callng a confguraton program (e.g., make menuconfg). Chose the QoS support tem n the menu and check f the QoS-based Scheduler tem s actvated. Note that the menu structure for the confguraton menu s defned n the two fles arch/386/confg.n (for the confguraton of the menu structure) and arch/386/defconfg (for the defnton of the default values for each confguraton parameter). he actvaton of the QoS-based schedulng tem sets the preprocessor varable CONFIG_QOS_SCHED, whch s used wthn the kernel for the actvaton of the QoS-based scheduler. hus, ths varable s used to dstngush between the orgnal mechansms and the QoSbased mechansms, e.g., the QoS-based scheduler s used nstead of the orgnal one by swtchng off the QoS-based Schedulng tem, the orgnal kernel wll be used. Note that all modfcatons that change the functonalty of the Lnux kernel are edged usng the #f[n]def CONFIG_QOS_SCHED query. Note that the CONFIG_QOS_SCHED parameter s defned n the nclude/lnux/confg.h fle; thus, ths lnux/confg.h must be ncluded n every other fle you use the CONFIG_QOS_SCHED parameter. After confgurng the QoS support, the kernel needs to be compled (ncludng a new setup of the dependences) and nstalled. For the buldng process, use the command make dep clean bzimage. he new kernel mage has to be nstalled (by copyng to the /boot drectory) and actvated, e.g., usng llo. Do not forget to copy the System.map fle n the root or /boot drectory, as a new system call was added to your Lnux system. Fnally, you have to reboot your machne and start the QoS-enabled kernel. Happy qosfyng!

Appendx C: he ask Structure task_struct he structure that defnes a task can be found n the fle nclude/lnux/sched.h. 260 struct task_struct { 261 /* these are hardcoded - don't touch */ 262 volatle long state; /* -1 unrunnable, 0 runnable, >0 stopped */ 263 unsgned long flags; /* per process flags, defned below */ 264 nt sgpendng; 265 mm_segment_t addr_lmt; /* thread address space: 266 0-0xBFFFFFFF for user-thead 267 0-0xFFFFFFFF for kernel-thread 268 */ 269 struct exec_doman *exec_doman; 270 volatle long need_resched; 271 272 cycles_t avg_slce; 273 nt lock_depth; /* Lock depth. We can context swtch n and out of holdng a syscall kernel lock... */ 274 /* begn ntel cache lne */ 275 long counter; 276 long prorty; 277 unsgned long polcy; 278 /* memory management nfo */ 279 struct mm_struct *mm, *actve_mm; 280 nt has_cpu; 281 nt processor; 282 struct lst_head run_lst; 283 struct task_struct *next_task, *prev_task; 284 nt last_processor; 285 286 /* task state */ 287 struct lnux_bnfmt *bnfmt; 288 nt ext_code, ext_sgnal; 289 nt pdeath_sgnal; /* he sgnal sent when the parent des */ 290 /*??? */ 291 unsgned long personalty; 292 nt dumpable:1; 293 nt dd_exec:1; 294 pd_t pd; 295 pd_t pgrp; 296 pd_t tty_old_pgrp; 297 pd_t sesson; 298 /* boolean value for sesson group leader */ 299 nt leader; 300 /* 301 * ponters to (orgnal) parent process, youngest chld, younger sblng, 302 * older sblng, respectvely. (p->father can be replaced wth 303 * p->p_pptr->pd) 304 */ 305 struct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_osptr; 306 307 /* PID hash table lnkage. */

46 Marc Bechler and Hartmut Rtter 308 struct task_struct *pdhash_next; 309 struct task_struct **pdhash_pprev; 310 311 wat_queue_head_t wat_chldext; /* for wat4() */ 312 struct semaphore *vfork_sem; /* for vfork() */ 313 unsgned long rt_prorty; 314 unsgned long t_real_value, t_prof_value, t_vrt_value; 315 unsgned long t_real_ncr, t_prof_ncr, t_vrt_ncr; 316 struct tmer_lst real_tmer; 317 struct tms tmes; 318 unsgned long start_tme; 319 long per_cpu_utme[nr_cpus], per_cpu_stme[nr_cpus]; 320 /* mm fault and swap nfo: ths can arguably be seen as ether mm-specfc or thread-specfc */ 321 unsgned long mn_flt, ma_flt, nswap, cmn_flt, cma_flt, cnswap; 322 nt swappable:1; 323 /* process credentals */ 324 ud_t ud,eud,sud,fsud; 325 gd_t gd,egd,sgd,fsgd; 326 nt ngroups; 327 gd_t groups[ngroups]; 328 kernel_cap_t cap_effectve, cap_nhertable, cap_permtted; 329 struct user_struct *user; 330 /* lmts */ 331 struct rlmt rlm[rlim_nlimis]; 332 unsgned short used_math; 333 char comm[16]; 334 /* fle system nfo */ 335 nt lnk_count; 336 struct tty_struct *tty; /* NULL f no tty */ 337 /* pc stuff */ 338 struct sem_undo *semundo; 339 struct sem_queue *semsleepng; 340 /* CPU-specfc state of ths task */ 341 struct thread_struct thread; 342 /* flesystem nformaton */ 343 struct fs_struct *fs; 344 /* open fle nformaton */ 345 struct fles_struct *fles; 346 /* sgnal handlers */ 347 spnlock_t sgmask_lock; /* Protects sgnal and blocked */ 348 struct sgnal_struct *sg; 349 sgset_t sgnal, blocked; 350 struct sgnal_queue *sgqueue, **sgqueue_tal; 351 unsgned long sas_ss_sp; 352 sze_t sas_ss_sze; 353 354 /* hread group trackng */ 355 u32 parent_exec_d; 356 u32 self_exec_d; 357 /* Protecton of felds allocato/deallocaton */ 358 struct semaphore ext_sem; 359 };

Appendx D: Schedulng Source Code he schedulng algorthm can be found n the fle nclude/lnux/sched.h and s spltted n two relevant functons: goodness() and schedule(). statc nlne nt goodness(struct task_struct * p, nt ths_cpu, struct mm_struct *ths_mm) 112 { 113 nt weght; 114 115 /* 116 * Realtme process, select the frst one on the 117 * runqueue (takng prortes wthn processes 118 * nto account). 119 */ 120 f (p->polcy!= SCHED_OHER) { 121 weght = 1000 + p->rt_prorty; 122 goto out; 123 } 124 125 /* 126 * Gve the process a frst-approxmaton goodness value 127 * accordng to the number of clock-tcks t has left. 128 * 129 * Don't do any other calculatons f the tme slce s 130 * over.. 131 */ 132 weght = p->counter; 133 f (!weght) 134 goto out; 135 136 #fdef SMP 137 /* Gve a largsh advantage to the same processor... */ 138 /* (ths s equvalent to penalzng other processors) */ 139 f (p->processor == ths_cpu) 140 weght += PROC_CHANGE_PENALY; 141 #endf 142 143 /*.. and a slght advantage to the current MM */ 144 f (p->mm == ths_mm!p->mm) 145 weght += 1; 146 weght += p->prorty; 147 148 out: 149 return weght; 150 } 436 asmlnkage vod schedule(vod) 437 { 438 struct schedule_data * sched_data;

48 Marc Bechler and Hartmut Rtter 439 struct task_struct *prev, *next, *p; 440 struct lst_head *tmp; 441 nt ths_cpu, c; 442 443 f (!current->actve_mm) BUG(); 444 f (tq_scheduler) 445 goto handle_tq_scheduler; 446 tq_scheduler_back: 447 448 prev = current; 449 ths_cpu = prev->processor; 450 451 f (n_nterrupt()) 452 goto schedulng_n_nterrupt; 453 454 release_kernel_lock(prev, ths_cpu); 455 456 /* Do "admnstratve" work here whle we don't hold any locks */ 457 f (softrq_state[ths_cpu].actve & softrq_state[ths_cpu].mask) 458 goto handle_softrq; 459 handle_softrq_back: 460 461 /* 462 * 'sched_data' s protected by the fact that we can run 463 * only one process per CPU. 464 */ 465 sched_data = & algned_data[ths_cpu].schedule_data; 466 467 spn_lock_rq(&runqueue_lock); 468 469 /* move an exhausted RR process to be last.. */ 470 f (prev->polcy == SCHED_RR) 471 goto move_rr_last; 472 move_rr_back: 473 474 swtch (prev->state & ~ASK_EXCLUSIVE) { 475 case ASK_INERRUPIBLE: 476 f (sgnal_pendng(prev)) { 477 prev->state = ASK_RUNNING; 478 break; 479 } 480 default: 481 del_from_runqueue(prev); 482 case ASK_RUNNING: 483 } 484 prev->need_resched = 0; 485 486 /* 487 * ths s the scheduler proper: 488 */ 489 490 repeat_schedule: 491 /* 492 * Default process to select.. 493 */ 494 next = dle_task(ths_cpu); 495 c = -1000;

Appendx D: Schedulng Source Code 49 496 f (prev->state == ASK_RUNNING) 497 goto stll_runnng; 498 499 stll_runnng_back: 500 lst_for_each(tmp, &runqueue_head) { 501 p = lst_entry(tmp, struct task_struct, run_lst); 502 f (can_schedule(p)) { 503 nt weght = goodness(p, ths_cpu, prev->actve_mm); 504 f (weght > c) 505 c = weght, next = p; 506 } 507 } 508 509 /* Do we need to re-calculate counters? */ 510 f (!c) 511 goto recalculate; 512 /* 513 * from ths pont on nothng can prevent us from 514 * swtchng to the next task, save ths fact n 515 * sched_data. 516 */ 517 sched_data->curr = next; 518 #fdef SMP 519 next->has_cpu = 1; 520 next->processor = ths_cpu; 521 #endf 522 spn_unlock_rq(&runqueue_lock); 523 524 f (prev == next) 525 goto same_process; 526 527 #fdef SMP 528 /* 529 * mantan the per-process 'average tmeslce' value. 530 * (ths has to be recalculated even f we reschedule to 531 * the same process) Currently ths s only used on SMP, 532 * and t's approxmate, so we do not have to mantan 533 * t whle holdng the runqueue spnlock. 534 */ 535 { 536 cycles_t t, ths_slce; 537 538 t = get_cycles(); 539 ths_slce = t - sched_data->last_schedule; 540 sched_data->last_schedule = t; 541 542 /* 543 * Exponentally fadng average calculaton, wth 544 * some weght so t doesnt get fooled easly by 545 * smaller rregulartes. 546 */ 547 prev->avg_slce = (ths_slce*1 + prev->avg_slce*1)/2; 548 } 549 550 /* 551 * We drop the scheduler lock early (t's a global spnlock), 552 * thus we have to lock the prevous process from gettng

50 Marc Bechler and Hartmut Rtter 553 * rescheduled durng swtch_to(). 554 */ 555 556 #endf /* SMP */ 557 558 kstat.context_swtch++; 559 /* 560 * there are 3 processes whch are affected by a context swtch: 561 * 562 * prev ==... ==> (last => next) 563 * 564 * It's the 'much more prevous' 'prev' that s on next's stack, 565 * but prev s set to (the ust run) 'last' process by swtch_to(). 566 * hs mght sound slghtly confusng but makes tons of sense. 567 */ 568 prepare_to_swtch(); 569 { 570 struct mm_struct *mm = next->mm; 571 struct mm_struct *oldmm = prev->actve_mm; 572 f (!mm) { 573 f (next->actve_mm) BUG(); 574 next->actve_mm = oldmm; 575 atomc_nc(&oldmm->mm_count); 576 enter_lazy_tlb(oldmm, next, ths_cpu); 577 } else { 578 f (next->actve_mm!= mm) BUG(); 579 swtch_mm(oldmm, mm, next, ths_cpu); 580 } 581 582 f (!prev->mm) { 583 prev->actve_mm = NULL; 584 mmdrop(oldmm); 585 } 586 } 587 588 /* 589 * hs ust swtches the regster state and the 590 * stack. 591 */ 592 swtch_to(prev, next, prev); 593 schedule_tal(prev); 594 595 same_process: 596 reacqure_kernel_lock(current); 597 return; 598 599 recalculate: 600 { 601 struct task_struct *p; 602 spn_unlock_rq(&runqueue_lock); 603 read_lock(&tasklst_lock); 604 for_each_task(p) 605 p->counter = (p->counter >> 1) + p->prorty; 606 read_unlock(&tasklst_lock); 607 spn_lock_rq(&runqueue_lock); 608 } 609 goto repeat_schedule;

Appendx D: Schedulng Source Code 51 610 611 stll_runnng: 612 c = prev_goodness(prev, ths_cpu, prev->actve_mm); 613 next = prev; 614 goto stll_runnng_back; 615 616 handle_softrq: 617 do_softrq(); 618 goto handle_softrq_back; 619 620 handle_tq_scheduler: 621 run_task_queue(&tq_scheduler); 622 goto tq_scheduler_back; 623 624 move_rr_last: 625 f (!prev->counter) { 626 prev->counter = prev->prorty; 627 move_last_runqueue(prev); 628 } 629 goto move_rr_back; 630 631 schedulng_n_nterrupt: 632 prntk("schedulng n nterrupt\n"); 633 *(nt *)0 = 0; 634 return; 635 }

Appendx E: Proof of Maxmum hreshold In order to show that there s a threshold for the maxmum sze of a tme slce for a watng task, we use a prove by nducton over the number of rentalzatons. he calculaton of a task s tme slce ts follows the recurson tsn ts 1 = P, ts n+ 1 = P + 2 Prove by nducton. Proposton: lm ts 2 P Begn of the nducton: ts 1 = P 2 P Implcaton of the nducton: lm ts = nsert the recurson n n + 1 tsn lm P + nsert the proposton n 2 2 P lm P + = 2 lm ( P + P)= n n P + P = 2 P q.e.d.

Appendx F: Dervaton of the Percentage A A task s percentage on the avalable CPU resources s defned by = = n A 1 After the prorty of task was altered by [tcks], the basc approach for the calculaton of the relatve percentage A s the followng (note that s the orgnal prorty). n n A A A A + + = + + = + = = 1 1 Replace A + + + = + + = = = = = = = n n n n n n 1 1 1 1 1 1 ) ( n n n n n n n n A = + = + + = = = = = = = = = 1 1 1 1 1 1 1 1 q.e.d.