Adaptive Allocation of Software and Hardware Real-Time Tasks for FPGA-based Embedded Systems

Adaptive Allocation of Software and Hardware Real-ime asks for FPGA-based Embedded Systems Rodolfo Pellizzoni and Marco Caccamo Department of Computer Science, University of Illinois at Urbana-Champaign {rpelliz2,mcaccamo}@uiuc.edu Abstract Operating systems for reconfigurable devices enable the development of embedded systems where software tasks, running on a CPU, can coexist with hardware tasks running on a reconfigurable hardware device (FPGA). Furthermore, in such systems relocatable tasks can be migrated from software to hardware and viceversa. he combination of high performance and predictability of hardware execution with software flexibility makes such architecture especially suitable to implement high-performance real-time embedded systems. In this work, we first discuss design and scheduling issues for relocatable tasks. We then concentrate on the on-line admission control problem. ask allocation and migration between the CPU and the reconfigurable device is discussed and sufficient feasibility tests are derived. Finally, the effectiveness of our relocation strategy is shown through a series of synthetic simulations. 1 Introduction As systems-on-chips (SoCs) become more widely used due to their improved performance, both in term of speed and power consuption, reconfigurable devices and in particular field-programmable gate arrays (FPGAs) are becoming more and more popular in the development of embedded systems where issues such as short time-to-market and update capabilities after deployment are critical. Recent developments in the field of operating systems for reconfigurable devices (OSRD) [18, 25, 26] enable a highly dynamic use of partially reconfigurable FPGAs, running multiple concurrent circuits (hardware tasks) with full multitasking capabilities. Furthermore, the introduction of embedded devices comprised of an FPGA and one or possibly several CPUs permits to run both software and hardware tasks on the same silicon device, achieving even greater flexibility. While a lot of work has been done in the design of suitable operating system abstractions and in the development of working prototypes for OSRD, much more remains to his work is supported in part by the NSF grant CCR-0237884, NSF grant CCR-0325716, and NSF CNS-050268. be done to obtain a feasibly usable platform. In particular, the important topic of real-time resource management has received little attention. In this work, we first introduce our vision for a reconfigurable platform that enables relocation (i.e. task migration between software and hardware) as a way to improve the system ability to cope with dynamic workloads. We then propose a novel allocation and admission control scheme that is able to improve the usage of system resources while preserving all timing constraints. In particular, our main contribution is the development of a relocation scheme with proven feasibility conditions that is suitable for real-time applications. he paper is organized as follows. In Section 2 we introduce our system abstraction, discussing its applicability and practical limitations, and we further describe our resource management scheme. In Section 3 and 4 we present our solutions to the allocation and relocation problem, providing simulation results in Section 5. Finally, in Section 6 we discuss related works and in Section 7 we provide concluding remarks and future work. 2 System Model We consider a system comprised by a general purpose CPU and a partially Reconfigurable Device (RD), together with main memory and I/O devices. Modern devices, like the Xilinx Virtex-II Pro and Virtex-IV family of FPGA [27], implement all of the above on a single configurable SoC. An OSRD is used to manage the entire system; prototypes have been proposed in [18, 25, 26]. asks can be provided to the system in both a software and a hardware configuration. he software configuration is a traditional software program that runs on the CPU, while the hardware configuration is implemented as a hardware circuit on the RD. Codesign tools can be used to generate both configurations given an initial specification in a highlevel language [10]. Since the RD is partially reconfigurable, it is possible to reconfigure a single hardware task at run-time by downloading its configuration data (known as a bitstream) without affecting the remaining hardware configurations. asks are relocatable, i.e. they can migrate from software to hardware and viceversa (relocation is implemented by [18]). 1

asks can be dynamically activated and terminated. Furthermore, we assume that tasks are subject to real-time constraints, i.e. once activated they are periodically executed and each task instance must terminate before a given deadline. We believe that this architectural model suits a variety of systems, including Micro unmanned Aerial Vehicles (MAVs) [], wearable computing [20] and sensor networks for complex tracking and surveillance applications [22], which exhibit characteristics that makes solutions based on both single or multiple CPU and on fixed hardware unsuitable: Dynamic workload with high computational demands: As an example, in applications based on multiple unmanned aerial vehicles coordinating on a global mission each vehicle must perform multiple concurrent control tasks together with complex multi-vehicle coordination [12], wireless processing and data aggregation, sensor processing and target tracking and localization. he workload is extremely dynamic, depending on both the vehicle and the mission status. Due to multiple target tracking and multiple surveillance objectives different tasks are contending for system resources. Wearable computing and sensor networks for tracking applications share similar characteristics (dynamic tasks with event-based workload surges), plus due to their long deployment time software updates are often necessary. Due to these intrinsic dynamic aspects, static task allocation on fixed hardware is hardly possible. At the same time, general purpose processors can not provide the required level of performance. he proposed model can constitute a valid solution, combining the flexibility of general purpose systems with the performance of hardware solutions; in particular, the advantages of relocation are thoroughly discussed in [17, 20]. Energy and cost constraints: All the proposed systems are severely energy constrained, since they draw power from either batteries or solar cells; the amount of energy used for computation is a significant percentage of the total energy consumption (see [] for details on MAVs). FP- GAs have been proven to provide better performance and to be more power efficient than both general purpose and application specific processors for a variety of applications [20]. Furthermore, all the proposed systems are deployed in large numbers and are therefore cost-sensitive. he use of a single SoC including a high performance FPGA can easily replace a number of discrete chips thus lowering board complexity and helping reducing costs. Both energy and cost constraints imply that the available computational resources must be efficiently used, i.e. by maximizing the amount of computation (tasks) that the system can handle. Real-time constraints: MAV applications, wearable computing systems and sensor networks can include critical monitoring and targeting tasks for which proven delay bounds must be guaranteed. Furthermore, flight control on MAVs is implemented through hard-real time tasks. Following the discussion above, the overall goal of our resource management scheme can be stated as: maximize the number of tasks simultaneously running in the system, while guaranteeing all real-time constraints. herefore, we need to provide an admission control test: every time a task is presented to the system, we run the test to check if the new task can be admitted while guaranteeing all the already running tasks. We introduce the details of our management scheme in Section 2.3, after we discuss some key model limitations in Section 2.1 and our task model in Section 2.2. It is important to note that our management goal assumes that, from a computational point of view, the software and hardware configurations of a task provide equal performance. In other applications (for example multimedia terminal) it can be more useful to consider hardware configurations providing better service compared to the corresponding software configurations. In this case the overall system goal would be to maximize the quality of service perceived by the user. Although we do not consider such scenario in this work, we are currently investigating it and our current results show that the admission control scheme can be applied unchanged to this further case. 2.1 Model Limitation When concerned with practical implementation of the proposed abstraction, several limitations of currently available reconfigurable devices and operating systems need to be considered. First of all, hardware configurations must be constrained in rectangular areas. wo area model are employed by current OSDR prototypes. In slotted area model, the device is divided in a series of slots, each of which has the same dimensions. Each task is partitioned by means of suitable design tools in some number of slots, which can be positioned anywhere on the device. he slotted area model incurs internal fragmentation: some area on the device can be wasted if the area occupied by a task is not a multiple of the defined slot area. he slotted area model is employed by [17, 18]. In the 1D area model, each task occupies a rectangular area on the device. he vertical dimension is fixed and spans the height of the device, while the horizontal dimension can vary. he 1D area model incurs both internal and external fragmentation: the total area available on the device can be greater than the area required by a task, but placing it can be impossible if the area is divided in smaller unconnected stripes. he 1D area model is employed by [25]. A more complex 2D area model is further discussed in the literature, but to the best of our knowledge, no working OSRD prototype is able to employ such model. In this work, we will only consider the slotted area model. ask communication is also a major issue, and different solutions, including buses and packet-switch networks, have been proposed [25, 15]. Communication is particularly critical in the slotted model since slots pertaining to the same tasks need strict synchronization. Real-time constraints for bus based systems are introduced in [6]. Since in this work we are mainly concerned with the management and relocation problem, we will assume that the system provides enough communication resources to meet the needs of all tasks and reserve a more thoroughly analysis for future work. 2

An important issue regards the reconfiguration capability of the RD. Each time a new task is started on the device, its bitstream needs to be loaded inside the device s configuration SRAM using the configuration interface; while downloading a bitstream, the area occupied by it can clearly not be used (other tasks can still run undisturbed). he load time is proportional to the task area; for modern, large devices, it is not negligible, in the order of tens of milliseconds to reconfigure a task that occupies the entire device [23]. his imposes severe constraints on how hardware tasks are managed. In particular, hardware tasks cannot be scheduled like periodic software tasks. Consider the slotted area model and suppose that hardware tasks are scheduled like periodic tasks, i.e. each hardware task is defined by a period and an execution time and is periodically activated. For simplicity, assume that all tasks occupy only one slot and have the same period p and execution time e. Let rec be the time needed to reconfigure the entire device, and t rec = rec A be the time needed to reconfigure a single slot, where A is the total number of slots. If we serialize slot reconfigurations, while a slot is reconfigured in t rec time all other tasks can keep running; therefore, we define U as the task utilization e+t rec p. It is then easy to see that if we want to keep the device constantly busy, we need to reconfigure the entire device 1 U times each p seconds, thus the following inequality must hold: p rec U. Supposing a typical time rec =50 ms [23] and U = 1 4, we cannot achieve frequencies greater than 5 Hz. herefore, in order to reduce the reconfiguration overhead we will impose that each hardware configuration executes for the entirety of its period, so that no reconfiguration is needed if no new task is activated. his is not a major limitation, since different synthesis parameters in tools permit a tradeoff between occupied area and execution time. his means that although the hardware configuration executes longer than the software one, it occupies a much smaller area that the one needed by an equivalent CPU dedicated to running the software configuration. he last issue regards hardware/software relocation. While suspending and migrating a software task between homogeneous CPUs is relatively easy, since the state of a software task can easily be saved, saving the state of a hardware task is more complex, since it involves saving the state of all its internal registers. While this is not technically impossible [21], it can nevertheless incur in a unbearable overhead. Instead, a different approach to relocation will be used. We assume that each task, either in software or hardware configuration, eventually reaches a point in which the execution of its next periodic instance does not depend on the state of the task after the completion of its previous instance, i.e. no internal state must be preserved between the two successive activations. When this point is met, a task can be relocated at the end of its period. Note, however, that reconfiguration constraints must be taken into account: while we can usually safely assume that starting a task on the CPU takes zero time, this is not true for the RD. herefore, the OSRD must first begin loading the task bitstream into the RD, which can possibly last for multiple task periods. When the loading operation completes and the end of a period is reached, the software configuration is terminated on the CPU and the hardware configuration is started on the RD. If a stateless point is never reached, we can add some additional logic to the task in order to save and restore state between instance activations. he resulting overhead is still much lower that permitting to save the state of the task at any time [16]. 2.2 ask Model Each relocatable task τ i is defined by a period p i,arelative deadline D i and two configurations: τi s (software), defined by an execution time e i,andτi h (hardware), defined by an area a i. We assume relative deadlines equal to periods, i.e. i, D i = p i. he execution time of a software configuration can be either a worst-case parameter (for hard tasks) or average-case parameter (for soft tasks). Furthermore, let U i = ei p i be the task s software utilization. Hardware configurations have no associated execution time. Each periodic instance (also called a job) of a hardware configuration runs for the entire period. Since hardware configurations cannot be preempted, they always meet their deadline as long as configuration changes (relocation) are only allowed between jobs. he area parameter depends on the area model of the RD: under the slotted area model, we denote with A the total number of slots on the RD and with a i the number of slots occupied by τi h. We assume that communication among tasks follows a synchronous dataflow approach, i.e. all inputs to a job are made available by the OSRD before the job starts and all outputs are propagated at the end of the job to subsequent tasks in the data graph. he dataflow model has several advantages. First, it enables transparency between hardware and software configurations since all data can be held in buffers managed by the operating system. Second, many commercially available languages and tools for hardware specification follow the dataflow model [3, 11]. hird, there is no need to account for blocking time due to critical sections during the execution of a task. Finally, the careful placement of buffers takes care of delays in data propagation along the communication infrastructure; in particular, precedence constraints among successive tasks can be removed by buffering one full task period. Software tasks can be scheduled on the CPU using any real-time scheduler with proven schedulability bounds and suitable isolation mechanisms. In this paper we will consider the EDF scheduler [14] in conjunction with the wellknown Constant Bandwidth Server [1]. he CBS provides isolation between hard and soft tasks so that all jobs of hard tasks are proven to complete within their deadlines if a feasibility condition is met. For a fixed task set S of software tasks, the following is a sufficient and necessary feasibility condition provided that kernel overhead is included in task execution times: U = U i 1, (1) τ i S where U is known as the total software utilization. 3

τ i τi s τi h p i D i e i U i a i i th task i th task software configuration i th task hardware configuration task period task relative deadline software configuration execution time task utilization hardware configuration area A RD area task set S set of tasks in software configuration H set of tasks in hardware configuration A = { S, H } allocation for task set U = τ U i i total utilization of tasks in U A = U S total utilization of tasks in soft. config. a = τ a i i total area of tasks in able 1. System notation In the same way, in order to be schedulable on the RD hardware configurations must meet placement constraints. Definition 1 (Slotted feasible placement) For the slotted area model, given a set H of hardware configurations scheduled on the RD we say that their placement is feasible iff: a i A. (2) τ i H asks can dynamically join and leave the system. he activation time of a task corresponds to the activation of its first job. he termination time of a task corresponds to the deadline of its last job. At any time t, (t) is the set of currently active tasks. Furthermore, let S (t) be the set of software tasks running on the CPU at time t and H (t) be the set of hardware tasks placed on the RD; then A (t) ={ S (t), H (t)} is the allocation for (t) iff S (t) H (t) = (t). Hence, the allocation of a task set defines how tasks are partitioned between the CPU and the RD. is said to be feasible iff each job of tasks in gets executed on either the CPU or the RD and A (t) results in both a feasible schedule and a feasible placement. able 1 summarizes the notation used throughout this work. 2.3 Management Scheme he following overall management strategy will be used. When a task or a group of tasks arrives in the system, an admission test is run to determine if it can be admitted. If the test succeeds, then the task is immediately activated on the CPU; in fact, loading a hardware configuration on the RD would delay the activation of the task. After the new task is activated on the CPU, or whenever a task is terminated, the system performs a relocation phase. he goal of the relocation phase is to relocate tasks, including the newly admitted one, in order to minimize the total software utilization, while preserving all feasibility constraints. We feel that this optimization objective is sensible for multiple reasons: Since newly activated tasks are admitted on the CPU to avoid the RD configuration overhead, minimizing the CPU utilization maximizes the probability of passing the admission test. Although we only consider relocatable tasks in this work, real systems would probably also be comprised of software-only tasks that cannot be placed on the RD. Although we are only concerned with the admission control problem in this work, we can envision situations in which hardware configurations provide services with better performance and lower power consumption compared to corresponding software configurations. he OS needs to run both the admission test and further computations to drive the relocation phase and to load hardware configurations. his added overhead can be considered as an added utilization term on the CPU. We will split the problem as follows. In Section 3, we discuss the problem of finding an optimal allocation given a task set subject to the slotted area model, assuming that no task is already running and therefore no relocation is required. In the subsequent Section 4 we will see how a pseudo-optimal solution can be used to drive the relocation phase. Due to space constraints, theorem proofs are not reported; they can be found in [1]. 3 Allocation Problem Given a task set of relocatable tasks, the optimal allocation problem consists in determining the feasible allocation A that minimizes the total software utilization on the CPU, supposing that no task is already running in the system. he problem can be stated as an integer linear programming optimization problem. Let us introduce for each task τ i in two indicator variables r i and c i. r i is set to one if τ i is placed on the RD, while c i is set to one if the task is scheduled on the CPU. he optimal allocation problem can then be represented as follows: Definition 2 (ILP ALLOC) Minimize τ i c iu i, subject to the following constraints and the restriction that variables r i,c i takes integer values only: τ i,c i + r i =1 (3) r i a i A (4) τ i τ i, 0 r i 1 (5) τ i, 0 c i 1 (6) 4

Lemma 1 ([1]) Any optimal solution to ILP ALLOC is an optimal solution for the allocation problem under the slotted area model, supposing that no task is already running. Now note that since τ i,r i +c i =1, min c i U i = min (1 r i )U i = U i max r i U i. herefore, the ILP ALLOC problem can be restated as the following equivalent ILP KNAP problem: Definition 3 (ILP KNAP) Maximize r i U i, subject to the following constraints and the restriction that variables r i take integer values only: r i a i A (7) τ i τ i, 0 r i 1 (8) Problem ILP KNAP is in the form of the well-known 0-1 KNAPSACK problem [13], which is known to be NP-hard in the weak sense. his means that pseudo-polynomial exact algorithms exist for the problem. However, since we are required to solve the allocation problem at run-time, even pseudo-polynomial algorithms can be excessively costly. Furthermore, as we will discuss in Section 4.1, using an optimal algorithm does not lead to a significant increase in performance. We will therefore use the simple greedy algorithm for 0-1 KNAPSACK to obtain a pseudo-optimal solution. he greedy algorithm works as follows, where R is used as a helper variable: Order all tasks in decreasing order of Ui a i. Assign R A. Starting from the first task τ i in the defined order, if R a i then set r i 1,R R a i,elsesetr i 0. Proceed to next task. Since we need to order the tasks, the complexity of the algorithm is O(N log(n)) where N is the number of tasks in. In order to characterize the performance of the algorithm, let LP KNAP be the linear relaxation of ILP KNAP (obtained by removing the constraints of r i being integer), OP(ILP KNAP), OP(LP KNAP) be the optimal solution to the ILP KNAP and LP KNAP problems respectively, and GREEDY(ILP KNAP) be the greedy solution to ILP KNAP. Furthermore, let τ c be the critical task, i.e.the first task, in decreasing order of Ui a i, such that r c =0in the greedy solution. It can be seen that the only difference between the OP(LP KNAP) and GREEDY(ILP KNAP) solutions is that while τ c is partially placed on the RD in the optimal linear solution, the greedy algorithm places it entirely on the CPU. herefore, since OP(LP KNAP) OP(ILP KNAP) GREEDY(ILP KNAP), the following inequalities hold: OP(LP KNAP) GREEDY(ILP KNAP) > > OP(LP KNAP) U c () OP(ILP KNAP) GREEDY(ILP KNAP) > (10) > OP(ILP KNAP) U c Given GREEDY(ILP KNAP), the total CPU utilization U can be computed as τ (1 r i i)u i = τ U i i GREEDY(ILP KNAP). he task set can then be admitted if U 1. However, if some tasks are already running in the system, then a relocation phase is needed to reach the new computed allocation A. he following section details the relocation phase and how to combine it with the admission test. 4 Relocation Phase In this section we discuss how task relocation can be performed without violating any feasibility constraints for the slotted area model. We consider a general relocation problem of the following type: given a task set,relocatable task sets S S,, H H,, and an allocation A = { S S, H H },we want to relocate tasks in order to obtain a new allocation A = { S S, H H }. Hence, S S and H H represent the sets of tasks that are kept on the CPU and RD respectively, while and represent the sets of tasks that are relocated from the CPU to the RD and from the RD to the CPU respectively. RD constraints must be considered when performing relocation. Consider a simple example in which = {τ i }, = {τ j }, τ i,τ j havethesameareaandtheslotted area model is used. Also suppose that RD area is fully occupied. hen in order to perform relocation we really need to swap the two tasks from CPU to RD and vice versa. However, tasks can only be relocated at the beginning of a job, and there may be no time instant in which two jobs of τ i,τ j start simultaneously. Furthermore, reconfiguring the device takes time. herefore, the only feasible approach is as follows: first, at the beginning of some job τi s is activated on the CPU while τi h is suspended. hen, the bitstream of τj h is loaded in the device. Finally, at the beginning of some job τj s is terminated and τ j h is started. Note that for some time both tasks software configurations are running on the CPU. his implies that relocation incurs an overhead in term of CPU utilization, in the sense that in order to perform relocation in a feasible way we need to leave some free computational power on the CPU in order to feasibly schedule an additional software configuration. Feasibility constraints for software tasks are typically expressed for fixed task sets, while in our case the set of active software configurations S running on the CPU frequently changes. However, it can be trivially proven that, if a software configuration is considered to be active on the CPU until the deadline of its last software job, then the classic EDF utilization bound: t 0,U S(t) 1 (11) can still be applied. Relocating tasks that have different areas is more difficult. We can clearly always perform relocation by first activating on the CPU the software configurations of all tasks 5

executed on the RD and then reconfiguring the whole RD, but this is highly improbable without violating software feasibility. We will therefore use the following idea: first, we partition both and into an equal number of sets of tasks that we will call swapping group, andwefurther create pairs of such swapping groups. hen, for each swapping pair we perform relocation in a way similar to the single-task case described before: first, we activate the software configurations of all tasks in the pair s swapping group from. hen, we load and activate all hardware configurations in the swapping group from.hekey concept is that we will build swapping groups in such a way as to minimize the CPU overhead required by the relocation process. In order to determine the swapping groups in a consistent and simple enough way to be applicable at runtime, we impose further constraints on task area. In particular, task area can only be chosen among a defined set of K areas {a 1,...,a K }, such that the area of the device is a multiple of a K and 1 <k K, a k is a multiple of a k 1. For example, for a typical value of A =6and K =6, {a 1 =1,a 2 =3,a 3 =6,a 4 =12,a 5 =24,a 6 =48} are possible values. Note that while this may seem a major limitation, the system designer is free to choose K and the value set based on the tasks in the system and furthermore following the dataflow model each task can be decomposed in possibly several subtasks to better fit the area constraints (tools often provide functionalities to partition logical functions in hardware). Note that in both allocations A and A the RD can be not fully utilized, i.e. some space can be unallocated. Since this complicates the analysis, we will solve the problem by introducing the concept of placeholder task. A placeholder task τ i is by definition a task with a i =1, U i =0andnoassociated bitstream/code. A placeholder task never executes: it is merely used in order to mark a certain area on the RD as being occupied. We can then define new task sets and as follows: Definition 4 Given task set (respectively ), ( ) is the task set comprised of all the tasks in ( )plusa a a H H (A a a H H ) placeholder tasks. Lemma 2 ([1]) For each allocation A = { S S, H H }: a = a (12) Note that since the allocation algorithm tries to place as many tasks as possible on the RD, the number of placeholder tasks is generally small. We can now define our swapping groups as follows: Lemma 3 ([1]) Let a max = max τi {a i}. hen each of, can be partitioned in M = a a max sets {S 1,...,SM }, {S1,...,SM } of area a max and at most one leftover set S M+1,SM+1 of area a mod a max ; furthermore, if a min = min τi {a i},thena mod a max a max a min and a mod a max a max a min. Note that the above theorem also suggests a constructive way to build the swapping groups; hence, algorithm GROUP PARIION can be defined as follows: starting from the smallest tasks, at each step we group them so as to form tasks of the immediately greater size, placing the leftover aside. We then continue grouping until we reach the size of the maximum area, and we combine all leftovers to produce the unique leftover group. hanks to the introduction of placeholder tasks, Lemmas 2 and 3 ensure that the resulting groups for and are of the same size. Note that GROUP PARIION has a complexity of O(N 2 ),wheren is the total number of tasks in the set we are partitioning. In fact, after we sort all tasks by area in O(N log(n)), at each step the number of newly produced groups is at most half the number of tasks for that step, therefore the quadratic complexity follows. Once the swapping groups have been created, we need to define swapping pairs. he two leftover tasks S M+1,SM+1 constitute a pair. Furthermore, suppose that the M swapping groups S 1,...,SM of are arranged such that k, 1 k<m: U S k U S k+1 and similarly for the M swapping groups S 1...,SM of, k, 1 k < M : U S U k S. We k+1 can then form pairs {P 1 =(S 1,S1 ),...,PM = (S M,SM )} and swap groups one pair at a time starting from P 1 to P M. he following theorems express sufficient feasibility conditions for relocation. heorem 4 ([1]) Under the slotted area model, consider allocations A = { S S, H H }, A = { S S, H H } and associated swapping pairs {P 1,...,P M }, with no leftover group. hen the following are sufficient feasibility conditions to relocate A to A : 1. U A + U S 1 1; 2. and U A + U S M 1 heorem 5 ([1]) Under the slotted area model, consider allocations A = { S S, H H }, A = { S S, H H } and associated swapping pairs {P 1,...,P M } and P M+1 = (S M+1,SM+1 ); furthermore, suppose that PM+1 is swapped before P 1. hen the following are sufficient feasibility conditions to relocate A to A : 1. U A + U S M+1 1; 2. and U A + U S M+1 U S M+1 + U S 1 1; 3. and U A + U S M 1 6

Note that heorem 5 basically means that the feasibility of the relocation phase only depends on the utilization of the leftover groups and on the smallest utilization of any two other groups in, (including the placeholder tasks). Furthermore, note that heorem 4 does not depend on the assumption U A U A, which makes it applicable to any kind of relocation. heorems 4 and 5 rely on the assumption that tasks are already partitioned in swapping groups. However, since we can choose how to create the partition, we can maximize the probability of a relocation being feasible by using the following guidelines: 1. minimize U S M+1 and U S 1 ; 2. minimize U S M ; task τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ τ 10 a i 3 1 1 1 1 1 1 3 3 1 U i 3 RD 2 2 1 1 1 2 able 2. Example: task set (a) Example: initial allocation CPU 2 2.5 3 3. maximize U S M+1 he swapping groups can then be created using the above guidelines with algorithm GROUP PARIION, by simply ordering all tasks by area and utilization. he complexity remains bounded by O(N 2 ) since at each step we can merge the newly created groups preserving ordering in linear time. 4.1 Admission Control RD RD CPU (b) Example: intermediate allocation CPU Using the feasibility tests from heorems 4, 5, an admission test can be run along the lines introduced in Section 3. Provided that all new tasks can be initially allocated on the CPU, we first run the GREEDY allocation algorithm to obtain a new pseudo-optimal allocation. hen, we create the swapping pairs. Finally, we check the feasibility conditions. If they hold, then relocation is possible. If not, we can choose between accepting the modified task set without relocation or rejecting the modification. he choice can depend on the criticality of newly arrived tasks, although accepting them without performing relocation may clearly compromise the future system performance in term of admitted tasks. Note that we do not expect an optimal allocation algorithm to perform any better than the greedy solution. o understand the reason, consider that the only difference between OP(LP KNAP) and GREEDY(ILP KNAP), as detailed in Section 3, lies in the allocation of the critical task τ c. herefore, while the greedy solution has a higher CPU utilization, it also has either free area on the RD or area occupied by tasks with lower utilization. We can thus expect that the minimum utilization swapping group for has lower utilization. As long as the critical task area a c is not greater than a max, these two factors typically balance out in the feasibility conditions. Also note that our relocation scheme can take a non negligible time to reconfigure the entire system in the presence of many swapping groups. his is not a main concern in multimedia systems where task arrivals and terminations are triggered by user interaction, but could be a problem in systems with short interarrival times since a new task could (c) Example: final allocation Figure 1. Example Relocation arrive before relocation is finished. We plan to address this problem as part of our future work, modifying our scheme to allow tasks to be admitted even during a relocation phase. A final note regards the management of software-only tasks, i.e. tasks that can only be scheduled on the CPU. Such tasks can be trivially included in our framework by simply forcing them to be allocated in S S. 4.2 Example In this section we provide a comprehensive example of the admission control and relocation procedure. We assume an RD area A =,taskareain{1, 3} and an optimal allocation algorithm. he task set is reported in able 2. ask parameters were chosen to keep the example simple and easily understandable; they should not be considered as real task cases. he initial situation is depicted in Figure 1(a), where the width of each task on the RD represents the number of slots occupied by its hardware configuration. asks τ 1 through τ 7 are running on the RD while tasks τ 8,τ are running on the 7

CPU. Note that since τ 8,τ have the lowest Ui a i ratio among the running tasks and there is no free space on the RD the allocation is optimal. his situation changes when simultaneously τ 6 and τ 7 terminate and a new task τ 10 arrives in the system. Since U 8 +U +U 10 = 7.5 1, the task can be safely admitted on the CPU, producing the allocation shown in Figure 1(b). A new allocation is then computed for task set = {τ 1,τ 2,τ 3,τ 4,τ 5,τ 8,τ,τ 10 }. It is easy to see that in the optimal solution τ 1,τ 2,τ 3,τ,τ 10 are allocated on the RD and τ 4,τ 5,τ 8 on the CPU. herefore we can derive the following sets: H H = {τ 1,τ 2,τ 3 }, = {τ 4,τ 5 }, S S = {τ 8 }, = {τ,τ 10 } Note that since a H H + a =7, we add two placeholder tasks τ 11,τ 12 with area 1 and utilization 0 to (no placeholder is necessary for ). We can then run the GROUP PARIION algorithm producing the following swapping groups: S 2 = {τ 11},S 1 = {τ 12,τ 4,τ 5 },S 2 = {τ 10}, S 1 = {τ } Once swapping groups have been defined, we check the feasibility conditions: 1. U 8 + U + U 10 + U 11 = 7.5 1 2. U 8 +U +U 10 +U 11 U 10 +U 12 +U 4 +U 5 = 6.5 1 3. U 8 + U 4 + U 5 + U = 6.5 1 We can finally relocate tasks as described in Section 4 by first swapping S 2 with S2 and then swapping S 1 with S1. Since all feasibility conditions hold, according to heorem 5 no task misses its deadline. he final resulting allocation is shown in Figure 1(c). 5 Simulation Results We have measured the effectiveness of our relocation strategy through a series of synthetic simulations. In particular, we have compared our admission test against a reference test which simply tries to allocate each new task on the RD firstly and on the CPU secondly, rejecting the task if there is not sufficient free area and free utilization. It is worth noticing that, to the best of our knowledge, no better test exists in the literature to perform admission control when using the described system; in fact, our comparison choice is a trivial extension of the reference algorithm shown in [23]. For each test, we have simulated the arrival of 100,000 synthetic tasks, and determined the rejection rate in term of the percentage of the area of rejected tasks with respect to all tasks arrived in the system. For each synthetic task τ i, the area a i is randomly chosen to account for tasks with very different computational requirements and the utilization U i is randomly generated with mean proportional to a i. We define the load L of the system in a given interval of time [t 1,t 2 ] as the load offered by all tasks activated in [t 1,t 2 ]: τ L([t 1,t 2 ]) = i activated in [t U 1,t 2] i (13) t 2 t 1 where is the average time that a task remains in the system; therefore, U i is the mean execution time required by all jobs of τ i. is computed and task terminations are randomly generated such that the average load is equal to a given value L. Note that since the mean value of U i is proportional to a i, we could redefine the offered load in term of a i A instead of U i as in [24]. Furthermore, since the system is comprised of both a RD and a CPU, a load of L 2 should lead to no rejection (in practice, since task arrivals and terminations constitute a random process, rejections happen even for L 2). Figures 2(a),2(b),2(c),2(d) show a subset of the results for a RD area of 12 slots (Xilinx XC4VFX140), with L ranging from 1 to 4; a more comprehensive set of graphs can be found in [1]. In Figures 2(a) and 2(b) task area is chosen in set {1, 2, 4, 8, 16, 32}, with smaller areas being extracted with higher probability than bigger ones (the average task area is 3.05). In Figure 2(c) the area is chosen in set {1, 2, 4, 8, 16, 32, 64}, with each element being given equal probability (the average task area is 18.14). ask utilization is randomly generated with standard deviation ai A in Figures 2(a) and 2(c) and 0.5 ai A in Figure 2(b). Figure 2(d) uses the same parameters as 2(a), but results are shown in term of the percentage of rejected tasks instead of rejected area. In all figures, relocation and relocation optim refer to our new admission test with relocation while reference is the reference test. he new allocation is computed using the GREEDY algorithm in relocation, while an optimal dynamic programming algorithm [13] is used in relocation optim. he average time needed to perform a single admission test is equal to 243 µs for relocation and 151 ms for relocation optim on our test system (a Pentium IV at 2.8 Ghz). Note that graphs do not saturate since we plot them as a function of the load offered to the system and not of the load of accepted tasks. Results in term of rejected tasks and rejected task area show similar trends; the percentage of rejected area is higher since tasks with bigger area are clearly more likely to be rejected. In all tests relocation clearly outperforms reference, rejecting less than one third of the tasks/area with respect to reference in the most favorable case of Figure 2(a). he performance of relocation clearly depends on both the average task area and the utilization standard deviation, with relocation performing better in the presence of small tasks with big differences between area and utilization; note that since all systems proposed in Section 1 include different kinds of activities, we expect that some tasks, like signal processing, can be much better optimized for hardware execution than others. he performance trend is expected, since the optimized allocation algorithm leads to better results as the utilization standard deviation increases. In the same way, smaller tasks lead to smaller swapping groups 8

0.35 0.5 0.3 reference relocation optim relocation 0.45 0.4 reference relocation optim relocation Percentage of rejected area 0.25 0.2 0.15 0.1 0.05 Percentage of rejected area 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 1.5 2 2.5 3 3.5 4 Load 0 1 1.5 2 2.5 3 3.5 4 Load (a) Max area 32, standard deviation a i A. (b) Max area 32, standard deviation 0.5 a i A. 0.35 0.12 reference reference 0.3 relocation optim relocation 0.1 relocation optim relocation Percentage of rejected area 0.25 0.2 0.15 0.1 Percentage of rejected tasks 0.08 0.06 0.04 0.05 0.02 0 1 1.5 2 2.5 3 3.5 4 Load 0 1 1.5 2 2.5 3 3.5 4 Load (c) Max area 64, standard deviation a i A. (d) Max area 32, standard deviation a i A. Figure 2. Experimental Results with lower total utilization, therefore the reallocation phase is more likely to be feasibly executable. Finally, note that relocation optim does not provide any performance improvement over relocation, as predicted in Section 4.1. Since the run time overhead of relocation optim is about three orders of magnitude greater with respect to relocation, using the simpler GREEDY algorithm in the allocation phase is the best choice. 6 Related Work o the best of our knowledge, no previous work on combined scheduling of software/hardware tasks has been published. he closest related work is presented in [24, 23], dealing with the admission control problem for real-time hardware tasks, and in [7], dealing with scheduling algorithms for periodic hardware tasks. However, only hard aperiodic tasks are considered in [24, 23], and furthermore no mentioned work takes configuration overheads into account. he problem of on-line task allocation for non real-time hardware tasks, with the goal of minimizing task activation delay, has received more attention [2], including schemes that relocate (i.e. move) tasks on the RD mainly in the interest of avoiding external fragmentation in the 1D and 2D area models [4, 5]. However, whenever relocation is performed tasks are assumed to be suspendable at any time, which can be difficult to achieve, and possibly for significant periods of time, which is unacceptable for real-time execution. In [8] a technique to relocate tasks on FPGAs without suspending them is introduced, but there is no analysis of the overhead in term of area that needs to be left free on the RD to relocate a task. 7 Conclusions and Future Work In this work, we have first proposed a pseudo-optimal allocation algorithm and a relocation scheme for relocatable tasks. We have then derived feasibility conditions for both software and hardware scheduling and we have defined an admission control test based on such conditions. Finally, the performance benefits of relocation have been measured through a series of synthetic simulations. Although we only considered systems comprised of a single CPU, we believe that our scheme can be easily adapted to multi-cpu systems by modifying the allocation algorithm. As future work, we first plan to extend our analysis to the 1D and possibly 2D area models. Since such models are

affected by external fragmentation, a suitable defragmentation scheme is needed to place tasks in a pseudo-optimal way. However, we believe that defragmentation can be easily accounted for in the schedulability analysis. Finally, as a long term objective we intend to develop an implementation of the proposed techniques on a working OSRD prototype. References [1] L. Abeni and G.Buttazzo. Integrating multimedia applications in hard real-time systems. In Proceedings of the 1th IEEE Real-ime Systems Symposium, Madrid, Spain, december 18. [2] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement for reconfigurable computing systems. IEEE Design and ests of Computers, 17(1):68 83, 2000. [3] G. Berry, S. Moisan, and J.-P. Rigault. Esterel: owards a synchronous and semantically sound high-level language for real-time applications. In Proc. IEEE Real-ime Systems Symposium, pages 30 40, Arlington, Virginia, 183. [4] G. Brebner and O. Diessel. Chip-based reconfigurable task management. In Proceedings of the 11 th International Conference on Field-Programmable Logic and Applications (FPL), pages 182 11, 2001. [5] K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck. Configuration relocation and defragmentation for run-time reconfigurable computing. IEEE ransactions on Very Large Scale Integration (VLSI) Systems, 10(3):20 220, June 2002. [6] K. Danne and M. Platzner. Memory-demanding periodic real-time applications on FPGA computers. In Work-in- Progress of the 17th Euromicro Conference on Real-ime Systems (ECRS), Palma de Mallorca, Spain, July 2005. [7] K. Danne and M. Platzner. Periodic real time scheduling for FPGA computers. In hird IEEE Int l Workshop on Intelligent Solutions in Embedded Systems (WISES), May 2005. [8] M. Gericota, G. Alves, M. Silva, and J. Ferreira. Online defragmentation for run-time partially reconfigurable FPGAs. In Proceedings of the 12th International Conference on Field-Programmable Logic and Applications (FPL), Montpellier, France, September 2002. [] J. M. Grasmeyer and M.. Keennon. Development of the black widow micro air vehicle. In Proceedings of AIAA Conference on Aerospace Sciences, 2001. [10] G.Vanmeerbeeck, P.Schaumont, S.Vernalde, M.Engels, and I.Bolsens. Hardware/software partitioning for embedded systems in OCAPI-xl. In CODES 01, Copenhagen, Denmark, April 2001. [11] N. Halbwachs, P. Caspi, and D. Pilaud. he synchronous dataflow programming language Lustre. In Another Look at Real ime Programming, Proceedings of the IEEE, Special Issue, September 11. [12] A. Howard, M. J. Matarić, and G. S. Sukhatme. An incremental self-deployment algorithm for mobile sensor networks. Autonomous Robots, 13(2):113 126, 2002. [13] H. Kellerer, U. Pferschy, and D. Pisinger. Knapsack Problems. Springer, 2004. [14] C. Liu and J. Layland. Scheduling algorithms for multiprogramming in a hard-real-time environment. Journal of the Association for Computing Machinery, 20(1), 173. [15]. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R. Lauwereins. Interconnection networks enable fine-grain dynamic multi-tasking on FPGAs. In Proc. of the 12th International Conference on Field-Programmable Logic and Applications (FPL), Montpellier, France, September 2002. [16] J.-Y. Mignolet, V. Nollet, P. Coene, D.Verkest, S. Vernalde, and R. Lauwereins. Infrastructure for design and management of relocatable tasks in a heterogeneous reconfigurable system-on-chip. In Proceedings of the DAE 03 conference, Munich, Germany, March 2003. [17] J.-Y. Mignolet, S. Vernalde, D. Verkest, and R. Lauwereins. Enabling hardware-software multitasking on a reconfigurable computing platform for networked portable multimedia appliances. In Proceedings of the International Conference on Engineering Reconfigurable Systems and Algorithms, pages 116 122, Las Vegas, June 2002. [18] V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauwereins. Designing an operating system for a heterogeneous reconfigurable SoC. In Proceedings of the RAW 03 workshop, Nice,France, April 2003. [1] R. Pellizzoni and M. Caccamo. Adaptive real-time management of relocatable tasks for FPGA-based embedded systems. echnical report, University of Illinois, 2005. http://pertsserver.cs.uiuc.edu/ mcaccamo/papers/. [20] C. Plessel, R. Enzler, H. Walder, J. Beutel, M. Platzner, L. hiele, and G. röster. he case for reconfigurable hardware in wearable computing. Personal and Ubiquitous Computing, October 2003. [21] H. Simmler, L. Levinson, and R. Männer. Multitasking on FPGA coprocessors. In Proc. 10 th Int l Conf. Field Programmable Logic and Applications, Villach, Austria, August 2000. [22] G. Simon, M. Maróti, Á. Lédeczi, G. Balogh, B. Kusy, A. Nádas, G. Pap, J. Sallai, and K. Frampton. Sensor network-based countersniper system. In Proceedings of the ACM Second International Conference on Embedded Networked Sensor Systems (SenSys), 2004. [23] C. Steiger, H. Walder, and M. Platzner. Operating systems for reconfigurable embedded platforms: Online scheduling of real-time tasks. IEEE ransactions on Computers, 53(11):133 1407, 2004. [24] C. Steiger, H. Walder, M. Platzner, and L. hiele. Online scheduling and placement of real-time tasks to partially reconfigurable devices. In Proceedings of the 24 th IEEE Real- ime System Symposium, Cancun, Mexico, December 2003. [25] H. Walder and M. Platzner. Reconfigurable hardware operating systems: From concepts to realizations. In Proc. Int l Conf. Eng. of Reconfigurable Systems and Algorithms (ERSA), 2003. [26] G. Wigley and D. Kearney. he development of an operating system for reconfigurable computing. In Proceedings IEEE Symposium FPGAs for Custom Computing Machines (FCCM), 2001. [27] Xilinx, Inc. Virtex-4, Virtex-II Pro and Virtex-II Pro X FPGA User Guide. http://www.xilinx.com/. 10