A Multi-criteria Class-based Job Scheduler for Large Computing Farms

Transcription

1 A Multi-criteria Class-based Job Scheduler for Large Computing Farms R. Baraglia 1, P. Dazzi 1, and R. Ferrini 1 1 ISTI A. Faedo, CNR, Pisa, Italy Abstract In this paper we propose a new multi-criteria class-based job scheduler able to dynamically schedule a stream of batch jobs on large-scale computing farms. It is driven by several configuration parameters allowing the scheduler customization with respect to the goals of an installation. The proposed scheduling policies allow to maximize the resource usage and to guarantee the applications QoS requirements. The proposed solution has been evaluated by simulations using different streams of synthetically generated jobs. To analyze the quality of our solution we propose a new methodology to estimate whether at a given time the resources in the system are really sufficient to meet the service level requested by the submitted jobs. Moreover, the proposed solution was also evaluated comparing it with the Backfilling and Flexible backfilling algorithms. Our scheduler demonstrated to be able to carry out good scheduling choices. Keywords: Job scheduling; Deadline Scheduling; Software License Scheduling; Computing Farm. 1. Introduction In large computing farms providing utility computing for a large number of users, with different functional and nonfunctional requirements, a scheduler plays a basic role in order to efficiently and effectively schedule submitted jobs on the available resources. The objective of the scheduling is to assign tasks to specific resources maximizing the overall resource utilization and guaranteeing the QoS required by applications. The scheduling problem has shown to be NPcomplete in its general as well as in some restricted forms [3]. The scheduling on utility computing environments is multi-criteria in nature [8]. In fact, these environments generally manage computational requests dynamically time varying, and with different computational requirements and constraints that compete to access shared resources. Even if in the past research efforts has been devoted to develop multi-criteria job scheduling algorithms [7], [6], [1], [2], there is still the need to improve the scheduling techniques able to manage an increasing number of jobs and to address all the application and installation requirements as well as sets of constraints for the afore mentioned computational environment. In this paper, we propose a scheduler able to schedule a continuos stream of batch jobs on largescale computing farms. As typical scenario we consider a computing farm made up of heterogeneous, single-processor or SMP machines, linked by a low-latency, high-bandwidth network. Some characteristics of the computing nodes (e.g. processor type, memory size, number of CPUs, link bandwidth) are static and known whereas some others are dynamic (e.g. floating sw licenses). The adopted scheduling policies permit us to optimize the scheduling with respect to different objectives, even contrasting, such as maximize the resource usage and to guarantee the non-functional applications requirements. The rest of this paper is organized as follows. Section 2 describes some of the most common job scheduling algorithms. Section 3 gives a description of the problem. Section 4 describes our solution. Section 5 outlines and evaluates our job scheduler. Finally, conclusions and future work are described in Section Related work Batch jobs scheduling are mainly divided in two main classes: on-line and offline. On-line algorithms are those that do not have any knowledge about the whole input job stream. They take decisions for each arriving job without knowing future inputs. Conversely, offline algorithms know all the jobs before taking scheduling decisions. Many of these algorithms are exploited into commercial and open source job schedulers [4]. The Backfilling algorithm [9] is a widely adopted scheduling approach, it is an optimization of the FCFS algorithm [1]. It requires each job specifies its execution time, so that the scheduler can estimate when jobs finish and other ones can be started. The main goal of Backfilling is to exploit a resource reservation approach to improve the FCFS policy by increasing the system resource usage and by decreasing the average job waiting time in the scheduler s queue. In order to improve performance, some backfilling variants, such as Flexible backfilling [5] have been proposed. The Flexible backfilling algorithm is obtained by exploiting a different order of queued jobs. Jobs prioritized according to scheduler goals are queued according to their priority value, and selected for scheduling. Even if the multi-criteria approach seems to be the most viable one to solve the resource management and scheduling problem in heterogeneous and distributed computational environments, only a few research efforts have been done in such direction [1], [7], [2], [11]. In [1] a multi-criteria job scheduler for scheduling a continuous stream of batch jobs on large scale computing

2 farms is proposed. It exploits a set of heuristics that drive the scheduler in taking decisions. Each heuristics manages a specific constraint, and contributes to compute the measurement of the matching degree between a job and a machine. The scheduler allows its extensions to manage a wide set of requirements and constraints. In [7] K. Kurowski et al. propose a two-level hierarchy multi-criteria scheduling approach for Grid environments. All participants of a scheduling process, i.e. endusers, Grid administrators and resource providers, express their requirements and preferences by using two sets of parameters: hard constraints and soft constraints. A Grid broker at higher level exploits the hard constraints to compute a set of feasible solutions, which can be optimized by using soft constraints describing preferences regarding multiple criteria, such as various performance factors, QoSbased parameters, and characteristics of local schedulers. In [2] a bi-criteria algorithm for scheduling moldable jobs on cluster computing platforms is proposed. It exploits two preexisting algorithms to simultaneously optimize two criteria: job makespan and weighted minimal average completion time. Such criteria are complementarity, and well represent the objectives of both users and system administrators. The algorithm was evaluated by simulations using two different synthetic workloads. In [11] a solution based on advanced resource reservation that optimizes resource utilization and user QoS constraints for Grid environments is proposed. It supports advanced reservations to deal with the dynamic of Grids and provides a solution for agreement enforcement. The proposed advanced reservation solution is structured according to a 3-layered negotiation protocol. Preferences of end-users are taken into account to start a negotiation to select resources to reserve. The user can select the best suitable offer or can decide to re-negotiate by changing some of the constraints. End-users preferences are modeled as utility functions for which end users have to specify required values and negotiation levels. In [6] is proposed a schedule-based solution for scheduling a continuous stream of batch jobs on computational Grids. The solution is based on Earliest Deadline First (EG-EDF) rule and Tabu search technique. The EG-EDF rule incrementally builds the schedule for all jobs by applying technique which fills earliest existing gaps in the schedule with newly arriving jobs. If no gap for a coming job is available EG-EDF rule uses Earliest Deadline First (EDF) strategy for including a new job into the existing schedule. The schedule is then optimized by using a Tabu search algorithm to move jobs into earliest gaps. Scheduling choices are taken to meet the QoS requested by the submitted jobs, and to optimize the hardware resource usage. 3. Problem Description We consider jobs and machines annotated with information describing their requirements and features, respectively. Jobs in a stream can be sequential or multi-thread, and all the jobs are independent one from each other. To each job is attached a description containing both an identifier and a set of functional and non-functional requirements. Functional requirements include the number of processors, the RAM size and the software licenses a job needs to be executed. Non-functional requirements (also referred as QoS) are job slowdown equal to one, job deadline and job advanced resource reservation. The description also includes an estimation of the time required to compute the job and the features describing the processor exploited to perform such estimation (benchmark score). Each job is executed on a single machine, and all jobs are preemptable. Job preemption can be performed when either a job submission or a job ending event takes place. The machines composing the farm are described by a benchmark score, the number and type of CPUs, the size of the RAM installed and the non-floating (i.e. bound to a machine) and floating (i.e. not bound to any specific machine) software licenses they can run. Processors installed on each machine has associated a weight. Every machine can execute multiple jobs at the same time in a space-sharing fashion. All the machines support two basic forms of job preemption: stop/restart and suspend/resume. The checkpoint/restart form is possible only if the running job is properly instrumented. Machines are assigned to jobs in the shape of sub-machines, namely a subset of a machine s processors. A submachine is managed by the scheduler as an instance of the machine from which it is originated. Floating sw licenses can be assigned to any machine able to run them. The only limit is that the total number of licenses in use can not be greater than their availability. In our study, we consider the association of licenses to machines. As a consequence if a set of jobs requiring the same license can be executed on the same machine, only one license copy is accounted. 4. The scheduler achitecture The proposed scheduler is based on multiple job classes. Each job is assigned to a class on the basis of its functional and/or non-functional requirements. Figure 1 depicts the architecture of our scheduler. Three main components are represented: Job-Dispatcher, Class-Scheduler and Control- Scheduler. The Job-Dispatcher receives, classifies, and dispatches each job to the proper class. A class is an entity characterized by a set of dynamically assigned computational resources, a job queue and a Class Scheduler. The classes are ranked according to a priority value assigned statically by the installation on the basis of the functional and nonfunctional requirements managed. To each class is associated a Class-Scheduler (CLS). This component is specialized for managing a specific class of job requirements. To each CLS is associated a job queue and a set of resources. The CLS extracts jobs from its queue and allocates them resources to be run. In case of resource shortage, it issues a request for additional resources to the Control Scheduler. A class releases the assigned resources when they have not been used

3 Fig. 1: The scheduler architecture. for a predefined quantum of time, fixed by the installation. The Control-Scheduler (CNS) is devoted to manage requests issued by CLSs. CNS allocates resources to CLSs in the form of sub-machines or floating sw licenses. As an example, consider a job asking for four processors in a computing farm composed by only eight-processors machines. The CNS assigns four free processors from an available suitable machine to the requesting class. That class becomes the temporary owner of the assigned resources until it will not release them. Requests issued by higher ranked classes are scheduled first than the other ones, and requests issued by the same CLS are managed according to the FCFS order. CNS defines new sub-machines according to two alternatives: 1) The sub-machine is defined on a machine already assigned to the requesting class, 2) The sub-machine is defined on a machine not assigned to any class, i.e. a free machine. If no machine is available, the CNS can decide to enact a resource stealing process. The definition of a new submachine is performed by exploiting the principle of the least privilege, namely, from all the available machines is chosen the one with the least amount of resources that is sufficient to satisfy the requested assignment. Sub-machines are managed using a data structure consisting in a vector V F M which length equals to the number of processors P of the largest machine in the computing farm. Each element in the vector contains a list in which each elements represents a farm s machine. A machine belongs to list at index i if its actual availability of processors equals to i. The lists are arranged in increasing order with respect to machine memory size, and then the number of floating sw licenses the machine can run. In order to find a machine with p processors, a RAM of size r and l licenses, CNS starts its search from the V F M[p] element and continues until it finds a machine addressing the assignment requirements or the vector ends. When a proper machine is found and a new sub-machine is created the number of available processors in that machine decreases correspondently. The data structure is updated consequently. The idea behind this data structure is to keep lager machines available for subsequent requests and to reduce machines fragmentation. Floating sw licenses are managed using a specific data structure consisting in a vector which length equals to the number of available floating sw licenses S. Each vector entry addresses a list storing the number of currently usable copies for a specific license. Such lists are structured according to three sublists storing respectively the number of available copies, the number of copies assigned to a class but not used, and the number of copies assigned to a class and in use. Floating licenses belonging to the first two sublists are available for assignment to classes, whereas the copies in the third list are already assigned. When a free copy of a floating sw license is assigned to a class, it is removed from the first sublist and assigned to the third sublist. When a job using a floating sw license finishes its execution the license copy is moved to the second sublist. After an installation-defined quantum of time licenses in the second sublist are released and moved to the first sublist. In our study we considered six job classes ( 5) The first three classes (, 1 and 2) manage jobs with both functional and non-functional requirements whereas the other three ones (3, 4 and 5) manage jobs with only functional requirements. Jobs are assigned to the classes according to the following criteria: Class : Jobs requiring a slowdown equal to 1. Jobs are managed by the related CLS according to the First Come First Served (FCFS) order. Class 1: Jobs requiring advanced resource reservation. Jobs are scheduled according their closeness to the reservation. Jobs for which the resource reservation fails are discarded. Alternatively, but not in our study, they could be moved to Class 3, 4 or 5 depending on their functional requirements. Class 2: Jobs with deadline. Jobs are scheduled according to the expected time they have to start to meet their deadline. The queue position of a job is determined exploiting the solution proposed in [6]. According to such solution, the closer the deadline of a job is, the higher its position in the job queue is. Class 3: Sequential or parallel jobs requiring floating sw licenses. Class 4: Parallel jobs not requiring any floating sw license. Class 5: Sequential jobs not asking for floating sw licenses. Jobs within classes 3, 4 and 5 are selected by the related CLS according to the FCFS order. If a job has requirements to be assigned to two different classes, it will be assigned to the class having a higher priority. The assignment of resources to classes permits to exploit the locality in job requirements. In fact, after a initial time, it is highly probable that a class managing jobs with similar requirements owns in advance the resources to run the jobs to it assigned.

4 Resource stealing: When a class is experiencing a lack of free resources to satisfy a request its CLS issues a request to CNS. As a consequence, the execution of jobs belonging to classes with a lower priority have to be interrupted to release the needed resources. However, the interruption of a job has a cost for the computing farm. Actually, interrupting a job is convenient only if the gain resulting from the use of the released resources overcomes this cost. To evaluate this cost several parameters should be considered, e.g. the time elapsed in execution by the job candidate to be interrupted, the number of sw assigned to that job, etc. In our model a resource r can be moved from a class A to a class B if the following expression is verified: Rank A > Cost B (r), where Rank A is the rank value of Class A and Cost B (r) is the cost associated to the interruption of jobs running on resource r and belongs to B. Considering a generic class C, a resource r, and the number k of jobs running on r such cost can be computed as: Cost C (r) = Rank C + k i=1 P C (i), where P C (i) = W i ( Tex(i) T ) W tot(i) dead + (W pr P r(i)) + (W l L(i)). T ex (i) is the time spent executing the job i and T tot (i) is the total estimated execution time of the job i. W i is the weight associated to the form of preemption adopted to interrupt the execution of the job i. This is small if the job supports checkpoint/restart, it increases in case of suspend/resume and it is maximum if the only option is stop/restart. W dead is the weight associated to jobs having a deadline. It equals to 1 for jobs without deadline. W pr is the weight associated to a processor and P r(i) is the number of processors assigned to i. W l is the weight associated to a floating sw license and L(i) is the number of floating sw licenses used by i. The idea of this approach is to allow an installation to tune W i, W pr, W l and W dead values and class ranks according to its objectives. As an example, suppose that an installation goal is to respect in a very strict way the prioritization given by the jobs classes.to this end, the Rank associated to two consecutive classes have to differ by a value greater than the maximum value that P C (i) can assume. Such value is obtained when a job i (it makes no difference to have just one or several jobs if the overall resource usage is the same) is using all the processors of the largest farm s machine, all the available floating sw licenses, and is approaching the end of its execution. Let s assume that the largest farm s machine has 124 processors, and that the total number of floating sw licenses is 2. Moreover, consider weights assuming these values: W i = 2, W pr = 1, W l =.5 and W dead = 2. Hence, the maximum value P C (i) can assume is: Pmax C = = As a consequence, the rank values of the six classes have to be fixed as follows: Rank_Class5 =, Rank_Class4 = 15, Rank_Class3 = 3, Rank_Class2 = 45, Rank_Class1 = 6, and Rank_Class = These rank values are the ones exploited in the conducted tests Resource search: Considering a job i belonging to a class A and requiring P r x P processors and L x S floating sw licenses the resource search algorithm is structured according to the following steps: 1) Starting from the entry P r x, the VFM data structure is analyzed to find jobs to be interrupted. 2) For every machine m suitable for executing i indexed by P r x, a list of jobs that could be interrupted is created. The list also includes free processors. 3) The first N x jobs, which interruption permits us to obtain the required P r x processors are selected. 4) If selecting the first N x jobs a number of processors greater than P r x is obtained, a refined step is conducted to fix the number of selected processors. The list of selected jobs is visited in reverse order to remove exceeding processors. 5) The cost Cost r is computed as the sum of the costs related to the jobs being interrupted on m. If Cost r is smaller than the costs computed for the other analyzed machines, the machine m is selected, and the jobs on it executing are selected to be interrupted. 6) Steps from 2 to 5 are repeated from P r x + 1 to P to find machines suitable for executing i. 7) At the end of step 6, the list of jobs that cloud be interrupted (i.e. jobs running on the machines with associated the lowest costs) is carried out. The interruption of the selected jobs may lead to free some required licenses. In this case, the found licenses are removed from L x and the following steps are executed: 1) The floating sw licenses search starts from the queue l L x of the floating sw license data structure. 2) The cost Cost r due to the interruption of a job using l is computed. The job corresponding to the smallest Cost r is selected and its execution interrupted. 3) Steps 1 and 2 are repeated until all the licenses needed to run the job i are found. 4) At the end of step 3, the set of jobs to interrupt is found. This phase is the most computation expensive one. In fact, the search of free processors requires in the worst case to analyze all the available machines and floating sw licenses. The search of processors needs to sort the N jobs running on each of the M machines in the farm. Since the sort operation has complexity N logn, in the worst case, the resource search algorithm has complexity C = M NlogN. To search a floating sw licenses in the worst case has complexity L. 5. Performance Evaluation The evaluation of the proposed scheduler was conducted by simulations using different streams of jobs and farms of different size. Job and machine parameters have been randomly generated from a uniform distribution in the ranges shown in Table 1. Moreover, we compared our solution

5 with Backfilling and Flexible backfilling algorithms. The job priorities of the Flexible backfilling algorithm are updated at each job submission or ending event and the reservation for the first queued job is maintained through events. Table 1: Parameters used to generate jobs and machines. Description Range Processor Type 1 5 Number of processors Benchmark score.5 2 RAM 5Mb 5Gb Job estimated execution time (secs) 16 2 Number of licenses copies 5 7 Number of different licenses 2 For each simulations the percentages of jobs requiring specific functional and non-functional requirements have been generated according to the values shown in Table 2. Table 2: s used to generate the job steams. Description 5% requires a slowdown equal to 1 3% has a deadline 5% needs of advanced resource reservation 6% needs a software license 3% needs a floating software license 1% needs a specific hardware 2% supports checkpointing 4% needs 1 processor 4% needs 2 processors 1% needs 4 processors.8% needs 8 processors.8% needs 16 processors.6% needs 32 processors.4% needs 64 processors.2% needs 128 processors The duration of each simulation was set at 432 time units (i.e. the number of seconds in 12 hours). For each simulation unit the system: (1) Generate a job and put it in the Dispatcher s job queue, (2) Update of the running jobs status (3) Update of the status of the resources (4) Execute the CLSs, (5) Execute the CNS, (6) Store the simulation statistics. In the conducted experiments the number of generated machines varied from 1 to 12, and to obtain stable values each simulation was repeated 5 times with different farm configurations and job streams. The performance metrics have been evaluated versus the system contention. Usually, this value is roughly computed as: ResourceR/ResourceA, where ResourceR is the amount of a specific resource requested by the jobs in the system, and ResourcesA is the available amount of such resource. This ratio does not provide an accurate information on resource availability because it ignores the jobs allocation implied constraints. In fact, all the requirements of a job must be satisfied to allocate it, so the variables describing the available resources cannot be considered independently. To clarify this point, let us suppose that the value computed by using the above expression is less than 1. In principle, it indicates an availability of the considered resource. As a consequence, a scheduler should be able to properly allocate the jobs on the available resource. However, this is not always true. As an example consider an availability of 2 processors in the system and a job to schedule requiring 16 processors. Clearly, if at least 16 of the 2 free processors are not available on the same machine the job can not be scheduled even if a rough analysis would suggest enough processors availability. Unfortunately, in general can be hard to understand if the resource shortage is caused by the ineffectiveness of the adopted scheduler or by an insufficient number of available resources. To overcome this problem we introduce the index. Its aim is to exploit a simple allocator to measure, with a certain degree of approximation, if at a certain time, the resources in the system are sufficient to meet all the jobs requirements. In particular, in this paper we only consider the resource processor for computing the index. To this end, we considered the following four job scheduling algorithms (but in principle others can be also considered), each one basing its strategy on a different job allocation policy: 1) Largest Machine, which allocates a job on the machine with the largest number of free processors, 2) Smallest Machine, which allocates a job on the machine with the smallest number of free processors, 3) Smallest Residue, which allocates a job on the machine where remains, after the allocation of a job, the lowest number of free processors, 4) Largest Residue, which allocates a job on the machine where remains, after the allocation of a job, the largest number of free processors. These algorithms were evaluated to find the one leading to the best processor usage in the simulated environment. To this end, a workload able to use all the available processors of the simulated farm was designed according to the following four steps: 1) A random generation of a set of machines, 2) For each machine a proper set of jobs were generated, 3) A random distribution of all the processors belonging to each machine to the generated set of jobs, 4) Assignment of the generated jobs to a free computation slot in such a way that they finish their execution on the target machine all at a fixed time. is computed as: (P rocessors r +P rocessors q )/P rocessors a, where P rocessors r are the processors request by the allocated jobs, P rocessors q are the processors request by the not allocated jobs, and P rocessors a are the available processors. The higher the value is, the higher the system contention is. Smallest Residue is the method that obtained the best results in 5 simulations we conducting varying the number and the type of the machines inside the simulated farm. This is the allocator we used for computing the index. It behaves as a sort of probe to measure the processors availability throughout a simulation. It is executed each time a job execution is started. To evaluate the scheduler efficiency, we analyzed the algorithms exploited by CNS

6 to handle sub-machines. Figure 2 shows the percentages of new sub-machines definition and expansion we obtained by the simulations. When the value is low ( <.4) there is a high rate of new sub-machines definition because the classes have only a few resources assigned. When the value of is between.4 and.8 there is a higher submachines expansion rate because the classes already have a large number of sub-machines and at the same time there are enough available processors on the farm machines. When the value of is greater than.8 the percentages of the expanding and new definition processes tend to stabilize at values of 3% and 4%, respectively. This result means that, when the system is heavily loaded, for example, when is equal to 2 the classes in about 35% of cases already have a submachine able to execute a submitted job. While, in the other 65% of cases, CLSs require to CNS to extend a submachine (25%) or to define a new sub-machine (4%) Machine Expansion New Machines Fig. 2: New sub-machines definition or expansion. The graph of Figure 3 shows the percentage of processors used by the running jobs. When the index is greater than 1, i.e. when the requested resources begin to be unavailable, the processor utilization approaches to 1%. But the shape of the curve clearly shows that in some cases there are free processors even if the values of are greater than 1. In fact, also when the value of is greater than 1.2 (i.e. when the estimated number of requested resources go over in the available ones) the figure shows that some processors are not used. This happens because, also when the number of requests is much greater than the available resources, the last are not able to run any one of the waiting requests. However, it is worth to point out that our scheduler is able to schedule jobs in a way that keeps low the number of unused resources. We also investigated the degree of satisfaction of nonfunctional job requirements. We evaluated the quality of service provided by the Control-Scheduler on the basis of decisions it has made to allocate resources to the classes. This analysis has been conducted to assess the choices made in the following areas: (1) Job classification policies and rank values assigned to classes; (2) Resource stealing technique. Bad choices can cause long queuing times for some types of jobs, in particular the ones belonging to low ranked classes. To evaluate the satisfaction level of the resource demands Fig. 3: Processor usage. we used the slowdown metric. This measures the ratio between the response time of a job (i.e. the time elapsed between its submission and its termination) and its execution time. It is computed as: (T w + T e)/t e, with T w the time that a job spends waiting to start and/or restart its execution, and T e the job execution time [9]. Figure 4(a) shows the average slowdown obtained executing job belonging to the following job classes: Class, i.e. jobs requiring a slowdown equal to 1; Class 3, i.e sequential or parallel jobs requiring floating license, Class 4, i.e. parallel jobs do not requiring floating license, and Class 5, i.e. sequential jobs do not asking for floating license. In this evaluation jobs requesting advanced reservation or a deadline were not considered because for such jobs the slowdown is not indicative. In fact, depending on their characteristics, such jobs can spend some time enqueued before to be executed without affecting their performances. In our tests all the requests of advanced resource reservation were satisfied. Figure 4(a) shows that, when the resources are available (i.e. < 1) all the jobs obtain a slowdown equal to 1. When > 1, i.e. the job competition to access the available computational resources increases, jobs are forced to spend some time in the queues resulting in an increase of the job average slowdown. It can be seen that in the conducted tests the job requirement slowdown=1 is satisfied also when the value of reaches 1.6 (i.e. high system contention). The slowdown value of jobs asking floating software licenses remain under 1.2 also with high system contention, while the slowdown increases only up to a 2% for parallel jobs. Instead, the value of the slowdown of the serial jobs is the worst, their completion time can increase up to 8%. Figure 4(b) shows the percentage of jobs executed respecting their deadline. The results obtained by the proposed scheduler were compared with those obtained by running a Backfilling and Flexible backfilling algorithms with the same simulation conditions, i.e. with both the same machines and the same job streams, used to evaluate the proposed scheduler. As expected, the lower the system contention is ( 1), the higher the percentage of the jobs meeting their deadline is, and all the schedulers are able to satisfy all deadline requests. The proposed scheduler is able to obtain better results than the other algorithms. It obtains a percentage of

7 Slowdown = 1 License Parallel Serial Slowdown Class Scheduler Backfilling Flexible Backfilling Slowdown Class Scheduler Backfilling Flexible Backfilling (a) Job slowdown (b) Job deadline Fig. 4: Slowdown (c) Slowdown of jobs requiring slowdown = 1 jobs that respect their deadline very close to 1% also with a high system contention ( 1.5). As the system contention increases the Flexible backfilling algorithm reaches a performance that is 16% lower than the one obtained by the proposed scheduler, while the Backfilling algorithm obtains a performance significantly lower than the one obtained by our scheduler. In Figure 4(c) we show the results obtained from the execution of jobs requiring a slowdown value equal to 1. It can be seen that when the resources are not longer available the Backfilling and Flexible backfilling algorithms are not able to guarantee this QoS. The proposed class scheduler, by using the technique of resource stealing, makes available the resources needed also when the system contention is high ( 1.5). It is worth to point out that the Flexible backfilling algorithm maintains the slowdown value within an acceptable level, offering in this test a performance comparable to the one obtained by the proposed scheduler. 6. Conclusion In this paper, we propose a new multi-criteria scheduler to dynamically schedule a continuous stream of batch jobs on large-scale non-dedicated computing farm made of heterogeneous, single-processor or SMP machines, linked by a low-latency, high-bandwidth network. The proposed solution aims at scheduling arriving jobs respecting several functional and non-functional job requirements and optimizing the hardware and software resource usage. Several configuration parameters allow the scheduler customization with respect to the goals of an installation. The scheduler was evaluated by simulations using different job streams synthetically generated. To conduct the evaluation a technique to measure the system contention throughout a simulation was adopted. The scheduler has been evaluated comparing it with Backfilling and Flexible backfilling schedulers. In the conducted tests, the proposed scheduler demonstrated to be able to carry out good scheduling choices As future work, we plan: (1) to enhance the current scheduler refining the adopted advanced resource reservation technique, and to manage jobs requiring co-allocation to be executed on more than one machine, (2) to introduce energy efficiency policies dispatching workloads to more energy-efficient machines, (3) to evaluate the scheduler when applied to computing platforms made of distributed computing farm, (4) to investigate the feasibility of different scheduling criteria to estimate the index. 7. Acknowledgment This work has been supported by the Projects CONTRAIL (EU- FP ) and S-CUBE (EU-FP ). References [1] G. Capannini, R. Baraglia, D. Puppin, L. Ricci, and M. Pasquali. A job scheduling framework for large computing farms. In SC, page 54, 27. [2] P.-F. Dutot, L. Eyraud, G. Mounié, and D. Trystram. Bi-criteria algorithm for scheduling jobs on cluster platforms. In Proceedings of the sixteenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA 4, pages , New York, NY, USA, 24. ACM. [3] H. El-Rewini, T. G. Lewis, and H. H. ALI. Task Scheduling in Parallel and Distributed Systems. PTR Prentice Hall, Englewood Cliffs, New Jersey, [4] Y. Etsion and D. Tsafrir. A short survey of commercial cluster batch schedulers. Technical Report 25-13, School of Computer Science and Engineering, The Hebrew University of Jerusalem, May 25. [5] D. Feitelson, L. Rudolph, and U. Schwiegelshohn. Parallel job scheduling a status report. In Job Scheduling Strategies for Parallel Processing, pages Springer, 25. [6] D. Klusek, H. Rudov, R. Baraglia, M. Pasquali, and G. Capannini. Comparison of multi-criteria scheduling techniques, In: Grid Computing Achievements and Prospects. Springer, 28. [7] K. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz. A multicriteria approach to two-level hierarchy scheduling in grids. J. of Scheduling, 11: , October 28. [8] Y. Kurowski, J. Nabrzyski, A. Oleksiak, and J. Weglarz. Scheduling jobs on the grid multicriteria approach. Computational Methods in Science and Technology, 12(2): , 26. [9] A. Mu alem and D. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. Parallel and Distributed Systems, IEEE Transactions on, 12(6): , 21. [1] U. Schwiegelshohn and R. Yahyapour. Analysis of first-come-firstserve parallel job scheduling. In Proceedings of the ninth annual ACM-SIAM symposium on Discrete algorithms, pages Society for Industrial and Applied Mathematics, [11] M. Siddiqui, A. Villazón, and T. Fahringer. Grid capacity planning with negotiation-based advance reservation for optimized qos. In Proceedings of the 26 ACM/IEEE conference on Supercomputing, page 13. ACM, 26.