New scheduling problems coming from grid computing

Transcription

1 Electronic Notes in Discrete Mathematics 36 (2010) New scheduling problems coming from grid computing Alexandre Lissy 1 Laboratoire d Informatique Université François-Rabelais Tours, France Patrick Martineau 2 Laboratoire d Informatique Université François-Rabelais Tours, France Abstract In this article, we expose how the availability of an increasing computational power affects scheduling problems that uses grid computing to utilize this power. We first present how scheduling problems are currently handled, and the so-called nextgeneration clusters that will have different needs. Starting from those, we discuss real-life issues that will affect the scheduling and its resolutions and we expose how it affects scheduling and assignment problems. Then, we conclude with discussing some emerging methods to increase scheduling performances. Keywords: scheduling, clusters, grid, single system image, failure, checkpoint, hotswap, green it 1 alexandre.lissy@etu.univ-tours.fr 2 patrick.martineau@univ-tours.fr /$ see front matter 2010 Elsevier B.V. All rights reserved. doi: /j.endm

2 1034 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) Introduction We are interested in scheduling problems for grid computing. As per Moore s law, stating that each year, computation power doubles for the same price, one can notice that computing power available is becoming more and more important. But while this might be a good news, it also has consequences on scheduling problems. In the first section, we will present what methods are currently used for handling real clusters and how this relates to scheduling. Then, we will describe what we call next-generation clusters, and resume with what seems to be the current industrial way of evolution. Before we conclude, we will expose new problems coming from this evolution and the way they impact scheduling. 2 Currently in-use scheduling methods for clusters in production When talking about clusters in production, it either targets enormous supercomputer such as those from the Top500 list, grid-based such as Grid5000 project, or small-scale computing cluster that are dozen of computers. 2.1 Running and proved solutions All super-computer and grid shares one thing, that is critical for them to be usable, i.e. submission mecanishm. The technical details are irrelevant, what is important is the idea: a tool used to schedule jobs. Such tools are called batch schedulers and they can be as primitive as the classical cron tool, or as complex as OAR. All of these batch schedulers are very efficient scheduling software, given that they are provided with correct data as input (which is classical off-line scheduling data, such as job s availability date, duration, ressources needed (power, memory)). Issue are kind of obvious: no one is able to precisely estimate the needs. Worst may happen as batch scheduling softwares may kill tasks that are still running after their deadline, forcing users to cheat, resulting in a poor scheduling. 2.2 Next-generation solutions For the past decade, the size of clusters have been following some what like Moore s law, resulting in those machines to have more and more CPUs. Ac-

3 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) cording to Top500 3, more than the half of today s super-computers registered on the list have between 4000 and 8000 CPUs, by june One decade ago, in june 1999, there were between 65 to 128 CPUs for half of all the registered systems. It should be obvious from these figures that parallelism is heavily increasing: it has been multiplied bu more than sixty. Another recent development is aimed at reducing the level of use for supercomputing, by suppressing the batch scheduler. It is the Single System Image way. Main idea is to create a cluster of machines running on a network, and aggregate them so it looks like a single massively parallel machine. Everything is common and shared: filesystem, process space, I/O space, IPC space,... This allows to create low-grade yet powerful and easily usables clusters. Several projects 4 5 are working in this way, such as DragonFly BSD, Kerrighed, LinuxPMI, MOSIX, OpenSSI,... Those clusters have some nice features such as distant process fork, or process migration, which are scheduling problems ; deciding on which node fork a new process, or where to migrate which process. But the context for those scheduling problems is different than for classical batch scheduler: it is impossible to ask for anything such as process duration, ressources needed, etc., leading to new needs for an efficient scheduling around the cluster. 3 Impacting scheduling problems We will now present several issue linked to clusters, not yet real-life for now, but which will be as soon as those developments are used in production. For each one, we will expose how it impacts scheduling. We will first present the problems linked to cluster s life cycle. Then, we will talk about virtualization and its role. After this, we will talk about failure handling, before continuing with checkpointing. We will then talk about energetic footprint of clusters. We will finish with some considerations about communications in massively parallelized software. 3.1 Cluster s life cycle The main idea behind this title is that we see emerging techniques linked with the use of wide clusters: those clusters have hotswap (adding and/or removal of hardware) capabilities. One or several nodes might appear or disappear

4 1036 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) in the cluster. In the past, scheduling problems for clusters were much more static. But with hotswap capabilities being mainstream (and needed), it is not enough. This kind of features is not only limited to compute-only clusters, but we might also find them for any kind of massively parallel architecture, used to handle cheap but powerful storage. We will talk about the node removal later, while exposing failure handling. Impact of cluster s life cycle on scheduling problem can be resumed to: robustness. It is important to be aware that we will have some nodes unavailable when using the cluster, and the proposed solution should tolerate this and not degrade the objective. 3.2 Virtualization is getting cheaper Virtualization is a rather old (IBM introduced it with CP/CMS a couple of decades ago) yet useful tool. For the past few years, it became mainstream tool (VMware, VirtualBox, Xen, KVM). Main reasons why it might be interesting for scheduling is that it allows for hardware-independency, abstraction for running jobs and this without an important overhead. Last, many virtualization tools also provide a way to migrate virtual machines between physical hosts. With good standard hardware, it might be very fast (a couple of seconds), and has no impact: for code running, it might just look like the box has been freezed. Plus, every mainstream processor has virtualisation instructions. Virtualization can be used to enable high availability and redundancy of services. Scheduling problems might take advantage of this kind of functionnality to duplicate some critical jobs and thus being able to ensure that they will be terminated. Current literature treats of how we can use virtualization as a piece of solution for scheduling problem [8]. 3.3 The checkpointing issue The main idea behind this concept is to freeze the process, and save it. It can be useful to address several flaws, such as: a node running a process crashed, and your cluster doesn t handle this. So you have to restart your job. You can restart your process to what it was when checkpointed, and avoid loosing too many results. Another use might be for handling bugs in your program: the process crashed after several weeks of computation. Yet, efficient checkpointing is a non-trivial problem and has its roots in scheduling mainly. There are several sides of the problem to be solved, from

5 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) what and how to checkpointing, to when checkpointing. The second part of those questions, the when, is the one that will have impact on scheduling problems. Also, scheduling algorithm have to consider the time needed to make the checkpoint of the process when assigning jobs to node, and computing a solution. It is also linked to nodes failure handling: in order to checkpoint as later as possible before the node is fails. Current literature on checkpointing can be found [9]. 3.4 Failure handling This is a major topic, which several roots. First of all, hardware will fail. On small-scale system, it might be rather rare, but when you start to have enough hardware, you will have to face it. So, software (here, scheduling), needs to be aware. Failure is one component of cluster s life cycle. Even so it is the bad side of life, death, cluster should handle failures of one or several nodes in a way it doesn t affect, or at least as minimum as possible, the remaining nodes, in a way that preserves running jobs. When we talked about loosing one node, it is in a general sense, it does not specifically implies that hardware is broken. But such a node cannot be used for current scheduling in the cluster. It is derivated from cluster s life cycle, but it is a bit different. We already talked about the impact of appearance and disappearance of nodes at runtime. But there are other questions directly related specifically to the failure of a node. Scheduling should be able to try to prevent node failure and effectively handle those failures in an efficient way. Handling of failure is concomittant with tools like virtualization, checkpoint/restart and migration, as they can be used to address the problem So the impact for scheduling is major. There are several ways to avoid failure, or at least its consequences, but they all have a cost that need to be taken into account at scheduling level: it is possible to run several instances of the same process on several nodes but it will lower efficiency. This can be circumvented by enabling checkpoint feature, but then we should consider the checkpointing cost not to mention the question of checkpointing itself as shown above ; or you can use migration to continue execution on another node prior to failure. This can rely on checkpointing, or failure prediction, wether you want some proactive or reactive behavior. Once again, it has a cost. Failure prediction/prevention and proactive scheduling are becoming more and more popular as subjects, current literature is available [4,1]. Financial sector works on similar issues [6]. Literature also contains analysis of past

6 1038 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) failure [5]. 3.5 Dealing with energetic footprint This topic might seem rather strange, but energetic consumption will become more and more critical. It is both ecologically, economically and intellectually pertinent to achieve a better use of power. There are hardware-related advancements needed to achieve this, but software and scheduling is also a major piece of the puzzle. A cluster that schedule in a green-way will allow for more power to be available at the same ecological price. There are hundreds of ways to consider this side, but we can already see some companies results. Some huge companies might schedule jobs based on weather and time, in order to use datacenters that are needing the less cooling for this workload (think: desert and night). Combined with virtualization, some are migrating their workload towards the night, reducing the cooling needed. Some company is starting to use a freshly built datacenter with green it in mind. An example of this is their dynamic cooling infrastructure, allowing the power to only those coolers that are connected to racks that are effectively used, even modulating cooling power according to consumed power by those racks. Solutions to lower energy consumption will have both hardware and software side, and scheduling will be a major component. Scheduling algorithm for clusters should consider the energetic cost of their decisions. Minimizing this can be achieved using some of the tools presented above, such as checkpointing and virtualization. With all those, and some eco-friendly assignment scheduler in a cluster, we might be able to only power what is needed of the cluster, and thus limit the use of cooling devices, maximize electricity usage to lower electricity conversion-related losses. Scheduling energyaware can be found in [12,13]. 3.6 Communications are growing Another point that might affect scheduling community is the growing of parallelism. Even five years ago, we only had one-core based mainstream computers. Now, mainstream is reaching, on a mid-term scale, quad-core. And six-core is tomorrow. This is an illustration of how massively parallel architecture are going to be more and more used. A 1680-cores cluster is cheap, now. It is just 70 computers. That makes a large computation power available, and using it as best as possible when needed to have massively parallelized software. One thing might alter the performance at this point is the communication, which

7 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) won t stop. There are hardware methods to solve this, but a good assignment will avoid the need for expensive hardware solution. The more the tasks that needs to communicate are close, the better. Massively-parallel tasks will still need to communicate, and this will lead to an increase in communications. We stated that this is generally the most important part of negative speedup in parallel applications. So, if we want still to have massively parallel code being able to scale, it will be needed to schedule them in a way that reduces the performances drawback caused by communication, by trying to put the communication-intensive tasks together. The literature contains references about scheduling considering communications, [2]. 4 Conclusion Today s hardware is powerful, and there is still progress to be done to efficiently use it. For example, many of the in-production scheduling systems for clusters (of any kind, it can either be a compute-cluster targetted at running intensive and parallels tasks or a pool of webservers), are using naive and basic scheduling algorithms, that doesn t consider topics we covered. We can cite for example [10]. Other works in the literature that covers scheduling in grid are a bit more complex [7]. We can also find references of efficiency-oriented scheduling, for example [11], inspired by economy. Bi-criteria scheduling is covered, too [3]. Most of them will have an impact on energy consumption. The Green IT shouldn t only be seen as a marketing green-washing by some vendors, but as the sign of a need that has to be addressed. Another important point is the maturity of some Single-System Image solutions. Those clusters are easier to use than supercomputers, and improving their scheduling tools will help providing better performances at a lower cost. References [1] Benjamin,. T., Khoo and B. Veeravalli, Pro-active failure handling mechanisms for scheduling in grid computing environments, Journal of Parallel and Distributed Computing (2009). [2] Dodonov, E. and R. F. de Mello, A novel approach for distributed application scheduling based on prediction of communication events, Future Generation Computer Systems (2009).

8 1040 A. Lissy, P. Martineau / Electronic Notes in Discrete Mathematics 36 (2010) [3] Dutot, P.-F., L. Eyraud, G. Mounié and D. Trystram, Bi-criteria algorithm for scheduling jobs on cluster platforms, ACM Symposium on Parallel Algorithms and Architectures (2004) (2005). [4] Fu, S., Failure-aware construction and reconfiguration of distributed virtual machines for high availability computing, in: CCGRID 09: Proceedings of the th IEEE/ACM International Symposium on Cluster Computing and the Grid (2009), pp [5] Fu, S. and C.-Z. Xu, Exploring event correlation for failure prediction in coalitions of clusters, in: SC 07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing (2007), pp [6] Huysmans, J., B. Baesens, J. Vanthienen and T. V. Gestel, Failure prediction with self organizing maps, Expert Syst. Appl. 30 (2006), pp [7] Lin, G. and R. Rajaraman, Approximation algorithms for multiprocessor scheduling under uncertainty (2007). [8] Pérotin, M., Calcul Haute Performance sur Matériel Générique, Ph.D. thesis, Laboratoire d Informatique de l École Polytechnique de l Université de Tours (2008). [9] Rough, J. T. and A. M. Goscinski, The development of an efficient checkpointing facility exploiting operating systems services of the genesis cluster operating system, Future Gener. Comput. Syst. 20 (2004), pp [10] Rudolph, L., M. Silvkin-Allalouf and E. Upfal, A simple load balancing scheme for task allocation in parallel machines, ACM Symposium on Parallel Algorithms and Architectures (1991), pp [11] Sherwani, J., N. Ali, N. Lotia, Z. Hayat and R. Buyya, Libra: An economy driven job scheduling system for clusters, Technical Report Technical Report, July 2002, Dept. of Computer Science and Software Engineering, The University of Melbourne (2002). [12] Subrata, R., A. Y. Zomaya and B. Landfeldt, Cooperative power-aware scheduling in grid computing environments, Journal of Parallel and Distributed Computing 70 (2010), pp [13] Tsao, S.-L. and Y.-L. Chen, Energy-efficient packet scheduling algorithms for real-time communications in a mobile wimax system, Computer Communications 31 (2008), pp