Reliable Job Scheduler using RFOH in Grid Computing

VOL, NO, JULY 200 ISSN 2079-8407 2009-200 CIS Joural All rights reserved http://wwwcisjouralorg Reliable Job Scheduler usig RFOH i Grid Computig Leyli Mohammad Khali Dept of Computer Sciece, Tabriz Uiversity Tabriz, Ira l-khali@tabrizuacir Maryam Etmia Far Dept of Computer Egieerig, Tabriz Brach, Islamic Azad Uiversity Tabriz, Ira metmiafar@iautacir Ali Ghaffari Dept of Computer Egieerig, Tabriz Brach, Islamic Azad Uiversity Tabriz, Ira HUTaghaffari@iautacirH ABSTRACT Distributed ad dyamic ature of Grids causes the probability of failure is great i such systems So fault tolerace has become a crucial area i computatioal Grid I this paper, we propose a ew Geetic Algorithm that used RFOH for havig reliable job schedulig i computatioal Grid This strategy maitais the fault occurrece history of resources i Grid Iformatio Server (GIS) Geetic Algorithm with RFOH fids a ear optimal solutio for the problem Furthermore, it icreases the percetage of jobs executed withi specified deadlie The simulatio results demostrate that proposed strategy decreases the probability of failure ad therefore icreases reliability Also it reduces total executio time of jobs So we will have a combiatio of reliability ad user satisfactio Keywords: Grid Computig; Reliability; RFOH; Job Schedulig; Geetic Algorithm; Fault Tolerace INTRODUCTION Computer scietists i the mid-990s bega explorig the desig ad developmet of a aalogous ifrastructure called the computatioal power Grid [, 2] A Grid is a type of parallel ad distributed system that eables the sharig, selectio, ad aggregatio of geographically distributed autoomous ad heterogeeous resources dyamically at rutime, depedig o their availability, capability, performace, cost, ad users' quality-of-service requiremets [3] Resources ca be computers, memories, istrumets (such as telescope), software applicatios, ad data that all are coected through the Iteret I Grid eviromets, there are may jobs to schedule for parallel executio o this system Sice that schedulig problem is a NP-Hard problem, we ca use evolutioary algorithms to solve it Amog these algorithms, geetic Algorithm (GA) is more usual which is used i this paper GA is a global search techique, which maitais a pool of potetial solutios called Chromosome [4] The GA produces ew solutios through combiig the good features of existig solutios radomly I this algorithm, there is a operatio amed Crossover operatio for global searches through the solutio space by radomly exchagig portios of two chromosomes Aother importat local search operator is Mutatio, which works by radomly chagig oe of the gees i a chromosome The whole process is repeated umber of times, called geeratios or iteratios Typically, the probability of a failure is higher i the grid computig tha i a traditioal parallel computig [5, 6] ad the failure of resources affects job executio fatally Therefore, a fault tolerace service is essetial i computatioal grids [5] Oe of importat goals i distributed systems such as Grid is to costruct the system i such a way that it ca automatically recover from error without serious decreasig of system performace For havig reliable ad thereupo fault tolerat job schedulig i the Grid, we propose a ew strategy I this strategy, Resource Fault Occurrece History (RFOH) iformatio is used i GA Hece this causes achievig a suitable solutio for schedulig problem, which has reliability too The reaso is that we try to reduce the selectio probability of resources with more fault occurrece history Rest of paper is orgaized as followig: Sectio 2 cotais the descriptio of the related work I sectio 3, proposed strategy is described Sectio 4 discusses the simulatio results ad fially Sectio 5 cocludes the paper 2 RELATED WORK The probability of a failure i large-scale Grids is much greater tha traditioal parallel systems [5, 6] Thus, Grid system should be able to idetify ad maage faults ad support reliable executio of jobs Grid System failure hadlig techiques are classified as task-level ad workflow-level [7] Task-level techiques mask the effects of the executio failure of tasks i the Grid system, while workflow-level techiques maipulate the system structure such as executio flow to deal with erroeous coditios Checkpoit techique is oe of the task-level techiques This techique moves failed tasks trasparetly to other resources, so that the task ca cotiue its executio from the poit of failure [8] I [8] with usig GA, a solutio was proposed for allocatig jobs to resources by Grid scheduler Also for 43

VOL, NO, JULY 200 ISSN 2079-8407 2009-200 CIS Joural All rights reserved havig fault tolerat job schedulig, Checkpoit techique was used But RFOH iformatio was t cosidered as what we used i our strategy I [6] the history of the fault occurrece of resource is maitaied i GIS Wheever a resource broker has a job to schedule it uses this iformatio from GIS ad depedig o this iformatio, it uses differet itesity of Check poitig cosiderig that resources have differet tedecy towards fault I that way, oly quality of Check poitig is discussed ad there is o attetio to the type of schedulig But, we used this iformatio i other format withi GA for havig reliable ad thereupo fault tolerat job scheduler I this paper, we use RFOH iformatio i GA doig this; we ca reduce the selectio probability of the resources with more fault occurrece history Therefore we have a reliable schedulig ad kid of fault tolerace Further we could have user satisfactio i job schedulig, too Hece we have a combiatio of reliability ad user satisfactio 3 OUR NEW APPROACH http://wwwcisjouralorg B RFOH storage For havig a reliable job scheduler, we use RFOH i GA This causes to reduce the selectio probability of resources with more fault occurrece history To store RFOH, GIS maitais the history of the fault occurred i resources i a table called Fault Occurrece History Table (FOHT) [9] FOHT has two colums First colum presets the history of fault occurred i the resources ad secod colum keeps the umber of job executio by resources So the umber of rows ad the umber of resources are equal Accordig to [9], FOHT is updated whe: ) The resource is uable to execute the give job i the specified deadlie The fault idex of this resource (first cell) is icremeted by 2) A job is allocated to the resource The umber of job executio by this resource (secod cell) is icremeted by Therefore FOHT is updated with the followig equatio for i th resource [9] Accordig to Fig, our proposed strategy cosists of three compoets: (a) Fault Detector, (b) RFOH Storage, ad (c) Job Scheduler Each of these compoets is explaied at followig FOHT FOHT[i] FOHT[i,] + : = FOHT[i,2] + : With fault occurece i i With job allocatioto i Fig 2 presets a part of FOHT i a give time For example, this figure idicates the umber of job executio by R is 6700 but at te times of these executios, fault occurred th th resource resource () 2 C) Job Scheduler R 0 6700 R2 5 00 A) Fault Detector R3 0 23 R4 8 20 Resource Broker Resources R5 3 25 User User 2 User Figure 2 A part of FOHT i a give time Figure A overview of our ew scheduler compoets A Fault Detector Fault detectio o the resources is doe by the resource broker After job allocatio to the resource, the resource broker should receive a respose of job executio from it withi a certai time iterval If i this time iterval, the resource broker could ot get ay respose, it realizes that a fault occurred o that resource The i ext step it seds the iformatio about that fault to GIS Metioed time iterval is a fuctio of resource speed, commuicatio latecy, ad queue legth of the resource C Job scheduler We assumed that there are jobs such J, J 2,, J ad we wat to allocate them to m resources such R, R 2,, R m Such schedulig problem is a NP-Hard problem So we use Geetic Algorithm to solve it As metioed before, for havig reliable ad fault tolerat job schedulig, we use RFOH i this algorithm Accordig to Fig 3, this method has 5 steps which work as metioed below: 44

VOL, NO, JULY 200 ISSN 2079-8407 2009-200 CIS Joural All rights reserved http://wwwcisjouralorg I (2) R t i is Respose Time of executio of i th job at oe chromosome Accordig to this equatio a chromosome with proper resources for job executig - that have less executio time will have less fitess value i the populatio So i this populatio whatever the fitess value of a chromosome becomes lesser tha others, it will be a best solutio for our problem b) Reliability: As we metioed, for havig a reliable job scheduler, we use RFOH i our strategy By usig RFOH iformatio, the resources with more tedecies to failure will have less probability to select So we ca defie the fitess fuctio as below: f FOHT [ i,] [ i,2 ] 2 = ( i = FOHT 00 ) (3) Figure 3 The steps of our proposed strategy ) Populatio iitializatio: At first we geerate a iitial populatio of chromosomes radomly With regards to Fig 4, each chromosome represets a possible solutio, which is a mappig sequece betwee jobs ad resources I this method the legth of chromosomes ad the umber of jobs i Grid are equal Accordig to Fig 4, The J, J2, ad J3 jobs are executed by R6, R, R3 resources respectively R6 J R J2 Figure 4 Represetatio of a part of a chromosome 2) Chromosome evaluatio: Each chromosome must evaluate to specify its fitess value This is implemeted with fitess fuctio which is defied based o below parameters a) Users Satisfactio: We assume that the cosidered parameter of user is the respose time of job executio So the fitess fuctio is defied as below: f = R ti R3 J3 (2) Accordig to (3), we must calculate the sum of percetage of RFOH to total umber of job executio proportio for each selective resource with usig FOHT Hece, a chromosome with great fault occurrece probability will have great fitess value i the populatio Therefore we attempt to miimize fitess value of chromosomes This causes the reductio of selectio probability of resources which are more tedecies to failure Now by combiig (2) ad (3), we ca itroduce a fitess fuctio for chromosome evaluatig as below: f = R ti + FOHT[ i,] ( 00) (4) FOHT[ i,2] This equatio shows that if the resources withi a chromosome have less total respose time for executig each job ad less fault occurrece history, that chromosome will have a less fitess value Cosiderig that we attempt to miimize fitess value i the populatio, such chromosome will have great probability for selectio as a solutio 3) Implemet Crossover ad Mutatio operators: After evaluatig chromosome for fidig proper solutio, we use Crossover ad Mutatio operators There are various types of Crossover ad Mutatio operators I this strategy, we used Two Poits Crossover ad a kid of Uiform Mutatio Two poits Crossover operator selects a radom pair of chromosomes ad exchages a radom part of those Our Mutatio operator radomly selects a chromosome, ad the radomly selects a job withi the chromosome, ad - accordig to first colum of FOHT - if its resource has more tha oe fault occurrece history, reassigs it to a ew resource radomly 4) Replacemet: After Crossover ad Mutatio performed, we replace offsprigs with their paret chromosomes i the populatio 5) So far, oe iteratio of algorithm is doe This algorithm stops whe a predefied umber of evolutios are reached, or all chromosomes coverge to the same mappig, or o improvemet i recet evaluatios, or a cost boud is met [9] Fially after stoppig the algorithm, the chromosome with less fitess value is selected as a problem solutio 45

VOL, NO, JULY 200 ISSN 2079-8407 2009-200 CIS Joural All rights reserved 4 SIMULATION RESULTS I this sectio, we evaluate our proposed strategy i two steps I both steps, we have used Matlab toolbox for Geetic Algorithms as a simulator that amed gatool The simulatio parameters ad settigs are listed i Table Table Simulatio parameters ad settigs Number of jobs 20 Job types (40 jobs i each group) Short: -0 istructios Middle: 45-55 istructios Log: 90-00 istructios http://wwwcisjouralorg Executio Time without Failure (ms) 4 2 08 06 04 02 with attetio to RFOH without attetio to RFOH Number of resources 20 Resource types (24 odes i each group ) Resource failure rates Very Slow, with MFlops Slow, with 0MFlops Middle, with 50MFlops Fast, with 90MFlops Very Fast, with 00MFlops Very Faulty: 90%-00% fault occurrece Faulty: 45%-55% fault occurrece Safe: 0%-0% fault occurrece Havig 20 jobs for executig, with regards to Fig 4, the geeral schema of chromosomes is as Fig 5 Accordig to step i sectio B of part III, first a radom populatio is geerated The i step 2, each chromosome is evaluated accordig to (4) We ca create FOHT ad also calculate total executio time of each chromosome with attetio to Table settigs After that, i step 3 Crossover ad Mutatio operators are used i populatio ad offsprigs is replaced with their parets i step 4 ad the step 2 is resumed After the umber of iteratios, oe of itroduced coditios i step 5 will occur ad cause to stop algorithm I this time, a chromosome with lesser fitess value ca be a ear best solutio for our problem 0 20 40 60 80 00 20 Figure 6 Total executio time compariso by varyig umber of resources E Evaluatio of fault tolerace ad reliability Accordig to fig 7, our ew strategy causes to reduce the probability percetage of fault occurrece i the selective resources for executig jobs This figure shows that without RFOH iformatio the probability of faulty resources selectio is great Furthermore, this probability reduces with decreasig the umber of resources But with our strategy it decreases Also fig 8 represets that i simple GA without attetio to RFOH iformatio, the reliability decreases But usage our ew strategy icrease the reliability Therefore we have a rather reliable selectio because the resources with more fault occurrece history were t selected We must otice that despite if a resource has a kid of reliability, but it is a slow machie it is ot suitable So it is t selected to execute oe of our jobs This shows that we ca have a combiatio of user satisfactio ad reliability R i J R j J 2 Figure 5 Geeral schema of chromosomes i our example D Evaluatio of total executio time Simulatio results illustrate that our proposed strategy ca select proper resources for job executig with less executio time For performace evaluatio, we compared our ew algorithm with the algorithm which ever uses RFOH iformatio Fig 6 represets that if durig executio the fault does t occur, the total executio time with our ew strategy is less tha total executio time without RFOH iformatio So the performace of GA with RFOH is better tha simple GA i the case of total executio time R k J20 Failure Probability Percetage 80 70 60 50 40 30 with attetio to RFOH without attetio to RFOH 20 0 20 40 60 80 00 20 Figure 7 Failure probability percetage compariso by varyig umber of resources 46

VOL, NO, JULY 200 ISSN 2079-8407 2009-200 CIS Joural All rights reserved Reliability Percetage 0035 003 0025 002 005 with attetio to RFOH without attetio to RFOH http://wwwcisjouralorg [8] S Baghavathi Priya, M Prakash, Dr K K Dhwa, Fault Tolerace-Geetic Algorithm for Grid Task Schedulig usig Check Poit, The Sixth Iteratioal Coferece o Grid ad Cooperative Computig (GCC), 2007 [9] Leyli Mohammad Khali, Maryam Etmia Far, ad Amir Masoud Rahmai, RFOH: a New Fault Tolerat Job Scheduler i Grid Computig, The 2d Iteratioal Coferece o Computer Egieerig ad Applicatios (ICCEA), Bali Islad, Idoesia, March 9-2, 200 00 0005 0 20 40 60 80 00 20 Figure 8 Reliability percetage compariso by varyig umber of resources 5 CONCLUSION I Grid eviromets, task executio failures ca occur for various reasos I this paper we preseted a ew GA for reliable job schedulig i the Grid This algorithm uses RFOH iformatio which is maitaied i FOHT The usig of this iformatio causes the reductio of selectig chace of the resources which have more failure probability Simulatio results idicate that our proposed strategy decreases total time of job executig REFERENCES [] I Foster, C Kesselma, ad S Tueke, The aatomy of the grid: Eablig scalable virtual orgaizatios, Supercomputig Applicatios, 200 [2] Foster ad C Kesselma, The Grid Blueprit for a Future Computig Ifrastructure, Sa Mateo, CA: Morga Kaufma, 999 [3] M Baker, R Buyya ad D Laforeza, Grids ad Grid Techologies for Wide-area Distributed Computig, Software-Practice & Experiece, Vol 32, No5, 2002, pp: 437-466 [4] AY Zomaya, RC Lee, ad S Olariu, A Itroductio to Geetic-Based Schedulig i Parallel- Processor Systems, Solutios to Parallel ad Distributed Computig Problems: Lessos from Biological Sciece, AY Zomaya, F Ercal, ad S Olariu, eds, New York: Wiley, 200, chapter 5, pp -33 [5] HwaMi Lee, KwagSik Chug, SugHo Chi, JogHyuk Lee, DaeWo Lee, 2005,"A resource maagemet ad fault tolerace services i grid computig", Joural of Parallel ad Distributed Computig, Vol 65, pp 305-37 [6] Babar Nazir, Taimoor Kha, Fault Tolerat Job Schedulig i Computatioal Grid, 2 d Iteratioal Coferece o Emergig Techologies Peshawar, Pakista (IEEE ICET), 2006 [7] S Hwag ad C Kesselma Grid Workflow: A Flexible Failure Hadlig Framework for the Grid, I 2th IEEE Iteratioal Symposium o High Performace Distributed Computig (HPDC 03), Seattle, Washigto, USA, IEEE CS Press, Los Alamitos, CA, USA, Jue 22-24, 2003 47