- PDF Free Download

Transcription

1 ImplementingMultiprocessorScheduling EricW.ParsonsandKennethC.Sevcik Disciplines ComputerSystemsResearchInstitute UniversityofToronto icallyarerarelyeverimplementedandevenmorerarelyincorporated criticism,however,isthatproposeddisciplinesthatarestudiedanalyt- considerableamountofanalyticresearchinthisarearecently.afrequent temsistheschedulingofparalleljobs.consequently,therehasbeena Abstract.Animportantissueinmultiprogrammedmultiprocessorsys- intocommercialschedulingsoftware.inthispaper,weseektobridge thisgapbydescribinghowatleastonecommercialschedulingsystem, namelyplatformcomputing'sloadsharingfacility,canbeextendedto oftheexibilityallowedinallocatingprocessors.inevaluatingtheperformanceofthesedisciplines,wendthatpreemptioncansignicantltiprocessorschedulingdisciplines,eachdieringconsiderablyinterms Wethendescribethedesignandimplementationofanumberofmul- ofthetypeofpreemptionthatisassumedtobeavailableandinterms supportawidevarietyofnewschedulingdisciplines. signicantlyaectedbytransientloads. reduceoverallresponsetimes,butthattheperformanceofdisciplines thatmustcommittoallocationswhenajobisrstactivatedcanbe 1Introduction ally-intensivescienticmodelingtoi/o-intensivedatabases,forthepurposeof obtainingcomputationalresults,measuringapplicationperformance,orsimply necessary.usersofthesesystemsrunapplicationsthatrangefromcomputationlation,mechanismstosharesuchsystemsamongusersarebecomingincreasingly Aslarge-scalemultiprocessorsystemsbecomeavailabletoagrowinguserpopu- debuggingnewparallelcodes.whileinthepast,systemsmayhavebeenacquired systemsanimportantproblem. forthebenetoflargeusercommunities,makingtheecientschedulingofthese exclusivelyforusebyasmallnumberofindividuals,theyarenowbeinginstalled run-to-completion(rtc)disciplinesandhaveverylittleexibilityinadjusting evenmorerarelyeverbecomepartofcommercialschedulingsystems.thecommercialschedulingsystemspresentlyavailable,forthemostpart,onlysupporquentcriticismsmadeisthatproposeddisciplinesarerarelyimplementedand Althoughmuchanalyticresearchhasbeendoneinthisarea,oneofthefre-

2 processorallocations.theseconstraintscanleadtobothhighresponsetimesand needforbothpreemptionandmechanismsforadjustingprocessorallocationsof lowsystemutilizations.ontheotherhand,mostresearchresultssupportthe jobsġiventhatanumberofhigh-performancecomputingcentershavebegun portthesecenters,however,mechanismstoextendexistingsystemswithexternal clearthatexistingcommercialschedulingsoftwareisofteninadequate.tosup- todeveloptheirownschedulingsoftware[hen95,lif95,sczl96,wmks96],itis inthistypeofsoftware. withouthavingtore-implementmuchofthebasefunctionalitytypicallyfound ware[sczl96].thisallowsnewschedulingpoliciestobeeasilyimplemented, (customer-provided)policiesarestartingtobecomeavailableincommercialsoft- oftheanalyticresearchandpracticalimplementationsofschedulingdisciplines. Assuch,wedescribetheimplementationofanumberofschedulingdisciplines, Furthermore,wedescribehowdierenttypesofknowledge(e.g.,amountofcomputationalworkorspeedupcharacteristics)canbeincludedinthedesignofthese involvingvarioustypesofjobpreemptionandprocessorallocationexibility. Theprimaryobjectiveofthispaperistohelpbridgethegapbetweensome disciplines.asecondaryobjectiveofourworkistobrieyexaminethebenets preemptionandknowledgemayhaveontheperformanceofparallelscheduling presentmotivationforthetypesofschedulingdisciplinesthatwechosetoimplement.insect.3,wedescribeloadsharingfacility(lsf),thecommercial and5,wedescribeanextensionlibrarywehavedevelopedtofacilitatethedevelopmentofmultiprocessorschedulingdisciplines,followedbythesetofdisciplines wehaveimplemented.finally,wepresentourexperimentalresultsinsect.6and ourconclusionsinsect.7. 2Background Therehavebeenmanyanalyticstudiesdoneonparallel-jobschedulingsinceit softwareschedulingsoftwareonwhichwebasedourimplementation.insects.4 Theremainderofthepaperisorganizedasfollows.Inthenextsection,we threadsofajobmayhavetowaitforsignicantamountsoftimeatsynchronizationpoints.thiscaneitherresultinlargecontext-switchoverheadsorwasted processorcycles.ingeneral,asinglethreadisassociatedwitheachprocessor,an First,theperformanceofasystemcanbesignicantlydegradedifajobis notgivenexclusiveuseoftheprocessorsonwhichitisrunning.otherwise,the observations. wasrstexaminedinthelateeighties.muchofthisworkhasledtothreebasic numberofprocessorsandstillachievegoodperformance[mz94].(inthelatter approachwhichisknownascoordinatedorgangscheduling[ous82,fr92].sometimes,however,itispossibletomultiplexthreadsofthesamejobonareduced case,itisstillassumedthatonlythreadsfromasinglejobaresimultaneously activeonanygivenprocessor.)

3 someexibilityinallocatingprocessorscansignicantlyimproveoverallperformance[gst91,sev94,nss93,rsd+94].inmostsystems,usersspecifyprecisely Second,jobsgenerallymakemoreecientuseoftheprocessingresources givensmallerprocessorsallocations.asaresult,providingtheschedulerwith thenumberofprocessorswhichshouldbeallocatedtoeachjob,apracticethat memory,andamaximum,correspondingtothepointafterwhichnofurther processorsarelikelytobebenecial.insomecases,itmayalsobenecessaryto iesaminimumprocessorallocation,usuallyresultingfromconstraintsdueto specifyadditionalconstraintsontheallocation,suchasbeingapoweroftwo. isknownasrigidscheduling.inadaptiveschedulingdisciplines,theuserspec- Ifavailable,specicknowledgeaboutjobs,suchasamountofworkorspeedup atlightloads,givingthemgoodresponsetimes.astheloadincreases,however, characteristics,canfurtheraidtheschedulerinallocatingprocessorsinexcess ofminimumallocations. allocationsizescanbedecreasedsoastoimprovetheeciencywithwhichthe processorsareutilized,andhenceallowingahigherloadtobesustained(i.e., ahighersustainablethroughput).also,adaptivedisciplinescanbetterutilize Inadaptivedisciplines,jobscanbeallocatedalargenumberofprocessors processorsthanrigidonesbecause,withthelatter,processorsareoftenleftidle duetopackingineciencies,whileadaptivedisciplinescanadjustallocationsto makeuseofallavailableprocessors. veryhighdegreeofvariabilityintheamountofcomputationalwork(alsoknown asservicedemand)[cmv94,fn95,gib96].inotherwords,mostjobshavevery smallservicedemandsbutafewjobscanrunforaverylongtime.run-tocompletion(rtc)disciplinesexhibitveryhighresponsetimesbecauseoncea Thethirdobservationisthatworkloadsfoundinpracticetendtohavea long-runningjobisdispatched,shortjobsmustwaitaconsiderableamountof timebeforeprocessorsbecomeavailable.preemptioncansignicantlyreduce themeanresponsetimesoftheseworkloadsrelativetorun-to-completiondisciplines[ps95]. inincreasingorderofimplementationcomplexity. tobepreempted.inthispaper,weconsiderthreedistincttypesofpreemption, needtobeinvokedfrequentlytobeuseful,sinceonlylong-runningjobseverneed andcomplextosupport.fortunately,resultsindicatethatpreemptiondoesnot Unlikethesequentialcase,preemptionofparalleljobscanbequiteexpensive SimpleInsimplepreemption,ajobmaybepreemptedbutitsthreadsmay MigratableInmigratablepreemption,ajobmaybepreemptedanditsthreads availableonmessage-passingsystems. tosupport(asthreadsneedonlybestopped),andmaybetheonlytype notbemigratedtoanotherprocessor.thistypeofpreemptionistheeasiest migrated.normally,thistypeofpreemptioncanbeeasilysupportedin shared-memorysystems,butensuringthatdataaccessedbyeachthread isalsomigratedappropriatelycanbedicult.inmessage-passingsystems, operating-systemsupportformigrationisnotusuallyprovided,butcheck-

4 eithermpiorpvm[pl96].whenacheckpointisrequested,therun-time providesatransparentcheckpointingfacilityforparallelapplicationsthatuse libraryushesanynetworkcommunicationsandi/oandsavestheimagesof eachprocessinvolvedinthecomputationtodisk;whenthejobisrestarted, pointingcanoftenbeemployedinstead.1forexample,thecondorsystem swapping,exceptthatallkernelresourcesarerelinquished. resumesthecomputationfromthepointatwhichthelastcheckpointwas taken.assuch,usingcheckpointingtopreemptajobissimilarincostto therun-timelibraryre-establishesthenecessarynetworkconnectionsand MalleableInmalleablepreemption,thesizeofajob'sprocessorallocationmay bechangedafterithasbegunexecution,afeaturethatnormallyrequiresexplicitsupportwithintheapplication.2intheprocesscontrolapproach,the applicationmustbedesignedtotoadaptdynamicallytochangesinprocessorallocationwhileitisrunning[tg89,gts91,nvz96].asthistypeof storecheckpointsinaformatthatisindependentofallocatedprocessors, thusallowingthejobtobesubsequentlyrestartedonadierentnumberof Forthesecases,itmightbepossibletomodifytheapplicationsoasto checkpointing,oftenusedbylong-runningjobstotoleratesystemfailures. supportisuncommon,asimplerstrategymaybetorelyonapplication-level previouslystudiedispresentedinfig.1,classiedaccordingtothetypeofpreemptionavailableandtheexibilityinprocessorallocation(i.e.,rigidversus Arepresentativesampleofcoordinatedschedulingdisciplinesthathavebeen processors. gratable,malleable)canbeappliedtoalladaptivedisciplines,butonlysimple tiontheyassumetobeavailable,whichcanincludeservicedemand,speedup adaptive).adaptivedisciplinesarefurthercategorizedbythetypeofinforma- characteristics,andmemoryrequirements.3alltypesofpreemption(simple,mi- andmigratablepreemptionaremeaningfulforrigiddisciplines.thedisciplines proposedinthispaperarehighlightedinitalics.(amorecompleteversionof thistablecanbefoundelsewhere[par97].) IBMSP-2system.ArecentextensiontoLoadLevelerthathasbecomepopular iseasy[lif95,sczl96].thisisarigidrtcschedulerthatusesexecutiontimeinformationprovidedbytheusertooerbothgreaterpredictabilityand bettersystemutilization.whenausersubmitsajob,theschedulerindicates LoadLevelerisacommercialschedulingsystemdesignedprimarilyforthe immediatelyatimebywhichthatjobwillberun;jobsthataresubsequently submittedmayberunbeforethisjobonlyiftheydonotdelaythestartofany 2Malleablepreemptionisoftentermeddynamicpartitioningintheliterature,butwe 1Althoughthecostsofthisapproachmayappeartobelarge,wehavefoundthat 3Somerigidschedulersdouseservice-demandinformationifavailable,butthisdistinctionisnotshowninthistable. nditmoreconvenienttotreatitasatypeofpreemption. onthroughput,evenwithlargecheckpointingoverheads. signicantreductionsinmeanresponsetimescanbeachievedwithminimalimpact

5 \LSF-";fortheadaptiveones,aregularanda\SUBSET"versionareprovided. theliterature.disciplinespresentedinthispaperareitalicizedandhavetheprex Table1.Representativesetofdisciplinesthathavebeenproposedandevaluatedin RTC PPJ[RSD+94] RTC[ZM90] Rigid A+,A+&mM[Sev89] ASP[ST93] Adaptive WorkSpeedupMem. yesmin/maxno EASY[Lif95] LoadLeveler NQS LSF Equal,IP[RSD+94] SDF[CMV94] PWS[GST91] yes pws no LSF-RTC LSF-RTC-AD(SUBSET)eithereithereither AVG[CMV94] AVG,Adapt- no avg no Preemption simple LSF-PREEMPT (matrix)[ous82] Cosched migratable(other)[ous82] Cosched Round-Robin[ZM90] LSF-PREEMPT- AD(SUBSET) eithereithereither RRJob[MVZ93] LSF-MIG LSF-MIG-AD(SUBSET)eithereithereither FB-ASP,FB-PWS no pws no malleable (notapplicable) BUDDY,EPOCH[MZ95]no Partition[TG89,MVZ93]no FOLD,EQUI[MZ94] Equi/Dynamic W&E[BG96] yes MPA[PS96b,PS96a] no AD(SUBSET) LSF-MALL- eithereithereither yes yes no

6 enoughprocessorsforsucienttime). previously-scheduledjob'sexecution(i.e.,agapexistsintheschedulecontaining (LSF).BybuildingontopofLSF,wefoundthatwecouldmakedirectuseof tensionstoanothercommercialschedulingsystem,calledloadsharingfacility LSFformanyaspectsofjobmanagement,includingtheuserinterfacesforsubmittingandmonitoringjobs,aswellasthelow-levelmechanismsforstarting, Thedisciplinesthatwepresentinthispaperhavebeenimplementedasex- stopping,andresumingjobs.lsfrunsonalargenumberofplatforms,including thesp-2,sgichallenge,sgiorigin,andhpexemplar,makingitanattractive vehicleforthistypeofschedulingresearch.ourworkisbasedonlsfversion 2.2a. nowbecomingpopularforparalleljobschedulingonmultiprocessorsystems.of 3LoadSharingFacility Althoughoriginallydesignedforloadbalancinginworkstationclusters,LSFis greatestrelevancetothisworkisthebatchsubsystem. tocongurequeuessothathigher-priorityjobspreemptlowerpriorityones(a Eachqueueisassociatedwithasetofprocessors,apriority,andmanyother thehighest-prioritynon-emptyqueueandrununtilcompletion,butitispossible parametersnotdescribedhere.bydefault,jobsareselectedinfcfsorderfrom Queuesprovidethebasisformuchofthecontrolovertheschedulingofjobs. ofajobisdenedbythequeuetowhichthejobhasbeensubmitted. featurethatiscurrentlyavailableonlyforthesequential-jobcase).thepriority executedbeforelongones.moreover,lsfcanbeconguredtopreemptlower higherprioritythanlongerjobs(seefig.1).anadministratorcoulddeneseveralqueues,eachinturncorrespondingtoincreasingservicedemandandhaving Toillustratetheuseofqueues,considerapolicywhereshorterjobshave decreasingpriority.ifjobsaresubmittedtothecorrectqueue,shortjobswillbe priorityjobsifhigherpriorityonesarrive,givingshortjobsstillbetterresponsiveness.topermitenforcementofthepolicy,lsfcanbeconguredtoterminate anyjobthatexceedstheexecution-timethresholddenedforthequeue. quired.whenlsfndsasucientnumberofprocessorssatisfyingtheresource Aspartofsubmittingajob,ausercanspecifythenumberofprocessorsre- processors,passingtothisprocessalistofprocessors.themasterprocesscan constraintsforthejob,itspawnsanapplication\master"processononeofthe ThecurrentversionofLSFprovidesonlylimitedsupportforparalleljobs. ofthemasterprocess,andassuch,arenotknowntothelsfbatchscheduling theparallelcomputation.theslaveprocessesarecompletelyunderthecontrol thenusethislistofprocessorstospawnanumberof\slave"processestoperform signals,andmanagingterminaloutput. programmingactivities,suchasspawningremoteprocesses,propagatingunix system.lsfdoesprovide,however,alibrarythatsimpliesseveraldistributed

7 Processors Short Jobs Priority=10 Preemptive Run Limit=5 mins Medium Jobs Priority=5 Preemptive/Preemptable Run Limit=60 mins Long jobsinthequeuetopreemptlower-priorityjobs).execution-timelimitsassociatedwith Fig.1.Exampleofapossiblesequential-jobqueuecongurationinLSFtofavourshortrunningjobs.Jobssubmittedtotheshort-jobqueuehavethehighestpriority,followed Priority=0 eachqueueenforcetheintendedpolicy. ingjobsinthequeuetobepreemptedbyhigher-priorityjobs)andpreemptive(allowing bymedium-andlong-jobqueues.thequeuesareconguredtobepreemptable(allow- Preemptable No Run Limit 4SchedulingExtensionLibrary schedulingtobecontrolled.ourschedulingdisciplinesareimplementedwithin extensiveapplication-programmerinterface(api),allowingmanyaspectsofjob LSFtoexperimentwiththenewdisciplines.Forthispurpose,LSFprovidesan requireanylsfsourcecodemodications,asthisallowsanyexistingusersof Theidealapproachtodevelopingnewschedulingdisciplinesisonethatdoesnot aprocessdistinctfromlsf,andarethuscalledschedulingextensions. mustuseasetoflsfroutinestoopenthelsfevent-loggingle,processeach ratherthanschedulingextensions.asaresult,theinterfacesareverylowlevel timeforajob informationcommonlyrequiredbyascheduler theprogrammer andcanbequitecomplextouse.forexample,todeterminetheaccumulatedrun TheLSFAPI,however,isdesignedtoimplementLSF-relatedcommands informationwheneverpossible.clearly,itisdicultforaschedulingextension size,requiringseveralsecondstoprocessinitsentirety,itisnecessarytocache logiteminturn,andcomputethetimebetweeneachpairofsuspend/resume totakecareofsuchdetailsandtoobtaintheinformationeciently. eventsforthejob.sincetheevent-loggingleistypicallyseveralmegabytesin providesimpleandecientaccesstoinformationaboutjobs(e.g.,processors currentlyusedbyajob),aswellastomanipulatethestateofjobsinthesystem Oneofourgoalswasthustodesignaschedulingextensionlibrarythatwould

8 components: JobandSystemInformationCache(JSIC)Thiscomponentservesasa (e.g.,suspendormigrateajob).thisfunctionalityislogicallydividedintotwo LSFInteractionLayer(LIL)Thiscomponentprovidesagenericinterface queues,andjobsforitsownbook-keepingpurposes.4 toalllsf-relatedactivities.inparticular,itupdatesthejsicdatastructuresbyqueryingthelsfbatchsystemandtranslateshigh-levelparallel-jociplinetoassociateauxiliary,discipline-specicinformationwithprocessors, cacheofsystemandjobinformationobtainedfromlsf.italsoallowsadis- Thebasicdesignsofallourschedulingdisciplinesarequitesimilar.Each schedulingoperations(e.g.,suspendjob)intotheappropriatelsf-specic ones. disciplineisassociatedwithadistinctsetoflsfqueues,whichthediscipline queueisdesignatedasthesubmitqueue,andotherqueuesareusedbythe tobescheduledbythecorrespondingschedulingdiscipline.normally,onelsf schedulingdisciplineasafunctionofajob'sstate.forexample,pendingjobs usestomanageitsownsetofjobs.alllsfjobsinthissetofqueuesareassumed maybeplacedinonelsfqueue,stoppedjobsinanother,andrunningjobs neverdispatchesthem,andarunningqueuewouldbeconguredsothatlsf suchactionsbyswitchingjobsfromonelsfqueuetoanother.continuingthe theprocessesofajobdirectly;rather,itimplicitlyrequestslsftoperform sameexample,apendingqueuewouldbeconguredsothatitacceptsjobsbut inathird.aschedulingdisciplineneverexplicitlydispatchesormanipulates immediatelydispatchesanyjobinthisqueueontheprocessorsspeciedforthe simplybyspecifyingtheappropriatelsfqueue,andcantracktheprogressof job.inthisway,ausersubmitsajobtobescheduledbyaparticulardiscipline anystateinformationthatneedstobepersistentcanbeencodedbythequeue inwhicheachjobresides.thisapproachgreatlysimpliesthere-initialization thejobusingallthestandardlsfutilities. queuesanddatastructures,wehavefoundthatthisisrarelynecessarybecause oftheschedulingextensionintheeventthattheextensionfailsatsomepoint, Althoughitispossibleforaschedulingdisciplinetocontaininternaljob animportantpropertyofanyproductionschedulingsystem. system.(forexample,onepartitioncouldbeusedforproductionworkloads overheadsifdierentdisciplinesarebeingusedindierentpartitionsofthe withinthesameextensionprocess,afeaturethatismostusefulinreducing whileanothercouldbeusedtoexperimentwithanewschedulingdiscipline.) Givenourdesign,itispossibleforseveralschedulingdisciplinestocoexist RetrievingsystemandjobinformationfromLSFcanplacesignicantloadon themasterprocessor,5imposingalimitonthenumberofextensionprocesses thatcanberunconcurrently.sinceeachschedulingdisciplineisassociatedwitha 4InfutureversionsofLSF,itwillbepossibleforinformationassociatedwithjobsto 5LSFrunsitsbatchscheduleronasingle,centralizedprocessor. besavedinloglessothatitwillnotbelostintheeventthattheschedulerfails.

9 New Scheduling Disciplines Sched Disc1 Sched Disc2 Sched Disc3... JSIC Scheduling Extension Library Data Objects LIL dierentsetoflsfqueues,thesetofprocessorsassociatedwitheachdiscipline canbedenedbyassigningprocessorstothecorrespondingqueuesusingthe forprocessorinformation.) LSFqueueadministrationtools.(Normally,eachdisciplineusesasinglequeue EASY[Lif95,SCZL96,Gib96,Gib97].OneofthegoalsofGibbons'workwasto uling.hefoundthat,formanyworkloads,historicalinformationcouldprovide determinewhetherhistoricalinformationaboutajobcouldbeexploitedinsched- studyinganumberofrigidschedulingdisciplines,includingtwovariantsof TheextensionlibrarydescribedherehasalsobeenusedbyGibbonsin upto75%ofthebenetsofhavingperfectinformation.forthepurposeof hiswork,gibbonsaddedanadditionalcomponenttotheextensionlibraryto theoriginaleasydisciplinetotakeintoaccountthisknowledgeandshowed gather,store,andanalyzehistoricalinformationaboutjobs.hethenadapted schedulingdisciplinesstudiedbygibbonsaredescribedelsewhere[gib96,gib97]. thehistoricaldatabase)isshowninfig.2.theextensionprocesscontainsthe howperformancecouldbeimproved.thehistoricaldatabaseanddetailsofthe extensionlibraryandeachofthedisciplinesconguredforthesystem.theextensionprocessmainlineessentiallysleepsuntilaschedulingeventoratimeout (correspondingtotheschedulingquantum)occurs.themainlinethenprompts Thehigh-levelorganizationoftheschedulingextensionlibrary(notincluding theliltoupdatethejsicandcallsadesignatedmethodforeachofthecon- gureddisciplines.next,wedescribeeachcomponentoftheextensionlibraryin detail. Poll thesameprocess. tensionlibrarysupportsmultipleschedulingdisciplinesrunningconcurrentlywithin Fig.2.High-leveldesignofschedulingextensionextensionlibrary.Asshown,theex- LSF Batch Subsystem

10 4.1JobandSystemInformationCache designofourschedulingdisciplines: considerationthetypesofoperationsthatwefoundtobemostcriticaltothe thatarepartoftheextension.ourdatastructuresweredesignedtakinginto aboutjobs,queues,andprocessorsthatarerelevanttotheschedulingdisciplines TheJobandSystemInformationCache(JSIC)containsalltheinformation {Aschedulermustbeabletoscansequentiallythroughthejobsassociated asimplemanneranyjob-relatedinformationobtainedfromlsf(e.g.,run withaparticularlsfqueue.foreachjob,itmustthenbeabletoaccessin {Finally,aschedulermustbeabletoassociatebook-keepinginformationwith {ItmustbeabletoscantheprocessorsassociatedwithanyLSFqueueand determinethestateofeachoneofthese(e.g.,availableorunavailable). times,processorsonwhichajobisrunning,lsfjobstate). LSFjobidentiers(jobId),allowingecientlookupofindividualjobs.Also,a Pointerstoinstancesoftheseobjectsarestoredinajobhashtablekeyedby Inourlibrary,informationabouteachactivejobisstoredinaJobInfoobject. eitherjobsorprocessors(e.g.,thesetofjobsrunningonagivenprocessor). listofjobidentiersismaintainedforeachqueue,permittingecientscanning ofjobsinanygivenqueue(intheordersubmittedtolsf). jobswouldalsobesuitableifitisguaranteedthataprocessorisneverassociated objectinstanceexistsforeachjob.forprocessors,ontheotherhand,wefound objectsassociatedwitheachqueue.usingaglobalapproachsimilartothatfor itconvenient(forexperimentalreasons)tohavedistinctprocessorinformation Theinformationassociatedwithajobisglobal,inthatasingleJobInfo processorname.foreach,thestateoftheprocessorandalistofjobsrunning thecaseonoursystem.similartojobs,processorsassociatedwithaqueuecan ontheprocessorcanbeobtained. withmorethanonedisciplinewithinanextension,butthiswasnotnecessarily bescannedsequentially,orcanbeaccessedthroughahashtablekeyedonthe LSFonlysupportsapollinginterface,however,theLILmust,foreachupdate ThemostsignicantfunctionoftheLSFinteractionlayeristoupdatetheJSIC datastructurestoreectthecurrentstateofthesystemwhenprompted.since 4.2LSFInteractionLayer(LIL) inthejsic.aspartofthisupdate,thejsicmustalsoprocessaneventlogging request,fetchalldatafromlsfandcompareittothatwhichiscurrentlystored representsalargefractionofthetotalextensionlibrarycode.(theextension le,sincecertaintypesofinformation(e.g.,totaltimespending,suspended, andrunning)arenotprovideddirectlybylsf.assuch,thejsicupdatecode libraryisapproximately1.5kloc.) ToupdatetheJSIC,theLILperformsthefollowingthreeactions:

11 {ItobtainsthelistofallactivejobsinthesystemfromLSF.Eachjobrecord {Itopenstheevent-loggingle,readsanyneweventsthathaveoccurredsince informationabouteachjobisrecordedinthejsic. asthejobstatus(e.g.,running,stopped),processorset,andqueue.allthis starttime,resourcerequirements,aswellassomedynamicinformation,such returnedbylsfcontainssomestaticinformation,suchasthesubmittime, {ItobtainsthelistofprocessorsassociatedwitheachqueueandqueriesLSF thelastupdate,andre-computesthependingtime,aggregateprocessorrun times)arecomputed. time,andwall-clockruntimeforeachjob.aswell,aggregateprocessorand wall-clockruntimessincethejobwaslastresumed(termedresidualrun ourextensions,wedonotusethedefaultsetofresourcestoavoidhavinglsf licenses,orswapspace,requiredbythejobcanbespecieduponsubmission.in LSFprovidesamechanismbywhichtheresources,suchasphysicalmemory, forthestatusofeachoftheseprocessors. makeanyschedulingdecisions,butratheraddanewsetofpseudo-resourcesthat thejobinfostructure. extension.aspartoftherstactionperformedbythelilupdateroutine,this informationisextractedfromthepseudo-resourcespecicationsandstoredin maximumprocessorallocationsorservicedemand,directlytothescheduling areusedtopassparametersorinformationaboutajob,suchasminimumand levelschedulingoperationsintolow-levellsfcalls. TheremainingLILfunctions,illustratedinTable2,basicallytranslatehigh- setprocessorsthisoperationdenesthelistofprocessorstobeallocatedtoa Table2.High-levelschedulingfunctionsprovidedbyLSFInteractionLayer. Operation switch Thisoperationmovesajobfromonequeuetoanother. job.lsfdispatchesthejobbycreatingamasterprocessonthe rstprocessorinthelist;asdescribedbefore,themasterprocess Description suspend resume resources(e.g.,physicalmemory). virtualresourcestheypossess,butnormallyreleaseanyphysical Thisoperationsuspendsajob.Theprocessesofthejobholdonto Thisoperationresumesajobthathaspreviouslybeensuspended. usesthelisttospawnitsslaveprocesses. migrate Thisoperationinitiatesthemigrationprocedureforajob.Itdoes notactuallymigratethejob,butratherplacesthejobinapendingstate,allowingittobesubsequentlyrestartedonadierent setofprocessors. PreemptionConsiderationsTheLSFinteractionlayermakescertainassumptionsaboutthewayinwhichjobscanbepreempted.Forsimplepreemption,ajobcanbesuspendedbysendingitaSIGTSTPsignal,whichisdelivered

12 tothemasterprocess;thisprocessmustthenpropagatethesignaltoitsslaves (whichisautomatedinthedistributedprogramminglibraryprovidedbylsf) toensurethatallprocessesbelongingtothejobarestopped.similarly,ajob beinthisstate(assumingdiskspaceforcheckpointingisabundant). canberesumedbysendingitasigcontsignal. emptedjobsdonotoccupyanykernelresources,allowinganynumberofjobsto plementedviaacheckpointingfacility,asdescribedinsect.2.asaresult,pre- Toidentifymigratablejobs,wesetanLSFaginthesubmissionrequest Incontrast,weassumethatmigratableandmalleablepreemptionareim- checkpointsignal(inourcase,thesigusr2signal),andthensendlsfamigrate indicatingthatthejobisre-runnable.tomigratesuchajob,werstsendita setprocessorsinterface).inmostcases,however,weswitchsuchajobtoa queuethathasbeenconguredtonotdispatchjobspriortosubmittingthe requestforthejob.thiswouldnormallycauselsftoterminatethejob(with pendingjob. migrationrequest,causingthejobtobesimplyterminatedandrequeuedasa asigtermsignal)andrestartitonthesetofprocessorsspecied(usingthe case,anynumberofprocessorscanbespecied. samenumberofprocessorsasintheinitialallocation,whileinthemalleable Inthemigratablecase,theschedulingdisciplinealwaysrestartsajobusingthe ticaltothatformigratingajob,theonlydierencebeingthewayitisused. Theinterfaceforchangingtheprocessorallocationofamalleablejobisiden- 4.3ASimpleExample workloads,thisapproachwillgreatlyimproveresponsetimeswithoutrequiring Toillustratehowtheextensionlibrarycanbeusedtoimplementadiscipline, variabilityinservicedemands,asistypicallythecaseevenforbatchsequential considerasequential-job,multi-levelfeedbackdisciplinethatdegradesthepriorityofjobsastheyacquireprocessingtime.iftheworkloadhasahighdegreeof usethesamequeuecongurationasshowninfig.1;weeliminatetherun-time userstospecifytheservicedemandsofjobsinadvance.forthisdiscipline,wecan infig.1);whenthejobhasacquiredacertainamountofprocessingtime,the higher-priorityqueuestolower-priorityonesastheyacquireprocessingtime. limits,however,astheschedulingdisciplinewillautomaticallymovejobsfrom reliesonthelsfbatchsystemtodispatch,suspend,andresumejobsasa schedulingextensionswitchesthejobtothemedium-priorityqueue,andafter somemoreprocessingtime,tothelow-priorityqueue.inthisway,theextension Usersinitiallysubmittheirjobstothehigh-priorityqueue(labeledshortjobs byexaminingthejobsineachofthethreequeues. 5Parallel-JobSchedulingDisciplines functionofthejobsineachqueue.userscantracktheprogressofjobssimply implementedaslsfextensions.importanttothedesignofthesedisciplinesare Wenowturnourattentiontotheparallel-jobschedulingdisciplinesthatwehave

13 thecostsassociatedwithusinglsfonourplatform.itcantakeuptothirty theloadonthemaster(scheduling)processortoanacceptablelevel. secondstodispatchajobonceitisreadytorun.migratableormalleablepreemptiontypicallyrequiresmorethanaminutetoreleasetheprocessorsassociated withajob;theseprocessorsareconsideredtobeunavailableduringthistime. Finally,schedulingdecisionsaremadeatmostonceeveryvesecondstokeep (i.e.,open)butpreventinganyofthesejobsfrombeingdispatchedautomaticallybylsf(i.e.,inactive).asecondqueue,calledtherunqueue,isusedburation.apendingqueueisdenedandconguredtoallowjobstobesubmitted theschedulertostartjobs.thisqueueisopen,active,andpossessesabsolutely Thedisciplinesdescribedinthissectionallshareacommonjobqueuecong- jobsinthisqueue.finally,athirdqueue,calledthestoppedqueue,isdenedto noloadconstraints.aschedulingextensionusesthisqueuebyrstspecifying assistinmigratingjobs.ittooisconguredtobeopenbutinactive.whenlsf jobtothisqueue;giventhequeueconguration,lsfimmediatelydispatches theprocessorsassociatedwithajob(i.e.,setprocessors)andthenmovingthe ispromptedtomigrateajobinthisqueue,itterminatesandrequeuesthejob, preservingitsjobidentier.inallourdisciplines,preemptedjobsareleftinthis processorallocation,thedesiredvaluelyingbetweentheminimumandmaximum.rigiddisciplinesusethedesiredvaluewhileadaptivedisciplinesarefree tochooseanyallocationbetweentheminimumandthemaximumvalues. Eachjobinoursystemisassociatedwithaminimum,desired,andmaximum queuetodistinguishthemfromjobsthathavenothadachancetorunyet(in thependingqueue). teristicsarespeciedintermsofthefractionofworkthatissequential.basically, service-demandinformationisusedtorunjobshavingtheleastremainingprocessingtime(tominimizemeanresponsetimes)andspeedupinformationisused oftheamountofcomputationrequiredonasingleprocessorandspeedupcharac- Ifprovidedtothescheduler,servicedemandinformationisspeciedinterms tofavourecientjobsinallocatingprocessors.sincejobscanvaryconsiderably 5.1Run-to-CompletionDisciplines intermsoftheirspeedupcharacteristics,computingtheremainingprocessing Next,wedescribetherun-to-completiondisciplines.Allthreevariantslistedin timewillonlybeaccurateifspeedupinformationisavailable. tension.thelsf-rtcdisciplineisdenedasfollows: similarand,assuch,areimplementedinasinglemoduleoftheschedulingex- Table1(i.e.,LSF-RTC,LSF-RTC-AD,andLSF-RTC-ADSUBSET)arequite LSF-RTCWheneverajobarrivesordeparts,theschedulerrepeatedlyscans TheLSFsystem,andhencetheJSIC,maintainsjobsinorderofarrival,so queue. available.itassignsprocessorstothejobandswitchesthejobtotherun thependingqueueuntilitndstherstjobforwhichenoughprocessorsare thedefaultrtcdisciplineisfcfs(skippinganyjobsattheheadofthequeue

14 providedtothescheduler,thenjobsarescannedinorderofincreasingservice forwhichnotenoughprocessorsareavailable).ifservice-demandinformationis bysetiaetal.[st93],exceptthatjobsareselectedforexecutiondierently skipping). demand,resultinginashortestprocessingtime(spt)discipline(againwith jobs(andhencecannotbecalledasp). becausethelsf-baseddisciplinestakeintoaccountmemoryrequirementsof TheLSF-RTC-ADdisciplineisverysimilartotheASPdisciplineproposed LSF-RTC-ADWheneverajobarrivesordeparts,theschedulerscansthe schedulerthenassignsprocessorstotheselectedjobsandswitchesthesejobs t,leftoverprocessorsareusedtoequalizeprocessorallocationsamongselectedjobs(i.e.,givingprocessorstojobshavingthesmallestallocation).the tosatisfythejob'sminimumprocessorrequirements.whennomorejobs pendingqueue,selectingtherstjobforwhichenoughprocessorsremain cessor,inturn,tothejobwhoseeciencywillbehighestafterthealloca- tion.thisapproachminimizesboththeprocessorandmemoryoccupancyin Ifspeedupinformationisavailable,theschedulerallocateseachleftoverpro- totherunqueue. throughput[ps96a]. adistributed-memoryenvironment,leadingtothehighestpossiblesustainable processors).sinceweassumethatajobutilizesprocessorsmoreecientlyas jobsinexcesstoeachofthejob'sminimumprocessorallocation(termedsurplus Thebasicprincipleistotrytominimizethenumberofprocessorsallocatedto areutilizedbyapplyinganalgorithmknownasasubset-sumalgorithm[mt90]. TheSUBSETvariantseekstoimprovetheeciencywithwhichprocessors principleallowsthesystemtorunatahigheroveralleciency. itsallocationsizedecreases(downtotheminimumallocationsize),thenthis LSF-RTC-ADSUBSETLetLbethenumberofjobsinthesystemandNbe thenumberofjobsselectedbytherst-talgorithmusedinlsf-rtc-ad. ThescheduleronlycommitstorunningtherstN0ofthesejobs,where (isatunableparameterthatdetermineshowaggressivelythescheduler N0=Nmax(1 L N;0) seekstominimizesurplusprocessorsastheloadincreases;forourexperiments,wechose=5.)usinganyleftoverprocessorsandleftoverjobs, theschedulerappliesthesubset-sumalgorithmtoselectthesetofjobsthat minimizesthenumberofsurplusprocessors.thejobschosenbythesubsetsumalgorithmareaddedtothelistofjobsselectedtorun,andanysurplus suspendedbuttheirprocessesmaynotbemigrated.sincetheresourcesusedby SimplePreemptiveDisciplinesInsimplepreemptivedisciplines,jobsmaybe processorsareallocatedasinlsf-rtc-ad.

15 ensuringthatnomorethanacertainnumberofprocesseseverexistonanygiven carefultonotover-commitsystemresources.inourdisciplines,thisisachievedby jobsarenotreleasedwhentheyareinapreemptedstate,however,onemustbe processor.inamoresophisticatedimplementation,wemightinsteadensurethat theswapspaceassociatedwitheachprocessorwouldneverbeovercommitted. wefoundthisapproachtobeproblematic.consideralong-runningjob,either discipline,weallowajobtopreemptanotheronlyifitpossessesthesamedesired mightoccurifjobswerenotalignedinthisway.6intheadaptivediscipline, processorallocation.thisistominimizethepossibilityofpackinglossesthat Thetwovariantsofthepreemptivedisciplinesarequitedierent.Intherigid packinglosseswiththeadaptive,simplepreemptivediscipline. onewouldbeconguredforalargeallocationsize,causingthem,andhence theentiresystem,torunineciently.asaresult,wedonotattempttoreduce thatisdispatchedbythescheduler.anysubsequentjobspreemptingthisrst arrivingduringanidleperiodorhavingalargeminimumprocessorrequirement, LSF-PREEMPTWheneverajobarrivesordepartsorwhenaquantumexpires,theschedulerre-evaluatestheselectionofjobscurrentlyrunning. apendingorstoppedjob,accordingtothefollowingcriteria: Then,theschedulerdeterminesifanyrunningjobshouldbepreemptedby AvailableprocessorsarerstallocatedinthesamewayasinLSF-RTC. 2.Ifnoservice-demandinformationisavailable,theaggregatecumulative 1.Astoppedjobcanonlypreemptajobrunningonthesamesetofprocessorsasthoseforwhichitiscongured.Apendingjobcanpreempt anyrunningjobthathasasamedesiredprocessorallocationvalue. fractionlessthanthatoftherunningjob(inourcase,weusethevalue of10%). otherwise,theservicedemandofthepreemptingjobmustbea(dierent) lessthanthatoftherunningjob(inourcase,weusethevalueof50%); processortimeofthependingorstoppedjobmustbesomefraction 3.Therunningjobmusthavebeenrunningforatleastacertainspecied 4.Thenumberofprocessespresentonanyprocessorcannotexceedaprespeciednumber(inourcase,veprocesses). amountoftime(oneminuteinourcase,sincesuspensionandresumption onlyconsistofsendingaunixsignaltoallprocessesofthejob). leastacquiredaggregateprocessingtimeischosenrstifnoservice-demand knowledgeisavailable,ortheonewiththeshortestremainingservicedemand Ifseveraljobscanpreemptagivenrunningjob,theonewhichhasthe ulingjobs,whereeachrowofthematrixrepresentsadierentsetofjobstorun Ouradaptive,simplepreemptivedisciplineusesamatrixapproachtosched- ifservice-demandknowledgeisavailable. 6Packinglossesoccurwhenprocessorsareleftidle,eitherbecausethereareaninsuf- onlysomeoftheprocessorsrequiredbystoppedjobsareavailable. cientnumbertomeettheminimumprocessorrequirementsofpendingjobsorif

16 cipline,anincomingjobisplacedintherstrowofthematrixthathasenough andthecolumnstheprocessorsinthesystem.inousterhout'sco-schedulingdis- LSF-PREEMPT-ADWhenevertheschedulerisawakened(dueeithertoan ourapproach,weuseamoredynamicapproach. freeprocessorsforthejob;ifnosuchrowexists,thenanewoneiscreated.in arrivalordepartureortoaquantumexpiry),thesetofjobscurrentlyrunningorstopped(i.e.,preempted)isorganizedintothematrixjustdescribed, usingtherstrowforthosejobsthatarerunning.eachrowisthenexaminedinturn.foreach,theschedulerpopulatestheuncommittedprocessors jobareuncommittedintherowcurrentlybeingexamined.)thescheduler apendingjob;thesejobscanswitchrowsifallprocessorsbeingusedbythe withthebestpending,stopped,orrunningjobs.(ifservice-demandinformationisavailable,currently-stoppedorrunningjobsmaybepreferableto run.ifsuchjobscannotbeaccommodatedintherowbeingexamined,then theschedulerskipstothenextrow. lessthantheminimumtimesincelastbeingstartedorresumed,continueto Oncethesetofjobsthatmightberunineachrowhasbeendetermined,the alsoensuresthatjobsthatarecurrentlyrunning,butwhichhaverunfor shortestremainingservicedemand.processorsintheselectedrowavailable ingtimeor,ifservice-demandinformationisavailable,thejobhavingthe forpendingjobsaredistributedasbefore(i.e.,equi-allocationifnospeedup knowledgeisavailable,orfavouringecientjobsifitis). schedulerchoosestherowthathasthejobhavingtheleastacquiredprocessferencebetweenthetwotypesisthat,inthemigratablecase,jobsarealways MigratableandMalleablePreemptiveDisciplinesIncontrasttothesimplepreemptivedisciplines,themigratableandmalleableonesassumethatajob processors. resumedwiththesamenumberofprocessorsallocatedwhenthejobrststarted, canbecheckpointedandrestartedatalaterpointintime.theprimarydif- LSF-MIGWheneverajobarrivesordepartsorwhenaquantumexpires, whereasinthemalleablecase,ajobcanberestartedwithadierentnumberof amountoftime(inourcase,tenminutes,sincemigrationandprocessor recongurationarerelativelyexpensive)areallowedtocontinuerunning. currently-runningjobswhichhavenotrunforatleastacertaincongurable Processorsnotusedbythesejobsareconsideredtobeavailableforreassignment.Theschedulerthenusesarst-talgorithmtoselectthejobs theschedulerre-evaluatestheselectionofjobscurrentlyrunning.first, Asbefore,ifservice-demandinformationisavailable,jobsareselectedin fromthoseremainingtorunnext,usingajob'sdesiredprocessorallocation. LSF-MIG-ADandLSF-MALL-ADApartfromtheiradaptiveness,these orderofleastremainingservicedemand. twodisciplinesareverysimilartothelsf-migdiscipline.inthemalleable

17 version,theschedulerusesthesamerst-talgorithmasinlsf-migto usesthesizeofajob'scurrentprocessorallocationinsteadofitsminimumif favouringecientjobsotherwise.inthemigratableversion,thescheduler usinganequi-allocationapproachifnospeedupinformationisavailable,and todetermineifajobts.anyleftoverprocessorsarethenallocatedasbefore, selectjobs,exceptthatitalwaysusesajob'sminimumprocessorallocation anddoesnotchangethesizeofsuchajob'sprocessorallocationifselected thejobhasalreadyrun(i.e.,hasbeenpreempted)intherst-talgorithm, ciplineshavealsobeenimplemented. Similartotherun-to-completioncase,SUBSET-variantsoftheadaptivedis- torun. qualitativeinnature.therearetworeasonsforthis.first,experimentsmustbe Theevaluationofthedisciplinesdescribedintheprevioussectionisprimarily 6PerformanceResults performedinrealtimeratherthaninsimulatedtime,requiringaconsiderable amountoftimetoexecutearelativelysmallnumberofjobs.moreover,failures thatcan(anddo)occurduringtheexperimentscansignicantlyinuencethe intendourimplementationstodemonstratethepracticalityofadisciplineand results,althoughsuchfailurescanbetoleratedbythedisciplines.second,we toobserveitsperformanceinarealcontext,ratherthantoanalyzeitsperformanceunderawidevarietyofconditions(forwhichasimulationwouldbemore suitable). tions(now),consistingofsixteenibm43p(133mhz,powerpc604)systems, connectedbythreeindependentnetworks(155mbpsatm,100mbpsethernet, 10MbpsEthernet). Theexperimentalplatformfortheimplementationisanetworkofworksta- realparallelapplication.thisisimportantinthecontextofournetworkof sources,yetbehaveinotherrespects(e.g.,executiontime,preemption)asa plicationdesignedtorepresentrealapplications.thebasicreasonforusinga syntheticapplicationisthatitcouldbedesignedtonotuseanyprocessingre- Toexercisetheschedulingsoftware,weuseaparameterizablesyntheticap- teststobeinconclusiveifjobswererunatlowpriority. thesystemfrombeingusedbyothersduringthetests,orwouldhavecausedthe workstations,becausethesystemisbeingactivelyusedbyanumberofother researchers.usingreal(compute-intensive)applicationswouldhaveprevented everrunningonagivenprocessorandthatallprocessesassociatedwiththejob arerunningsimultaneously.assuch,thebehaviourofourdisciplines,whenused inconjunctionwithoursyntheticapplication,isidenticaltothatofadedicated systemrunningcompute-intensiveapplications.infact,byassociatingadierent Eachofourschedulingdisciplinesensuresthatonlyasingleoneofitsjobsis setofqueueswitheachdiscipline,eachoneconguredtouseallprocessors,it waspossibletoconductseveralexperimentsconcurrently.(thejobssubmitted toeachsubmitqueueforthedierentdisciplinesweregeneratedindependently.)

18 allocationsusingthestandardmechanismprovidedbylsf.finally,itcanbe modelawiderangeofrealapplications.second,itsupportsadaptiveprocessor checkpointedandrestarted,tomodelbothmigratableandmalleablejobs. easilyparameterizedwithrespecttospeedupandservicedemand,allowingitto Thesyntheticapplicationpossessesthreeimportantfeatures.First,itcanbe formeanresponsetimeandmakespanmeasurements.(themakespanisthemaximumcompletiontimeofanyjobinthesetofjobsunderconsideration,assuming moderately-heavyload.asmallinitialnumberofthesejobs(e.g.,200)aretagged accordingtoapoissonarrivalprocess,usinganarrivalratethatreectsa Anexperimentconsistsofsubmittingasequenceofjobstothescheduler eightprocessorsinreality.thus,allprocessorallocationsaremultiplesofeight, representativeoflargesystems,weassumethateachprocessorcorrespondsto alljobsinthisinitialsethaveleftthesystem.tomaketheexperimentmore andtheminimumallocationiseightprocessors.scalingthenumberofprocessorsinthiswayaectsthesyntheticapplicationindeterminingtheamountof remainingservicedemandforajob. 6.1WorkloadModel timeitshouldexecuteandtheschedulingdisciplinesindeterminingtheexpected thattherstjobarrivesattimezero.)eachexperimentterminatesonlywhen to128processors)[hot96b,hot96a].themostsignicantdierenceisthatthe measurementsmadeoverthepastyearatthecornelltheorycenter(scaled tributionwhosemedianis2985seconds.7theparametersareconsistentwith meanof8000seconds(2.2hours)andcoecientofvariation(cv)of4,adis- Servicedemandsforjobsaredrawnfromahyper-exponentialdistribution,with andmalleablepreemptioncases,weonlypreemptajobifithasrunatleast10 minutes,sincepreemptionrequiresatleastoneminute.)alldisciplinesreceived exactlythesamesequenceofjobsinanyparticularexperiment,andingeneral, resultsasitonlymagniesschedulingoverheads.(recallthatinthemigratable meanisaboutaquarterofthatactuallyobserved,whichshouldnotundulyaect individualexperimentsrequiredanywherefrom24to48hourstocomplete. betweentheminimumandthemaximumprocessorallocationsforthejob. processors,andmaximumsizesaresetatsixteen.8thisdistributionissimilarto allocationsizeusedforrigiddisciplinesischosenfromauniformdistribution thoseusedinpreviousstudiesinthisarea[ps96a,mz95,set95].theprocessor Minimumprocessorallocationsizesareuniformlychosenfromonetosixteen informationcanonlybeobtainedifalargefractionofthetotalworkinthe workloadhasgoodspeedup,andmoreover,iflarger-sizedjobstendtohavebetter speedupthansmaller-sizedones[ps96a].assuch,welet75%ofthejobshave Ithasbeenshownpreviouslythatperformancebenetsofknowingspeedup 8Notethatmaximumprocessorallocationinformationisonlyusefulatlighterloads, 7The25%,50%,and75%quantilesare1230,2985,and6100seconds,respectively. sinceatheavyloads,jobsseldomreceivemanymoreprocessorsthantheirminimum allocation.

19 goodspeedup,where99.9%oftheworkisperfectlyparallelizable(corresponding 6.2ResultsandLessonsLearned toaspeedupof114on128processors).poorspeedupjobshaveaspeedupof6.4 on8processorsandaspeedupof9.3on128processors.9 Theperformanceresultsofalldisciplinesunderthefourknowledgecases(no preemptive,rigiddisciplinedoesnotoeranyadvantagesoverthecorresponding magnitude)thanthemigratableormalleablepreemptivedisciplines.thesimple fortherun-to-completiondisciplinesaremuchhigher(byuptoanorderof intable3andsummarizedinfigs.3and4.ascanbeseen,theresponsetimes knowledge,service-demandknowledge,speedupknowledge,orboth)aregiven allowingajobtoonlypreemptanotherthathasthesamedesiredprocessor run-to-completionversion.thereasonisthatthereisinsucientexibilityin andmalleabledisciplines(seefig.4).intheformercase,makespansdecreased bynearly50%fromtherigidtotheadaptivevariantusingthesubset-sumalgorithm.toachievethisimprovement,however,themeanresponsetimesgenerally Adaptabilityappearstohavethemostpositiveeectforrun-to-completion regard. requirement.theadaptivepreemptivedisciplineisconsiderablybetterinthis increasedbecauseprocessorallocationstendedtobesmaller(leadingtolonger averageruntimes).inthemalleablecase,adaptabilityresultedinsmallerbut noticeabledecreasesinmakespans(5{10%).itshouldbenotedthattheopportunityforimprovementismuchlowerthaninthertccasebecausetheminimum eitherthemeanresponsetime(fortheformer)orthemakespan(forthelatter) makespansofapproximately78000seconds). makespanis65412secondsforthisexperiment(comparedtoactualobserved knowledgehadlimitedbenetintherun-to-completiondisciplinesbecausethe highresponsetimesresultfromlong-runningjobsbeingactivated,whichthe werelarge,butmaynotbeassignicantasonemightexpect.service-demand Service-demandandspeedupknowledgeappearedtobemosteectivewhen tsofhavingservicedemandinformation.highlightingthisdierence,weoften foundqueuelengthsforrun-to-completiondisciplinestogrowashighas60jobs, disciplines,themultilevelfeedbackapproachachievedthemajorityofthebene- schedulermustdoatsomepoint.inthemigratableandmalleablepreemptive bestartedatthesametimeasahigh-eciencyjob;eveninthebestcase,the whileformigratableormalleabledisciplines,theywererarelylargerthanve. maximumeciencyofapoor-speedupjobwillonlybe58%givenaminimum becausepoor-speedupjobscanrarelyruneciently.(toutilizeprocessorseciently,suchajobmusthavealowminimumprocessorrequirement,andmust Givenourworkload,wefoundspeedupknowledgetobeoflimitedbenet processorallocationofeightafterscaling.)fromtheresults,onecanobservethat 9Suchatwo-speedup-classworkloadappearstobesupportedbydatafromtheCornell toitselapsedtime[par97]. TheoryCenterifweexaminetheamountofCPUtimeconsumedbyeachjobrelative

20 oftime;inthesecases,aminimumboundonthemeanresponsetimesisreported(indicatedbya>)andthenumberofunnishedjobs isgiveninparenthesis. Discipline NoKnowledgeService-Demand Speedup Both MRTMakespanMRTMakespanMRTMakespanMRTMakespan Table3.PerformanceofLSF-basedschedulingdisciplines.Insometrials,thedisciplinedidnotterminatewithinareasonableamount LSF-RTC-ADSUBSET LSF-PREEMPT-AD >2293>219105(2) >1342>192031(1) LSF-MIG-ADSUBSET >1347>193772(1) LSF-MALL-ADSUBSET

21 12000 Observed Mean Response Times NONE Service Demand Speedup Both Fig.3.Observedmeanresponsetimesforeachdiscipline Observed Makespans NONE Service Demand Speedup Both Fig.4.Observedmakespansforeachdiscipline.

22 Long Job Long Job long-runningjobs,thesystemrarelyreachesastatewhereallprocessorsareavailable, schedulertoactivatejobshavinglargeminimumprocessorrequirements.becauseofthe Fig.5.Eectsofhighlyvariableservicedemandsontheabilityforarun-to-completion whichisnecessarytoscheduleajobhavingalargeminimumprocessorrequirement. Long Job service-demandknowledgecansometimesnegatethebenetsofhavingspeedup Time knowledgeasjobshavingtheleastremainingservicedemand(ratherthanleast acquiredprocessingtime)aregivenhigherpriority. ofourschedulers,inordertofurtherunderstandtheperformanceresults.our observationscanbesummarizedasfollows: {Jobshavinglargeminimumprocessorrequirementscanoftenexperience Whileperformingtheourexperiments,wemonitoredthebehaviourofeach havinglargeminimumprocessorrequirement. havingalargeservicedemand,makingitdiculttoeverscheduleajob ThisbehaviourisillustratedinFig.5.Evenatlightloads,itisquitelikelyfor haveahighdegreeofvariability,thereisoftenatleastonejobrunning signicantdelaysinrun-to-completiondisciplines.sinceservicedemands {Adaptiverun-to-completiondisciplinescanleadtomorevariablemakespans. tobeavailableatthetimeitmakesitsschedulingdecision. disciplinescannotcounteractthiseectbecauseitstillrequiresallprocessors someprocessorstobeoccupied,preventingthedispatchingofajobhavinga largeprocessorrequirement.eventheuseofthesubsetvariantofthertc ecutiontimeoftheselongjobsissetinadvance.intheadaptivecase,a Ina200-jobworkload,themakespanisdictatedessentiallybythelongrunningjobsinthesystem(e.g.,inoneofourexperiments,onejobhad schedulermayallocatesuchjobsasmallnumberofprocessors,whichis makespanofarigiddisciplinewillberelativelypredictablebecausetheex- asequentialservicedemandof265000seconds,oralmost74hours).the goodfromaneciencystandpoint,butcanleadtomuchlongermakespans. Also,iflongjobsareallocatedfewprocessors,whichtendstooccurinmost adaptivedisciplinesastheloadincreases,theselongjobswilloccupyprocessorsforlongerperiodsoftime(relativetotherigidcase).thiscanmake everndenoughavailableprocessors. itevenmoredicultforjobswithlargeminimumprocessorrequirementsto Processors

23 Theconclusionisthatrun-to-completiondisciplinesareevenmoreproblematicthanoriginallyindicated.Ithaspreviouslybeenshownhowhigments. canalsoleadtostarvationforjobshavinglargeminimumprocessorrequire- variabilityinservicedemandscanleadtopoorresponsetimesifmemory isabundant;theseobservationsshowthathighlyvariableservicedemands {Migratabledisciplinescansignicantlyreduceresponsetimesrelativeto RTCones.However,adaptiveversionsofmigratabledisciplinescanexhibit unpredictablecompletiontimesforlong-runningjobs,asaschedulermust committoanallocationwhenajobisrstactivated.insomecases,the tohaveotherprocessorssubsequentlybecomeavailable.inaproductionenvironment,thismayencourageuserssubmittinghighservice-demandjobs schedulerallocatesasmallnumberofprocessorstolong-runningjobs,only tospecifyalargeminimumprocessorallocationsimplytoensurethattheir leadingtopotentialstarvationproblems.(thiswasthecauseofthelarge eectonthesustainablethroughput. jobscompletewithinamoredesirableamountoftime,buthavinganegative ADSUBSETexperiments.)Inordertoresumesuchajoboncestopped,the Inothercases,long-runningjobswereallocatedalargenumberofprocessors, makespansinthefull-knowledgelsf-migrate-adandlsf-migrate- schedulermustbecapableofpreemptingasucientnumberofrunningjobs tosatisfythestoppedjob'sprocessorrequirement.thiscanbedicultat highloadswherejobswithsmallprocessorallocationsarecontinuouslybeing {Fromauser'sperspective,malleabledisciplinesaremostattractive.During atleasttenminutes.inarealworkload,webelievethisproblemwillbecome demandbecomessmaller. started,suspended,andresumed,sinceweonlypreemptjobsthathaverun andastheloadbecomeslighter,long-runningjobsreceivemoreprocessors. periodsofheavyload,thesystemallocatesjobsasmallnumberofprocessors, lessimportantastheratioofthemigrationoverheadtothemeanservice Unusedprocessorsarisingfromimperfectpackingareneveraproblem,allowingahighlevelofutilizationtobeachieved.Also,jobsrarelyexperienctionuponactivatingajobforthersttime.Asaresult,adaptivemalleable starvationbecausetheschedulerdoesnotcommititselftoaprocessoralloca- lowresponsetimesandhighthroughputs(evengivena10%re-allocation overhead). disciplinesconsistentlyperformedbestandhavethehighestpotentialfor 7Conclusions disciplineswereimplementedonanetworkofworkstations,theycanbeusedon eachwithvaryingdegreesofknowledgeofjobcharacteristics.althoughthese widerangeofdisciplines,fromrun-to-completiontomalleablepreemptiveones, basedonplatformcomputing'sloadsharingfacility(lsf).weconsidera Inthispaper,wepresentthedesignofparallel-jobschedulingimplementations, anydistributed-memorymultiprocessorsystemsupportinglsf.