v 16 v 17 v 21 v 22 v 23

Transcription

1 SchedulingMultithreadedComputations byworkstealing TheUniversityofTexasatAustin CharlesE.Leiserson RobertD.Blumofe MITLaboratoryforComputerScience ticalmethodofschedulingthiskindofdynamicmimd-stylecomputationis\work structured)multithreadedcomputationsonparallelcomputers.apopularandprac- Thispaperstudiestheproblemofecientlyschedulingfullystrict(i.e.,well- Abstract multithreadedcomputationswithdependencies. stealing,"inwhichprocessorsneedingworkstealcomputationalthreadsfromother processors.inthispaper,wegivetherstprovablygoodwork-stealingschedulerfor theminimumexecutiontimewithaninnitenumberofprocessors.moreover,the computationonpprocessorsusingourwork-stealingschedulerist1=p+o(t1),where T1istheminimumserialexecutiontimeofthemultithreadedcomputationandT1is Specically,ouranalysisshowsthattheexpectedtimetoexecuteafullystrict atmosto(pt1(1+nd)smax),wheresmaxisthesizeofthelargestactivationrecordof requirement.wealsoshowthattheexpectedtotalcommunicationofthealgorithmis anythreadandndisthemaximumnumberoftimesthatanythreadsynchronizeswith spacerequiredbytheexecutionisatmosts1p,wheres1istheminimumserialspace threeoftheseboundsareexistentiallyoptimaltowithinaconstantfactor. schedulersaremorecommunicationecientthantheirwork-sharingcounterparts.all itsparent.thiscommunicationboundjustiesthefolkwisdomthatwork-stealing 1Forecientexecutionofadynamicallygrowing\multithreaded"computationonaMIMD- styleparallelcomputer,aschedulingalgorithmmustensurethatenoughthreadsareactive Introduction ofconcurrentlyactivethreadsremainswithinreasonablelimitssothatmemoryrequirements concurrentlytokeeptheprocessorsbusy.simultaneously,itshouldensurethatthenumber arenotundulylarge.moreover,theschedulershouldalsotrytomaintainrelatedthreads andwassupportedinpartbyanarpahigh-performancecomputinggraduatefellowship ThisresearchwasdonewhileRobertD.BlumofewasattheMITLaboratoryforComputerScience ThisresearchwassupportedinpartbytheAdvancedResearchProjectsAgencyunderContractN

2 computations:worksharingandworkstealing.inworksharing,wheneveraprocessor Needlesstosay,achievingallthesegoalssimultaneouslycanbedicult. onthesameprocessor,ifpossible,sothatcommunicationbetweenthemcanbeminimized. generatesnewthreads,theschedulerattemptstomigratesomeofthemtootherprocessors inhopesofdistributingtheworktounderutilizedprocessors.inworkstealing,however, Twoschedulingparadigmshavearisentoaddresstheproblemofschedulingmultithreaded underutilizedprocessorstaketheinitiative:theyattemptto\steal"threadsfromother processors.intuitively,themigrationofthreadsoccurslessfrequentlywithworkstealing byawork-stealingscheduler,butthreadsarealwaysmigratedbyawork-sharingscheduler. thanwithworksharing,sincewhenallprocessorshaveworktodo,nothreadsaremigrated communication.sincethen,manyresearchershaveimplementedvariantsonthisstrategy Theseauthorspointouttheheuristicbenetsofworkstealingwithregardstospaceand allelexecutionoffunctionalprograms[16]andhalstead'simplementationofmultilisp[30]. Thework-stealingideadatesbackatleastasfarasBurtonandSleep'sresearchonpar- search.recently,zhangandortynski[48]haveobtainedgoodboundsonthecommunication [11,21,23,29,34,37,46].Rudolph,Slivkin-Allalouf,andUpfal[43]analyzedarandomizedwork-stealingstrategyforloadbalancingindependentjobsonaparallelcomputer,and KarpandZhang[33]analyzedarandomizedwork-stealingstrategyforparallelbacktrack requirementsofthisalgorithm. aswellasdataowcomputations[2]inwhichthreadsmaystallduetoadatadependency. strict"(well-structured)multithreadedcomputations.thisclassofcomputationsencompassesbothbacktracksearchcomputations[33,48]anddivide-and-conquercomputations[47], Inthispaper,wepresentandanalyzeawork-stealingalgorithmforscheduling\fully Weanalyzeouralgorithmsinastringentatomic-accessmodelsimilartotheatomicmessagepassingmodelof[36]inwhichconcurrentaccessestothesamedatastructureareserially queuedbyanadversary. multithreadedcomputationswhichisprovablyecientintermsoftime,space,andcommunication.weprovethattheexpectedtimetoexecuteafullystrictcomputationonp processorsusingourwork-stealingschedulerist1=p+o(t1),wheret1istheminimum Ourmaincontributionisarandomizedwork-stealingschedulingalgorithmforfullystrict timewithaninnitenumberofprocessors.inaddition,thespacerequiredbytheexecution isatmosts1p,wheres1istheminimumserialspacerequirement.theseboundsarebetterthanpreviousboundsforwork-sharingschedulers[10],andthework-stealingscheduler serialexecutiontimeofthemultithreadedcomputationandt1istheminimumexecution ismuchsimplerandeminentlypractical.partofthisimprovementisduetoourfocusingonfullystrictcomputations,ascomparedtothe(general)strictcomputationsstudied O(PT1(1+nd)Smax),whereSmaxisthesizeofthelargestactivationrecordofanythread andndisthemaximumnumberoftimesthatanythreadsynchronizeswithitsparent.this boundisexistentiallytighttowithinaconstantfactor,meetingthelowerboundofwu andkung[47]forcommunicationinparalleldivide-and-conquer.incontrast,work-sharing in[10].wealsoprovethattheexpectedtotalcommunicationoftheexecutionisatmost requirementsofparallelcomputations.cullerandarvind[19]andruggieroandsargeant schedulershavenearlyworst-casebehaviorforcommunication.thus,ourresultsbolsterthe folkwisdomthatworkstealingissuperiortoworksharing. Othershavestudiedandcontinuetostudytheproblemofecientlymanagingthespace 2

3 [44]giveheuristicsforlimitingthespacerequiredbydataowprograms.Burton[14]shows andanalyzedschedulingalgorithmswithprovablygoodtimeandspacebounds.itisnot spacebounds.blelloch,gibbons,matias,andnarlikar[3,4]havealsorecentlydeveloped Burton[15]hasdevelopedandanalyzedaschedulingalgorithmwithprovablygoodtimeand howtolimitspaceincertainparallelcomputationswithoutcausingdeadlock.morerecently, theoreticmodelofmultithreadedcomputationsintroducedin[10],whichprovidesatheo- reticalbasisforanalyzingschedulers.section3givesasimpleschedulingalgorithmwhich Theremainderofthispaperisorganizedasfollows.InSection2wereviewthegraph- yetclearwhetheranyofthesealgorithmsareaspracticalasworkstealing. usesacentralqueue.this\busy-leaves"algorithmformsthebasisforourrandomizedworkstealingalgorithm,whichwepresentinsection4.insection5weintroducetheatomic-access modelthatweusetoanalyzeexecutiontimeandcommunicationcostsforthework-stealing andcommunicationcostofthework-stealingalgorithm.toconclude,insection7webriey boundalongwithadelay-sequenceargument[41]insection6toanalyzetheexecutiontime algorithm,andwepresentandanalyzeacombinatorial\ballsandbins"gamethatweuse discusshowthetheoreticalideasinthispaperhavebeenappliedtothecilkprogramming toderiveaboundonthecontentionthatarisesinrandomworkstealing.wethenusethis languageandruntimesystem[8,25],aswellasmakesomeconcludingremarks. 2Thissectionreprisesthegraph-theoreticmodelofmultithreadedcomputationintroduced in[10].wealsodenewhatitmeansforcomputationstobe\fullystrict."weconclude Amodelofmultithreadedcomputation withastatementofthegreedy-schedulingtheorem,whichisanadaptationoftheoremsby Brent[13]andGraham[27,28]ondagscheduling. quentialorderingofunit-timeinstructions.theinstructionsareconnectedbydependency edges,whichprovideapartialorderingonwhichinstructionsmustexecutebeforewhich otherinstructions.infigure1,forexample,eachshadedblockisathreadwithcircles Amultithreadedcomputationiscomposedofasetofthreads,eachofwhichisase- representinginstructionsandthehorizontaledges,calledcontinueedges,representingthe sequentialordering.thread 5ofthisexamplecontains3instructions:v10,v11,andv12. usetostorethevaluesonwhichtheycompute. itachunkofmemory,calledanactivationframe,thattheinstructionsofthethreadcan Theinstructionsofathreadmustexecuteinthissequentialorderfromtherst(leftmost) instructiontothelast(rightmost)instruction.inordertoexecuteathread,weallocatefor processorsofap-processorparallelcomputerexecutewhichinstructionsateachstep.an executionscheduledependsontheparticularmultithreadedcomputationandthenumberp ofprocessors.inanygivenstepofanexecutionschedule,eachprocessorexecutesatmost AP-processorexecutionscheduleforamultithreadedcomputationdetermineswhich rentlywiththespawnedthread.weconsiderspawnedthreadstobechildrenofthethread ingathreadislikeasubroutinecall,exceptthatthespawningthreadcanoperateconcur- oneinstruction. thatdidthespawning,andathreadmayspawnasmanychildrenasitdesires.inthisway, Duringthecourseofitsexecution,athreadmaycreate,orspawn,otherthreads.Spawn- 3

4 Γ 1 v 1 v 2 v 16 v 17 v 21 v 22 v 23 Γ 2 Γ 6 Figure1:Amultithreadedcomputation.Thiscomputationcontains23instructionsv1;v2;:::;v23 v 3 v 6 v 9 v 13 v 14 v v 18 v 19 v and6threads 1; 2;:::; Γ 3 Γ 4 Γ 5 v 4 v 5 v 7 v 8 v 10 v 11 v dren.thespawntreeistheparallelanalogofacalltree.inourexamplecomputation,the spawntree'srootthread 1hastwochildren, 2and 6,andthread 2hasthreechildren, threadsareorganizedintoaspawntreeasindicatedinfigure1bythedownward-pointing, shadeddependencyedges,calledspawnedges,thatconnectthreadstotheirspawnedchil- 12 executionschedulemustobeythisedgeinthatnoprocessormayexecuteaninstructionin thespawnoperation intheparentthreadtotherstinstructionofthechildthread.an 3, 4,and 5.Threads 3, 4, 5,and 6,whichhavenochildren,areleafthreads. aspawnedchildthreaduntilafterthespawninginstructionintheparentthreadhasbeen Eachspawnedgegoesfromaspecicinstruction theinstructionthatactuallydoes v7cannotbeexecuteduntilafterthespawninginstructionv6.consistentwithourunit-time instructionexecutes,itallocatesanactivationframeforthenewchildthread.onceathread modelofinstructions,asingleinstructionmayspawnatmostonechild.whenthespawning executed.inourexamplecomputation(figure1),duetothespawnedge(v6;v7),instruction Whenthelastinstructionofathreadexecutes,itdeallocatesitsframeandthethreaddies. bycontinueandspawnedges.consideraninstructionthatproducesadatavaluetobe hasbeenspawnedanditsframehasbeenallocated,wesaythethreadisaliveorliving. consumedbyanotherinstruction.suchaproducer/consumerrelationshipprecludesthe consuminginstructionfromexecutinguntilaftertheproducinginstruction.toenforce Anexecutionschedulegenerallyrespectsotherdependenciesbesidesthoserepresented suchorderings,otherdependencyedges,calledjoinedges,mayberequired,asshownin beforetheproducinginstructionhasexecuted,executionoftheconsumingthreadcannot continue thethreadstalls.oncetheproducinginstructionexecutes,thejoindependencyis Figure1bythecurvededges.Iftheexecutionofathreadarrivesataconsuminginstruction resolutionanddetectioncanbeaccomplishedusingmechanismssuchasjoincounters[8], ready.amultithreadedcomputationdoesnotmodelthemeansbywhichjoindependencies getresolvedorbywhichunresolvedjoindependenciesgetdetected.inimplementation, resolved,whichenablestheconsumingthreadtoresumeitsexecution thethreadbecomes futures[30],ori-structures[2]. instructionhasatmostaconstantnumberofjoinedgesincidentonit.thisassumption Wemaketwotechnicalassumptionsregardingjoinedges.Werstassumethateach 4

5 isconsistentwithourunit-timemodelofinstructions.thesecondassumptionisthatno continuestobereadytoexecuteforatleastonemoreinstruction. joinedgesentertheinstructionimmediatelyfollowingaspawn.thisassumptionmeans thatwhenaparentthreadspawnsachildthread,theparentcannotimmediatelystall.it inthisgraphhavebeenexecuted.sothatexecutionschedulesexist,thisgraphmustbe andnoprocessormayexecuteaninstructionuntilafteralloftheinstruction'spredecessors edgesofthecomputation.thesedependencyedgesformadirectedgraphofinstructions, Anexecutionschedulemustobeytheconstraintsgivenbythespawn,continue,andjoin executed. executionschedule,aninstructionisreadyifallofitspredecessorsinthedaghavebeen acyclic.thatis,itmustbeadirectedacyclicgraph,ordag.atanygivenstepofan frameshavebeendeallocated.althoughthisassumptionisnotabsolutelynecessary,itgives childrendie,andthus,athreaddoesnotdeallocateitsactivationframeuntilallitschildren's theexecutionanaturalstructure,anditwillsimplifyouranalysesofspaceutilization.in Wemakethesimplifyingassumptionthataparentthreadremainsaliveuntilallits (orifsuchstorageisavailable,thenwedonotaccountforit).therefore,thespaceused thecomputation;thereisnoglobalstorageavailabletothecomputationoutsidetheframes accountingforspaceutilization,wealsoassumethattheframesholdallthevaluesusedby threadsatthattime,andthetotalspaceusedinexecutingacomputationisthemaximum atagiventimeinexecutingacomputationisthetotalsizeofallframesusedbyallliving suchvalueoverthecourseoftheexecution. activationframeisallocatedandthisframeremainsallocatedaslongasthethreadremains nectedbydependencyedges.theinstructionsareconnectedbycontinueedgesintothreads, andthethreadsformaspawntreewiththespawnedges.whenathreadisspawned,an Tosummarize,amultithreadedcomputationcanbeviewedasadagofinstructionscon- alive.alivingthreadmaybeeitherreadyorstalledduetoanunresolveddependency. thanonemultithreadedcomputation.inthatcase,wesaytheprogramisnondeterministic.ifthesamemultithreadedcomputationisgeneratedbytheprogramontheinput Agivenmultithreadedprogramwhenrunonagiveninputcansometimesgeneratemore nomatterhowthecomputationisscheduled,thentheprogramisdeterministic.inthis cally,weshallnotworryabouthowthemultithreadedcomputationisgenerated.instead, weshallstudyitspropertiesinanaposteriorifashion. paper,weshallanalyzemultithreadedcomputations,notmultithreadedprograms.speci- thekindsofsyncrhonizationsthatcanoccurarerestricted.astrictmultithreadedcomputationisoneinwhichalljoinedgesfromathreadgotoanancestorofthethreadin Becausemultithreadedcomputationswitharbitrarydependenciescanbeimpossibleto scheduleeciently[10],westudysubclassesofgeneralmultithreadedcomputationsinwhich theactivationtree.inastrictcomputation,theonlyedgeintoasubtree(emanatingfrom itsargumentsareavailable,althoughtheargumentscanbegarneredinparallel.afully spawnedge(v2;v3).thus,strictnessmeansthatathreadcannotbeinvokedbeforeallof thecomputationoffigure1isstrict,andtheonlyedgeintothesubtreerootedat 2isthe outsidethesubtree)isthespawnedgethatspawnsthesubtree'srootthread.forexample, strictcomputationisoneinwhichalljoinedgesfromathreadgotothethread'sparent.a fullystrictcomputationis,inasense,a\well-structured"computation,inthatalljoinedges fromasubtree(ofthespawntree)emanatefromthesubtree'sroot.theexamplecompu- 5

6 tationoffigure1isfullystrict.anymultithreadedcomputationthatcanbeexecutedina depth-rstmanneronasingleprocessorcanbemadeeitherstrictorfullystrictbyaltering thedependencystructure,possiblyaectingtheachievableparallelism,butnotaectingthe semanticsofthecomputation[5]. lengthtobethelengthofalongestdirectedpathinthedag.ourexamplecomputation workofthecomputationtobethetotalnumberofinstructionsandthecritical-path computerintermsofthecomputation's\work"and\critical-pathlength."wedenethe WequantifyandboundtheexecutiontimeofacomputationonaP-processorparallel (Figure1)haswork23andcritical-pathlength10.Foragivencomputation,letT(X)denote thetimetoexecutethecomputationusingp-processorexecutionschedulex,andlet denotetheminimumexecutiontimewithpprocessors theminimumbeingtakenoverallpprocessorexecutionschedulesforthecomputation.thent1istheworkofthecomputation, TP=min XT(X) sincea1-processorcomputercanonlyexecuteoneinstructionateachstep,andt1isthe critical-pathlength,sinceevenwitharbitrarilymanyprocessors,eachinstructiononapath mustexecuteserially.noticethatwemusthavetpt1=p,becausepprocessorscan executeonlypinstructionspertimestep,andofcourse,wemusthavetpt1. provedin[10,20],extendstheseresultsminimallytoshowthatthisupperboundontpcan thisupperboundisuniversallyoptimaltowithinafactorof2.thefollowingtheorem, processorexecutionschedulesxwitht(x)t1=p+t1.asthesumoftwolowerbounds, EarlyworkondagschedulingbyBrent[13]andGraham[27,28]showsthatthereexistP- ready,thenallexecute. Pinstructionsareready,thenPinstructionsexecute,andiffewerthanPinstructionsare beobtainedbygreedyschedules:thoseinwhichateachstepoftheexecution,ifatleast executionschedulexachievest(x)t1=p+t1. T1andcritical-pathlengthT1,andforanynumberPofprocessors,anygreedyP-processor Theorem1(Thegreedy-schedulingtheorem)Foranymultithreadedcomputationwithwork Generally,weareinterestedinschedulesthatachievelinearspeedup,thatisT(X)= O(T1=P).Foragreedyschedule,linearspeedupoccurswhentheparallelism,whichwe denetobet1=t1,satisest1=t1=(p). stackdepthofathreadtobethesumofthesizesoftheactivationframesofallitsancestors, includingitself.thestackdepthofamultithreadedcomputationisthemaximumstack depthofanyofitsthreads.weshalldenotebys1theminimumamountofspacepossiblefor Toquantifythespaceusedbyagivenexecutionscheduleofacomputation,wedenethe any1-processorexecutionofamultithreadedcomputation,whichisequaltothestackdepth ofthecomputation.lets(x)denotethespaceusedbyap-processorexecutionschedule Xofamultithreadedcomputation.Weshallbeinterestedinthoseexecutionschedulesthat exhibitatmostlinearexpansionofspace,thatis,s(x)=o(s1p),whichisexistentially optimaltowithinaconstantfactor[10]. 6

7 Onceathread hasbeenspawnedinastrictcomputation,asingleprocessorcancomplete 3theexecutionoftheentiresubcomputationrootedat evenifnootherprogressismade Thebusy-leavesproperty stall.asweshallsee,thispropertyallowsanexecutionscheduletokeeptheleaves\busy." at thatisready.inparticular,noleafthreadinastrictmultithreadedcomputationcan untilthetime dies,thereisalwaysatleastonethreadfromthesubcomputationrooted onotherpartsofthecomputation.inotherwords,fromthetimethethread isspawned computationwithworkt1,critical-pathlengtht1,andstackdepths1,thereexistsapprocessorexecutionschedulexthatachievestimet(x)t1=p+t1andspaces(x)s1p Inthissection,weshowthatforanynumberPofprocessorsandanystrictmultithreaded Bycombiningthis\busy-leaves"propertywiththegreedyproperty,wederiveexecution schedulesthatsimultaneouslyexhibitlinearspeedupandlinearexpansionofspace. simultaneously.wegiveasimpleonlinep-processorparallelalgorithm thebusy-leaves thealgorithmhascomputedandexecutedtherstt 1stepsoftheexecutionschedule. randomizedwork-stealingalgorithmpresentedinsection4. Algorithm tocomputesuchaschedule.thissimplealgorithmwillformthebasisforthe revealedsofarintheexecutiontocomputeandexecutethetthstepoftheschedule.in Atthetthstep,thealgorithmusesonlyinformationfromtheportionofthecomputation TheBusy-LeavesAlgorithmoperatesonlineinthefollowingsense.Beforethetthstep, particular,itdoesnotuseanyinformationfrominstructionsnotyetexecutedorthreadsnot yetspawned. ThoughwedescribethealgorithmasaP-processorparallelalgorithm,weshallnotanalyzeit thisglobalpool,andwhenaprocessorneedswork,itremovesareadythreadfromthepool. isuniformlyavailabletoallpprocessors.whenspawnsoccur,newthreadsareaddedto TheBusy-LeavesAlgorithmmaintainsalllivingthreadsinasinglethreadpoolwhich contendingforaccesstothepool.infact,weshallonlyanalyzepropertiesoftheschedule itselfandignorethecostincurredbythealgorithmincomputingtheschedule.(scheduling assuch.specically,incomputingthetthstepoftheschedule,wealloweachprocessortoadd threadstothethreadpoolanddeletethreadsfromit.thus,weignoretheeectsofprocessors processoreitherisidleorhasathreadtoworkon.thoseprocessorsthatareidlebeginthe threadintheglobalthreadpoolandallprocessorsidle.atthebeginningofeachstep,each overheadswillbeanalyzedfortherandomizedwork-stealingalgorithm,however.) stepbyattemptingtoremoveanyreadythreadfromthepool.iftherearesucientlymany TheBusy-LeavesAlgorithmoperatesasfollows.Thealgorithmbeginswiththeroot readythreadsinthepooltosatisfyalloftheidleprocessors,theneveryidleprocessorgets thathasathreadtoworkonexecutesthenextinstructionfromthatthread.ingeneral, areadythreadtoworkon.otherwise,someprocessorsremainidle.then,eachprocessor tothefollowingrules. onceaprocessorhasathread,callit a,toworkon,itexecutesaninstructionfrom aat eachstepuntilthethreadeitherspawns,stalls,ordies,inwhichcase,itperformsaccording ➊Spawns:Ifthethread aspawnsachild b,thentheprocessornishesthecurrent stepbyreturning atothethreadpool.theprocessorbeginsthenextstepworking on b. 7

8 step threadpool processoractivity 321 1:v1 2:v3 p1v2 1:v16 p :v4 2:v6 4:v7 v5 6:v18 v17 v :v9 5:v10 v8 1:v21 2:v13 v :v15 1:v23 v11 v12 1:v22 v14 workedonandtheinstructionexecutedbyeachofthe2processors,p1andp2,ateachstep.living justaftereachidleprocessorhasremovedareadythread.italsoliststhereadythreadbeing putationoffigure1.thisscheduleliststhelivingthreadsintheglobalthreadpoolateachstep Figure2:A2-processorexecutionschedulecomputedbytheBusy-LeavesAlgorithmforthecom- threadsthatarereadyarelistedinbold.theotherlivingthreadsarestalled. ➋Stalls:Ifthethread astalls,thentheprocessornishesthecurrentstepbyreturning ➌Dies:Ifthethread adies,thentheprocessornishesthecurrentstepbycheckingto atothethreadpool.theprocessorbeginsthenextstepidle. idle. andnootherprocessorisworkingon b,thentheprocessortakes bfromthepool andbeginsthenextstepworkingon b.otherwise,theprocessorbeginsthenextstep seeif a'sparentthread bcurrentlyhasanylivingchildren.if bhasnolivechildren thebusy-leavesalgorithmonthecomputationoffigure1.rule➊:atstep2,processor p1workingonthread 1executesv2whichspawnsthechild 2,sop1places 1backinthe pool(tobepickedupatthebeginningofthenextstepbytheidlep2)andbeginsthenext Figure2illustratesthesethreerulesina2-processorexecutionschedulecomputedby 2executesv15and 2dies,sop1retrievestheparent 1fromthepoolandbeginsthenext 1stalls,sop2returns 1tothepoolandbeginsthenextstepidle(andremainsidlesince stepworkingon 2.Rule➋:Atstep8,processorp2workingonthread 1executesv21and stepworkingon 1. thethreadpoolcontainsnoreadythreads).rule➌:atstep13,processorp1workingon spawnsubtreeatanytimestepttobetheportionofthespawntreeconsistingofjust execution,everyleafinthe\spawnsubtree"hasaprocessorworkingonit.wedenethe LeavesAlgorithmmaintainsthebusy-leavesproperty:ateverytimestepduringthe Besidesbeinggreedy,foranystrictcomputation,theschedulecomputedbytheBusy- 8

9 thosethreadsthatarealiveatstept.torestatethebusy-leavesproperty,ateverytimestep, property,buteverystrictmultithreadedcomputationdoes.webeginbyshowingthatany nowprovethisfactandshowthatitimplieslinearexpansionofspace.itisworthnoting thatnoteverymultithreadedcomputationhasaschedulethatmaintainsthebusy-leaves everylivingthreadthathasnolivingdescendantshasaprocessorworkingonit.weshall schedulethatmaintainsthebusy-leavespropertyexhibitslinearexpansionofspace. Proof: schedulexthatmaintainsthebusy-leavespropertyusesspaceboundedbys(x)s1p. Lemma2ForanymultithreadedcomputationwithstackdepthS1,anyP-processorexecution andtherefore,thespaceinuseatanytimesteptisatmosts1p. mostpleaves.foreachsuchleaf,thespaceusedbyitandallofitsancestorsisatmosts1, Forschedulesthatmaintainthebusy-leavesproperty,theupperboundS1Pisconser- Thebusy-leavespropertyimpliesthatatalltimestepst,thespawnsubtreehasat vative.bychargings1spaceforeachbusyleaf,wemaybeovercharging.forsomecom- putations,byknowingthattheschedulepreservesthebusy-leavesproperty,wecanappeal directlytothefactthatthespawnsubtreeneverhasmorethanpleavestoobtaintight boundsonspaceusage[6]. Theorem3ForanynumberPofprocessorsandanystrictmultithreadedcomputationwith computesaschedulethatisbothgreedyandmaintainsthebusy-leavesproperty. Wenishthissectionbyshowingthatforstrictcomputations,theBusy-LeavesAlgorithm whosespacesatisess(x)s1p. ap-processorexecutionschedulexwhoseexecutiontimesatisest(x)t1=p+t1and workt1,critical-pathlengtht1,andstackdepths1,thebusy-leavesalgorithmcomputes Lemma2ifwecanshowthattheBusy-LeavesAlgorithmmaintainsthebusy-leavesproperty. Weprovethisfactbyinductiononthenumberofsteps.Attherststepofthealgorithm,the Proof: sincethebusy-leavesalgorithmcomputesagreedyschedule.thespaceboundfollowsfrom Thetimeboundfollowsdirectlyfromthegreedy-schedulingtheorem(Theorem1), eitherspawns,stalls,ordies.rule➊:if aspawnsachild b,then aisnotaleaf(evenifit aprocessorhasathread atoworkon,itexecutesinstructionsfromthatthreaduntilit onit.wemustshowthatallofthealgorithmrulespreservethebusy-leavesproperty.when spawnsubtreecontainsjusttherootthreadwhichisaleaf,andsomeprocessorisworking mayturnintoaleaf.inthiscase,theprocessorworkson bunlesssomeotherprocessor wasbefore)and bisaleaf.inthiscase,theprocessorworkson b,sothenewleafisbusy. alreadyis,sothenewleafisguaranteedtobebusy. Rule➋:If astalls,then acannotbealeafsinceinastrictcomputation,theunresolved dependencymustcomefromadescendant.rule➌:if adies,thenitsparentthread b ecientexecutionschedulesanddoesoperateonline,itsurelydoesnotdosoeciently, mustbecomputedecientlyonline,andthoughthebusy-leavesalgorithmdoescompute schedule,andweknowhowtondit.butthesefactstakeusonlysofar.executionschedules Wenowknowthateverystrictmultithreadedcomputationhasanecientexecution andinthefollowingsections,weprovethatitisbothecientandscalable. contendforaccess.inthenextsection,wepresentadistributedonlineschedulingalgorithm, isaconsequenceofemployingasinglecentralizedthreadpoolatwhichallprocessorsmust exceptpossiblyinthecaseofsmall-scalesymmetricmultiprocessors.thislackofscalability 9

10 4tithreadedcomputationsonaparallelcomputer.Also,wepresentanimportantstructural Inthissection,wepresentanonline,randomizedwork-stealingalgorithmforschedulingmul- Arandomizedwork-stealingalgorithm algorithmcausesatmostalinearexpansionofspace.thislemmareappearsinsection6to lemmawhichisusedattheendofthissectiontoshowthatforfullystrictcomputations,this showthatforfullystrictcomputations,thisalgorithmachieveslinearspeedupandgenerates existentiallyoptimalamountsofcommunication. Algorithmisdistributedacrosstheprocessors.Specically,eachprocessormaintainsaready Threadscanbeinsertedonthebottomandremovedfromeitherend.Aprocessortreats dequedatastructureofthreads.thereadydequehastwoends:atopandabottom. IntheWork-StealingAlgorithm,thecentralizedthreadpooloftheBusy-Leaves migratedtootherprocessorsareremovedfromthetop. deque.itstartsworkingonthethread,callit a,andcontinuesexecuting a'sinstructions itsreadydequelikeacallstack,pushingandpoppingfromthebottom.threadsthatare until aspawns,stalls,dies,orenablesastalledthread,inwhichcase,itperformsaccording tothefollowingrules. Ingeneral,aprocessorobtainsworkbyremovingthethreadatthebottomofitsready ➊Spawns:Ifthethread aspawnsachild b,then aisplacedonthebottomofthe ➋Stalls:Ifthethread astalls,itsprocessorchecksthereadydeque.ifthedeque containsanythreads,thentheprocessorremovesandbeginsworkonthebottommost readydeque,andtheprocessorcommencesworkon b. beginsworkonit.(thiswork-stealingstrategyiselaboratedbelow.) stealsthetopmostthreadfromthereadydequeofarandomlychosenprocessorand thread.ifthereadydequeisempty,however,theprocessorbeginsworkstealing:it ➌Dies:Ifthethread adies,thentheprocessorfollowsrule➋asinthecaseof a ➍Enables:Ifthethread aenablesastalledthread b,thenow-readythread bis placedonthebottomofthereadydequeof a'sprocessor. stalling. rule➍forthecasewhenathreadenablesastalledthread,theserulesareanalogoustothe rulesofthebusy-leavesalgorithm,andasweshallsee,rule➍isneededtoensurethatthe performrule➍forenablingandthenrule➋forstallingorrule➌fordying.exceptfor Athreadcansimultaneouslyenableastalledthreadandstallordie,inwhichcasewerst algorithmmaintainsimportantstructuralproperties,includingthebusy-leavesproperty. themultithreadedcomputationisplacedinthereadydequeofoneprocessor,whiletheother processorsstartworkstealing. TheWork-StealingAlgorithmbeginswithallreadydequesempty.Therootthreadof beginsworkonthetopthread.ifthevictim'sreadydequeisempty,however,thethieftries Thethiefqueriesthereadydequeofthevictim,andifitisnonempty,thethiefremovesand athiefandattemptstostealworkfromavictimprocessorchosenuniformlyatrandom. Whenaprocessorbeginsworkstealing,itoperatesasfollows.Theprocessorbecomes again,pickinganothervictimatrandom. 10

11 Γ k ready deque Γ spawnedachild.thedashededgesarethe\dequeedges"introducedinsection6. Figure3:Thestructureofaprocessor'sreadydeque.Theblackinstructionineachthreadindicates thethread'scurrentlyreadyinstruction.onlythread kmayhavebeenworkedonsinceitlast 2 Γ 1 Γ 0 executing Wenowstateandproveanimportantlemmaonthestructureofthreadsintheready thread timeandcommunication.figure3illustratesthelemma. usedlaterinthissectiontoanalyzeexecutionspaceandinsection6toanalyzeexecution dequeofanyprocessorduringtheexecutionofafullystrictcomputation.thislemmais Lemma4IntheexecutionofanyfullystrictmultithreadedcomputationbytheWork-Stealing thread.let 0bethethreadthatpisworkingon,letkbethenumberofthreadsinp'sready Algorithm,consideranyprocessorpandanygiventimestepatwhichpisworkingona inp'sreadydequesatisfythefollowingproperties: top,sothat 1isthebottommostand kisthetopmost.ifwehavek>0,thenthethreads deque,andlet 1; 2;:::; kdenotethethreadsinp'sreadydequeorderedfrombottomto ➀Fori=1;2;:::;k,thread iistheparentof i 1. Proof: ➁Ifwehavek>1,thenfori=1;2;:::;k 1,thread ihasnotbeenworkedonsince itspawned i 1. processorpexecutesaninstructionfromthread 0.Let 1; 2;:::; kdenotethekthreads therootthreadinsomeprocessor'sreadydequeandallotherreadydequesempty,sothe lemmavacuouslyholdsattheoutset.now,consideranystepofthealgorithmatwhich Theproofisastraightforwardinductiononexecutiontime.Executionbeginswith inp'sreadydequebeforethestep,andsupposethateitherk=0orbothpropertieshold. propertiesholdafterthestep. algorithmandshowthattheyallpreservethelemma.thatis,eitherk0=0orboth denotethek0threadsinp'sreadydequeafterthestep.wenowlookattherulesofthe Let 0denotethethread(ifany)beingworkedonbypafterthestep,andlet 01; 02;:::; 0k0 Property➀:Ifk0>1,thenforj=2;3;:::;k0,thread 0jistheparentof 0j 1,sincebefore andcommencesworkonthechild.thus, 0isthechild,wehavek0=k+1>0,and forj=1;2;:::;k0,wehave 0j= j 1.SeeFigure4.Now,wecancheckbothproperties. Rule➊:If 0spawnsachild,thenppushes 0ontothebottomofthereadydeque thespawnwehavek>0,whichmeansthatfori=1;2;:::;k,thread iistheparentof i 1. 11

12 Moreover, 01isobviouslytheparentof 0.Property➁:Ifk0>2,thenforj=2;3;:::;k0 1, spawnonlyjustoccurred. k>1,whichmeansthatfori=1;2;:::;k 1,thread ihasnotbeenworkedonsinceit thread 0jhasnotbeenworkedonsinceitspawned 0j 1,becausebeforethespawnwehave spawned i 1.Finally,thread 01hasnotbeenworkedonsinceitspawned 0,becausethe Γ k Γ 2 Γ Figure4:Thereadydequeofaprocessorbeforeandafterthethread 0thatitisworkingon 3 Γ 1 Γ spawnsachild.(notethatthethreads 0and 0arenotactuallyinthedeque;theyarethe (a)beforespawn. (b)afterspawn. 2 Γ 0 Γ 1 Γ readydequeisempty,sotheprocessorcommencesworkstealing,andwhentheprocessor threadsbeingworkedonbeforeandafterthespawn.) stealsandbeginsworkonathread,wehavek0=0.ifk>0,thereadydequeisnot empty,sotheprocessorpopsthebottommostthreadothedequeandcommencesworkon Rules➋and➌:If 0stallsordies,thenwehavetwocasestoconsider.Ifk=0,the 0 Forj=1;2;:::;k0,thread 0jistheparentof 0j 1,sincefori=1;2;:::;k,thread iisthe have 0j= j+1.seefigure5.now,ifk0>0,wecancheckbothproperties.property➀: parentof i 1.Property➁:Ifk0>1,thenforj=1;2;:::;k0 1,thread 0jhasnotbeen it.thus,wehave 0= 1(thepoppedthread)andk0=k 1,andforj=1;2;:::;k0,we meansthatfori=2;3;:::;k 1,thread ihasnotbeenworkedonsinceitspawned i 1. workedonsinceitspawned 0j 1,becausebeforethestallordeathwehavek>2,which Γ k Γ k Γ k Γ 2 Γ (Notethatthethreads 0and 0arenotactuallyinthedeque;theyarethethreadsbeingworked Figure5:Thereadydequeofaprocessorbeforeandafterthethread 0thatitisworkingondies. (a)beforedeath. (b)afterdeath. 1 Γ 1 Γ 0 Γ viouslystalledthreadmustbe 0'sparent.First,weobservethatwemusthavek=0.If onbeforeandafterthedeath.) Rule➍:If 0enablesastalledthread,thenduetothefullystrictcondition,thatpre- 0 12

13 thebottomofthereadydeque.wehave 0= 0andk0=k+1=1with 01denotingthe apply.withk=0,thereadydequeisemptyandtheprocessorplacestheparentthreadon bebottommostinthereadydeque.thus,thisparentthreadisreadyandrule➍doesnot wehavek>0,thentheprocessor'sreadydequeisnotempty,andthisparentthreadmust newlyenabledparent.weonlyhavetochecktherstproperty.property➀:thread 01is afterthestealwehavek0=k 1.Ifk0>0holds,thenbothpropertiesareclearlypreserved. obviouslytheparentof 0. notinvokeanyoftheaboverules clearlypreservethelemma. Allotheractionsbyprocessorp suchasworkstealingorexecutinganinstructionthatdoes Ifsomeotherprocessorstealsathreadfromprocessorp,thenwemusthavek>0,and k 1andbroughtbacktoprocessorp'sreadydeque.Thekeyobservationisthatwhen kis kisstolenfromprocessorpandthenstallsonitsnewprocessor.later, kisreenabledby workedonsinceitspawned k 1,sinceProperty➁excludes k.thissituationariseswhen Beforemovingon,itisworthpointingouthowitmayhappenthatthread khasbeen k 2; k 3;:::; 0showninFigure3werespawnedafter kwasreenabled. reenabled,processorp'sreadydequeisemptyandpisworkingon k 1.Theotherthreads Theorem5ForanyfullystrictmultithreadedcomputationwithstackdepthS1,theWork- executingafullystrictcomputation. WeconcludethissectionbyboundingthespaceusedbytheWork-StealingAlgorithm Proof: StealingAlgorithmrunonacomputerwithPprocessorsusesatmostS1Pspace. hasaprocessorworkingonit.ifwecanestablishthisfact,thenlemma2completesthe proof. leavesproperty:ateverytimestepoftheexecution,everyleafinthecurrentspawnsubtree LiketheBusy-LeavesAlgorithm,theWork-StealingAlgorithmmaintainsthebusy- ofsomeprocessor.butlemma4guaranteesthatnoleafthreadsitsinaprocessor'sready readyandthereforemusteitherhaveaprocessorworkingonitorbeinthereadydeque sequenceoflemma4.ateverytimestep,everyleafinthecurrentspawnsubtreemustbe ThattheWork-StealingAlgorithmmaintainsthebusy-leavespropertyisasimplecon- dequewhiletheprocessorworksonsomeotherthread. whenmultiplethiefprocessorssimultaneouslyattempttostealfromthesamevictim. however,wemusttakecaretodeneamodelforcopingwiththecontentionthatmayarise nicationboundsforthework-stealingalgorithm.beforewecanproceedwiththisanalysis, Withthespaceboundinhand,wenowturnattentiontoanalyzingthetimeandcommu- executionofamultithreadedcomputationbythework-stealingalgorithm.weintroduce Thissectionpresentsthe\atomic-access"modelthatweusetoanalyzecontentionduringthe 5 Atomicaccessesandtherecyclinggame incurredbyrandom,asynchronousaccessesinthismodel.weshallusetheresultsofthis acombinatorial\ballsandbins"game,whichweusetoboundthetotalamountofdelay sectioninsection6,whereweanalyzethework-stealingalgorithm. Algorithm.WeassumethatthemachineisanasynchronousparallelcomputerwithP Theatomic-accessmodelisthemachinemodelweusetoanalyzetheWork-Stealing 13

14 themodelofkarpandzhang[33].theyassumethatifconcurrentstealrequestsaremade theatomicmessage-passingmodelof[36].thisassumptionismorestringentthanthatin processors,anditsmemorycanbeeitherdistributedorshared.ouranalysisassumesthat toadeque,inonetimestep,onerequestissatisedandalltheothersaredenied.inthe concurrentaccessestothesamedatastructureareseriallyqueuedbyanadversary,asin Theonlyconstraintontheadversaryisthatifthereisatleastonerequestforadeque,then byanadversary,ratherthanbeingdenied.moreover,fromthecollectionofwaitingrequests foragivendeque,theadversarygetstochoosewhichisservicedandwhichcontinuetowait. atomic-accessmodel,wealsoassumethatonerequestissatised,buttheothersarequeued theadversarycannotchoosethatnonebeserviced. islikelytobeproportionaltothetotalnumbermofrequests,nomatterwhichprocessors processorstopdequeswitheachprocessorallowedatmostoneoutstandingrequest,then thetotalamountoftimethattheprocessorsspendwaitingfortheirrequeststobesatised ThemainresultofthissectionistoshowthatifrequestsaremaderandomlybyP maketherequestsandnomatterhowtherequestsaredistributedovertime.inorderto bytheadversary. provethisresult,weintroducea\ballsandbins"gamethatmodelstheeectsofqueueing executedbytheadversary.initially,allpballsareinareservoirseparatefromthepbins. whichisequaltothenumberofbins.theparametermisthetotalnumberofballtosses ballsaretossedatrandomintobins.theparameterpisthenumberofballsinthegame, The(P;M)-recyclinggameisacombinatorialgameplayedbytheadversary,inwhich Ateachstepofthegame,theadversaryexecutesthefollowingtwooperationsinsequence: 1.Theadversarychoosessomeoftheballsinthereservoir(possiblyallandpossibly none),andthenforeachoftheseballs,theadversaryremovesitfromthereservoir, 2.TheadversaryinspectseachofthePbinsinturn,andforeachbinthatcontainsat selectsoneofthepbinsuniformlyandindependentlyatrandom,andtossestheball leastoneball,theadversaryremovesanyoneoftheballsinthebinandreturnsitto intoit. tosseshavebeenmadeandallballshavebeenremovedfromthebinsandplacedbackinthe TheadversaryispermittedtomakeatotalofMballtosses.ThegameendswhenMball thereservoir. reservoir. isinthereservoir,itmeansthattheball'sownerisnotmakingastealrequest.ifaballis rithm.wecanvieweachballandeachbinasbeingownedbyadistinctprocessor.ifaball inabin,itmeansthattheball'sownerhasmadeastealrequesttothedequeofthebin's TherecyclinggamemodelstheservicingofstealrequestsbytheWork-StealingAlgo- andreturnedtothereservoir,itmeansthattherequesthasbeenserviced. owner,butthattherequesthasnotyetbeensatised.whenaballisremovedfromabin adversaryistomakethetotaldelayaslargeaspossible.thenextlemmashowsthatdespite delayd=ptt=1nt,wheretisthetotalnumberofstepsinthegame.thegoalofthe correspondtostealrequeststhathavenotbeensatised.weshallbeinterestedinthetotal Aftereachsteptofthegame,therearesomenumberntofballsleftinthebins,which 14

15 Lemma6Forany>0,withprobabilityatleast1,thetotaldelayinthe(P;M)-recycling tothereservoir,thetotaldelayisunlikelytobelarge. thechoicesthattheadversarymakesaboutwhichballstotossintobinsandwhichtoreturn modeliso(m+plgp+plg(1=))withprobabilityatleast1,andtheexpectedtotaldelay isatmostm. thetotaldelayincurredbymrandomrequestsmadebypprocessorsintheatomic-access gameiso(m+plgp+plg(1=)).1theexpectedtotaldelayisatmostm.inotherwords, ballfromeachbinisimmaterial,andthus,wecanassumethatballsarequeuedintheirbins Proof: andwhentheadversarytossesaball,itisplacedonthebackofthequeue.ifseveralballs inarst-in-rst-out(fifo)order.theadversaryremovesballsfromthefrontofthequeue, Werstmaketheobservationthatthestrategybywhichtheadversarychoosesa aretossedintothesamebinatthesamestep,theycanbeplacedonthebackofthequeue ballistossed. inanyorder.thereasonthatassumingafifodisciplineforqueuingballsinabindoesnot aectthetotaldelayisthatthenumberofballsinagivenbinatagivenstepisthesame nomatterwhichballisremoved,andwhereballsaretossedhasnothingtodowithwhich totalnumberofstepsthatnishwithballrinabin.then,wehave orinthereservoir.denethedelayofballrtobetherandomvariablerdenotingthe Foranygivenballandanygivenstep,thestepeithernisheswiththetheballinabin ithtimeitistosseduntilitisreturnedtothereservoir.denealsotheithdelayofaball Denetheithcycleofaballtobethosestepsinwhichtheballremainsinabinfromthe D=PXr=1r: (1) tobethenumberofstepsinitsithcycle. have=pmi=1di. ofball1.ifweletmdenotethenumberoftimesthatball1istossedbytheadversary,and fori=1;2;:::;m,letdibetherandomvariabledenotingtheithdelayofball1,thenwe Weshallanalyzethetotaldelaybyfocusing,withoutlossofgenerality,onthedelay=1 byanotherballreitheronceornotatall.consequently,wecandecomposeeachrandom theadversaryfollowsthefiforule,itfollowsthattheithcycleofball1canbedelayed placesitinsomebinkandballrisremovedfrombinkduringtheithcycleofball1.since Wesaythattheithcycleofball1isdelayedbyanotherballriftheithtossofball1 variablediintoasumdi=xi2+xi3++ximofindicatorrandomvariables,where Thus,wehave xir=(1iftheithcycleofball1isdelayedbyballr; 0otherwise. 1=isatmostpolynomialinMandP[40]. 1GregPlaxtonoftheUniversityofTexas,AustinhasimprovedthisboundtoO(M)forthecasewhen =mxi=1pxr=2xir: (2) 15

16 delayedbyballr.foranysuchsets,weclaimthat setsofpairs(i;r),eachofwhichcorrespondstotheeventthattheithcycleofball1is Wenowproveanimportantpropertyoftheseindicatorrandomvariables.Considerany Thecruxofprovingtheclaimistoshowthat Pr8<:^ (i;r)2s(xir=1)9=;p jsj: (3) wheres0=s f(i;r)g,whencetheclaim(3)followsfrombayes'stheorem. Pr8<:xir=1^ (i0;r0)2s0(xi0r0=1)9=;1=p; (4) withprobabilityeither1=por0,andhence,withprobabilityatmost1=p.conditioningon tossofball1,itfallsintowhateverbincontainsballr,ifany.apriori,thiseventhappens saryfollowsthefiforule,wehavethatxir=1onlyif,whentheadversaryexecutestheith WecanderiveInequality(4)fromacarefulanalysisofdependencies.Becausetheadver- tellsnothingaboutwheretheithtossofball1goes.therefore,theserandomvariablesare creasethisprobability,aswenowargueintwocases.intherstcase,theindicatorrandom variablesxi0r0,wherei06=i,tellwhetherothercyclesofball1aredelayed.thisinformation anycollectionofeventsrelatingwhichballsdelaythisorothercyclesofball1cannotin- independentofxir,andthus,theprobability1=pupperboundisnotaected.inthesecond containingballr0,butthisinformationtellsusnothingaboutwhetheritgoestothebin case,theindicatorrandomvariablesxir0tellwhethertheithtossofball1goestothebin randballr0arelocated.moreover,no\collusion"amongtheindicatorrandomvariables providesanymoreinformation,andthusinequality(4)holds. containingballr,becausetheindicatorrandomvariablestellusnothingtorelatewhereball orexceedagivenvalue,theremustbesomesetcontainingoftheseindicatorrandom canbeexpressesasasumofm(p 1)indicatorrandomvariables.Inorderfortoequal variables,eachofwhichmustbe1.foranyspecicsuchset,inequality(3)saysthatthe Equation(2)showsthatthedelayencounteredbyball1throughoutallofitscycles probabilityisatmostp thatallrandomvariablesinthesetare1.sincethereare m(p 1) (emp=)suchsets,whereeisthebaseofthenaturallogarithm,wehave PrfgemP =em P whenevermaxf2em;lgp+lg(1=)g. Althoughouranalysiswasperformedforball1,itappliestoanyotherballaswell. =P; exceedsmaxf2emr;lgp+lg(1=)gisatmost=p.byboole'sinequalityandequation(1), Consequently,foranygivenballrwhichistossedmrtimes,theprobabilitythatitsdelayr 16

17 itfollowsthatwithprobabilityatleast1,thetotaldelaydisatmost DPXr=1maxf2emr;lgP+lg(1=)g sincem=ppr=1mr. TheupperboundE[D]Mcanbeobtainedasfollows.Recallthateachristhe =(M+PlgP+Plg(1=)); sumof(p 1)mrindicatorrandomvariables,eachofwhichhasexpectationatmost1=P. turnbacktothework-stealingalgorithm. linearityofexpectation,weobtaine[d]m. Therefore,bylinearityofexpectation,E[r]mr.UsingEquation(1)andagainusing WiththisboundonthetotaldelayincurredbyMrandomrequestsnowinhand,we 6tithreadedcomputationwiththeWork-StealingAlgorithm.Foranyfullystrictcomputation Inthissection,weanalyzethetimeandcommunicationcostofexecutingafullystrictmul- Analysisofthework-stealingalgorithm withworkt1andcritical-pathlengtht1,weshowthattheexpectedrunningtimewith Pprocessors,includingschedulingoverhead,isT1=P+O(T1).Moreover,forany>0, theexecutiontimeonpprocessorsist1=p+o(t1+lgp+lg(1=)),withprobabilityat fullystrictcomputationiso(pt1(1+nd)smax),wherendisthemaximumnumberofjoin least1.wealsoshowthattheexpectedtotalcommunicationduringtheexecutionofa edgesfromathreadtoitsparentandsmaxisthelargestsizeofanyactivationframe. victimsimultaneously.inthiscase,aswehaveindicatedintheprevioussection,wemake isdistributed,andsothereisnocontentionatacentralizeddatastructure.nevertheless,it isstillpossibleforcontentiontoarisewhenseveralthieveshappentodescendonthesame UnlikeintheBusy-LeavesAlgorithm,the\readypool"intheWork-StealingAlgorithm work-stealingresponsetakesanyconstantamountoftime. request.thisassumptioncanberelaxedwithoutmateriallyaectingtheresultssothata Wefurtherassumethatittakesunittimeforaprocessortorespondtoawork-stealing theconservativeassumptionthatanadversaryseriallyqueuesthework-stealingrequests. dollars,onefromeachprocessor.ateachstep,eachprocessorplacesitsdollarinoneof multithreadedcomputationwithworkt1andcritical-pathlengtht1onacomputerwith Pprocessors,weuseanaccountingargument.Ateachstepofthealgorithm,wecollectP ToanalyzetherunningtimeoftheWork-StealingAlgorithmexecutingafullystrict threebucketsaccordingtoitsactionsatthatstep.iftheprocessorexecutesaninstruction bucket.weshallderivetherunning-timeboundbyboundingthenumberofdollarsineach merelywaitsforaqueuedstealrequestatthestep,thenitplacesitsdollarintothewait atthestep,thenitplacesitsdollarintotheworkbucket.iftheprocessorinitiatesasteal bucketattheendoftheexecution,summingthesethreebounds,andthendividingbyp. attemptatthestep,thenitplacesitsdollarintothestealbucket.and,iftheprocessor WerstboundthetotalnumberofdollarsintheWorkbucket. 17

18 Lemma7TheexecutionofafullystrictmultithreadedcomputationwithworkT1bythe Proof:AprocessorplacesadollarintheWorkbucketonlywhenitexecutesaninstruction. intheworkbucket. Work-StealingAlgorithmonacomputerwithPprocessorsterminateswithexactlyT1dollars Thus,sincethereareT1instructionsinthecomputation,theexecutionendswithexactlyT1 tempts,andwemustalsodeneanaugmenteddagthatwethenusetodene\critical" \delay-sequence"argument.werstintroducethenotionofa\round"ofwork-stealat- dollarsintheworkbucket. instructions.theideaisasfollows.if,duringthecourseoftheexecution,alargenumberof BoundingthetotaldollarsintheStealbucketrequiresasignicantlymoreinvolved stealsareattempted,thenwecanidentifyasequenceofinstructions thedelaysequence in theaugmenteddagsuchthateachofthesestealattemptswasinitiatedwhilesomeinstructionfromthesequencewascritical.wethenshowthatacriticalinstructionisunlikelyto remaincriticalacrossamodestnumberofstealattempts.wecanthenconcludethatsuch adelaysequenceisunlikelytooccur,andtherefore,anexecutionisunlikelytosueralarge attemptssuchthatifastealattemptthatisinitiatedattimesteptoccursinaparticular round,thenallotherstealattemptsinitiatedattimesteptarealsointhesameround.we canpartitionallofthestealattemptsthatoccurduringanexecutionintoroundsasfollows. Aroundofstealattemptsisasetofatleast3Pbutfewerthan4Pconsecutivesteal numberofstealattempts. therstroundstartsattimestep1andendsattimestept1.ingeneral,iftheithround endsattimestepti,thenthe(i+1)stroundbeginsattimestepti+1andendsatthe Therstroundcontainsallstealattemptsinitiatedattimesteps1;2;:::;t1,wheret1isthe earliesttimesuchthatatleast3pstealattemptswereinitiatedatorbeforet1.wesaythat denition,eachroundcontainsatleast3pconsecutivestealattempts.moreover,sinceat mostp 1stealattemptscanbeinitiatedinasingletimestep,eachroundcontainsfewer stepsbetweenti+1andti+1,inclusive.thesestealattemptsbelongtoroundi+1.by earliesttimestepti+1>ti+1suchthatatleast3pstealattemptswereinitiatedattime anaugmenteddagobtainedbymodifyingtheoriginaldagslightly.letgdenotetheoriginal than4p 1stealattempts,andeachroundtakesatleast4steps. spawn,andjoinedgesasedges.theaugmenteddagg0istheoriginaldaggtogetherwith dag,thatis,thedagconsistingofthecomputation'sinstructionsasverticesanditscontinue, Thesequenceofinstructionsthatmakeupthedelaysequenceisdenedwithrespectto dequeedgesareshowndashedinfigure3.insection2wemadethetechnicalassumption spawnedgeand(u;w)isacontinueedge,thedequeedge(w;v)isplaceding0.these somenewedges,asfollows.foreverysetofinstructionsu,v,andwsuchthat(u;v)isa outthatg0isonlyananalyticaltool.thedequeedgeshavenoeectontheschedulingand executionofthecomputationbythework-stealingalgorithm. longestpathing,thenthelongestpathing0haslengthatmost2t1.itisworthpointing thatinstructionwhasnoincomingjoinedges,andsog0isadag.ift1isthelengthofa structionwsuchthatthereisadirectedpathfromwtoving0,instructionwhasbeen theexecution,wesaythatanunexecutedinstructionviscriticalifeveryinstructionthat precedesv(eitherdirectlyorindirectly)ing0hasbeenexecuted,thatis,ifforeveryin- Thedequeedgesarethekeytodeningcriticalinstructions.Atanytimestepduring 18

19 readyinstructionmayormaynotbecritical.intuitively,thestructuralpropertiesofaready executed.acriticalinstructionmustbeready,sinceg0containseveryedgeofg,buta instructionacrossthedequeedgehasnotyetbeenexecuted. dequeenumeratedinlemma4guaranteethatifathreadisdeepinareadydeque,then itscurrentinstructioncannotbecritical,becausethepredecessorofthethread'scurrent Denition8Adelaysequenceisa3-tuple(U;R;)satisfyingthefollowingconditions: U=(u1;u2;:::;uL)isamaximaldirectedpathinG0.Specically,fori=1;2;:::;L Wenowformalizeourdenitionofadelaysequence. structionu1mustbetherstinstructionoftherootthread),andinstructionulhasno outgoingedgesing0(instructionulmustbethelastinstructionoftherootthread). 1,theedge(ui;ui+1)belongstoG0,instructionu1hasnoincomingedgesinG0(in- Risapositiveintegernumberofsteal-attemptrounds. =(1;01;2;02;:::;L;0L)isapartitionofR(thatisR=PLi=1(i+0i)),such ofthepartitioncorrespondstotherst1rounds.thesecondpiececorrespondstothenext ThepartitioninducesapartitionofasequenceofRroundsasfollows.Therstpiece that0i2f0;1gforeachi=1;2;:::;l. tobetheiconsecutiveroundsstartingaftertherithround,whereri=pi 1 inthepiecescorrespondingtothei,notthe0i,andsowedenetheithgroupofrounds consecutiveroundsaftertherst(1+01)rounds,andsoon.weareinterestedprimarily 01consecutiveroundsaftertherst1rounds.Thethirdpiececorrespondstothenext2 BecauseisapartitionofRand0i2f0;1g,fori=1;2;:::;L,wehave LXi=1iR L: j=1(j+0j). ofthestealattemptsthatcomprisetheroundareinitiatedattimestepswhenviscritical. Wesaythatagivenroundofstealattemptsoccurswhileinstructionviscriticalifall (5) rounds. occurwhileinstructionuiiscritical.inotherwords,uimustbecriticalthroughoutalli issaidtooccurduringanexecutionifforeachi=1;2;:::;l,alliroundsintheithgroup Inotherwords,vmustbecriticalthroughouttheentireround.Adelaysequence(U;R;) G0andapartition=(1;01;2;02;:::;L;0L)oftherstRrounds,suchthatforeach thensomedelaysequence(u;r;)mustoccur.inparticular,ifwelookatanyexecutionin whichatleastrroundsoccur,thenwecanidentifyapathu=(u1;u2;:::;ul)inthedag ThefollowinglemmastatesthatifatleastRroundstakeplaceduringanexecution, Sucharoundcannotbepartofanygroup,becausenoinstructioniscriticalthroughout. whetheruiiscriticalatthebeginningofaroundbutgetsexecutedbeforetheroundends. i=1;2;:::l,alloftheiroundsintheithgroupoccurwhileuiiscritical.each0iindicates occur. 4PRstealattemptsoccurduringtheexecution,thensomedelaysequence(U;R;)must pathlengtht1bythework-stealingalgorithmonacomputerwithpprocessors.ifatleast Lemma9Considertheexecutionofafullystrictmultithreadedcomputationwithcritical- 19

20 instructionsonadirectedpathing0suchthatforeverytimestepduringtheexecution, Proof: adelaysequence(u;r;)andshowthatitoccurs.withatleast4prstealattempts,there mustbeatleastrrounds.weconstructthedelaysequencebyrstidentifyingasetof Foragivenexecutioninwhichatleast4PRstealattemptstakeplace,weconstruct oneoftheseinstructionsiscritical.then,wepartitiontherstrroundsaccordingtowhen eachroundoccursrelativetowheneachinstructiononthepathiscritical. whichwedenotebyv1.letvl1denotea(notnecessarilyimmediate)predecessorinstruction ofv1ing0withthelatestexecutiontime.let(vl1;:::;v2;v1)denoteadirectedpathfrom vl1tov1ing0.weextendthispathbacktotherstinstructionoftherootthreadby ToconstructthepathU,weworkbackwardsfromthelastinstructionoftherootthread, ing0.wenishiteratingtheconstructionwhenwegettoaniterationkinwhichvlkisthe latestexecutiontime,andlet(vli+1;:::;vli+1;vli)denoteadirectedpathfromvli+1tovli directedpathing0fromvlitov1.weletvli+1denoteapredecessorofvliing0withthe iteratingthisconstructionasfollows.attheithiterationwehaveaninstructionvlianda rstinstructionoftherootthread.ourdesiredsequenceisthenu=(u1;u2;:::;ul),where L=lkandui=vL i+1fori=1;2;:::;l.onecanverifythatateverytimestepofthe execution,oneofthevliiscritical. oftherstrroundsaccordingtowheneachroundoccurs.wewouldlikeourpartitionto besuchthatforeachround(amongtherstrrounds),wehavethepropertythatifthe roundoccurswhilesomeinstructionuiiscritical,thentheroundbelongstotheithgroup. Now,toconstructthepartition=(1;01;2;02;:::;L;0L),wepartitionthesequence theseroundsareconsecutiveatthebeginningofthesequence,sotheseroundscomprisethe Startwith1,andlet1equalthenumberofroundsthatoccurwhileu1iscritical.Allof 1stgroup thatis,theyarethe1consecutiveroundsstartingafterther1=0rstrounds. Next,iftheroundthatimmediatelyfollowsthoserst1roundsbeginsafteru1hasbeen criticalandendsafteru1isexecuted(forotherwise,itwouldbepartoftherstgroup),so executed,thenweset01=0,andwegoonto2.otherwise,thatroundbeginswhileu1is weset01=1,andwegoonto2.for2,welet2equalthenumberofroundsthatoccur thenumberofroundsthatbeginwhileuiiscriticalbutdonotenduntilafteruiisexecuted. lettingeachibethenumberofroundsthatoccurwhileuiiscriticalandlettingeach0ibe r2=1+01rounds,sotheseroundscomprisethe2ndgroup.wecontinueinthisfashion, whileu2iscritical.notethatalloftheseroundsareconsecutivebeginningaftertherst Asanexample,wemayhavearoundthatbeginswhileuiiscriticalandthenendswhile sequenceandthatitoccurs.byconstruction,uisamaximalpathing0.nowconsidering ui+2iscritical,andinthiscase,weset0i=1and0i+1=0.inthisexample,the(i+1)st groupisempty,sowealsoseti+1=0.,weobservethateachroundamongtherstrroundsiscountedexactlyonceineither aiora0i,soisindeedapartitionofr.moreover,fori=1;2;:::;l,atmostone Weconcludetheproofbyverifyingthatthe(U;R;)asjustconstructedisadelay uiiscritical.therefore,thedelaysequence(u;r;)occurs. fori=1;2;:::;l,theiroundsthatcomprisetheithgroupalloccurwhiletheinstruction 0i2f0;1g.Thus,(U;R;)isadelaysequence.Finally,weobservethat,byconstruction, roundcanbeginwhiletheinstructionuiiscriticalandendafteruiisexecuted,sowehave numberofrounds.specically,werstshowthatacriticalinstructionmustbetheready Wenowestablishthatacriticalinstructionisunlikelytoremaincriticalacrossamodest 20