SchedulingMultithreadedComputations byworkstealing TheUniversityofTexasatAustin CharlesE.Leiserson RobertD.Blumofe MITLaboratoryforComputerScience ticalmethodofschedulingthiskindofdynamicmimd-stylecomputationis\work structured)multithreadedcomputationsonparallelcomputers.apopularandprac- Thispaperstudiestheproblemofecientlyschedulingfullystrict(i.e.,well- Abstract multithreadedcomputationswithdependencies. stealing,"inwhichprocessorsneedingworkstealcomputationalthreadsfromother processors.inthispaper,wegivetherstprovablygoodwork-stealingschedulerfor theminimumexecutiontimewithaninnitenumberofprocessors.moreover,the computationonpprocessorsusingourwork-stealingschedulerist1=p+o(t1),where T1istheminimumserialexecutiontimeofthemultithreadedcomputationandT1is Specically,ouranalysisshowsthattheexpectedtimetoexecuteafullystrict atmosto(pt1(1+nd)smax),wheresmaxisthesizeofthelargestactivationrecordof requirement.wealsoshowthattheexpectedtotalcommunicationofthealgorithmis anythreadandndisthemaximumnumberoftimesthatanythreadsynchronizeswith spacerequiredbytheexecutionisatmosts1p,wheres1istheminimumserialspace threeoftheseboundsareexistentiallyoptimaltowithinaconstantfactor. schedulersaremorecommunicationecientthantheirwork-sharingcounterparts.all itsparent.thiscommunicationboundjustiesthefolkwisdomthatwork-stealing 1Forecientexecutionofadynamicallygrowing\multithreaded"computationonaMIMD- styleparallelcomputer,aschedulingalgorithmmustensurethatenoughthreadsareactive Introduction ofconcurrentlyactivethreadsremainswithinreasonablelimitssothatmemoryrequirements concurrentlytokeeptheprocessorsbusy.simultaneously,itshouldensurethatthenumber arenotundulylarge.moreover,theschedulershouldalsotrytomaintainrelatedthreads andwassupportedinpartbyanarpahigh-performancecomputinggraduatefellowship. 94-1-0985.ThisresearchwasdonewhileRobertD.BlumofewasattheMITLaboratoryforComputerScience ThisresearchwassupportedinpartbytheAdvancedResearchProjectsAgencyunderContractN00014-1
computations:worksharingandworkstealing.inworksharing,wheneveraprocessor Needlesstosay,achievingallthesegoalssimultaneouslycanbedicult. onthesameprocessor,ifpossible,sothatcommunicationbetweenthemcanbeminimized. generatesnewthreads,theschedulerattemptstomigratesomeofthemtootherprocessors inhopesofdistributingtheworktounderutilizedprocessors.inworkstealing,however, Twoschedulingparadigmshavearisentoaddresstheproblemofschedulingmultithreaded underutilizedprocessorstaketheinitiative:theyattemptto\steal"threadsfromother processors.intuitively,themigrationofthreadsoccurslessfrequentlywithworkstealing byawork-stealingscheduler,butthreadsarealwaysmigratedbyawork-sharingscheduler. thanwithworksharing,sincewhenallprocessorshaveworktodo,nothreadsaremigrated communication.sincethen,manyresearchershaveimplementedvariantsonthisstrategy Theseauthorspointouttheheuristicbenetsofworkstealingwithregardstospaceand allelexecutionoffunctionalprograms[16]andhalstead'simplementationofmultilisp[30]. Thework-stealingideadatesbackatleastasfarasBurtonandSleep'sresearchonpar- search.recently,zhangandortynski[48]haveobtainedgoodboundsonthecommunication [11,21,23,29,34,37,46].Rudolph,Slivkin-Allalouf,andUpfal[43]analyzedarandomizedwork-stealingstrategyforloadbalancingindependentjobsonaparallelcomputer,and KarpandZhang[33]analyzedarandomizedwork-stealingstrategyforparallelbacktrack requirementsofthisalgorithm. aswellasdataowcomputations[2]inwhichthreadsmaystallduetoadatadependency. strict"(well-structured)multithreadedcomputations.thisclassofcomputationsencompassesbothbacktracksearchcomputations[33,48]anddivide-and-conquercomputations[47], Inthispaper,wepresentandanalyzeawork-stealingalgorithmforscheduling\fully Weanalyzeouralgorithmsinastringentatomic-accessmodelsimilartotheatomicmessagepassingmodelof[36]inwhichconcurrentaccessestothesamedatastructureareserially queuedbyanadversary. multithreadedcomputationswhichisprovablyecientintermsoftime,space,andcommunication.weprovethattheexpectedtimetoexecuteafullystrictcomputationonp processorsusingourwork-stealingschedulerist1=p+o(t1),wheret1istheminimum Ourmaincontributionisarandomizedwork-stealingschedulingalgorithmforfullystrict timewithaninnitenumberofprocessors.inaddition,thespacerequiredbytheexecution isatmosts1p,wheres1istheminimumserialspacerequirement.theseboundsarebetterthanpreviousboundsforwork-sharingschedulers[10],andthework-stealingscheduler serialexecutiontimeofthemultithreadedcomputationandt1istheminimumexecution ismuchsimplerandeminentlypractical.partofthisimprovementisduetoourfocusingonfullystrictcomputations,ascomparedtothe(general)strictcomputationsstudied O(PT1(1+nd)Smax),whereSmaxisthesizeofthelargestactivationrecordofanythread andndisthemaximumnumberoftimesthatanythreadsynchronizeswithitsparent.this boundisexistentiallytighttowithinaconstantfactor,meetingthelowerboundofwu andkung[47]forcommunicationinparalleldivide-and-conquer.incontrast,work-sharing in[10].wealsoprovethattheexpectedtotalcommunicationoftheexecutionisatmost requirementsofparallelcomputations.cullerandarvind[19]andruggieroandsargeant schedulershavenearlyworst-casebehaviorforcommunication.thus,ourresultsbolsterthe folkwisdomthatworkstealingissuperiortoworksharing. Othershavestudiedandcontinuetostudytheproblemofecientlymanagingthespace 2
[44]giveheuristicsforlimitingthespacerequiredbydataowprograms.Burton[14]shows andanalyzedschedulingalgorithmswithprovablygoodtimeandspacebounds.itisnot spacebounds.blelloch,gibbons,matias,andnarlikar[3,4]havealsorecentlydeveloped Burton[15]hasdevelopedandanalyzedaschedulingalgorithmwithprovablygoodtimeand howtolimitspaceincertainparallelcomputationswithoutcausingdeadlock.morerecently, theoreticmodelofmultithreadedcomputationsintroducedin[10],whichprovidesatheo- reticalbasisforanalyzingschedulers.section3givesasimpleschedulingalgorithmwhich Theremainderofthispaperisorganizedasfollows.InSection2wereviewthegraph- yetclearwhetheranyofthesealgorithmsareaspracticalasworkstealing. usesacentralqueue.this\busy-leaves"algorithmformsthebasisforourrandomizedworkstealingalgorithm,whichwepresentinsection4.insection5weintroducetheatomic-access modelthatweusetoanalyzeexecutiontimeandcommunicationcostsforthework-stealing andcommunicationcostofthework-stealingalgorithm.toconclude,insection7webriey boundalongwithadelay-sequenceargument[41]insection6toanalyzetheexecutiontime algorithm,andwepresentandanalyzeacombinatorial\ballsandbins"gamethatweuse discusshowthetheoreticalideasinthispaperhavebeenappliedtothecilkprogramming toderiveaboundonthecontentionthatarisesinrandomworkstealing.wethenusethis languageandruntimesystem[8,25],aswellasmakesomeconcludingremarks. 2Thissectionreprisesthegraph-theoreticmodelofmultithreadedcomputationintroduced in[10].wealsodenewhatitmeansforcomputationstobe\fullystrict."weconclude Amodelofmultithreadedcomputation withastatementofthegreedy-schedulingtheorem,whichisanadaptationoftheoremsby Brent[13]andGraham[27,28]ondagscheduling. quentialorderingofunit-timeinstructions.theinstructionsareconnectedbydependency edges,whichprovideapartialorderingonwhichinstructionsmustexecutebeforewhich otherinstructions.infigure1,forexample,eachshadedblockisathreadwithcircles Amultithreadedcomputationiscomposedofasetofthreads,eachofwhichisase- representinginstructionsandthehorizontaledges,calledcontinueedges,representingthe sequentialordering.thread 5ofthisexamplecontains3instructions:v10,v11,andv12. usetostorethevaluesonwhichtheycompute. itachunkofmemory,calledanactivationframe,thattheinstructionsofthethreadcan Theinstructionsofathreadmustexecuteinthissequentialorderfromtherst(leftmost) instructiontothelast(rightmost)instruction.inordertoexecuteathread,weallocatefor processorsofap-processorparallelcomputerexecutewhichinstructionsateachstep.an executionscheduledependsontheparticularmultithreadedcomputationandthenumberp ofprocessors.inanygivenstepofanexecutionschedule,eachprocessorexecutesatmost AP-processorexecutionscheduleforamultithreadedcomputationdetermineswhich rentlywiththespawnedthread.weconsiderspawnedthreadstobechildrenofthethread ingathreadislikeasubroutinecall,exceptthatthespawningthreadcanoperateconcur- oneinstruction. thatdidthespawning,andathreadmayspawnasmanychildrenasitdesires.inthisway, Duringthecourseofitsexecution,athreadmaycreate,orspawn,otherthreads.Spawn- 3
Γ 1 v 1 v 2 v 16 v 17 v 21 v 22 v 23 Γ 2 Γ 6 Figure1:Amultithreadedcomputation.Thiscomputationcontains23instructionsv1;v2;:::;v23 v 3 v 6 v 9 v 13 v 14 v v 18 v 19 v and6threads 1; 2;:::; 6. 15 20 Γ 3 Γ 4 Γ 5 v 4 v 5 v 7 v 8 v 10 v 11 v dren.thespawntreeistheparallelanalogofacalltree.inourexamplecomputation,the spawntree'srootthread 1hastwochildren, 2and 6,andthread 2hasthreechildren, threadsareorganizedintoaspawntreeasindicatedinfigure1bythedownward-pointing, shadeddependencyedges,calledspawnedges,thatconnectthreadstotheirspawnedchil- 12 executionschedulemustobeythisedgeinthatnoprocessormayexecuteaninstructionin thespawnoperation intheparentthreadtotherstinstructionofthechildthread.an 3, 4,and 5.Threads 3, 4, 5,and 6,whichhavenochildren,areleafthreads. aspawnedchildthreaduntilafterthespawninginstructionintheparentthreadhasbeen Eachspawnedgegoesfromaspecicinstruction theinstructionthatactuallydoes v7cannotbeexecuteduntilafterthespawninginstructionv6.consistentwithourunit-time instructionexecutes,itallocatesanactivationframeforthenewchildthread.onceathread modelofinstructions,asingleinstructionmayspawnatmostonechild.whenthespawning executed.inourexamplecomputation(figure1),duetothespawnedge(v6;v7),instruction Whenthelastinstructionofathreadexecutes,itdeallocatesitsframeandthethreaddies. bycontinueandspawnedges.consideraninstructionthatproducesadatavaluetobe hasbeenspawnedanditsframehasbeenallocated,wesaythethreadisaliveorliving. consumedbyanotherinstruction.suchaproducer/consumerrelationshipprecludesthe consuminginstructionfromexecutinguntilaftertheproducinginstruction.toenforce Anexecutionschedulegenerallyrespectsotherdependenciesbesidesthoserepresented suchorderings,otherdependencyedges,calledjoinedges,mayberequired,asshownin beforetheproducinginstructionhasexecuted,executionoftheconsumingthreadcannot continue thethreadstalls.oncetheproducinginstructionexecutes,thejoindependencyis Figure1bythecurvededges.Iftheexecutionofathreadarrivesataconsuminginstruction resolutionanddetectioncanbeaccomplishedusingmechanismssuchasjoincounters[8], ready.amultithreadedcomputationdoesnotmodelthemeansbywhichjoindependencies getresolvedorbywhichunresolvedjoindependenciesgetdetected.inimplementation, resolved,whichenablestheconsumingthreadtoresumeitsexecution thethreadbecomes futures[30],ori-structures[2]. instructionhasatmostaconstantnumberofjoinedgesincidentonit.thisassumption Wemaketwotechnicalassumptionsregardingjoinedges.Werstassumethateach 4
isconsistentwithourunit-timemodelofinstructions.thesecondassumptionisthatno continuestobereadytoexecuteforatleastonemoreinstruction. joinedgesentertheinstructionimmediatelyfollowingaspawn.thisassumptionmeans thatwhenaparentthreadspawnsachildthread,theparentcannotimmediatelystall.it inthisgraphhavebeenexecuted.sothatexecutionschedulesexist,thisgraphmustbe andnoprocessormayexecuteaninstructionuntilafteralloftheinstruction'spredecessors edgesofthecomputation.thesedependencyedgesformadirectedgraphofinstructions, Anexecutionschedulemustobeytheconstraintsgivenbythespawn,continue,andjoin executed. executionschedule,aninstructionisreadyifallofitspredecessorsinthedaghavebeen acyclic.thatis,itmustbeadirectedacyclicgraph,ordag.atanygivenstepofan frameshavebeendeallocated.althoughthisassumptionisnotabsolutelynecessary,itgives childrendie,andthus,athreaddoesnotdeallocateitsactivationframeuntilallitschildren's theexecutionanaturalstructure,anditwillsimplifyouranalysesofspaceutilization.in Wemakethesimplifyingassumptionthataparentthreadremainsaliveuntilallits (orifsuchstorageisavailable,thenwedonotaccountforit).therefore,thespaceused thecomputation;thereisnoglobalstorageavailabletothecomputationoutsidetheframes accountingforspaceutilization,wealsoassumethattheframesholdallthevaluesusedby threadsatthattime,andthetotalspaceusedinexecutingacomputationisthemaximum atagiventimeinexecutingacomputationisthetotalsizeofallframesusedbyallliving suchvalueoverthecourseoftheexecution. activationframeisallocatedandthisframeremainsallocatedaslongasthethreadremains nectedbydependencyedges.theinstructionsareconnectedbycontinueedgesintothreads, andthethreadsformaspawntreewiththespawnedges.whenathreadisspawned,an Tosummarize,amultithreadedcomputationcanbeviewedasadagofinstructionscon- alive.alivingthreadmaybeeitherreadyorstalledduetoanunresolveddependency. thanonemultithreadedcomputation.inthatcase,wesaytheprogramisnondeterministic.ifthesamemultithreadedcomputationisgeneratedbytheprogramontheinput Agivenmultithreadedprogramwhenrunonagiveninputcansometimesgeneratemore nomatterhowthecomputationisscheduled,thentheprogramisdeterministic.inthis cally,weshallnotworryabouthowthemultithreadedcomputationisgenerated.instead, weshallstudyitspropertiesinanaposteriorifashion. paper,weshallanalyzemultithreadedcomputations,notmultithreadedprograms.speci- thekindsofsyncrhonizationsthatcanoccurarerestricted.astrictmultithreadedcomputationisoneinwhichalljoinedgesfromathreadgotoanancestorofthethreadin Becausemultithreadedcomputationswitharbitrarydependenciescanbeimpossibleto scheduleeciently[10],westudysubclassesofgeneralmultithreadedcomputationsinwhich theactivationtree.inastrictcomputation,theonlyedgeintoasubtree(emanatingfrom itsargumentsareavailable,althoughtheargumentscanbegarneredinparallel.afully spawnedge(v2;v3).thus,strictnessmeansthatathreadcannotbeinvokedbeforeallof thecomputationoffigure1isstrict,andtheonlyedgeintothesubtreerootedat 2isthe outsidethesubtree)isthespawnedgethatspawnsthesubtree'srootthread.forexample, strictcomputationisoneinwhichalljoinedgesfromathreadgotothethread'sparent.a fullystrictcomputationis,inasense,a\well-structured"computation,inthatalljoinedges fromasubtree(ofthespawntree)emanatefromthesubtree'sroot.theexamplecompu- 5
tationoffigure1isfullystrict.anymultithreadedcomputationthatcanbeexecutedina depth-rstmanneronasingleprocessorcanbemadeeitherstrictorfullystrictbyaltering thedependencystructure,possiblyaectingtheachievableparallelism,butnotaectingthe semanticsofthecomputation[5]. lengthtobethelengthofalongestdirectedpathinthedag.ourexamplecomputation workofthecomputationtobethetotalnumberofinstructionsandthecritical-path computerintermsofthecomputation's\work"and\critical-pathlength."wedenethe WequantifyandboundtheexecutiontimeofacomputationonaP-processorparallel (Figure1)haswork23andcritical-pathlength10.Foragivencomputation,letT(X)denote thetimetoexecutethecomputationusingp-processorexecutionschedulex,andlet denotetheminimumexecutiontimewithpprocessors theminimumbeingtakenoverallpprocessorexecutionschedulesforthecomputation.thent1istheworkofthecomputation, TP=min XT(X) sincea1-processorcomputercanonlyexecuteoneinstructionateachstep,andt1isthe critical-pathlength,sinceevenwitharbitrarilymanyprocessors,eachinstructiononapath mustexecuteserially.noticethatwemusthavetpt1=p,becausepprocessorscan executeonlypinstructionspertimestep,andofcourse,wemusthavetpt1. provedin[10,20],extendstheseresultsminimallytoshowthatthisupperboundontpcan thisupperboundisuniversallyoptimaltowithinafactorof2.thefollowingtheorem, processorexecutionschedulesxwitht(x)t1=p+t1.asthesumoftwolowerbounds, EarlyworkondagschedulingbyBrent[13]andGraham[27,28]showsthatthereexistP- ready,thenallexecute. Pinstructionsareready,thenPinstructionsexecute,andiffewerthanPinstructionsare beobtainedbygreedyschedules:thoseinwhichateachstepoftheexecution,ifatleast executionschedulexachievest(x)t1=p+t1. T1andcritical-pathlengthT1,andforanynumberPofprocessors,anygreedyP-processor Theorem1(Thegreedy-schedulingtheorem)Foranymultithreadedcomputationwithwork Generally,weareinterestedinschedulesthatachievelinearspeedup,thatisT(X)= O(T1=P).Foragreedyschedule,linearspeedupoccurswhentheparallelism,whichwe denetobet1=t1,satisest1=t1=(p). stackdepthofathreadtobethesumofthesizesoftheactivationframesofallitsancestors, includingitself.thestackdepthofamultithreadedcomputationisthemaximumstack depthofanyofitsthreads.weshalldenotebys1theminimumamountofspacepossiblefor Toquantifythespaceusedbyagivenexecutionscheduleofacomputation,wedenethe any1-processorexecutionofamultithreadedcomputation,whichisequaltothestackdepth ofthecomputation.lets(x)denotethespaceusedbyap-processorexecutionschedule Xofamultithreadedcomputation.Weshallbeinterestedinthoseexecutionschedulesthat exhibitatmostlinearexpansionofspace,thatis,s(x)=o(s1p),whichisexistentially optimaltowithinaconstantfactor[10]. 6
Onceathread hasbeenspawnedinastrictcomputation,asingleprocessorcancomplete 3theexecutionoftheentiresubcomputationrootedat evenifnootherprogressismade Thebusy-leavesproperty stall.asweshallsee,thispropertyallowsanexecutionscheduletokeeptheleaves\busy." at thatisready.inparticular,noleafthreadinastrictmultithreadedcomputationcan untilthetime dies,thereisalwaysatleastonethreadfromthesubcomputationrooted onotherpartsofthecomputation.inotherwords,fromthetimethethread isspawned computationwithworkt1,critical-pathlengtht1,andstackdepths1,thereexistsapprocessorexecutionschedulexthatachievestimet(x)t1=p+t1andspaces(x)s1p Inthissection,weshowthatforanynumberPofprocessorsandanystrictmultithreaded Bycombiningthis\busy-leaves"propertywiththegreedyproperty,wederiveexecution schedulesthatsimultaneouslyexhibitlinearspeedupandlinearexpansionofspace. simultaneously.wegiveasimpleonlinep-processorparallelalgorithm thebusy-leaves thealgorithmhascomputedandexecutedtherstt 1stepsoftheexecutionschedule. randomizedwork-stealingalgorithmpresentedinsection4. Algorithm tocomputesuchaschedule.thissimplealgorithmwillformthebasisforthe revealedsofarintheexecutiontocomputeandexecutethetthstepoftheschedule.in Atthetthstep,thealgorithmusesonlyinformationfromtheportionofthecomputation TheBusy-LeavesAlgorithmoperatesonlineinthefollowingsense.Beforethetthstep, particular,itdoesnotuseanyinformationfrominstructionsnotyetexecutedorthreadsnot yetspawned. ThoughwedescribethealgorithmasaP-processorparallelalgorithm,weshallnotanalyzeit thisglobalpool,andwhenaprocessorneedswork,itremovesareadythreadfromthepool. isuniformlyavailabletoallpprocessors.whenspawnsoccur,newthreadsareaddedto TheBusy-LeavesAlgorithmmaintainsalllivingthreadsinasinglethreadpoolwhich contendingforaccesstothepool.infact,weshallonlyanalyzepropertiesoftheschedule itselfandignorethecostincurredbythealgorithmincomputingtheschedule.(scheduling assuch.specically,incomputingthetthstepoftheschedule,wealloweachprocessortoadd threadstothethreadpoolanddeletethreadsfromit.thus,weignoretheeectsofprocessors processoreitherisidleorhasathreadtoworkon.thoseprocessorsthatareidlebeginthe threadintheglobalthreadpoolandallprocessorsidle.atthebeginningofeachstep,each overheadswillbeanalyzedfortherandomizedwork-stealingalgorithm,however.) stepbyattemptingtoremoveanyreadythreadfromthepool.iftherearesucientlymany TheBusy-LeavesAlgorithmoperatesasfollows.Thealgorithmbeginswiththeroot readythreadsinthepooltosatisfyalloftheidleprocessors,theneveryidleprocessorgets thathasathreadtoworkonexecutesthenextinstructionfromthatthread.ingeneral, areadythreadtoworkon.otherwise,someprocessorsremainidle.then,eachprocessor tothefollowingrules. onceaprocessorhasathread,callit a,toworkon,itexecutesaninstructionfrom aat eachstepuntilthethreadeitherspawns,stalls,ordies,inwhichcase,itperformsaccording ➊Spawns:Ifthethread aspawnsachild b,thentheprocessornishesthecurrent stepbyreturning atothethreadpool.theprocessorbeginsthenextstepworking on b. 7
step threadpool processoractivity 321 1:v1 2:v3 p1v2 1:v16 p2 5764 2 3:v4 2:v6 4:v7 v5 6:v18 v17 v19 1089 2 2:v9 5:v10 v8 1:v21 2:v13 v20 14 12 13 11 1 2 2:v15 1:v23 v11 v12 1:v22 v14 workedonandtheinstructionexecutedbyeachofthe2processors,p1andp2,ateachstep.living justaftereachidleprocessorhasremovedareadythread.italsoliststhereadythreadbeing putationoffigure1.thisscheduleliststhelivingthreadsintheglobalthreadpoolateachstep Figure2:A2-processorexecutionschedulecomputedbytheBusy-LeavesAlgorithmforthecom- threadsthatarereadyarelistedinbold.theotherlivingthreadsarestalled. ➋Stalls:Ifthethread astalls,thentheprocessornishesthecurrentstepbyreturning ➌Dies:Ifthethread adies,thentheprocessornishesthecurrentstepbycheckingto atothethreadpool.theprocessorbeginsthenextstepidle. idle. andnootherprocessorisworkingon b,thentheprocessortakes bfromthepool andbeginsthenextstepworkingon b.otherwise,theprocessorbeginsthenextstep seeif a'sparentthread bcurrentlyhasanylivingchildren.if bhasnolivechildren thebusy-leavesalgorithmonthecomputationoffigure1.rule➊:atstep2,processor p1workingonthread 1executesv2whichspawnsthechild 2,sop1places 1backinthe pool(tobepickedupatthebeginningofthenextstepbytheidlep2)andbeginsthenext Figure2illustratesthesethreerulesina2-processorexecutionschedulecomputedby 2executesv15and 2dies,sop1retrievestheparent 1fromthepoolandbeginsthenext 1stalls,sop2returns 1tothepoolandbeginsthenextstepidle(andremainsidlesince stepworkingon 2.Rule➋:Atstep8,processorp2workingonthread 1executesv21and stepworkingon 1. thethreadpoolcontainsnoreadythreads).rule➌:atstep13,processorp1workingon spawnsubtreeatanytimestepttobetheportionofthespawntreeconsistingofjust execution,everyleafinthe\spawnsubtree"hasaprocessorworkingonit.wedenethe LeavesAlgorithmmaintainsthebusy-leavesproperty:ateverytimestepduringthe Besidesbeinggreedy,foranystrictcomputation,theschedulecomputedbytheBusy- 8
thosethreadsthatarealiveatstept.torestatethebusy-leavesproperty,ateverytimestep, property,buteverystrictmultithreadedcomputationdoes.webeginbyshowingthatany nowprovethisfactandshowthatitimplieslinearexpansionofspace.itisworthnoting thatnoteverymultithreadedcomputationhasaschedulethatmaintainsthebusy-leaves everylivingthreadthathasnolivingdescendantshasaprocessorworkingonit.weshall schedulethatmaintainsthebusy-leavespropertyexhibitslinearexpansionofspace. Proof: schedulexthatmaintainsthebusy-leavespropertyusesspaceboundedbys(x)s1p. Lemma2ForanymultithreadedcomputationwithstackdepthS1,anyP-processorexecution andtherefore,thespaceinuseatanytimesteptisatmosts1p. mostpleaves.foreachsuchleaf,thespaceusedbyitandallofitsancestorsisatmosts1, Forschedulesthatmaintainthebusy-leavesproperty,theupperboundS1Pisconser- Thebusy-leavespropertyimpliesthatatalltimestepst,thespawnsubtreehasat vative.bychargings1spaceforeachbusyleaf,wemaybeovercharging.forsomecom- putations,byknowingthattheschedulepreservesthebusy-leavesproperty,wecanappeal directlytothefactthatthespawnsubtreeneverhasmorethanpleavestoobtaintight boundsonspaceusage[6]. Theorem3ForanynumberPofprocessorsandanystrictmultithreadedcomputationwith computesaschedulethatisbothgreedyandmaintainsthebusy-leavesproperty. Wenishthissectionbyshowingthatforstrictcomputations,theBusy-LeavesAlgorithm whosespacesatisess(x)s1p. ap-processorexecutionschedulexwhoseexecutiontimesatisest(x)t1=p+t1and workt1,critical-pathlengtht1,andstackdepths1,thebusy-leavesalgorithmcomputes Lemma2ifwecanshowthattheBusy-LeavesAlgorithmmaintainsthebusy-leavesproperty. Weprovethisfactbyinductiononthenumberofsteps.Attherststepofthealgorithm,the Proof: sincethebusy-leavesalgorithmcomputesagreedyschedule.thespaceboundfollowsfrom Thetimeboundfollowsdirectlyfromthegreedy-schedulingtheorem(Theorem1), eitherspawns,stalls,ordies.rule➊:if aspawnsachild b,then aisnotaleaf(evenifit aprocessorhasathread atoworkon,itexecutesinstructionsfromthatthreaduntilit onit.wemustshowthatallofthealgorithmrulespreservethebusy-leavesproperty.when spawnsubtreecontainsjusttherootthreadwhichisaleaf,andsomeprocessorisworking mayturnintoaleaf.inthiscase,theprocessorworkson bunlesssomeotherprocessor wasbefore)and bisaleaf.inthiscase,theprocessorworkson b,sothenewleafisbusy. alreadyis,sothenewleafisguaranteedtobebusy. Rule➋:If astalls,then acannotbealeafsinceinastrictcomputation,theunresolved dependencymustcomefromadescendant.rule➌:if adies,thenitsparentthread b ecientexecutionschedulesanddoesoperateonline,itsurelydoesnotdosoeciently, mustbecomputedecientlyonline,andthoughthebusy-leavesalgorithmdoescompute schedule,andweknowhowtondit.butthesefactstakeusonlysofar.executionschedules Wenowknowthateverystrictmultithreadedcomputationhasanecientexecution andinthefollowingsections,weprovethatitisbothecientandscalable. contendforaccess.inthenextsection,wepresentadistributedonlineschedulingalgorithm, isaconsequenceofemployingasinglecentralizedthreadpoolatwhichallprocessorsmust exceptpossiblyinthecaseofsmall-scalesymmetricmultiprocessors.thislackofscalability 9
4tithreadedcomputationsonaparallelcomputer.Also,wepresentanimportantstructural Inthissection,wepresentanonline,randomizedwork-stealingalgorithmforschedulingmul- Arandomizedwork-stealingalgorithm algorithmcausesatmostalinearexpansionofspace.thislemmareappearsinsection6to lemmawhichisusedattheendofthissectiontoshowthatforfullystrictcomputations,this showthatforfullystrictcomputations,thisalgorithmachieveslinearspeedupandgenerates existentiallyoptimalamountsofcommunication. Algorithmisdistributedacrosstheprocessors.Specically,eachprocessormaintainsaready Threadscanbeinsertedonthebottomandremovedfromeitherend.Aprocessortreats dequedatastructureofthreads.thereadydequehastwoends:atopandabottom. IntheWork-StealingAlgorithm,thecentralizedthreadpooloftheBusy-Leaves migratedtootherprocessorsareremovedfromthetop. deque.itstartsworkingonthethread,callit a,andcontinuesexecuting a'sinstructions itsreadydequelikeacallstack,pushingandpoppingfromthebottom.threadsthatare until aspawns,stalls,dies,orenablesastalledthread,inwhichcase,itperformsaccording tothefollowingrules. Ingeneral,aprocessorobtainsworkbyremovingthethreadatthebottomofitsready ➊Spawns:Ifthethread aspawnsachild b,then aisplacedonthebottomofthe ➋Stalls:Ifthethread astalls,itsprocessorchecksthereadydeque.ifthedeque containsanythreads,thentheprocessorremovesandbeginsworkonthebottommost readydeque,andtheprocessorcommencesworkon b. beginsworkonit.(thiswork-stealingstrategyiselaboratedbelow.) stealsthetopmostthreadfromthereadydequeofarandomlychosenprocessorand thread.ifthereadydequeisempty,however,theprocessorbeginsworkstealing:it ➌Dies:Ifthethread adies,thentheprocessorfollowsrule➋asinthecaseof a ➍Enables:Ifthethread aenablesastalledthread b,thenow-readythread bis placedonthebottomofthereadydequeof a'sprocessor. stalling. rule➍forthecasewhenathreadenablesastalledthread,theserulesareanalogoustothe rulesofthebusy-leavesalgorithm,andasweshallsee,rule➍isneededtoensurethatthe performrule➍forenablingandthenrule➋forstallingorrule➌fordying.exceptfor Athreadcansimultaneouslyenableastalledthreadandstallordie,inwhichcasewerst algorithmmaintainsimportantstructuralproperties,includingthebusy-leavesproperty. themultithreadedcomputationisplacedinthereadydequeofoneprocessor,whiletheother processorsstartworkstealing. TheWork-StealingAlgorithmbeginswithallreadydequesempty.Therootthreadof beginsworkonthetopthread.ifthevictim'sreadydequeisempty,however,thethieftries Thethiefqueriesthereadydequeofthevictim,andifitisnonempty,thethiefremovesand athiefandattemptstostealworkfromavictimprocessorchosenuniformlyatrandom. Whenaprocessorbeginsworkstealing,itoperatesasfollows.Theprocessorbecomes again,pickinganothervictimatrandom. 10
Γ k ready deque Γ spawnedachild.thedashededgesarethe\dequeedges"introducedinsection6. Figure3:Thestructureofaprocessor'sreadydeque.Theblackinstructionineachthreadindicates thethread'scurrentlyreadyinstruction.onlythread kmayhavebeenworkedonsinceitlast 2 Γ 1 Γ 0 executing Wenowstateandproveanimportantlemmaonthestructureofthreadsintheready thread timeandcommunication.figure3illustratesthelemma. usedlaterinthissectiontoanalyzeexecutionspaceandinsection6toanalyzeexecution dequeofanyprocessorduringtheexecutionofafullystrictcomputation.thislemmais Lemma4IntheexecutionofanyfullystrictmultithreadedcomputationbytheWork-Stealing thread.let 0bethethreadthatpisworkingon,letkbethenumberofthreadsinp'sready Algorithm,consideranyprocessorpandanygiventimestepatwhichpisworkingona inp'sreadydequesatisfythefollowingproperties: top,sothat 1isthebottommostand kisthetopmost.ifwehavek>0,thenthethreads deque,andlet 1; 2;:::; kdenotethethreadsinp'sreadydequeorderedfrombottomto ➀Fori=1;2;:::;k,thread iistheparentof i 1. Proof: ➁Ifwehavek>1,thenfori=1;2;:::;k 1,thread ihasnotbeenworkedonsince itspawned i 1. processorpexecutesaninstructionfromthread 0.Let 1; 2;:::; kdenotethekthreads therootthreadinsomeprocessor'sreadydequeandallotherreadydequesempty,sothe lemmavacuouslyholdsattheoutset.now,consideranystepofthealgorithmatwhich Theproofisastraightforwardinductiononexecutiontime.Executionbeginswith inp'sreadydequebeforethestep,andsupposethateitherk=0orbothpropertieshold. propertiesholdafterthestep. algorithmandshowthattheyallpreservethelemma.thatis,eitherk0=0orboth denotethek0threadsinp'sreadydequeafterthestep.wenowlookattherulesofthe Let 0denotethethread(ifany)beingworkedonbypafterthestep,andlet 01; 02;:::; 0k0 Property➀:Ifk0>1,thenforj=2;3;:::;k0,thread 0jistheparentof 0j 1,sincebefore andcommencesworkonthechild.thus, 0isthechild,wehavek0=k+1>0,and forj=1;2;:::;k0,wehave 0j= j 1.SeeFigure4.Now,wecancheckbothproperties. Rule➊:If 0spawnsachild,thenppushes 0ontothebottomofthereadydeque thespawnwehavek>0,whichmeansthatfori=1;2;:::;k,thread iistheparentof i 1. 11
Moreover, 01isobviouslytheparentof 0.Property➁:Ifk0>2,thenforj=2;3;:::;k0 1, spawnonlyjustoccurred. k>1,whichmeansthatfori=1;2;:::;k 1,thread ihasnotbeenworkedonsinceit thread 0jhasnotbeenworkedonsinceitspawned 0j 1,becausebeforethespawnwehave spawned i 1.Finally,thread 01hasnotbeenworkedonsinceitspawned 0,becausethe Γ k Γ 2 Γ Figure4:Thereadydequeofaprocessorbeforeandafterthethread 0thatitisworkingon 3 Γ 1 Γ spawnsachild.(notethatthethreads 0and 0arenotactuallyinthedeque;theyarethe (a)beforespawn. (b)afterspawn. 2 Γ 0 Γ 1 Γ readydequeisempty,sotheprocessorcommencesworkstealing,andwhentheprocessor threadsbeingworkedonbeforeandafterthespawn.) stealsandbeginsworkonathread,wehavek0=0.ifk>0,thereadydequeisnot empty,sotheprocessorpopsthebottommostthreadothedequeandcommencesworkon Rules➋and➌:If 0stallsordies,thenwehavetwocasestoconsider.Ifk=0,the 0 Forj=1;2;:::;k0,thread 0jistheparentof 0j 1,sincefori=1;2;:::;k,thread iisthe have 0j= j+1.seefigure5.now,ifk0>0,wecancheckbothproperties.property➀: parentof i 1.Property➁:Ifk0>1,thenforj=1;2;:::;k0 1,thread 0jhasnotbeen it.thus,wehave 0= 1(thepoppedthread)andk0=k 1,andforj=1;2;:::;k0,we meansthatfori=2;3;:::;k 1,thread ihasnotbeenworkedonsinceitspawned i 1. workedonsinceitspawned 0j 1,becausebeforethestallordeathwehavek>2,which Γ k Γ k Γ k Γ 2 Γ (Notethatthethreads 0and 0arenotactuallyinthedeque;theyarethethreadsbeingworked Figure5:Thereadydequeofaprocessorbeforeandafterthethread 0thatitisworkingondies. (a)beforedeath. (b)afterdeath. 1 Γ 1 Γ 0 Γ viouslystalledthreadmustbe 0'sparent.First,weobservethatwemusthavek=0.If onbeforeandafterthedeath.) Rule➍:If 0enablesastalledthread,thenduetothefullystrictcondition,thatpre- 0 12
thebottomofthereadydeque.wehave 0= 0andk0=k+1=1with 01denotingthe apply.withk=0,thereadydequeisemptyandtheprocessorplacestheparentthreadon bebottommostinthereadydeque.thus,thisparentthreadisreadyandrule➍doesnot wehavek>0,thentheprocessor'sreadydequeisnotempty,andthisparentthreadmust newlyenabledparent.weonlyhavetochecktherstproperty.property➀:thread 01is afterthestealwehavek0=k 1.Ifk0>0holds,thenbothpropertiesareclearlypreserved. obviouslytheparentof 0. notinvokeanyoftheaboverules clearlypreservethelemma. Allotheractionsbyprocessorp suchasworkstealingorexecutinganinstructionthatdoes Ifsomeotherprocessorstealsathreadfromprocessorp,thenwemusthavek>0,and k 1andbroughtbacktoprocessorp'sreadydeque.Thekeyobservationisthatwhen kis kisstolenfromprocessorpandthenstallsonitsnewprocessor.later, kisreenabledby workedonsinceitspawned k 1,sinceProperty➁excludes k.thissituationariseswhen Beforemovingon,itisworthpointingouthowitmayhappenthatthread khasbeen k 2; k 3;:::; 0showninFigure3werespawnedafter kwasreenabled. reenabled,processorp'sreadydequeisemptyandpisworkingon k 1.Theotherthreads Theorem5ForanyfullystrictmultithreadedcomputationwithstackdepthS1,theWork- executingafullystrictcomputation. WeconcludethissectionbyboundingthespaceusedbytheWork-StealingAlgorithm Proof: StealingAlgorithmrunonacomputerwithPprocessorsusesatmostS1Pspace. hasaprocessorworkingonit.ifwecanestablishthisfact,thenlemma2completesthe proof. leavesproperty:ateverytimestepoftheexecution,everyleafinthecurrentspawnsubtree LiketheBusy-LeavesAlgorithm,theWork-StealingAlgorithmmaintainsthebusy- ofsomeprocessor.butlemma4guaranteesthatnoleafthreadsitsinaprocessor'sready readyandthereforemusteitherhaveaprocessorworkingonitorbeinthereadydeque sequenceoflemma4.ateverytimestep,everyleafinthecurrentspawnsubtreemustbe ThattheWork-StealingAlgorithmmaintainsthebusy-leavespropertyisasimplecon- dequewhiletheprocessorworksonsomeotherthread. whenmultiplethiefprocessorssimultaneouslyattempttostealfromthesamevictim. however,wemusttakecaretodeneamodelforcopingwiththecontentionthatmayarise nicationboundsforthework-stealingalgorithm.beforewecanproceedwiththisanalysis, Withthespaceboundinhand,wenowturnattentiontoanalyzingthetimeandcommu- executionofamultithreadedcomputationbythework-stealingalgorithm.weintroduce Thissectionpresentsthe\atomic-access"modelthatweusetoanalyzecontentionduringthe 5 Atomicaccessesandtherecyclinggame incurredbyrandom,asynchronousaccessesinthismodel.weshallusetheresultsofthis acombinatorial\ballsandbins"game,whichweusetoboundthetotalamountofdelay sectioninsection6,whereweanalyzethework-stealingalgorithm. Algorithm.WeassumethatthemachineisanasynchronousparallelcomputerwithP Theatomic-accessmodelisthemachinemodelweusetoanalyzetheWork-Stealing 13
themodelofkarpandzhang[33].theyassumethatifconcurrentstealrequestsaremade theatomicmessage-passingmodelof[36].thisassumptionismorestringentthanthatin processors,anditsmemorycanbeeitherdistributedorshared.ouranalysisassumesthat toadeque,inonetimestep,onerequestissatisedandalltheothersaredenied.inthe concurrentaccessestothesamedatastructureareseriallyqueuedbyanadversary,asin Theonlyconstraintontheadversaryisthatifthereisatleastonerequestforadeque,then byanadversary,ratherthanbeingdenied.moreover,fromthecollectionofwaitingrequests foragivendeque,theadversarygetstochoosewhichisservicedandwhichcontinuetowait. atomic-accessmodel,wealsoassumethatonerequestissatised,buttheothersarequeued theadversarycannotchoosethatnonebeserviced. islikelytobeproportionaltothetotalnumbermofrequests,nomatterwhichprocessors processorstopdequeswitheachprocessorallowedatmostoneoutstandingrequest,then thetotalamountoftimethattheprocessorsspendwaitingfortheirrequeststobesatised ThemainresultofthissectionistoshowthatifrequestsaremaderandomlybyP maketherequestsandnomatterhowtherequestsaredistributedovertime.inorderto bytheadversary. provethisresult,weintroducea\ballsandbins"gamethatmodelstheeectsofqueueing executedbytheadversary.initially,allpballsareinareservoirseparatefromthepbins. whichisequaltothenumberofbins.theparametermisthetotalnumberofballtosses ballsaretossedatrandomintobins.theparameterpisthenumberofballsinthegame, The(P;M)-recyclinggameisacombinatorialgameplayedbytheadversary,inwhich Ateachstepofthegame,theadversaryexecutesthefollowingtwooperationsinsequence: 1.Theadversarychoosessomeoftheballsinthereservoir(possiblyallandpossibly none),andthenforeachoftheseballs,theadversaryremovesitfromthereservoir, 2.TheadversaryinspectseachofthePbinsinturn,andforeachbinthatcontainsat selectsoneofthepbinsuniformlyandindependentlyatrandom,andtossestheball leastoneball,theadversaryremovesanyoneoftheballsinthebinandreturnsitto intoit. tosseshavebeenmadeandallballshavebeenremovedfromthebinsandplacedbackinthe TheadversaryispermittedtomakeatotalofMballtosses.ThegameendswhenMball thereservoir. reservoir. isinthereservoir,itmeansthattheball'sownerisnotmakingastealrequest.ifaballis rithm.wecanvieweachballandeachbinasbeingownedbyadistinctprocessor.ifaball inabin,itmeansthattheball'sownerhasmadeastealrequesttothedequeofthebin's TherecyclinggamemodelstheservicingofstealrequestsbytheWork-StealingAlgo- andreturnedtothereservoir,itmeansthattherequesthasbeenserviced. owner,butthattherequesthasnotyetbeensatised.whenaballisremovedfromabin adversaryistomakethetotaldelayaslargeaspossible.thenextlemmashowsthatdespite delayd=ptt=1nt,wheretisthetotalnumberofstepsinthegame.thegoalofthe correspondtostealrequeststhathavenotbeensatised.weshallbeinterestedinthetotal Aftereachsteptofthegame,therearesomenumberntofballsleftinthebins,which 14
Lemma6Forany>0,withprobabilityatleast1,thetotaldelayinthe(P;M)-recycling tothereservoir,thetotaldelayisunlikelytobelarge. thechoicesthattheadversarymakesaboutwhichballstotossintobinsandwhichtoreturn modeliso(m+plgp+plg(1=))withprobabilityatleast1,andtheexpectedtotaldelay isatmostm. thetotaldelayincurredbymrandomrequestsmadebypprocessorsintheatomic-access gameiso(m+plgp+plg(1=)).1theexpectedtotaldelayisatmostm.inotherwords, ballfromeachbinisimmaterial,andthus,wecanassumethatballsarequeuedintheirbins Proof: andwhentheadversarytossesaball,itisplacedonthebackofthequeue.ifseveralballs inarst-in-rst-out(fifo)order.theadversaryremovesballsfromthefrontofthequeue, Werstmaketheobservationthatthestrategybywhichtheadversarychoosesa aretossedintothesamebinatthesamestep,theycanbeplacedonthebackofthequeue ballistossed. inanyorder.thereasonthatassumingafifodisciplineforqueuingballsinabindoesnot aectthetotaldelayisthatthenumberofballsinagivenbinatagivenstepisthesame nomatterwhichballisremoved,andwhereballsaretossedhasnothingtodowithwhich totalnumberofstepsthatnishwithballrinabin.then,wehave orinthereservoir.denethedelayofballrtobetherandomvariablerdenotingthe Foranygivenballandanygivenstep,thestepeithernisheswiththetheballinabin ithtimeitistosseduntilitisreturnedtothereservoir.denealsotheithdelayofaball Denetheithcycleofaballtobethosestepsinwhichtheballremainsinabinfromthe D=PXr=1r: (1) tobethenumberofstepsinitsithcycle. have=pmi=1di. ofball1.ifweletmdenotethenumberoftimesthatball1istossedbytheadversary,and fori=1;2;:::;m,letdibetherandomvariabledenotingtheithdelayofball1,thenwe Weshallanalyzethetotaldelaybyfocusing,withoutlossofgenerality,onthedelay=1 byanotherballreitheronceornotatall.consequently,wecandecomposeeachrandom theadversaryfollowsthefiforule,itfollowsthattheithcycleofball1canbedelayed placesitinsomebinkandballrisremovedfrombinkduringtheithcycleofball1.since Wesaythattheithcycleofball1isdelayedbyanotherballriftheithtossofball1 variablediintoasumdi=xi2+xi3++ximofindicatorrandomvariables,where Thus,wehave xir=(1iftheithcycleofball1isdelayedbyballr; 0otherwise. 1=isatmostpolynomialinMandP[40]. 1GregPlaxtonoftheUniversityofTexas,AustinhasimprovedthisboundtoO(M)forthecasewhen =mxi=1pxr=2xir: (2) 15
delayedbyballr.foranysuchsets,weclaimthat setsofpairs(i;r),eachofwhichcorrespondstotheeventthattheithcycleofball1is Wenowproveanimportantpropertyoftheseindicatorrandomvariables.Considerany Thecruxofprovingtheclaimistoshowthat Pr8<:^ (i;r)2s(xir=1)9=;p jsj: (3) wheres0=s f(i;r)g,whencetheclaim(3)followsfrombayes'stheorem. Pr8<:xir=1^ (i0;r0)2s0(xi0r0=1)9=;1=p; (4) withprobabilityeither1=por0,andhence,withprobabilityatmost1=p.conditioningon tossofball1,itfallsintowhateverbincontainsballr,ifany.apriori,thiseventhappens saryfollowsthefiforule,wehavethatxir=1onlyif,whentheadversaryexecutestheith WecanderiveInequality(4)fromacarefulanalysisofdependencies.Becausetheadver- tellsnothingaboutwheretheithtossofball1goes.therefore,theserandomvariablesare creasethisprobability,aswenowargueintwocases.intherstcase,theindicatorrandom variablesxi0r0,wherei06=i,tellwhetherothercyclesofball1aredelayed.thisinformation anycollectionofeventsrelatingwhichballsdelaythisorothercyclesofball1cannotin- independentofxir,andthus,theprobability1=pupperboundisnotaected.inthesecond containingballr0,butthisinformationtellsusnothingaboutwhetheritgoestothebin case,theindicatorrandomvariablesxir0tellwhethertheithtossofball1goestothebin randballr0arelocated.moreover,no\collusion"amongtheindicatorrandomvariables providesanymoreinformation,andthusinequality(4)holds. containingballr,becausetheindicatorrandomvariablestellusnothingtorelatewhereball orexceedagivenvalue,theremustbesomesetcontainingoftheseindicatorrandom canbeexpressesasasumofm(p 1)indicatorrandomvariables.Inorderfortoequal variables,eachofwhichmustbe1.foranyspecicsuchset,inequality(3)saysthatthe Equation(2)showsthatthedelayencounteredbyball1throughoutallofitscycles probabilityisatmostp thatallrandomvariablesinthesetare1.sincethereare m(p 1) (emp=)suchsets,whereeisthebaseofthenaturallogarithm,wehave PrfgemP =em P whenevermaxf2em;lgp+lg(1=)g. Althoughouranalysiswasperformedforball1,itappliestoanyotherballaswell. =P; exceedsmaxf2emr;lgp+lg(1=)gisatmost=p.byboole'sinequalityandequation(1), Consequently,foranygivenballrwhichistossedmrtimes,theprobabilitythatitsdelayr 16
itfollowsthatwithprobabilityatleast1,thetotaldelaydisatmost DPXr=1maxf2emr;lgP+lg(1=)g sincem=ppr=1mr. TheupperboundE[D]Mcanbeobtainedasfollows.Recallthateachristhe =(M+PlgP+Plg(1=)); sumof(p 1)mrindicatorrandomvariables,eachofwhichhasexpectationatmost1=P. turnbacktothework-stealingalgorithm. linearityofexpectation,weobtaine[d]m. Therefore,bylinearityofexpectation,E[r]mr.UsingEquation(1)andagainusing WiththisboundonthetotaldelayincurredbyMrandomrequestsnowinhand,we 6tithreadedcomputationwiththeWork-StealingAlgorithm.Foranyfullystrictcomputation Inthissection,weanalyzethetimeandcommunicationcostofexecutingafullystrictmul- Analysisofthework-stealingalgorithm withworkt1andcritical-pathlengtht1,weshowthattheexpectedrunningtimewith Pprocessors,includingschedulingoverhead,isT1=P+O(T1).Moreover,forany>0, theexecutiontimeonpprocessorsist1=p+o(t1+lgp+lg(1=)),withprobabilityat fullystrictcomputationiso(pt1(1+nd)smax),wherendisthemaximumnumberofjoin least1.wealsoshowthattheexpectedtotalcommunicationduringtheexecutionofa edgesfromathreadtoitsparentandsmaxisthelargestsizeofanyactivationframe. victimsimultaneously.inthiscase,aswehaveindicatedintheprevioussection,wemake isdistributed,andsothereisnocontentionatacentralizeddatastructure.nevertheless,it isstillpossibleforcontentiontoarisewhenseveralthieveshappentodescendonthesame UnlikeintheBusy-LeavesAlgorithm,the\readypool"intheWork-StealingAlgorithm work-stealingresponsetakesanyconstantamountoftime. request.thisassumptioncanberelaxedwithoutmateriallyaectingtheresultssothata Wefurtherassumethatittakesunittimeforaprocessortorespondtoawork-stealing theconservativeassumptionthatanadversaryseriallyqueuesthework-stealingrequests. dollars,onefromeachprocessor.ateachstep,eachprocessorplacesitsdollarinoneof multithreadedcomputationwithworkt1andcritical-pathlengtht1onacomputerwith Pprocessors,weuseanaccountingargument.Ateachstepofthealgorithm,wecollectP ToanalyzetherunningtimeoftheWork-StealingAlgorithmexecutingafullystrict threebucketsaccordingtoitsactionsatthatstep.iftheprocessorexecutesaninstruction bucket.weshallderivetherunning-timeboundbyboundingthenumberofdollarsineach merelywaitsforaqueuedstealrequestatthestep,thenitplacesitsdollarintothewait atthestep,thenitplacesitsdollarintotheworkbucket.iftheprocessorinitiatesasteal bucketattheendoftheexecution,summingthesethreebounds,andthendividingbyp. attemptatthestep,thenitplacesitsdollarintothestealbucket.and,iftheprocessor WerstboundthetotalnumberofdollarsintheWorkbucket. 17
Lemma7TheexecutionofafullystrictmultithreadedcomputationwithworkT1bythe Proof:AprocessorplacesadollarintheWorkbucketonlywhenitexecutesaninstruction. intheworkbucket. Work-StealingAlgorithmonacomputerwithPprocessorsterminateswithexactlyT1dollars Thus,sincethereareT1instructionsinthecomputation,theexecutionendswithexactlyT1 tempts,andwemustalsodeneanaugmenteddagthatwethenusetodene\critical" \delay-sequence"argument.werstintroducethenotionofa\round"ofwork-stealat- dollarsintheworkbucket. instructions.theideaisasfollows.if,duringthecourseoftheexecution,alargenumberof BoundingthetotaldollarsintheStealbucketrequiresasignicantlymoreinvolved stealsareattempted,thenwecanidentifyasequenceofinstructions thedelaysequence in theaugmenteddagsuchthateachofthesestealattemptswasinitiatedwhilesomeinstructionfromthesequencewascritical.wethenshowthatacriticalinstructionisunlikelyto remaincriticalacrossamodestnumberofstealattempts.wecanthenconcludethatsuch adelaysequenceisunlikelytooccur,andtherefore,anexecutionisunlikelytosueralarge attemptssuchthatifastealattemptthatisinitiatedattimesteptoccursinaparticular round,thenallotherstealattemptsinitiatedattimesteptarealsointhesameround.we canpartitionallofthestealattemptsthatoccurduringanexecutionintoroundsasfollows. Aroundofstealattemptsisasetofatleast3Pbutfewerthan4Pconsecutivesteal numberofstealattempts. therstroundstartsattimestep1andendsattimestept1.ingeneral,iftheithround endsattimestepti,thenthe(i+1)stroundbeginsattimestepti+1andendsatthe Therstroundcontainsallstealattemptsinitiatedattimesteps1;2;:::;t1,wheret1isthe earliesttimesuchthatatleast3pstealattemptswereinitiatedatorbeforet1.wesaythat denition,eachroundcontainsatleast3pconsecutivestealattempts.moreover,sinceat mostp 1stealattemptscanbeinitiatedinasingletimestep,eachroundcontainsfewer stepsbetweenti+1andti+1,inclusive.thesestealattemptsbelongtoroundi+1.by earliesttimestepti+1>ti+1suchthatatleast3pstealattemptswereinitiatedattime anaugmenteddagobtainedbymodifyingtheoriginaldagslightly.letgdenotetheoriginal than4p 1stealattempts,andeachroundtakesatleast4steps. spawn,andjoinedgesasedges.theaugmenteddagg0istheoriginaldaggtogetherwith dag,thatis,thedagconsistingofthecomputation'sinstructionsasverticesanditscontinue, Thesequenceofinstructionsthatmakeupthedelaysequenceisdenedwithrespectto dequeedgesareshowndashedinfigure3.insection2wemadethetechnicalassumption spawnedgeand(u;w)isacontinueedge,thedequeedge(w;v)isplaceding0.these somenewedges,asfollows.foreverysetofinstructionsu,v,andwsuchthat(u;v)isa outthatg0isonlyananalyticaltool.thedequeedgeshavenoeectontheschedulingand executionofthecomputationbythework-stealingalgorithm. longestpathing,thenthelongestpathing0haslengthatmost2t1.itisworthpointing thatinstructionwhasnoincomingjoinedges,andsog0isadag.ift1isthelengthofa structionwsuchthatthereisadirectedpathfromwtoving0,instructionwhasbeen theexecution,wesaythatanunexecutedinstructionviscriticalifeveryinstructionthat precedesv(eitherdirectlyorindirectly)ing0hasbeenexecuted,thatis,ifforeveryin- Thedequeedgesarethekeytodeningcriticalinstructions.Atanytimestepduring 18
readyinstructionmayormaynotbecritical.intuitively,thestructuralpropertiesofaready executed.acriticalinstructionmustbeready,sinceg0containseveryedgeofg,buta instructionacrossthedequeedgehasnotyetbeenexecuted. dequeenumeratedinlemma4guaranteethatifathreadisdeepinareadydeque,then itscurrentinstructioncannotbecritical,becausethepredecessorofthethread'scurrent Denition8Adelaysequenceisa3-tuple(U;R;)satisfyingthefollowingconditions: U=(u1;u2;:::;uL)isamaximaldirectedpathinG0.Specically,fori=1;2;:::;L Wenowformalizeourdenitionofadelaysequence. structionu1mustbetherstinstructionoftherootthread),andinstructionulhasno outgoingedgesing0(instructionulmustbethelastinstructionoftherootthread). 1,theedge(ui;ui+1)belongstoG0,instructionu1hasnoincomingedgesinG0(in- Risapositiveintegernumberofsteal-attemptrounds. =(1;01;2;02;:::;L;0L)isapartitionofR(thatisR=PLi=1(i+0i)),such ofthepartitioncorrespondstotherst1rounds.thesecondpiececorrespondstothenext ThepartitioninducesapartitionofasequenceofRroundsasfollows.Therstpiece that0i2f0;1gforeachi=1;2;:::;l. tobetheiconsecutiveroundsstartingaftertherithround,whereri=pi 1 inthepiecescorrespondingtothei,notthe0i,andsowedenetheithgroupofrounds consecutiveroundsaftertherst(1+01)rounds,andsoon.weareinterestedprimarily 01consecutiveroundsaftertherst1rounds.Thethirdpiececorrespondstothenext2 BecauseisapartitionofRand0i2f0;1g,fori=1;2;:::;L,wehave LXi=1iR L: j=1(j+0j). ofthestealattemptsthatcomprisetheroundareinitiatedattimestepswhenviscritical. Wesaythatagivenroundofstealattemptsoccurswhileinstructionviscriticalifall (5) rounds. occurwhileinstructionuiiscritical.inotherwords,uimustbecriticalthroughoutalli issaidtooccurduringanexecutionifforeachi=1;2;:::;l,alliroundsintheithgroup Inotherwords,vmustbecriticalthroughouttheentireround.Adelaysequence(U;R;) G0andapartition=(1;01;2;02;:::;L;0L)oftherstRrounds,suchthatforeach thensomedelaysequence(u;r;)mustoccur.inparticular,ifwelookatanyexecutionin whichatleastrroundsoccur,thenwecanidentifyapathu=(u1;u2;:::;ul)inthedag ThefollowinglemmastatesthatifatleastRroundstakeplaceduringanexecution, Sucharoundcannotbepartofanygroup,becausenoinstructioniscriticalthroughout. whetheruiiscriticalatthebeginningofaroundbutgetsexecutedbeforetheroundends. i=1;2;:::l,alloftheiroundsintheithgroupoccurwhileuiiscritical.each0iindicates occur. 4PRstealattemptsoccurduringtheexecution,thensomedelaysequence(U;R;)must pathlengtht1bythework-stealingalgorithmonacomputerwithpprocessors.ifatleast Lemma9Considertheexecutionofafullystrictmultithreadedcomputationwithcritical- 19
instructionsonadirectedpathing0suchthatforeverytimestepduringtheexecution, Proof: adelaysequence(u;r;)andshowthatitoccurs.withatleast4prstealattempts,there mustbeatleastrrounds.weconstructthedelaysequencebyrstidentifyingasetof Foragivenexecutioninwhichatleast4PRstealattemptstakeplace,weconstruct oneoftheseinstructionsiscritical.then,wepartitiontherstrroundsaccordingtowhen eachroundoccursrelativetowheneachinstructiononthepathiscritical. whichwedenotebyv1.letvl1denotea(notnecessarilyimmediate)predecessorinstruction ofv1ing0withthelatestexecutiontime.let(vl1;:::;v2;v1)denoteadirectedpathfrom vl1tov1ing0.weextendthispathbacktotherstinstructionoftherootthreadby ToconstructthepathU,weworkbackwardsfromthelastinstructionoftherootthread, ing0.wenishiteratingtheconstructionwhenwegettoaniterationkinwhichvlkisthe latestexecutiontime,andlet(vli+1;:::;vli+1;vli)denoteadirectedpathfromvli+1tovli directedpathing0fromvlitov1.weletvli+1denoteapredecessorofvliing0withthe iteratingthisconstructionasfollows.attheithiterationwehaveaninstructionvlianda rstinstructionoftherootthread.ourdesiredsequenceisthenu=(u1;u2;:::;ul),where L=lkandui=vL i+1fori=1;2;:::;l.onecanverifythatateverytimestepofthe execution,oneofthevliiscritical. oftherstrroundsaccordingtowheneachroundoccurs.wewouldlikeourpartitionto besuchthatforeachround(amongtherstrrounds),wehavethepropertythatifthe roundoccurswhilesomeinstructionuiiscritical,thentheroundbelongstotheithgroup. Now,toconstructthepartition=(1;01;2;02;:::;L;0L),wepartitionthesequence theseroundsareconsecutiveatthebeginningofthesequence,sotheseroundscomprisethe Startwith1,andlet1equalthenumberofroundsthatoccurwhileu1iscritical.Allof 1stgroup thatis,theyarethe1consecutiveroundsstartingafterther1=0rstrounds. Next,iftheroundthatimmediatelyfollowsthoserst1roundsbeginsafteru1hasbeen criticalandendsafteru1isexecuted(forotherwise,itwouldbepartoftherstgroup),so executed,thenweset01=0,andwegoonto2.otherwise,thatroundbeginswhileu1is weset01=1,andwegoonto2.for2,welet2equalthenumberofroundsthatoccur thenumberofroundsthatbeginwhileuiiscriticalbutdonotenduntilafteruiisexecuted. lettingeachibethenumberofroundsthatoccurwhileuiiscriticalandlettingeach0ibe r2=1+01rounds,sotheseroundscomprisethe2ndgroup.wecontinueinthisfashion, whileu2iscritical.notethatalloftheseroundsareconsecutivebeginningaftertherst Asanexample,wemayhavearoundthatbeginswhileuiiscriticalandthenendswhile sequenceandthatitoccurs.byconstruction,uisamaximalpathing0.nowconsidering ui+2iscritical,andinthiscase,weset0i=1and0i+1=0.inthisexample,the(i+1)st groupisempty,sowealsoseti+1=0.,weobservethateachroundamongtherstrroundsiscountedexactlyonceineither aiora0i,soisindeedapartitionofr.moreover,fori=1;2;:::;l,atmostone Weconcludetheproofbyverifyingthatthe(U;R;)asjustconstructedisadelay uiiscritical.therefore,thedelaysequence(u;r;)occurs. fori=1;2;:::;l,theiroundsthatcomprisetheithgroupalloccurwhiletheinstruction 0i2f0;1g.Thus,(U;R;)isadelaysequence.Finally,weobservethat,byconstruction, roundcanbeginwhiletheinstructionuiiscriticalandendafteruiisexecuted,sowehave numberofrounds.specically,werstshowthatacriticalinstructionmustbetheready Wenowestablishthatacriticalinstructionisunlikelytoremaincriticalacrossamodest 20
facttoshowthataftero(1)rounds,acriticalinstructionisverylikelytobeexecuted. instructionofathreadthatisnearthetopofitsprocessor'sreadydeque.wethenusethis threadthathasatmost1threadaboveitinitsprocessor'sreadydeque. tionbythework-stealingalgorithm,eachcriticalinstructionisthereadyinstructionofa Lemma10Ateverytimestepduringtheexecutionofafullystrictmultithreadedcomputa- Lemma4guaranteesthateachoftheatleast2threadsabove 0inp'sreadydequeisan isbeingworkedonbyp.if 0hasmorethan1threadaboveitinp'sreadydeque,then u0iscritical, 0isready.Hence,forsomeprocessorp,either 0isinp'sreadydequeor 0 Proof: Consideranytimestep,andletu0bethecriticalinstructionofathread 0.Since thread ithatspawnedthread i 1,andletwidenoteui'ssuccessorinstructioninthread i. 0and kistherootthread.further,fori=1;2;:::;k,letuidenotetheinstructionof ancestorof 0.Let 1; 2;:::; kdenote 0'sancestorthreads,where 1istheparentof successorofthespawninstructionuiinthread i,eachthread ifori=1;2;:::;khasbeen sinceu0iscritical,eachinstructionwihasbeenexecuted.moreover,becauseeachwiisthe workedonsincethetimestepatwhichitspawnedthread i 1.ButLemma4guarantees Becauseofthedequeedges,eachinstructionwiisapredecessorofu0inG0,andconsequently, thatonlythetopmostthreadinp'sreadydequecanhavethisproperty.thus, 1istheonly threadthatcanpossiblybeabove 0inp'sreadydeque. roundsoccurwhiletheinstructionviscriticalisatmosttheprobabilitythatonly0or1of vandanynumberr2ofsteal-attemptrounds,theprobabilitythatanyparticularsetofr Lemma11Considertheexecutionofanyfullystrictmultithreadedcomputationbythe thestealattemptsinitiatedintherstr 1oftheseroundschoosev'sprocessor,whichisat Work-StealingAlgorithmonaparallelcomputerwithP2processors.Foranyinstruction Proof: moste 2r+3. thestealattemptsthatcomprisetherstr 1oftheserounds,ofwhichtheremustbeat setofrrounds,andsupposethattheyalloccurwhileinstructionviscritical.now,consider theprocessorinwhosereadydequev'sthreadresidesattimestepta.consideranyparticular Lettadenotethersttimestepatwhichinstructionviscritical,andletpdenote thesestealattemptsisinitiated.notethatbecausethelastround,likeeveryround,must least3p(r 1).Lettbdenotethetimestepjustafterthetimestepatwhichthelastof whichinstructionvisexecuted. criticalandatleast2timestepsbeforevisexecuted,atmost1ofthemcanchooseprocessor takeatleasttwo(infact,four)steps,thetimesteptbmustoccurbeforethetimestepat thatinstructionvisthereadyinstructionofathread,whichhasatmost1threadabove pasitstarget,forotherwise,vwouldbeexecutedatorbeforetb.recallfromlemma10 Weshallrstshowthatofthese3P(r 1)stealattemptsinitiatedwhileinstructionvis threadbeplacedaboveitinitsreadydeque.consequently,ifastealattempttargeting instructionvisexecuted,sinceonlybyprocessorpexecutinginstructionsfrom cananother itinp'sreadydequeaslongasviscritical. processorpisinitiatedatsometimesteptta,weareguaranteedthatinstructionvis If hasnothreadsaboveit,thenanotherthreadcannotbeplacedaboveituntilafter 21
executedatatimestepnolaterthant,eitherbythread beingstolenandexecutedorby fromp'sreadydequenolaterthantimestept.supposefurtherthatanotherstealattempt attempttargetingprocessorpisinitiatedattimesteptta,thenthread 0getsstolen pexecutingthethreaditself. targetingprocessorpisinitiatedattimestept0,wheretatt0<tb.then,weknowthat Now,suppose hasonethread 0aboveitinp'sreadydeque.Inthiscase,ifasteal thread 0 thesamethreadthattherststealgot.butthisscenariocanonlyoccurifin impossible,sincevisexecutedaftertimesteptb.consequently,thissecondstealmustget thread,theninstructionvmustgetexecutedatorbeforetimestept0+1tb,whichis asecondstealwillbeservicedbypatorbeforetimestept0+1.ifthissecondstealgets stept0+1tb,whichisonceagainimpossible. ofsomeinstructionfromthread,inwhichcaseinstructionvmustbeexecutedbeforetime theinterveningtimeperiod,thread 0stallsandissubsequentlyreenabledbytheexecution tat<tb,andatmost1ofwhichtargetsprocessorp.theprobabilitythateither0or1 of3p(r 1)stealattemptschoosesprocessorpis Thus,wemusthave3P(r 1)stealattempts,eachinitiatedatatimesteptsuchthat 1 1P3P(r 1)+3P(r 1)1P1 1P3P(r 1) 1 (6r 5)1 1P3P(r 1) =1+3(r 1)P P 11 1P3P(r 1) forr2. (6r 5)e 3(r 1) Wenowcompletethedelay-sequenceargumentandboundthetotaldollarsintheSteal e 2r+3 pathlengtht1bythework-stealingalgorithmonaparallelcomputerwithpprocessors. bucket. Forany>0,withprobabilityatleast1,atmostO(P(T1+lg(1=)))work-stealattempts occur.theexpectednumberofstealattemptsiso(pt1).inotherwords,withprobabilityat Lemma12Considertheexecutionofanyfullystrictmultithreadedcomputationwithcritical- bucket,andtheexpectednumberofdollarsinthisbucketiso(pt1). Proof: least1,theexecutionterminateswithatmosto(p(t1+lg(1=)))dollarsinthesteal delaysequence(u;r;)mustoccur.consideraparticulardelaysequence(u;r;)having U=(u1;u2;:::;uL)and=(1;01;2;02;:::;L;0L),withL2T1.Weshallcompute theprobabilitythat(u;r;)occurs. FromLemma9,weknowthatifatleast4PRstealattemptsoccur,thensome roundsintheithgroupalloccurringwhileagiveninstructionuiiscriticalisatmostthe alliroundsintheithgroup.fromlemma11,weknowthattheprobabilityofthei probabilitythatonly0or1ofthestealattemptsinitiatedinthersti 1oftheserounds Suchasequenceoccursiffori=1;2;:::;L,eachinstructionuiiscriticalthroughout 22
choosev'sprocessor,whichisatmoste 2i+3,providedi2.(Forthosevaluesofiwith sequence(u;r;)occurringasfollows: thetargetschoseninotherrounds,wecanboundtheprobabilityoftheparticulardelay ofthework-stealattemptsintheiroundsoftheithgrouparechosenindependentlyfrom i<2,weshalluse1asanupperboundonthisprobability.)moreover,sincethetargets Prf(U;R;)occursg = 1iLPrftheiroundsoftheithgroupoccurwhileuiiscriticalg Y exp264 20B@X i2e 2i+3 =exp264 20B@X 1iLi X 1iL i2i1ca+3l375 e 2((R L) L)+3L i<2i1ca+3l375 wherethesecond-to-lastlinefollowsfrominequality(5). Toboundtheprobabilityofsomedelaysequence(U;R;)occurring,weneedtocount =e 2R+7L; thenumberofsuchdelaysequencesandmultiplybytheprobabilitythataparticularsuch sequenceoccurs.thedirectedpathuinthemodieddagg0startsattherstinstruction G0isatmost2T1,thereareatmost(d+1)2T1waysofchoosingthepathU=(u1;u2;:::;uL). instructions,weassumethatthedegreedisaconstant.sincethelengthofalongestpathin degreed,theng0hasdegreeatmostd+1.consistentwithourunit-timeassumptionfor oftherootthreadandendsatthelastinstructionoftherootthread.iftheoriginaldaghas Thereareatmost2L+R R4T1+R sequence(u;r;)occursby Aswehavejustshown,agivendelaysequencehasatmostane 2R+7Le 2R+14T1chance ofoccurring.multiplyingthesethreefactorstogetherboundstheprobabilitythatanydelay Rwaystochoose,sincepartitionsRinto2Lpieces. (d+1) 2T14T1+R R!e 2R+14T1; whichisatmostforr=ct1lgd+lg(1=),wherecisasucientlylargeconstant. Thus,theprobabilitythatatleast4PR=(P(T1lgd+lg(1=)))=(P(T1+lg(1=))) (6) distributiondecreasesexponentially. stealattemptsoccurisatmost.theexpectationboundfollows,becausethetailofthe bythework-stealingalgorithm,andwecompletetheproofbyboundingthenumberof theoremthatboundsthetotalexecutiontimeforafullystrictmultithreadedcomputation dollarsinthewaitbucket. WithboundsonthenumberofdollarsintheWorkandStealbuckets,wenowstatethe 23
Moreover,forany>0,withprobabilityatleast1,theexecutiontimeonPprocessors Pprocessors.Theexpectedrunningtime,includingschedulingoverhead,isT1=P+O(T1). Theorem13Considertheexecutionofanyfullystrictmultithreadedcomputationwithwork T1andcritical-pathlengthT1bytheWork-StealingAlgorithmonaparallelcomputerwith ist1=p+o(t1+lgp+lg(1=)).2 thetotaldelay thatis,thetotaldollarsinthewaitbucket asafunctionofthenumber Proof: Mofstealattempts thatis,thetotaldollarsinthestealbucket.thislemmasaysthat mustboundthedollarsinthewaitbucket.thisboundisgivenbylemma6whichbounds Lemmas7and12boundthedollarsintheWorkandStealbuckets,sowenow forany>0,withprobabilityatleast1,thenumberofdollarsinthewaitbucketisat andtheexpectednumberofdollarsinthewaitbucketisatmostthenumberinthesteal mostaconstanttimesthenumberofdollarsinthestealbucketpluso(plgp+plg(1=)), bucket. analysismakestheassumptionthatatmostaconstantnumberofbytesneedbecommunicatedalongajoinedgetoresolvethedependencyputationexecutedbythework-stealingalgorithmperformsinadistributedmodel.the Thenexttheoremboundsthetotalamountofcommunicationthatamultithreadedcom- WenowaddupthedollarsinthethreebucketsanddividebyPtocompletethisproof. pathlengtht1bythework-stealingalgorithmonaparallelcomputerwithpprocessors. Theorem14Considertheexecutionofanyfullystrictmultithreadedcomputationwithcritical- ofthelargestactivationframeinthecomputation.moreover,forany>0,theprobability Then,thetotalnumberofbytescommunicatedhasexpectationO(PT1(1+nd)Smax)wherend isthemaximumnumberofjoinedgesfromathreadtoitsparentandsmaxisthesizeinbytes isatleast1 thatthetotalcommunicationincurrediso(p(t1+lg(1=))(1+nd)smax). Proof: Byourbucketingargument,theexpectednumberofstealattemptsisatmostO(PT1). Whenathreadisstolen,thecommunicationincurredisatmostSmax.Communicationalso occurswheneverajoinedgeentersaparentthreadfromoneofitschildrenandtheparent Weprovetheboundfortheexpectation.Thehigh-probabilityboundisanalogous. thecommunicationincurredisatmosto(nd)persteal.finally,wecanhavecommunication whenachildthreadenablesitsparentandputstheparentintothechild'sprocessor'sready hasbeenstolen,butsinceeachjoinedgeaccountsforatmostaconstantnumberofbytes, costiso(pt1(1+nd)smax). deque.thiseventcanhappenatmostndtimesforeachtimetheparentisstolen,sothe communicationincurredisatmostndsmaxpersteal.thus,theexpectedtotalcommunication ofwuandkung[47],whoshowedthatdivide-and-conquercomputations aspecialcaseof fullystrictcomputationsthatrequire(pt1(1+nd)smax)totalcommunicationforany executionschedulethatachieveslinearspeedup.thisresultfollowsdirectlyfromatheorem Thecommunicationboundsinthistheoremareexistentiallytight,inthatthereexist polynomialinmandp. fullystrictcomputationswithnd=1 requirethismuchcommunication. 2WithPlaxton'sbound[40]forLemma6,thisboundbecomesT1=P+O(T1),whenever1=isatmost 24
thatis,whenp=o(t1=t1) thetotalcommunicationisatmosto(t1smax).moreover, ifpt1=t1,thetotalcommunicationismuchlessthant1smax,whichconrmsthefolk wisdomthatwork-stealingalgorithmsrequiremuchlesscommunicationthanthepossibly Inthecasewhenwehavend=O(1)andthealgorithmachieveslinearexpectedspeedup 7(T1Smax)communicationofwork-sharingalgorithms. Howpracticalarethemethodsanalyzedinthispaper?Wehavebeenactivelyengagedin buildingac-basedlanguagecalledcilk(pronounced\silk")forprogrammingmultithreaded Conclusion employsaprovablyecientschedulingalgorithm,cilkdeliversguaranteedperformanceto computations[5,8,25,32,42].cilkisderivedfromthepcm\parallelcontinuationmachine"system[29],whichwasitselfpartlyinspiredbytheresearchreportedhere.thecilcationwritteninthecilklanguagecanbepredictedaccuratelyusingthemodelt1=p+t1. userapplications.specically,wehavefoundempiricallythattheperformanceofanappli- runtimesystememploysthework-stealingalgorithmdescribedinthispaper.becausecilk ParagonMPP,andtheIBMSP-2.)Todate,applicationswritteninCilkincludeprotein asthesunenterprise,thesilicongraphicsorigin,theintelquadpentium,andthedec Alphaserver.(EarlierversionsofCilkranontheThinkingMachinesCM-5MPP,theIntel TheCilksystemcurrentlyrunsoncontemporaryshared-memorymultiprocessors,such folding[38],graphicrendering[45],backtracksearch,andthe?socrateschessprogram[31], incilkwonfirstprize(undefeated)intheicfp'98programmingcontestsponsoredbythe Cilkchess,wonthe1996DutchOpenComputerChessTournament.Ateamprogramming ona1824-nodeparagonatsandianationallaboratories.ourmorerecentchessprogram, whichwonsecondprizeinthe1995iccaworldcomputerchesschampionshiprunning InternationalConferenceonFunctionalProgramming. worksofworkstations.thisruntimesystem,calledcilk-now[5,11,35],supportsadaptive parallelism,whereprocessorsinaworkstationenvironmentcanjoinauser'scomputation iftheywouldbeotherwiseidleandyetbeavailableimmediatelytoleavethecomputation Aspartofourresearch,wehaveimplementedaprototyperuntimesystemforCilkonnet- whenneededagainbytheirowners.cilk-nowalsosupportstransparentfaulttolerance, meaningthattheuser'scomputationcanproceedeveninthefaceofprocessorscrashing, andyettheprogrammerwritesthecodeinacompletelyfault-obliviousfashion.amore recentdistributedimplementationforclustersofsmp'sisdescribedin[42]. ory[6,7,24,26]anddebuggingtools[17,18,22,45].up-to-dateinformation,papers,and softwarereleasescanbefoundontheworldwidewebathttp://supertech.lcs.mit.edu/cilk. Forthecaseofshared-memorymultiprocessors,wehaverecentlygeneralizedthetime WehavealsoinvestigatedothertopicsrelatedtoCilk,includingdistributedsharedmem- haveshownthatforarbitrary(notnecessarilyfullystrictorevenstrict)multithreadedcomputations,theexpectedexecutiontimeiso(t1=p+t1).thisboundisbasedonanew bound(butnotthespaceorcommunicationbounds)alongtwodimensions[1].first,we structurallemmaandanamortizedanalysisusingapotentialfunction.second,wehave developedanonblockingimplementationofthework-stealingalgorithm,andwehaveanalyzeditsexecutiontimeforamultiprogrammedenvironmentinwhichthecomputation 25
setofprocessors,theboundspecializestomatchourpreviousbound.thenonblocking workstealerhasbeenimplementedinthehooduser-levelthreadslibrary[12,39].up-todateinformation,papers,andsoftwarereleasescanbefoundontheworldwidewebat http://www.cs.utexas.edu/users/hood. Acknowledgments ThankstoBruceMaggsofCarnegieMellon,whooutlinedthestrategyinSection6forusing adelay-sequenceargumenttoprovethetimeboundsonthework-stealingalgorithm,which ingiscontrolledbyanadversary.incasetheadversarychoosesnottogroworshrinkthe executesonasetofprocessorsthatgrowsandshrinksovertime.thisgrowingandshrink- commentsthatimprovedtheclarityofourpaper.thanksalsotoarvind,michaelhalbherr, technicalcommentsonourprobabilisticanalyses.thankstotheanonymousreferees,yanjun ZhangofSouthernMethodistUniversity,andWarrenBurtonofSimonFraserUniversityfor improvedourpreviousbounds.thankstogregplaxtonofuniversityoftexas,austinfor ChrisJoerg,BradleyKuszmaul,KeithRandall,andYuliZhouofMITforhelpfuldiscussions. References [1]NimarS.Arora,RobertD.Blumofe,andC.GregPlaxton.Threadschedulingformultiprogrammedmultiprocessors.InProceedingsoftheTenthAnnualACMSymposiumonParallel AlgorithmsandArchitectures(SPAA),pages119{129,PuertoVallarta,Mexico,June1998. [2]Arvind,RishiyurS.Nikhil,andKeshavK.Pingali.I-structures:Datastructuresforparallelcomputing.ACMTransactionsonProgrammingLanguagesandSystems,11(4):598{632, October1989. [3]GuyE.Blelloch,PhillipB.Gibbons,andYossiMatias.Provablyecientschedulingfor onparallelalgorithmsandarchitectures(spaa),pages1{12,santabarbara,california,july languageswithne-grainedparallelism.inproceedingsoftheseventhannualacmsymposium [4]GuyE.Blelloch,PhillipB.Gibbons,YossiMatias,andGirijaJ.Narlikar.Space-ecient schedulingofparallelismwithsynchronizationvariables.inproceedingsofthe9thannual 1995. [5]RobertD.Blumofe.ExecutingMultithreadedProgramsEciently.PhDthesis,DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology, RhodeIsland,June1997. ACMSymposiumonParallelAlgorithmsandArchitectures(SPAA),pages12{23,Newport, [6]RobertD.Blumofe,MatteoFrigo,ChristopherF.Joerg,CharlesE.Leiserson,andKeithH. September1995.AlsoavailableasMITLaboratoryforComputerScienceTechnicalReport MIT/LCS/TR-677. Randall.Ananalysisofdag-consistentdistributedshared-memoryalgorithms.InProceedings pages297{308,padua,italy,june1996.26 oftheeighthannualacmsymposiumonparallelalgorithmsandarchitectures(spaa),
[7]RobertD.Blumofe,MatteoFrigo,ChristopherF.Joerg,CharlesE.Leiserson,andKeithH. [8]RobertD.Blumofe,ChristopherF.Joerg,BradleyC.Kuszmaul,CharlesE.Leiserson,KeithH. ParallelProcessingSymposium(IPPS),pages132{141,Honolulu,Hawaii,April1996. Randall.Dag-consistentdistributedsharedmemory.InProceedingsoftheTenthInternational [9]RobertD.BlumofeandCharlesE.Leiserson.Schedulingmultithreadedcomputationsbywork Randall,andYuliZhou.Cilk:Anecientmultithreadedruntimesystem.JournalofParallel stealing.inproceedingsofthe35thannualsymposiumonfoundationsofcomputerscience anddistributedcomputing,37(1):55{69,august1996. [10]RobertD.BlumofeandCharlesE.Leiserson.Space-ecientschedulingofmultithreaded computations.siamjournaloncomputing,27(1):202{229,february1998. (FOCS),pages356{368,SantaFe,NewMexico,November1994. [11]RobertD.BlumofeandPhilipA.Lisiecki.Adaptiveandreliableparallelcomputingonnetworksofworkstations.InProceedingsoftheUSENIX1997AnnualTechnicalConferenceon [12]RobertD.BlumofeandDionisiosPapadopoulos.Theperformanceofworkstealinginmultiprogrammedenvironments.TechnicalReportTR-98-13,TheUniversityofTexasatAustin, UNIXandAdvancedComputingSystems,pages133{147,Anaheim,California,January1997. [13]RichardP.Brent.Theparallelevaluationofgeneralarithmeticexpressions.Journalofthe ACM,21(2):201{206,April1974. DepartmentofComputerSciences,May1998. [15]F.WarrenBurton.Guaranteeinggoodmemoryboundsforparallelprograms.IEEETransactionsonSoftwareEngineering,22(10),October1996. [14]F.WarrenBurton.Storagemanagementinvirtualtreemachines.IEEETransactionson Computers,37(3):321{328,March1988. [16]F.WarrenBurtonandM.RonanSleep.Executingfunctionalprogramsonavirtualtreeof [17]Guang-IenCheng.Algorithmsfordata-racedetectioninmultithreadedprograms.Master's ComputerArchitecture,pages187{194,Portsmouth,NewHampshire,October1981. processors.inproceedingsofthe1981conferenceonfunctionalprogramminglanguagesand [18]Guang-IenCheng,MingdongFeng,CharlesE.Leiserson,KeithH.Randall,andAndrewF. thesis,departmentofelectricalengineeringandcomputerscience,massachusettsinstitute oftechnology,may1998. [19]DavidE.CullerandArvind.Resourcerequirementsofdataowprograms.InProceedings Stark.DetectingdataracesinCilkprogramsthatuselocks.InTenthACMSymposiumon ParallelAlgorithmsandArchitectures(SPAA),PuertoVallarta,Mexico,June1998. ofthe15thannualinternationalsymposiumoncomputerarchitecture(isca),pages141{ [20]DerekL.Eager,JohnZahorjan,andEdwardD.Lazowska.Speedupversuseciencyinparallel 150,Honolulu,Hawaii,May1988.AlsoavailableasMITLaboratoryforComputerScience, ComputationStructuresGroupMemo280. systems.ieeetransactionsoncomputers,38(3):408{423,march1989. 27
[21]R.Feldmann,P.Mysliwietz,andB.Monien.Gametreesearchonamassivelyparallelsystem. [22]MingdongFengandCharlesE.Leiserson.EcientdetectionofdeterminacyracesinCilkprograms.InNinthAnnualACMSymposiumonParallelAlgorithmsandArchitectures(SPAA), pages1{11,newport,rhodeisland,june1997. AdvancesinComputerChess7,pages203{219,1993. [23]RaphaelFinkelandUdiManber.DIB adistributedimplementationofbacktracking.acm [24]MatteoFrigo.Theweakestreasonablememorymodel.Master'sthesis,DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,January1998. TransactionsonProgrammingLanguagesandSystems,9(2):235{256,April1987. [25]MatteoFrigo,CharlesE.Leiserson,andKeithH.Randall.TheimplementationoftheCilk-5 [26]MatteoFrigoandVictorLuchangco.Computation-centricmemorymodels.InProceedingsof multithreadedlanguage.inproceedingsofthe1998acmsigplanconferenceonprogramminglanguagedesignandimplementation(pldi),montreal,canada,june1998. [27]R.L.Graham.Boundsforcertainmultiprocessinganomalies.TheBellSystemTechnical thetenthannualacmsymposiumonparallelalgorithmsandarchitectures(spaa),pages 240{249,PuertoVallarta,Mexico,June1998. [28]R.L.Graham.Boundsonmultiprocessingtiminganomalies.SIAMJournalonApplied Mathematics,17(2):416{429,March1969. Journal,45:1563{1581,November1966. [29]MichaelHalbherr,YuliZhou,andChrisF.Joerg.MIMD-styleparallelprogrammingwith [30]RobertH.Halstead,Jr.ImplementationofMultilisp:Lisponamultiprocessor.InConference Parallelism:Hardware,Software,andApplications,Capri,Italy,September1994. continuation-passingthreads.inproceedingsofthe2ndinternationalworkshoponmassive [31]ChrisJoergandBradleyC.Kuszmaul.Massivelyparallelchess.InProceedingsoftheThird Texas,August1984. Recordofthe1984ACMSymposiumonLispandFunctionalProgramming,pages9{17,Austin, [32]ChristopherF.Joerg.TheCilkSystemforParallelMultithreadedComputing.PhDthesis, DIMACSParallelImplementationChallenge,RutgersUniversity,NewJersey,October1994. [33]RichardM.KarpandYanjunZhang.Randomizedparallelalgorithmsforbacktracksearch DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,January1996. andbranch-and-boundcomputation.journaloftheacm,40(3):765{789,july1993. [34]BradleyC.Kuszmaul.SynchronizedMIMDComputing.PhDthesis,DepartmentofElectrical [35]PhilipLisiecki.MacroschedulingintheCilknetworkofworkstationsenvironment.Master's availableasmitlaboratoryforcomputersciencetechnicalreportmit/lcs/tr-645. EngineeringandComputerScience,MassachusettsInstituteofTechnology,May1994.Also thesis,departmentofelectricalengineeringandcomputerscience,massachusettsinstitute oftechnology,may1996. 28
[36]PangfengLiu,WilliamAiello,andSandeepBhatt.Anatomicmodelformessage-passing.In [37]EricMohr,DavidA.Kranz,andRobertH.Halstead,Jr.Lazytaskcreation:Atechniquefor (SPAA),pages154{163,Velen,Germany,June1993. ProceedingsoftheFifthAnnualACMSymposiumonParallelAlgorithmsandArchitectures [38]VijayS.Pande,ChristopherF.Joerg,AlexanderYuGrosberg,andToyoichiTanaka.Enumerationsofthehamiltonianwalksonacubicsublattice.JournalofPhysicsA,27,1994. increasingthegranularityofparallelprograms.ieeetransactionsonparallelanddistributed Systems,2(3):264{280,July1991. [39]DionysiosP.Papadopoulos.Hood:Auser-levelthreadlibraryformultiprogrammingmultiprocessors.Master'sthesis,DepartmentofComputerSciences,TheUniversityofTexasat [41]AbhiramRanade.Howtoemulatesharedmemory.InProceedingsofthe28thAnnualSymposiumonFoundationsofComputerScience(FOCS),pages185{194,LosAngeles,California, [40]C.GregoryPlaxton,August1994.Privatecommunication. Austin,August1998. [42]KeithH.Randall.Cilk:EcientMultithreadedComputing.PhDthesis,DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,May1998. October1987. [43]LarryRudolph,MiriamSlivkin-Allalouf,andEliUpfal.Asimpleloadbalancingschemefor taskallocationinparallelmachines.inproceedingsofthethirdannualacmsymposiumon [44]CarlosA.RuggieroandJohnSargeant.ControlofparallelismintheManchesterdataow July1991. ParallelAlgorithmsandArchitectures(SPAA),pages237{245,HiltonHead,SouthCarolina, [45]AndrewF.Stark.Debuggingmultithreadedprogramsthatincorporateuser-levellocking. machine.infunctionalprogramminglanguagesandcomputerarchitecture,number274in LectureNotesinComputerScience,pages1{15.Springer-Verlag,1987. [46]MarkT.VandevoordeandEricS.Roberts.WorkCrews:Anabstractionforcontrollingparallelism.InternationalJournalofParallelProgramming,17(4):347{366,August1988. Master'sthesis,DepartmentofElectricalEngineeringandComputerScience,Massachusetts InstituteofTechnology,May1998. [47]I-ChenWuandH.T.Kung.Communicationcomplexityforparalleldivide-and-conquer.In Proceedingsofthe32ndAnnualSymposiumonFoundationsofComputerScience(FOCS), [48]Y.ZhangandA.Ortynski.Theeciencyofrandomizedparallelbacktracksearch.InProceedingsofthe6thIEEESymposiumonParallelandDistributedProcessing,Dallas,Texas, pages151{162,sanjuan,puertorico,october1991. October1994. 29