v 16 v 17 v 21 v 22 v 23

Similar documents

M.S. in Business and Information Systems

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Tutorial 8. NP-Complete Problems

An Oracle White Paper July Load Balancing in Oracle Tuxedo ATMI Applications

2.0. Specification of HSN 2.0 JavaScript Static Analyzer

Bachelor of Technology (Computer Engineering.) Scheme of Courses/Examination. (3 rd SEMESTER) 1 HUT-211 Organizational Behaviour

Code and Process Migration! Motivation!

Java SE 7 Programming

Job Scheduling Model

MANAGEMENT CONSOLE PERFORMANCE COMPARISON

Distributed Computing over Communication Networks: Maximal Independent Set

Dynamic Thread Pool based Service Tracking Manager

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Capacity Scheduler Guide

CHAPTER 1 INTRODUCTION

TIME VALUE OF MONEY. Return of vs. Return on Investment: We EXPECT to get more than we invest!

Bayesian networks - Time-series models - Apache Spark & Scala

Application. Performance Testing

Curriculum. for the Master s degree programme. Applied Informatics. Programme code L Effective date: 1 st of October 2013

Deadlock Detection and Recovery!

Technician High Pressure Pump Guide for the 7.3 Power Stroke Engine

Math. Rounding Decimals. Answers. 1) Round to the nearest tenth ) Round to the nearest whole number

Comp 204: Computer Systems and Their Implementation. Lecture 12: Scheduling Algorithms cont d

Java SE 7 Programming

Java SE 7 Programming

Artificial Intelligence Beating Human Opponents in Poker

CS Standards Crosswalk: CSTA K-12 Computer Science Standards and Oracle Java Programming (2014)

Load balancing in SOAJA (Service Oriented Java Adaptive Applications)

Announcements. Basic Concepts. Histogram of Typical CPU- Burst Times. Dispatcher. CPU Scheduler. Burst Cycle. Reading

NUMERICAL CALCULATION OF THE DENSITY OF PRIME NUMBERS WITH A GIVEN LEAST PRIMITIVE ROOT

Analysis and Comparison of CPU Scheduling Algorithms

Algorithm Design and Analysis

Thread level parallelism

Setting up your own Internet Radio Station

DIPLOMADO DE JAVA - OCA

Notes on Network Security - Introduction

GREATEST COMMON DIVISOR

REAL TIME OPERATING SYSTEMS. Lesson-18:

FAQ: BroadLink Multi-homing Load Balancers

Common Approaches to Real-Time Scheduling

Managing Stop-Go Capital Flows in EM Asia: So Far, So Good

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

IMPROVEMENT OF DIGITAL IMAGE RESOLUTION BY OVERSAMPLING

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Contributions to Gang Scheduling

Using a Digital Recorder with Dragon NaturallySpeaking

Applying Fixed Route Principles To Improve Paratransit Runcutting. Keith Forstall

Kerrighed: use cases. Cyril Brulebois. Kerrighed. Kerlabs

Portfolio Replication Variable Annuity Case Study. Curt Burmeister Senior Director Algorithmics

Guided Performance Analysis with the NVIDIA Visual Profiler

Online Supplement for Maximizing throughput in zero-buffer tandem lines with dedicated and flexible servers by Mohammad H. Yarmand and Douglas G.

CPU Scheduling Outline

P900 SERIES PORTABLE HYDROSTATIC TESTER

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Lecture 13: The Knapsack Problem

Systems of Linear Equations

(VB) - Pro: Designing and Developing Windows Applications Using the Microsoft.NET Framework 3.5

Decentralized Utility-based Sensor Network Design

OpenFlow Based Load Balancing

Chapter 8: Bags and Sets

Main TVM functions of a BAII Plus Financial Calculator

Certificate IV in Human Resources. Name Other. Tutorial Support if required - Bundaberg Information Version Control

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

LOAD BALANCING AND ADMISSION CONTROL OF A PARLAY X APPLICATION SERVER

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

341 - Bioinformatics Android Coursework

Finding the Measure of Segments Examples

Minimizing the Number of Machines in a Unit-Time Scheduling Problem

Real-Time Scheduling 1 / 39

English Schools' Athletic Association Track & Field Championships 2016 Timetable

CIEL A universal execution engine for distributed data-flow computing

Internal Audit Report Credit Cards (C4/69, C4/70)

1 Formulating The Low Degree Testing Problem

An Empirical Study of Two MIS Algorithms

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Accelerating File Transfers Increase File Transfer Speeds in Poorly-Performing Networks

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Transcription:

SchedulingMultithreadedComputations byworkstealing TheUniversityofTexasatAustin CharlesE.Leiserson RobertD.Blumofe MITLaboratoryforComputerScience ticalmethodofschedulingthiskindofdynamicmimd-stylecomputationis\work structured)multithreadedcomputationsonparallelcomputers.apopularandprac- Thispaperstudiestheproblemofecientlyschedulingfullystrict(i.e.,well- Abstract multithreadedcomputationswithdependencies. stealing,"inwhichprocessorsneedingworkstealcomputationalthreadsfromother processors.inthispaper,wegivetherstprovablygoodwork-stealingschedulerfor theminimumexecutiontimewithaninnitenumberofprocessors.moreover,the computationonpprocessorsusingourwork-stealingschedulerist1=p+o(t1),where T1istheminimumserialexecutiontimeofthemultithreadedcomputationandT1is Specically,ouranalysisshowsthattheexpectedtimetoexecuteafullystrict atmosto(pt1(1+nd)smax),wheresmaxisthesizeofthelargestactivationrecordof requirement.wealsoshowthattheexpectedtotalcommunicationofthealgorithmis anythreadandndisthemaximumnumberoftimesthatanythreadsynchronizeswith spacerequiredbytheexecutionisatmosts1p,wheres1istheminimumserialspace threeoftheseboundsareexistentiallyoptimaltowithinaconstantfactor. schedulersaremorecommunicationecientthantheirwork-sharingcounterparts.all itsparent.thiscommunicationboundjustiesthefolkwisdomthatwork-stealing 1Forecientexecutionofadynamicallygrowing\multithreaded"computationonaMIMD- styleparallelcomputer,aschedulingalgorithmmustensurethatenoughthreadsareactive Introduction ofconcurrentlyactivethreadsremainswithinreasonablelimitssothatmemoryrequirements concurrentlytokeeptheprocessorsbusy.simultaneously,itshouldensurethatthenumber arenotundulylarge.moreover,theschedulershouldalsotrytomaintainrelatedthreads andwassupportedinpartbyanarpahigh-performancecomputinggraduatefellowship. 94-1-0985.ThisresearchwasdonewhileRobertD.BlumofewasattheMITLaboratoryforComputerScience ThisresearchwassupportedinpartbytheAdvancedResearchProjectsAgencyunderContractN00014-1

computations:worksharingandworkstealing.inworksharing,wheneveraprocessor Needlesstosay,achievingallthesegoalssimultaneouslycanbedicult. onthesameprocessor,ifpossible,sothatcommunicationbetweenthemcanbeminimized. generatesnewthreads,theschedulerattemptstomigratesomeofthemtootherprocessors inhopesofdistributingtheworktounderutilizedprocessors.inworkstealing,however, Twoschedulingparadigmshavearisentoaddresstheproblemofschedulingmultithreaded underutilizedprocessorstaketheinitiative:theyattemptto\steal"threadsfromother processors.intuitively,themigrationofthreadsoccurslessfrequentlywithworkstealing byawork-stealingscheduler,butthreadsarealwaysmigratedbyawork-sharingscheduler. thanwithworksharing,sincewhenallprocessorshaveworktodo,nothreadsaremigrated communication.sincethen,manyresearchershaveimplementedvariantsonthisstrategy Theseauthorspointouttheheuristicbenetsofworkstealingwithregardstospaceand allelexecutionoffunctionalprograms[16]andhalstead'simplementationofmultilisp[30]. Thework-stealingideadatesbackatleastasfarasBurtonandSleep'sresearchonpar- search.recently,zhangandortynski[48]haveobtainedgoodboundsonthecommunication [11,21,23,29,34,37,46].Rudolph,Slivkin-Allalouf,andUpfal[43]analyzedarandomizedwork-stealingstrategyforloadbalancingindependentjobsonaparallelcomputer,and KarpandZhang[33]analyzedarandomizedwork-stealingstrategyforparallelbacktrack requirementsofthisalgorithm. aswellasdataowcomputations[2]inwhichthreadsmaystallduetoadatadependency. strict"(well-structured)multithreadedcomputations.thisclassofcomputationsencompassesbothbacktracksearchcomputations[33,48]anddivide-and-conquercomputations[47], Inthispaper,wepresentandanalyzeawork-stealingalgorithmforscheduling\fully Weanalyzeouralgorithmsinastringentatomic-accessmodelsimilartotheatomicmessagepassingmodelof[36]inwhichconcurrentaccessestothesamedatastructureareserially queuedbyanadversary. multithreadedcomputationswhichisprovablyecientintermsoftime,space,andcommunication.weprovethattheexpectedtimetoexecuteafullystrictcomputationonp processorsusingourwork-stealingschedulerist1=p+o(t1),wheret1istheminimum Ourmaincontributionisarandomizedwork-stealingschedulingalgorithmforfullystrict timewithaninnitenumberofprocessors.inaddition,thespacerequiredbytheexecution isatmosts1p,wheres1istheminimumserialspacerequirement.theseboundsarebetterthanpreviousboundsforwork-sharingschedulers[10],andthework-stealingscheduler serialexecutiontimeofthemultithreadedcomputationandt1istheminimumexecution ismuchsimplerandeminentlypractical.partofthisimprovementisduetoourfocusingonfullystrictcomputations,ascomparedtothe(general)strictcomputationsstudied O(PT1(1+nd)Smax),whereSmaxisthesizeofthelargestactivationrecordofanythread andndisthemaximumnumberoftimesthatanythreadsynchronizeswithitsparent.this boundisexistentiallytighttowithinaconstantfactor,meetingthelowerboundofwu andkung[47]forcommunicationinparalleldivide-and-conquer.incontrast,work-sharing in[10].wealsoprovethattheexpectedtotalcommunicationoftheexecutionisatmost requirementsofparallelcomputations.cullerandarvind[19]andruggieroandsargeant schedulershavenearlyworst-casebehaviorforcommunication.thus,ourresultsbolsterthe folkwisdomthatworkstealingissuperiortoworksharing. Othershavestudiedandcontinuetostudytheproblemofecientlymanagingthespace 2

[44]giveheuristicsforlimitingthespacerequiredbydataowprograms.Burton[14]shows andanalyzedschedulingalgorithmswithprovablygoodtimeandspacebounds.itisnot spacebounds.blelloch,gibbons,matias,andnarlikar[3,4]havealsorecentlydeveloped Burton[15]hasdevelopedandanalyzedaschedulingalgorithmwithprovablygoodtimeand howtolimitspaceincertainparallelcomputationswithoutcausingdeadlock.morerecently, theoreticmodelofmultithreadedcomputationsintroducedin[10],whichprovidesatheo- reticalbasisforanalyzingschedulers.section3givesasimpleschedulingalgorithmwhich Theremainderofthispaperisorganizedasfollows.InSection2wereviewthegraph- yetclearwhetheranyofthesealgorithmsareaspracticalasworkstealing. usesacentralqueue.this\busy-leaves"algorithmformsthebasisforourrandomizedworkstealingalgorithm,whichwepresentinsection4.insection5weintroducetheatomic-access modelthatweusetoanalyzeexecutiontimeandcommunicationcostsforthework-stealing andcommunicationcostofthework-stealingalgorithm.toconclude,insection7webriey boundalongwithadelay-sequenceargument[41]insection6toanalyzetheexecutiontime algorithm,andwepresentandanalyzeacombinatorial\ballsandbins"gamethatweuse discusshowthetheoreticalideasinthispaperhavebeenappliedtothecilkprogramming toderiveaboundonthecontentionthatarisesinrandomworkstealing.wethenusethis languageandruntimesystem[8,25],aswellasmakesomeconcludingremarks. 2Thissectionreprisesthegraph-theoreticmodelofmultithreadedcomputationintroduced in[10].wealsodenewhatitmeansforcomputationstobe\fullystrict."weconclude Amodelofmultithreadedcomputation withastatementofthegreedy-schedulingtheorem,whichisanadaptationoftheoremsby Brent[13]andGraham[27,28]ondagscheduling. quentialorderingofunit-timeinstructions.theinstructionsareconnectedbydependency edges,whichprovideapartialorderingonwhichinstructionsmustexecutebeforewhich otherinstructions.infigure1,forexample,eachshadedblockisathreadwithcircles Amultithreadedcomputationiscomposedofasetofthreads,eachofwhichisase- representinginstructionsandthehorizontaledges,calledcontinueedges,representingthe sequentialordering.thread 5ofthisexamplecontains3instructions:v10,v11,andv12. usetostorethevaluesonwhichtheycompute. itachunkofmemory,calledanactivationframe,thattheinstructionsofthethreadcan Theinstructionsofathreadmustexecuteinthissequentialorderfromtherst(leftmost) instructiontothelast(rightmost)instruction.inordertoexecuteathread,weallocatefor processorsofap-processorparallelcomputerexecutewhichinstructionsateachstep.an executionscheduledependsontheparticularmultithreadedcomputationandthenumberp ofprocessors.inanygivenstepofanexecutionschedule,eachprocessorexecutesatmost AP-processorexecutionscheduleforamultithreadedcomputationdetermineswhich rentlywiththespawnedthread.weconsiderspawnedthreadstobechildrenofthethread ingathreadislikeasubroutinecall,exceptthatthespawningthreadcanoperateconcur- oneinstruction. thatdidthespawning,andathreadmayspawnasmanychildrenasitdesires.inthisway, Duringthecourseofitsexecution,athreadmaycreate,orspawn,otherthreads.Spawn- 3

Γ 1 v 1 v 2 v 16 v 17 v 21 v 22 v 23 Γ 2 Γ 6 Figure1:Amultithreadedcomputation.Thiscomputationcontains23instructionsv1;v2;:::;v23 v 3 v 6 v 9 v 13 v 14 v v 18 v 19 v and6threads 1; 2;:::; 6. 15 20 Γ 3 Γ 4 Γ 5 v 4 v 5 v 7 v 8 v 10 v 11 v dren.thespawntreeistheparallelanalogofacalltree.inourexamplecomputation,the spawntree'srootthread 1hastwochildren, 2and 6,andthread 2hasthreechildren, threadsareorganizedintoaspawntreeasindicatedinfigure1bythedownward-pointing, shadeddependencyedges,calledspawnedges,thatconnectthreadstotheirspawnedchil- 12 executionschedulemustobeythisedgeinthatnoprocessormayexecuteaninstructionin thespawnoperation intheparentthreadtotherstinstructionofthechildthread.an 3, 4,and 5.Threads 3, 4, 5,and 6,whichhavenochildren,areleafthreads. aspawnedchildthreaduntilafterthespawninginstructionintheparentthreadhasbeen Eachspawnedgegoesfromaspecicinstruction theinstructionthatactuallydoes v7cannotbeexecuteduntilafterthespawninginstructionv6.consistentwithourunit-time instructionexecutes,itallocatesanactivationframeforthenewchildthread.onceathread modelofinstructions,asingleinstructionmayspawnatmostonechild.whenthespawning executed.inourexamplecomputation(figure1),duetothespawnedge(v6;v7),instruction Whenthelastinstructionofathreadexecutes,itdeallocatesitsframeandthethreaddies. bycontinueandspawnedges.consideraninstructionthatproducesadatavaluetobe hasbeenspawnedanditsframehasbeenallocated,wesaythethreadisaliveorliving. consumedbyanotherinstruction.suchaproducer/consumerrelationshipprecludesthe consuminginstructionfromexecutinguntilaftertheproducinginstruction.toenforce Anexecutionschedulegenerallyrespectsotherdependenciesbesidesthoserepresented suchorderings,otherdependencyedges,calledjoinedges,mayberequired,asshownin beforetheproducinginstructionhasexecuted,executionoftheconsumingthreadcannot continue thethreadstalls.oncetheproducinginstructionexecutes,thejoindependencyis Figure1bythecurvededges.Iftheexecutionofathreadarrivesataconsuminginstruction resolutionanddetectioncanbeaccomplishedusingmechanismssuchasjoincounters[8], ready.amultithreadedcomputationdoesnotmodelthemeansbywhichjoindependencies getresolvedorbywhichunresolvedjoindependenciesgetdetected.inimplementation, resolved,whichenablestheconsumingthreadtoresumeitsexecution thethreadbecomes futures[30],ori-structures[2]. instructionhasatmostaconstantnumberofjoinedgesincidentonit.thisassumption Wemaketwotechnicalassumptionsregardingjoinedges.Werstassumethateach 4

isconsistentwithourunit-timemodelofinstructions.thesecondassumptionisthatno continuestobereadytoexecuteforatleastonemoreinstruction. joinedgesentertheinstructionimmediatelyfollowingaspawn.thisassumptionmeans thatwhenaparentthreadspawnsachildthread,theparentcannotimmediatelystall.it inthisgraphhavebeenexecuted.sothatexecutionschedulesexist,thisgraphmustbe andnoprocessormayexecuteaninstructionuntilafteralloftheinstruction'spredecessors edgesofthecomputation.thesedependencyedgesformadirectedgraphofinstructions, Anexecutionschedulemustobeytheconstraintsgivenbythespawn,continue,andjoin executed. executionschedule,aninstructionisreadyifallofitspredecessorsinthedaghavebeen acyclic.thatis,itmustbeadirectedacyclicgraph,ordag.atanygivenstepofan frameshavebeendeallocated.althoughthisassumptionisnotabsolutelynecessary,itgives childrendie,andthus,athreaddoesnotdeallocateitsactivationframeuntilallitschildren's theexecutionanaturalstructure,anditwillsimplifyouranalysesofspaceutilization.in Wemakethesimplifyingassumptionthataparentthreadremainsaliveuntilallits (orifsuchstorageisavailable,thenwedonotaccountforit).therefore,thespaceused thecomputation;thereisnoglobalstorageavailabletothecomputationoutsidetheframes accountingforspaceutilization,wealsoassumethattheframesholdallthevaluesusedby threadsatthattime,andthetotalspaceusedinexecutingacomputationisthemaximum atagiventimeinexecutingacomputationisthetotalsizeofallframesusedbyallliving suchvalueoverthecourseoftheexecution. activationframeisallocatedandthisframeremainsallocatedaslongasthethreadremains nectedbydependencyedges.theinstructionsareconnectedbycontinueedgesintothreads, andthethreadsformaspawntreewiththespawnedges.whenathreadisspawned,an Tosummarize,amultithreadedcomputationcanbeviewedasadagofinstructionscon- alive.alivingthreadmaybeeitherreadyorstalledduetoanunresolveddependency. thanonemultithreadedcomputation.inthatcase,wesaytheprogramisnondeterministic.ifthesamemultithreadedcomputationisgeneratedbytheprogramontheinput Agivenmultithreadedprogramwhenrunonagiveninputcansometimesgeneratemore nomatterhowthecomputationisscheduled,thentheprogramisdeterministic.inthis cally,weshallnotworryabouthowthemultithreadedcomputationisgenerated.instead, weshallstudyitspropertiesinanaposteriorifashion. paper,weshallanalyzemultithreadedcomputations,notmultithreadedprograms.speci- thekindsofsyncrhonizationsthatcanoccurarerestricted.astrictmultithreadedcomputationisoneinwhichalljoinedgesfromathreadgotoanancestorofthethreadin Becausemultithreadedcomputationswitharbitrarydependenciescanbeimpossibleto scheduleeciently[10],westudysubclassesofgeneralmultithreadedcomputationsinwhich theactivationtree.inastrictcomputation,theonlyedgeintoasubtree(emanatingfrom itsargumentsareavailable,althoughtheargumentscanbegarneredinparallel.afully spawnedge(v2;v3).thus,strictnessmeansthatathreadcannotbeinvokedbeforeallof thecomputationoffigure1isstrict,andtheonlyedgeintothesubtreerootedat 2isthe outsidethesubtree)isthespawnedgethatspawnsthesubtree'srootthread.forexample, strictcomputationisoneinwhichalljoinedgesfromathreadgotothethread'sparent.a fullystrictcomputationis,inasense,a\well-structured"computation,inthatalljoinedges fromasubtree(ofthespawntree)emanatefromthesubtree'sroot.theexamplecompu- 5

tationoffigure1isfullystrict.anymultithreadedcomputationthatcanbeexecutedina depth-rstmanneronasingleprocessorcanbemadeeitherstrictorfullystrictbyaltering thedependencystructure,possiblyaectingtheachievableparallelism,butnotaectingthe semanticsofthecomputation[5]. lengthtobethelengthofalongestdirectedpathinthedag.ourexamplecomputation workofthecomputationtobethetotalnumberofinstructionsandthecritical-path computerintermsofthecomputation's\work"and\critical-pathlength."wedenethe WequantifyandboundtheexecutiontimeofacomputationonaP-processorparallel (Figure1)haswork23andcritical-pathlength10.Foragivencomputation,letT(X)denote thetimetoexecutethecomputationusingp-processorexecutionschedulex,andlet denotetheminimumexecutiontimewithpprocessors theminimumbeingtakenoverallpprocessorexecutionschedulesforthecomputation.thent1istheworkofthecomputation, TP=min XT(X) sincea1-processorcomputercanonlyexecuteoneinstructionateachstep,andt1isthe critical-pathlength,sinceevenwitharbitrarilymanyprocessors,eachinstructiononapath mustexecuteserially.noticethatwemusthavetpt1=p,becausepprocessorscan executeonlypinstructionspertimestep,andofcourse,wemusthavetpt1. provedin[10,20],extendstheseresultsminimallytoshowthatthisupperboundontpcan thisupperboundisuniversallyoptimaltowithinafactorof2.thefollowingtheorem, processorexecutionschedulesxwitht(x)t1=p+t1.asthesumoftwolowerbounds, EarlyworkondagschedulingbyBrent[13]andGraham[27,28]showsthatthereexistP- ready,thenallexecute. Pinstructionsareready,thenPinstructionsexecute,andiffewerthanPinstructionsare beobtainedbygreedyschedules:thoseinwhichateachstepoftheexecution,ifatleast executionschedulexachievest(x)t1=p+t1. T1andcritical-pathlengthT1,andforanynumberPofprocessors,anygreedyP-processor Theorem1(Thegreedy-schedulingtheorem)Foranymultithreadedcomputationwithwork Generally,weareinterestedinschedulesthatachievelinearspeedup,thatisT(X)= O(T1=P).Foragreedyschedule,linearspeedupoccurswhentheparallelism,whichwe denetobet1=t1,satisest1=t1=(p). stackdepthofathreadtobethesumofthesizesoftheactivationframesofallitsancestors, includingitself.thestackdepthofamultithreadedcomputationisthemaximumstack depthofanyofitsthreads.weshalldenotebys1theminimumamountofspacepossiblefor Toquantifythespaceusedbyagivenexecutionscheduleofacomputation,wedenethe any1-processorexecutionofamultithreadedcomputation,whichisequaltothestackdepth ofthecomputation.lets(x)denotethespaceusedbyap-processorexecutionschedule Xofamultithreadedcomputation.Weshallbeinterestedinthoseexecutionschedulesthat exhibitatmostlinearexpansionofspace,thatis,s(x)=o(s1p),whichisexistentially optimaltowithinaconstantfactor[10]. 6

Onceathread hasbeenspawnedinastrictcomputation,asingleprocessorcancomplete 3theexecutionoftheentiresubcomputationrootedat evenifnootherprogressismade Thebusy-leavesproperty stall.asweshallsee,thispropertyallowsanexecutionscheduletokeeptheleaves\busy." at thatisready.inparticular,noleafthreadinastrictmultithreadedcomputationcan untilthetime dies,thereisalwaysatleastonethreadfromthesubcomputationrooted onotherpartsofthecomputation.inotherwords,fromthetimethethread isspawned computationwithworkt1,critical-pathlengtht1,andstackdepths1,thereexistsapprocessorexecutionschedulexthatachievestimet(x)t1=p+t1andspaces(x)s1p Inthissection,weshowthatforanynumberPofprocessorsandanystrictmultithreaded Bycombiningthis\busy-leaves"propertywiththegreedyproperty,wederiveexecution schedulesthatsimultaneouslyexhibitlinearspeedupandlinearexpansionofspace. simultaneously.wegiveasimpleonlinep-processorparallelalgorithm thebusy-leaves thealgorithmhascomputedandexecutedtherstt 1stepsoftheexecutionschedule. randomizedwork-stealingalgorithmpresentedinsection4. Algorithm tocomputesuchaschedule.thissimplealgorithmwillformthebasisforthe revealedsofarintheexecutiontocomputeandexecutethetthstepoftheschedule.in Atthetthstep,thealgorithmusesonlyinformationfromtheportionofthecomputation TheBusy-LeavesAlgorithmoperatesonlineinthefollowingsense.Beforethetthstep, particular,itdoesnotuseanyinformationfrominstructionsnotyetexecutedorthreadsnot yetspawned. ThoughwedescribethealgorithmasaP-processorparallelalgorithm,weshallnotanalyzeit thisglobalpool,andwhenaprocessorneedswork,itremovesareadythreadfromthepool. isuniformlyavailabletoallpprocessors.whenspawnsoccur,newthreadsareaddedto TheBusy-LeavesAlgorithmmaintainsalllivingthreadsinasinglethreadpoolwhich contendingforaccesstothepool.infact,weshallonlyanalyzepropertiesoftheschedule itselfandignorethecostincurredbythealgorithmincomputingtheschedule.(scheduling assuch.specically,incomputingthetthstepoftheschedule,wealloweachprocessortoadd threadstothethreadpoolanddeletethreadsfromit.thus,weignoretheeectsofprocessors processoreitherisidleorhasathreadtoworkon.thoseprocessorsthatareidlebeginthe threadintheglobalthreadpoolandallprocessorsidle.atthebeginningofeachstep,each overheadswillbeanalyzedfortherandomizedwork-stealingalgorithm,however.) stepbyattemptingtoremoveanyreadythreadfromthepool.iftherearesucientlymany TheBusy-LeavesAlgorithmoperatesasfollows.Thealgorithmbeginswiththeroot readythreadsinthepooltosatisfyalloftheidleprocessors,theneveryidleprocessorgets thathasathreadtoworkonexecutesthenextinstructionfromthatthread.ingeneral, areadythreadtoworkon.otherwise,someprocessorsremainidle.then,eachprocessor tothefollowingrules. onceaprocessorhasathread,callit a,toworkon,itexecutesaninstructionfrom aat eachstepuntilthethreadeitherspawns,stalls,ordies,inwhichcase,itperformsaccording ➊Spawns:Ifthethread aspawnsachild b,thentheprocessornishesthecurrent stepbyreturning atothethreadpool.theprocessorbeginsthenextstepworking on b. 7

step threadpool processoractivity 321 1:v1 2:v3 p1v2 1:v16 p2 5764 2 3:v4 2:v6 4:v7 v5 6:v18 v17 v19 1089 2 2:v9 5:v10 v8 1:v21 2:v13 v20 14 12 13 11 1 2 2:v15 1:v23 v11 v12 1:v22 v14 workedonandtheinstructionexecutedbyeachofthe2processors,p1andp2,ateachstep.living justaftereachidleprocessorhasremovedareadythread.italsoliststhereadythreadbeing putationoffigure1.thisscheduleliststhelivingthreadsintheglobalthreadpoolateachstep Figure2:A2-processorexecutionschedulecomputedbytheBusy-LeavesAlgorithmforthecom- threadsthatarereadyarelistedinbold.theotherlivingthreadsarestalled. ➋Stalls:Ifthethread astalls,thentheprocessornishesthecurrentstepbyreturning ➌Dies:Ifthethread adies,thentheprocessornishesthecurrentstepbycheckingto atothethreadpool.theprocessorbeginsthenextstepidle. idle. andnootherprocessorisworkingon b,thentheprocessortakes bfromthepool andbeginsthenextstepworkingon b.otherwise,theprocessorbeginsthenextstep seeif a'sparentthread bcurrentlyhasanylivingchildren.if bhasnolivechildren thebusy-leavesalgorithmonthecomputationoffigure1.rule➊:atstep2,processor p1workingonthread 1executesv2whichspawnsthechild 2,sop1places 1backinthe pool(tobepickedupatthebeginningofthenextstepbytheidlep2)andbeginsthenext Figure2illustratesthesethreerulesina2-processorexecutionschedulecomputedby 2executesv15and 2dies,sop1retrievestheparent 1fromthepoolandbeginsthenext 1stalls,sop2returns 1tothepoolandbeginsthenextstepidle(andremainsidlesince stepworkingon 2.Rule➋:Atstep8,processorp2workingonthread 1executesv21and stepworkingon 1. thethreadpoolcontainsnoreadythreads).rule➌:atstep13,processorp1workingon spawnsubtreeatanytimestepttobetheportionofthespawntreeconsistingofjust execution,everyleafinthe\spawnsubtree"hasaprocessorworkingonit.wedenethe LeavesAlgorithmmaintainsthebusy-leavesproperty:ateverytimestepduringthe Besidesbeinggreedy,foranystrictcomputation,theschedulecomputedbytheBusy- 8

thosethreadsthatarealiveatstept.torestatethebusy-leavesproperty,ateverytimestep, property,buteverystrictmultithreadedcomputationdoes.webeginbyshowingthatany nowprovethisfactandshowthatitimplieslinearexpansionofspace.itisworthnoting thatnoteverymultithreadedcomputationhasaschedulethatmaintainsthebusy-leaves everylivingthreadthathasnolivingdescendantshasaprocessorworkingonit.weshall schedulethatmaintainsthebusy-leavespropertyexhibitslinearexpansionofspace. Proof: schedulexthatmaintainsthebusy-leavespropertyusesspaceboundedbys(x)s1p. Lemma2ForanymultithreadedcomputationwithstackdepthS1,anyP-processorexecution andtherefore,thespaceinuseatanytimesteptisatmosts1p. mostpleaves.foreachsuchleaf,thespaceusedbyitandallofitsancestorsisatmosts1, Forschedulesthatmaintainthebusy-leavesproperty,theupperboundS1Pisconser- Thebusy-leavespropertyimpliesthatatalltimestepst,thespawnsubtreehasat vative.bychargings1spaceforeachbusyleaf,wemaybeovercharging.forsomecom- putations,byknowingthattheschedulepreservesthebusy-leavesproperty,wecanappeal directlytothefactthatthespawnsubtreeneverhasmorethanpleavestoobtaintight boundsonspaceusage[6]. Theorem3ForanynumberPofprocessorsandanystrictmultithreadedcomputationwith computesaschedulethatisbothgreedyandmaintainsthebusy-leavesproperty. Wenishthissectionbyshowingthatforstrictcomputations,theBusy-LeavesAlgorithm whosespacesatisess(x)s1p. ap-processorexecutionschedulexwhoseexecutiontimesatisest(x)t1=p+t1and workt1,critical-pathlengtht1,andstackdepths1,thebusy-leavesalgorithmcomputes Lemma2ifwecanshowthattheBusy-LeavesAlgorithmmaintainsthebusy-leavesproperty. Weprovethisfactbyinductiononthenumberofsteps.Attherststepofthealgorithm,the Proof: sincethebusy-leavesalgorithmcomputesagreedyschedule.thespaceboundfollowsfrom Thetimeboundfollowsdirectlyfromthegreedy-schedulingtheorem(Theorem1), eitherspawns,stalls,ordies.rule➊:if aspawnsachild b,then aisnotaleaf(evenifit aprocessorhasathread atoworkon,itexecutesinstructionsfromthatthreaduntilit onit.wemustshowthatallofthealgorithmrulespreservethebusy-leavesproperty.when spawnsubtreecontainsjusttherootthreadwhichisaleaf,andsomeprocessorisworking mayturnintoaleaf.inthiscase,theprocessorworkson bunlesssomeotherprocessor wasbefore)and bisaleaf.inthiscase,theprocessorworkson b,sothenewleafisbusy. alreadyis,sothenewleafisguaranteedtobebusy. Rule➋:If astalls,then acannotbealeafsinceinastrictcomputation,theunresolved dependencymustcomefromadescendant.rule➌:if adies,thenitsparentthread b ecientexecutionschedulesanddoesoperateonline,itsurelydoesnotdosoeciently, mustbecomputedecientlyonline,andthoughthebusy-leavesalgorithmdoescompute schedule,andweknowhowtondit.butthesefactstakeusonlysofar.executionschedules Wenowknowthateverystrictmultithreadedcomputationhasanecientexecution andinthefollowingsections,weprovethatitisbothecientandscalable. contendforaccess.inthenextsection,wepresentadistributedonlineschedulingalgorithm, isaconsequenceofemployingasinglecentralizedthreadpoolatwhichallprocessorsmust exceptpossiblyinthecaseofsmall-scalesymmetricmultiprocessors.thislackofscalability 9

4tithreadedcomputationsonaparallelcomputer.Also,wepresentanimportantstructural Inthissection,wepresentanonline,randomizedwork-stealingalgorithmforschedulingmul- Arandomizedwork-stealingalgorithm algorithmcausesatmostalinearexpansionofspace.thislemmareappearsinsection6to lemmawhichisusedattheendofthissectiontoshowthatforfullystrictcomputations,this showthatforfullystrictcomputations,thisalgorithmachieveslinearspeedupandgenerates existentiallyoptimalamountsofcommunication. Algorithmisdistributedacrosstheprocessors.Specically,eachprocessormaintainsaready Threadscanbeinsertedonthebottomandremovedfromeitherend.Aprocessortreats dequedatastructureofthreads.thereadydequehastwoends:atopandabottom. IntheWork-StealingAlgorithm,thecentralizedthreadpooloftheBusy-Leaves migratedtootherprocessorsareremovedfromthetop. deque.itstartsworkingonthethread,callit a,andcontinuesexecuting a'sinstructions itsreadydequelikeacallstack,pushingandpoppingfromthebottom.threadsthatare until aspawns,stalls,dies,orenablesastalledthread,inwhichcase,itperformsaccording tothefollowingrules. Ingeneral,aprocessorobtainsworkbyremovingthethreadatthebottomofitsready ➊Spawns:Ifthethread aspawnsachild b,then aisplacedonthebottomofthe ➋Stalls:Ifthethread astalls,itsprocessorchecksthereadydeque.ifthedeque containsanythreads,thentheprocessorremovesandbeginsworkonthebottommost readydeque,andtheprocessorcommencesworkon b. beginsworkonit.(thiswork-stealingstrategyiselaboratedbelow.) stealsthetopmostthreadfromthereadydequeofarandomlychosenprocessorand thread.ifthereadydequeisempty,however,theprocessorbeginsworkstealing:it ➌Dies:Ifthethread adies,thentheprocessorfollowsrule➋asinthecaseof a ➍Enables:Ifthethread aenablesastalledthread b,thenow-readythread bis placedonthebottomofthereadydequeof a'sprocessor. stalling. rule➍forthecasewhenathreadenablesastalledthread,theserulesareanalogoustothe rulesofthebusy-leavesalgorithm,andasweshallsee,rule➍isneededtoensurethatthe performrule➍forenablingandthenrule➋forstallingorrule➌fordying.exceptfor Athreadcansimultaneouslyenableastalledthreadandstallordie,inwhichcasewerst algorithmmaintainsimportantstructuralproperties,includingthebusy-leavesproperty. themultithreadedcomputationisplacedinthereadydequeofoneprocessor,whiletheother processorsstartworkstealing. TheWork-StealingAlgorithmbeginswithallreadydequesempty.Therootthreadof beginsworkonthetopthread.ifthevictim'sreadydequeisempty,however,thethieftries Thethiefqueriesthereadydequeofthevictim,andifitisnonempty,thethiefremovesand athiefandattemptstostealworkfromavictimprocessorchosenuniformlyatrandom. Whenaprocessorbeginsworkstealing,itoperatesasfollows.Theprocessorbecomes again,pickinganothervictimatrandom. 10

Γ k ready deque Γ spawnedachild.thedashededgesarethe\dequeedges"introducedinsection6. Figure3:Thestructureofaprocessor'sreadydeque.Theblackinstructionineachthreadindicates thethread'scurrentlyreadyinstruction.onlythread kmayhavebeenworkedonsinceitlast 2 Γ 1 Γ 0 executing Wenowstateandproveanimportantlemmaonthestructureofthreadsintheready thread timeandcommunication.figure3illustratesthelemma. usedlaterinthissectiontoanalyzeexecutionspaceandinsection6toanalyzeexecution dequeofanyprocessorduringtheexecutionofafullystrictcomputation.thislemmais Lemma4IntheexecutionofanyfullystrictmultithreadedcomputationbytheWork-Stealing thread.let 0bethethreadthatpisworkingon,letkbethenumberofthreadsinp'sready Algorithm,consideranyprocessorpandanygiventimestepatwhichpisworkingona inp'sreadydequesatisfythefollowingproperties: top,sothat 1isthebottommostand kisthetopmost.ifwehavek>0,thenthethreads deque,andlet 1; 2;:::; kdenotethethreadsinp'sreadydequeorderedfrombottomto ➀Fori=1;2;:::;k,thread iistheparentof i 1. Proof: ➁Ifwehavek>1,thenfori=1;2;:::;k 1,thread ihasnotbeenworkedonsince itspawned i 1. processorpexecutesaninstructionfromthread 0.Let 1; 2;:::; kdenotethekthreads therootthreadinsomeprocessor'sreadydequeandallotherreadydequesempty,sothe lemmavacuouslyholdsattheoutset.now,consideranystepofthealgorithmatwhich Theproofisastraightforwardinductiononexecutiontime.Executionbeginswith inp'sreadydequebeforethestep,andsupposethateitherk=0orbothpropertieshold. propertiesholdafterthestep. algorithmandshowthattheyallpreservethelemma.thatis,eitherk0=0orboth denotethek0threadsinp'sreadydequeafterthestep.wenowlookattherulesofthe Let 0denotethethread(ifany)beingworkedonbypafterthestep,andlet 01; 02;:::; 0k0 Property➀:Ifk0>1,thenforj=2;3;:::;k0,thread 0jistheparentof 0j 1,sincebefore andcommencesworkonthechild.thus, 0isthechild,wehavek0=k+1>0,and forj=1;2;:::;k0,wehave 0j= j 1.SeeFigure4.Now,wecancheckbothproperties. Rule➊:If 0spawnsachild,thenppushes 0ontothebottomofthereadydeque thespawnwehavek>0,whichmeansthatfori=1;2;:::;k,thread iistheparentof i 1. 11

Moreover, 01isobviouslytheparentof 0.Property➁:Ifk0>2,thenforj=2;3;:::;k0 1, spawnonlyjustoccurred. k>1,whichmeansthatfori=1;2;:::;k 1,thread ihasnotbeenworkedonsinceit thread 0jhasnotbeenworkedonsinceitspawned 0j 1,becausebeforethespawnwehave spawned i 1.Finally,thread 01hasnotbeenworkedonsinceitspawned 0,becausethe Γ k Γ 2 Γ Figure4:Thereadydequeofaprocessorbeforeandafterthethread 0thatitisworkingon 3 Γ 1 Γ spawnsachild.(notethatthethreads 0and 0arenotactuallyinthedeque;theyarethe (a)beforespawn. (b)afterspawn. 2 Γ 0 Γ 1 Γ readydequeisempty,sotheprocessorcommencesworkstealing,andwhentheprocessor threadsbeingworkedonbeforeandafterthespawn.) stealsandbeginsworkonathread,wehavek0=0.ifk>0,thereadydequeisnot empty,sotheprocessorpopsthebottommostthreadothedequeandcommencesworkon Rules➋and➌:If 0stallsordies,thenwehavetwocasestoconsider.Ifk=0,the 0 Forj=1;2;:::;k0,thread 0jistheparentof 0j 1,sincefori=1;2;:::;k,thread iisthe have 0j= j+1.seefigure5.now,ifk0>0,wecancheckbothproperties.property➀: parentof i 1.Property➁:Ifk0>1,thenforj=1;2;:::;k0 1,thread 0jhasnotbeen it.thus,wehave 0= 1(thepoppedthread)andk0=k 1,andforj=1;2;:::;k0,we meansthatfori=2;3;:::;k 1,thread ihasnotbeenworkedonsinceitspawned i 1. workedonsinceitspawned 0j 1,becausebeforethestallordeathwehavek>2,which Γ k Γ k Γ k Γ 2 Γ (Notethatthethreads 0and 0arenotactuallyinthedeque;theyarethethreadsbeingworked Figure5:Thereadydequeofaprocessorbeforeandafterthethread 0thatitisworkingondies. (a)beforedeath. (b)afterdeath. 1 Γ 1 Γ 0 Γ viouslystalledthreadmustbe 0'sparent.First,weobservethatwemusthavek=0.If onbeforeandafterthedeath.) Rule➍:If 0enablesastalledthread,thenduetothefullystrictcondition,thatpre- 0 12

thebottomofthereadydeque.wehave 0= 0andk0=k+1=1with 01denotingthe apply.withk=0,thereadydequeisemptyandtheprocessorplacestheparentthreadon bebottommostinthereadydeque.thus,thisparentthreadisreadyandrule➍doesnot wehavek>0,thentheprocessor'sreadydequeisnotempty,andthisparentthreadmust newlyenabledparent.weonlyhavetochecktherstproperty.property➀:thread 01is afterthestealwehavek0=k 1.Ifk0>0holds,thenbothpropertiesareclearlypreserved. obviouslytheparentof 0. notinvokeanyoftheaboverules clearlypreservethelemma. Allotheractionsbyprocessorp suchasworkstealingorexecutinganinstructionthatdoes Ifsomeotherprocessorstealsathreadfromprocessorp,thenwemusthavek>0,and k 1andbroughtbacktoprocessorp'sreadydeque.Thekeyobservationisthatwhen kis kisstolenfromprocessorpandthenstallsonitsnewprocessor.later, kisreenabledby workedonsinceitspawned k 1,sinceProperty➁excludes k.thissituationariseswhen Beforemovingon,itisworthpointingouthowitmayhappenthatthread khasbeen k 2; k 3;:::; 0showninFigure3werespawnedafter kwasreenabled. reenabled,processorp'sreadydequeisemptyandpisworkingon k 1.Theotherthreads Theorem5ForanyfullystrictmultithreadedcomputationwithstackdepthS1,theWork- executingafullystrictcomputation. WeconcludethissectionbyboundingthespaceusedbytheWork-StealingAlgorithm Proof: StealingAlgorithmrunonacomputerwithPprocessorsusesatmostS1Pspace. hasaprocessorworkingonit.ifwecanestablishthisfact,thenlemma2completesthe proof. leavesproperty:ateverytimestepoftheexecution,everyleafinthecurrentspawnsubtree LiketheBusy-LeavesAlgorithm,theWork-StealingAlgorithmmaintainsthebusy- ofsomeprocessor.butlemma4guaranteesthatnoleafthreadsitsinaprocessor'sready readyandthereforemusteitherhaveaprocessorworkingonitorbeinthereadydeque sequenceoflemma4.ateverytimestep,everyleafinthecurrentspawnsubtreemustbe ThattheWork-StealingAlgorithmmaintainsthebusy-leavespropertyisasimplecon- dequewhiletheprocessorworksonsomeotherthread. whenmultiplethiefprocessorssimultaneouslyattempttostealfromthesamevictim. however,wemusttakecaretodeneamodelforcopingwiththecontentionthatmayarise nicationboundsforthework-stealingalgorithm.beforewecanproceedwiththisanalysis, Withthespaceboundinhand,wenowturnattentiontoanalyzingthetimeandcommu- executionofamultithreadedcomputationbythework-stealingalgorithm.weintroduce Thissectionpresentsthe\atomic-access"modelthatweusetoanalyzecontentionduringthe 5 Atomicaccessesandtherecyclinggame incurredbyrandom,asynchronousaccessesinthismodel.weshallusetheresultsofthis acombinatorial\ballsandbins"game,whichweusetoboundthetotalamountofdelay sectioninsection6,whereweanalyzethework-stealingalgorithm. Algorithm.WeassumethatthemachineisanasynchronousparallelcomputerwithP Theatomic-accessmodelisthemachinemodelweusetoanalyzetheWork-Stealing 13

themodelofkarpandzhang[33].theyassumethatifconcurrentstealrequestsaremade theatomicmessage-passingmodelof[36].thisassumptionismorestringentthanthatin processors,anditsmemorycanbeeitherdistributedorshared.ouranalysisassumesthat toadeque,inonetimestep,onerequestissatisedandalltheothersaredenied.inthe concurrentaccessestothesamedatastructureareseriallyqueuedbyanadversary,asin Theonlyconstraintontheadversaryisthatifthereisatleastonerequestforadeque,then byanadversary,ratherthanbeingdenied.moreover,fromthecollectionofwaitingrequests foragivendeque,theadversarygetstochoosewhichisservicedandwhichcontinuetowait. atomic-accessmodel,wealsoassumethatonerequestissatised,buttheothersarequeued theadversarycannotchoosethatnonebeserviced. islikelytobeproportionaltothetotalnumbermofrequests,nomatterwhichprocessors processorstopdequeswitheachprocessorallowedatmostoneoutstandingrequest,then thetotalamountoftimethattheprocessorsspendwaitingfortheirrequeststobesatised ThemainresultofthissectionistoshowthatifrequestsaremaderandomlybyP maketherequestsandnomatterhowtherequestsaredistributedovertime.inorderto bytheadversary. provethisresult,weintroducea\ballsandbins"gamethatmodelstheeectsofqueueing executedbytheadversary.initially,allpballsareinareservoirseparatefromthepbins. whichisequaltothenumberofbins.theparametermisthetotalnumberofballtosses ballsaretossedatrandomintobins.theparameterpisthenumberofballsinthegame, The(P;M)-recyclinggameisacombinatorialgameplayedbytheadversary,inwhich Ateachstepofthegame,theadversaryexecutesthefollowingtwooperationsinsequence: 1.Theadversarychoosessomeoftheballsinthereservoir(possiblyallandpossibly none),andthenforeachoftheseballs,theadversaryremovesitfromthereservoir, 2.TheadversaryinspectseachofthePbinsinturn,andforeachbinthatcontainsat selectsoneofthepbinsuniformlyandindependentlyatrandom,andtossestheball leastoneball,theadversaryremovesanyoneoftheballsinthebinandreturnsitto intoit. tosseshavebeenmadeandallballshavebeenremovedfromthebinsandplacedbackinthe TheadversaryispermittedtomakeatotalofMballtosses.ThegameendswhenMball thereservoir. reservoir. isinthereservoir,itmeansthattheball'sownerisnotmakingastealrequest.ifaballis rithm.wecanvieweachballandeachbinasbeingownedbyadistinctprocessor.ifaball inabin,itmeansthattheball'sownerhasmadeastealrequesttothedequeofthebin's TherecyclinggamemodelstheservicingofstealrequestsbytheWork-StealingAlgo- andreturnedtothereservoir,itmeansthattherequesthasbeenserviced. owner,butthattherequesthasnotyetbeensatised.whenaballisremovedfromabin adversaryistomakethetotaldelayaslargeaspossible.thenextlemmashowsthatdespite delayd=ptt=1nt,wheretisthetotalnumberofstepsinthegame.thegoalofthe correspondtostealrequeststhathavenotbeensatised.weshallbeinterestedinthetotal Aftereachsteptofthegame,therearesomenumberntofballsleftinthebins,which 14

Lemma6Forany>0,withprobabilityatleast1,thetotaldelayinthe(P;M)-recycling tothereservoir,thetotaldelayisunlikelytobelarge. thechoicesthattheadversarymakesaboutwhichballstotossintobinsandwhichtoreturn modeliso(m+plgp+plg(1=))withprobabilityatleast1,andtheexpectedtotaldelay isatmostm. thetotaldelayincurredbymrandomrequestsmadebypprocessorsintheatomic-access gameiso(m+plgp+plg(1=)).1theexpectedtotaldelayisatmostm.inotherwords, ballfromeachbinisimmaterial,andthus,wecanassumethatballsarequeuedintheirbins Proof: andwhentheadversarytossesaball,itisplacedonthebackofthequeue.ifseveralballs inarst-in-rst-out(fifo)order.theadversaryremovesballsfromthefrontofthequeue, Werstmaketheobservationthatthestrategybywhichtheadversarychoosesa aretossedintothesamebinatthesamestep,theycanbeplacedonthebackofthequeue ballistossed. inanyorder.thereasonthatassumingafifodisciplineforqueuingballsinabindoesnot aectthetotaldelayisthatthenumberofballsinagivenbinatagivenstepisthesame nomatterwhichballisremoved,andwhereballsaretossedhasnothingtodowithwhich totalnumberofstepsthatnishwithballrinabin.then,wehave orinthereservoir.denethedelayofballrtobetherandomvariablerdenotingthe Foranygivenballandanygivenstep,thestepeithernisheswiththetheballinabin ithtimeitistosseduntilitisreturnedtothereservoir.denealsotheithdelayofaball Denetheithcycleofaballtobethosestepsinwhichtheballremainsinabinfromthe D=PXr=1r: (1) tobethenumberofstepsinitsithcycle. have=pmi=1di. ofball1.ifweletmdenotethenumberoftimesthatball1istossedbytheadversary,and fori=1;2;:::;m,letdibetherandomvariabledenotingtheithdelayofball1,thenwe Weshallanalyzethetotaldelaybyfocusing,withoutlossofgenerality,onthedelay=1 byanotherballreitheronceornotatall.consequently,wecandecomposeeachrandom theadversaryfollowsthefiforule,itfollowsthattheithcycleofball1canbedelayed placesitinsomebinkandballrisremovedfrombinkduringtheithcycleofball1.since Wesaythattheithcycleofball1isdelayedbyanotherballriftheithtossofball1 variablediintoasumdi=xi2+xi3++ximofindicatorrandomvariables,where Thus,wehave xir=(1iftheithcycleofball1isdelayedbyballr; 0otherwise. 1=isatmostpolynomialinMandP[40]. 1GregPlaxtonoftheUniversityofTexas,AustinhasimprovedthisboundtoO(M)forthecasewhen =mxi=1pxr=2xir: (2) 15

delayedbyballr.foranysuchsets,weclaimthat setsofpairs(i;r),eachofwhichcorrespondstotheeventthattheithcycleofball1is Wenowproveanimportantpropertyoftheseindicatorrandomvariables.Considerany Thecruxofprovingtheclaimistoshowthat Pr8<:^ (i;r)2s(xir=1)9=;p jsj: (3) wheres0=s f(i;r)g,whencetheclaim(3)followsfrombayes'stheorem. Pr8<:xir=1^ (i0;r0)2s0(xi0r0=1)9=;1=p; (4) withprobabilityeither1=por0,andhence,withprobabilityatmost1=p.conditioningon tossofball1,itfallsintowhateverbincontainsballr,ifany.apriori,thiseventhappens saryfollowsthefiforule,wehavethatxir=1onlyif,whentheadversaryexecutestheith WecanderiveInequality(4)fromacarefulanalysisofdependencies.Becausetheadver- tellsnothingaboutwheretheithtossofball1goes.therefore,theserandomvariablesare creasethisprobability,aswenowargueintwocases.intherstcase,theindicatorrandom variablesxi0r0,wherei06=i,tellwhetherothercyclesofball1aredelayed.thisinformation anycollectionofeventsrelatingwhichballsdelaythisorothercyclesofball1cannotin- independentofxir,andthus,theprobability1=pupperboundisnotaected.inthesecond containingballr0,butthisinformationtellsusnothingaboutwhetheritgoestothebin case,theindicatorrandomvariablesxir0tellwhethertheithtossofball1goestothebin randballr0arelocated.moreover,no\collusion"amongtheindicatorrandomvariables providesanymoreinformation,andthusinequality(4)holds. containingballr,becausetheindicatorrandomvariablestellusnothingtorelatewhereball orexceedagivenvalue,theremustbesomesetcontainingoftheseindicatorrandom canbeexpressesasasumofm(p 1)indicatorrandomvariables.Inorderfortoequal variables,eachofwhichmustbe1.foranyspecicsuchset,inequality(3)saysthatthe Equation(2)showsthatthedelayencounteredbyball1throughoutallofitscycles probabilityisatmostp thatallrandomvariablesinthesetare1.sincethereare m(p 1) (emp=)suchsets,whereeisthebaseofthenaturallogarithm,wehave PrfgemP =em P whenevermaxf2em;lgp+lg(1=)g. Althoughouranalysiswasperformedforball1,itappliestoanyotherballaswell. =P; exceedsmaxf2emr;lgp+lg(1=)gisatmost=p.byboole'sinequalityandequation(1), Consequently,foranygivenballrwhichistossedmrtimes,theprobabilitythatitsdelayr 16

itfollowsthatwithprobabilityatleast1,thetotaldelaydisatmost DPXr=1maxf2emr;lgP+lg(1=)g sincem=ppr=1mr. TheupperboundE[D]Mcanbeobtainedasfollows.Recallthateachristhe =(M+PlgP+Plg(1=)); sumof(p 1)mrindicatorrandomvariables,eachofwhichhasexpectationatmost1=P. turnbacktothework-stealingalgorithm. linearityofexpectation,weobtaine[d]m. Therefore,bylinearityofexpectation,E[r]mr.UsingEquation(1)andagainusing WiththisboundonthetotaldelayincurredbyMrandomrequestsnowinhand,we 6tithreadedcomputationwiththeWork-StealingAlgorithm.Foranyfullystrictcomputation Inthissection,weanalyzethetimeandcommunicationcostofexecutingafullystrictmul- Analysisofthework-stealingalgorithm withworkt1andcritical-pathlengtht1,weshowthattheexpectedrunningtimewith Pprocessors,includingschedulingoverhead,isT1=P+O(T1).Moreover,forany>0, theexecutiontimeonpprocessorsist1=p+o(t1+lgp+lg(1=)),withprobabilityat fullystrictcomputationiso(pt1(1+nd)smax),wherendisthemaximumnumberofjoin least1.wealsoshowthattheexpectedtotalcommunicationduringtheexecutionofa edgesfromathreadtoitsparentandsmaxisthelargestsizeofanyactivationframe. victimsimultaneously.inthiscase,aswehaveindicatedintheprevioussection,wemake isdistributed,andsothereisnocontentionatacentralizeddatastructure.nevertheless,it isstillpossibleforcontentiontoarisewhenseveralthieveshappentodescendonthesame UnlikeintheBusy-LeavesAlgorithm,the\readypool"intheWork-StealingAlgorithm work-stealingresponsetakesanyconstantamountoftime. request.thisassumptioncanberelaxedwithoutmateriallyaectingtheresultssothata Wefurtherassumethatittakesunittimeforaprocessortorespondtoawork-stealing theconservativeassumptionthatanadversaryseriallyqueuesthework-stealingrequests. dollars,onefromeachprocessor.ateachstep,eachprocessorplacesitsdollarinoneof multithreadedcomputationwithworkt1andcritical-pathlengtht1onacomputerwith Pprocessors,weuseanaccountingargument.Ateachstepofthealgorithm,wecollectP ToanalyzetherunningtimeoftheWork-StealingAlgorithmexecutingafullystrict threebucketsaccordingtoitsactionsatthatstep.iftheprocessorexecutesaninstruction bucket.weshallderivetherunning-timeboundbyboundingthenumberofdollarsineach merelywaitsforaqueuedstealrequestatthestep,thenitplacesitsdollarintothewait atthestep,thenitplacesitsdollarintotheworkbucket.iftheprocessorinitiatesasteal bucketattheendoftheexecution,summingthesethreebounds,andthendividingbyp. attemptatthestep,thenitplacesitsdollarintothestealbucket.and,iftheprocessor WerstboundthetotalnumberofdollarsintheWorkbucket. 17

Lemma7TheexecutionofafullystrictmultithreadedcomputationwithworkT1bythe Proof:AprocessorplacesadollarintheWorkbucketonlywhenitexecutesaninstruction. intheworkbucket. Work-StealingAlgorithmonacomputerwithPprocessorsterminateswithexactlyT1dollars Thus,sincethereareT1instructionsinthecomputation,theexecutionendswithexactlyT1 tempts,andwemustalsodeneanaugmenteddagthatwethenusetodene\critical" \delay-sequence"argument.werstintroducethenotionofa\round"ofwork-stealat- dollarsintheworkbucket. instructions.theideaisasfollows.if,duringthecourseoftheexecution,alargenumberof BoundingthetotaldollarsintheStealbucketrequiresasignicantlymoreinvolved stealsareattempted,thenwecanidentifyasequenceofinstructions thedelaysequence in theaugmenteddagsuchthateachofthesestealattemptswasinitiatedwhilesomeinstructionfromthesequencewascritical.wethenshowthatacriticalinstructionisunlikelyto remaincriticalacrossamodestnumberofstealattempts.wecanthenconcludethatsuch adelaysequenceisunlikelytooccur,andtherefore,anexecutionisunlikelytosueralarge attemptssuchthatifastealattemptthatisinitiatedattimesteptoccursinaparticular round,thenallotherstealattemptsinitiatedattimesteptarealsointhesameround.we canpartitionallofthestealattemptsthatoccurduringanexecutionintoroundsasfollows. Aroundofstealattemptsisasetofatleast3Pbutfewerthan4Pconsecutivesteal numberofstealattempts. therstroundstartsattimestep1andendsattimestept1.ingeneral,iftheithround endsattimestepti,thenthe(i+1)stroundbeginsattimestepti+1andendsatthe Therstroundcontainsallstealattemptsinitiatedattimesteps1;2;:::;t1,wheret1isthe earliesttimesuchthatatleast3pstealattemptswereinitiatedatorbeforet1.wesaythat denition,eachroundcontainsatleast3pconsecutivestealattempts.moreover,sinceat mostp 1stealattemptscanbeinitiatedinasingletimestep,eachroundcontainsfewer stepsbetweenti+1andti+1,inclusive.thesestealattemptsbelongtoroundi+1.by earliesttimestepti+1>ti+1suchthatatleast3pstealattemptswereinitiatedattime anaugmenteddagobtainedbymodifyingtheoriginaldagslightly.letgdenotetheoriginal than4p 1stealattempts,andeachroundtakesatleast4steps. spawn,andjoinedgesasedges.theaugmenteddagg0istheoriginaldaggtogetherwith dag,thatis,thedagconsistingofthecomputation'sinstructionsasverticesanditscontinue, Thesequenceofinstructionsthatmakeupthedelaysequenceisdenedwithrespectto dequeedgesareshowndashedinfigure3.insection2wemadethetechnicalassumption spawnedgeand(u;w)isacontinueedge,thedequeedge(w;v)isplaceding0.these somenewedges,asfollows.foreverysetofinstructionsu,v,andwsuchthat(u;v)isa outthatg0isonlyananalyticaltool.thedequeedgeshavenoeectontheschedulingand executionofthecomputationbythework-stealingalgorithm. longestpathing,thenthelongestpathing0haslengthatmost2t1.itisworthpointing thatinstructionwhasnoincomingjoinedges,andsog0isadag.ift1isthelengthofa structionwsuchthatthereisadirectedpathfromwtoving0,instructionwhasbeen theexecution,wesaythatanunexecutedinstructionviscriticalifeveryinstructionthat precedesv(eitherdirectlyorindirectly)ing0hasbeenexecuted,thatis,ifforeveryin- Thedequeedgesarethekeytodeningcriticalinstructions.Atanytimestepduring 18

readyinstructionmayormaynotbecritical.intuitively,thestructuralpropertiesofaready executed.acriticalinstructionmustbeready,sinceg0containseveryedgeofg,buta instructionacrossthedequeedgehasnotyetbeenexecuted. dequeenumeratedinlemma4guaranteethatifathreadisdeepinareadydeque,then itscurrentinstructioncannotbecritical,becausethepredecessorofthethread'scurrent Denition8Adelaysequenceisa3-tuple(U;R;)satisfyingthefollowingconditions: U=(u1;u2;:::;uL)isamaximaldirectedpathinG0.Specically,fori=1;2;:::;L Wenowformalizeourdenitionofadelaysequence. structionu1mustbetherstinstructionoftherootthread),andinstructionulhasno outgoingedgesing0(instructionulmustbethelastinstructionoftherootthread). 1,theedge(ui;ui+1)belongstoG0,instructionu1hasnoincomingedgesinG0(in- Risapositiveintegernumberofsteal-attemptrounds. =(1;01;2;02;:::;L;0L)isapartitionofR(thatisR=PLi=1(i+0i)),such ofthepartitioncorrespondstotherst1rounds.thesecondpiececorrespondstothenext ThepartitioninducesapartitionofasequenceofRroundsasfollows.Therstpiece that0i2f0;1gforeachi=1;2;:::;l. tobetheiconsecutiveroundsstartingaftertherithround,whereri=pi 1 inthepiecescorrespondingtothei,notthe0i,andsowedenetheithgroupofrounds consecutiveroundsaftertherst(1+01)rounds,andsoon.weareinterestedprimarily 01consecutiveroundsaftertherst1rounds.Thethirdpiececorrespondstothenext2 BecauseisapartitionofRand0i2f0;1g,fori=1;2;:::;L,wehave LXi=1iR L: j=1(j+0j). ofthestealattemptsthatcomprisetheroundareinitiatedattimestepswhenviscritical. Wesaythatagivenroundofstealattemptsoccurswhileinstructionviscriticalifall (5) rounds. occurwhileinstructionuiiscritical.inotherwords,uimustbecriticalthroughoutalli issaidtooccurduringanexecutionifforeachi=1;2;:::;l,alliroundsintheithgroup Inotherwords,vmustbecriticalthroughouttheentireround.Adelaysequence(U;R;) G0andapartition=(1;01;2;02;:::;L;0L)oftherstRrounds,suchthatforeach thensomedelaysequence(u;r;)mustoccur.inparticular,ifwelookatanyexecutionin whichatleastrroundsoccur,thenwecanidentifyapathu=(u1;u2;:::;ul)inthedag ThefollowinglemmastatesthatifatleastRroundstakeplaceduringanexecution, Sucharoundcannotbepartofanygroup,becausenoinstructioniscriticalthroughout. whetheruiiscriticalatthebeginningofaroundbutgetsexecutedbeforetheroundends. i=1;2;:::l,alloftheiroundsintheithgroupoccurwhileuiiscritical.each0iindicates occur. 4PRstealattemptsoccurduringtheexecution,thensomedelaysequence(U;R;)must pathlengtht1bythework-stealingalgorithmonacomputerwithpprocessors.ifatleast Lemma9Considertheexecutionofafullystrictmultithreadedcomputationwithcritical- 19

instructionsonadirectedpathing0suchthatforeverytimestepduringtheexecution, Proof: adelaysequence(u;r;)andshowthatitoccurs.withatleast4prstealattempts,there mustbeatleastrrounds.weconstructthedelaysequencebyrstidentifyingasetof Foragivenexecutioninwhichatleast4PRstealattemptstakeplace,weconstruct oneoftheseinstructionsiscritical.then,wepartitiontherstrroundsaccordingtowhen eachroundoccursrelativetowheneachinstructiononthepathiscritical. whichwedenotebyv1.letvl1denotea(notnecessarilyimmediate)predecessorinstruction ofv1ing0withthelatestexecutiontime.let(vl1;:::;v2;v1)denoteadirectedpathfrom vl1tov1ing0.weextendthispathbacktotherstinstructionoftherootthreadby ToconstructthepathU,weworkbackwardsfromthelastinstructionoftherootthread, ing0.wenishiteratingtheconstructionwhenwegettoaniterationkinwhichvlkisthe latestexecutiontime,andlet(vli+1;:::;vli+1;vli)denoteadirectedpathfromvli+1tovli directedpathing0fromvlitov1.weletvli+1denoteapredecessorofvliing0withthe iteratingthisconstructionasfollows.attheithiterationwehaveaninstructionvlianda rstinstructionoftherootthread.ourdesiredsequenceisthenu=(u1;u2;:::;ul),where L=lkandui=vL i+1fori=1;2;:::;l.onecanverifythatateverytimestepofthe execution,oneofthevliiscritical. oftherstrroundsaccordingtowheneachroundoccurs.wewouldlikeourpartitionto besuchthatforeachround(amongtherstrrounds),wehavethepropertythatifthe roundoccurswhilesomeinstructionuiiscritical,thentheroundbelongstotheithgroup. Now,toconstructthepartition=(1;01;2;02;:::;L;0L),wepartitionthesequence theseroundsareconsecutiveatthebeginningofthesequence,sotheseroundscomprisethe Startwith1,andlet1equalthenumberofroundsthatoccurwhileu1iscritical.Allof 1stgroup thatis,theyarethe1consecutiveroundsstartingafterther1=0rstrounds. Next,iftheroundthatimmediatelyfollowsthoserst1roundsbeginsafteru1hasbeen criticalandendsafteru1isexecuted(forotherwise,itwouldbepartoftherstgroup),so executed,thenweset01=0,andwegoonto2.otherwise,thatroundbeginswhileu1is weset01=1,andwegoonto2.for2,welet2equalthenumberofroundsthatoccur thenumberofroundsthatbeginwhileuiiscriticalbutdonotenduntilafteruiisexecuted. lettingeachibethenumberofroundsthatoccurwhileuiiscriticalandlettingeach0ibe r2=1+01rounds,sotheseroundscomprisethe2ndgroup.wecontinueinthisfashion, whileu2iscritical.notethatalloftheseroundsareconsecutivebeginningaftertherst Asanexample,wemayhavearoundthatbeginswhileuiiscriticalandthenendswhile sequenceandthatitoccurs.byconstruction,uisamaximalpathing0.nowconsidering ui+2iscritical,andinthiscase,weset0i=1and0i+1=0.inthisexample,the(i+1)st groupisempty,sowealsoseti+1=0.,weobservethateachroundamongtherstrroundsiscountedexactlyonceineither aiora0i,soisindeedapartitionofr.moreover,fori=1;2;:::;l,atmostone Weconcludetheproofbyverifyingthatthe(U;R;)asjustconstructedisadelay uiiscritical.therefore,thedelaysequence(u;r;)occurs. fori=1;2;:::;l,theiroundsthatcomprisetheithgroupalloccurwhiletheinstruction 0i2f0;1g.Thus,(U;R;)isadelaysequence.Finally,weobservethat,byconstruction, roundcanbeginwhiletheinstructionuiiscriticalandendafteruiisexecuted,sowehave numberofrounds.specically,werstshowthatacriticalinstructionmustbetheready Wenowestablishthatacriticalinstructionisunlikelytoremaincriticalacrossamodest 20

facttoshowthataftero(1)rounds,acriticalinstructionisverylikelytobeexecuted. instructionofathreadthatisnearthetopofitsprocessor'sreadydeque.wethenusethis threadthathasatmost1threadaboveitinitsprocessor'sreadydeque. tionbythework-stealingalgorithm,eachcriticalinstructionisthereadyinstructionofa Lemma10Ateverytimestepduringtheexecutionofafullystrictmultithreadedcomputa- Lemma4guaranteesthateachoftheatleast2threadsabove 0inp'sreadydequeisan isbeingworkedonbyp.if 0hasmorethan1threadaboveitinp'sreadydeque,then u0iscritical, 0isready.Hence,forsomeprocessorp,either 0isinp'sreadydequeor 0 Proof: Consideranytimestep,andletu0bethecriticalinstructionofathread 0.Since thread ithatspawnedthread i 1,andletwidenoteui'ssuccessorinstructioninthread i. 0and kistherootthread.further,fori=1;2;:::;k,letuidenotetheinstructionof ancestorof 0.Let 1; 2;:::; kdenote 0'sancestorthreads,where 1istheparentof successorofthespawninstructionuiinthread i,eachthread ifori=1;2;:::;khasbeen sinceu0iscritical,eachinstructionwihasbeenexecuted.moreover,becauseeachwiisthe workedonsincethetimestepatwhichitspawnedthread i 1.ButLemma4guarantees Becauseofthedequeedges,eachinstructionwiisapredecessorofu0inG0,andconsequently, thatonlythetopmostthreadinp'sreadydequecanhavethisproperty.thus, 1istheonly threadthatcanpossiblybeabove 0inp'sreadydeque. roundsoccurwhiletheinstructionviscriticalisatmosttheprobabilitythatonly0or1of vandanynumberr2ofsteal-attemptrounds,theprobabilitythatanyparticularsetofr Lemma11Considertheexecutionofanyfullystrictmultithreadedcomputationbythe thestealattemptsinitiatedintherstr 1oftheseroundschoosev'sprocessor,whichisat Work-StealingAlgorithmonaparallelcomputerwithP2processors.Foranyinstruction Proof: moste 2r+3. thestealattemptsthatcomprisetherstr 1oftheserounds,ofwhichtheremustbeat setofrrounds,andsupposethattheyalloccurwhileinstructionviscritical.now,consider theprocessorinwhosereadydequev'sthreadresidesattimestepta.consideranyparticular Lettadenotethersttimestepatwhichinstructionviscritical,andletpdenote thesestealattemptsisinitiated.notethatbecausethelastround,likeeveryround,must least3p(r 1).Lettbdenotethetimestepjustafterthetimestepatwhichthelastof whichinstructionvisexecuted. criticalandatleast2timestepsbeforevisexecuted,atmost1ofthemcanchooseprocessor takeatleasttwo(infact,four)steps,thetimesteptbmustoccurbeforethetimestepat thatinstructionvisthereadyinstructionofathread,whichhasatmost1threadabove pasitstarget,forotherwise,vwouldbeexecutedatorbeforetb.recallfromlemma10 Weshallrstshowthatofthese3P(r 1)stealattemptsinitiatedwhileinstructionvis threadbeplacedaboveitinitsreadydeque.consequently,ifastealattempttargeting instructionvisexecuted,sinceonlybyprocessorpexecutinginstructionsfrom cananother itinp'sreadydequeaslongasviscritical. processorpisinitiatedatsometimesteptta,weareguaranteedthatinstructionvis If hasnothreadsaboveit,thenanotherthreadcannotbeplacedaboveituntilafter 21

executedatatimestepnolaterthant,eitherbythread beingstolenandexecutedorby fromp'sreadydequenolaterthantimestept.supposefurtherthatanotherstealattempt attempttargetingprocessorpisinitiatedattimesteptta,thenthread 0getsstolen pexecutingthethreaditself. targetingprocessorpisinitiatedattimestept0,wheretatt0<tb.then,weknowthat Now,suppose hasonethread 0aboveitinp'sreadydeque.Inthiscase,ifasteal thread 0 thesamethreadthattherststealgot.butthisscenariocanonlyoccurifin impossible,sincevisexecutedaftertimesteptb.consequently,thissecondstealmustget thread,theninstructionvmustgetexecutedatorbeforetimestept0+1tb,whichis asecondstealwillbeservicedbypatorbeforetimestept0+1.ifthissecondstealgets stept0+1tb,whichisonceagainimpossible. ofsomeinstructionfromthread,inwhichcaseinstructionvmustbeexecutedbeforetime theinterveningtimeperiod,thread 0stallsandissubsequentlyreenabledbytheexecution tat<tb,andatmost1ofwhichtargetsprocessorp.theprobabilitythateither0or1 of3p(r 1)stealattemptschoosesprocessorpis Thus,wemusthave3P(r 1)stealattempts,eachinitiatedatatimesteptsuchthat 1 1P3P(r 1)+3P(r 1)1P1 1P3P(r 1) 1 (6r 5)1 1P3P(r 1) =1+3(r 1)P P 11 1P3P(r 1) forr2. (6r 5)e 3(r 1) Wenowcompletethedelay-sequenceargumentandboundthetotaldollarsintheSteal e 2r+3 pathlengtht1bythework-stealingalgorithmonaparallelcomputerwithpprocessors. bucket. Forany>0,withprobabilityatleast1,atmostO(P(T1+lg(1=)))work-stealattempts occur.theexpectednumberofstealattemptsiso(pt1).inotherwords,withprobabilityat Lemma12Considertheexecutionofanyfullystrictmultithreadedcomputationwithcritical- bucket,andtheexpectednumberofdollarsinthisbucketiso(pt1). Proof: least1,theexecutionterminateswithatmosto(p(t1+lg(1=)))dollarsinthesteal delaysequence(u;r;)mustoccur.consideraparticulardelaysequence(u;r;)having U=(u1;u2;:::;uL)and=(1;01;2;02;:::;L;0L),withL2T1.Weshallcompute theprobabilitythat(u;r;)occurs. FromLemma9,weknowthatifatleast4PRstealattemptsoccur,thensome roundsintheithgroupalloccurringwhileagiveninstructionuiiscriticalisatmostthe alliroundsintheithgroup.fromlemma11,weknowthattheprobabilityofthei probabilitythatonly0or1ofthestealattemptsinitiatedinthersti 1oftheserounds Suchasequenceoccursiffori=1;2;:::;L,eachinstructionuiiscriticalthroughout 22

choosev'sprocessor,whichisatmoste 2i+3,providedi2.(Forthosevaluesofiwith sequence(u;r;)occurringasfollows: thetargetschoseninotherrounds,wecanboundtheprobabilityoftheparticulardelay ofthework-stealattemptsintheiroundsoftheithgrouparechosenindependentlyfrom i<2,weshalluse1asanupperboundonthisprobability.)moreover,sincethetargets Prf(U;R;)occursg = 1iLPrftheiroundsoftheithgroupoccurwhileuiiscriticalg Y exp264 20B@X i2e 2i+3 =exp264 20B@X 1iLi X 1iL i2i1ca+3l375 e 2((R L) L)+3L i<2i1ca+3l375 wherethesecond-to-lastlinefollowsfrominequality(5). Toboundtheprobabilityofsomedelaysequence(U;R;)occurring,weneedtocount =e 2R+7L; thenumberofsuchdelaysequencesandmultiplybytheprobabilitythataparticularsuch sequenceoccurs.thedirectedpathuinthemodieddagg0startsattherstinstruction G0isatmost2T1,thereareatmost(d+1)2T1waysofchoosingthepathU=(u1;u2;:::;uL). instructions,weassumethatthedegreedisaconstant.sincethelengthofalongestpathin degreed,theng0hasdegreeatmostd+1.consistentwithourunit-timeassumptionfor oftherootthreadandendsatthelastinstructionoftherootthread.iftheoriginaldaghas Thereareatmost2L+R R4T1+R sequence(u;r;)occursby Aswehavejustshown,agivendelaysequencehasatmostane 2R+7Le 2R+14T1chance ofoccurring.multiplyingthesethreefactorstogetherboundstheprobabilitythatanydelay Rwaystochoose,sincepartitionsRinto2Lpieces. (d+1) 2T14T1+R R!e 2R+14T1; whichisatmostforr=ct1lgd+lg(1=),wherecisasucientlylargeconstant. Thus,theprobabilitythatatleast4PR=(P(T1lgd+lg(1=)))=(P(T1+lg(1=))) (6) distributiondecreasesexponentially. stealattemptsoccurisatmost.theexpectationboundfollows,becausethetailofthe bythework-stealingalgorithm,andwecompletetheproofbyboundingthenumberof theoremthatboundsthetotalexecutiontimeforafullystrictmultithreadedcomputation dollarsinthewaitbucket. WithboundsonthenumberofdollarsintheWorkandStealbuckets,wenowstatethe 23

Moreover,forany>0,withprobabilityatleast1,theexecutiontimeonPprocessors Pprocessors.Theexpectedrunningtime,includingschedulingoverhead,isT1=P+O(T1). Theorem13Considertheexecutionofanyfullystrictmultithreadedcomputationwithwork T1andcritical-pathlengthT1bytheWork-StealingAlgorithmonaparallelcomputerwith ist1=p+o(t1+lgp+lg(1=)).2 thetotaldelay thatis,thetotaldollarsinthewaitbucket asafunctionofthenumber Proof: Mofstealattempts thatis,thetotaldollarsinthestealbucket.thislemmasaysthat mustboundthedollarsinthewaitbucket.thisboundisgivenbylemma6whichbounds Lemmas7and12boundthedollarsintheWorkandStealbuckets,sowenow forany>0,withprobabilityatleast1,thenumberofdollarsinthewaitbucketisat andtheexpectednumberofdollarsinthewaitbucketisatmostthenumberinthesteal mostaconstanttimesthenumberofdollarsinthestealbucketpluso(plgp+plg(1=)), bucket. analysismakestheassumptionthatatmostaconstantnumberofbytesneedbecommunicatedalongajoinedgetoresolvethedependencyputationexecutedbythework-stealingalgorithmperformsinadistributedmodel.the Thenexttheoremboundsthetotalamountofcommunicationthatamultithreadedcom- WenowaddupthedollarsinthethreebucketsanddividebyPtocompletethisproof. pathlengtht1bythework-stealingalgorithmonaparallelcomputerwithpprocessors. Theorem14Considertheexecutionofanyfullystrictmultithreadedcomputationwithcritical- ofthelargestactivationframeinthecomputation.moreover,forany>0,theprobability Then,thetotalnumberofbytescommunicatedhasexpectationO(PT1(1+nd)Smax)wherend isthemaximumnumberofjoinedgesfromathreadtoitsparentandsmaxisthesizeinbytes isatleast1 thatthetotalcommunicationincurrediso(p(t1+lg(1=))(1+nd)smax). Proof: Byourbucketingargument,theexpectednumberofstealattemptsisatmostO(PT1). Whenathreadisstolen,thecommunicationincurredisatmostSmax.Communicationalso occurswheneverajoinedgeentersaparentthreadfromoneofitschildrenandtheparent Weprovetheboundfortheexpectation.Thehigh-probabilityboundisanalogous. thecommunicationincurredisatmosto(nd)persteal.finally,wecanhavecommunication whenachildthreadenablesitsparentandputstheparentintothechild'sprocessor'sready hasbeenstolen,butsinceeachjoinedgeaccountsforatmostaconstantnumberofbytes, costiso(pt1(1+nd)smax). deque.thiseventcanhappenatmostndtimesforeachtimetheparentisstolen,sothe communicationincurredisatmostndsmaxpersteal.thus,theexpectedtotalcommunication ofwuandkung[47],whoshowedthatdivide-and-conquercomputations aspecialcaseof fullystrictcomputationsthatrequire(pt1(1+nd)smax)totalcommunicationforany executionschedulethatachieveslinearspeedup.thisresultfollowsdirectlyfromatheorem Thecommunicationboundsinthistheoremareexistentiallytight,inthatthereexist polynomialinmandp. fullystrictcomputationswithnd=1 requirethismuchcommunication. 2WithPlaxton'sbound[40]forLemma6,thisboundbecomesT1=P+O(T1),whenever1=isatmost 24

thatis,whenp=o(t1=t1) thetotalcommunicationisatmosto(t1smax).moreover, ifpt1=t1,thetotalcommunicationismuchlessthant1smax,whichconrmsthefolk wisdomthatwork-stealingalgorithmsrequiremuchlesscommunicationthanthepossibly Inthecasewhenwehavend=O(1)andthealgorithmachieveslinearexpectedspeedup 7(T1Smax)communicationofwork-sharingalgorithms. Howpracticalarethemethodsanalyzedinthispaper?Wehavebeenactivelyengagedin buildingac-basedlanguagecalledcilk(pronounced\silk")forprogrammingmultithreaded Conclusion employsaprovablyecientschedulingalgorithm,cilkdeliversguaranteedperformanceto computations[5,8,25,32,42].cilkisderivedfromthepcm\parallelcontinuationmachine"system[29],whichwasitselfpartlyinspiredbytheresearchreportedhere.thecilcationwritteninthecilklanguagecanbepredictedaccuratelyusingthemodelt1=p+t1. userapplications.specically,wehavefoundempiricallythattheperformanceofanappli- runtimesystememploysthework-stealingalgorithmdescribedinthispaper.becausecilk ParagonMPP,andtheIBMSP-2.)Todate,applicationswritteninCilkincludeprotein asthesunenterprise,thesilicongraphicsorigin,theintelquadpentium,andthedec Alphaserver.(EarlierversionsofCilkranontheThinkingMachinesCM-5MPP,theIntel TheCilksystemcurrentlyrunsoncontemporaryshared-memorymultiprocessors,such folding[38],graphicrendering[45],backtracksearch,andthe?socrateschessprogram[31], incilkwonfirstprize(undefeated)intheicfp'98programmingcontestsponsoredbythe Cilkchess,wonthe1996DutchOpenComputerChessTournament.Ateamprogramming ona1824-nodeparagonatsandianationallaboratories.ourmorerecentchessprogram, whichwonsecondprizeinthe1995iccaworldcomputerchesschampionshiprunning InternationalConferenceonFunctionalProgramming. worksofworkstations.thisruntimesystem,calledcilk-now[5,11,35],supportsadaptive parallelism,whereprocessorsinaworkstationenvironmentcanjoinauser'scomputation iftheywouldbeotherwiseidleandyetbeavailableimmediatelytoleavethecomputation Aspartofourresearch,wehaveimplementedaprototyperuntimesystemforCilkonnet- whenneededagainbytheirowners.cilk-nowalsosupportstransparentfaulttolerance, meaningthattheuser'scomputationcanproceedeveninthefaceofprocessorscrashing, andyettheprogrammerwritesthecodeinacompletelyfault-obliviousfashion.amore recentdistributedimplementationforclustersofsmp'sisdescribedin[42]. ory[6,7,24,26]anddebuggingtools[17,18,22,45].up-to-dateinformation,papers,and softwarereleasescanbefoundontheworldwidewebathttp://supertech.lcs.mit.edu/cilk. Forthecaseofshared-memorymultiprocessors,wehaverecentlygeneralizedthetime WehavealsoinvestigatedothertopicsrelatedtoCilk,includingdistributedsharedmem- haveshownthatforarbitrary(notnecessarilyfullystrictorevenstrict)multithreadedcomputations,theexpectedexecutiontimeiso(t1=p+t1).thisboundisbasedonanew bound(butnotthespaceorcommunicationbounds)alongtwodimensions[1].first,we structurallemmaandanamortizedanalysisusingapotentialfunction.second,wehave developedanonblockingimplementationofthework-stealingalgorithm,andwehaveanalyzeditsexecutiontimeforamultiprogrammedenvironmentinwhichthecomputation 25

setofprocessors,theboundspecializestomatchourpreviousbound.thenonblocking workstealerhasbeenimplementedinthehooduser-levelthreadslibrary[12,39].up-todateinformation,papers,andsoftwarereleasescanbefoundontheworldwidewebat http://www.cs.utexas.edu/users/hood. Acknowledgments ThankstoBruceMaggsofCarnegieMellon,whooutlinedthestrategyinSection6forusing adelay-sequenceargumenttoprovethetimeboundsonthework-stealingalgorithm,which ingiscontrolledbyanadversary.incasetheadversarychoosesnottogroworshrinkthe executesonasetofprocessorsthatgrowsandshrinksovertime.thisgrowingandshrink- commentsthatimprovedtheclarityofourpaper.thanksalsotoarvind,michaelhalbherr, technicalcommentsonourprobabilisticanalyses.thankstotheanonymousreferees,yanjun ZhangofSouthernMethodistUniversity,andWarrenBurtonofSimonFraserUniversityfor improvedourpreviousbounds.thankstogregplaxtonofuniversityoftexas,austinfor ChrisJoerg,BradleyKuszmaul,KeithRandall,andYuliZhouofMITforhelpfuldiscussions. References [1]NimarS.Arora,RobertD.Blumofe,andC.GregPlaxton.Threadschedulingformultiprogrammedmultiprocessors.InProceedingsoftheTenthAnnualACMSymposiumonParallel AlgorithmsandArchitectures(SPAA),pages119{129,PuertoVallarta,Mexico,June1998. [2]Arvind,RishiyurS.Nikhil,andKeshavK.Pingali.I-structures:Datastructuresforparallelcomputing.ACMTransactionsonProgrammingLanguagesandSystems,11(4):598{632, October1989. [3]GuyE.Blelloch,PhillipB.Gibbons,andYossiMatias.Provablyecientschedulingfor onparallelalgorithmsandarchitectures(spaa),pages1{12,santabarbara,california,july languageswithne-grainedparallelism.inproceedingsoftheseventhannualacmsymposium [4]GuyE.Blelloch,PhillipB.Gibbons,YossiMatias,andGirijaJ.Narlikar.Space-ecient schedulingofparallelismwithsynchronizationvariables.inproceedingsofthe9thannual 1995. [5]RobertD.Blumofe.ExecutingMultithreadedProgramsEciently.PhDthesis,DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology, RhodeIsland,June1997. ACMSymposiumonParallelAlgorithmsandArchitectures(SPAA),pages12{23,Newport, [6]RobertD.Blumofe,MatteoFrigo,ChristopherF.Joerg,CharlesE.Leiserson,andKeithH. September1995.AlsoavailableasMITLaboratoryforComputerScienceTechnicalReport MIT/LCS/TR-677. Randall.Ananalysisofdag-consistentdistributedshared-memoryalgorithms.InProceedings pages297{308,padua,italy,june1996.26 oftheeighthannualacmsymposiumonparallelalgorithmsandarchitectures(spaa),

[7]RobertD.Blumofe,MatteoFrigo,ChristopherF.Joerg,CharlesE.Leiserson,andKeithH. [8]RobertD.Blumofe,ChristopherF.Joerg,BradleyC.Kuszmaul,CharlesE.Leiserson,KeithH. ParallelProcessingSymposium(IPPS),pages132{141,Honolulu,Hawaii,April1996. Randall.Dag-consistentdistributedsharedmemory.InProceedingsoftheTenthInternational [9]RobertD.BlumofeandCharlesE.Leiserson.Schedulingmultithreadedcomputationsbywork Randall,andYuliZhou.Cilk:Anecientmultithreadedruntimesystem.JournalofParallel stealing.inproceedingsofthe35thannualsymposiumonfoundationsofcomputerscience anddistributedcomputing,37(1):55{69,august1996. [10]RobertD.BlumofeandCharlesE.Leiserson.Space-ecientschedulingofmultithreaded computations.siamjournaloncomputing,27(1):202{229,february1998. (FOCS),pages356{368,SantaFe,NewMexico,November1994. [11]RobertD.BlumofeandPhilipA.Lisiecki.Adaptiveandreliableparallelcomputingonnetworksofworkstations.InProceedingsoftheUSENIX1997AnnualTechnicalConferenceon [12]RobertD.BlumofeandDionisiosPapadopoulos.Theperformanceofworkstealinginmultiprogrammedenvironments.TechnicalReportTR-98-13,TheUniversityofTexasatAustin, UNIXandAdvancedComputingSystems,pages133{147,Anaheim,California,January1997. [13]RichardP.Brent.Theparallelevaluationofgeneralarithmeticexpressions.Journalofthe ACM,21(2):201{206,April1974. DepartmentofComputerSciences,May1998. [15]F.WarrenBurton.Guaranteeinggoodmemoryboundsforparallelprograms.IEEETransactionsonSoftwareEngineering,22(10),October1996. [14]F.WarrenBurton.Storagemanagementinvirtualtreemachines.IEEETransactionson Computers,37(3):321{328,March1988. [16]F.WarrenBurtonandM.RonanSleep.Executingfunctionalprogramsonavirtualtreeof [17]Guang-IenCheng.Algorithmsfordata-racedetectioninmultithreadedprograms.Master's ComputerArchitecture,pages187{194,Portsmouth,NewHampshire,October1981. processors.inproceedingsofthe1981conferenceonfunctionalprogramminglanguagesand [18]Guang-IenCheng,MingdongFeng,CharlesE.Leiserson,KeithH.Randall,andAndrewF. thesis,departmentofelectricalengineeringandcomputerscience,massachusettsinstitute oftechnology,may1998. [19]DavidE.CullerandArvind.Resourcerequirementsofdataowprograms.InProceedings Stark.DetectingdataracesinCilkprogramsthatuselocks.InTenthACMSymposiumon ParallelAlgorithmsandArchitectures(SPAA),PuertoVallarta,Mexico,June1998. ofthe15thannualinternationalsymposiumoncomputerarchitecture(isca),pages141{ [20]DerekL.Eager,JohnZahorjan,andEdwardD.Lazowska.Speedupversuseciencyinparallel 150,Honolulu,Hawaii,May1988.AlsoavailableasMITLaboratoryforComputerScience, ComputationStructuresGroupMemo280. systems.ieeetransactionsoncomputers,38(3):408{423,march1989. 27

[21]R.Feldmann,P.Mysliwietz,andB.Monien.Gametreesearchonamassivelyparallelsystem. [22]MingdongFengandCharlesE.Leiserson.EcientdetectionofdeterminacyracesinCilkprograms.InNinthAnnualACMSymposiumonParallelAlgorithmsandArchitectures(SPAA), pages1{11,newport,rhodeisland,june1997. AdvancesinComputerChess7,pages203{219,1993. [23]RaphaelFinkelandUdiManber.DIB adistributedimplementationofbacktracking.acm [24]MatteoFrigo.Theweakestreasonablememorymodel.Master'sthesis,DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,January1998. TransactionsonProgrammingLanguagesandSystems,9(2):235{256,April1987. [25]MatteoFrigo,CharlesE.Leiserson,andKeithH.Randall.TheimplementationoftheCilk-5 [26]MatteoFrigoandVictorLuchangco.Computation-centricmemorymodels.InProceedingsof multithreadedlanguage.inproceedingsofthe1998acmsigplanconferenceonprogramminglanguagedesignandimplementation(pldi),montreal,canada,june1998. [27]R.L.Graham.Boundsforcertainmultiprocessinganomalies.TheBellSystemTechnical thetenthannualacmsymposiumonparallelalgorithmsandarchitectures(spaa),pages 240{249,PuertoVallarta,Mexico,June1998. [28]R.L.Graham.Boundsonmultiprocessingtiminganomalies.SIAMJournalonApplied Mathematics,17(2):416{429,March1969. Journal,45:1563{1581,November1966. [29]MichaelHalbherr,YuliZhou,andChrisF.Joerg.MIMD-styleparallelprogrammingwith [30]RobertH.Halstead,Jr.ImplementationofMultilisp:Lisponamultiprocessor.InConference Parallelism:Hardware,Software,andApplications,Capri,Italy,September1994. continuation-passingthreads.inproceedingsofthe2ndinternationalworkshoponmassive [31]ChrisJoergandBradleyC.Kuszmaul.Massivelyparallelchess.InProceedingsoftheThird Texas,August1984. Recordofthe1984ACMSymposiumonLispandFunctionalProgramming,pages9{17,Austin, [32]ChristopherF.Joerg.TheCilkSystemforParallelMultithreadedComputing.PhDthesis, DIMACSParallelImplementationChallenge,RutgersUniversity,NewJersey,October1994. [33]RichardM.KarpandYanjunZhang.Randomizedparallelalgorithmsforbacktracksearch DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,January1996. andbranch-and-boundcomputation.journaloftheacm,40(3):765{789,july1993. [34]BradleyC.Kuszmaul.SynchronizedMIMDComputing.PhDthesis,DepartmentofElectrical [35]PhilipLisiecki.MacroschedulingintheCilknetworkofworkstationsenvironment.Master's availableasmitlaboratoryforcomputersciencetechnicalreportmit/lcs/tr-645. EngineeringandComputerScience,MassachusettsInstituteofTechnology,May1994.Also thesis,departmentofelectricalengineeringandcomputerscience,massachusettsinstitute oftechnology,may1996. 28

[36]PangfengLiu,WilliamAiello,andSandeepBhatt.Anatomicmodelformessage-passing.In [37]EricMohr,DavidA.Kranz,andRobertH.Halstead,Jr.Lazytaskcreation:Atechniquefor (SPAA),pages154{163,Velen,Germany,June1993. ProceedingsoftheFifthAnnualACMSymposiumonParallelAlgorithmsandArchitectures [38]VijayS.Pande,ChristopherF.Joerg,AlexanderYuGrosberg,andToyoichiTanaka.Enumerationsofthehamiltonianwalksonacubicsublattice.JournalofPhysicsA,27,1994. increasingthegranularityofparallelprograms.ieeetransactionsonparallelanddistributed Systems,2(3):264{280,July1991. [39]DionysiosP.Papadopoulos.Hood:Auser-levelthreadlibraryformultiprogrammingmultiprocessors.Master'sthesis,DepartmentofComputerSciences,TheUniversityofTexasat [41]AbhiramRanade.Howtoemulatesharedmemory.InProceedingsofthe28thAnnualSymposiumonFoundationsofComputerScience(FOCS),pages185{194,LosAngeles,California, [40]C.GregoryPlaxton,August1994.Privatecommunication. Austin,August1998. [42]KeithH.Randall.Cilk:EcientMultithreadedComputing.PhDthesis,DepartmentofElectricalEngineeringandComputerScience,MassachusettsInstituteofTechnology,May1998. October1987. [43]LarryRudolph,MiriamSlivkin-Allalouf,andEliUpfal.Asimpleloadbalancingschemefor taskallocationinparallelmachines.inproceedingsofthethirdannualacmsymposiumon [44]CarlosA.RuggieroandJohnSargeant.ControlofparallelismintheManchesterdataow July1991. ParallelAlgorithmsandArchitectures(SPAA),pages237{245,HiltonHead,SouthCarolina, [45]AndrewF.Stark.Debuggingmultithreadedprogramsthatincorporateuser-levellocking. machine.infunctionalprogramminglanguagesandcomputerarchitecture,number274in LectureNotesinComputerScience,pages1{15.Springer-Verlag,1987. [46]MarkT.VandevoordeandEricS.Roberts.WorkCrews:Anabstractionforcontrollingparallelism.InternationalJournalofParallelProgramming,17(4):347{366,August1988. Master'sthesis,DepartmentofElectricalEngineeringandComputerScience,Massachusetts InstituteofTechnology,May1998. [47]I-ChenWuandH.T.Kung.Communicationcomplexityforparalleldivide-and-conquer.In Proceedingsofthe32ndAnnualSymposiumonFoundationsofComputerScience(FOCS), [48]Y.ZhangandA.Ortynski.Theeciencyofrandomizedparallelbacktracksearch.InProceedingsofthe6thIEEESymposiumonParallelandDistributedProcessing,Dallas,Texas, pages151{162,sanjuan,puertorico,october1991. October1994. 29