CallForwarding: ASimpleInterproceduralOptimizationTechnique fordynamicallytypedlanguages KoenDeBosschere;ySaumyaDebray;zDavidGudeman;zSampathKannanz ydepartmentofelectronics UniversiteitGent B-9000Gent,Belgium zdepartmentofcomputerscience TheUniversityofArizona Tucson,AZ85721,USA Abstract Thispaperdiscussescallforwarding,asimpleinterproceduraloptimizationtechniquefordynamicallytyped languages.thebasicideabehindtheoptimizationis straightforward:ndanorderingforthe\entryactions" ofaprocedure,andgeneratemultipleentrypointsfor theprocedure,soastomaximizethesavingsrealized fromdierentcallsitesbypassingdierentsetsofentryactions.weshowthattheproblemofcomputing optimalsolutionstoarbitrarycallforwardingproblems isnp-complete,anddescribeanecientgreedyalgorithmfortheproblem.experimentalresultsindicate that(i)thisalgorithmiseective,inthatthesolutions producedaregenerallyclosetooptimal;and(ii)the resultingoptimizationleadstosignicantperformance improvementsforanumberofbenchmarkstested. 1Introduction Thecodegeneratedforafunctionorprocedureina dynamicallytypedlanguagetypicallyhastocarryout varioustypeandrangechecksonitsargumentsbefore itcanoperateonthem.theseruntimetestscanincur asignicantperformanceoverhead.asaverysimple example,considerthefollowingfunctiontocomputethe averageofalistofnumbers: K.DeBosscherewassupportedbytheNationalFundforScienticResearchofBelgiumandbytheBelgianNationalIncentive ProgramforfundamentalresearchinArticialIntelligence.S. DebrayandD.GudemanweresupportedinpartbytheNational ScienceFoundationundergrantnumberCCR-9123520.S.KannanwassupportedinpartbytheNationalScienceFoundation undergrantnumberccr-9108969. 0Copyright1994ACM.AppearedintheProceedingsofthe21stAnnualACMSIGPLAN-SIGACT SymposiumonPrinciplesofProgrammingLanguages,January1994,pp.409{420. ave(l,sum,count)= ifnull(l)thensum/count elseave(tail(l),sum+head(l),count+1) Inastraightforwardimplementationofthisfunction, thecodegeneratedchecksthetypeofeachofitsargumentseachtimearoundtheloop:therstargument mustbea(emptyornon-empty)list,whilethesecond andthirdargumentsmustbenumbers.1notice,however,thatsomeofthistypecheckingisunnecessary:the expressionsum+head(l)evaluatescorrectlyonlyifsum isanumber,inwhichcaseitsvalueisalsoanumber; similarly,count+1evaluatescorrectlyonlyifcountis anumber,andinthatcaseitalsoevaluatestoanumber.thus,oncethetypesofsumandcounthavebeen checkedattheentrytotheloop,furthertypecheckson thesecondandthirdargumentsarenotnecessary. Thefunctioninthisexampleistailrecursive,making iteasytorecognizetheiterativenatureofitscomputationandusesomeformofinvariantcodemotionto movethetypecheckoutoftheloop.ingeneral,however,suchredundantactionsmaybeencounteredwhere thedenitionsarenottailrecursiveandwheretheloop structureisnotaseasytorecognize.analternativeapproach,whichworksingeneral,istogeneratemultiple entrypointsforthefunctionave,sothataparticular callsitecanenteratthe\appropriate"entrypoint,bypassinganycodeitdoesnotneedtoexecute.inthe exampleabove,thiswouldgiveexactlythedesiredresult:tailcalloptimizationwouldcompiletherecursive calltoaveintoajumpinstruction,andnoticingthat therecursivecalldoesnotneedtotestthetypesofits secondandthirdarguments,thetargetofthisjump 1Inreality,thegeneratedcodewoulddistinguishbetweenthe numerictypesintandfloat,e.g.,using\messagesplitting"techniquesasin[5,6] thedistinctionisnotimportanthere,andwe assumeasinglenumerictypeforsimplicityofexposition. 1
wouldbechosentobypassthesetests. However,noticethatintheexampleabove,evenif wegeneratemultipleentrypointsforave,theoptimizationworksonlyifthetestsaregeneratedintheright order:sinceitisnecessarytotestthetypeoftherst argumenteachtimearoundtheloop,thetestsonthe secondandthirdargumentscannotbebypassedifthe typetestontherstargumentprecedesthoseonthe othertwoarguments.asthisexampleillustrates,the orderinwhichthetestsaregeneratedinuencesthe amountofunnecessarycodethatcanbebypassedat runtime,andthereforetheperformanceoftheprogram. Ingeneral,functionsandproceduresindynamically typedlanguagescontainasetof(idempotent)\entry actions,"suchastypetests,initializationactions(especiallyforvariadicprocedures),etc.,thatareexecuted atentrytotheprocedure.moreover,theseactionscan typicallybecarriedoutinanyofanumberofdierent\legal"orders(ingeneral,notallorderingsofentry actionsmaybelegal,sincesomeactionsmaydepend ontheoutcomesofothers forexample,thetypeofan expressionhead(x)cannotbecheckeduntilxhasbeen veriedtobeoftypelist).thecodegeneratedfora procedurethereforeconsistsofasetofentryactionsin someorder,followedbycodeforitsbody.therearea numberofdierentcallsitesforeachprocedure,andat eachcallsitewehavesomeinformationabouttheactual parametersatthatcallsite,allowingthatcalltoskip someoftheseentryactions.moreover,eachcallsitehas adierentexecutionfrequency(estimated,forexample, fromproleinformationorfromthestructureofthe callgraph).ingeneral,dierentcallsiteshavedierent informationavailableabouttheiractualparameters,so thatanorderfortheentryactionsofaprocedurethat isgoodforonecallsite,intermsofthenumberofunnecessaryentryactionsthatcanbeskipped,maynotbe asgoodforanothercallsite.agoodcompilershould thereforeattempttondanorderingontheentryactionsthatmaximizesthebenets,overallcallsites,due tobypassingunnecessarycode.werefertodetermining suchanorderfortheentryactionsandthen\forwarding"thebranchinstructionsatdierentcallsitessoas tobypassunnecessarycodeas\callforwarding." Whilemanysystemscompilefunctionswithmultipleentrypoints,wedonotknowofanythatattempt toordertheentryactionscarefullyinordertoexploit thistothefullest.inthispaper,weaddresstheproblemofdetermininga\good"orderforthesetoftestsa functionorprocedurehastocarryout.weshowthat generatinganoptimalorderisnp-completeingeneral, andgiveanecientalgorithmforselectinganordering usingagreedyheuristic.theresultgeneralizesanumberofoptimizationsfortraditionalcompilers,suchas jumpchaincollapsingandinvariantcodemotionoutof loops.experimentalresultsindicatethat(i)theheuristicisgood,inthattheorderingsitgeneratesareusually notfarfromtheoptimal;and(ii)theresultingoptimizationiseective,inthesensethatittypicallyleadsto signicantspeedimprovements. Theissuesandoptimizationsdiscussedinthispaperareprimarilyattheintermediatecodelevel:for thisreason,wedonotmakemanyassumptionsabout thesourcelanguage,exceptthatacalltoaprocedure typicallyinvolvesexecutingasetofidempotent\entryactions."thiscoversawidevarietyofdynamicallytypedlanguages,e.g.,functionalprogramminglanguagessuchaslispandscheme(e.g.,see[15]),logic programminglanguagessuchasprolog[4],ghc[17] andjanus[11,13],imperativelanguagessuchassetl [14],andobject-orientedlanguagessuchasSmalltalk [10]andSELF[6].Theoptimizationwediscussis likelytobemostbenecialforlanguagesandprogramswhereprocedurecallsarecommon,andwhich arethereforeliabletobenetsignicantlyfromreducingthecostofprocedurecalls.however thetitleof thepapernotwithstanding theoptimizationisnotlimited,apriori,todynamicallytypedlanguages:itis alsoapplicable,inprinciple,toidempotententryactions,suchasinitializationandarrayboundchecks, instaticallytypedlanguages,andsomeoptimizations usedinstaticallytypedlanguages,suchasinverseetareduction/uncurrying/argumentatteninginstandard MLofNewJersey[1],canalsobethoughtofasinstances ofcallforwarding(seesection6). 2TheCallForwardingProblem Asdiscussedintheprevioussection,thecodegeneratedforaprocedureconsistsofasetofentryactions, whichcanbecarriedoutinanumberofdierentlegal orders,followedbythecodeforitsbody.eachprocedurehasanumberofcallsites,andateachcallsite thereissomeinformationabouttheactualparameters forcallsissuedfromthatsite,specifyingwhichentry actionsmustbeexecutedandwhichmaybeskipped.2 Thisismodelledbyassociating,witheachcallsite,a setofentryactionsthatmustbeexecutedbythatcall site.moreover,eachcallsitehasassociatedwithitan estimateofitsexecutionfrequency:suchestimatescan beobtainedfromproleinformation,orfromthestructureofthecallgraphoftheprogram(see,forexample, [3,19]).Finally,dierententryactionsmayrequirea dierentnumberofmachineinstructionstoexecute,and thereforehavedierent\sizes." Ourobjectiveistoordertheentryactionsofthe proceduresinaprogram,andredirectcallssoastoby- 2Theprecisemechanismbywhichthisinformationisobtained, e.g.,dataowanalysis,userdeclarations,etc.,isorthogonaltothe issuesdiscussedinthispaper,andsoisnotaddressedhere.
passunnecessaryactionswherepossible,insuchaway thatthetotalnumberofinstructionsthatareskipped, overtheentireexecutionoftheprogram,isaslargeas possible.however,itisnotdiculttoseethatforany procedurepinaprogram,thecodetosetupandexecuteprocedurecallsinthebodyofpisseparatefrom theentryactionsofp.becauseofthis,theorderof p'sentryactions andtherefore,thenumberofinstructionsthatareskippedbycallstopinanexecutionofthe program neitherinuencenorareinuencedbytheorderoftheentryactionsforanyotherprocedureinthe program.theproblemofmaximizingthetotalnumber ofinstructionsskippedbycallforwardingfortheentire program,then,reducestotheproblemofmaximizing, foreachprocedure,thenumberofinstructionsskipped bycallstothatprocedure.forourpurposes,therefore, thecallforwardingproblemistheproblemofdetermininga\good"orderfortheentryactionsofaprocedure sothatthesavingsaccruingfrombypassingunnecessaryentryactionsoverallcallsitesforthatprocedure, weightedbyexecutionfrequency,isaslargeaspossible. Theproblemcanbegeneralizedbyallowingcodeto becopiedfromaproceduretothecallsitesforthatprocedure.asanexample,supposewehaveaprocedure withentryactionsaandb,andtwocallsites:a,which canskipabutmustexecuteb;andb,whichcanskip bbutmustexecutea.supposetheentryactionsare generatedintheorderha;bi,thencallsiteacanskipa, butbcannotskipbandthereforeexecutesunnecessary code(asymmetricproblemarisesiftheotherpossible orderischosen).asolutionistocopytheentryactionaatthecallsiteb,i.e.,executetheentryaction atbbeforejumpingtothecallee.ifweallowarbitrarilymanyentryactionstobecopiedtocallsitesin thismanner,itistrivialtogenerateanoptimalsolution toanycallforwardingproblem:simplycopytoeach callsitetheentryactionsthatcallsitemustexecute, thenbranchintothecalleebypassingallentryactions atthecallee.thisobviouslyproducesanoptimalsolution,sinceeachcallsiteexecutesexactlythoseentry actionsthatitmustexecute,andcanbedoneeciently inpolynomialtime.however,ithastheproblemthat suchunrestrictedcopyingcanleadtosignicantcode bloat,sincetheremaybemanycallsitesforaprocedure,eachofthemgettingacopyofmostoftheentry actionsforthatprocedure(wehaveobservedthisphenomenoninanumberofapplicationprograms). Thebestsolutiontothisproblemistoimposea globalboundonthetotalnumberofentryactionsthat maybecopied,acrossallthecallsitesoccurringinaprogram,butthisturnsouttobecomplicatedtoimplement becausewhenperformingcallforwardingonanyparticularprocedure,wehavetokeeptrackofthenumberof entryactionscopiedforalltheproceduresintheprogram,includingthosethathavenotyetbeenprocessed bytheoptimizer!asimpleandeectiveapproximationtothisapproachistoassign,foreachprocedure, aboundonthenumberofentryactionsthatcanbe copiedtoeachcallsiteforthatprocedure.ifwestart withaglobalboundonthetotalnumberofentryactionsthatcanbecopied,suchper-procedureboundscan beobtainedby\dividingup"theglobalboundamong theprocedures(possiblytakingintoaccount,foreach procedure,thenumberofcallsitesforitandtheirexecutionfrequencies,sothatprocedureswithdeeplynested callsitescancopymoreentryactionsandtherebyeffectgreateroptimization).adiscussionofheuristics forestablishingsuchper-procedureboundsisbeyond thescopeofthisabstract:wesimplyassume,inthe discussionthatfollows,thatforeachprocedurethereis aboundonthenumberofitsentryactionsthatcanbe copiedtoanycallsite. Thecallforwardingproblemcanthereforebeformulatedintheabstractasfollows: Denition2.1Acallforwardingproblemisa5-tuple he;c;w;s;ki,where: {Eisaniteset(representingtheentryactionsof theprocedureconcerned); {CisamultisetofsubsetsofE(representingthe entryactionsthateachcallsitemustexecute); {w:c?!n,wherenisthesetofnaturalnumbers,isafunctionthatmapseachcallsitetoits \weight",i.e.,executionfrequency; {s:e?!nrepresentsthe\size"ofeachelementofe(representingthenumberofmachine instructionsneededtorealizethecorresponding entryaction);and {k0representsaboundonthenumberofentry actionsthatcanbecopiedtocallsites. AsolutiontoacallforwardingproblemhE;C;w;s;ki isapermutationofe,i.e.,a1-1function:e?! f1;:::;jejg.thecostofasolutionis,intuitively, thetotalnumberofmachineinstructionsexecuted,over allcallsites,giventhattheentryactionsaregeneratedintheorder.givenacallforwardingproblem he;c;w;s;ki,thecostofasolutionforitisdened asfollows.first,letcopied(c;;i)denote(theindices of)thoseentryactionsinthathavetobecopiedtoa callsiteciftheentrypointforcistobypassthersti elementsof: copied(c;;i)=fjjji^?1(j)2cg:
Here,?1(j)denotestheelementofEthatisthejth elementofthepermutation.foranycallsitec2c, giventheboundkonthenumberofactionsthatcanbe copiedtoc,themaximumnumberofentryactionsthat canbeskippedbyc eitherbecausecdoesnothaveto executethataction,orbecauseithasbeencopiedfrom thecalleetothecallsite isgivenby Skip(c;)=maxfi:jcopied(c;;i)jkg: Thecostofasolutioncanthenbeexpressedasthe weightedsum,overallcallsites,of(thesizesof)the instructionsthatcannotbeskippedbythecallsites: cost()= Pc2Cfw(c)s(I)jI2E^(I)> Skip(c;)g: 3AlgorithmicIssues Werstconsiderthecomplexityofdeterminingoptimal solutionstocallforwardingproblems.thefollowing resultshowsthattheexistenceofecientalgorithms forthisisunlikely: Theorem3.1ThedeterminationofanoptimalsolutiontoacallforwardingproblemisNP-complete.ItremainsNP-completeevenifallentryactionshaveequal size. ProofByreductionfromtheOptimalLinearArrangementproblem,whichisknowntobeNP-complete[8,9]. SeetheAppendixfordetails. Thisresultmightverywellbeofonlyacademicinterestifthenumberofentryactionsencounteredintypicalprogramscouldbeguaranteedtobesmall.However,ourexperiencehasbeenthatthisisnotthecase inmanyactualapplications.thereasonforthisisthat, evenifthenumberofargumentstoproceduresissmall formostprogramsencounteredinpractice,itisnot unusualtohaveanumberofentryactionsassociated withasingleargument(e.g.,seesection4),involving typeandrangechecks,patternmatchingandindexing code,pointerchaindereferencing(acommonoperation inlogicprogramminglanguages),andsoon.because ofthis,thetotalnumberofentryactionsinaprocedurecanbequitelarge,makingexhaustivesearchfor anoptimalsolutionimpractical.wethereforeseekecientpolynomialtimeheuristicsforcallforwardingthat producegoodsolutionsforcommoncases. 3.1AGreedyAlgorithm Whiletheproblemofcomputingoptimalsolutionsfor arbitrarycallforwardingproblemsisnp-completein general,agreedyalgorithmappearstoworkquitewell inpractice(seetable1).givenacallforwardingproblemforaprocedurewithaboundofkonthenumber ofactionsthatcanbecopiedfromthecalleetothecall sites,thegeneralideaistopickactionsoneatatime,at eachstepchoosinganactionthatminimizesthecostto bepaidatthatstep.thealgorithmmaintainsalistof callsitesthatdonotneedtoexecutemorethankofthe actionschosenuptothatpoint,andthereforecanstill havesomeactionscopiedtothem suchcallsitesare saidtobeactive.eachactivecallsitechasassociated withitacounter,denotedbycount[c]infigure1,that keepstrackofhowmanymoreactionscanbecopiedto thatcallsite.theweightofanaction,atanypointin thealgorithm,iscomputedasthesumoftheweightsof theactivecallsitesthatneedtoexecutethataction,dividedbythe\size"ofthataction(recallthatthesizeof anactionrepresentsthenumberofmachineinstructions neededtoimplementit) thus,everythingelsebeing equal,anactionthatismoreexpensiveintermsofthe numberofmachineinstructionsitrequireswillhavea smallerweightthanonewithsmallersize,andhencebe pickedearlier,therebyallowingmorecallsitestobypass it.sinceingeneraltheremaybedependenciesbetween instructionsthatrestrictthesetoflegalorderings(e.g., seetheexampleinsection4),thealgorithmrstconstructsadependencygraphwhosenodesaretheentry actionsunderconsideration,andwherethereisanedge fromanodee1toanodee2ife1mustprecedee2in anylegalexecution;thesetofpredecessorsofanodex inthisgraphisdenotedbypreds(x).thealgorithmis simple:itrepeatedlypicksan\available"action(i.e., anactionwhosepredecessorsinthedependencygraph Ghavealreadybeenpicked)ofleastweight,thenupdatesthecountersoftheappropriatecallsitesaswell asthelistofactivecallsites,deletingfromthislistany callsitethathasreacheditslimitofthenumberofactionsthatcanbecopiedfromthecallee.thisprocess continuesuntilallactionshavebeenenumerated.the algorithmisdescribedinfigure1. 4AnExample InthissectionweconsiderinmoredetailtheavefunctionfromSection1toseetheeectofcallforwarding onthecodegenerated.toillustratethefactthatthis optimizationisnotlimitedtocodefortypechecking, weconsiderherearealizationofthisfunctioninprolog. Asinotherlogicprogramminglanguages,unication betweenvariablesinprologcansetupchainsofpointers,andloadingthevalueofavariablerequiresdereferencingsuchchains.anumberofauthorshaveshown thatsignicantperformanceimprovementsarepossible ifthelengthsofthesepointerchainscanbepredictedvia compile-timeanalysis,sothatunnecessarydereferencingcodecanbedeleted[7,12,16];however,theanalyses involvedarefairlycomplex.hereweshowhow,inmany
cases,unnecessarydereferenceoperationscanbeeliminatedusingcallforwarding.theprocedureisdened asfollows: ave([],sum,count,avg):- AvgisSum/Count. ave([h L],Sum,Count,Avg):- Sum1isSum+H,Count1isCount+1, ave(l,sum1,count1,avg). Assumethat,asinmanymodernLispandPrologimplementations,parametersarepassedin(virtualmachine) registers,sothattherstparameterisinregisterarg1, thesecondparameterinregisterarg2,andsoon.figure 2(a)givestheintermediatecodethatmightbegeneratedinastraightforwardway.(Inreality,thegenerated codewoulddistinguishbetweenthenumerictypesint andfloat,e.g.,using\messagesplitting"techniquesas in[5,6] thedistinctionisnotimportanthere,andwe assumeasinglenumerictypeforsimplicityofexposition.)therstsixinstructionsofaveareentryactions thatcanbeexecutedinanyorderwherethedereferencingofaregisterprecedesitsuse.moreover,atthe (recursive)callsiteforave,weknowfromthesemanticsoftheaddinstructionthatarg1andarg2areboth numbers,andthatthereisnoneedforeitherdereferencingortypecheckingoftheseregisters.theentry actionscorrespondingtodereferencingandtypecheckingoftheseregisterscanthereforebebypassedbythe recursivecallsite.assumethatapartfromtherecursive call,thereisanothercallsite(the\initial"call)forthe procedureave.fornotationalbrevityinthediscussion thatfollows,denotetheinstructionsaboveasfollows: Arg1:=deref(Arg1) 7!a Arg2:=deref(Arg2) 7!b Arg3:=deref(Arg3) 7!c if:list(arg1)gotoerr7!d if:number(arg2)gotoerr7!e if:number(arg3)gotoerr7!f Finally,assumethatnocopyingofcodetocallsites isallowed.then,wecanformulatethisasacallforwardingproblemhe;c;w;s;kiasfollows: E=fa;b;c;d;e;fg; C=fc1;c2g,wherec1=fa;b;c;d;e;fgisthe initialcallsite,andc2=fa;dgistherecursive callsite; w=fc17!1;c27!10g,i.e.,weassumethatloops iterateabout10timesontheaverage; the\sizefunction"smapseachentryactionine to1(forsimplicity);and k=0,i.e.,nocopyingofcodetocallsitesisallowed. Initially,thesetofavailableactionsisfa;b;cg,andboth callsitesareactive,sotheweightscomputedforthese actionsare:a:11;b:1;c:1.therearetwoactions, bandc,thathavelowestweight,andoneofthem say,b ispickedbythealgorithm.asaresult,the callsitec1becomesinactive.thesetofavailableactionsatthispointisfa;c;eg,withweights10,0,0respectively.therearetwoactions,cande,withlowest weight,andoneofthem say,c ispicked.thealgorithmproceedsinthismanner,eventuallyproducingthe sequencehb;c;e;f;a;diasasolutiontothiscallforwardingproblem.inotherwords,callforwardingordersthe entryactionssothatthedereferencingandtypetests onarg2andarg3comerst,andcanbeskippedby therecursivecalltoave.theresultingcodeisshownin Figure2(b).Noticethatthecodefordereferencingand typecheckingthesecondandthirdargumentshaveeffectivelybeen\hoisted"outoftheloop.moreover,this hasbeenaccomplished,notbyrecognizinganddealing withloopsinsomespecialway,butsimplybyusing theinformationavailableatcallsites.itisapplicable, therefore,eventocomputationsthatarenotiterative (i.e.,tailrecursive),includingproceduresthatinvolve arbitrarylinear,nonlinear,andmutualrecursion. 5ExperimentalResults Weranexperimentsonanumberofsmallbenchmarks togauge(i)theecacyofgreedyalgorithm,i.e.,the qualityofitssolutionscomparedtotheoptimal;and(ii) theecacyoftheoptimization,i.e.,theperformance improvementsresultingfromit.thenumberspresented reecttheperformanceofjc[11],animplementationof alogicprogramminglanguagecalledjanus[13]ona Sparcstation-1.3Thissystemiscurrentlyavailableby anonymousftpfromcs.arizona.edu. Table1gives,foreachbenchmark,thenumberof machineinstructionsthatwouldbeexecutedoverall callsitesfortheentryactionsintheproceduresonly, using(i)nocallforwarding;(ii)callforwardingusing thegreedyalgorithm;and(iii)optimalcallforwarding. Theweightsforthecallsiteswereestimatedusingthe structureofthecallgraph:weassumedthatontheaverage,eachloopiteratesabout10times,andthebranches ofaconditionalaretakenwithequalfrequency.while theoptimizationswerecarriedoutattheintermediate codelevel,weusedcountsofthenumberofsparcassemblyinstructionsforeachintermediatecodeinstruction, togetherwiththeexecutionfrequenciesestimatedfrom thecallgraphstructure,toestimatetheruntimecost 3Ourimplementationusesavariantofcallforwardingwhere entryactionsarecopiedfromthecalleetothecallsitesaslong asthiswillallowalateractiontobeskipped.
ofthedierentsolutions.theresultsindicatethatthe greedyheuristichasuniformlygoodperformance:on thebenchmarks,itattainstheoptimalsolutionineach caseṫable2givestheimprovementsinspeedresulting fromouroptimizations,andservestoevaluatetheef- cacyofcallforwarding.thetimereportedforeach benchmark,inmilliseconds,isthetimetakentoexecutetheprogramonce.thistimewasobtainedby iteratingtheprogramlongenoughtoeliminatemosteffectsduetomultiprogrammingandclockgranularity, thendividingthetotaltimetakenbythenumberofiterations.theexperimentswererepeated20timesfor eachbenchmark,andtheaveragetimetakenineach case.callforwardingaccountsforimprovementsrangingfromabout12%toover45%.mostofthisimprovementcomesfromcodemotionoutofinnerloops:the vastmajorityoftypetestsetc.inaprocedureappearas entryactionsthatarebypassedinrecursivecallsdueto callforwarding,eectively\hoisting"suchtestsoutof innerloops.asaresult,muchoftheruntimeoverhead fromdynamictypecheckingisoptimizedaway. Table3putsthesenumbersinperspectivebycomparingtheperformanceofjctoQuintusandSicstus Prologs,twowidelyusedcommercialPrologsystems. OncomparingtheperformancenumbersfromTable2 forjcbeforeandafteroptimization,itcanbeseenthat theperformanceofjciscompetitivewiththesesystemsevenbeforetheapplicationoftheoptimizations discussedinthispaper.itiseasytotakeapoorlyengineeredsystemwithalotofinecienciesandgethuge performanceimprovementsbyeliminatingsomeofthese ineciencies.thepointofthistableisthatwhenevaluatingtheecacyofouroptimizations,wewerecareful tobeginwithasystemwithgoodperformance,soasto avoiddrawingoverlyoptimisticconclusions. Finally,Table4comparestheperformanceofour JanussystemwithCcodeforsomesmallbenchmarks.4 Again,thesewererunonaSparcstation1,withccas theccompiler.theprogramswerewritteninthestyle onewouldexpectofacompetentcprogrammer:no recursion(exceptintakandnrev ano(n2)\naive reverse"programforreversingalinkedlistofintegers whereitishardtoavoid),destructiveupdates,andthe useofarraysratherthanlinkedlists(exceptinnrev, whichbydenitiontraversesalist).thesourcecode forthesebenchmarksisgiveninappendixb.itcanbe seenthattheperformanceofjcisnotveryfarfromthat 4TheJanusversionofqsortusedinthistableisslightlydifferentfromthatofTable3:inthiscasethereareexplicitinteger typetestsintheprogramsource,tobeconsistentwithintdeclarationsinthecprogramandallowafaircomparisonbetween thetwoprograms.thepresenceofthesetestsprovidesadditionalinformationtothejccompilerandallowssomeadditional optimizations. ofc,attainingapproximatelythesameperformanceas unoptimizedccode,andbeingonlyaboutafactorof 2,ontheaverage,slowerthanCcodeoptimizedatlevel -O4.Onsomebenchmarks,suchasnrev,jcoutperformsunoptimizedCandisnotmuchslowerthanoptimizedC,eventhoughtheCprogramusesdestructiveassignmentanddoesnotallocatenewconscells, whilejanusisasingleassignmentlanguagewherethe programallocatesnewconscellsateachiteration its performancecanbeattributedatleastinparttothe benetsofcallforwarding. 6RelatedWork Theoptimizationsdescribedherecanbeseenasgeneralizingsomeoptimizationsfortraditionalimperative languages[2].inthespecialcaseofa(conditionalor unconditional)jumpwhosetargetisa(conditionalor unconditional)jumpinstruction,callforwardinggeneralizestheow-of-controloptimizationthatcollapses chainsofjumpinstructions.callforwardingisableto dealwithconditionaljumpstoconditionaljumps(this turnsouttobeanimportantsourceofperformanceimprovementinpractice),whiletraditionalcompilersfor imperativelanguagessuchascandfortrantypically dealonlywithjumpchainswherethereisatmostone conditionaljump(see,forexample,[2],p.556). Whenweconsidercallforwardingforthelastcall inarecursiveprocedure,whatwegetisessentiallya generalizationofcodemotionoutofloops,inthesense thatthecodethatisbypassedduetocallforwardingat aparticularcallsiteneednotbeinvariantwithrespect totheentireloop.thepointisbestillustratedbyan example:considerafunction f(x)=ifx=0then1 elseifp(x)thenf(g(x-1))/*1*/ elsef(h(x-1)) /*2*/ Assumethattheentryactionsforthisfunctioninclude atestthatitsargumentisaninteger,andsupposethat weknow,fromdataowanalysis,thatg()returnsaninteger,butdonotknowanythingaboutthereturntype ofh().fromtheconventionaldenitionofa\loop"in aowgraph(see,forexample,[2]),thereisoneloop intheowgraphofthisfunctionthatincludesboth thetailrecursivecallsitesforf().becauseofourlack ofknowledgeaboutthereturntypeofh(),wecannot claimthat\theargumenttof()isaninteger"isaninvariantfortheentireloop.however,usingcallforwarding,theintegertestintheportionofthelooparising fromcallsite1canbebypassed.eectively,thismoves somecodeoutof\partof"aloop.moreover,ouralgorithmimplementsinterproceduraloptimizationandcan dealwithbothdirectandmutualrecursion,aswellas non-tail-recursivecode,withouthavingtodoanything
special,whiletraditionalcodemotionalgorithmshandle onlytheintra-proceduralcase. Theideaofcompilingfunctionswithmultipleentry pointsisnotnew:manylispsystemsdothis,standardmlofnewjerseyandyalehaskellgeneratedual entrypointsforfunctions,andaquariusprologgeneratesmultipleentrypointsforprimitiveoperations[18]. However,wedonotknowofanysystemthatattempts toordertheentryactionscarefullyinordertomaximize thesavingsfrombypassingentryactions. Someoptimizationsusedinstaticallytypedlanguagescanalsobethoughtofintermsofcallforwarding. Forexample,StandardMLofNewJerseyusesacombinationofthreetransformations inverseeta-reduction, uncurrying,andargumentattening tooptimizefunctionswherealloftheknowncallsitespasstuplesofthe samesizeasarguments,butwherethefunctionmay \escape,"i.e.,notallofcallsitesareknownatcompiletime[1].theideaistohavetheknowncallsites passargumentsinregistersinsteadofconstructingand deconstructingtuplesontheheap,whilecallsitesthat areunknownatcompiletimeexecuteadditionalcode tocorrectlydeconstructthetuplestheypass.thisoptimizationcanbethoughtofintermsofcallforwarding asfollows:supposethateachknowncallsiteforafunctionconstructsandpassesann-tupleastheargument, whichisthendeconstructedwithnselectoperations atthecallee.wecancopythenselectoperations fromthecalleetoeachknowncallsite,andforwardthe callstoenterthecalleebypassingtheseoperations.at eachofthesecallsites,theconstructionoftheargumentn-tuplefollowedbynselectsonitcaneasilybe recognizedasinverseoperationsthatcanbeoptimized toavoidhavingtoactuallybuildtuplesontheheap. Thus,knowncallsitescanbeexecutedeciently,while callsitesthatarenotknownatcompiletimeenterat theoriginalentrypointandexecutetheselectoperationsintheexpectedway.indeed,thewholepointof inverseeta-reductionistogeneratetwoentrypointsfor afunctionsothatknowncallsitescanbypassunnecessarycode:callforwardingcanbeseenasawayof extendingthisideatogetmorethantwoentrypoints wherenecessary. ChambersandUngarconsidercompile-timeoptimizationtechniquestoreduceruntimetypechecking indynamicallytypedobject-orientedlanguages[5,6]. Theirapproachusestypeanalysistogeneratemultiple copiesofprogramfragments,inparticularloopbodies,whereeachcopyisspecializedtoaparticulartype andthereforecanomitsometypetests.someofthe eectsoftheoptimizationwediscuss,e.g.,\hoisting" typetestsoutofloops(seesection4),aresimilarto eectsachievedbytheoptimizationofchambersand Ungar.Ingeneral,however,itisessentiallyorthogonaltotheworkdescribedhere,inthatitisconcerned primarilywithtypeinferenceandcodespecialization ratherthanwithcodeordering.becauseofthis,the twooptimizationsarecomplementary:evenifthebody ofaprocedurehasbeenoptimizedusingthetechniques ofchambersandungar,itmaycontaintypetestsetc. attheentry,whicharecandidatesfortheoptimization wediscuss;conversely,the\messagesplitting"optimizationofchambersandungarcanenhancetheeectsof callforwardingconsiderably. 7Conclusions Thispaperdiscussescallforwarding,asimpleinterproceduraloptimizationtechniquefordynamicallytyped languages.thebasicideabehindtheoptimizationisextremelystraightforward:ndanorderingforthe\entry actions"ofaproceduresuchthatthesavingsrealized fromdierentcallsitesbypassingdierentsetsofentry actions,weightedbytheirestimatedexecutionfrequencies,isaslargeaspossible.itturnsout,however,tobe quiteeectiveforimprovingprogramperformance.we showthattheproblemofcomputingoptimalsolutions toarbitrarycallforwardingproblemsisnp-complete, anddescribeanecientheuristicfortheproblems.experimentalresultsindicatethatthesolutionsproduced aregenerallyoptimalorclosetooptimal,andleadto signicantperformanceimprovementsforanumberof benchmarkstested.avariantoftheseideashasbeen implementedinjc,alogicprogrammingsystemthatis availablebyanonymousftpfromcs.arizona.edu. References [1]A.Appel,CompilingwithContinuations,CambridgeUniversityPress,1992. [2]A.V.Aho,R.SethiandJ.D.Ullman,Compilers{ Principles,TechniquesandTools,Addison-Wesley, 1986. [3]T.BallandJ.Larus,\OptimallyProlingand TracingPrograms",Proc.19th.ACMSymp. onprinciplesofprogramminglanguages,albuquerque,nm,jan.1992,pp.59{70. [4]M.CarlssonandJ.Widen,SICStusPrologUser's Manual,SwedishInstituteofComputerScience, Oct.1988. [5]C.ChambersandD.Ungar,\IterativeType AnalysisandExtendedMessageSplitting:OptimizingDynamicallyTypedObject-OrientedPrograms",Proc.SIGPLAN'90ConferenceonProgrammingLanguageDesignandImplementation, WhitePlains,NY,June1990,pp.150{164.SIG- PLANNoticesvol.25no.6. [6]C.Chambers,D.UngarandE.Lee,\AnEcient ImplementationofSELF,ADynamicallyTyped
Object-OrientedLanguageBasedonPrototypes", Proc.OOPSLA'89,NewOrleans,LA,1989,pp. 49{70. [7]S.K.Debray,\ASimpleCodeImprovement SchemeforProlog",J.LogicProgramming,vol.13 no.1,may1992,pp.57-88. [8]M.R.GareyandD.S.Johnson,Computersand Intractability:AGuidetotheTheoryofNP- Completeness,Freeman,NewYork,1979. [9]M.R.Garey,D.S.Johnson,andL.Stockmeyer, \SomeSimpliedNP-completeGraphProblems", TheoreticalComputerSciencevol.1,pp.237{267, 1976. [10]A.GoldbergandD.Robson,Smalltalk-80:The LanguageanditsImplementation,Addison-Wesley, 1983. [11]D.Gudeman,K.DeBosschere,andS.K.Debray, \jc:anecientandportableimplementation ofjanus",proc.jointinternationalconference andsymposiumonlogicprogramming,washingtondc,nov.1992.mitpress. [12]A.Marien,G.Janssens,A.Mulkers,andM. Bruynooghe,\TheImpactofAbstractInterpretation:AnExperimentinCodeGeneration",Proc. SixthInternationalConferenceonLogicProgramming,Lisbon,June1989,pp.33{47.MITPress. [13]V.Saraswat,K.Kahn,andJ.Levy,\Janus:A steptowardsdistributedconstraintprogramming", inproc.1990northamericanconferenceonlogic Programming,Austin,TX,Oct.1990,pp.431-446. MITPress. [14]J.T.Schwartz,R.B.K.Dewar,E.Dubinsky,and E.Schonberg,ProgrammingwithSets:AnIntroductiontoSETL,Springer-Verlag,1986. [15]G.L.SteeleJr.,CommonLisp:TheLanguage, DigitalPress,1984. [16]A.Taylor,\RemovalofDereferencingandTrailing inprologcompilation",proc.sixthinternational ConferenceonLogicProgramming,Lisbon,June 1989,pp.48{60.MITPress. [17]K.Ueda,\GuardedHornClauses",inConcurrent Prolog:CollectedPapers,vol.1,ed.E.Shapiro,pp. 140-156,1987.MITPress. [18]P.VanRoy,CanLogicProgrammingExecuteas FastasImperativeProgramming?,PhDDissertation,UniversityofCalifornia,Berkeley,Nov.1990. [19]D.W.Wall,\PredictingProgramBehaviorUsing RealorEstimatedProles",Proc.SIGPLAN-91 Conf.onProgrammingLanguageDesignandImplementation,June1991,pp.59{70. AAppendix:ProofofNPCompleteness Thefollowingproblemisusefulindiscussingthecomplexityofoptimalcallforwarding: DenitionA.1TheOptimalLinearArrangement problem(ola)isdenedasfollows:givenagraph G=(V;E)andanintegerk,ndapermutation,f, fromtheverticesinvto1;:::;nsuchthatdeningthe lengthofedge(i;j)tobejf(i)?f(j)j,thetotallength ofalledgesislessthanorequaltok. ThefollowingresultisduetoGarey,Johnson,and Stockmeyer[8,9]: TheoremA.1TheOptimalLinearArrangementproblemisNP-complete. Thefollowingresultgivesthecomplexityofoptimalcall forwarding: Theorem3.1Thedeterminationofanoptimalsolution toacallforwardingproblemisnp-complete.itremains NP-completeevenifeveryentryactionhasequalsize. Proof:Werstformulateoptimalcallforwardingasa decisionproblem,asfollows:\givenacallforwarding problemiandanintegerk0,isthereasolutiontoi withcostnogreaterthank?"werefertothisproblem ascf.theproofisbyreductionfromoptimallineararrangementproblem,which,fromtheorema.1, isnp-complete.letg=(v;e);kbeaparticularinstanceofola.wemakethefollowingtransformation toaninstanceha;c;w;s;kiofcf,where: {Aisthesetofvertices1;:::;ninValongwith twodummyverticessandt; {TheelementsofCarealldoubletonsets: {correspondingtoeachedge(u;v)2e,there isanelementfu;vgincwithweight1: forterminologicalsimplicityinthediscussion thatfollows,werefertotheseelementsas normalsets; {letbethemaximumdegreeofanyvertex ing,thencorrespondingtoeachvertexi2g ofdegreedi,thereisanelementfi;sginc withweight12(?di)(someofthesesets
couldhavezeroweight,inwhichcasethey caneectivelyberemoved):werefertothese elementsasspecialsets; {nally,thereisanelementfs;tgincof weightm,wheremislargeenoughtoensure thatsandthavetobethelasttwoelements inanyoptimalorderingofthevertices(m canbechosentoben3orgreater):werefer tothiselementasaheavyset. {s(i)=1foreveryi2a. {k=0. WealsohavetodenethenumberKthatistobound thecostofthecallforwardingproblemsoconstructed. LetK=14n(n+5)+3M+k=2.Weclaimthatthe instanceofcfsodenedhasasolutionwithcostno greaterthankifandonlyifthegiveninstanceofola hasasolution. ConsideranyproposedorderofelementsinasolutiontotheinstanceofCFdenedabove.Thecostof thissolutioncanbedecomposedasfollows: Aswemarchalongthelistofelements,ateachpoint wecharge=2toeachoftheelementswehaveseenso farbutnottoeitherofthespecialelements.ifvertex i2gisencountered,thechargeof=2onvertexifrom thenoncanbethoughtofaspaying1/2towardseach ofthenormalsetsthatcontainiandpayingtheentire costofthespecialsetthatcontainsi.nowifbothelementsofanormalsethavebeenencountered,thetotal costofthesetwillfromthenonbepickedupbythese chargestothevertices.foranormalsetfi;jg,afteri hasbeenencounteredandbeforejhasbeenencountered theextrachargeof1/2ateachstagewillbechargedto theedge(i;j).breakingupthechargesasabove,one ndsthatforanyorderinwhichsandtnishlast,the chargetotheverticesisaconstantindependentofthe orderandisequalto14(n(n+5))andthechargefor theheavysetisxedat3m.theonlyvariableisthe chargetotheedgesandthischargewillbeexactlyhalf thetotallengthoftheedges,sinceanedgegetscharged onlyafteroneofitsendpointshasbeenencounteredand beforetheotherendpointhasbeenencountered,i.e.for the\duration"ofitslength. ThusthereisaYESanswertotheinstanceofCF createdifandonlyifthetotallengthofall\normal" edgesiskepttokorless,or,inotherwords,ifandonly iftheinstanceofolaisayes-instance.(notethat sincethecostofthespecialsetsisentirelypickedup bythevertices,thelengthsofthespecialedgesdonot matter.) BSourceCodeforSomeBenchmarks ThesourcecodeforthebenchmarksusedinthecomparisonbetweenjcandCisgivenbelow.Forspace reasons,onlythecodeforthemainfunctionsisgiven. nrev:c: typedefstructs{ inthead; structs*tail; }cons_node; cons_node*append(l1,l2) cons_node*l1,*l2; {cons_node*l3; if(l1==null)returnl2; else{ for(l3=l1;l3->tail!=null;l3=l3->tail) ; l3->tail=l2; returnl1; } }cons_node*nrev(l) cons_node*l; {cons_node*l1; if(l==null)returnnull; else{ l1=l->tail; l->tail=null;/*reclaimheadnode*/ returnappend(nrev(l1),l); } }Janus: nrev([],^[]). nrev([h L1],^R):- nrev(l1,^r1),app(r1,[h],^r). app([],l,^l). app([h L1],L2,^[H L3]):-app(L1,L2,^L3). binomial:c: /*fact()asinthefactorialbenchmark*/ intpow(x,i) intx,i; {intprod; for(prod=1;i>0;i--)prod*=x; returnprod; }intchoose(n,k) intn,k; {returnfact(n)/(fact(k)*fact(n-k)); }intbinomial(x,y,n)
intx,y; {inti,prod=0; for(i=0;i<=n;i++) prod+=choose(n,i)*pow(x,i)*pow(y,n-i); returnprod; }Janus: /*fact()asinthefactorialbenchmark*/ pow(x,n,^p):-int(x) pow(x,n,^p,1). pow(x,0,^p,a):-int(x),int(a) P=A. pow(x,n,^p,a):- int(x),int(n),int(a),n>0 pow(x,n-1,^p,x*a). choose(n,k,^c):-int(n),int(k) fact(n,^f1),fact(k,^f2),fact(n-k,^f3), C=F1//(F2*F3). binomial(x,y,n,^z):- int(x),int(y),int(n),n>=0 binomial(x,y,n,^z,n). binomial(_,_,_,^0,0). binomial(x,y,n,^z,k):- int(x),int(y),int(n),int(k),k>0 binomial(x,y,n,^z1,k-1), choose(n,k,^c), pow(x,k,^xp), pow(y,n-k,^yp), Z=Z1+C*Xp*Yp. dnf:c: dnf(in,r,w,b) intin[],r,w,b; {inttemp; while(r<=w){ if(in[w]==0){ temp=in[w];in[w]=in[r];in[r]=temp; R+=1; }elseif(in[w]==1) W-=1; elseif(in[w]==2){ temp=in[w];in[w]=in[b];in[b]=temp; B-=1;W-=1; } } }Janus: dnf(in,r,w,b,^out):- int(r),int(w),r>w Out=In. dnf(in,r,w,b,^out):- int(r),int(w),r=<w,in.w=red dnf(in[r->in.w,w->in.r],r+1,w,b,^out). dnf(in,r,w,b,^out):- int(r),int(w),r=<w,in.w=white dnf(in,r,w-1,b,^out). dnf(in,r,w,b,^out):- int(r),int(w),r=<w,in.w=blue dnf(in[b->in.w,w->in.b],r,w-1,b-1,^out). tak:c: inttak(x,y,z) intx,y,z; {if(x<=y)returnz; returntak(tak(x-1,y,z), tak(y-1,z,x), tak(z-1,x,y)); }Janus: tak(x,y,z,^a):- int(x),int(y),int(z),x>y tak(x-1,y,z,^a1), tak(y-1,z,x,^a2), tak(z-1,x,y,^a3), tak(a1,a2,a3,^a). tak(x,y,z,^a):- int(x),int(y),int(z),x=<y A=Z. factorial:c: intfact(n) intn; {intprod; for(prod=1;n>0;n--) prod*=n; returnprod; }Janus: fact(n,^x):- int(n),n>=0 fact(n,^x,1). fact(n,^f,a):- int(a),int(n),n>0 fact(n-1,^f,a*n). fact(0,^f,a):-int(a) F=A.
Output:AsolutiontoI,i.e.,apermutationofE. Input:AcallforwardingproblemI=hE;C;w;s;ki. Method: beginactivesites:=c; :=hi; foreachc2cdocount[c]:=kod Processed:=;; AvailInstrs:=therootnodesofG; whileavailinstrs6=;do constructthedependencygraphgforlegalexecutionorders; Processed:=Processed[fIg; od; I:=anelementofAvailInstrswiththeleastweightsocomputed; :=appenditotheendof; AvailInstrs:=(AvailInstrsnfIg)[fJ2Ejpreds(J)Processedg; foreachi2availinstrsdo foreachc2activesitess.t.i2cdo/*updatelistofactivesites*/ computetheweightofias(pfw(c)jc2activesitesandi2cg)=s(i); elsecount[c]:=count[c]?1; ifcount[c]=0then deletecfromactivesites; /*extendsolution*/ /*updatelistofavailableinstructions*/ endod; return; od Figure1:AGreedyAlgorithmforCallForwarding
ave:arg1:=deref(arg1) if:number(arg3)gotoerr Arg2:=deref(Arg2) t1:=head(arg1) ifarg1==nilgotol1 if:number(arg2)gotoerr Arg1:=tail(Arg1) Arg3:=deref(Arg3) if:list(arg1)gotoerr ave:arg2:=deref(arg2) t1:=deref(t1) L0:Arg1:=deref(Arg1) ifarg1==nilgotol1 t1:=head(arg1) if:number(arg3)gotoerr if:list(arg1)gotoerr Arg3:=deref(Arg3) Arg1:=tail(Arg1) if:number(arg2)gotoerr L1:t1:=div(Arg2,Arg3) if:number(t1)gotoerr Arg2:=add(Arg2,t1) Arg3:=add(Arg3,1) gotoave if:number(t1)gotoerr t1:=deref(t1) Arg2:=add(Arg2,t1) Arg3:=add(Arg3,1) Figure2:TheEectofCallForwardingonIntermediateCodefortheaveprocedure Arg4:=deref(Arg4) (a)beforecallforwarding assign(arg4,t1) L1:t1:=div(Arg2,Arg3) (b)aftercallforwarding gotol0 Arg4:=deref(Arg4) assign(arg4,t1)
Program hanoi tak nrev qsort nooptimization 1776 492 574 726 greedy 225 172 360 450 optimal 225 172 360 Table1:EcacyofthegreedyCallForwardingheuristic(inSparcassemblyinstructions) factorial merge 129 720 330 24 450 Programw/oforwarding(ms)withforwarding(ms) dnf pi 5963 124 306 1304 25 1304 330 24 binomial 5.95 5.14 %improvement 13.6 25 hanoi tak 186 299 163 207 12.4 Programjc(J)(ms)Sicstus(S)(ms)Quintus(Q)(ms)S/J nrev qsort merge 0.745 1.17 2.31 0.716 0.613 1.87 19.0 30.8 hanoi dnf Table2:PerformanceImprovementduetoCallForwarding 0.356 0.191 38.8 tak nrev 163 300 690 1.8417.7 qsort 207 730 2200 3.5346.3 factorial0.049 0.716 1.87 5.1 1.8 7.9 9.4 2.51 2.73 11.03 10.63 Q/J 4.23 Program nrevtable3:theperformanceofjc,comparedwithsicstusandquintusprolog GeometricMean: 0.44 0.27 8.98 3.31 5.03 binomialjc(j)(ms)c(unopt)(ms)c(opt:-o4)j/c-unoptj/c-opt 0.716 0.89 0.52 0.80 5.51 dnf qsort 5.14 4.76 3.17 1.08 6.72 tak factorial 0.191 0.049 1.33 207 0.191 1.25 208 0.061 0.34 72 1.06 1.62 3.91 1.38 Table4:TheperformanceofjccomparedtoC GeometricMean: 0.049 0.036 1.00 0.98 3.13 2.88 1.36 2.18