forexploitingilpwithinloops,whichcaneectivelyoverlaptheexecutionofoperationsfrom

Size: px
Start display at page:

Download "forexploitingilpwithinloops,whichcaneectivelyoverlaptheexecutionofoperationsfrom"

Transcription

1 SoftwarePipeliningwithRegisterAllocationandSpilling M.AntonErtlChristineEisenbeisz InstitutfurComputersprachen TechnischeUniversitatWien JianWangyAndreasKrall A-040Wien,Austria Argentinierstr.8 (RRG)whichcandynamicallyreecttheregisterrequirementduringsoftwarepipelining. techniqueandstudytheproblemofregisterspillingforsoftwarepipelining.wealsopresent threealgorithms{rpswithoutspilling,rpswithspillingandthesoftwarepipeliningwith mainsanopenproblem.inthispaper,werstpresenttheregisterrequirementgraph alimitednumberofregisters.thepreliminaryexperimentalresultsshowthatthersttwo Then,usingtheRRGasabasis,wedevelopaRegister-Pressure-Sensitive(RPS)scheduling Simultaneousregisterallocationandsoftwarepipeliningisstilllessunderstoodandre- Abstract performanceandthethirdcaneectivelyexploitinstruction-levelparallelismwithinloops algorithmscanecientlyreducetheregisterrequirementwithoutdegradationoftheoptimal Introduction Keywords:Instruction-levelParallelism,LoopScheduling,SoftwarePipelining,Register evenforthosemachineswithasmallregisterle. IthasbeenwellknownthatexploitingInstruction-LevelParallelism(ILP)withinloopshasbecomeakeycompilationissuefortheinstruction-levelparallelprocessorslikeVeryLongInstructionWord(VLIW)andsuperscalarmachines[,,].Softwarepipelininghasbeenproposed dierentiterations[4,5,6,7,8,9,0,,]. knownthatperformingregisterallocationbeforesoftwarepipeliningmayintroduceunacceptable anti-dependencesduetothereuseofregisters,whichmaylimitsoftwarepipelining[7,].on theotherhand,ifsoftwarepipeliningisdonebeforeregisterallocation,moreregistersthan pipeliningisstilllessunderstoodandremainsopen. theperformanceofthepipelinedloop[].however,simultaneousregisterallocationandsoftware necessarymaybeneeded,whichmaycauseunnecessaryregisterspillingsandseverelydegrade RegisterAllocationisanotherkeycompilationissue[,4,5,6,7].Ithasbeenwell Allocation,Spilling,DataDependenceGraph forexploitingilpwithinloops,whichcaneectivelyoverlaptheexecutionofoperationsfrom andtheaustrianscienceandresearchministry. ThisworkwassupportedbytheLiseMeitnerStipendiumfundedbytheAustrianScienceFoundation(FWF) zdr.eisenbeisiswithinria-rocquencourt,domainedevoluceau,bp05-785,lechesnaycedex,france.

2 sincethemid980s[0,8,,6,9],andregisterallocationforsoftwarepipelinedloophas ofregistersneededforagivenmoduloscheduledloop[].ningandgaohavepresented numberofregistersneededforndingsomeoptimalsoftwarepipelinedloop[],buttheydo beenstudiedbymanyresearchersandsomeecienttechniqueshavebeenproposed[0,, aframeworkofregisterallocationforsoftwarepipeliningbywhichtheydeducetheminimal consideredinfewstudies.mangione-smith,etal.developedalowerboundonthenumber 7,5].However,theinteractionbetweenregisterallocationandsoftwarepipeliningwaslately Theinteractionbetweenregisterallocationandloop-freecodeschedulinghasbeenstudied performthemoduloschedulingwithatryforshorteningthelifetimeofavariable,buthedoes hasbeenpresentedbyhu[],inwhichheusestheideaofbidirectionalslack-schedulingto notconsidertheresourceconstraints.acalledlifetime-sensitivemoduloschedulingtechnique notconsidertheregisterspillingproblem. necessaryisneeded.ontheotherhand,fromtherrgwecandynamicallyestimatetheregister theregisterrelatedinformationtoguidetheschedulingprocesssuchthatnomoreregisterthan requirementsuchthatthespillingdecisionandthetradeobetweentheinitiationintervaland controltheregisterpressurecausedbysoftwarepipeliningitself.ononehand,therrggives registerrequirementduringsoftwarepipelining.whilesoftwarepipelining,therrgisusedto understandtheinteractionbetweenregisterallocationandsoftwarepipelining,wepresenta novelframework,calledregisterrequirementgraph(rrg),whichcandynamicallyreectthe Ourapproachespresentedinthispaperaredierentfromalloftheabove.Inorderto registerpressureareecientlymade. pipeliningwithalimitednumberofregisters(section6);(5)givethepreliminaryexperimental resultstoindicatetheeciencyofthethreealgorithms(section7). (RPS)schedulingtechnique(Section4);()Studytheproblemofregisterspillingtoreducethe registerpressurewithoutdegradationoftheoptimalperformance(section5);(4)present duringsoftwarepipelining(section);()usetherrgtodeveloparegister-pressure-sensitive threesoftwarepipeliningalgorithms{rpswithoutspilling,rpswithspillingandthesoftware thispapercanbeconcludedasfollows:()presenttherrgtoestimatetheregisterrequirement Thenextsectiongivesabackgroundtomakethispaperself-contained.Theworkreportedin ThedatadependencesofaloopcanberepresentedbyaLoopDataDependenceGraph(LDDG), DecomposedSoftwarePipelining(DESP) acyclic;secondly,weapplythelistschedulingtechniqueonthemodiedgraphtogenerate asanexample.first,wemodifythelddgbyremovingsomeedgessothatthegraphbecomes distanceandthedelayaretwonon-negativeintegersassociatedwitheachedge.for (O;E;;),whereOistheoperationsetandEthedependenceedgeset;thedependence thesoftwarepipelinedloopbodyundertheresourceconstraints,andusetherow-numberto denotethecycle-numberofeachoperationintheloopbody;thirdly,wedeterminetheiterationnumber(denotedascolumn-numberinthecontextofdesp)ofeachoperationsuchthatall startoftheoperationopofthe(e)thpreviousiteration[,9]. example,e=(op;op0)and((e);(e))denotethatop0canonlybeissued(e)cyclesafterthe datadependencesinlddgaresatised. DESPisanovelmoduloschedulingapproach,anditsideacanbeillustratedbyFigure. onesaredottedifwedonotattach(;)toeachedge. Forallexamplesinthispaper,theloop-independentdependenceedgesaresolidedgeswhereasloop-carried Formally,DESPtheoreticallydecomposestheloopscheduleintotwofunctions,row-number

3 5,,, 4; 6, ; andcolumn-number. LDDG MLDDG Denition.LetG=(O;E;;)betheLDDGofaloop,andavalidloopschedule rn 5,,,4 4 4 step 6, step step 5 5, 4; ; 5,,,4;,4, 6, ; 6 6 5,,6 column-number. Thus,softwarepipeliningcanbedescribedbelowwiththeconceptsofrow-numberand mappingsfromoton(non-negativeintegerset),suchthat forgwithinitiationintervalii.wedenetherow-numberrnandthecolumn-numbercn,two 5,; 6; Denition.(DecomposedSoftwarePipelining)LetG=(O;E;;)betheLDDG (op;)=rn(op)+ii(cn(op) )and(op;i)=(op;)+ii(i ): Figure Decomposed resource-conict; ifandonlyifthefollowingconstraintsaresatised:.resourceconstraints:8opi;opjo,ifrn(opi)=rn(opj),thenopiandopjcannotbe.dependenceconstraints: ofaloop,wesaythattherow-number,rn,andthecolumn-number,cn,arevalidfortheloop, goalofdecomposedsoftwarepipeliningistondvalidrow-numberandcolumn-numberwith IIiscalledastheinitiationintervalorthelengthofthesoftwarepipelinedloopbody.The minimumii. Inourpreviouspapers[4,5,6],wehaveproventhefollowingtheoreticalresults. 9IIN;8e=(op;op0)E;rn(op0) rn(op)+ii((e)+cn(op0) cn(op))(e): where(e)= (e)+d((e)+rn(op) rn(op0))=iie,e=(op;op0). dependenceconstraintsarealsosatised,ifandonlyif,foreachcyclecofthelddg, satisestheresourceconstraints.wecanconstructcolumn-numbercnsuchthatthedata Theorem.ForagivenLDDG,supposewehaveconstructedrow-numberrnwhich extendedtothecaseofmulti-cyclenon-pipelinedoperations. Here,weonlyconsiderthepipelinedoperationsandthesingle-cycleoperations,butthedenitioniseasily 8eC(e)0 X

4 RegisterRequirementGraph datadependenceconstraintsarealsosatised. accounttheresourceconstraints,thenwecanalwaysconstructcolumn-numbersuchthatthe ThefollowingcorallaryisdirectfromTheorem.. Corallary.ForaLDDGwithoutcycle,ifwehaveconstructedrow-numbertakinginto theschedulingprocess(determiningtherow-number). Graph(RRG)whichcandynamicallyestimatedcnij.TheRRGgivestheheuristicstoguide (denotedasdcnij).forexample,supposevariableuiswrittenbyopiandreadbyopj,then dcnijgivestheestimateofthelifetimeofu.thus,werstpresenttheregisterrequirement bythedierencebetweenthecolumn-numbersoftwooperationswhichhaveadatadependence theregisterrequirementofeachvariable.infact,theregisterrequirementismainlydetermined Indecomposedsoftwarepipelining,thecolumn-numberisanimportantparametertocontrol step,weusethefollowingmethodtomodifythelddg[4,5,6]: arenotincludedinthesccs; denotedas(rn0;cn0); ()ndoutallstronglyconnectedcomponents(sccs)inthelddg,removealledgeswhich ()undertheunlimitedresourceconstraints,generateasoftwarepipelinedloopforthesccs, OursoftwarepipeliningframeworkisbasedontheDESPasshowninFigure..Intherst weobtainanacyclicdependencegraphmlddg=(o;em;).anewgraph,calledregister requirementgraph,isdenedasrrg=(o;e;!),where!isaweightoneachedgewhich satisfyingthedatadependencesofthemlddgmustsatisfytheconditionoftheorem.. fromthesccs. GiventheLDDG(O;E;;)ofaloop,aftertherststepofdecomposedsoftwarepipelining, Theremaininggraphisacyclic,denotedasMLDDG.Wehaveproventhatanyrow-numbers ()foreachedgee=(opi;op j)ofsccs,ifrn0(opj) rn0(op i)<(e),thenremovee pipelinedloopbody,weinitiallydene!asfollows: representstheestimateddierencebetweenthecolumn-numbersoftwooperationsintheworst case.letmiibetheestimatedminimuminitiationinterval,beforeschedulingthesoftware determined; E Emasfollows: Whileschedulingthesoftwarepipelinedloopbody,werecompute!(e)foreache=(opi;opj) ()!(e)= (e)+d((e)+mii )=MIIe;8eE Em. ()!(e)= (e);8eem; ()!(e)= (e)+d((e) +rn(opi))=miie,ifrn(opi)isdeterminedbutrn(opj)isnot; ()!(e)= (e)+d((e) (rn(opj) rn(opi)))=miie,ifrn(opi)andrn(opj)bothare rn(op6)=andrn(op)=rn(op4)=. machinemodel.itslddgandmlddgareshowninfigure.()and(),respectively. Figure.()istheinitialRRG.Figure.(4)istheRRGwhenrn(op)=rn(op)=rn(op5)= ()!(e)= (e)++d((e) rn(opj))=miie,ifrn(opj)isdeterminedbutrn(opi)isnot; AnexampleofRRGisgiveninFigure.and.,Figure.()istheloopand()the 4

5 The Original Loop: for i= to n do s=s+a[i] a[i]=s*s*a[i] enddo () The Loop The Code of the Loop Body:. t0=t0+;. t=a[t0];. s=s+t; 4. t=s*s; 5. t=t*t; 6. a[t0]=t Figure. An Example Pipeline Number Operation Latency Memory port Load Store Address ALU Add/Sub Adder FAdd/FSub IAdd/ISub Multiplier FMUL IMUL () The Machine Model operationreadingthevariableinthelddg.thecriticaldenition-usepathofvariableu,cdupu, (0,) 0 isdenedas Adenition-usepathisdenedasapathfromtheoperationwritingavariabletoany (0,) 7 7 8ecdupu!(e)=max X anydupofu(x 8edupu!(e)): (0,) 0 4 (0,) (0,) () MLDDG () (4) givestheestimateoftheregisterrequirementofu. criticaldenition-usepathincludee. LetRRG=(O;E;!),foreachedgeeE,(e)isdenedasthenumberofvariableswhose ()LetRRG=(O;E;!),cdupubethecriticaldenition-usepathofu,thenP8ecdupu!(e) RRGhasthefollowingtwoproperties: Figure. LDDG, MLDDG and RRGs 4RPSScheduling ewhichisinthelddgbutnotincludedinthemlddg,ifeissatised(thatis,rn(opj) (e))(e)registerscomparedtothecasewheneisnotsatised. rn(opi)(e);e=(opi;opj)),thentheregisterrequirementmaybedecreasedbyupto(!(e)+ ()LetRRG=(O;E;!),duringschedulingthesoftwarepipelinedloopbody,foranyedge operations. inthesoftwarepipelinedloopbody; Wepresentthefollowingtwoheuristicstodirecttheschedulingprocess: Inthesecondstepofoursoftwarepipeliningframework,weuselistschedulingonthe ()Delaysomeoperationstobescheduledsuchthatsomedependenceedgescanbesatised ()Developregister-pressure-sensitiveheuristicstodeterminetheschedulingprioritiesfor 5

6 wendoutallschedulableoperationsatthecurrentcycleandputthemintothedataready MLDDG(obtainedintherststep)todeterminetherow-numbersforalloperations.First, Set(DRS),thenweselecttheoperationswiththehighestschedulingprioritytoschedule. somedependenceedgesavoidbeingunnecessarybroken. maybealotofschedulableoperationsinthedrsateachcycle.withoutincreaseofthe estimatedii,itisgreatlypossiblethatsomeoperationscanbedelayedtoschedulesuchthat AsmostdependenceedgesoriginallyintheLDDGhavebeenremovedintheMLDDG,there usingresandnisthenumberofresinthemachine;and t +dn=netheestimatedii,wheretisthecurrentcycle,nisthenumberofoperations is,t+(e)+height(op) theestimatedii,wheretisthecurrentcycle,eisthedependence edgewhichwearewillingtoholdandheight(op)istheheightofopinthemlddg. WesuggestthatanoperationcanbedelayedandremovedfromthecurrentDRSonlyif ()Theoperationdoesnotusethecriticalresources.resisoneofthecriticalresourcesif ()ThelengthsoftheresultingdependencepathsarenotgreaterthantheestimatedII.That willingtohold. withthegreatestvalueof(!(e)+(e))(e),whereeisthedependenceedgewhichweare beputintothedrs,butonlyoperationcanbedelayedandremovedfromthedrs. NextwediscusshowtodeterminetheschedulingprioritiesfortheoperationsoftheDRS. Whentherearemorethanoneoperationwhichcanbedelayed,werstconsidertheoperation FortheexampleofFigure.(),attherstcycle,alloperationsareschedulableandcan schedulingprioritiesasfollows: heightinthemlddg.ifopiandopjarenotresource-conict,thentheyshouldbescheduled att.ifopiandopjareresource-conict,thenweusethesecondheuristictodeterminetheir MLDDGastherstheuristic.Thesecondheuristicissensitivetotheregisterpressureandis derivedfromtherrg. Inordertoobtaintheoptimaltimeeciency,weconsidertheheightofoperationinthe alledgesadjacenttoopi.letrn(opi)=t,were-computethenewvalueof!ofeachedgein ()Ifanoperationisscheduledatt,thenanothershouldbescheduledafterthetthcycle; ()Supposeopiisscheduledatt,letDES(opi)bethedependenceedgesetwhichincluded Atthecurrentcyclet,supposeopiandopjaretheoperationswiththegreatestvalueof ingpriority. DES(opi),denotedas!new.Thus,wecancomputetheregister benefitofopi, (4)Theoperationwithgreatervalueof(theregister-benet)istheonewithhigherschedul- ()Bythesamemethodasstep(),wecompute(opj;t); (opi;t)=x 8eDES(opi)(!(e)!new(e))(e); ofsimultaneouslylivevariablesisgreaterthanthenumberofavailablemachineregisters.the criticalresources. Spillingdecisionareconventionallymadeonlywhenaregisterconictoccurs,thatis,thenumber 4.RegisterSpilling TheestimatedIIcanbederivedfromthecriticalcycleoftheLDDGandthenumberofoperationsusingthe 6

7 pipeliningoverlapstheexecutionoftheoperationsfromdierentiterations,increasingregister pressureandgeneratingexcessivespillcodeinthecaseofsmallmachineregisterles. thattheregistercanbere-usedtokeeptheresultofanewcomputationatthecostofincreasing thenumberofload/storeoperationsandprobablydegradingthecodeperformance.software eectofspillingiskeepingtheresultofacomputationinmemoryratherthaninaregistersuch softwarepipelining?()howtodoaspilling? feasible. isthatspillingdecisionshouldbemadeduringsoftwarepipeliningsuchthattheinteractions thechangeontheregisterrequirementduringsoftwarepipeliningandmakeourstarting-point betweenregisterallocationandloopschedulingcanbeseen.therrgcandynamicallyreect Twoproblemstobediscussedareasfollows:()Whenisaspillingdecisionmadeduring Thissectiondiscussesregisterspillingproblemforsoftwarepipelining.Ourstarting-point thevariable)butmayhavemorethanoneuse(theoperationusingthevariable).werstwant tomakearemark:themeaningofspillinginthecontextofthispaperissomethingdierent fromtheconventionalspillingproblem[4].wesayspillinga(agroupof)use(s)butdonot operationafterthedenitionandaloadoperationbeforethespilleduseareinserted,andother usesstillreferencethevalueofthevariableinaregister. sayspillingavariable(thatis,spillingallitsuses).byspillingause,wemeanthatastore Intheloopbody,wesupposethat,avariableonlyhasadenition(theoperationdening dependenceedgesintothemlddgcanalsodecreasetheregisterrequirement. registers.infact,othermeasureslikedelayingsomeoperationstoscheduleandintroducingsome isneededonlyifthenumberofrequiredregistersisgreaterthanthenumberofavailablemachine FromtheRRG,wecandynamicallyestimatetheregisterrequirementateachcycle.Spilling orothermeasurestodecreasetheregisterrequirement(seenextsection). ()ModifytheMLDDGandtheRRGbyaddingthenecessaryload/storeoperationsandrecomputingthevalueofcorresponding!and. registerstoreachtheestimatedii,werstincreasetheestimatediiandthenconsiderspilling doesnotincreasetheestimatedii.inthecaseofthattherearenotenoughavailablemachine Thespillingprocessconsistsoftwosteps:()Selecta(agroupof)use(s)forspilling; Anothernecessaryconditionforspillingisthattheload/storeoperationscausedbyspilling variableu,undertheassumptionofthatuse(op;u)hasbeenspilled,were-computetheminimal load/storeoperation.moreprecisely,givenause,use(op;u),whereopistheoperationusing registerrequirementofvariableuandthenewintroducedvariable,denotedask0u.thus,the spilling-benetofuse(op;u)isd(ku K0u)=easastoreandaloadareinsertedtotheMLDDG andtherrgforspillingause. Thespilling benefitofauseisdenedasthenumberofsavedregistersperinserted bodycontainstwomultiplications.thesoftwarepipelinedloopbodycanbefoundunderthe twocases:()schedulingwithoutspilling;()schedulingwithspilling. Obviously,ausewithgreatervalueofspilling-benetistheonewithhigherspillingpriority. rn(4)=.itiseasytocomputethenumberofrequiredregisterswhichis. constraintsofthemlddg(showninfigure.())andtheinitialrrg(showninfigure.()).bydelayingoperation,weobtainrn()=rn()=rn(5)=rn(6)=andrn()= WetaketheloopshowninFigure.asanexampletoillustratetheaboveideas.Wediscuss Forthesecondcase,theestimatedIIisalso.Aftercomputingthespilling-benetsofall Fortherstcase,theestimatedIIissincethemachinehasonemultiplierbuttheloop 7

8 uses,wendthatup(op6;t0)hasthegreatestvalueofspilling-benetwhichisd( 7 )=e=, soup(op6;t0)hasthehighestspillingpriority.afterspillingup(op6;t0),themodiedmlddg andthemodiedinitialrrgareshowninfigure5..bydelayingoperation,weobtain rn()=rn()=rn(5)=rn(6)=rn(s)=andrn()=rn(4)=rn(l)=.itiseasyto computethenumberofrequiredregisterswhichis. degradationoftheoptimalsoftwarepipeliningperformanceifthespillingdecisioncanbeecientlycontrolled. 5Algorithms Onthebasisofthelastthreesections,wepresentthreesoftwarepipeliningalgorithms.The rsttwoaresoftwarepipeliningtominimizetheregisterrequirementandthethirdissoftware pipeliningwithalimitednumberofregisters. Animportantobservationisthat,spillingcandecreasetheregisterrequirementwithout OUTPUT:Thesoftwarepipelinedloop; AlgorithmRPS-without-Spilling; INPUT:ThelooptobesoftwarepipelinedanditsLDDG; Thealgorithmisdescribedasfollows: 5.RPSSchedulingwithoutSpilling BEGIN.ConstructtheMLDDG,determinetheestimatedII;.ComputetheheightofeachoperationintheMLDDG;.Findoutalldenition-usepathsofeachvariable,constructtheRRG; 6.DeterminetheschedulingprioritiesofalloperationsintheDRS; 7.Undertheconstraintofresources,selecttheoperationwiththehighestschedulingpriority 5.Findoutthoseoperationswhichcanbedelayedonebyone,removethemfromtheDRS; 4.FindoutallschedulableoperationsandputthemintheDRS; l s () The modified MLDDG () The modified initial RRG Figure 5. Scheduling with Register Spilling l s

9 fromthedrsandplaceitinthecurrentcycle,updatethedsr.thissteprepeatsuntilno operationcanbeplacedinthecurrentcycle; andtherrgandgotostep5; column-numberofeachoperationiscomputedintermsoftherow-numbersandtheii; numbers; 8.Ifalloperationsoftheloophavebeenscheduledthengotostep9;elseupdatetheDRS END; 9.Foreachoperation,letitsrow-numberbeitscycle-number.FromTheorem.,the 5.RPSSchedulingwithSpilling 0.Generatethesoftwarepipelinedloopintermsoftherow-numbersandthecolumn- BEGIN AlgorithmSpill-Checking; whichisdescribedasfollows: checkingstepisinsertedbetweenstep5andstep6.thenewstepcallsaspill-checkingalgorithm ThisalgorithmisdierentfromtheRPS-without-Spillingalgorithminthewaythatanewspill- spilling.inthisstep,ifnousecanbeselectedthenreturn; onthecriticaldenition-usepaths;.undertheconstraintofnotincreasingtheestimatedii,selecta(agroupof)use(s)for.ifthememoryaccessunitisoneofthecriticalresources,thenreturn; END; 4.UpdatetheMLDDG,theRRGandtheDRS;return;.Computethespilling-benetofeachuse,weactuallyonlyconsiderthoseuseswhichare limitednumberofregisters. registerrequirement.inthissectionwepresentanapproachforsoftwarepipeliningwitha Theabovetwoalgorithmstrytoobtaintheoptimalsoftwarepipelinedloopwiththeminimal 5.SoftwarePipeliningwithaLimitedNumberofRegisters sharethesameregistersremainsopenduringsoftwarepipelining. onlyestimatestheregisterrequirementofeachvariable.theproblemofwhichvariablescan isgreaterthanthegivennumberofavailablemachineregistersthenweincreasetheestimated IIsuchthattheregisterrequirementisreduced. Wepresentthefollowingheuristics:LetK0bethegivennumberofavailablemachineregisters;KestbetheestimatednumberofrequiredregistersfromRRG.Anon-negativeintegerN0is introduced.ifkest N0K0thenwecallthealgorithmofRPSschedulingwithspilling;elsewe rstincreasetheestimatedii(maybealsoincreasen0insomecases)tosatisfykest N0K0. registers.ifthenumberisgreaterthank0,thenweincreasetheestimatediiandcalltheal- Aftergettingthesoftwarepipelinedloopbody,wecanpreciselycomputethenumberofrequired However,itisdicultandcomplicatedtopreciselyestimatetheregisterrequirement.RRG Ourideaisthatwerstestimatedtheregisterrequirement,ifthenumberofrequiredregisters 9

10 gorithmofrpswithspillingagain.theprocessrepeatsuntilasoftwarepipelinedloopbodyis obtainedwhoseregisterrequirementisnotgreaterthank0. Theeorttoimplementthealgorithmspresentedinthispaperisunderway.Beforegetting empirically. 6PreliminaryExperimentalResults WehavenotyetanytheoreticalanalysisaboutN0,butwebelievethatN0canbeestimated ourpreliminaryexperimentsaremainlyconductedbyamanualsimulation,wetrytoselect,theotherveexamplesareselectedfromthelivermorebenchmarks,shownintable.as extensiveexperimentaltests,weselectsixexamplestoverifyouralgorithms.exceptforexample somesimpleloopsinarandomway.themachinemodelweuseintheexperimentsisshownin Figure.(). threeschedulingapproaches{desp,rpswithoutspillingandrpswithspilling.although Tablegivestheregisterrequirementsfortheoptimalsoftwarepipeliningperformanceby theinitiationintervals(ii)ofthesoftwarepipelinedloopsare.forexampleand,the forexampleand5.forexampleand5,noimprovementinregisterusecanbeobtainedsince algorithmofrpswithspillingcanfurtherobtainanimprovementoverdespinregisteruseof :%and:%,respectively,withoutdegradationoftheoptimalperformance. DESPfrom7:4%to7:9%inregisterusewithoutdegradationoftheoptimalperformanceexcept column-numbers,thealgorithmofrpswithoutspillingcanstillobtainanimprovementover DESPitselfadoptsthemeasurestoreducetheregisterrequirementwhenitdeterminesthe presentedintableandfigure7..tablegivestheinitiationintervals(ii)obtainedbyour Theresultsofouralgorithmforsoftwarepipeliningwithalimitednumberofregistersare 0. Experimental Examples Example L MII with lcd? Remarks 0 no Figure.() no Kernel 7 yes Kernel 4 8 yes Kernel yes Kernel 6 7 no Kernel note : L = the length of the longest dependence path in the loop body. note : MII = the Minimal II. note : lcd = loop-carried dependence. Table. Register Requirement for Three Scheduling Approaches Example II DESP RPS without Spilling RPS with Spilling note: II = the initiation interval of the software pipelined loop.

11 body(shownintable),representingtheoptimalperformancewhenweonlyexploittheilp and,respectively.therelationsbetweenk0andthespeedupareshowninfigure7..the speedupisdenedasl=ii,wherelisthelengthofthelongestdependencepathintheloop algorithmforthesixexampleswhenthenumberofavailablemachineregisters(k0)is8,6 withintheloopbody.theresultsshowthatouralgorithmcanobtaintheoptimalspeedup whenk0=(theminimalsizeofregisterleinthecurrentilpprocessors)andanaverage acrossiterationsforloopsevenforasmallregisterle(k0=8). speedupof.4whenk0=8,indicatingthatouralgorithmcanstillecientlyexploittheilp Table. Software Pipelining with a Limited Number of Registers (The Initiation Interval of the Software Pipelined Loop) Example The number of available machine registers: speedup Ko = 8 6 forsoftwarepipeliningisstudied.wealsopresentthreealgorithms{rpswithoutspilling,rps theregisterrequirementduringsoftwarepipelining.onthebasisoftherrg,aregister- Pressure-Sensitive(RPS)schedulingtechniqueisdevelopedandtheproblemofregisterspilling 7Conclusion ThispaperpresentstheRegisterRequirementGraph(RRG)whichcandynamicallyreect Ko = 6 withspillingandthesoftwarepipeliningwithalimitednumberofregisters.thepreliminary 4 experimentalresultsindicatethatthersttwoalgorithmscanecientlyimprovetheregister Ko = 8 example ILPacrossiterationsforloopsevenforthosemachineswithasmallregisterle. usewithoutdegradationoftheoptimalperformanceandthethirdcaneectivelyexploitthe Figure 7. Software Pipelining a Limited Number Registers experimentaltests. Thethreealgorithmsarebeingimplementedonourcompilertestbed.Weexpectextensive

12 References []J.A.Fisher,D.Landskov,andB.D.Shriver.Microcodecompaction:Lookingbackward [4]B.R.RauandC.D.Glaeser.Someschedulingtechniquesandaneasilyschedulablehorizon- []B.R.RauandJ.A.Fisher.Instruction-levelparallelprocessing:History,overviewand []F.Gasperoni.Compilationtechniquesforvliwarchitectures.TechnicalReportTR45,New YorkUniversity,March989. perspective.thejournalofsupercomputing,7(),january99. andlookingforward.inproceedingsof98nationalcomputerconference, [5]A.AikenandA.Nicolau.Perfectpipelining:Anewloopparallelizationtechnique.In talarchitectureforhighperformancescienticcomputing.inproceedingsofthe4thin- ternationalsymposiumonmicroprogrammingandmicroarchitectures(micro-4),pages 8{98,October98. [6]P.Y.T.Hsu.HighlyConcurrentScalarProcessing.PhDthesis,UniversityofIllinois, [7]K.Ebcioglu.Acompilationtechniqueforsoftwarepipeliningofloopswithconditional proceedingsofeuropeansymposiumonprogramming,lecturenotesincomputerscience, No.00,pages{5.Spring-Verlag,June988. Urbana-Champaign,986. [9]BogongSuandJianWang.Loop-carrieddependenceandthegeneralURPRsoftware [8]B.Su,S.Ding,andJ.Xia.Urpr-anextensionofurcrforsoftwarepipelining.Inproceedingsofthe9thInternationalSymposiumonMicroprogrammingandMicroarchitectures Microarchitectures(MICRO-0),pages69{79,987. jumps.inproceedingsofthe0thinternationalsymposiumonmicroprogrammingand (MICRO-9),pages04{08,986. [0]R.F.Touzeau.Afortrancompilerforthefps-64scienticcompute.InproceedingsofACM []A.E.Charlesworth.Anapproachtoscienticarrayprocessing:Thearchitecturedesignof pipeliningapproach.inproceedingsofthe4thannualhawaiiinternationalconferenceon []M.S.Lam.ASystolicArrayOptimizingCompiler.PhDthesis,CMU,987.CMU-CS-87- SystemSciences,pages66{7.IEEEandACM,January99. []D.G.Bradlee,S.J.Eggers,andR.R.Henry.Integratedregisterallocationandinstruction SIGPLANSymposiumonCompilerConstruction,984. [4]G.J.Chaitin.Registerallocationandspillingviagraphcoloring.InproceedingsofACM theap-0b/fps-64family.computer,pages8{7,september schedulingforriscs.inproceedingsofthe4thinternationalconferenceonasplos,99. [6]S.S.Pinter.Registerallocationwithinstructionscheduling:Anewapproach.Inproceedings [5]L.J.Hendren,G.R.Gao,E.R.Altman,andC.Mukerji.Registerallocationusingcyclic ofacmsigplanpldi,99. intervalgraph:anewapproachtoanoldproblem.technicalreportacapstechnical Memo,McGillUniversity,99. SIGPLANSymp.onCompilerConstruction,98.

13 [9]S.A.Mahlke,W.Y.Chen,P.P.Chang,andW.W.Hwu.Scalarprogramperformanceon [8]J.R.GoodmanandW.Hsu.Codeschedulingandregisterallocationinlargebasicblocks. [7]B.R.Rau,M.Lee,P.P.Tirumalai,andM.S.Schlansker.Registerallocationforsoftware multiple-instruction-issueprocessorswithalimitednumberofregisters.inproceedingsof InproceedingsofInternationalConferenceonSupercomputing,988. the5thhawaiiinternationalconferenceonsystemsciences,january99. pipelinedloops.inproceedingsofpldi,99. [0]C.Eisenbeis,W.Jalby,andA.Lichnewsky.Compile-timeoptimizationofmemoryand []QiNingandGuangR.Gao.Anovelframeworkofregisterallocationforsoftwarepipelining. []WilliamMangione-Smith,S.G.Abraham,andE.S.Davidson.Registerrequirementsof puting,99. pipelinedprocessors.inproceedingsof99acminternationalconferenceonsupercom- Compilers,989. registerusageonthecray-.inproceedingsofthesecondworkshoponlanguagesand []R.Hu.Lifetime-sensitivemoduloscheduling.InproceedingsofACMSIGPLANPLDI, [4]JianWangandChristineEisenbeis.DecomposedSoftwarePipelining:Anewapproachto TechnicalReportACAPSTechnicalMemo4,McGillUniversity,99. IFIP,North-Holland,January99. exploitinstructionlevelparallelismforloopprograms.inmichelcosnard,kemalebcioglu, andjean-lucgaudiot,editors,proceedingsofifipwg0.workingconferenceonarchitecturesandcompilationtechniquesforfineandmediumgrainparallelism,pages{5. pages58{67,june99. [5]JianWangandChristineEisenbeis.DecomposedSoftwarePipelining.ReseachRepport [6]JianWang,ChristineEisenbeis,MartinJourdan,andBogongSu.DecomposedSoftware RR-88,INRIA-Rocquencourt,France,99. Programming,():57{79,994. Pipelining:Anewperspectiveandanewapproach.InternationalJournalofParallel

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling

More information

Software Pipelining - Modulo Scheduling

Software Pipelining - Modulo Scheduling EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:

More information

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012

Software Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012 Department of Computer Science Indian Institute of Science Bangalore 560 2 NPTEL Course on Compiler Design Introduction to Overlaps execution of instructions from multiple iterations of a loop Executes

More information

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER

AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree

More information

Instruction scheduling

Instruction scheduling Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.

More information

Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

More information

The Performance of Scalar Replacement on the HP 715/50

The Performance of Scalar Replacement on the HP 715/50 The Performance of Scalar Replacement on the HP 715/5 1 Introduction Steve Carr Qunyan Wu Department of Computer Science Michigan Technological University Houghton MI 49931-1295 It has been shown that

More information

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

PROBLEMS. which was discussed in Section 1.6.3.

PROBLEMS. which was discussed in Section 1.6.3. 22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between

More information

Introduction to Cluster Computing

Introduction to Cluster Computing Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Introduction Goal/Idea Phases Mandatory Assignments Tools Timeline/Exam General info Introduction Supercomputers are expensive Workstations

More information

CHAPTER 1 ENGINEERING PROBLEM SOLVING. Copyright 2013 Pearson Education, Inc.

CHAPTER 1 ENGINEERING PROBLEM SOLVING. Copyright 2013 Pearson Education, Inc. CHAPTER 1 ENGINEERING PROBLEM SOLVING Computing Systems: Hardware and Software The processor : controls all the parts such as memory devices and inputs/outputs. The Arithmetic Logic Unit (ALU) : performs

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Optimization on Huygens

Optimization on Huygens Optimization on Huygens Wim Rijks wimr@sara.nl Contents Introductory Remarks Support team Optimization strategy Amdahls law Compiler options An example Optimization Introductory Remarks Modern day supercomputers

More information

Fli;' HEWLETT. Iterative Modulo Scheduling. B. Ramakrishna Rau Compiler and Architecture Research HPL-94-115 November, 1995

Fli;' HEWLETT. Iterative Modulo Scheduling. B. Ramakrishna Rau Compiler and Architecture Research HPL-94-115 November, 1995 Fli;' HEWLETT a:~ PACKARD Iterative Modulo Scheduling B. Ramakrishna Rau Compiler and Architecture Research HPL-94-115 November, 1995 modulo scheduling, instruction scheduling, software pipelining, loop

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows: 4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Building an Inexpensive Parallel Computer

Building an Inexpensive Parallel Computer Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University

More information

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of

More information

A high-level implementation of software pipelining in LLVM

A high-level implementation of software pipelining in LLVM A high-level implementation of software pipelining in LLVM Roel Jordans 1, David Moloney 2 1 Eindhoven University of Technology, The Netherlands r.jordans@tue.nl 2 Movidius Ltd., Ireland 2015 European

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se

More information

Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation

Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using -1 Integer Linear Programming Formulation Santosh G. Nagarakatte 1 and R. Govindarajan 1,2 1 Department of Computer

More information

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

and RISC Optimization Techniques for the Hitachi SR8000 Architecture 1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.

More information

Firewall Compressor: An Algorithm for Minimizing Firewall Policies

Firewall Compressor: An Algorithm for Minimizing Firewall Policies Firewall Compressor: An Algorithm for Minimizing Firewall Policies Alex Liu, Eric Torng, Chad Meiners Department of Computer Science Michigan State University {alexliu,torng,meinersc}@cse.msu.edu Introduction

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

picojava TM : A Hardware Implementation of the Java Virtual Machine

picojava TM : A Hardware Implementation of the Java Virtual Machine picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

Multi-GPU Load Balancing for Simulation and Rendering

Multi-GPU Load Balancing for Simulation and Rendering Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks

More information

Implementation of Full -Parallelism AES Encryption and Decryption

Implementation of Full -Parallelism AES Encryption and Decryption Implementation of Full -Parallelism AES Encryption and Decryption M.Anto Merline M.E-Commuication Systems, ECE Department K.Ramakrishnan College of Engineering-Samayapuram, Trichy. Abstract-Advanced Encryption

More information

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

More information

Types of Workloads. Raj Jain. Washington University in St. Louis

Types of Workloads. Raj Jain. Washington University in St. Louis Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Software Pipelining: An Effective Scheduling Technique for VLIW Machines

Software Pipelining: An Effective Scheduling Technique for VLIW Machines Software Pipelining: An Effective Scheduling Technique for VLIW Machines Monica Lam Department of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213 Abstract This paper shows that

More information

Power System Security Monitoring, Analysis, and Control. George Gross

Power System Security Monitoring, Analysis, and Control. George Gross ECE 573 Power System Operations and Control Power System Security Monitoring, Analysis, and Control George Gross Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Application Insight Through Performance Modeling

Application Insight Through Performance Modeling Application Insight Through Performance Modeling Gabriel Marin Department of Computer Science Rice University Houston, TX 77005 mgabi@cs.rice.edu John Mellor-Crummey Department of Computer Science Rice

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints

Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 7, Ingredients Software Pipelining Data & Resource Constraints Resource Constraints in C67x Loop Scheduling Without Resource Bounds

More information

High-Level Synthesis for FPGA Designs

High-Level Synthesis for FPGA Designs High-Level Synthesis for FPGA Designs BRINGING BRINGING YOU YOU THE THE NEXT NEXT LEVEL LEVEL IN IN EMBEDDED EMBEDDED DEVELOPMENT DEVELOPMENT Frank de Bont Trainer consultant Cereslaan 10b 5384 VT Heesch

More information

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on

CS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on CS 61C: Great Ideas in Computer Architecture Finite State Machines Instructors: Krste Asanovic & Vladimir Stojanovic hbp://inst.eecs.berkeley.edu/~cs61c/sp15 1 Levels of RepresentaKon/ InterpretaKon High

More information

GameTime: A Toolkit for Timing Analysis of Software

GameTime: A Toolkit for Timing Analysis of Software GameTime: A Toolkit for Timing Analysis of Software Sanjit A. Seshia and Jonathan Kotker EECS Department, UC Berkeley {sseshia,jamhoot}@eecs.berkeley.edu Abstract. Timing analysis is a key step in the

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Java Virtual Machine: the key for accurated memory prefetching

Java Virtual Machine: the key for accurated memory prefetching Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain

More information

15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report. Trading Platform Server 15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

MONITORING PERFORMANCE IN WINDOWS 7

MONITORING PERFORMANCE IN WINDOWS 7 MONITORING PERFORMANCE IN WINDOWS 7 Performance Monitor In this demo we will take a look at how we can use the Performance Monitor to capture information about our machine performance. We can access Performance

More information

EVALUATION OF SCHEDULING AND ALLOCATION ALGORITHMS WHILE MAPPING ASSEMBLY CODE ONTO FPGAS

EVALUATION OF SCHEDULING AND ALLOCATION ALGORITHMS WHILE MAPPING ASSEMBLY CODE ONTO FPGAS EVALUATION OF SCHEDULING AND ALLOCATION ALGORITHMS WHILE MAPPING ASSEMBLY CODE ONTO FPGAS ABSTRACT Migration of software from older general purpose embedded processors onto newer mixed hardware/software

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as

More information

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints Olivier Beaumont,, Paul Renaud-Goud Inria & University of Bordeaux Bordeaux, France 9th Scheduling for Large Scale Systems

More information

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps

More information

Hypothesis Testing for Network Security

Hypothesis Testing for Network Security Hypothesis Testing for Network Security Philip Godfrey, Matthew Caesar, David Nicol, William H. Sanders, Dong Jin INFORMATION TRUST INSTITUTE University of Illinois at Urbana-Champaign We need a science

More information

Chapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow

Chapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow Chapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow MII51011-1.1 Introduction Small capacity, non-volatile memory is commonly used in storing manufacturing data (e.g., manufacturer

More information

Comparing RTOS to Infinite Loop Designs

Comparing RTOS to Infinite Loop Designs Comparing RTOS to Infinite Loop Designs If you compare the way software is developed for a small to medium sized embedded project using a Real Time Operating System (RTOS) versus a traditional infinite

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Reduced Instruction Set Computer (RISC)

Reduced Instruction Set Computer (RISC) Reduced Instruction Set Computer (RISC) Focuses on reducing the number and complexity of instructions of the ISA. RISC Goals RISC: Simplify ISA Simplify CPU Design Better CPU Performance Motivated by simplifying

More information

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 26 Real - Time POSIX. (Contd.) Ok Good morning, so let us get

More information

Influence of Load Balancing on Quality of Real Time Data Transmission*

Influence of Load Balancing on Quality of Real Time Data Transmission* SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 6, No. 3, December 2009, 515-524 UDK: 004.738.2 Influence of Load Balancing on Quality of Real Time Data Transmission* Nataša Maksić 1,a, Petar Knežević 2,

More information

Bridgewalling - Using Netfilter in Bridge Mode

Bridgewalling - Using Netfilter in Bridge Mode Bridgewalling - Using Netfilter in Bridge Mode Ralf Spenneberg, ralf@spenneberg.net Revision : 1.5 Abstract Firewalling using packet filters is usually performed by a router. The packet filtering software

More information

Chapter 19: Real-Time Systems. Overview of Real-Time Systems. Objectives. System Characteristics. Features of Real-Time Systems

Chapter 19: Real-Time Systems. Overview of Real-Time Systems. Objectives. System Characteristics. Features of Real-Time Systems Chapter 19: Real-Time Systems System Characteristics Features of Real-Time Systems Chapter 19: Real-Time Systems Implementing Real-Time Operating Systems Real-Time CPU Scheduling VxWorks 5.x 19.2 Silberschatz,

More information

To appear in MICRO-25 Conference Proceedings, December 1992 1. Enhanced Modulo Scheduling for Loops with Conditional Branches

To appear in MICRO-25 Conference Proceedings, December 1992 1. Enhanced Modulo Scheduling for Loops with Conditional Branches To appear in MICRO-5 Conference Proceedings, December 199 1 Enhanced Modulo Scheduling for Loops with Conditional Branches Nancy J. Warter Grant E. Haab John W. Bockhaus Krishna Subramanian Coordinated

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100) EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the

More information

Hardware/Software Codesign

Hardware/Software Codesign Hardware/Software Codesign. Review. Allocation, Binding and Scheduling Marco Platzner Lothar Thiele by the authors Synthesis Behavior Structure Synthesis Tasks ΠAllocation: ΠBinding: ΠScheduling: selection

More information

How To Program With Adaptive Vision Studio

How To Program With Adaptive Vision Studio Studio 4 intuitive powerful adaptable software for machine vision engineers Introduction Adaptive Vision Studio Adaptive Vision Studio software is the most powerful graphical environment for machine vision

More information

A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization

A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization Chun Chen, Jacqueline Chame, Mary Hall, and Kristina Lerman University of Southern California/Information Sciences

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive

More information

CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Optimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations

Optimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations Optimizing compilers CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations V. Krishna Nandivada IIT Madras Copyright c 2015 by Antony L. Hosking. Permission to make digital

More information

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle? Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

More information

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation: CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

Longer is Better? Exploiting Path Diversity in Data Centre Networks

Longer is Better? Exploiting Path Diversity in Data Centre Networks Longer is Better? Exploiting Path Diversity in Data Centre Networks Fung Po (Posco) Tso, Gregg Hamilton, Rene Weber, Colin S. Perkins and Dimitrios P. Pezaros University of Glasgow Cloud Data Centres Are

More information

Intellicus Enterprise Reporting and BI Platform

Intellicus Enterprise Reporting and BI Platform Intellicus Cluster and Load Balancing (Windows) Intellicus Enterprise Reporting and BI Platform Intellicus Technologies info@intellicus.com www.intellicus.com Copyright 2014 Intellicus Technologies This

More information

REAL TIME OPERATING SYSTEMS. Lesson-10:

REAL TIME OPERATING SYSTEMS. Lesson-10: REAL TIME OPERATING SYSTEMS Lesson-10: Real Time Operating System 1 1. Real Time Operating System Definition 2 Real Time A real time is the time which continuously increments at regular intervals after

More information

Technical Properties. Mobile Operating Systems. Overview Concepts of Mobile. Functions Processes. Lecture 11. Memory Management.

Technical Properties. Mobile Operating Systems. Overview Concepts of Mobile. Functions Processes. Lecture 11. Memory Management. Overview Concepts of Mobile Operating Systems Lecture 11 Concepts of Mobile Operating Systems Mobile Business I (WS 2007/08) Prof Dr Kai Rannenberg Chair of Mobile Business and Multilateral Security Johann

More information

OPC COMMUNICATION IN REAL TIME

OPC COMMUNICATION IN REAL TIME OPC COMMUNICATION IN REAL TIME M. Mrosko, L. Mrafko Slovak University of Technology, Faculty of Electrical Engineering and Information Technology Ilkovičova 3, 812 19 Bratislava, Slovak Republic Abstract

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS1500, fall 2015 Lecture 5: I/O Systems, part I Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley

More information