forexploitingilpwithinloops,whichcaneectivelyoverlaptheexecutionofoperationsfrom
|
|
- Marcus Dorsey
- 8 years ago
- Views:
Transcription
1 SoftwarePipeliningwithRegisterAllocationandSpilling M.AntonErtlChristineEisenbeisz InstitutfurComputersprachen TechnischeUniversitatWien JianWangyAndreasKrall A-040Wien,Austria Argentinierstr.8 (RRG)whichcandynamicallyreecttheregisterrequirementduringsoftwarepipelining. techniqueandstudytheproblemofregisterspillingforsoftwarepipelining.wealsopresent threealgorithms{rpswithoutspilling,rpswithspillingandthesoftwarepipeliningwith mainsanopenproblem.inthispaper,werstpresenttheregisterrequirementgraph alimitednumberofregisters.thepreliminaryexperimentalresultsshowthatthersttwo Then,usingtheRRGasabasis,wedevelopaRegister-Pressure-Sensitive(RPS)scheduling Simultaneousregisterallocationandsoftwarepipeliningisstilllessunderstoodandre- Abstract performanceandthethirdcaneectivelyexploitinstruction-levelparallelismwithinloops algorithmscanecientlyreducetheregisterrequirementwithoutdegradationoftheoptimal Introduction Keywords:Instruction-levelParallelism,LoopScheduling,SoftwarePipelining,Register evenforthosemachineswithasmallregisterle. IthasbeenwellknownthatexploitingInstruction-LevelParallelism(ILP)withinloopshasbecomeakeycompilationissuefortheinstruction-levelparallelprocessorslikeVeryLongInstructionWord(VLIW)andsuperscalarmachines[,,].Softwarepipelininghasbeenproposed dierentiterations[4,5,6,7,8,9,0,,]. knownthatperformingregisterallocationbeforesoftwarepipeliningmayintroduceunacceptable anti-dependencesduetothereuseofregisters,whichmaylimitsoftwarepipelining[7,].on theotherhand,ifsoftwarepipeliningisdonebeforeregisterallocation,moreregistersthan pipeliningisstilllessunderstoodandremainsopen. theperformanceofthepipelinedloop[].however,simultaneousregisterallocationandsoftware necessarymaybeneeded,whichmaycauseunnecessaryregisterspillingsandseverelydegrade RegisterAllocationisanotherkeycompilationissue[,4,5,6,7].Ithasbeenwell Allocation,Spilling,DataDependenceGraph forexploitingilpwithinloops,whichcaneectivelyoverlaptheexecutionofoperationsfrom andtheaustrianscienceandresearchministry. ThisworkwassupportedbytheLiseMeitnerStipendiumfundedbytheAustrianScienceFoundation(FWF) zdr.eisenbeisiswithinria-rocquencourt,domainedevoluceau,bp05-785,lechesnaycedex,france.
2 sincethemid980s[0,8,,6,9],andregisterallocationforsoftwarepipelinedloophas ofregistersneededforagivenmoduloscheduledloop[].ningandgaohavepresented numberofregistersneededforndingsomeoptimalsoftwarepipelinedloop[],buttheydo beenstudiedbymanyresearchersandsomeecienttechniqueshavebeenproposed[0,, aframeworkofregisterallocationforsoftwarepipeliningbywhichtheydeducetheminimal consideredinfewstudies.mangione-smith,etal.developedalowerboundonthenumber 7,5].However,theinteractionbetweenregisterallocationandsoftwarepipeliningwaslately Theinteractionbetweenregisterallocationandloop-freecodeschedulinghasbeenstudied performthemoduloschedulingwithatryforshorteningthelifetimeofavariable,buthedoes hasbeenpresentedbyhu[],inwhichheusestheideaofbidirectionalslack-schedulingto notconsidertheresourceconstraints.acalledlifetime-sensitivemoduloschedulingtechnique notconsidertheregisterspillingproblem. necessaryisneeded.ontheotherhand,fromtherrgwecandynamicallyestimatetheregister theregisterrelatedinformationtoguidetheschedulingprocesssuchthatnomoreregisterthan requirementsuchthatthespillingdecisionandthetradeobetweentheinitiationintervaland controltheregisterpressurecausedbysoftwarepipeliningitself.ononehand,therrggives registerrequirementduringsoftwarepipelining.whilesoftwarepipelining,therrgisusedto understandtheinteractionbetweenregisterallocationandsoftwarepipelining,wepresenta novelframework,calledregisterrequirementgraph(rrg),whichcandynamicallyreectthe Ourapproachespresentedinthispaperaredierentfromalloftheabove.Inorderto registerpressureareecientlymade. pipeliningwithalimitednumberofregisters(section6);(5)givethepreliminaryexperimental resultstoindicatetheeciencyofthethreealgorithms(section7). (RPS)schedulingtechnique(Section4);()Studytheproblemofregisterspillingtoreducethe registerpressurewithoutdegradationoftheoptimalperformance(section5);(4)present duringsoftwarepipelining(section);()usetherrgtodeveloparegister-pressure-sensitive threesoftwarepipeliningalgorithms{rpswithoutspilling,rpswithspillingandthesoftware thispapercanbeconcludedasfollows:()presenttherrgtoestimatetheregisterrequirement Thenextsectiongivesabackgroundtomakethispaperself-contained.Theworkreportedin ThedatadependencesofaloopcanberepresentedbyaLoopDataDependenceGraph(LDDG), DecomposedSoftwarePipelining(DESP) acyclic;secondly,weapplythelistschedulingtechniqueonthemodiedgraphtogenerate asanexample.first,wemodifythelddgbyremovingsomeedgessothatthegraphbecomes distanceandthedelayaretwonon-negativeintegersassociatedwitheachedge.for (O;E;;),whereOistheoperationsetandEthedependenceedgeset;thedependence thesoftwarepipelinedloopbodyundertheresourceconstraints,andusetherow-numberto denotethecycle-numberofeachoperationintheloopbody;thirdly,wedeterminetheiterationnumber(denotedascolumn-numberinthecontextofdesp)ofeachoperationsuchthatall startoftheoperationopofthe(e)thpreviousiteration[,9]. example,e=(op;op0)and((e);(e))denotethatop0canonlybeissued(e)cyclesafterthe datadependencesinlddgaresatised. DESPisanovelmoduloschedulingapproach,anditsideacanbeillustratedbyFigure. onesaredottedifwedonotattach(;)toeachedge. Forallexamplesinthispaper,theloop-independentdependenceedgesaresolidedgeswhereasloop-carried Formally,DESPtheoreticallydecomposestheloopscheduleintotwofunctions,row-number
3 5,,, 4; 6, ; andcolumn-number. LDDG MLDDG Denition.LetG=(O;E;;)betheLDDGofaloop,andavalidloopschedule rn 5,,,4 4 4 step 6, step step 5 5, 4; ; 5,,,4;,4, 6, ; 6 6 5,,6 column-number. Thus,softwarepipeliningcanbedescribedbelowwiththeconceptsofrow-numberand mappingsfromoton(non-negativeintegerset),suchthat forgwithinitiationintervalii.wedenetherow-numberrnandthecolumn-numbercn,two 5,; 6; Denition.(DecomposedSoftwarePipelining)LetG=(O;E;;)betheLDDG (op;)=rn(op)+ii(cn(op) )and(op;i)=(op;)+ii(i ): Figure Decomposed resource-conict; ifandonlyifthefollowingconstraintsaresatised:.resourceconstraints:8opi;opjo,ifrn(opi)=rn(opj),thenopiandopjcannotbe.dependenceconstraints: ofaloop,wesaythattherow-number,rn,andthecolumn-number,cn,arevalidfortheloop, goalofdecomposedsoftwarepipeliningistondvalidrow-numberandcolumn-numberwith IIiscalledastheinitiationintervalorthelengthofthesoftwarepipelinedloopbody.The minimumii. Inourpreviouspapers[4,5,6],wehaveproventhefollowingtheoreticalresults. 9IIN;8e=(op;op0)E;rn(op0) rn(op)+ii((e)+cn(op0) cn(op))(e): where(e)= (e)+d((e)+rn(op) rn(op0))=iie,e=(op;op0). dependenceconstraintsarealsosatised,ifandonlyif,foreachcyclecofthelddg, satisestheresourceconstraints.wecanconstructcolumn-numbercnsuchthatthedata Theorem.ForagivenLDDG,supposewehaveconstructedrow-numberrnwhich extendedtothecaseofmulti-cyclenon-pipelinedoperations. Here,weonlyconsiderthepipelinedoperationsandthesingle-cycleoperations,butthedenitioniseasily 8eC(e)0 X
4 RegisterRequirementGraph datadependenceconstraintsarealsosatised. accounttheresourceconstraints,thenwecanalwaysconstructcolumn-numbersuchthatthe ThefollowingcorallaryisdirectfromTheorem.. Corallary.ForaLDDGwithoutcycle,ifwehaveconstructedrow-numbertakinginto theschedulingprocess(determiningtherow-number). Graph(RRG)whichcandynamicallyestimatedcnij.TheRRGgivestheheuristicstoguide (denotedasdcnij).forexample,supposevariableuiswrittenbyopiandreadbyopj,then dcnijgivestheestimateofthelifetimeofu.thus,werstpresenttheregisterrequirement bythedierencebetweenthecolumn-numbersoftwooperationswhichhaveadatadependence theregisterrequirementofeachvariable.infact,theregisterrequirementismainlydetermined Indecomposedsoftwarepipelining,thecolumn-numberisanimportantparametertocontrol step,weusethefollowingmethodtomodifythelddg[4,5,6]: arenotincludedinthesccs; denotedas(rn0;cn0); ()ndoutallstronglyconnectedcomponents(sccs)inthelddg,removealledgeswhich ()undertheunlimitedresourceconstraints,generateasoftwarepipelinedloopforthesccs, OursoftwarepipeliningframeworkisbasedontheDESPasshowninFigure..Intherst weobtainanacyclicdependencegraphmlddg=(o;em;).anewgraph,calledregister requirementgraph,isdenedasrrg=(o;e;!),where!isaweightoneachedgewhich satisfyingthedatadependencesofthemlddgmustsatisfytheconditionoftheorem.. fromthesccs. GiventheLDDG(O;E;;)ofaloop,aftertherststepofdecomposedsoftwarepipelining, Theremaininggraphisacyclic,denotedasMLDDG.Wehaveproventhatanyrow-numbers ()foreachedgee=(opi;op j)ofsccs,ifrn0(opj) rn0(op i)<(e),thenremovee pipelinedloopbody,weinitiallydene!asfollows: representstheestimateddierencebetweenthecolumn-numbersoftwooperationsintheworst case.letmiibetheestimatedminimuminitiationinterval,beforeschedulingthesoftware determined; E Emasfollows: Whileschedulingthesoftwarepipelinedloopbody,werecompute!(e)foreache=(opi;opj) ()!(e)= (e)+d((e)+mii )=MIIe;8eE Em. ()!(e)= (e);8eem; ()!(e)= (e)+d((e) +rn(opi))=miie,ifrn(opi)isdeterminedbutrn(opj)isnot; ()!(e)= (e)+d((e) (rn(opj) rn(opi)))=miie,ifrn(opi)andrn(opj)bothare rn(op6)=andrn(op)=rn(op4)=. machinemodel.itslddgandmlddgareshowninfigure.()and(),respectively. Figure.()istheinitialRRG.Figure.(4)istheRRGwhenrn(op)=rn(op)=rn(op5)= ()!(e)= (e)++d((e) rn(opj))=miie,ifrn(opj)isdeterminedbutrn(opi)isnot; AnexampleofRRGisgiveninFigure.and.,Figure.()istheloopand()the 4
5 The Original Loop: for i= to n do s=s+a[i] a[i]=s*s*a[i] enddo () The Loop The Code of the Loop Body:. t0=t0+;. t=a[t0];. s=s+t; 4. t=s*s; 5. t=t*t; 6. a[t0]=t Figure. An Example Pipeline Number Operation Latency Memory port Load Store Address ALU Add/Sub Adder FAdd/FSub IAdd/ISub Multiplier FMUL IMUL () The Machine Model operationreadingthevariableinthelddg.thecriticaldenition-usepathofvariableu,cdupu, (0,) 0 isdenedas Adenition-usepathisdenedasapathfromtheoperationwritingavariabletoany (0,) 7 7 8ecdupu!(e)=max X anydupofu(x 8edupu!(e)): (0,) 0 4 (0,) (0,) () MLDDG () (4) givestheestimateoftheregisterrequirementofu. criticaldenition-usepathincludee. LetRRG=(O;E;!),foreachedgeeE,(e)isdenedasthenumberofvariableswhose ()LetRRG=(O;E;!),cdupubethecriticaldenition-usepathofu,thenP8ecdupu!(e) RRGhasthefollowingtwoproperties: Figure. LDDG, MLDDG and RRGs 4RPSScheduling ewhichisinthelddgbutnotincludedinthemlddg,ifeissatised(thatis,rn(opj) (e))(e)registerscomparedtothecasewheneisnotsatised. rn(opi)(e);e=(opi;opj)),thentheregisterrequirementmaybedecreasedbyupto(!(e)+ ()LetRRG=(O;E;!),duringschedulingthesoftwarepipelinedloopbody,foranyedge operations. inthesoftwarepipelinedloopbody; Wepresentthefollowingtwoheuristicstodirecttheschedulingprocess: Inthesecondstepofoursoftwarepipeliningframework,weuselistschedulingonthe ()Delaysomeoperationstobescheduledsuchthatsomedependenceedgescanbesatised ()Developregister-pressure-sensitiveheuristicstodeterminetheschedulingprioritiesfor 5
6 wendoutallschedulableoperationsatthecurrentcycleandputthemintothedataready MLDDG(obtainedintherststep)todeterminetherow-numbersforalloperations.First, Set(DRS),thenweselecttheoperationswiththehighestschedulingprioritytoschedule. somedependenceedgesavoidbeingunnecessarybroken. maybealotofschedulableoperationsinthedrsateachcycle.withoutincreaseofthe estimatedii,itisgreatlypossiblethatsomeoperationscanbedelayedtoschedulesuchthat AsmostdependenceedgesoriginallyintheLDDGhavebeenremovedintheMLDDG,there usingresandnisthenumberofresinthemachine;and t +dn=netheestimatedii,wheretisthecurrentcycle,nisthenumberofoperations is,t+(e)+height(op) theestimatedii,wheretisthecurrentcycle,eisthedependence edgewhichwearewillingtoholdandheight(op)istheheightofopinthemlddg. WesuggestthatanoperationcanbedelayedandremovedfromthecurrentDRSonlyif ()Theoperationdoesnotusethecriticalresources.resisoneofthecriticalresourcesif ()ThelengthsoftheresultingdependencepathsarenotgreaterthantheestimatedII.That willingtohold. withthegreatestvalueof(!(e)+(e))(e),whereeisthedependenceedgewhichweare beputintothedrs,butonlyoperationcanbedelayedandremovedfromthedrs. NextwediscusshowtodeterminetheschedulingprioritiesfortheoperationsoftheDRS. Whentherearemorethanoneoperationwhichcanbedelayed,werstconsidertheoperation FortheexampleofFigure.(),attherstcycle,alloperationsareschedulableandcan schedulingprioritiesasfollows: heightinthemlddg.ifopiandopjarenotresource-conict,thentheyshouldbescheduled att.ifopiandopjareresource-conict,thenweusethesecondheuristictodeterminetheir MLDDGastherstheuristic.Thesecondheuristicissensitivetotheregisterpressureandis derivedfromtherrg. Inordertoobtaintheoptimaltimeeciency,weconsidertheheightofoperationinthe alledgesadjacenttoopi.letrn(opi)=t,were-computethenewvalueof!ofeachedgein ()Ifanoperationisscheduledatt,thenanothershouldbescheduledafterthetthcycle; ()Supposeopiisscheduledatt,letDES(opi)bethedependenceedgesetwhichincluded Atthecurrentcyclet,supposeopiandopjaretheoperationswiththegreatestvalueof ingpriority. DES(opi),denotedas!new.Thus,wecancomputetheregister benefitofopi, (4)Theoperationwithgreatervalueof(theregister-benet)istheonewithhigherschedul- ()Bythesamemethodasstep(),wecompute(opj;t); (opi;t)=x 8eDES(opi)(!(e)!new(e))(e); ofsimultaneouslylivevariablesisgreaterthanthenumberofavailablemachineregisters.the criticalresources. Spillingdecisionareconventionallymadeonlywhenaregisterconictoccurs,thatis,thenumber 4.RegisterSpilling TheestimatedIIcanbederivedfromthecriticalcycleoftheLDDGandthenumberofoperationsusingthe 6
7 pipeliningoverlapstheexecutionoftheoperationsfromdierentiterations,increasingregister pressureandgeneratingexcessivespillcodeinthecaseofsmallmachineregisterles. thattheregistercanbere-usedtokeeptheresultofanewcomputationatthecostofincreasing thenumberofload/storeoperationsandprobablydegradingthecodeperformance.software eectofspillingiskeepingtheresultofacomputationinmemoryratherthaninaregistersuch softwarepipelining?()howtodoaspilling? feasible. isthatspillingdecisionshouldbemadeduringsoftwarepipeliningsuchthattheinteractions thechangeontheregisterrequirementduringsoftwarepipeliningandmakeourstarting-point betweenregisterallocationandloopschedulingcanbeseen.therrgcandynamicallyreect Twoproblemstobediscussedareasfollows:()Whenisaspillingdecisionmadeduring Thissectiondiscussesregisterspillingproblemforsoftwarepipelining.Ourstarting-point thevariable)butmayhavemorethanoneuse(theoperationusingthevariable).werstwant tomakearemark:themeaningofspillinginthecontextofthispaperissomethingdierent fromtheconventionalspillingproblem[4].wesayspillinga(agroupof)use(s)butdonot operationafterthedenitionandaloadoperationbeforethespilleduseareinserted,andother usesstillreferencethevalueofthevariableinaregister. sayspillingavariable(thatis,spillingallitsuses).byspillingause,wemeanthatastore Intheloopbody,wesupposethat,avariableonlyhasadenition(theoperationdening dependenceedgesintothemlddgcanalsodecreasetheregisterrequirement. registers.infact,othermeasureslikedelayingsomeoperationstoscheduleandintroducingsome isneededonlyifthenumberofrequiredregistersisgreaterthanthenumberofavailablemachine FromtheRRG,wecandynamicallyestimatetheregisterrequirementateachcycle.Spilling orothermeasurestodecreasetheregisterrequirement(seenextsection). ()ModifytheMLDDGandtheRRGbyaddingthenecessaryload/storeoperationsandrecomputingthevalueofcorresponding!and. registerstoreachtheestimatedii,werstincreasetheestimatediiandthenconsiderspilling doesnotincreasetheestimatedii.inthecaseofthattherearenotenoughavailablemachine Thespillingprocessconsistsoftwosteps:()Selecta(agroupof)use(s)forspilling; Anothernecessaryconditionforspillingisthattheload/storeoperationscausedbyspilling variableu,undertheassumptionofthatuse(op;u)hasbeenspilled,were-computetheminimal load/storeoperation.moreprecisely,givenause,use(op;u),whereopistheoperationusing registerrequirementofvariableuandthenewintroducedvariable,denotedask0u.thus,the spilling-benetofuse(op;u)isd(ku K0u)=easastoreandaloadareinsertedtotheMLDDG andtherrgforspillingause. Thespilling benefitofauseisdenedasthenumberofsavedregistersperinserted bodycontainstwomultiplications.thesoftwarepipelinedloopbodycanbefoundunderthe twocases:()schedulingwithoutspilling;()schedulingwithspilling. Obviously,ausewithgreatervalueofspilling-benetistheonewithhigherspillingpriority. rn(4)=.itiseasytocomputethenumberofrequiredregisterswhichis. constraintsofthemlddg(showninfigure.())andtheinitialrrg(showninfigure.()).bydelayingoperation,weobtainrn()=rn()=rn(5)=rn(6)=andrn()= WetaketheloopshowninFigure.asanexampletoillustratetheaboveideas.Wediscuss Forthesecondcase,theestimatedIIisalso.Aftercomputingthespilling-benetsofall Fortherstcase,theestimatedIIissincethemachinehasonemultiplierbuttheloop 7
8 uses,wendthatup(op6;t0)hasthegreatestvalueofspilling-benetwhichisd( 7 )=e=, soup(op6;t0)hasthehighestspillingpriority.afterspillingup(op6;t0),themodiedmlddg andthemodiedinitialrrgareshowninfigure5..bydelayingoperation,weobtain rn()=rn()=rn(5)=rn(6)=rn(s)=andrn()=rn(4)=rn(l)=.itiseasyto computethenumberofrequiredregisterswhichis. degradationoftheoptimalsoftwarepipeliningperformanceifthespillingdecisioncanbeecientlycontrolled. 5Algorithms Onthebasisofthelastthreesections,wepresentthreesoftwarepipeliningalgorithms.The rsttwoaresoftwarepipeliningtominimizetheregisterrequirementandthethirdissoftware pipeliningwithalimitednumberofregisters. Animportantobservationisthat,spillingcandecreasetheregisterrequirementwithout OUTPUT:Thesoftwarepipelinedloop; AlgorithmRPS-without-Spilling; INPUT:ThelooptobesoftwarepipelinedanditsLDDG; Thealgorithmisdescribedasfollows: 5.RPSSchedulingwithoutSpilling BEGIN.ConstructtheMLDDG,determinetheestimatedII;.ComputetheheightofeachoperationintheMLDDG;.Findoutalldenition-usepathsofeachvariable,constructtheRRG; 6.DeterminetheschedulingprioritiesofalloperationsintheDRS; 7.Undertheconstraintofresources,selecttheoperationwiththehighestschedulingpriority 5.Findoutthoseoperationswhichcanbedelayedonebyone,removethemfromtheDRS; 4.FindoutallschedulableoperationsandputthemintheDRS; l s () The modified MLDDG () The modified initial RRG Figure 5. Scheduling with Register Spilling l s
9 fromthedrsandplaceitinthecurrentcycle,updatethedsr.thissteprepeatsuntilno operationcanbeplacedinthecurrentcycle; andtherrgandgotostep5; column-numberofeachoperationiscomputedintermsoftherow-numbersandtheii; numbers; 8.Ifalloperationsoftheloophavebeenscheduledthengotostep9;elseupdatetheDRS END; 9.Foreachoperation,letitsrow-numberbeitscycle-number.FromTheorem.,the 5.RPSSchedulingwithSpilling 0.Generatethesoftwarepipelinedloopintermsoftherow-numbersandthecolumn- BEGIN AlgorithmSpill-Checking; whichisdescribedasfollows: checkingstepisinsertedbetweenstep5andstep6.thenewstepcallsaspill-checkingalgorithm ThisalgorithmisdierentfromtheRPS-without-Spillingalgorithminthewaythatanewspill- spilling.inthisstep,ifnousecanbeselectedthenreturn; onthecriticaldenition-usepaths;.undertheconstraintofnotincreasingtheestimatedii,selecta(agroupof)use(s)for.ifthememoryaccessunitisoneofthecriticalresources,thenreturn; END; 4.UpdatetheMLDDG,theRRGandtheDRS;return;.Computethespilling-benetofeachuse,weactuallyonlyconsiderthoseuseswhichare limitednumberofregisters. registerrequirement.inthissectionwepresentanapproachforsoftwarepipeliningwitha Theabovetwoalgorithmstrytoobtaintheoptimalsoftwarepipelinedloopwiththeminimal 5.SoftwarePipeliningwithaLimitedNumberofRegisters sharethesameregistersremainsopenduringsoftwarepipelining. onlyestimatestheregisterrequirementofeachvariable.theproblemofwhichvariablescan isgreaterthanthegivennumberofavailablemachineregistersthenweincreasetheestimated IIsuchthattheregisterrequirementisreduced. Wepresentthefollowingheuristics:LetK0bethegivennumberofavailablemachineregisters;KestbetheestimatednumberofrequiredregistersfromRRG.Anon-negativeintegerN0is introduced.ifkest N0K0thenwecallthealgorithmofRPSschedulingwithspilling;elsewe rstincreasetheestimatedii(maybealsoincreasen0insomecases)tosatisfykest N0K0. registers.ifthenumberisgreaterthank0,thenweincreasetheestimatediiandcalltheal- Aftergettingthesoftwarepipelinedloopbody,wecanpreciselycomputethenumberofrequired However,itisdicultandcomplicatedtopreciselyestimatetheregisterrequirement.RRG Ourideaisthatwerstestimatedtheregisterrequirement,ifthenumberofrequiredregisters 9
10 gorithmofrpswithspillingagain.theprocessrepeatsuntilasoftwarepipelinedloopbodyis obtainedwhoseregisterrequirementisnotgreaterthank0. Theeorttoimplementthealgorithmspresentedinthispaperisunderway.Beforegetting empirically. 6PreliminaryExperimentalResults WehavenotyetanytheoreticalanalysisaboutN0,butwebelievethatN0canbeestimated ourpreliminaryexperimentsaremainlyconductedbyamanualsimulation,wetrytoselect,theotherveexamplesareselectedfromthelivermorebenchmarks,shownintable.as extensiveexperimentaltests,weselectsixexamplestoverifyouralgorithms.exceptforexample somesimpleloopsinarandomway.themachinemodelweuseintheexperimentsisshownin Figure.(). threeschedulingapproaches{desp,rpswithoutspillingandrpswithspilling.although Tablegivestheregisterrequirementsfortheoptimalsoftwarepipeliningperformanceby theinitiationintervals(ii)ofthesoftwarepipelinedloopsare.forexampleand,the forexampleand5.forexampleand5,noimprovementinregisterusecanbeobtainedsince algorithmofrpswithspillingcanfurtherobtainanimprovementoverdespinregisteruseof :%and:%,respectively,withoutdegradationoftheoptimalperformance. DESPfrom7:4%to7:9%inregisterusewithoutdegradationoftheoptimalperformanceexcept column-numbers,thealgorithmofrpswithoutspillingcanstillobtainanimprovementover DESPitselfadoptsthemeasurestoreducetheregisterrequirementwhenitdeterminesthe presentedintableandfigure7..tablegivestheinitiationintervals(ii)obtainedbyour Theresultsofouralgorithmforsoftwarepipeliningwithalimitednumberofregistersare 0. Experimental Examples Example L MII with lcd? Remarks 0 no Figure.() no Kernel 7 yes Kernel 4 8 yes Kernel yes Kernel 6 7 no Kernel note : L = the length of the longest dependence path in the loop body. note : MII = the Minimal II. note : lcd = loop-carried dependence. Table. Register Requirement for Three Scheduling Approaches Example II DESP RPS without Spilling RPS with Spilling note: II = the initiation interval of the software pipelined loop.
11 body(shownintable),representingtheoptimalperformancewhenweonlyexploittheilp and,respectively.therelationsbetweenk0andthespeedupareshowninfigure7..the speedupisdenedasl=ii,wherelisthelengthofthelongestdependencepathintheloop algorithmforthesixexampleswhenthenumberofavailablemachineregisters(k0)is8,6 withintheloopbody.theresultsshowthatouralgorithmcanobtaintheoptimalspeedup whenk0=(theminimalsizeofregisterleinthecurrentilpprocessors)andanaverage acrossiterationsforloopsevenforasmallregisterle(k0=8). speedupof.4whenk0=8,indicatingthatouralgorithmcanstillecientlyexploittheilp Table. Software Pipelining with a Limited Number of Registers (The Initiation Interval of the Software Pipelined Loop) Example The number of available machine registers: speedup Ko = 8 6 forsoftwarepipeliningisstudied.wealsopresentthreealgorithms{rpswithoutspilling,rps theregisterrequirementduringsoftwarepipelining.onthebasisoftherrg,aregister- Pressure-Sensitive(RPS)schedulingtechniqueisdevelopedandtheproblemofregisterspilling 7Conclusion ThispaperpresentstheRegisterRequirementGraph(RRG)whichcandynamicallyreect Ko = 6 withspillingandthesoftwarepipeliningwithalimitednumberofregisters.thepreliminary 4 experimentalresultsindicatethatthersttwoalgorithmscanecientlyimprovetheregister Ko = 8 example ILPacrossiterationsforloopsevenforthosemachineswithasmallregisterle. usewithoutdegradationoftheoptimalperformanceandthethirdcaneectivelyexploitthe Figure 7. Software Pipelining a Limited Number Registers experimentaltests. Thethreealgorithmsarebeingimplementedonourcompilertestbed.Weexpectextensive
12 References []J.A.Fisher,D.Landskov,andB.D.Shriver.Microcodecompaction:Lookingbackward [4]B.R.RauandC.D.Glaeser.Someschedulingtechniquesandaneasilyschedulablehorizon- []B.R.RauandJ.A.Fisher.Instruction-levelparallelprocessing:History,overviewand []F.Gasperoni.Compilationtechniquesforvliwarchitectures.TechnicalReportTR45,New YorkUniversity,March989. perspective.thejournalofsupercomputing,7(),january99. andlookingforward.inproceedingsof98nationalcomputerconference, [5]A.AikenandA.Nicolau.Perfectpipelining:Anewloopparallelizationtechnique.In talarchitectureforhighperformancescienticcomputing.inproceedingsofthe4thin- ternationalsymposiumonmicroprogrammingandmicroarchitectures(micro-4),pages 8{98,October98. [6]P.Y.T.Hsu.HighlyConcurrentScalarProcessing.PhDthesis,UniversityofIllinois, [7]K.Ebcioglu.Acompilationtechniqueforsoftwarepipeliningofloopswithconditional proceedingsofeuropeansymposiumonprogramming,lecturenotesincomputerscience, No.00,pages{5.Spring-Verlag,June988. Urbana-Champaign,986. [9]BogongSuandJianWang.Loop-carrieddependenceandthegeneralURPRsoftware [8]B.Su,S.Ding,andJ.Xia.Urpr-anextensionofurcrforsoftwarepipelining.Inproceedingsofthe9thInternationalSymposiumonMicroprogrammingandMicroarchitectures Microarchitectures(MICRO-0),pages69{79,987. jumps.inproceedingsofthe0thinternationalsymposiumonmicroprogrammingand (MICRO-9),pages04{08,986. [0]R.F.Touzeau.Afortrancompilerforthefps-64scienticcompute.InproceedingsofACM []A.E.Charlesworth.Anapproachtoscienticarrayprocessing:Thearchitecturedesignof pipeliningapproach.inproceedingsofthe4thannualhawaiiinternationalconferenceon []M.S.Lam.ASystolicArrayOptimizingCompiler.PhDthesis,CMU,987.CMU-CS-87- SystemSciences,pages66{7.IEEEandACM,January99. []D.G.Bradlee,S.J.Eggers,andR.R.Henry.Integratedregisterallocationandinstruction SIGPLANSymposiumonCompilerConstruction,984. [4]G.J.Chaitin.Registerallocationandspillingviagraphcoloring.InproceedingsofACM theap-0b/fps-64family.computer,pages8{7,september schedulingforriscs.inproceedingsofthe4thinternationalconferenceonasplos,99. [6]S.S.Pinter.Registerallocationwithinstructionscheduling:Anewapproach.Inproceedings [5]L.J.Hendren,G.R.Gao,E.R.Altman,andC.Mukerji.Registerallocationusingcyclic ofacmsigplanpldi,99. intervalgraph:anewapproachtoanoldproblem.technicalreportacapstechnical Memo,McGillUniversity,99. SIGPLANSymp.onCompilerConstruction,98.
13 [9]S.A.Mahlke,W.Y.Chen,P.P.Chang,andW.W.Hwu.Scalarprogramperformanceon [8]J.R.GoodmanandW.Hsu.Codeschedulingandregisterallocationinlargebasicblocks. [7]B.R.Rau,M.Lee,P.P.Tirumalai,andM.S.Schlansker.Registerallocationforsoftware multiple-instruction-issueprocessorswithalimitednumberofregisters.inproceedingsof InproceedingsofInternationalConferenceonSupercomputing,988. the5thhawaiiinternationalconferenceonsystemsciences,january99. pipelinedloops.inproceedingsofpldi,99. [0]C.Eisenbeis,W.Jalby,andA.Lichnewsky.Compile-timeoptimizationofmemoryand []QiNingandGuangR.Gao.Anovelframeworkofregisterallocationforsoftwarepipelining. []WilliamMangione-Smith,S.G.Abraham,andE.S.Davidson.Registerrequirementsof puting,99. pipelinedprocessors.inproceedingsof99acminternationalconferenceonsupercom- Compilers,989. registerusageonthecray-.inproceedingsofthesecondworkshoponlanguagesand []R.Hu.Lifetime-sensitivemoduloscheduling.InproceedingsofACMSIGPLANPLDI, [4]JianWangandChristineEisenbeis.DecomposedSoftwarePipelining:Anewapproachto TechnicalReportACAPSTechnicalMemo4,McGillUniversity,99. IFIP,North-Holland,January99. exploitinstructionlevelparallelismforloopprograms.inmichelcosnard,kemalebcioglu, andjean-lucgaudiot,editors,proceedingsofifipwg0.workingconferenceonarchitecturesandcompilationtechniquesforfineandmediumgrainparallelism,pages{5. pages58{67,june99. [5]JianWangandChristineEisenbeis.DecomposedSoftwarePipelining.ReseachRepport [6]JianWang,ChristineEisenbeis,MartinJourdan,andBogongSu.DecomposedSoftware RR-88,INRIA-Rocquencourt,France,99. Programming,():57{79,994. Pipelining:Anewperspectiveandanewapproach.InternationalJournalofParallel
Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Opportunities for Loop Optimization Software Pipelining Modulo Scheduling Resource and Dependence Constraints Scheduling
More informationSoftware Pipelining - Modulo Scheduling
EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:
More informationSoftware Pipelining. Y.N. Srikant. NPTEL Course on Compiler Design. Department of Computer Science Indian Institute of Science Bangalore 560 012
Department of Computer Science Indian Institute of Science Bangalore 560 2 NPTEL Course on Compiler Design Introduction to Overlaps execution of instructions from multiple iterations of a loop Executes
More informationAN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS TANYA M. LATTNER
AN IMPLEMENTATION OF SWING MODULO SCHEDULING WITH EXTENSIONS FOR SUPERBLOCKS BY TANYA M. LATTNER B.S., University of Portland, 2000 THESIS Submitted in partial fulfillment of the requirements for the degree
More informationInstruction scheduling
Instruction ordering Instruction scheduling Advanced Compiler Construction Michel Schinz 2015 05 21 When a compiler emits the instructions corresponding to a program, it imposes a total order on them.
More informationModule: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
More informationThe Performance of Scalar Replacement on the HP 715/50
The Performance of Scalar Replacement on the HP 715/5 1 Introduction Steve Carr Qunyan Wu Department of Computer Science Michigan Technological University Houghton MI 49931-1295 It has been shown that
More informationProblems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration
More information
EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro
EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds
More informationCourse on Advanced Computer Architectures
Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationPROBLEMS. which was discussed in Section 1.6.3.
22 CHAPTER 1 BASIC STRUCTURE OF COMPUTERS (Corrisponde al cap. 1 - Introduzione al calcolatore) PROBLEMS 1.1 List the steps needed to execute the machine instruction LOCA,R0 in terms of transfers between
More informationIntroduction to Cluster Computing
Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Introduction Goal/Idea Phases Mandatory Assignments Tools Timeline/Exam General info Introduction Supercomputers are expensive Workstations
More informationCHAPTER 1 ENGINEERING PROBLEM SOLVING. Copyright 2013 Pearson Education, Inc.
CHAPTER 1 ENGINEERING PROBLEM SOLVING Computing Systems: Hardware and Software The processor : controls all the parts such as memory devices and inputs/outputs. The Arithmetic Logic Unit (ALU) : performs
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationOptimization on Huygens
Optimization on Huygens Wim Rijks wimr@sara.nl Contents Introductory Remarks Support team Optimization strategy Amdahls law Compiler options An example Optimization Introductory Remarks Modern day supercomputers
More informationFli;' HEWLETT. Iterative Modulo Scheduling. B. Ramakrishna Rau Compiler and Architecture Research HPL-94-115 November, 1995
Fli;' HEWLETT a:~ PACKARD Iterative Modulo Scheduling B. Ramakrishna Rau Compiler and Architecture Research HPL-94-115 November, 1995 modulo scheduling, instruction scheduling, software pipelining, loop
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationSolutions. Solution 4.1. 4.1.1 The values of the signals are as follows:
4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the
More informationPipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationBuilding an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
More informationTHE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER
THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of
More informationA high-level implementation of software pipelining in LLVM
A high-level implementation of software pipelining in LLVM Roel Jordans 1, David Moloney 2 1 Eindhoven University of Technology, The Netherlands r.jordans@tue.nl 2 Movidius Ltd., Ireland 2015 European
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationSoftware Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationUsing Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se
More informationRegister Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using 0-1 Integer Linear Programming Formulation
Register Allocation and Optimal Spill Code Scheduling in Software Pipelined Loops Using -1 Integer Linear Programming Formulation Santosh G. Nagarakatte 1 and R. Govindarajan 1,2 1 Department of Computer
More informationand RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
More informationFirewall Compressor: An Algorithm for Minimizing Firewall Policies
Firewall Compressor: An Algorithm for Minimizing Firewall Policies Alex Liu, Eric Torng, Chad Meiners Department of Computer Science Michigan State University {alexliu,torng,meinersc}@cse.msu.edu Introduction
More informationWAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
More informationQ. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
More informationpicojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationMulti-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
More informationImplementation of Full -Parallelism AES Encryption and Decryption
Implementation of Full -Parallelism AES Encryption and Decryption M.Anto Merline M.E-Commuication Systems, ECE Department K.Ramakrishnan College of Engineering-Samayapuram, Trichy. Abstract-Advanced Encryption
More informationLecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations
Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2
More informationTypes of Workloads. Raj Jain. Washington University in St. Louis
Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationUsing Graphics and Animation to Visualize Instruction Pipelining and its Hazards
Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221
More informationStatic Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
More informationVALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
More informationBEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
More informationSoftware Pipelining: An Effective Scheduling Technique for VLIW Machines
Software Pipelining: An Effective Scheduling Technique for VLIW Machines Monica Lam Department of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213 Abstract This paper shows that
More informationPower System Security Monitoring, Analysis, and Control. George Gross
ECE 573 Power System Operations and Control Power System Security Monitoring, Analysis, and Control George Gross Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationCOMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
More informationApplication Insight Through Performance Modeling
Application Insight Through Performance Modeling Gabriel Marin Department of Computer Science Rice University Houston, TX 77005 mgabi@cs.rice.edu John Mellor-Crummey Department of Computer Science Rice
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationPerformance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationSoftware Programmable DSP Platform Analysis Episode 7, Monday 19 March 2007, Ingredients. Software Pipelining. Data Dependence. Resource Constraints
Software Programmable DSP Platform Analysis Episode 7, Monday 19 March 7, Ingredients Software Pipelining Data & Resource Constraints Resource Constraints in C67x Loop Scheduling Without Resource Bounds
More informationHigh-Level Synthesis for FPGA Designs
High-Level Synthesis for FPGA Designs BRINGING BRINGING YOU YOU THE THE NEXT NEXT LEVEL LEVEL IN IN EMBEDDED EMBEDDED DEVELOPMENT DEVELOPMENT Frank de Bont Trainer consultant Cereslaan 10b 5384 VT Heesch
More informationCS 61C: Great Ideas in Computer Architecture Finite State Machines. Machine Interpreta4on
CS 61C: Great Ideas in Computer Architecture Finite State Machines Instructors: Krste Asanovic & Vladimir Stojanovic hbp://inst.eecs.berkeley.edu/~cs61c/sp15 1 Levels of RepresentaKon/ InterpretaKon High
More informationGameTime: A Toolkit for Timing Analysis of Software
GameTime: A Toolkit for Timing Analysis of Software Sanjit A. Seshia and Jonathan Kotker EECS Department, UC Berkeley {sseshia,jamhoot}@eecs.berkeley.edu Abstract. Timing analysis is a key step in the
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationJava Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
More information15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationMONITORING PERFORMANCE IN WINDOWS 7
MONITORING PERFORMANCE IN WINDOWS 7 Performance Monitor In this demo we will take a look at how we can use the Performance Monitor to capture information about our machine performance. We can access Performance
More informationEVALUATION OF SCHEDULING AND ALLOCATION ALGORITHMS WHILE MAPPING ASSEMBLY CODE ONTO FPGAS
EVALUATION OF SCHEDULING AND ALLOCATION ALGORITHMS WHILE MAPPING ASSEMBLY CODE ONTO FPGAS ABSTRACT Migration of software from older general purpose embedded processors onto newer mixed hardware/software
More informationChapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationOptimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology Parallel Reduction Common and important data parallel primitive Easy to implement in CUDA Harder to get it right Serves as
More informationEfficient and Robust Allocation Algorithms in Clouds under Memory Constraints
Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints Olivier Beaumont,, Paul Renaud-Goud Inria & University of Bordeaux Bordeaux, France 9th Scheduling for Large Scale Systems
More informationOPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect
OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps
More informationHypothesis Testing for Network Security
Hypothesis Testing for Network Security Philip Godfrey, Matthew Caesar, David Nicol, William H. Sanders, Dong Jin INFORMATION TRUST INSTITUTE University of Illinois at Urbana-Champaign We need a science
More informationChapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow
Chapter 11. Using MAX II User Flash Memory for Data Storage in Manufacturing Flow MII51011-1.1 Introduction Small capacity, non-volatile memory is commonly used in storing manufacturing data (e.g., manufacturer
More informationComparing RTOS to Infinite Loop Designs
Comparing RTOS to Infinite Loop Designs If you compare the way software is developed for a small to medium sized embedded project using a Real Time Operating System (RTOS) versus a traditional infinite
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationReduced Instruction Set Computer (RISC)
Reduced Instruction Set Computer (RISC) Focuses on reducing the number and complexity of instructions of the ISA. RISC Goals RISC: Simplify ISA Simplify CPU Design Better CPU Performance Motivated by simplifying
More informationReal-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 26 Real - Time POSIX. (Contd.) Ok Good morning, so let us get
More informationInfluence of Load Balancing on Quality of Real Time Data Transmission*
SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 6, No. 3, December 2009, 515-524 UDK: 004.738.2 Influence of Load Balancing on Quality of Real Time Data Transmission* Nataša Maksić 1,a, Petar Knežević 2,
More informationBridgewalling - Using Netfilter in Bridge Mode
Bridgewalling - Using Netfilter in Bridge Mode Ralf Spenneberg, ralf@spenneberg.net Revision : 1.5 Abstract Firewalling using packet filters is usually performed by a router. The packet filtering software
More informationChapter 19: Real-Time Systems. Overview of Real-Time Systems. Objectives. System Characteristics. Features of Real-Time Systems
Chapter 19: Real-Time Systems System Characteristics Features of Real-Time Systems Chapter 19: Real-Time Systems Implementing Real-Time Operating Systems Real-Time CPU Scheduling VxWorks 5.x 19.2 Silberschatz,
More informationTo appear in MICRO-25 Conference Proceedings, December 1992 1. Enhanced Modulo Scheduling for Loops with Conditional Branches
To appear in MICRO-5 Conference Proceedings, December 199 1 Enhanced Modulo Scheduling for Loops with Conditional Branches Nancy J. Warter Grant E. Haab John W. Bockhaus Krishna Subramanian Coordinated
More informationAdministration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationEE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the
More informationHardware/Software Codesign
Hardware/Software Codesign. Review. Allocation, Binding and Scheduling Marco Platzner Lothar Thiele by the authors Synthesis Behavior Structure Synthesis Tasks Œ Allocation: Œ Binding: Œ Scheduling: selection
More informationHow To Program With Adaptive Vision Studio
Studio 4 intuitive powerful adaptable software for machine vision engineers Introduction Adaptive Vision Studio Adaptive Vision Studio software is the most powerful graphical environment for machine vision
More informationA Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization
A Systematic Approach to Model-Guided Empirical Search for Memory Hierarchy Optimization Chun Chen, Jacqueline Chame, Mary Hall, and Kristina Lerman University of Southern California/Information Sciences
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationIMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications
Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive
More informationCS352H: Computer Systems Architecture
CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions
More informationA Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
More informationCS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationOptimizing compilers. CS6013 - Modern Compilers: Theory and Practise. Optimization. Compiler structure. Overview of different optimizations
Optimizing compilers CS6013 - Modern Compilers: Theory and Practise Overview of different optimizations V. Krishna Nandivada IIT Madras Copyright c 2015 by Antony L. Hosking. Permission to make digital
More informationLecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationIA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
More informationLonger is Better? Exploiting Path Diversity in Data Centre Networks
Longer is Better? Exploiting Path Diversity in Data Centre Networks Fung Po (Posco) Tso, Gregg Hamilton, Rene Weber, Colin S. Perkins and Dimitrios P. Pezaros University of Glasgow Cloud Data Centres Are
More informationIntellicus Enterprise Reporting and BI Platform
Intellicus Cluster and Load Balancing (Windows) Intellicus Enterprise Reporting and BI Platform Intellicus Technologies info@intellicus.com www.intellicus.com Copyright 2014 Intellicus Technologies This
More informationREAL TIME OPERATING SYSTEMS. Lesson-10:
REAL TIME OPERATING SYSTEMS Lesson-10: Real Time Operating System 1 1. Real Time Operating System Definition 2 Real Time A real time is the time which continuously increments at regular intervals after
More informationTechnical Properties. Mobile Operating Systems. Overview Concepts of Mobile. Functions Processes. Lecture 11. Memory Management.
Overview Concepts of Mobile Operating Systems Lecture 11 Concepts of Mobile Operating Systems Mobile Business I (WS 2007/08) Prof Dr Kai Rannenberg Chair of Mobile Business and Multilateral Security Johann
More informationOPC COMMUNICATION IN REAL TIME
OPC COMMUNICATION IN REAL TIME M. Mrosko, L. Mrafko Slovak University of Technology, Faculty of Electrical Engineering and Information Technology Ilkovičova 3, 812 19 Bratislava, Slovak Republic Abstract
More informationComputer Organization and Components
Computer Organization and Components IS1500, fall 2015 Lecture 5: I/O Systems, part I Associate Professor, KTH Royal Institute of Technology Assistant Research Engineer, University of California, Berkeley
More information