ParallelProgrammingandPerformanceEvaluationwithThe InsungParkMichaelVossBrianArmstrongRudolfEigenmann SchoolofElectricalandComputerEngineering UrsaToolFamily andtheirintegrationwithperformanceevaluationenvironments.first,weproposeinteractivecompilationscenariosinsteadoftheusualblack-box-orienteduseofcompilertools.insuchscenarios, informationgatheredbythecompilerandthecompiler'sreasoningarepresentedtotheuserinmeaningfulwaysandon-demand.second,atightintegrationofcompilationandperformanceanalysis toolsisadvocated.manyoftheexisting,advancedinstrumentsforgatheringperformanceresults arebeingusedinthepresentedenvironmentandtheirresultsarecombinedinintegratedviews withcompilerinformationanddatafromothertools.initialinstrumentsthatassistusersin\data Abstract Thispapercontributestothesolutionofseveralopenproblemswithparallelprogrammingtools PurdueUniversity toolbymakingavailablethegatheredresultstotheusercommunityatlargeviatheworld-wide Web. usersataspecicsite,suchasaresearchordevelopmentproject.ursamajorcomplementsthis mining"thisinformationarepresentedandtheneedformuchstrongerfacilitiesisexplained. toolfamily.twocasestudiesarepresentedthatillustratetheuseofthetoolsfordevelopingand studyingparallelapplicationsandforevaluatingparallelizingcompilers. Thispaperpresentsobjectives,functionality,experience,andnextdevelopmentstepsoftheUrsa TheUrsaFamilyprovidestwotoolsaddressingtheseissues.UrsaMinorsupportsagroupof 1Introduction occur.inothercases,usersmayknowthatthearraysectionsaccessedindierentloopiterationsdonot theseshortcomings.forexample,althoughthecompilerdetectsavalue-specicdatadependence,the tool.onedisadvantageofthisscenarioisthatthecompilermayhaveinsucientknowledgeorlimited usermayknowthatineveryreasonableprograminputthevaluesaresuchthatthedependencedoesnot capabilitiestoparallelizeaprogramoptimally.insomecasesitwouldbeeasyfortheusertomakeupfor compileristhattheconversionofagivenserialprogramintoparallelformisdonemechanicallybythe importantclassofsuchtools[bde+96,haa+96].theapparentadvantageofusingaparallelizing thechallengingtaskofdevelopingwell-performingparallelprograms.parallelizingcompilersareone Interactiveuseofparallelizingcompilers.Manyprogrammingtoolsexistthatassisttheuserin overlap.furthermore,certainprogramtransformationsmaymakeasubstantialperformancedierence, ndthereasonwhyaloopwasnotparallelizedautomatically,asmallmodicationmaybeappliedthat butareapplicabletoveryfewprograms,andhencenotbuiltintoacompiler'srepertoire.ifausercan ensuresparallelexecution.becauseofthesereasons,manualcodemodicationinadditiontoautomatic parallelizationisoftennecessarytoachievegoodperformance. REERaward.ThisworkisnotnecessarilyrepresentativeofthepositionsorpoliciesoftheU.S.ArmyortheGovernment. ThisworkwassupportedinpartbyPurdueUniversity,U.S.Armycontract#DABT63-92-C-33,andanNSFCA- 1
timinginformationbecomesavailablefromvariousprogramruns,structuralinformationoftheprogram Integratedcompilationandperformanceevaluation.Duringtheprocessofcompilingaparallel formation.findingparallelismstartsfromlookingthroughthisinformationandlocatingpotentially programandmeasuringitsperformance,aconsiderableamountofinformationisgathered.forexample, isgatheredfromthecodedocumentation,andcompilersoeralargeamountofprogramanalysisin- accompanyingthisprocedureisoftenoverwhelming.toolsthatassistthisprocessareimportant. ofthecompilationprocess,thecharacteristicsofthegivenprogram,itsperformanceresults,andthe parallelsectionsofcode.improvingparallelperformanceistheimmediatenextstep.decisionsare relationshipsofthesedata.itisthebasisforenhancingtheperformanceofanexistingparallelprogram madebasedontimingresultsandtheirrelationshiptoprogramcharacteristics.thebookkeepingeort aswellasforbeginningtoparallelizeaserialprogram. gathersinformationalongthecourseofcompilingandrunningaprogramandpresentsitinaformat parallelization.thetoolhelpsaprogrammerunderstandthestructureofaprogram,identifyparallelism, andcompareperformanceresultsofdierentprogramvariants.thetool,ursaminor[pvae97], thatiseasytolookupandcomprehend.usingthetool,theprogrammercomestoanunderstanding ThepresentedtooliscloselyrelatedtothePolariscompilerinfrastructure[BDE+96].Polaris,asa Inthispaper,weintroduceanon-goingtoolprojectthatsupportsascenarioofuser-plus-compiler symbolicprogramanalysis.polarisalsorepresentsageneralinfrastructureforanalyzingandmanipulatingfortranprograms,whichcanprovideusefulinformationabouttheprogramstructureandits potentialparallelism.polarisplaysamajorroleingeneratingthedatalesusedasinputtoursa parallelismdetection. parallelapplications.section5thenshowstwocasestudiesofursaminorinuse.section6concludes discussesitsfunctionality.sectionpresentstheursamajortool[pe98],aweb-basedtoolbuiltupon UrsaMinorthatwasdesignedfordistributionandevaluationofexperimentalresultswithvarious thepaper. 2ObjectivesofUrsaMinor Section2presentsourobjectivesindevelopingUrsaMinor.Section3givesanoverviewand compiler,includesadvancedprogramanalysisandtransformationtechniques,suchasarrayprivatization,symbolicandnonlineardatadependencetesting,idiomrecognition,interproceduralanalysis,and Minor.Examplesofsuchlesareloopparallelizationsummaries,data-dependenceinformation,and loop/subroutinecallgraphs.polarisalsoinstrumentsprogramsfortimingmeasurementsandmaximum exploitingparallelism,thetoolpursuesthefollowingobjectives: IntegratedBrowsersforProgram,Compilation,andPerformanceData:TheUrsaMinor TheintendedusersoftheUrsaMinortoolareparallelprogrammersthathavesomeexperienceusingparallelizingcompilersandperformanceanalysistools.Inordertoassisttheminidentifyingand InteractiveCompilers:Thecurrent,predominantlyblack-boxuseofparallelizingcompilersneeds detailswheneverheorshefeelstheneedtoconcentrateonaspecicportionoftheprogram.the ofaprogram.inthisway,ausercanstartfromanoverallviewoftheprogramandinspectthe toolcollectsandfacilitatestheuseofprogram,compilation,andperformancedata.theinformationneedstobepresentedinaformatthatconveyshigh-levelaswellasdetaileddescriptions [MCC+95],andPTOPP[EM93]performanceanalysisenvironments. toolcomplementsandintegratescapabilitiesprovidedbytoolssuchasthepablo[ree9],paradyn tobechangedintoaninteractivescenario.thisgoesbeyondinteractivepassinvocationaspioneeredbytoolssuchasstart/pat[asm89]andparascope[bkk+89].theultimategoalofthe UrsaMinorprojectistoprovideacomprehensiveenvironmentthatencompassestheprocessof writing,compiling,running,andimprovingparallelprograms.tothisendinteractivecapabilities 2
performancedata,andvisualizingthisinformation.theursaminorenvironmentprovidesaidsforthe dynprojects,whichprovideadvancedfacilitiesforoptimizingandinstrumentingprograms,gathering usertounderstandthegatheredperformancedataandtoreasonabouttheinformationinaninteractive similarobjectivetothatofvtune[int97],whichisanadvancedtoolforsingle-processorsystems. way.inthesensethatthetoolprovidesuserswithadvicetoimproveperformance,ursaminorhasa Theseobjectivesdistinguishourapproachfromrelatedeorts,suchasthePolaris,PabloandPara- areprovidedtoviewprograminformationgatheredbythecompilerandrelateittoinformation Inadditiontothemainobjectives,weobservethefollowingdesignrulestomakeourtoolmoreuseful providedbyotherprogrammingtools. andeasilyaccessible: Portability:Fordisseminatinganewtooltotheusercommunity,itisimportantthatitbeeasyto compileranditsperformanceanalysislibraries,whichthemselvesareportabletomanyplatforms. independentjavalanguage,andbyusingonlywidely-availableapplicationprogramminginter- faces(apis).thetoolmakesuseofinformationgatheredbyotherfacilities,suchasthepolaris Inaddition,UrsaMinoristobeexibleinthedataformatitcanread,suchthatitcanadapt installonnewplatforms.weapproachthisgoalbyimplementingursaminorinthetarget- Expandability:ThemainfunctionoftheUrsaMinortoolisinformationgatheringandbrowsing. Leveragingoexistingtools:Weconsiderusingotheravailabletoolstoaugmentthefeaturesof sheetscapableofrichgraphicalpresentationofdata.byallowingtheinformationtobeunderstood byoneofthesespreadsheets,wecantakeadvantageofitsfeaturestocreatecharts,whilefocusing UrsaMinorthatweregardedas\notoriginalbutnicetohave".Forinstance,therearespread- tothetools(compilersandperformanceanalyzers)availableonthelocalplatform. 3DescriptionofUrsaMinor seeitthroughthetoolwithminimalmodications.wecanalsoenablethetooltoreadageneric Hence,wheneverweobtainnewtypesofinformationaboutthegivenprogramweshouldbeableto datale,sothatnewtypeofinformationcanbeunderstoodwithoutsignicantmodications. onthenewfunctionalityofursaminor. graphicalinterface,whichcanprovideselectiveviewsandcombinationsofthedata.figure1illustrates Thesesourcesincludetoolssuchascompilers,prolers,andsimulators.Itinteractswithusersthrougha 3.1Overview TheUrsaMinorprojectprovidestoolsthatassistparallelprogrammersineectivelywritingand tuningcodes.itprovidesuserswithinformationavailablefromvarioussourcesinacomprehensibleway. willdiscusshowourdesignobjectiveswererealizedintheconcretetool. Inthissection,weprovideanoverviewofUrsaMinor[PVAE97]anddescribeitsfunctionality.We Polariscompiler.TheUrsaMinortoolincludesasubroutineandloopneststructureanalyzer,also implementedusingthepolarisinfrastructure. tool[pet93,ke97].informationaboutwhichloopsareserialorparallelisprovidedbytheactual interactionbetweenursaminorandthevariousdatales. optionsareprovidedtoreadfromthevariousoriginalles,addtotheexistinginformationincrementally, explicitlybytheuserbeforeursaminorcanreadandcombinethem.oncetheyexist,severaltool notdiscussedfurtherinthispaper[eig93].maximumparallelismestimatesaresuppliedbythemax/p frominstrumentedprogramruns.thetoolperformingthisinstrumentationisapolaris-basedutility, Inthecurrentimplementation,theseinformationsourcesareavailableinlesthatneedtobecreated UrsaMinorcollectsandcombinesinformationfromvarioussources.Timinginformationisgathered 3
Calling Structure Analyzer Result Performance Results Information Sources Data Dependence Test Summary Simulation Report from Max/P Generated by Polaris-based Tools Other Information Sources Other Tools Saved DataBase Source File SpreadSheet open/save export storetheentiredatabase,orreadfromapreviouslysaveddatabase.infuturereleasesweplanto automatetheprocessofcreatingtheinformationsourcesby,forexample,invokingthecompilerondemand. Figure1:ComponentsoftheUrsaMinortoolandtheirinteractions. URSA MINOR UMD (Ursa DataBase) presentation/edit presentation/edit Loop Table View Call Graph View spectedwithaneditorandprinted.furthermore,theinformationcanbesavedinaformatthatcanbe isastorageunitthatholdsthecollectiveinformationaboutaprogramanditsexecutionresultsin acertainsystemenvironment.thisdatabaseisorganizedasatextle,whichcanoptionallybein- readbycommercialspreadsheets,providingarichersetofdatamanipulationfunctionsandgraphical Internally,UrsaMinorstoresinformationinUrsaMinor/MajorDatabases(UMD).AUMD interaction interaction representations. TheUrsaMinortooliswritteninJava.Thus,anyplatformonwhichtheJavaruntimeenvironment User prototypinguserinterfaces,whichenableustofocusonthedesignofthetoolfunctionality.furthermore, isavailablecanbeusedtorunthetool.itusesthebasicjavalanguagewithstandardapis,which thefunctionalityofursaminormoreclosely. newtypesofdatatothedatabase.thewindowingtoolkitsandutilitiesprovideagoodenvironmentfor Java,withitsnetworksupport,makesausefullanguageforrealizinganothergoalofthisproject:making beenrealizedintheursamajortool,whichisdiscussedinsection.inthenextsection,weexamine enhancestheportabilityofthetool.objectorientationinjavaallowsarelativelyeasyadditionof 3.2Functionality TheUrsaMinortoolpresentsinformationtotheuserthroughtwodisplaywindows:Aloopinformation tableandacallgraph.theuserinteractswiththetoolbychoosingmenuitemsormouse-clicking. availablethegatheredprogram,compilation,andperformanceresultstoremoteusers.thisgoalhas ofinvocationsofeachloop,theparentintheneststructure,andthemaximumdegreeofparallelism providedbymax/p[pet93,ke97].italsoindicateswhetheraloopisserialorparallelasdetectedby rently,thetabledisplaysinformationsuchastimingresultsfromvariousprogramruns,thenumber Polaris.Ifitisserial,thereasongivenbythecompilercanbedisplayedonmouse-clicking.InFigure2, Figure2showsthelooptableview,eachlinedisplayinginformationforanindividualloop.Cur-
theuserhasclickedonlooprestardo56toseethereasoninhibitingparallelization. programtuningprojects,anursaminorlooptableisusuallypresentallthetime.aftereachprogram view.also,ausercanrearrangecolumns,deletecolumns,sorttheentriesalphabeticallyorbasedon run,thenewlycollectedtiminginformationisincludedasanadditionalcolumninthelooptable.inthis theexecutiontime.byspecifyingareferencecolumn,speedupscanbecalculatedon-demand.inour Whenevernewinformationfromothertoolsbecomesavailable,theusercanaddcolumnsinthis Figure2:LoopTableViewoftheUrsaMinortool. overallprogram.eectsofprogrammodicationsonotherprogramsectionsbecomeobviousaswell. Themodicationmaychangetherelativeimportanceoftheloops,sothatsortingthembytheirnewest way,performancedierencescanbeinspectedimmediatelyforeachindividualloopaswellasforthe structure,theusercanzoominandout.thisdisplayhelpstounderstandtheprogramstructurefor taskssuchasinterchangingloopsorndingouterorinnercandidateparallelloops. executiontimeyieldsanewmost-time-consuminglooponwhichtheprogrammercanfocusnext. InFigurewehavereadthisformintothecommercialxess3spreadsheetprogram.Thisallowsone theuserisinspectingtheloopactfordo2inthisway.ifonewantsawiderviewoftheprogram subroutine,function,orloop.forexample,parallelloopsarerepresentedbygreenrectangles,andserial loopsbyredrectangles.clickingoneofthesewilldisplaythecorrespondingsourcecode.infigure3 subroutine,function,andloopnestinformationasshowninfigure3.eachrectanglerepresentseithera UrsaMinorcansavethedatabaseinaformatthatgenericspreadsheetprogramscanunderstand. AnotherviewofUrsaMinorprovidesthecallingstructureofagivenprogram,whichincludes toexploitthemanyoptionsandgraphicalrepresentationsofthistool.infiguretheuserhaschosen UrsaMajor[PE98]isanextensionoftheUrsaMinortool.BecausewechoseJavaasanimple- anexecutiontimegraphfortheprogrambdna,comparingtheperformanceofpolariswiththecompiler fromsunmicrosystems,(athirdlineindicating\linearspeedup"forreference). UrsaMajor:Web-basedevaluationofparallelapplications 5
Figure3:AnnotatedCallGraphViewoftheUrsaMinortool. Figure:Spread-SheetViewoftheUrsaMinortool. 6
mentationlanguage,itwasnaturaltocombineourtoolcapabilitieswiththerapidlyadvancinginternet canidentifypreciselytheeectofasourcecodechangeontheperformanceforboththemodiedcode codemodications,etc.thetoolhelpsrelateallthesepiecesofinformation,sothat,forexample,one thatcouldguidetheseusersinexploitingthenewmachines.ursamajorprovidesamethodologyof theirserialandparallelsourcecode,performanceimprovementsresultingfromcompilationorsource \learningbyexample"tobothlocalandremoteusers.newusersseeavarietyofsampleprograms, innon-expertusersandprogrammers.however,therearenoestablishedprogrammingmethodologies technologyand,inthisway,allowusersatremotesitestoaccessourexperimentaldata. sectionandtheoverallprogram. First,aordablemultiprocessorworkstationsandPCsarecurrentlyleadingtoasubstantialincrease Inextendingtheuseofourtooltoaworld-wideaudienceweareaddressingseveralnewissues: andthecomparisonofresultswiththoseobtainedbyothers.tothisend,manytestapplicationshave alargebodyofmeasurementsobtainedfromtheseprogramscanbefoundintheliteratureandon fromseveralpapers)andtheyhavetoundergosubstantialre-categorizationsandtransformations.in beenmadepubliclyavailableforstudyandbenchmarkingbybothresearchersandindustry.although Fromthebeginning,theabstractionofperformanceandprograminformationintoaformthatanswers addressingthisissue,theursamajorprojectiscreatingacomprehensivedatabaseofinformation. publicdatarepositories,itisusuallyextremelydiculttocombinethemintoaformmeaningfulfor newpurposes.inpartthisisbecausedataarenotreadilyavailable(i.e.,theyhavetobeextracted Second,acoreneedforadvancingthestateoftheartofcomputersystemsisperformanceevaluation thequestionsoftheobserverwasoneofourgoals.however,thisissuebecomesdrasticallymorecomplex asweconsiderlargedatarepositoriesorganizedintoamultitudeofdimensions.theinternettechnology anditscombinationwithhigh-performancecomputingtoolsopensthisnewrealmofquestionsand http://www.ecn.purdue.edu/~ipark/um/index.html. opportunities,whichwearebeginningtoexplorewithursamajor..1descriptionofursamajor UrsaMajorisaweb-basedtoolcapableofpresentingtheUrsaMinor/Majordatabasetoaremote networkingfeaturesandfororganizingthedataintoarepositorythatiseasytoaccessfromremote Majorrepository(UMR),whichwillbediscussedinthenextsection.UrsaMajorisavailableat UrsaMajor'smodulesfromthesecomponents.Inaddition,newmoduleswerecreatedforthetool's Websites.Thelatterincludesthedenitionofnamingschemeswithwhichinformationcanbefound basicbuildingblocksforursamajor.javaclassinheritancewasutilizedextensivelyfordeveloping user.figure5showsanoverallviewoftheinteractionsbetweenursamajor,auser,andtheursa intuitivelyandcaneasilyberelatedtootherinformation. UrsaMinor'sfacilitiesformanipulatingdatabasesandforcreatinggraphicaluserinterfacesare pagethroughjavaappletandisinvokedbyclickingabuttoninthewebpage. MajortoolisalmostidenticalwiththoseofUrsaMinor,butUrsaMajorisembeddedinaweb withursaminor,exceptthattheycannotsavelesonthelocaldisk.thelookandfeeloftheursa theumdsoftheirinterestbyexaminingthedescriptionsprovidedfortheavailableumds.umdsare thenretrievedbytheirurl.onceaumdisdisplayed,usersmayperformthesametasksastheydo istheaccesstotherepository.remotejavaapplicationscannotaccessdisklesdirectly.theyhaveto retrievedataintheformofwebdocuments.thisisduetojavasecurityrestrictions.usersmaychoose SinceitisbasedonUrsaMinor,UrsaMajoroersthesamebasicfunctionality.Onedierence amountofinformationisgathered.severalsucheortsareongoinginourgroup,hencetheumris Duringtheprocessofcompilingaparallelprogramandmeasuringitsperformance,aconsiderable.2UrsaMajorRepository(UMR) 7
Remote Server Ursa Major Applet UMR (Ursa Major Repository) Java Program Download DataBase Download PurdueUniversity,includingSPECandPerfectbenchmarks. continuouslybeingextended.itcurrentlycontainsseveralbenchmarksuitesthathavebeenstudiedat URSA MAJOR UMD (Ursa Database) reports,aswellasthetiminginformationofvariousprogramruns.findingparallelismstartsfrom Thespecicdataincludesstructuralprograminformation,resultsofprogramanalysis,simulation Figure5:InteractionprovidedbytheUrsaMajortool. presentation/edit database presentation/edit database Loop Table View Call Graph View interaction interaction leanddirectorynamesindicatingdatasuchastheprogramnames,platforms,compilers,optimization, andparallellanguages.tobeexible,theseextensionsarenothard-coded.instead,theyaredescribed ndinformationenteredbyotherusers.tothisend,therepositorystructureconsistsofextensionson lookingthroughthisinformationandlocatingpotentiallyparallelsectionsofcode.severaltoolsand methodologiesarebeingusedtogatherandorganizesuchdata[vggj+89,em93]. Oneissueindesigningtherepositorywastodenestorageschemesthatmakesiteasyforusersto User inacongurationlethatisreadbyursamajoratthestartofasession. WepresentearlyexperienceswithusingtheUrsaMajortoolandwithitsimplementation.Wehave.3ExperienceswithUrsaMajor usedthetoolinourresearchteam,onmultipleworkstationplatformsandalsopcsconnectedthrough modemsathome.ourteamincludesresearchersattwouniversities,sothatrealisticremoteaccesses wereinvolved.basedontheseexperienceswecanpicturescenariosofhowthedierentusercommunities canbesttakeadvantageofthetoolandwhatchallengesneedtobeaddressedtomakeitevenmore grammers,andresearchersinterestedinperformanceevaluationandbenchmarking.obviouslythese usefulinthefuture. categoriescanoverlap.forbeginners,thetoolsupportsamethodologyof\learningbyexample".new programmersstartbygettingthegeneralfeelfortherepository.thisisbestdonestartingwiththe callgraphviewandclickingonseveralnodesinthisgraphtoinspectthesourceprograms.togetmore UrsaMajortargetsseveralaudiences.Theyincludenoviceparallelprogrammers,advancedpro- 8
compareserialandparallelprogramversions.ursamajorsupportsthisbyprovidingthelooptable view.sourcecodecorrespondingtoserialandtheparallelvariantcanbeopened.thelooptablealso insightsaboutanindividualprogramtheusernowcanstepthroughthemosttime-consumingloopsand givesthenewuserarstideaofhowprogramsneedtobetransformedtoruninparallelandwhat showstimingsofthetwovariantsgivingtheuserarstviewofthespeedupsobtainedbyeachloop.the improvementsbycombiningtheperspectivesfrombothperformanceevaluationandcompileranalysis performanceimprovementcanbeobtained. spectionofthereasonswhycertainparallelloopsorprogramsectionsperformwellorpoorlyinmore toolcancomputeanddisplaythesespeedupnumbersasanoption.comparingtheseprogramvariants ofinformationkeptintheursamajorrepositoryandfacilitatingaccesstothisinformationinvarious detailandwhyacodesectionisnotparallel.inthisway,usersmayidentifythebottleneckandpossible results. dimensions.evenwithinourresearchgrouptheavailabilityoftherepositoryenabledmanydierent Theadvancedprogrammermaybenetfromthistoolbyexploitingthefeaturesallowingthein- ongoingeort. entsubroutinesandloopswithinaprogram,andscalabilitystudiesovernumbersofprocessorsanddata setsizes.increasingthesupportforinspectingourdatabasefromthesevariousanglesisanimportant studies,suchasarchitecturalcomparisons,comparisonsofdierentcompilers,dierentprograms,dier- UrsaMajorfurtherservestheresearchcommunityingeneralbymakingavailablethelargeamount performedontheperfectbenchmarkscodearc2d,ispresentedhereasourrstcasestudy. UrsaMinorisusedinthesearchforexplanationsofthesedierences.Anexampleofsuchasearch codeswithvariousdirectivesets.iftheperformanceresultsofthesecodesaresignicantlydierent, pileroutputrepresentation[vos97].indoingso,wehaveexpressedtheparallelisminseveralbenchmark 5CaseStudies 5.1ExperimentswiththeARC2DApplication Inacurrentstudy,wearecomparingparalleldirectivelanguagesfortheirsuitabilityasaportablecom- astheexecutiontimemeasuredbytheinstrumentation,itiseasilydeterminedwhensuchperturbation occurs.inarc2d,11ofthe19loopshadaninstrumentationoverheadofmorethan.1%oftheloop noticeablyimpactthemeasuredperformance.usingthenumberoftimeseachloopisexecuted,aswell executiontime.wechose.1%asthecutotoensurethattheinstrumentedtimingmeasurementsstill gatheredbyursaminorandtransformedintoaformwhichisreadablebycommercialspreadsheet packagessuchasexcelandxess3.oneconcernwithinstrumentationisthattheassociatedoverheadwill onaprocessorultrasparcworkstationwasdone.theresultsofthisinstrumentationwasthen reectedtheprogramperformancewithhighaccuracy.removingtheinstrumentationfromthese11 First,asabase-linemeasurement,aloopbyloopproleoftheserialversionofthecodeexecuted averageexecutiontimesforcomputingtheoverhead.infuturereleasesofthetoolthiscomputationwill parallelizedversionsoftheseloopswereusedtocomparetheperformanceofseveralparalleldirective loops,reducedthetotalexecutiontimeoftheprogramby6%.ursaminorcurrentlyprovidesthe befullyautomated. languages.themajorloopsinarc2dparallelizedbypolarisarefilerxdo19,stepfxdo21and OpenMPindustrystandard[OMP97].BrowsingthroughtheperformanceresultsdisplayedbyUrsa theseloopsintheserialversioncanbeseeninfigure6. STEPFXdo23.TheidenticationoftheseloopswasstraightforwardgiventhatUrsaMinorpresented dialectandtheotherusingtheportablekap/prodirectiveset[kuc88],acloserelativeofthenew theexecutiontimesofeachloopaswellasannotateditasparallelorserial.therelativeimportanceof Additionally,themosttime-consumingloopswereidentiedintheserialcode.ThePolaris- Minoritwasseenthatonprocessors,theKAP/Prodirectivelanguageexhibitedsuperiorperformance. TheparallelismfoundbyPolariswasexpressedintwoforms.OneusingthenativeSunSPARC 9
Figure6:PercentageofexecutiontimespentinmajorloopsofARC2D. STEPFX do23 (11.7%) STEPFX do21 (1.9%) FILERX do19 (6.%) reason.loopinterchangingwasbeingappliedtomanyoftheloopnestsinthekap/prodirectiveversion Furthermore,byaddingtheloop-by-loopproleofARC2D,asparallelizedbytheSunnativecompiler, loopsinthekap/proversionwhencomparingthe1processorparallelexecutiontotheexecutionof aninterestingphenomenonwasdiscovered:asignicant\negativeoverhead"existedformanyofthe theuntransformedcode.apparently,sequentialoptimizationswereperformedinthekap/proversion SunSPARCdirectives.TheperformanceofthethreemajorloopsisshowninFigure7. intheloopsfoundtobeparallelbythenativecompiler,butnotinthepolarisversionwhichusedthe whichwerenotperformedintheserialversion.interestingly,thissameoptimizationwasoftenperformed Usingthesourcecodebrowsingcapabilities,aside-by-sidecomparisonoftheloopnestsuncoveredthe Others interchangingwasnotdisabledwhenparallelizingthecodewiththenativesunparallelizingcompiler; bytheback-endcompiler.theuseofthesunsparcdirectivesinhibitedthistransformation.loop (71.%) wereimperfectlynestedintheoriginalsource,butweretransformedintoaperfectnestbypolaris. TheapplicationofforwardsubstitutionanddeadcodeeliminationbyPolariscreatedperfectlynested parallelizingcompilerwasabletoidentifythesameamountofparallelismaspolaris,itdidnotapply loops,whichtheback-endcompilerwasthenabletointerchange.therefore,althoughthenativesun furtheroptimizations.figure8showstheperformanceofthethreeparallelversionsofarc2dexecuted howeveritwasappliedlessfrequently.foramoredetaileddiscussionofthisphenomenonandothers onprocessorsoftheultrasparc.thisgurealsoshowstheperformancethatwouldbeobtainedin uncoveredduringtheanalysisofarc2d,pleasereferto[vos97]. thesunsparcdirectiveversioniftheinterchanginghadbeendone. structurerepresentation,showedthatthetwomostsignicantloopsstepfxdo21andstepfxdo23 Afurtheranalysisoftheserialsource,thePolaristranslatedversions,andtheirgraphicalloop quicklyidentied.theoftentedioustaskoftabularizingprolingresultswasperformedautomatically waseasilyperformedwiththebrowsingfacilities.thegraphspresentedinfigures6through8can graphingfunctions. andtheidenticationoftheparallelloopsinthistablewasmadeobvious.thenestingstructureof begeneratedbyexportingtheursaminor/majordatabasetothexess3spreadsheetandusingits severalversionsofthesourcecodeforeachloopnestwasoftennecessary,andaside-by-sidecomparison theloopstructurewasasignicantaidinquicklyidentifyingthisphenomenon.adetailedstudyofthe theloopswasamajorfactorintheperformanceofthiscode,andursaminor'sgraphicaldisplayof UrsaMinorallowedthecharacteristicsresponsiblefortheperformancedierencesinARC2Dtobe obtainedonaprocessorsparcstation2,a6processorultrasparcenterprise,a16processorsilicongraphicspowerchallengeanda32processorsorigin2havebeenmadeavailableasumdsatures,canbeinteractivelyexploredthroughtheursamajorwebpage.performancemeasurements Thefullresultsofthisstudy,performedon8benchmarkprogramsacrossmultiprocessorarchitec- 1
(a) (b) (c) (d) (e) (f) 8 Figure7:LoopperformanceofARC2DonanUltraSPARC:(a)ExecutiontimeofFILERXdo19,(b) SpeedupofFILERXdo19,(c)ExecutiontimeofSTEPFXdo21,(d)SpeedupofSTEPFXdo21,(e) ExecutiontimeofSTEPFXdo23and(f)SpeedupofSTEPFXdo23. 5 3 2 1 Execution Time (sec)6 8 6 2 Execution Time (sec)1 1 8 6 2 Execution Time (sec)12 ser 1 2 3 Number of Processors ser 1 2 3 Number of Processors ser 1 2 3 Number of Processors Speedup Speedup Speedup 1 12 1 8 6 2 1 2 3 Number of Processors 1 6 2 1 2 3 Number of Processors 1 Native Parallelizer Polaris+Native Directives Polaris+KAP/Pro Directives 8 6 2 1 2 3 Number of Processors Figure8:PerformanceofARC2DonProcessorsofUltraSPARC. 11 3 Native Sun Parallelizer Polaris+Sun Directives 2 +Perfect Nest Interchange +Imperfect Nest Interchange Polaris+KAP/Pro Directives 1
howthecomputationalcomplexityoftheoverallapplicationsuitescaleswiththenumberofprocessors 5.2ExperimentwiththeSeismicApplicationSuite Asthesecondcasestudy,weintroduceanotherprojectthatcharacterizesandanalyzeslarge-scope thatsite.foradetaileddescriptionoftheseresultsreferto[vos97]. [MH93],aseismicactivitysimulationprogramconsistingof2,linesofFortrancode.TheSeismic BenchmarkSuitecontainsadeephierarchyofnestedsubroutinesandloops.Ourgoalistounderstand industrialapplications[ae97].oneoftheprogramsweconsideredwastheseismicbenchmarksuite providesaverageloopexecutiontimesaswellasaloop'sparentinthecallingstructure.withcodes aslargeastheseismicsuitethesimpletaskoflocatingthebeginningandendingofloopsbecomes andwiththeinputdataspace.here,wewillbrieydescribehowtheursaminortoolcanbeofhelp loopfromactualmeasurements.inordertodothisweusethelooptableviewinursaminorwhich executiontime,exclusiveofanyinner-loops,isestimatedbyobtaininganexpressionforthenumberof iterationstheloopwillexecuteandcombiningthisexpressionwiththeaveragetimeperiterationofthe intheprocessofcharacterizingalargeapplication. cumbersomeandpronetohumanerrors.ursaminorgreatlysimpliesthistaskandprovidesavisual descriptionoftheloopnesthierarchywithitscallgraphview. Tocharacterizeanapplication'sexecutiontimewesumthetimescontributedbyeachloop.Aloop's 1 9 8 Figure9:Actualmeasurementsofloopexecutiontimeswerecomparedwithpredictedtimestodetermine 7 theaccuracyofthemodelonaloop-by-loopbasis.theseparatecolorsrepresenttheloopsofthisseismic 6 phase.theactualmeasurementsweregatheredusinga32-processornodeofansgi/crayorigin2 ofncsaattheuniversityofillinois. 5 3 characterizationandlocatethepointsneedingrenement.figure9comparesactualmeasurementswith ourpredictedtimesforoneseismicprocessingphase(calleddepthmigration)asthenumberofprocessors 2 increasesfrom1to32.ursaminoraidedingatheringthedatafromboththemeasurementsandour 1 modelsothateachloop'sperformancecouldbeanalyzedindividually.loopsthatscaleddierently fromthemeasuredtimingswereeasytond.ourmodelcouldthenbemodiedformoreaccurate Afterwecharacterizedthecode'sperformance,weusedUrsaMinortodeterminetheaccuracyofour F M F M F M F M F M F M 1 1 2 2 8 8 16 16 F = Forecasted, 12 M = Measured Number Processors Time (seconds) Phase : Depth Migration
whenthenumberofprocessorsincreased. predictions.weusedthisprocesstotestourmodel'sscalabilitywhenthedatasizeincreasedaswellas importingintoanxess3spreadsheet,inwhichweproducedgraphsvisuallydepictingthescalability machineslarger(moreprocessors)thanwecurrentlyhaveavailableandtoinputdatasizesappropriate dominatedbythecomputationtime(becauseofthisthecurves\total"and\comp"overlap). Figure1:ForecastedperformanceoftheSeismicSuiteasthemachinesizeisscaledup.Thecurves oftheapplication.figure1showsextrapolationresultsforoneseismicprocessingphase,againdepth forsuchlargemachines.databasesofpredictedexecutiontimeswereexportedfromursaminorfor dividethetotalexecutiontimeintocomputation,communication,anddiskiotimes.thetotaltimeis migration,asthenumberofprocessorsisincreasedfrom1to2,8processors.thedatasetisonewhich woulduse3terabytesofdiskspace. ThenalgoalofourcharacterizingprocesswasextrapolatingtheSeismicSuite'sperformanceto wellaloop-parallelversionoftheprogramwouldperformusingpolarisasastartingpoint.ursaminor program.asoriginallywritten,theseismicbenchmarkisamessage-passingcode.weinvestigatedhow parallelexecutionoverhead. calculatedthespeedupofourloop-parallelprogramforeachloop,aggingtheloopswithspeedupsbelow 1.TheseloopsweretheninvestigatedfurthertoimprovetheirautomaticparallelizationbyPolaris.If useofursamajor.measurementsweregatheredusingthesgi/crayorigin2atncsa. noimprovementscouldbemade,weforcedalooptoexecuteseriallysothatitwouldnotincurany AnotherobjectiveoftheSeismicBenchmarkcasestudywastoproduceawell-performingloop-parallel 6Conclusion Wehavepresentedanon-goingprojectthatprovidestoolsandmethodologiesforparallelprogram developmentandperformanceevaluation.ursaminorandursamajorsupportusermodelsof \parallelprogrammingbyexamples"forbeginnersandinteractivecompilationandperformancetuning ThedatafromtheSeismicBenchmarkcasestudyiscurrentlyavailabletooutsideusersthroughthe forexperts.theyalsoserveasaprogramandbenchmarkdatabaseforcomputingsystemsresearch.the 13 Time (seconds, log scale) 1x1 9 1x1 8 1x1 7 1x1 6 1x1 5 1x1 1x1 3 1x1 2 1x1 1 Phase : Depth Migration 1x1 1 1 1 1 1 Number of Processors Total Comp Comm Disk IO Disk Reads Disk Writes
toolsintegrateinformationavailablefromperformanceanalysistools,compilers,simulators,andsource Keepingclosetogetherthetooldesignprojectsandapplicationcharacterizationeortswillensurethe programstoadegreenotprovidedbyprevioustools.ursamajorcanbeexecutedontheworld-wide practicalityofourtoolinthefuture. Web,fromwhereagrowingrepositoryofinformationcanbeviewed. toolsandtheiruserviews.forexample,wewillincludeimprovedcompilerexplanationswhycertain compilerortoperformcertaintransformationsbyhand.anotherimportantgoalisthesupportfor optimizationswereorwerenotperformed.thisenablestheprogrammertoinputmissingdatatothe ToolcapabilitiesneededintheseeortsarebeingintegratedinbothUrsaMinorandUrsaMajor. asthecharacterizationandanalysisofrealapplicationsandthedevelopmentofparallelizingcompilers. Severalenhancementsareplannednext.Newcategoriesofinformationwillbeintegratedintothe TheUrsatoolfamilyisevolvinginaneed-drivenway.Itsdevelopersarealsoinvolvedinprojectssuch thetool'sservicetoaworld-wideaudience. References usermethodologies.asalong-termgoalweenvisionfacilitiesthatallowonetoquerytheinformation [AE97]BrianArmstrongandRudolfEigenmann.Performanceforecasting:Characterizationofap- repositorydirectlyforsuggestedimprovementsofprograms,compilers,orarchitectures.bettersupport oeredbythenewinternettechnology,continuousfeedbackfromitsusercommunitywillhelpimprove forthetool'swebresponseisanotherongoingeort.aswehaveonlybeguntoexplorethepotential [AO88]J.AmbrasandV.O'Day.MicroScope:AKnowledge-BasedProgrammingEnvironment. [ASM89]BillAppelbe,KevinSmith,andCharlesMcDowell.Start/Pat:AParallel-Programming putinglaboratory,february97. Toolkit.IEEESoftware,6():29{38,July1989. IEEESoftware,pages5{58,May1988. dueuniversity,schoolofelectricalandcomputer,engineering,high-performancecom- plicationsoncurrentandfuturearchitectures.technicalreportece-hpclab-9722,pur- [BKK+89]V.Balasundaram,K.Kennedy,U.Kremer,K.McKinley,andJ.Subhlok.TheParaScope [BDE+96]W.Blume,R.Doallo,R.Eigenmann,J.Grout,J.Hoeinger,T.Lawrence,J.Lee,D.Padua, [BST86]G.Bruno,P.Spiller,andI.Tota.AISPE:AnAdvanced,IndustrialSoftware-Production Y.Paek,B.Pottenger,L.Rauchwerger,andP.Tu.ParallelprogrammingwithPolaris.IEEE editor:aninteractiveparallelprogrammingtool.ininternationalconferenceonsupercomputing,pages5{55,1989. Computer,pages78{82,December1996. Environment.ProceedingsofComputerSoftwareandApplicationsConf.,pages9{99, [EM93]RudolfEigenmannandPatrickMcClaughry.PracticalToolsforOptimizingParallelPrograms.Presentedatthe1993SCSMulticonference,Arlington,VA,March27-April1, Computers.ConferenceProceedings,ICS'93,Tokyo,Japan,pages27{36,July2-22,1993. 1986. [Eig93]RudolfEigenmann.TowardaMethodologyofOptimizingProgramsforHigh-Performance [HAA+96]M.W.Hall,J.M.Anderson,S.P.Amarasinghe,B.R.Murphy,S.-W.Liao,E.Bugnion, andm.s.lam.maximizingmultiprocessorperformancewiththesuifcompiler.ieee Computer,pages8{89,December1996. 1
[KT87]J.H.KuoandH.C.Tu.PrototypingaSoftwareInformationBaseforSoftware-Engineeri [KE97]Seon-WookKimandRudolfEigenmann.Max/P:detectingthemaximumparallelismin [Int97]Intel. http://developer.intel.com/design/perftool/vtune/index.htm. afortranprogram.purdueuniversity,schoolofelectricalandcomputer,engineering, High-PerformanceComputingLaboratory,1997.ManualECE-HPCLab-9721. ngenvironments.proceedingsofcomputersoftwareandapplicationsconf.,pages38{, VTune: VisualTuningEnvironment, 1997. [MCC+95]BartonP.Miller,MarkD.Callaghan,JonathanM.Cargille,JereyK.Hollingsworth [Kuc88]Kuck&Associates,Inc.,Champaign,Illinois.KAPUser'sGuide,1988. [MH93]C.C.MosherandS.Hassanzadeh.ARCOseismicprocessingperformanceevaluationsuite, user'sguide.technicalreport,arco,plano,tx.,1993. Paradynparallelperformancemeasurementtools.IEEEComputer,28(11),November1995. R.BruceIrvin,KarenL.Karavanic,KrishnaKunchithapadam,andTiaNewhall.The 1987. [PE98]InsungParkandRudolfEigenmann.UrsaMajor:ExploringWebtechnologyfordesign [Pet93]PaulMarxPetersen.EvaluationofProgramsandParallelizingCompilersUsingDynamic [OMP97]OpenMP:AProposedIndustryStandardAPIforSharedMemoryProgramming.Technical computingres.&dev.,january1993. HighPerformanceComputingandNetworking,April1998. AnalysisTechniques.PhDthesis,Univ.ofIllinoisatUrbana-Champaign,CenterforSuper- andevaluationofhigh-performancesystems.inproc.oftheinternationalconferenceon report,openmp,october1997. [VGGJ+89]Jr.VincentGuarna,DennisGannon,DavidJablonowski,AllenMalony,andYogeshGaur. [Ree9]DanielA.Reed.Experimentalperformanceanalysisofparallelsystems:Techniquesand [PVAE97]InsungPark,MichaelJ.Voss,BrianArmstrong,andRudolfEigenmann.InteractivecompilationandperformanceanalysiswithUrsaMinor.InWorkshopofLanguagesandCompilers openproblems.inproc.ofthe7thint'confonmodellingtechniquesandtoolsforcomputerperformanceevaluation,pages25{51,199. Faust:AnIntegratedEnvironmentfortheDevelopmentofParallelPrograms.IEEESoftware,pages2{27,July1989. forparallelcomputing,august97. [Vos97]MichaelJ.Voss.Portableloop-levelparallelismforsharedmemorymultiprocessorarchitectures.Master'sthesis,SchoolofElectricalandComputerEngineering,PurdueUniversity, October97. 15