areprovidedtoviewprograminformationgatheredbythecompilerandrelateittoinformation



Similar documents
Advanced Digital Imaging

User Reports. Time on System. Session Count. Detailed Reports. Summary Reports. Individual Gantt Charts

Development of Monitoring and Analysis Tools for the Huawei Cloud Storage

Windows 2003 Performance Monitor. System Monitor. Adding a counter

JAVA WEB START OVERVIEW

XP24000/XP20000 Performance Monitor User Guide

B) Using Processor-Cache Affinity Information in Shared Memory Multiprocessor Scheduling

EMBL-EBI. Database Replication - Distribution

Overlapping Data Transfer With Application Execution on Clusters

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

How To Understand The Architecture Of An Ulteo Virtual Desktop Server Farm

Understanding the Benefits of IBM SPSS Statistics Server

Clonecloud: Elastic execution between mobile device and cloud [1]

Oracle Change Management Pack Installation

Checking IE Settings, and Basic System Requirements for QuestionPoint

Muse Server Sizing. 18 June Document Version Muse

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Using ELMS with TurningPoint Cloud

Advanced Peer to Peer Discovery and Interaction Framework

WHAT S NEW WITH EMC NETWORKER

INUVIKA OPEN VIRTUAL DESKTOP FOUNDATION SERVER

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Minimum Hardware Configurations for EMC Documentum Archive Services for SAP Practical Sizing Guide

TECHNICAL CONDITIONS REGARDING ACCESS TO VP.ONLINE. User guide. vp.online

Altaro Hyper-V Backup - Offsite Backups & Seeding Guide

How To Write A Network Operating System For A Network (Networking) System (Netware)

Attachments Internet Explorer

FileMaker 11. ODBC and JDBC Guide

EMS. Trap Collection Active Alarm Alarms sent by & SMS. Location, status and serial numbers of all assets can be managed and exported

System Requirements and Configuration Options

Web Application Testing. Web Performance Testing

SQL Express to SQL Server Database Migration MonitorIT v10.5

Distributed Network Management Using SNMP, Java, WWW and CORBA

Integrating Content Management Within Enterprise Applications: The Open Standards Option. Copyright Xythos Software, Inc All Rights Reserved

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Data Visualization in Julia

Scalability of Master-Worker Architecture on Heroku

Installation Guide Sybase ETL Small Business Edition 4.2 for Windows

PARALLELS CLOUD STORAGE

SAS Add in to MS Office A Tutorial Angela Hall, Zencos Consulting, Cary, NC

1. Accessing the LONZA network from a private PC or Internet Café

Remote Online Support

EMC Smarts SAM, IP, ESM, MPLS, NPM, OTM, and VoIP Managers Support Matrix

Architecture and Mode of Operation

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

MIGRATING DESKTOP AND ROAMING ACCESS. Migrating Desktop and Roaming Access Whitepaper

Virtual machine interface. Operating system. Physical machine interface

Printer Management Software

Fluke Networks NetFlow Tracker

XpoLog Center Suite Log Management & Analysis platform

BarTender Print Portal. Web-based Software for Printing BarTender Documents WHITE PAPER

Tuning WebSphere Application Server ND 7.0. Royal Cyber Inc.

System requirements for ICS Skills ATS

K1000: Advanced Topics

Checking Browser Settings, and Basic System Requirements for QuestionPoint

11.1 inspectit inspectit

Table of Contents Release Notes 2013/04/08. Introduction in OS Deployment Manager. in Security Manager Known issues

Synergis Software 18 South 5 TH Street, Suite 100 Quakertown, PA , version

HSBCnet FX AND MM TRADING. Troubleshooting and Minimum System Requirements

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

Inform IT. Features and Benefits. Overview. Process Information Web Server Version 3.2/1

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems

High Performance Computing in CST STUDIO SUITE

PIE. Internal Structure

KACO-monitoring. watchdog prolog insight. Easy to install. Inverter integrated monitoring available. Monitor up to 32 inverters with one device

ITPS AG. Aplication overview. DIGITAL RESEARCH & DEVELOPMENT SQL Informational Management System. SQL Informational Management System 1

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

Introduction to Cluster Computing

Enterprise Reporter Report Library

Clientless SSL VPN Users

MLM1000 Multi-Layer Monitoring Software

INSTALLATION MINIMUM REQUIREMENTS. Visit us on the Web

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

How To Login To Webex Online

IC 1101 Basic Electronic Practice for Electronics and Information Engineering

J-TRADER QUICK START USERGUIDE For Version 8.0

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

Using Windows Task Scheduler to Automate WPS Jobs on a Windows Server Platform

Active Merchandiser: Review Spotlight Orders and Performance

SSL VPN Service. To get started using the NASA IV&V/WVU SSL VPN service, you must verify that you meet all required criteria specified here:

New Features in XE8. Marco Cantù RAD Studio Product Manager

BROCADE PERFORMANCE MANAGEMENT SOLUTIONS

BIT Course Description

How do I use Citrix Staff Remote Desktop

Case Study. Regulatory Reporting

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

nanohub.org An Overview of Virtualization Techniques

SUMMER SCHOOL ON ADVANCES IN GIS

DBMS / Business Intelligence, Business Intelligence / DBMS

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Hardware Information Managing your server, adapters, and devices ESCALA POWER5 REFERENCE 86 A1 00EW 00

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Installation and Administration Guide

Transcription:

ParallelProgrammingandPerformanceEvaluationwithThe InsungParkMichaelVossBrianArmstrongRudolfEigenmann SchoolofElectricalandComputerEngineering UrsaToolFamily andtheirintegrationwithperformanceevaluationenvironments.first,weproposeinteractivecompilationscenariosinsteadoftheusualblack-box-orienteduseofcompilertools.insuchscenarios, informationgatheredbythecompilerandthecompiler'sreasoningarepresentedtotheuserinmeaningfulwaysandon-demand.second,atightintegrationofcompilationandperformanceanalysis toolsisadvocated.manyoftheexisting,advancedinstrumentsforgatheringperformanceresults arebeingusedinthepresentedenvironmentandtheirresultsarecombinedinintegratedviews withcompilerinformationanddatafromothertools.initialinstrumentsthatassistusersin\data Abstract Thispapercontributestothesolutionofseveralopenproblemswithparallelprogrammingtools PurdueUniversity toolbymakingavailablethegatheredresultstotheusercommunityatlargeviatheworld-wide Web. usersataspecicsite,suchasaresearchordevelopmentproject.ursamajorcomplementsthis mining"thisinformationarepresentedandtheneedformuchstrongerfacilitiesisexplained. toolfamily.twocasestudiesarepresentedthatillustratetheuseofthetoolsfordevelopingand studyingparallelapplicationsandforevaluatingparallelizingcompilers. Thispaperpresentsobjectives,functionality,experience,andnextdevelopmentstepsoftheUrsa TheUrsaFamilyprovidestwotoolsaddressingtheseissues.UrsaMinorsupportsagroupof 1Introduction occur.inothercases,usersmayknowthatthearraysectionsaccessedindierentloopiterationsdonot theseshortcomings.forexample,althoughthecompilerdetectsavalue-specicdatadependence,the tool.onedisadvantageofthisscenarioisthatthecompilermayhaveinsucientknowledgeorlimited usermayknowthatineveryreasonableprograminputthevaluesaresuchthatthedependencedoesnot capabilitiestoparallelizeaprogramoptimally.insomecasesitwouldbeeasyfortheusertomakeupfor compileristhattheconversionofagivenserialprogramintoparallelformisdonemechanicallybythe importantclassofsuchtools[bde+96,haa+96].theapparentadvantageofusingaparallelizing thechallengingtaskofdevelopingwell-performingparallelprograms.parallelizingcompilersareone Interactiveuseofparallelizingcompilers.Manyprogrammingtoolsexistthatassisttheuserin overlap.furthermore,certainprogramtransformationsmaymakeasubstantialperformancedierence, ndthereasonwhyaloopwasnotparallelizedautomatically,asmallmodicationmaybeappliedthat butareapplicabletoveryfewprograms,andhencenotbuiltintoacompiler'srepertoire.ifausercan ensuresparallelexecution.becauseofthesereasons,manualcodemodicationinadditiontoautomatic parallelizationisoftennecessarytoachievegoodperformance. REERaward.ThisworkisnotnecessarilyrepresentativeofthepositionsorpoliciesoftheU.S.ArmyortheGovernment. ThisworkwassupportedinpartbyPurdueUniversity,U.S.Armycontract#DABT63-92-C-33,andanNSFCA- 1

timinginformationbecomesavailablefromvariousprogramruns,structuralinformationoftheprogram Integratedcompilationandperformanceevaluation.Duringtheprocessofcompilingaparallel formation.findingparallelismstartsfromlookingthroughthisinformationandlocatingpotentially programandmeasuringitsperformance,aconsiderableamountofinformationisgathered.forexample, isgatheredfromthecodedocumentation,andcompilersoeralargeamountofprogramanalysisin- accompanyingthisprocedureisoftenoverwhelming.toolsthatassistthisprocessareimportant. ofthecompilationprocess,thecharacteristicsofthegivenprogram,itsperformanceresults,andthe parallelsectionsofcode.improvingparallelperformanceistheimmediatenextstep.decisionsare relationshipsofthesedata.itisthebasisforenhancingtheperformanceofanexistingparallelprogram madebasedontimingresultsandtheirrelationshiptoprogramcharacteristics.thebookkeepingeort aswellasforbeginningtoparallelizeaserialprogram. gathersinformationalongthecourseofcompilingandrunningaprogramandpresentsitinaformat parallelization.thetoolhelpsaprogrammerunderstandthestructureofaprogram,identifyparallelism, andcompareperformanceresultsofdierentprogramvariants.thetool,ursaminor[pvae97], thatiseasytolookupandcomprehend.usingthetool,theprogrammercomestoanunderstanding ThepresentedtooliscloselyrelatedtothePolariscompilerinfrastructure[BDE+96].Polaris,asa Inthispaper,weintroduceanon-goingtoolprojectthatsupportsascenarioofuser-plus-compiler symbolicprogramanalysis.polarisalsorepresentsageneralinfrastructureforanalyzingandmanipulatingfortranprograms,whichcanprovideusefulinformationabouttheprogramstructureandits potentialparallelism.polarisplaysamajorroleingeneratingthedatalesusedasinputtoursa parallelismdetection. parallelapplications.section5thenshowstwocasestudiesofursaminorinuse.section6concludes discussesitsfunctionality.sectionpresentstheursamajortool[pe98],aweb-basedtoolbuiltupon UrsaMinorthatwasdesignedfordistributionandevaluationofexperimentalresultswithvarious thepaper. 2ObjectivesofUrsaMinor Section2presentsourobjectivesindevelopingUrsaMinor.Section3givesanoverviewand compiler,includesadvancedprogramanalysisandtransformationtechniques,suchasarrayprivatization,symbolicandnonlineardatadependencetesting,idiomrecognition,interproceduralanalysis,and Minor.Examplesofsuchlesareloopparallelizationsummaries,data-dependenceinformation,and loop/subroutinecallgraphs.polarisalsoinstrumentsprogramsfortimingmeasurementsandmaximum exploitingparallelism,thetoolpursuesthefollowingobjectives: IntegratedBrowsersforProgram,Compilation,andPerformanceData:TheUrsaMinor TheintendedusersoftheUrsaMinortoolareparallelprogrammersthathavesomeexperienceusingparallelizingcompilersandperformanceanalysistools.Inordertoassisttheminidentifyingand InteractiveCompilers:Thecurrent,predominantlyblack-boxuseofparallelizingcompilersneeds detailswheneverheorshefeelstheneedtoconcentrateonaspecicportionoftheprogram.the ofaprogram.inthisway,ausercanstartfromanoverallviewoftheprogramandinspectthe toolcollectsandfacilitatestheuseofprogram,compilation,andperformancedata.theinformationneedstobepresentedinaformatthatconveyshigh-levelaswellasdetaileddescriptions [MCC+95],andPTOPP[EM93]performanceanalysisenvironments. toolcomplementsandintegratescapabilitiesprovidedbytoolssuchasthepablo[ree9],paradyn tobechangedintoaninteractivescenario.thisgoesbeyondinteractivepassinvocationaspioneeredbytoolssuchasstart/pat[asm89]andparascope[bkk+89].theultimategoalofthe UrsaMinorprojectistoprovideacomprehensiveenvironmentthatencompassestheprocessof writing,compiling,running,andimprovingparallelprograms.tothisendinteractivecapabilities 2

performancedata,andvisualizingthisinformation.theursaminorenvironmentprovidesaidsforthe dynprojects,whichprovideadvancedfacilitiesforoptimizingandinstrumentingprograms,gathering usertounderstandthegatheredperformancedataandtoreasonabouttheinformationinaninteractive similarobjectivetothatofvtune[int97],whichisanadvancedtoolforsingle-processorsystems. way.inthesensethatthetoolprovidesuserswithadvicetoimproveperformance,ursaminorhasa Theseobjectivesdistinguishourapproachfromrelatedeorts,suchasthePolaris,PabloandPara- areprovidedtoviewprograminformationgatheredbythecompilerandrelateittoinformation Inadditiontothemainobjectives,weobservethefollowingdesignrulestomakeourtoolmoreuseful providedbyotherprogrammingtools. andeasilyaccessible: Portability:Fordisseminatinganewtooltotheusercommunity,itisimportantthatitbeeasyto compileranditsperformanceanalysislibraries,whichthemselvesareportabletomanyplatforms. independentjavalanguage,andbyusingonlywidely-availableapplicationprogramminginter- faces(apis).thetoolmakesuseofinformationgatheredbyotherfacilities,suchasthepolaris Inaddition,UrsaMinoristobeexibleinthedataformatitcanread,suchthatitcanadapt installonnewplatforms.weapproachthisgoalbyimplementingursaminorinthetarget- Expandability:ThemainfunctionoftheUrsaMinortoolisinformationgatheringandbrowsing. Leveragingoexistingtools:Weconsiderusingotheravailabletoolstoaugmentthefeaturesof sheetscapableofrichgraphicalpresentationofdata.byallowingtheinformationtobeunderstood byoneofthesespreadsheets,wecantakeadvantageofitsfeaturestocreatecharts,whilefocusing UrsaMinorthatweregardedas\notoriginalbutnicetohave".Forinstance,therearespread- tothetools(compilersandperformanceanalyzers)availableonthelocalplatform. 3DescriptionofUrsaMinor seeitthroughthetoolwithminimalmodications.wecanalsoenablethetooltoreadageneric Hence,wheneverweobtainnewtypesofinformationaboutthegivenprogramweshouldbeableto datale,sothatnewtypeofinformationcanbeunderstoodwithoutsignicantmodications. onthenewfunctionalityofursaminor. graphicalinterface,whichcanprovideselectiveviewsandcombinationsofthedata.figure1illustrates Thesesourcesincludetoolssuchascompilers,prolers,andsimulators.Itinteractswithusersthrougha 3.1Overview TheUrsaMinorprojectprovidestoolsthatassistparallelprogrammersineectivelywritingand tuningcodes.itprovidesuserswithinformationavailablefromvarioussourcesinacomprehensibleway. willdiscusshowourdesignobjectiveswererealizedintheconcretetool. Inthissection,weprovideanoverviewofUrsaMinor[PVAE97]anddescribeitsfunctionality.We Polariscompiler.TheUrsaMinortoolincludesasubroutineandloopneststructureanalyzer,also implementedusingthepolarisinfrastructure. tool[pet93,ke97].informationaboutwhichloopsareserialorparallelisprovidedbytheactual interactionbetweenursaminorandthevariousdatales. optionsareprovidedtoreadfromthevariousoriginalles,addtotheexistinginformationincrementally, explicitlybytheuserbeforeursaminorcanreadandcombinethem.oncetheyexist,severaltool notdiscussedfurtherinthispaper[eig93].maximumparallelismestimatesaresuppliedbythemax/p frominstrumentedprogramruns.thetoolperformingthisinstrumentationisapolaris-basedutility, Inthecurrentimplementation,theseinformationsourcesareavailableinlesthatneedtobecreated UrsaMinorcollectsandcombinesinformationfromvarioussources.Timinginformationisgathered 3

Calling Structure Analyzer Result Performance Results Information Sources Data Dependence Test Summary Simulation Report from Max/P Generated by Polaris-based Tools Other Information Sources Other Tools Saved DataBase Source File SpreadSheet open/save export storetheentiredatabase,orreadfromapreviouslysaveddatabase.infuturereleasesweplanto automatetheprocessofcreatingtheinformationsourcesby,forexample,invokingthecompilerondemand. Figure1:ComponentsoftheUrsaMinortoolandtheirinteractions. URSA MINOR UMD (Ursa DataBase) presentation/edit presentation/edit Loop Table View Call Graph View spectedwithaneditorandprinted.furthermore,theinformationcanbesavedinaformatthatcanbe isastorageunitthatholdsthecollectiveinformationaboutaprogramanditsexecutionresultsin acertainsystemenvironment.thisdatabaseisorganizedasatextle,whichcanoptionallybein- readbycommercialspreadsheets,providingarichersetofdatamanipulationfunctionsandgraphical Internally,UrsaMinorstoresinformationinUrsaMinor/MajorDatabases(UMD).AUMD interaction interaction representations. TheUrsaMinortooliswritteninJava.Thus,anyplatformonwhichtheJavaruntimeenvironment User prototypinguserinterfaces,whichenableustofocusonthedesignofthetoolfunctionality.furthermore, isavailablecanbeusedtorunthetool.itusesthebasicjavalanguagewithstandardapis,which thefunctionalityofursaminormoreclosely. newtypesofdatatothedatabase.thewindowingtoolkitsandutilitiesprovideagoodenvironmentfor Java,withitsnetworksupport,makesausefullanguageforrealizinganothergoalofthisproject:making beenrealizedintheursamajortool,whichisdiscussedinsection.inthenextsection,weexamine enhancestheportabilityofthetool.objectorientationinjavaallowsarelativelyeasyadditionof 3.2Functionality TheUrsaMinortoolpresentsinformationtotheuserthroughtwodisplaywindows:Aloopinformation tableandacallgraph.theuserinteractswiththetoolbychoosingmenuitemsormouse-clicking. availablethegatheredprogram,compilation,andperformanceresultstoremoteusers.thisgoalhas ofinvocationsofeachloop,theparentintheneststructure,andthemaximumdegreeofparallelism providedbymax/p[pet93,ke97].italsoindicateswhetheraloopisserialorparallelasdetectedby rently,thetabledisplaysinformationsuchastimingresultsfromvariousprogramruns,thenumber Polaris.Ifitisserial,thereasongivenbythecompilercanbedisplayedonmouse-clicking.InFigure2, Figure2showsthelooptableview,eachlinedisplayinginformationforanindividualloop.Cur-

theuserhasclickedonlooprestardo56toseethereasoninhibitingparallelization. programtuningprojects,anursaminorlooptableisusuallypresentallthetime.aftereachprogram view.also,ausercanrearrangecolumns,deletecolumns,sorttheentriesalphabeticallyorbasedon run,thenewlycollectedtiminginformationisincludedasanadditionalcolumninthelooptable.inthis theexecutiontime.byspecifyingareferencecolumn,speedupscanbecalculatedon-demand.inour Whenevernewinformationfromothertoolsbecomesavailable,theusercanaddcolumnsinthis Figure2:LoopTableViewoftheUrsaMinortool. overallprogram.eectsofprogrammodicationsonotherprogramsectionsbecomeobviousaswell. Themodicationmaychangetherelativeimportanceoftheloops,sothatsortingthembytheirnewest way,performancedierencescanbeinspectedimmediatelyforeachindividualloopaswellasforthe structure,theusercanzoominandout.thisdisplayhelpstounderstandtheprogramstructurefor taskssuchasinterchangingloopsorndingouterorinnercandidateparallelloops. executiontimeyieldsanewmost-time-consuminglooponwhichtheprogrammercanfocusnext. InFigurewehavereadthisformintothecommercialxess3spreadsheetprogram.Thisallowsone theuserisinspectingtheloopactfordo2inthisway.ifonewantsawiderviewoftheprogram subroutine,function,orloop.forexample,parallelloopsarerepresentedbygreenrectangles,andserial loopsbyredrectangles.clickingoneofthesewilldisplaythecorrespondingsourcecode.infigure3 subroutine,function,andloopnestinformationasshowninfigure3.eachrectanglerepresentseithera UrsaMinorcansavethedatabaseinaformatthatgenericspreadsheetprogramscanunderstand. AnotherviewofUrsaMinorprovidesthecallingstructureofagivenprogram,whichincludes toexploitthemanyoptionsandgraphicalrepresentationsofthistool.infiguretheuserhaschosen UrsaMajor[PE98]isanextensionoftheUrsaMinortool.BecausewechoseJavaasanimple- anexecutiontimegraphfortheprogrambdna,comparingtheperformanceofpolariswiththecompiler fromsunmicrosystems,(athirdlineindicating\linearspeedup"forreference). UrsaMajor:Web-basedevaluationofparallelapplications 5

Figure3:AnnotatedCallGraphViewoftheUrsaMinortool. Figure:Spread-SheetViewoftheUrsaMinortool. 6

mentationlanguage,itwasnaturaltocombineourtoolcapabilitieswiththerapidlyadvancinginternet canidentifypreciselytheeectofasourcecodechangeontheperformanceforboththemodiedcode codemodications,etc.thetoolhelpsrelateallthesepiecesofinformation,sothat,forexample,one thatcouldguidetheseusersinexploitingthenewmachines.ursamajorprovidesamethodologyof theirserialandparallelsourcecode,performanceimprovementsresultingfromcompilationorsource \learningbyexample"tobothlocalandremoteusers.newusersseeavarietyofsampleprograms, innon-expertusersandprogrammers.however,therearenoestablishedprogrammingmethodologies technologyand,inthisway,allowusersatremotesitestoaccessourexperimentaldata. sectionandtheoverallprogram. First,aordablemultiprocessorworkstationsandPCsarecurrentlyleadingtoasubstantialincrease Inextendingtheuseofourtooltoaworld-wideaudienceweareaddressingseveralnewissues: andthecomparisonofresultswiththoseobtainedbyothers.tothisend,manytestapplicationshave alargebodyofmeasurementsobtainedfromtheseprogramscanbefoundintheliteratureandon fromseveralpapers)andtheyhavetoundergosubstantialre-categorizationsandtransformations.in beenmadepubliclyavailableforstudyandbenchmarkingbybothresearchersandindustry.although Fromthebeginning,theabstractionofperformanceandprograminformationintoaformthatanswers addressingthisissue,theursamajorprojectiscreatingacomprehensivedatabaseofinformation. publicdatarepositories,itisusuallyextremelydiculttocombinethemintoaformmeaningfulfor newpurposes.inpartthisisbecausedataarenotreadilyavailable(i.e.,theyhavetobeextracted Second,acoreneedforadvancingthestateoftheartofcomputersystemsisperformanceevaluation thequestionsoftheobserverwasoneofourgoals.however,thisissuebecomesdrasticallymorecomplex asweconsiderlargedatarepositoriesorganizedintoamultitudeofdimensions.theinternettechnology anditscombinationwithhigh-performancecomputingtoolsopensthisnewrealmofquestionsand http://www.ecn.purdue.edu/~ipark/um/index.html. opportunities,whichwearebeginningtoexplorewithursamajor..1descriptionofursamajor UrsaMajorisaweb-basedtoolcapableofpresentingtheUrsaMinor/Majordatabasetoaremote networkingfeaturesandfororganizingthedataintoarepositorythatiseasytoaccessfromremote Majorrepository(UMR),whichwillbediscussedinthenextsection.UrsaMajorisavailableat UrsaMajor'smodulesfromthesecomponents.Inaddition,newmoduleswerecreatedforthetool's Websites.Thelatterincludesthedenitionofnamingschemeswithwhichinformationcanbefound basicbuildingblocksforursamajor.javaclassinheritancewasutilizedextensivelyfordeveloping user.figure5showsanoverallviewoftheinteractionsbetweenursamajor,auser,andtheursa intuitivelyandcaneasilyberelatedtootherinformation. UrsaMinor'sfacilitiesformanipulatingdatabasesandforcreatinggraphicaluserinterfacesare pagethroughjavaappletandisinvokedbyclickingabuttoninthewebpage. MajortoolisalmostidenticalwiththoseofUrsaMinor,butUrsaMajorisembeddedinaweb withursaminor,exceptthattheycannotsavelesonthelocaldisk.thelookandfeeloftheursa theumdsoftheirinterestbyexaminingthedescriptionsprovidedfortheavailableumds.umdsare thenretrievedbytheirurl.onceaumdisdisplayed,usersmayperformthesametasksastheydo istheaccesstotherepository.remotejavaapplicationscannotaccessdisklesdirectly.theyhaveto retrievedataintheformofwebdocuments.thisisduetojavasecurityrestrictions.usersmaychoose SinceitisbasedonUrsaMinor,UrsaMajoroersthesamebasicfunctionality.Onedierence amountofinformationisgathered.severalsucheortsareongoinginourgroup,hencetheumris Duringtheprocessofcompilingaparallelprogramandmeasuringitsperformance,aconsiderable.2UrsaMajorRepository(UMR) 7

Remote Server Ursa Major Applet UMR (Ursa Major Repository) Java Program Download DataBase Download PurdueUniversity,includingSPECandPerfectbenchmarks. continuouslybeingextended.itcurrentlycontainsseveralbenchmarksuitesthathavebeenstudiedat URSA MAJOR UMD (Ursa Database) reports,aswellasthetiminginformationofvariousprogramruns.findingparallelismstartsfrom Thespecicdataincludesstructuralprograminformation,resultsofprogramanalysis,simulation Figure5:InteractionprovidedbytheUrsaMajortool. presentation/edit database presentation/edit database Loop Table View Call Graph View interaction interaction leanddirectorynamesindicatingdatasuchastheprogramnames,platforms,compilers,optimization, andparallellanguages.tobeexible,theseextensionsarenothard-coded.instead,theyaredescribed ndinformationenteredbyotherusers.tothisend,therepositorystructureconsistsofextensionson lookingthroughthisinformationandlocatingpotentiallyparallelsectionsofcode.severaltoolsand methodologiesarebeingusedtogatherandorganizesuchdata[vggj+89,em93]. Oneissueindesigningtherepositorywastodenestorageschemesthatmakesiteasyforusersto User inacongurationlethatisreadbyursamajoratthestartofasession. WepresentearlyexperienceswithusingtheUrsaMajortoolandwithitsimplementation.Wehave.3ExperienceswithUrsaMajor usedthetoolinourresearchteam,onmultipleworkstationplatformsandalsopcsconnectedthrough modemsathome.ourteamincludesresearchersattwouniversities,sothatrealisticremoteaccesses wereinvolved.basedontheseexperienceswecanpicturescenariosofhowthedierentusercommunities canbesttakeadvantageofthetoolandwhatchallengesneedtobeaddressedtomakeitevenmore grammers,andresearchersinterestedinperformanceevaluationandbenchmarking.obviouslythese usefulinthefuture. categoriescanoverlap.forbeginners,thetoolsupportsamethodologyof\learningbyexample".new programmersstartbygettingthegeneralfeelfortherepository.thisisbestdonestartingwiththe callgraphviewandclickingonseveralnodesinthisgraphtoinspectthesourceprograms.togetmore UrsaMajortargetsseveralaudiences.Theyincludenoviceparallelprogrammers,advancedpro- 8

compareserialandparallelprogramversions.ursamajorsupportsthisbyprovidingthelooptable view.sourcecodecorrespondingtoserialandtheparallelvariantcanbeopened.thelooptablealso insightsaboutanindividualprogramtheusernowcanstepthroughthemosttime-consumingloopsand givesthenewuserarstideaofhowprogramsneedtobetransformedtoruninparallelandwhat showstimingsofthetwovariantsgivingtheuserarstviewofthespeedupsobtainedbyeachloop.the improvementsbycombiningtheperspectivesfrombothperformanceevaluationandcompileranalysis performanceimprovementcanbeobtained. spectionofthereasonswhycertainparallelloopsorprogramsectionsperformwellorpoorlyinmore toolcancomputeanddisplaythesespeedupnumbersasanoption.comparingtheseprogramvariants ofinformationkeptintheursamajorrepositoryandfacilitatingaccesstothisinformationinvarious detailandwhyacodesectionisnotparallel.inthisway,usersmayidentifythebottleneckandpossible results. dimensions.evenwithinourresearchgrouptheavailabilityoftherepositoryenabledmanydierent Theadvancedprogrammermaybenetfromthistoolbyexploitingthefeaturesallowingthein- ongoingeort. entsubroutinesandloopswithinaprogram,andscalabilitystudiesovernumbersofprocessorsanddata setsizes.increasingthesupportforinspectingourdatabasefromthesevariousanglesisanimportant studies,suchasarchitecturalcomparisons,comparisonsofdierentcompilers,dierentprograms,dier- UrsaMajorfurtherservestheresearchcommunityingeneralbymakingavailablethelargeamount performedontheperfectbenchmarkscodearc2d,ispresentedhereasourrstcasestudy. UrsaMinorisusedinthesearchforexplanationsofthesedierences.Anexampleofsuchasearch codeswithvariousdirectivesets.iftheperformanceresultsofthesecodesaresignicantlydierent, pileroutputrepresentation[vos97].indoingso,wehaveexpressedtheparallelisminseveralbenchmark 5CaseStudies 5.1ExperimentswiththeARC2DApplication Inacurrentstudy,wearecomparingparalleldirectivelanguagesfortheirsuitabilityasaportablecom- astheexecutiontimemeasuredbytheinstrumentation,itiseasilydeterminedwhensuchperturbation occurs.inarc2d,11ofthe19loopshadaninstrumentationoverheadofmorethan.1%oftheloop noticeablyimpactthemeasuredperformance.usingthenumberoftimeseachloopisexecuted,aswell executiontime.wechose.1%asthecutotoensurethattheinstrumentedtimingmeasurementsstill gatheredbyursaminorandtransformedintoaformwhichisreadablebycommercialspreadsheet packagessuchasexcelandxess3.oneconcernwithinstrumentationisthattheassociatedoverheadwill onaprocessorultrasparcworkstationwasdone.theresultsofthisinstrumentationwasthen reectedtheprogramperformancewithhighaccuracy.removingtheinstrumentationfromthese11 First,asabase-linemeasurement,aloopbyloopproleoftheserialversionofthecodeexecuted averageexecutiontimesforcomputingtheoverhead.infuturereleasesofthetoolthiscomputationwill parallelizedversionsoftheseloopswereusedtocomparetheperformanceofseveralparalleldirective loops,reducedthetotalexecutiontimeoftheprogramby6%.ursaminorcurrentlyprovidesthe befullyautomated. languages.themajorloopsinarc2dparallelizedbypolarisarefilerxdo19,stepfxdo21and OpenMPindustrystandard[OMP97].BrowsingthroughtheperformanceresultsdisplayedbyUrsa theseloopsintheserialversioncanbeseeninfigure6. STEPFXdo23.TheidenticationoftheseloopswasstraightforwardgiventhatUrsaMinorpresented dialectandtheotherusingtheportablekap/prodirectiveset[kuc88],acloserelativeofthenew theexecutiontimesofeachloopaswellasannotateditasparallelorserial.therelativeimportanceof Additionally,themosttime-consumingloopswereidentiedintheserialcode.ThePolaris- Minoritwasseenthatonprocessors,theKAP/Prodirectivelanguageexhibitedsuperiorperformance. TheparallelismfoundbyPolariswasexpressedintwoforms.OneusingthenativeSunSPARC 9

Figure6:PercentageofexecutiontimespentinmajorloopsofARC2D. STEPFX do23 (11.7%) STEPFX do21 (1.9%) FILERX do19 (6.%) reason.loopinterchangingwasbeingappliedtomanyoftheloopnestsinthekap/prodirectiveversion Furthermore,byaddingtheloop-by-loopproleofARC2D,asparallelizedbytheSunnativecompiler, loopsinthekap/proversionwhencomparingthe1processorparallelexecutiontotheexecutionof aninterestingphenomenonwasdiscovered:asignicant\negativeoverhead"existedformanyofthe theuntransformedcode.apparently,sequentialoptimizationswereperformedinthekap/proversion SunSPARCdirectives.TheperformanceofthethreemajorloopsisshowninFigure7. intheloopsfoundtobeparallelbythenativecompiler,butnotinthepolarisversionwhichusedthe whichwerenotperformedintheserialversion.interestingly,thissameoptimizationwasoftenperformed Usingthesourcecodebrowsingcapabilities,aside-by-sidecomparisonoftheloopnestsuncoveredthe Others interchangingwasnotdisabledwhenparallelizingthecodewiththenativesunparallelizingcompiler; bytheback-endcompiler.theuseofthesunsparcdirectivesinhibitedthistransformation.loop (71.%) wereimperfectlynestedintheoriginalsource,butweretransformedintoaperfectnestbypolaris. TheapplicationofforwardsubstitutionanddeadcodeeliminationbyPolariscreatedperfectlynested parallelizingcompilerwasabletoidentifythesameamountofparallelismaspolaris,itdidnotapply loops,whichtheback-endcompilerwasthenabletointerchange.therefore,althoughthenativesun furtheroptimizations.figure8showstheperformanceofthethreeparallelversionsofarc2dexecuted howeveritwasappliedlessfrequently.foramoredetaileddiscussionofthisphenomenonandothers onprocessorsoftheultrasparc.thisgurealsoshowstheperformancethatwouldbeobtainedin uncoveredduringtheanalysisofarc2d,pleasereferto[vos97]. thesunsparcdirectiveversioniftheinterchanginghadbeendone. structurerepresentation,showedthatthetwomostsignicantloopsstepfxdo21andstepfxdo23 Afurtheranalysisoftheserialsource,thePolaristranslatedversions,andtheirgraphicalloop quicklyidentied.theoftentedioustaskoftabularizingprolingresultswasperformedautomatically waseasilyperformedwiththebrowsingfacilities.thegraphspresentedinfigures6through8can graphingfunctions. andtheidenticationoftheparallelloopsinthistablewasmadeobvious.thenestingstructureof begeneratedbyexportingtheursaminor/majordatabasetothexess3spreadsheetandusingits severalversionsofthesourcecodeforeachloopnestwasoftennecessary,andaside-by-sidecomparison theloopstructurewasasignicantaidinquicklyidentifyingthisphenomenon.adetailedstudyofthe theloopswasamajorfactorintheperformanceofthiscode,andursaminor'sgraphicaldisplayof UrsaMinorallowedthecharacteristicsresponsiblefortheperformancedierencesinARC2Dtobe obtainedonaprocessorsparcstation2,a6processorultrasparcenterprise,a16processorsilicongraphicspowerchallengeanda32processorsorigin2havebeenmadeavailableasumdsatures,canbeinteractivelyexploredthroughtheursamajorwebpage.performancemeasurements Thefullresultsofthisstudy,performedon8benchmarkprogramsacrossmultiprocessorarchitec- 1

(a) (b) (c) (d) (e) (f) 8 Figure7:LoopperformanceofARC2DonanUltraSPARC:(a)ExecutiontimeofFILERXdo19,(b) SpeedupofFILERXdo19,(c)ExecutiontimeofSTEPFXdo21,(d)SpeedupofSTEPFXdo21,(e) ExecutiontimeofSTEPFXdo23and(f)SpeedupofSTEPFXdo23. 5 3 2 1 Execution Time (sec)6 8 6 2 Execution Time (sec)1 1 8 6 2 Execution Time (sec)12 ser 1 2 3 Number of Processors ser 1 2 3 Number of Processors ser 1 2 3 Number of Processors Speedup Speedup Speedup 1 12 1 8 6 2 1 2 3 Number of Processors 1 6 2 1 2 3 Number of Processors 1 Native Parallelizer Polaris+Native Directives Polaris+KAP/Pro Directives 8 6 2 1 2 3 Number of Processors Figure8:PerformanceofARC2DonProcessorsofUltraSPARC. 11 3 Native Sun Parallelizer Polaris+Sun Directives 2 +Perfect Nest Interchange +Imperfect Nest Interchange Polaris+KAP/Pro Directives 1

howthecomputationalcomplexityoftheoverallapplicationsuitescaleswiththenumberofprocessors 5.2ExperimentwiththeSeismicApplicationSuite Asthesecondcasestudy,weintroduceanotherprojectthatcharacterizesandanalyzeslarge-scope thatsite.foradetaileddescriptionoftheseresultsreferto[vos97]. [MH93],aseismicactivitysimulationprogramconsistingof2,linesofFortrancode.TheSeismic BenchmarkSuitecontainsadeephierarchyofnestedsubroutinesandloops.Ourgoalistounderstand industrialapplications[ae97].oneoftheprogramsweconsideredwastheseismicbenchmarksuite providesaverageloopexecutiontimesaswellasaloop'sparentinthecallingstructure.withcodes aslargeastheseismicsuitethesimpletaskoflocatingthebeginningandendingofloopsbecomes andwiththeinputdataspace.here,wewillbrieydescribehowtheursaminortoolcanbeofhelp loopfromactualmeasurements.inordertodothisweusethelooptableviewinursaminorwhich executiontime,exclusiveofanyinner-loops,isestimatedbyobtaininganexpressionforthenumberof iterationstheloopwillexecuteandcombiningthisexpressionwiththeaveragetimeperiterationofthe intheprocessofcharacterizingalargeapplication. cumbersomeandpronetohumanerrors.ursaminorgreatlysimpliesthistaskandprovidesavisual descriptionoftheloopnesthierarchywithitscallgraphview. Tocharacterizeanapplication'sexecutiontimewesumthetimescontributedbyeachloop.Aloop's 1 9 8 Figure9:Actualmeasurementsofloopexecutiontimeswerecomparedwithpredictedtimestodetermine 7 theaccuracyofthemodelonaloop-by-loopbasis.theseparatecolorsrepresenttheloopsofthisseismic 6 phase.theactualmeasurementsweregatheredusinga32-processornodeofansgi/crayorigin2 ofncsaattheuniversityofillinois. 5 3 characterizationandlocatethepointsneedingrenement.figure9comparesactualmeasurementswith ourpredictedtimesforoneseismicprocessingphase(calleddepthmigration)asthenumberofprocessors 2 increasesfrom1to32.ursaminoraidedingatheringthedatafromboththemeasurementsandour 1 modelsothateachloop'sperformancecouldbeanalyzedindividually.loopsthatscaleddierently fromthemeasuredtimingswereeasytond.ourmodelcouldthenbemodiedformoreaccurate Afterwecharacterizedthecode'sperformance,weusedUrsaMinortodeterminetheaccuracyofour F M F M F M F M F M F M 1 1 2 2 8 8 16 16 F = Forecasted, 12 M = Measured Number Processors Time (seconds) Phase : Depth Migration

whenthenumberofprocessorsincreased. predictions.weusedthisprocesstotestourmodel'sscalabilitywhenthedatasizeincreasedaswellas importingintoanxess3spreadsheet,inwhichweproducedgraphsvisuallydepictingthescalability machineslarger(moreprocessors)thanwecurrentlyhaveavailableandtoinputdatasizesappropriate dominatedbythecomputationtime(becauseofthisthecurves\total"and\comp"overlap). Figure1:ForecastedperformanceoftheSeismicSuiteasthemachinesizeisscaledup.Thecurves oftheapplication.figure1showsextrapolationresultsforoneseismicprocessingphase,againdepth forsuchlargemachines.databasesofpredictedexecutiontimeswereexportedfromursaminorfor dividethetotalexecutiontimeintocomputation,communication,anddiskiotimes.thetotaltimeis migration,asthenumberofprocessorsisincreasedfrom1to2,8processors.thedatasetisonewhich woulduse3terabytesofdiskspace. ThenalgoalofourcharacterizingprocesswasextrapolatingtheSeismicSuite'sperformanceto wellaloop-parallelversionoftheprogramwouldperformusingpolarisasastartingpoint.ursaminor program.asoriginallywritten,theseismicbenchmarkisamessage-passingcode.weinvestigatedhow parallelexecutionoverhead. calculatedthespeedupofourloop-parallelprogramforeachloop,aggingtheloopswithspeedupsbelow 1.TheseloopsweretheninvestigatedfurthertoimprovetheirautomaticparallelizationbyPolaris.If useofursamajor.measurementsweregatheredusingthesgi/crayorigin2atncsa. noimprovementscouldbemade,weforcedalooptoexecuteseriallysothatitwouldnotincurany AnotherobjectiveoftheSeismicBenchmarkcasestudywastoproduceawell-performingloop-parallel 6Conclusion Wehavepresentedanon-goingprojectthatprovidestoolsandmethodologiesforparallelprogram developmentandperformanceevaluation.ursaminorandursamajorsupportusermodelsof \parallelprogrammingbyexamples"forbeginnersandinteractivecompilationandperformancetuning ThedatafromtheSeismicBenchmarkcasestudyiscurrentlyavailabletooutsideusersthroughthe forexperts.theyalsoserveasaprogramandbenchmarkdatabaseforcomputingsystemsresearch.the 13 Time (seconds, log scale) 1x1 9 1x1 8 1x1 7 1x1 6 1x1 5 1x1 1x1 3 1x1 2 1x1 1 Phase : Depth Migration 1x1 1 1 1 1 1 Number of Processors Total Comp Comm Disk IO Disk Reads Disk Writes

toolsintegrateinformationavailablefromperformanceanalysistools,compilers,simulators,andsource Keepingclosetogetherthetooldesignprojectsandapplicationcharacterizationeortswillensurethe programstoadegreenotprovidedbyprevioustools.ursamajorcanbeexecutedontheworld-wide practicalityofourtoolinthefuture. Web,fromwhereagrowingrepositoryofinformationcanbeviewed. toolsandtheiruserviews.forexample,wewillincludeimprovedcompilerexplanationswhycertain compilerortoperformcertaintransformationsbyhand.anotherimportantgoalisthesupportfor optimizationswereorwerenotperformed.thisenablestheprogrammertoinputmissingdatatothe ToolcapabilitiesneededintheseeortsarebeingintegratedinbothUrsaMinorandUrsaMajor. asthecharacterizationandanalysisofrealapplicationsandthedevelopmentofparallelizingcompilers. Severalenhancementsareplannednext.Newcategoriesofinformationwillbeintegratedintothe TheUrsatoolfamilyisevolvinginaneed-drivenway.Itsdevelopersarealsoinvolvedinprojectssuch thetool'sservicetoaworld-wideaudience. References usermethodologies.asalong-termgoalweenvisionfacilitiesthatallowonetoquerytheinformation [AE97]BrianArmstrongandRudolfEigenmann.Performanceforecasting:Characterizationofap- repositorydirectlyforsuggestedimprovementsofprograms,compilers,orarchitectures.bettersupport oeredbythenewinternettechnology,continuousfeedbackfromitsusercommunitywillhelpimprove forthetool'swebresponseisanotherongoingeort.aswehaveonlybeguntoexplorethepotential [AO88]J.AmbrasandV.O'Day.MicroScope:AKnowledge-BasedProgrammingEnvironment. [ASM89]BillAppelbe,KevinSmith,andCharlesMcDowell.Start/Pat:AParallel-Programming putinglaboratory,february97. Toolkit.IEEESoftware,6():29{38,July1989. IEEESoftware,pages5{58,May1988. dueuniversity,schoolofelectricalandcomputer,engineering,high-performancecom- plicationsoncurrentandfuturearchitectures.technicalreportece-hpclab-9722,pur- [BKK+89]V.Balasundaram,K.Kennedy,U.Kremer,K.McKinley,andJ.Subhlok.TheParaScope [BDE+96]W.Blume,R.Doallo,R.Eigenmann,J.Grout,J.Hoeinger,T.Lawrence,J.Lee,D.Padua, [BST86]G.Bruno,P.Spiller,andI.Tota.AISPE:AnAdvanced,IndustrialSoftware-Production Y.Paek,B.Pottenger,L.Rauchwerger,andP.Tu.ParallelprogrammingwithPolaris.IEEE editor:aninteractiveparallelprogrammingtool.ininternationalconferenceonsupercomputing,pages5{55,1989. Computer,pages78{82,December1996. Environment.ProceedingsofComputerSoftwareandApplicationsConf.,pages9{99, [EM93]RudolfEigenmannandPatrickMcClaughry.PracticalToolsforOptimizingParallelPrograms.Presentedatthe1993SCSMulticonference,Arlington,VA,March27-April1, Computers.ConferenceProceedings,ICS'93,Tokyo,Japan,pages27{36,July2-22,1993. 1986. [Eig93]RudolfEigenmann.TowardaMethodologyofOptimizingProgramsforHigh-Performance [HAA+96]M.W.Hall,J.M.Anderson,S.P.Amarasinghe,B.R.Murphy,S.-W.Liao,E.Bugnion, andm.s.lam.maximizingmultiprocessorperformancewiththesuifcompiler.ieee Computer,pages8{89,December1996. 1

[KT87]J.H.KuoandH.C.Tu.PrototypingaSoftwareInformationBaseforSoftware-Engineeri [KE97]Seon-WookKimandRudolfEigenmann.Max/P:detectingthemaximumparallelismin [Int97]Intel. http://developer.intel.com/design/perftool/vtune/index.htm. afortranprogram.purdueuniversity,schoolofelectricalandcomputer,engineering, High-PerformanceComputingLaboratory,1997.ManualECE-HPCLab-9721. ngenvironments.proceedingsofcomputersoftwareandapplicationsconf.,pages38{, VTune: VisualTuningEnvironment, 1997. [MCC+95]BartonP.Miller,MarkD.Callaghan,JonathanM.Cargille,JereyK.Hollingsworth [Kuc88]Kuck&Associates,Inc.,Champaign,Illinois.KAPUser'sGuide,1988. [MH93]C.C.MosherandS.Hassanzadeh.ARCOseismicprocessingperformanceevaluationsuite, user'sguide.technicalreport,arco,plano,tx.,1993. Paradynparallelperformancemeasurementtools.IEEEComputer,28(11),November1995. R.BruceIrvin,KarenL.Karavanic,KrishnaKunchithapadam,andTiaNewhall.The 1987. [PE98]InsungParkandRudolfEigenmann.UrsaMajor:ExploringWebtechnologyfordesign [Pet93]PaulMarxPetersen.EvaluationofProgramsandParallelizingCompilersUsingDynamic [OMP97]OpenMP:AProposedIndustryStandardAPIforSharedMemoryProgramming.Technical computingres.&dev.,january1993. HighPerformanceComputingandNetworking,April1998. AnalysisTechniques.PhDthesis,Univ.ofIllinoisatUrbana-Champaign,CenterforSuper- andevaluationofhigh-performancesystems.inproc.oftheinternationalconferenceon report,openmp,october1997. [VGGJ+89]Jr.VincentGuarna,DennisGannon,DavidJablonowski,AllenMalony,andYogeshGaur. [Ree9]DanielA.Reed.Experimentalperformanceanalysisofparallelsystems:Techniquesand [PVAE97]InsungPark,MichaelJ.Voss,BrianArmstrong,andRudolfEigenmann.InteractivecompilationandperformanceanalysiswithUrsaMinor.InWorkshopofLanguagesandCompilers openproblems.inproc.ofthe7thint'confonmodellingtechniquesandtoolsforcomputerperformanceevaluation,pages25{51,199. Faust:AnIntegratedEnvironmentfortheDevelopmentofParallelPrograms.IEEESoftware,pages2{27,July1989. forparallelcomputing,august97. [Vos97]MichaelJ.Voss.Portableloop-levelparallelismforsharedmemorymultiprocessorarchitectures.Master'sthesis,SchoolofElectricalandComputerEngineering,PurdueUniversity, October97. 15