Wealsopresentaperformancemodelanduseittoanalyzeouralgorithms.Wendthatasymp- 1.1.Dataparallelism.Highlyparallel,localmemorycomputerarchitectures

Transcription

1 Machine,adistributed-memorySIMDmachinewhoseprogrammingmodelconceptuallysuppliesone Choleskyfactorizationofasparsematrix.OurexperimentalimplementationsareontheConnection processorperdataelement.incontrasttospecial-purposealgorithmsinwhichthematrixstructure conformstotheconnectionstructureofthemachine,ourfocusisonmatriceswitharbitrarysparsity structure. Abstract.Wedevelopandcompareseveralne-grainedparallelalgorithmstocomputethe HIGHLYPARALLELSPARSECHOLESKYFACTORIZATION Themostpromisingalternativeisasupernodal,multifrontalalgorithmwhoseinnerloopperforms JOHNR.GILBERTANDROBERTSCHREIBERy usefulinchoosingamongalternativealgorithmsforacomplicatedproblem. tionratescomparabletothoseofthedensesubroutine.althoughatpresentarchitecturallimitations parallel,densefactorizationalgorithmisusedasthekeysubroutine.thesparsecodeattainsexecu- severaldensefactorizationssimultaneouslyonatwo-dimensionalgridofprocessors.ane-grained toticanalysiscombinedwithexperimentalmeasurementofparametersisaccurateenoughtobe preventthedensefactorizationfromrealizingitspotentialeciency,weconcludethataregulardata tifrontalfactorization,systemsoflinearequations,parallelcomputing,dataparallelalgorithms, parallelarchitecturecanbeusedecientlytosolvearbitrarilystructuredsparseproblems. chordalgraphs,cliquetrees,connectionmachine,performanceanalysis. Keywords.sparsematrixalgorithms,Choleskyfactorization,supernodalfactorization,mul- Wealsopresentaperformancemodelanduseittoanalyzeouralgorithms.Wendthatasymp- 05C50,15A23,65F05,65F50,68M20. 1.Introduction. siverelativetocomputation,soanalgorithmmustminimizecommunication.locality totheprocessors. ally),thushidingfromtheprogrammerthedetailsofdistributionofdataandwork simplehardwareinawaythatscaleswithoutbottlenecks.adataparallelprogrammingmodelsimpliestheprogrammingoflocalmemoryparallelarchitecturesby associatingaprocessorwitheverydataelementinacomputation(atleastconceptu- promisetoachievehighperformanceinexpensivelybyassemblingalargeamountof Somemajorchallengescomealongwiththesepromises.Communicationisexpen- 1.1.Dataparallelism.Highlyparallel,localmemorycomputerarchitectures sequentialalgorithms:theymustexploitregularityinthedata.foreciencyon isimportantincommunication,soitpaystosubstitutecommunicationwithnearby removesthegeneral-patterncommunicationfromtheinnerloop.) processorsformoregeneralpatternswherepossible.thesequentialprogrammer advantageofthemorecomplicatedofourtwoalgorithms,gridcholesky,isthatit tunestheinnerloopofanalgorithmforhighperformance,butsimpledataparallel algorithmstendtohave\everythingintheinnerloop"becauseasequentialloopover SIMDmachines,theymustalsobehighlyregularinthetimedimension.Insome thedataistypicallyreplacedbyaparalleloperation.(fromthispointofview,the NASAandtheUniversitiesSpaceResearchAssociation. lationsystemsdivisionofnasaandbydarpaviacooperativeagreementncc2-387between MoettField,CA94035.Thisauthor'sworkwassupportedbytheNumericalAerodynamicSimu- c1990,1991xeroxcorporation.allrightsreserved. XeroxPaloAltoResearchCenter,3333CoyoteHillRoad,PaloAlto,California94304.Copyright yresearchinstituteforadvancedcomputerscience,mst045-1,nasaamesresearchcenter, Algorithmsfordataparallelarchitecturesmustmakedierenttrade-osthan 1

2 casesentirelynewapproachesmaybeappropriate;examplesofexperimentswithsuch approachesincludeparticle-in-boxowsimulation,knowledgebasemaintenance[5], andtheentireeldofneuralcomputation[20].ontheotherhand,thesamekind ofregularityinaproblemoranalgorithmcanoftenbeexploitedinawiderange ofarchitectures;therefore,manyideasfromsequentialcomputationturnouttobe surprisinglyapplicableinthehighlyparalleldomain.forexample,block-oriented matrixoperationsareusefulinsequentialmachineswithhierarchicalstorageand conventionalvectorsupercomputers[3];weshallseethattheyarealsocrucialto ecientdataparallelmatrixalgorithms. 1.2.Goalsofthisstudy.Dataparallelalgorithmsarenaturalforcomputationsonmatricesthataredenseorhaveregularnonzerostructuresarisingfrom,for example,regularnitedierencediscretizations.themaingoalofthisresearchis todeterminewhetherdataparallelismisusefulindealingwithirregular,arbitrarily structuredproblems.specically,weconsidercomputingthecholeskyfactorization ofanarbitrarysparse,symmetric,positivedenitematrix.wewillmakenoassumptionsaboutthenonzerostructureofthematrixbesidessymmetry.wewillpresent evidencethatarbitrarysparseproblemscanbesolvednearlyasecientlyasdense problemsbycarefullyexploitingregularitiesinthenonzerostructureofthetriangular factorthatcomefromthecliquestructureofitschordalgraph. Asecondgoalistoperformacasestudyinanalysisofparallelalgorithms.The analysisofsequentialalgorithmsanddatastructuresisamatureandusefulscience thathascontributedtosparsematrixcomputationformanyyears.bycontrast,the studyofcomplexityofparallelalgorithmsisinitsinfancy,anditremainstobeseen howusefulparallelcomplexitytheorywillbeindesigningecientalgorithmsforreal parallelmachines.wewillarguebyexamplethat,atleastwithinaparticularclassof parallelarchitectures,asymptoticanalysiscombinedwithexperimentalmeasurement ofparametersisaccurateenoughtobeusefulinchoosingamongalternativealgorithms forasinglefairlycomplicatedproblem. 1.3.Outline.Thestructureoftheremainderofthepaperisasfollows.In Section2wereviewthedenitionsweneedfromnumericallinearalgebraandgraph theory,sketchthearchitectureoftheconnectionmachine,andpresentatimingmodel forageneralizeddataparallelcomputerthatabstractsthatarchitecture. InSection3wepresenttherstoftwoparallelalgorithmsforsparseCholesky factorization.thealgorithm,whichwecallroutercholesky,isbasedonatheoreticallyecientalgorithmintheprammodelofparallelcomputation.weanalyze thealgorithmandpointouttworeasonsthatitfailstobepractical,onehavingtodo withcommunicationandonewithprocessorutilization. InSection4wepresentasecondalgorithm,whichwecallGridCholesky.Grid Choleskyisadataparallelimplementationofasupernodal,multifrontalmethodthat drawsontheideasofduandreid[7]andashcraftetal.[1].itimprovesonrouter Choleskybyusingatwo-dimensionalgridofprocessorstooperateondensesubmatrices,thusreplacingmostoftheslowgenerally-routedcommunicationofRouter Choleskywithfastergridcommunication.Italsosolvestheprocessorutilization problembyassigningdierentdataelementstotheworkingprocessorsatdierent stagesofthecomputation.wepresentananalysisandexperimentalresultsforapilot implementationofgridcholeskyontheconnectionmachine. ThepilotimplementationofGridCholeskyisapproximatelyasecientasa densecholeskyfactorizationalgorithm,butisstillslowcomparedtothetheoretical peakperformanceofthemachine.severalstepsnecessarytoimprovetheabsolute 2

3 forfurtherresearch. eciencyofthealgorithm,mostofwhichconcernecientcholeskyfactorizationof densematrices,aredescribed.finallywedrawsomeconclusionsanddiscussavenues diagonalsuchthat sparsematrix.thereisauniquennlowertriangularmatrixl=(lij)withpositive ConnectionMachine,anddescribesourparametricmodelofadataparallelcomputer. sectionoutlinesthedataparallelprogrammingmodelanditsimplementationonthe andgraphtheoryneededtostudysparsecholeskyfactorization.mostofthismaterial iscoveredinmoredetailbygeorgeandliu[13,23,24].theremainderofthe 2.1.Linearalgebra.LetA=(aij)beannnreal,symmetric,positivedenite 2.Denitions.Thersttwosubsectionsbelowsummarizethelinearalgebra denotethenumberofnonzeroelementsofx. thelinearsystemax=bbysolvingly=bandltx=y.wewilldiscussalgorithms LthatwerezeroinAarecalledllorll-in.ForanymatrixX,wewrite(X)to solvedis forcomputinglbelow.ingeneral,lislesssparsethana.thenonzeropositionsof ThisistheCholeskyfactorizationofA.WeseektocomputeL;withitwemaysolve TherowsandcolumnsofAmaybesymmetricallyreorderedsothatthesystem A=LLT: turedmatricesmaybefactoredusingthesameorderingandsymbolicfactorization.) Astudyoftheimplementationofappropriatereorderingandsymbolicfactorization furtherassumethatthestructureoflhasbeendeterminedbyasymbolicfactoring actuallycomputingltypicallydominates.(inmanycases,severalidenticallystruc- process.weignorethesepreliminarycomputationsinthisstudybecausethecostof wherepisapermutationmatrix.weassumethatsuchareordering,chosentoreduce(l)andthenumberofoperationsrequiredtocomputel,hasbeendone.we PAPT(Px)=Pb proceduresondataparallelarchitecturesisinpreparation[18]. thelowertriangleofa,i.e.thereisnoll,thenaisaperfecteliminationmatrix.if PAPTisaperfecteliminationmatrixforsomepermutationmatrixP,wecallthe orderingcorrespondingtopaperfecteliminationorderingofa. IfthematrixAissuchthatitsCholeskyfactorLhasnomorenonzerosthan verticesf1;2;:::;ngandedgese(a)=f(i;j)jaij6=0g: elementsarear;sforr2rands2s.(foranysets,wewritejsjtodenoteits thesparse,symmetricmatrixa.first,g(a),thegraphofa,isthegraphwith cardinality.) Vertexelimination.Weassociatetwoordered,undirectedgraphswith 2.2.Graphtheory. LetRandSbesubsetsoff1;:::;ng.ThenA(R;S)isthejRjjSjmatrixwhose (NotethatE(A)isasetofunorderedpairs.)Next,wedenethelledgraph,G(A), withverticesf1;2;:::;ngandedges E(A)=f(i;j)jlij6=0g; 3

4 sothatg(a)isg(l+lt).theedgesing(a)thatarenotedgesofg(a)arecalled lledges.theoutputofasymbolicfactorizationofaisarepresentationofg(a). anedgebetweentwononconsecutivevertices(achord).suchagraphissaidtobe reorderingofg(a). whoseverticesallhavenumberslowerthanbothiandj;moreover,foreverysuchpath neworderingisaperfecteliminationorderingofg(a);liu[24]callsitanequivalent Withanothernumbering,thislastpropertymayormaynothold.Ifitdoes,thenthe ing(a)thereisanedgeing(a)[28].considerrenumberingtheverticesofg(a) Chordalgraphs.EverycycleofmorethanthreeverticesinG(A)has Foreverylledge(i;j)inE(A)thereisapathinG(A)fromvertexitovertexj ofanyotherclique.foranyv2v,theneighborhoodofv,writtenadj(v),isthe intheusualway. setfu2vj(u;v)2eg.themonotoneneighborhoodofv,writtenmadj(v),isthe graphofsomematrix[27]. smallersetfu2vju>v;(u;v)2eg.weextendadjandmadjtosetsofvertices thatforallu;v2x,(u;v)2e.acliqueismaximalifitisnotapropersubset chordal.notonlyisg(a)chordalforeverya,buteverychordalgraphisthelled tinguishableiffug[adj(u)=fvg[adj(v).twoverticesareindependentifthereis noedgebetweenthem.asetofverticesisindependentifeverypairofverticesinit Avertexvissimplicialifadj(v)isaclique.Twovertices,uandv,areindis- LetG=G(V;E)beanyundirectedgraph.AcliqueisasubsetXofVsuch titionsthesimplicialverticesintopairwiseindependentcliques.wecallthesethe vertexofb. simplicialcliquesofthegraph. tinguishable.asetofindistinguishablesimplicialverticesthusformsaclique,though notingeneralamaximalclique.theequivalencerelationofindistinguishabilitypar- isindependent;twosetsaandbareindependentifnovertexofaisadjacenttoa Itisimmediatethatanytwosimplicialverticesareeitherindependentorindis- consistingofonetreeforeachconnectedcomponentofg(a).forsimplicityweshall suchneighbor;otherwiseuisaroot.inotherwords,thersto-diagonalnonzero assumeinwhatfollowsthataisirreducible,sothatvertexnistheonlyroot,though inationtreet(a)isarootedspanningforestofg(a)denedasfollows.ifvertexu elementintheuthcolumnoflisinrowp(u).itiseasytoshowthatt(a)isaforest hasahigher-numberedneighborv,thentheparentp(u)ofuint(a)isthesmallest Liu[24]givesasurveyofitsmanyuses.LetAhavetheCholeskyfactorL.Theelim- eliminationistheeliminationtree.thisstructurewasdenedbyschreiber[30]; Eliminationtrees.AfundamentaltoolinstudyingsparseGaussian that,ifwethinkoftheverticesoft(a)ascolumnsofaorl,anygivencolumnofl ouralgorithmsdonotassumethis. termsofoperationsonsinglecolumns.adescriptionintermsofoperationsonfull dependsonlyoncolumnsthatareitsdescendantsinthetree. (u;v)isanedgeofg(a)withu<v(thatis,iflvu6=0)thenvisonthisunique pathfromutotheroot.thismeansthatwhent(a)isconsideredasaspanningtree ofg(a),thereareno\crossedges"joiningverticesindierentsubtrees.itimplies ThereisamonotoneincreasingpathinT(A)fromeveryvertextotheroot.If blockscanyieldalgorithmswithbetterlocalityofreference,whichisanadvantage eitheronamachinewithamemoryhierarchy(registers,cache,mainmemory,disk)or Cliquetrees.TheeliminationtreedescribesaCholeskyfactorizationin 4

5 onadistributed-memoryparallelmachine.theconnectionmachinefallsintoboth ofthesecategories. thekeyideainbothduandreid's\multifrontal"algorithm[7]andthe\supernodal" exploredextensivelyinthecombinatorialliterature;representationsofchordalgraphs algorithmofashcraft,grimes,lewis,peyton,andsimon[1],whichcanbetraced backtotheso-calledelementmodeloffactorization[15,33].afullsubmatrixoflis acliqueinthechordalgraphg(a).thecliquestructureofchordalgraphshasbeen astreesofcliquesdatebackatleastto1972[10]andcontinuetobeused[16,25,26]. Describingsymmetricfactorizationintermsofoperationsonfullsubmatricesis innodesthatareproperdescendantsofn.anequivalentdenitionistothinkof ofg(a)intocliques,insuchawaythatalltheverticesofanodenareindistinguishablesimplicialverticesinthegraphthatresultsbydeletingfromg(a)allvertices fromtheirparents.(thisdenitiondiersslightlyfromthatofpeyton[26],whose startingwithaneliminationtreeandcollapsingverticesthatareindistinguishable G(A)inproperdescendantsofsomesupernodearedeleted,thesupernodebecomesa treenodesareoverlappingmaximalcliquesofg(a).) Acliquewhichisanodeinacliquetreeisalsocalledasupernode.Ifallverticesof AcliquetreeformatrixAisatreewhosenodesaresetsthatpartitionthevertices simplicialcliqueintheresultinggraph.thecliquetreeissometimescalledasupernode treeorsupernodaleliminationtree[2].amatrixmayhavemanydierentclique trees indeed,theeliminationtreeitselfisone.ournumericalfactorizationalgorithm WeprogrammedtheCMin*lisp,whichiscompiledintoParis. machinearchitecturepresentedbytheassemblylanguageinstructionsetparis[34]. ory,simdparallelcomputer.thedescriptionwepresentherecorrespondstothe GridCholeskycanactuallyuseanycliquetree;thesymbolicfactorizationwedescribe insection4.1usesablockjessandkeesalgorithmtocomputeashortestpossible cliquetree. available.)theprocessorsareconnectedbyacommunicationnetworkcalledthe 65,536bitsofmemory.(Sincethisworkwasdone,largermemorieshavebecome Afull-sizedCMhas216=65,536processors,eachofwhichcandirectlyaccess Architecture.TheConnectionMachine(modelCM-2)isalocalmem- 2.3.TheConnectionMachine. 16-dimensionalhypercube. router,whichisconguredbyacombinationofmicrocodeandhardwaretobea pvar.apvarisanarrayofdatainwhicheveryprocessorstoresandmanipulatesone processors,p.iftherearevtimesasmanyelementsinthepvarxasthereare element.thesizeofapvarmaybeamultipleofthenumberofphysicalmachine processors;thustheprogrammer'sviewremains\oneprocessorperdatum."the processors,then(throughmicrocode)eachphysicalprocessorsimulatesvvirtual ratioviscalledthevirtualprocessor(vp)ratio.thecmrequiresthatvmustbea oftwonotsmallerthantherealnumberx. poweroftwo.thuswewillndusefulthenotationdxe,meaningthesmallestpower TheessentialfeatureoftheCMprogrammingmodelistheparallelvariableor embeddedinthemachine(usinggraycodes)sothatneighboringvirtualprocessors arraywithdimensionsthatarepowersoftwo.thevpsetsandtheirpvarsare aresimulatedbythesameorneighboringphysicalprocessors. determinedbytheprogrammer,whomaychoosetoviewitasanymultidimensional Thegeometryofeachsetofvirtualprocessors(anditsassociatedpvars)isalso 5

6 TheParisinstructionsetcorrespondsfairlywelltotheabstractmodelofdata parallelprogrammingthatthecmattemptstopresenttotheprogrammer,butitdoes notcorrespondcloselytotheactualhardwareofthecm.largelyforthisreason,it ishardtogethighperformancewhenprogramminginparisorinalanguagethat iscompiledintoparis[31].weshallgointothispointindetaillater.thereare otherwaystoviewandtoprogramthehardwareofthecm-2thatcanprovidebetter performance.thesearejustnowbecomingavailabletousers,butwerenotwhenthis workwasdone ConnectionMachineprogramming.ParallelcomputationontheCM isexpressedthroughelementwisebinaryoperationsonpairsofpvarsthatresideinthe samevpset thatis,havethesamevpratioandlayoutonthemachine.(optionally, onemayspecifyabooleanmaskthatselectsonlycertainvirtualprocessorstobe active.)theseoperationstaketimeproportionaltov,sincetheactualprocessors mustloopovertheirsimulatedvirtualprocessors.thisremainstrueevenwhenthe setofselectedprocessorsisverysparse. Interprocessorcommunicationisexpressedandaccomplishedinthreeways,which wediscussinorderofincreasinggeneralitybutdecreasingspeed. Communicationwithvirtualprocessorsatnearestneighborgridcellsismost ecient.apvarmaybeshiftedalonganyofitsaxesusingthistypeofcommunication. Theshiftmaybecircularorend-oattheprogrammer'sdiscretion. Asecondcommunicationprimitive,scan,allowsbroadcastofdata.Forexample, ifxisaone-dimensionalpvarwiththevalue[1,2,3,4,5,6,7,8]thenascanofx yields[1,1,1,1,1,1,1,1].scansareimplementedusingthehypercubeconnections. Thetimeforascanoflengthnislinearinlogn.Scanscanalsobeusedtobroadcast alongeitherrowsorcolumnsofatwo-dimensionalarray.scansthatperformparallel prexarithmeticoperationsarealsoavailable,butwedonotusethem. Scansofsubarraysarepossible.Inasegmentedscan,theprogrammerspeciesa booleanpvar,thesegmentpvar,congruenttox.thesegmentsofxbetweenadjacent Tvaluesinthesegmentpvararescannedindependently.Thus,forexample,ifwe usethesegmentpvar[tffftfft]andxisasabove,thenasegmentedscan returns[1,1,1,1,5,5,5,8]. Thethirdandmostgeneralformofcommunicationallowsavirtualprocessorto accessdatainthememoryofanyothervirtualprocessor.theseoperationsgoby severaldierentnamesevenwithinthecmenvironment;weshallrefertothemin termsalreadyfamiliarinsparsematrixcomputation:gatherandscatter. Agatherallowsprocessorstoreaddatainthememoryofotherprocessors. TheCMtermforagatherispref!!(forthe*lispprogrammer)orget(forthe Parisprogrammer).Inagather,threepvarsareinvolved:thesource,thedestination,andtheaddress.Theaddressoftheprocessorwhosememoryistobereadis takenfromtheintegeraddresspvar.supposethesourceistheone-dimensionalpvar [15;14;13;:::;2;1;0]andtheaddresspvaris[0;1;2;0;1;2;:::;0;1;2;0].Thenthe datastoredinthedestinationis[15;14;13;15;14;13;:::;15;14;13;15].thefortran- 90orMatlabstatementthataccomplishesthisisisdest=source(address);it performstheassignmentdest(i) source(address(i))for1ilength(dest). Ascatterallowsprocessorstowritedatatothememoryofotherprocessors. TheCMtermforascatteris*pset(forthe*lispprogrammer)orsend(forthe Parisprogrammer).Againthethreepvarsareasource,adestination,andaninteger address.thefortran-90ormatlabversionisisdest(address)=source,andthe eectisdest(address(i)) source(i)for1ilength(source). 6

7 Parameter V Virtualprocessorratio memoryreferencetime4.8vsec Multiplyoraddtime7 ConnectionMachineParametricModel Description ParametersofCMmodel Table1 Scantime Newstime Routetime 6:2+1:2log2scan-distance 3scatter(nocollisions):64 scatter-add(4collisions):110 MeasuredCM-2value scatter-add(100collisions):200 sourceandaddressareasaboveandthedestinationinitiallyhasthevalue[1;1;:::;1] acollisionhasoccurred.)theprogrammercanselectoneofseveraldierentways tocombinecollidingvaluesbyspecifyingacombiningoperator.forexample,ifthe processorsaresenttothesamedestinationprocessor.(whenthishappenswesaythat Inascatter,whentheaddresspvarhasduplicatevalues,datafromseveralsource gather(manycollisions):430 thenafterascatter-with-add,thedestinationhasthevalue[45;40;35;1;1;1;:::;1]. Thesumofelementssource(j)suchthataddress(j)=kisstoredindest(k)ifthere tobeapowerfulaidtodataparallelprogramming. ondataparallelarchitecturesanduseittoanalyzeperformanceofseveralalgorithms areanysuchelements;otherwisedest(k)isunchanged.othercombiningoperators aredescribedbyveparameters: includeproduct,maximum,minimum,and\error".wehavefoundcombiningscatter forsparsecholeskyfactorization.theessentialmachinecharacteristicsinthemodel MeasuredCMperformance.Wewilldevelopamodelofperformance (Floating-pointadditiontakesthesametimeasmultiplication,.)Inourmodel, The32bitgridcommunicationtime,inunitsof Thememoryreferencetimefora32bitword nectionmachine.thereforeisproportionaltovpratio,andtheotherparameters executiontimescaleslinearlywithvpratio,whichisessentiallycorrectforthecon- The32bitroutertime,inunitsof The32bitscantime,inunitsof The32bitoating-pointmultiplytime,inunitsof routingpatternsthatperformevenworsethanthis.foranygivenpattern,gather toroffourdependingonthenumberofcollisions;itispossibletodesignpathological obtainedbyexperimentonthecm-2.weobservethatroutertimesrangeoverafac- areindependentofvpratio.intable1,wegivemeasuredvaluesfortheseparameters usuallytakesjusttwiceaslongasscatter,presumablybecauseitisimplementedby sendingarequestandthensendingareply.inourapproximateanalyses,therefore, wegenerallychooseavalueofforscattercorrespondingtothenumberofcollisions observed,andmodelgatherastaking2oating-pointtimes. 7

8 closelyongilbertandhafsteinsson'stheoreticallyecientalgorithm[17]forthe PRAMmodelofcomputation.Itscommunicationrequirementsaretoounstructured forittobeveryecientonane-grainedmultiprocessorlikethecm,butweimplementedandanalyzedittouseasabasisforcomparisonandtohelptuneour performancemodelofthecm. 3.RouterCholesky.OurrstparallelCholeskyfactorizationalgorithmisa parallelimplementationofastandardcolumn-orientedsparsecholesky;itisbased thesymbolicfactorizationg(a)areavailable.(inourexperimentswecomputedthe symbolicfactorizationsequentially;gilbertandhafsteinsson[17]describeaparallel treet(a)toorganizeitscomputation.forthepresent,assumethatboththetreeand algorithm.)eachvertexofthetreecorrespondstoacolumnofthematrix. factorizationintermsoftwoprimitiveoperations,cdivandcmod: FollowingGeorgeetal.[12],weexpressasequentialcolumn-orientedCholesky 3.1.TheRouterCholeskyalgorithm.RouterCholeskyusestheelimination 8

9 proceduresequentialcholesky(matrixa); forj foreachedge(i;j)ofg(a)withi<jdo 1tondo columnjaccumulatesallnecessaryupdatescmod(j;i)fromcolumnstoitsleftjust thediagonalelementinthatcolumn,androutinecmod(j;i)modiescolumnjby Routinecdiv(j)dividesthesubdiagonalelementsofcolumnjbythesquarerootof subtractingamultipleofcolumni.thisiscalledaleft-lookingalgorithmbecause endsequentialcholesky; cdiv(j)od cmod(j;i)od; cdiv(i). beforethecdiv(j)thatcompletesitscomputation.bycontrast,aright-lookingalgorithmwouldperformalltheupdatescmod(j;i)usingcolumniimmediatelyafterthe bycolumns(vertices)thatareitsdescendantsinthetree[30].thereforeaparallel left-lookingalgorithmcancomputealltheleafvertexcolumnsatonce. NowconsidertheeliminationtreeT(A).Agivencolumn(vertex)ismodiedonly procedureroutercholesky(matrixa); forh foreachedge(i;j)withheight(i)<height(j)=hpardo 0toheight(n)do theleaveshaveheight0,verticeswhosechildrenareallleaveshaveheight1,and Hereheight(j)isthelengthofthelongestpathinT(A)fromvertexjtoaleaf.Thus endroutercholesky; od foreachvertexjwithheight(j)=hpardo cdiv(j)od cmod(j;i)od; soforth.theouterloopofthisalgorithmworkssequentiallyfromtheleavesofthe eliminationtreeuptotheroot.ateachstep,anentirelevel'sworthofcmod'sand cdiv'saredone. toeveryedgeandvertexofthelledgraphg).supposeprocessorpijisassignedto thenonzerothatisinitiallyaijandwilleventuallybecomelij.(iflijisall,thenaij thendividetheirownnonzerosbyljj.intheparallelcmod(j;i),processorpjisends themultiplierljitotheprocessorspkiwithk>j.eachsuchpkithencomputesthe updatelkiljilocallyandsendsittopkjtobesubtractedfromlkj. isinitiallyzero;recallthatweassumethatthesymbolicfactorizationisalreadydone, soweknowwhichlijwillbenonzero.)intheparallelcdiv(j),processorpjjcomputes ljjasthesquarerootofitselement,andsendsljjtoprocessorspijfori>j,which Aprocessorisassignedtoeverynonzeroofthetriangularfactor(or,equivalently, updatestoaprocessorincolumnj.eachcolumniisinvolvedinatmostonecmod atatimebecauseeverycolumnmodifyingjisadescendantofjint(a),andthe performedbytheprocessorsincolumniwhothen,ontheirowninitiative,sendthese subtreesrootedatverticesofanygivenheightaredisjoint.thereforeeachprocessor participatesinatmostonecmodorcdivateachparallelstep.ifweignorethetime takenbycommunication(includingthetimetocombineupdatestoasinglepkjthat maycomefromdierentpki1,pki2,:::),theneachparallelsteptakesaconstant Wecallthisaleft-initiatedalgorithmbecausethemultiplicationsincmod(j;i)are 9

10 howtodothecommunicationincdivandcmod. theeliminationtreet(a). leskyonthecmwemustspecifyhowtoassigndatatoprocessors,andthendescribe amountoftimeandtheparallelalgorithmrunsintimeproportionaltotheheightof 3.2.CMimplementationofRouterCholesky.ToimplementRouterCho- thinkofthisprocessorassignmentasaprocessorforeachvertexjofthelledgraph, eachsub-diagonalnonzero.thesymmetricuppertriangleisnotstored.wecanalso followedbyaprocessorforeachedge(i;j)withi>j. singlecolumnecientbecausetheycanusethecmscaninstructions.eachcolumn isrepresentedbyaprocessorforitsdiagonalelementfollowedbyaprocessorfor standardsequentialsparsematrixalgorithms[13].thismakesoperationswithina layoutthenonzerosinaone-dimensionalarrayincolumnmajororder,asinmany RouterCholesky,likemanydataparallelalgorithms,isproigateofparallelvariablestorage.Eachprocessorcontainssomeworkingstorageandthefollowingpvars: jhtjil Columnnumberofthiselement. Rownumberofthiselement. ElementoffactormatrixL,initiallyA. Weuseone(virtual)processorforeachnonzerointhetriangularfactorL.We (Recallthatp(j)>jistheeliminationtreeparentofvertexj<n.) iht diagonalp height(j)int(a). decidewhetheritparticipatesinacdivorcmod.bycomparingthelocalprocessor's nextupdate eparent height(i)int(a). ihtorjhttothecurrentvalueoftheouterloopindex,aprocessorcandetermine Ateachstageofthesequentialouterloop,eachprocessorusesihtandjhtto InprocessorPij,apointertoPi;p(j). Pointertonextelementthisonemayupdate. Boolean:Isthisadiagonalelement? activecolumn. aconnectedsubgraphoftheeliminationtree,andarelinkedtogetherinthistree nextupdatetoalaterelementinitsrow.thenonzeropositionsineachroware Theactualupdateisdonebyascatter-with-add,whichusestheroutertosendthe ifitselementisinacolumnthatisinvolvedinacdivoracmod. updatetoitsdestination. Thecdivusesascanoperationtocopythediagonalelementtotherestofthe onestepupthetreeusingtheeparentpointers. thatareitsancestorsintheeliminationtree.ateachstage,nextupdateismoved structurebytheeparentpointers.eachnonzeroupdatesonlyelementsincolumns Thecmodusesasimilarscantocopythemultiplierljidowntotherestofcolumni. Togureoutwheretosendtheupdate,eachelementmaintainsapointercalled whichinthiscaseisd(l)=pe.theciareconstants. kydoesaconstantnumberofrouteroperations,scans,andarithmeticoperations. Recallthatthememoryreferencetimeisproportionaltothevirtualprocessorratio, Thenumberofstagesish+1,wherehistheheightoftheeliminationtree.Interms oftheparametersofthemachinemodelinsection2.3.2,itsrunningtimeis 3.3.RouterCholeskyperformance:Theory.EachstageofRouterCholes- (c1+c2+c3)h: 10

11 routerisusedtwiceinthisoperation.thedominanttermistheroutertermc1h. Noticethatwedonotexplicitlycounttimeforcombiningupdatestothesameelement fromdierentsources,sincethisishandledwithintherouterandisthusincludedin. kpointsonaside,thenthegraphisakbyksquaregrid,andwehaven=k2, dierencemeshintwodimensionsorderedbynesteddissection[11].ifthemeshhas h=o(k),and(l)=o(k2logk).thenumberofarithmeticoperationsinthe CholeskyfactorizationisO(k3),ineitherthesequentialorparallelalgorithms.Router Themosttime-consumingstepisincrementingthenext-updatepointer;the thenumberofoperationsinthesequentialalgorithmtoparalleltime,wendthat Cholesky'srunningtimeisO(k3logk=p).Ifwedeneperformanceastheratioof theperformanceiso(p=logk)(takingtobeaconstantindependentofpork; thisisapproximatelycorrectfortheconnectionmachine,althoughtheoretically Togetafeelingforthisanalysisconsiderthemodelproblem,a5-pointnite shouldgrowatleastwithlogp).thisanalysispointsouttwoweakpointsofrouter Cholesky.First,theperformanceonthemodelproblemdropswithincreasingproblem size.(thisdependsontheproblem,ofcourse;forathree-dimensionalmodelproblem timingmodelandanalysis,weexperimentedwithroutercholeskyonavarietyof routertime,becauseeverystepusesgeneralcommunication. boundednodedegree,orderedbynesteddissection[17].theasymptoticanalysisis seriously,theconstantintheleadingtermofthecomplexityisproportionaltothe thesamebutthevaluesoftheconstantswillbedierent. asimilaranalysisshowsthatperformanceiso(p)regardlessofproblemsize.)more sparsematrices.wepresentoneexamplehereindetail.theoriginalmatrixis2500 ve-pointsquaremesh.itispreorderedbysparspak'sautomaticnesteddissection 2500with7400nonzeros(countingsymmetricentriesonlyonce),representinga5050 Thisanalysiscanbeextendedtoanytwo-dimensionalniteelementmeshwith heuristic[13],whichgivesorderingsverysimilartotheidealnesteddissectionordering operationstocompute. usedintheanalysisofthemodelproblemabove.thecholeskyfactorhas(l)= 48;608nonzeros,aneliminationtreeofheighth=144,andtakes1,734,724arithmetic 3.4.RouterCholeskyperformance:Experiments.Inordertovalidatethe seconds.thisisnotabadtforroutertime;itisnotclearwhytheremainingtime weuseonly48,608ofthe65,536virtualprocessors.)weobservedarunningtimeof wouldpredictroutertimec1h=39secondsandothertime(c2+c3)h=1:5 V=d(L)=pe=8.(Roundinguptoapoweroftwohasconsiderablecosthere,since intotheanalysisabove(using=200sincetherewereingeneralmanycollisions),we 53seconds,ofwhichabout41secondswasduetogathersandscatters.Substituting NASAAmesResearchCenter.Theresultsquotedherearefromp=8,192processors, withoatingpointcoprocessors,ofthemachineatnasa.thevpratiowastherefore WeranthisproblemonCM-2'sattheXeroxPaloAltoResearchCenterandthe issuchapoort,buttheexpensivesquarerootandthedatamovementinvolvedin thepointerupdatescontributetoit,anditseemsthati/omayhaveaectedthe andascatterwithexactlythesamecommunicationpattern.morecarefuluseofthe tobeacost-eectivewaytofactorsparsematrices.eachstagedoestwogathers routercouldprobablyspeeditupbyafactoroftwotove.however,thiswouldnot measured53seconds. 3.5.RemarksonRouterCholesky.RouterCholeskyistooslowasitstands Theobservation,inanycase,isthatroutertimecompletelydominates. 11

12 beenoughtomakeitpractical;somethingmorelikeahundredfoldimprovementin routerspeedwouldbeneeded. TheoneadvantageofRouterCholeskyistheextremesimplicityofitscode. Itisnomorecomplicatedthanthenumericfactorizationroutineofaconventional sequentialsparsecholeskypackage[13].itisinterestingtonotethatcolumn-oriented sparsecholeskycodesonmimdmessage-passingmultiprocessors[12,14,35]aremore complex.theyexploitmimdcapabilitytoimplementdynamicschedulingofthe cmodandcdivtasks.theyallowarbitraryassignmentofcolumnstoprocessorsand thereforearerequiredtouseindirectaddressingofcolumns.finally,theyarewritten withlow-levelcommunicationprimitives,theexplicit\send"and\receive." RouterCholesky'ssimplicitycomesdearly.FlexibilityinschedulingallowsMIMD implementationstogainamodestperformanceadvantageoveranypossiblesimd implementation.moreimportant,weemploy(l)virtualprocessors,regardlessof thenumberofphysicalprocessors.itisessentialthatthesevirtualprocessorsnotall sitidle,consumingphysicalprocessortimeslices,whenthereisnothingforthemto do.asimplementedbytheparisinstructionset,theydositidle. WedescribedRouterCholeskyasaleft-initiated,left-lookingalgorithm.Ina right-initiatedalgorithm,processorpijwouldperformtheupdatestolij.inarightlookingalgorithm,updateswouldbeappliedassoonastheupdatingcolumnofl wascomputedinsteadofimmediatelybeforetheupdatedcolumnoflwastobe computed.routercholeskyisthusoneoffourcousins.itistheonlyoneofthe fourthatmapsoperationstoprocessorsevenly;theotherthreealternativesrequire aninnersequentialloopofsomekind.allfourversionsrequireatleasthrouter operations. 4.GridCholesky.Inthissectionwepresentaparallelsupernodal,multifrontal CholeskyalgorithmanditsimplementationontheCM.Multifrontalmethods,introducedbyDuandReid[7],computeasparseCholeskyfactorizationbyperforming symmetricupdatesonasetofdensesubmatrices.wefollowliu[23]inreferring toanalgorithmthatusesrank-1updates\multifrontal"andtheblockversionthat usesrank-kupdates\supernodalmultifrontal."theideaofusingblockmethods tooperateonsupernodeshasbeenusedinmanydierentsparsefactorizationalgorithms[1,7].parallelsupernodalormultifrontalalgorithmshavebeenusedonmimd message-passingandshared-memorymachines[2,6,32]. Thealgorithmusesatwo-dimensionalVPset(whichwecallthe\playingeld") topartiallyfactor,inparallel,anumberofdenseprincipalsubmatricesofthepartially factoredmatrix.byworkingontheplayingeld,wemayusethefastgridandscan mechanismsforallthenecessarycommunicationduringthefactorizationofthedense submatrices.onlywhenweneedtomovethesedensesubmatricesbackandforthto theplayingelddoweneedtousetherouter.inthiswaywedrasticallyreducethe useoftherouter:forthemodelproblemonakkgridwereducethenumberof usesfromh=3k?1to2log2k?1.theplayingeldcanalsooperateatalowervp ratioingeneralbecauseitdoesnotneedtostoretheentirefactoredmatrixatonce. Thismethod,likeallmultifrontalmethods,isinessencean\outofcore"method inthatthecholeskyfactoriskeptinadatastructurethatisnotreferredtowhile doingarithmetic,allofwhichisdoneondensesubmatrices.thenoveltyhereisthe factorizationofmanyofthesedenseproblemsinparallel;thesimultaneousexploitationoftheparallelismavailablewithineachofthedensefactorizations;theuseofa two-dimensionalgridofprocessorsfortheseparalleldensefactorizations;theuseof themachine'srouterforparalleltransfersfromthematrixstoragedatastructure;and 12

13 theuseofthecombiningscatteroperationsforparallelupdateofthematrixstorage datastructure. 4.1.TheGridCholeskyalgorithm BlockJessandKeesreordering.FirstwedescribeanequivalentreorderingofthechordalgraphG=G(A)thatwecalltheblockJess/Keesordering. BlockJess/Keesisaperfecteliminationorderingthathastwopropertiesthatmake itthebestequivalentreorderingforourpurposes:iteliminatesverticeswithidenticalmonotoneneighborhoodsconsecutively,anditproducesacliquetreeofminimum height. OurreorderingeliminatesallthesimplicialverticesofGsimultaneouslyasone majorstep.intheprocess,itpartitionsalltheverticesofgintosupernodes.each ofthesesupernodesisacliqueing,andisasimplicialcliquewhenitscomponent verticesareabouttobeeliminated.eachvertexislabeledwiththestage,ormajor stepnumber,atwhichitiseliminated.inmoredetail,thereorderingalgorithmisas follows.procedurereorder(graphg) activestage?1; whilegisnotemptydo activestage activestage+1; NumberallthesimplicialverticesinG,assigning consecutivenumberswithineachsupernode; stage(v) activestageforallsimplicialverticesv; RemoveallthesimplicialverticesfromGod; h activestage endreorder Thecliquesarethenodesofacliquetreewhoseheightish,onelessthanthenumber ofmajoreliminationsteps.theparentofagivencliqueisthelowest-stageclique adjacenttothegivenclique. Thename\blockJess/Kees"indicatesarelationshipwithanalgorithmdueto JessandKees[21]thatndsanequivalentreorderingforachordalgraphsoasto minimizetheheightofitseliminationtree.theoriginal(or\point")jess/keesorderingeliminatesjustonevertexfromeachsimplicialcliqueateachmajorstep.(thisis amaximum-sizeindependentsetofsimplicialvertices.)eachstepofpointjess/kees producesonelevelofaneliminationtree,fromtheleavesup.theresultingeliminationtreehasminimumheightoverallperfecteliminationordersong.ourblock Jess/Keeseliminatesallthesimplicialverticesateachmajorstep,producingaclique treeonelevelatatimefromtheleavesup.thisorderingmaynotminimizetheheight oftheeliminationtree.however,asblairandpeyton[4]haveshown,itdoesproduce acliquetreeofminimumheightoverallperfecteliminationordersong. Everyvertexisincludedinexactlyonesupernode.Wenumberthesupernodesas fs1;:::;smginsuchawaythatifi<jthentheverticesinsihavelowernumbers thantheverticesinsj.thestageatwhichasupernodesiseliminatedistheiteration ofthewhileloopatwhichitsverticesarenumberedandeliminated;thus,forallv2s, stage(v)=stage(s)istheheightofnodesinthecliquetree Parallelsupernodalmultifrontalelimination.LetCbeasupernode. ItisimmediatethatK=adj(C)[Cisaclique,andthatitismaximal.Our factorizationalgorithmworksbyformingtheprincipalsubmatricesofacorresponding 13

14 toverticesinthemaximalcliquesgeneratedbysupernodes.let=jcjand= jadj(c)j.writea(k;k)fortheprincipalsubmatrixoforderjkj=+consisting ofelementsai;jwithi;j2k.itisnaturaltopartitiontheprincipalsubmatrix A(K;K)ofAas wherex=a(c;c)is,y=a(adj(c);adj(c))is,andeis. looking"algorithm.thedetailsareasfollows. proceduregridcholesky(matrixa) foractivestage Intheterminologyoftheprevioussection,GridCholeskyisa\blockright- A(K;K)=XE foreachsupernodecwithstage(c)=activestagepardo 0tohdo ETY; MoveA(K;K)totheplayingeld, SetYtozeroontheplayingeld; Perform=jCjstepsofparallelGaussianelimination A(C;C) wherek=c[adj(c); andtheschurcomplementy0=?etx?1e, tocomputethecholeskyfactorlofx, theupdatedsubmatrixe0=l?1e, wherex,e,andypartitiona(k;k)asabove; end,wediscussanimplementationofludecompositionwithoutpivoting.(weuse endgridcholesky; useful,weneedafastdensematrixfactorizationsontwo-dimensionalvpsets.tothat 4.2.Multipledensepartialfactorization.Inordertomakethisapproach oda(adj(c);adj(c)) A(adj(C);C) L; LUinsteadofCholeskyherebecausewecanseenoecientwaytoexploitsymmetry E0T; withatwo-dimensionalmachine;moreover,luavoidsthesquarerootateachstepand A(adj(C);adj(C))+Y0od thatusesonlynearestneighborcommunicationonthegrid,andarank-1update soisabitfaster.)weanalyzedandimplementedtwomethods:asystolicalgorithm aboutthetwo-dimensionalplayingeldsimultaneously(eachasaseparate\baseball submatricesa(k;k)correspondingtosupernodesatagivenstagearedistributed algorithmthatusesrowandcolumnbroadcast.witheitherofthesemethods,allthe diamond"),andthepartialfactorizationisappliedtoallthesubmatricesatonce. scanstokeepeachfactorizationwithinitsown\diamond."thenumberofstepson acrosseachrow,andnallyaparallelmultiplyandsubtracttoapplytheupdate.the numberofrank-1updatesis,thesizeofthesupernode. plicationtocomputethemultiplierforeachrow,anotherscantocopythemultiplier scandownthecolumnstocopythepivotrowtotheremainingrows,aparallelmulti- 1updateconsistsofadivisiontocomputethereciprocalofthediagonalelement,a A(K;K),withasupernodeofsizeandaSchurcomplementofsize.Asinglerank- Anentirestageofpartialfactorizationsisperformedatonce,usingsegmented Wedescribetherank-1algorithmintermsofitseectonasinglesubmatrix theplayingeldatstagesiss,themaximumvalueofoverallsupernodesat 14

15 Herec3is2s,andc4isproportionaltosaswell. stages.thenastageofrank-1partialfactorizationtakestime below,foracompletefactorization(thatis,oneinwhich=0).thebookkeeping includesnearest-neighborcommunicationtomovethreeone-bittokensthatcontrol whichprocessorsperformreciprocals,multiplications,andsoonateachstep. Scans(rowandcolumnbroadcast): Therelativecostofthevariouspartsoftherank-1updatecodearesummarized (c3+c4): Multiply(computingmultipliers): News(movingthetokens): 79.7% 5.5% allcommunicationisbetweengridneighbors.thusitscommunicationtermsare proportionaltoratherthan.thisadvantageismorethanosetbythefactthat Divide(reciprocalofpivotelement): Multiply-subtract(Gaussianelimination): Unliketherank-1implementation,thesystolicimplementationneverusesascan; 7.1% 2.7% 3s+2ssequentialiterationsarenecessary,whiletherank-1methodonlyneedss Remarksondensepartialfactorization.Theoretically,systolicfactorizationshouldbeasymptoticallymoreecientasmachinesizeandproblemsize 4.8% forthetwo-dimensionalmodelproblemtheaverageschurcomplementsizesisabout Butforapartialfactorizationtherank-1algorithmistheclearwinner.Forexample, 4s,sotherank-1codehasan11-to-1advantageinnumberofsteps.Thismorethan threefoldincreaseinnumberofsteps,andsothesystolicmethodissomewhatfaster. grows.realistically,however,thecmhappenstohave6;forafullfactorization makesupforthefactthatscanismuchslowerthangridcommunication. growwithoutbound,becausescansmustbecomemoreexpensiveasthemachine (=0)asixfolddecreaseincommunicationtimeperstepmorethanbalancesthe algorithm,themultiply-subtract,accountsforonly1=21ofthetotaltimeintherank- dousefulwork,sincetheactivepartofthematrixoccupiesasmallerandsmaller 1parallelalgorithm.Moreover,only1=3ofthemultiply-subtractoperationsactually subsetoftheplayingeldasthecomputationprogresses.thisgivesthecodean VPratios.Thereasonsarethese:scanisslowrelativetoarithmetic;thedivideand havefoundthistobetypicalin*lispcodesformatrixoperations,especiallywithhigh multiplyoperationsoccuronverysparsevpsets;andthevpratioremainsconstant overalleciencyofonepartin63forluandonepartin126forcholesky.we Itisinterestingtonotethattheonlyarithmeticthatmattersinasequential canreadilybemadeatcompiletime.asanalternative,wecouldhaverewrittenthe code;thevpsetcouldshrinkasthematrixshrinks,andthedivideandthemultiplies couldbeperformedinasparservpset.15 processorsthatareactive.sometimesadeterminationthatthisisgoingtohappen substantially:theloopovervvirtualprocessorsshouldberestrictedtothosevirtual astheactivepartofthematrixgetssmaller. Signicantperformanceimprovementscouldcomefromseveralpossiblesources. Amoreecientimplementationofvirtualprocessorscouldimproveperformance

16 onevectoroating-pointarithmeticchip.performing32oatingpointoperationsimpliesmoving32numbersinbit-serialfashionintoatransposerchip,thenmovingthem eachphysicalprocessorgetsonlyonecopyofthedataratherthanvcopies. sible.theparisinstructionsethidesthefactthatevery32physicalprocessorsshare tothescanwhichtakeso(b+d).thecopyscanscouldalsobeimplementedsothat takeso(b=d+d)timetobroadcastbbitsinad-dimensionalhypercube,incontrast tureofthecm.hoandjohnsson[19]havedevelopedaningeniousalgorithmthat Moreecientuseofthelow-leveloating-pointarchitectureoftheCM-2ispos- Thescanscouldbespedupconsiderablywithinthehypercubeconnectionstruc- inparallelintothevectorchip,thenreversingtheprocesstostoretheresult.while precludingtheuseofblockalgorithms[3]thatcouldstoreintermediateresultsinthe cyclesarerequiredjusttoaccessoneoating-pointnumber. thismodeofoperationconformstotheone-processor-per-data-elementprogramming model,itwastestimeandmemorybandwidthwhenonlyafewprocessorsareactually active(suchascomputingmultipliersordiagonalreciprocalsinlu),since32memory registersinthetransposer.thusthecomputationrateislimitedbythebandwidth betweenthetransposerandmemory(about3.5gigaopsfora64kprocessorcm) insteadofbytheoperationrateofthevectorchip(about27gigaopstotal). Thismodealsorequiresintermediateresultstobestoredbacktomainmemory, westill(early1991)cannotuseit. CMFortranallowsthismodel,itdoesnotallowscansandscatterwithcombining,so wewereworkingwithjust256realprocessors!)also,ifvirtualizationishandled (latein1988)thetoolsforprogrammingonthislevelwerenoteasilyusable.while processor-plus-transposer-and-vector-chipunitasasingleprocessor,andrepresenting eciently,weneedonlykeep256processorsbusy.atthetimethisworkwasdone theydonotneedtobemovedbit-seriallyintothearithmeticunit.(viewedthisway, 32-bitoating-pointnumbers\slicewise,"withonebitperphysicalprocessor,sothat Amoreecientdensematrixfactorizationcanbeachievedbytreatingeach32- setswithdierentgeometries:thematrixstoragestoresthenonzeroelementsofa theactivesubmatricestotheplayingeld,factorsthem,andmovesupdatesbackto andl(doingalmostnocomputation),andtheplayingeldiswherethedensepartial ofvirtualprocessorsthatstoresallofaandlinessentiallythesameformasrouter themainmatrix. factorizationsaredone.thetop-levelfactorizationprocedureisjustaloopthatmoves 4.3.CMimplementationofGridCholesky.GridCholeskyusestwoVP Cholesky.EachofthefollowingpvarshasoneelementforeachnonzeroinL Matrixstorage.ThematrixstorageVPsetisaone-dimensionalarray thatis,morethanoneycmaybecomputingupdatestothesameelementoflatthe eld.thesupernodescaredisjoint,buttheirneighboringsetsadj(c)mayoverlap; Weuseascattertomovetheactivecolumnsfrommatrixstoragetotheplaying activestagethestageatwhichjoccursinasupernode. griditheplayingeldrowinwhichthiselementsits. gridjtheplayingeldcolumninwhichthiselementsits. updatesworkingstorageforsumofincomingupdates. lvalueelementsofl,initiallythoseofa. samestage.therefore,weusescatter-with-addtomovethepartiallyfactoredmatrix fromtheplayingeldbacktomatrixstorage. 16

17 ThepvarsusedinthisVPsetare eciency.itssizeisdeterminedaspartofthesymbolicfactorizationandreordering. stage,althoughitcouldactuallyusedierentvpratiosatdierentstagesformore two-dimensionalgridonwhichthesupernodesarefactored.inourimplementation itislargeenoughtoholdalltheprincipalsubmatricesforallmaximalcliquesatany Theplayingeld.ThesecondVPset,calledtheplayingeld,isthe ofallthemaximalcliques. aswellassomebooleanagsusedtocoordinatethesimultaneouspartialfactorization doingrank-1updates(seesectionsection4.2)ofallthedensesubmatricesstored there,usingsegmentedscanstodistributethepivotrowsandcolumnswithinall TheprocessorsoftheplayingeldcomputeLUfactorizationsbysimultaneously updatedestthematrixstoragelocation(processor)thatholdsthismatrixelement;anintegerpvararrayindexedbystage. denseatheplayingeldformatrixelements. submatricesatthesametime.thenumberofrank-1updatestepsisthesizeofthe largestsupernodeatthecurrentstage.thesubmatricesmaybedierentsizes;each matrixonlydoesasmanyrank-1updatesasthesizeofitssupernode. squarearraysontothesmallestpossiblerectangularplayingeld(whosebordersmust A(K;K)forallthemaximalcliquesKateverystage.Thisisatwo-dimensionalbin \rsttbylevels"heuristic.thislayoutisdoneduringthesequentialsymbolic bepowersoftwo).optimaltwo-dimensionalbinpackingisingeneralannp-hard problem,thoughvariousecientheuristicsexist[9].ourexperimentsuseasimple factorization,beforethenumericfactorizationisbegun. packingproblem.inordertominimizecmcomputationtime,wewanttopackthese Inordertousethisprocedureweneedtondaplacementofallthesubmatrices forgridcholeskyintotimeinthematrixstoragevpsetandtimeontheplaying isoneadditionperstagetoaddtheaccumulatedupdatestothematrix.)thereisa xednumberofrouterusesperstage,sothematrixstoragetimeis eld.theformerincludesalltheroutertrac,andessentiallynothingelse.(there forsomeconstantc5.thesubscriptmsindicatesthatthevalueofistakenin 4.4.GridCholeskyperformance:Theory.Weseparatetherunningtime completelycomputedcolumnsandtheschurcomplementsbacktomatrixstorage. ofrank-1updatesisthesizeofthelargestsupernodeatthatlevel,whichiss. thematrixstoragevpset,whosevpratiosetisvms=d(l)=pe.inthecurrent playingeldatthebeginningofastage,andthentwoscattersareusedtomovethe implementationc5=4,sincetwoscattersareusedtomovethedensematricestothe Weexpresstheplayingeldtimeasasumoverlevels.Ateachlevelsthenumber TMS=c5MSh wherec6andc7areconstants(infactc6=2),andthesubscriptsindicatesthat AccordingtotheanalysisinSection4.2, thevalueofistakenintheplayingeldvpsetatstages.thevpratiointhis VPsetcouldbeapproximatelytheratioofthetotalsizeofthedensesubmatricesat TPF=(c6+c7)hXs=0ss; 17

18 Stages h?1 h?3 Subproblemcountsandplayingeldsizeforthemodelproblem. R(s)s 1 ktable2 h?4 h?5 h? k=4 3k=24:5k2 3k=29k2 3k=218k2 5k=425k2 7k=824:5k2 ks+sp(c+c)2 h? k=8 k=16 5k=825k2 7k=1624:5k2 k2 stagestothenumberofprocessors,changingateachstageasthenumberandsize maximumofthisvalueoverallstages. ofthemaximalcliquesvary.howeverinourimplementationitissimplyxedatthe Again,togetafeelingforthisanalysisletusconsiderthemodelproblem,ave- h?2r?122r+1k=2r+17k=2r+124:5k222r 5k=2r25k2 O(k3)arithmeticoperations.Table2summarizesthenumberandsizesofthecliques pointnitedierencemeshonakkgridorderedbynesteddissection.forthis problemn=k2,h=o(logk),and(l)=o(k2logk).thefactorizationrequires thatoccurateachstage.thecolumnsinthetableareasfollows. sizeplayingeldateverystage.accordingtotable2,aplayingeldofsized25k2e R(s) s+s Numberofsupernodesatstages. sucesiftheproblemscanbepackedinwithouttoomuchloss.thevpratiois P(C+C)2TotalareaofalldensesubmatricesA(K;K)atstages. sothematrixstoragetimeiso(k2log2k=p).ourpilotimplementationusesthesame TheVPratioinmatrixstorageforthemodelproblemisO((L))=p=O(k2logk=p), SizeoflargestmaximalcliqueC[adj(C)atstages. Sizeoflargestsupernodeatstages. eldtimeiso(k3=p).insummary,thetotalrunningtimeofgridcholeskyforthe modelproblemis O(k2=p).ThesumoverallstagesofsisO(k)(infactitis3k+O(1)),sotheplaying term.thisisbecausetheplayingeldcomputationsaredoneondensematriceswith moreimportantinpractice,therouterspeedappearsonlyinthesecond-order done,hasalowervpratiothanthematrixstoragestructure.second,andmuch vanished.thisisbecausetheplayingeld,wherethebulkofthecomputationis arithmeticoperationstotime,iso(p);thelogkineciencyofroutercholeskyhas Twothingsarenotableaboutthis:First,theperformance,orratioofsequential Ok2log2k p+k3 p: withtheproblemsize. moreecientgridcommunication.thismeansthattheroutertimebecomesless importantastheproblemsizegrows,whetherornotthenumberofprocessorsgrows 18

19 andmachinesizesothatthevpratioremainsconstant.thenthemodelproblem Choleskyisa\scalable"parallelalgorithm. analysisofthemodelproblemcarriesthrough(withdierentconstantfactors)for requireso(k)totalparalleloperations,butonlyo(logk)routeroperations.the analysiscarriesthroughforanythree-dimensionalniteelementproblem.thus,grid anytwo-dimensionalniteelementproblemorderedbynesteddissection;asimilar Onewayoflookingatthisanalysisistothinkofincreasingbothproblemsize pointdiscretizationofthelaplacianonasquare6363mesh,orderedbynested mentalresultsfromafairlysmallmodelproblem,thematrixarisingfromtheve- metricentriesonlyonce).thecholeskyfactorhas(l)=86,408nonzeros,aclique dissection.thismatrixhasn=3,969columnsand11,781nonzeros(countingsym- compute. treewithh=11stagesofsupernodes,andtakes3,589,202arithmeticoperationsto 4.5.GridCholeskyperformance:Experiments.Herewepresentexperi- localcomputation).theother2:04secondswasmatrixstoragetime,consistingmostly VPs.Theresultsquotedherearefrom8,192processors,withoating-point coprocessors,ofthemachineatnasa.bothvpsetsthereforehadavpratioof16. (AlargerproblemwouldneedahigherVPratiointhematrixstoragethaninthe time(3:12forthescans,0:15fornearest-neighbormovesofone-bittokens,and0:82for playingeld.) ThematrixstorageVPsetrequires128KVPs.Thexed-sizeplayingeldrequires comestobetween1:5and4:7seconds,dependingonwhichvaluewechoosefor.in fact3=4oftherouteroperationsarescatterswithnocollisions,andtheother1=4 withexperiment.themodelpredictsmatrixstoragetimeofabouth4ms.this ofthefourscattersateachstage.ouranalyticmodelpredictsplayingeldtimeto beabout3k(2+4)pf.thiscomesto4:0seconds,whichisingoodagreement arescatter-with-add,typicallywithtwotofourcollisions.thettothemodelis Weobservedarunningtimeof6:13seconds.Ofthis,4:09secondswasplayingeld thereforequiteclose. about20timesasfastasroutercholesky.itis,however,onlyrunningat:586 megaopson8kprocessors,whichwouldscaleto4:68megaopsonafull64kmachine.alargerproblemwouldrunsomewhatfaster,butitisclearthatmakinggrid Choleskyattractivewillrequiremajorimprovements.Someoftheseweresketched insection Cholesky.Forthesmallsampleproblemtherelativetimesforrouterandnon-router computationsareasfollows. 4.6.RemarksonGridCholesky.Onthissmallexample,GridCholeskyis Movingdatatotheplayingeld: ArstquestioniswhetherGridCholeskyisarouter-boundcodelikeRouter Evidently,theGridCholeskycodeisnotrouter-boundforthisproblem.Forlarger (orstructurallydenser)problemsthissituationgetsbetterstill:foramachineofxed size,thetimespentusingtheroutergrowslikeo(log2k)whilethetimeontheplaying MovingSchurcomplementsbacktomatrixstorage: Factoringontheplayingeld: 21% 67% 12% eldgrowslikeo(k3)forakkgrid,asweshowedabove.ifwesolvedthesame 19

20 problemonafull-sized64kprocessormachine,therelativetimeswouldpresumably ofparalleleliminationstepsontheplayingeldisgivenby bethesameasabove;butifwesolvedaproblem8timesaslargetheoperation countwouldincreasebyafactorofabout22whilethenumberofstages,androuter operations,wouldincreaseonlybyafactorofabout1.3. Next,weaskwhetherouruseoftheplayingeldisecientornot.Thenumber doneisis3:69106,sotheprocessorutilizationis7:9%.thereareseveralreasons ustodo22:8106multiply-addsor45:6106ops.thenumberofopsactually forthislossofeciency: whichis177fortheexample.ontheplayingeldof131,072processorsthisallows hxs=1s; Thealgorithmdoesbothhalvesofthesymmetricdensesubproblems(factor theimplementationusesthesameplayingeldsizeateverylevel(factorof thearchitectureforcesthedimensionsoftheplayingeldtobepowersoftwo of2); about4=3); Parisvirtualizationmethodcostusafactorofroughly(131;072=10;500)=12:5. processors,butonaverageweonlyhaveworkforabout10,500ofthem.thus,the suredresult;1=:079=12:6.inotherwords,everystepmustuseall131,072virtual Theseeectsaccountforafactorofroughly12.4,whichisconsistentwithourmea- theplayingeldisnotfullycoveredbyactiveprocessorsinitially,andasthe densefactorizationprogressesprocessorsinthesupernodesfallidle(factorof about7=2). (factorofabout4=3); peak:virtualization.weused217virtualprocessors.onaverage,wemakeuseof of7:6. Asimilaranalysisshowsthatvirtualizationslowstheuseoftherouterbyafactor WesummarizethereasonsthatourachievedperformanceissofarbelowtheCM's TheParisinstructionset,whichmakesreuseofdatainregistersimpossible, Theslowrouter. Communicationcostsforscansontheplayingeld,usingthebuilt-insuboptimalalgorithm. 10,500ontheplayingeldand5,600inmatrixstorage.Inactuality,there TheSIMDrestriction.Thiscausesustohavetowaitforthedividesand tasks,mostofthiseectcouldalsoberemovedbyecientvirtualization). multiplies.(sincethereareveryfewactivevirtualprocessorsduringthese thusexposingthelowmemorybandwidthofthecm. are256physical(oating-point)processorsinthemachineweused. impedimentstoperformancewouldprovide.ineachcase,wehypothesizean\ideal" machineinwhichthecorrespondingcostiscompletelyremoved.thus,forexample, routeroperationtakesnotimewhatever.20 thestatisticsforthethirdrowofthetableareforamarvelousmachineinwhicha Table3givesanupperboundontheimprovementthatremovalofeachofthese

21 ImpedimentTimeMopsSpeedup removed None Virtualization Slowrouter FactorsaectingeciencyofGridCholesky Table3 10. chiefproblemsaretheparisvirtualizationmethod;thelackofafastbroadcastin Slowscans , , ParisontheCM-2;noneisfundamentaltoSIMD,dataparallelcomputing. thegrid;andthememory-to-memoryinstructionset.allofthesearepeculiaritiesof Clearly,goodeciencyispossible,evenonanSIMDmachinewitharouter.The requirementforexpensivegeneral-purposecommunicationgrowsonlylogarithmically withproblemsize;evenformodestproblemsthecodeisnotlimitedbythecmrouter. presentintheeliminationtree,butbecauseitpayslittleattentiontothecostof itsworkwithgridcommunicationondensesubmatrices.analysisshowsthatthe rst,routercholesky,isconciseandelegantandtakesadvantageoftheparallelism Choleskyfactorizationalgorithms,implementedforadataparallelcomputer.The communicationitisnotpractical. 5.Conclusions.Wehavecomparedtwohighlyparallelgeneral-purposesparse foramoderatelysmallsampleproblem. Webelieve,however,thatourexperimentsandanalysisleadtotheconclusionthat ExperimentshowsthatGridCholeskyisabout20timesasfastasRouterCholesky nectionmachinecost-eectiveforsolvinggenerallystructuredsparsematrixproblems. OurpilotimplementationofGridCholeskyisnotfastenoughtomaketheCon- Wethereforedevelopedaparallelalgorithm,GridCholesky,thatdoesmostof twotothreeordersofmagnitudefasterwithimprovementsintheinstructionsetof theconnectionmachine.ofourpilotimplementationfromthe27-gigaoptheoretical peakperformanceofa64kprocessorcmyawnsomewhatlessdauntingly. ahighlyparallelsimdmachine.weshowedindetailthatgridcholeskycouldrun aparallelsupernodal,multifrontalalgorithmcanbemadetoperformecientlyon turecompilersforhighlyparallelmachines,whiletheywillsupportthedata-parallel virtualprocessorabstractionattheuser'slevel,willgeneratecodeatalevelbelow thatabstraction. straction,whichistosaybelowtheleveloftheassembly-languagearchitectureofthe machine.althoughtmchasrecentlyreleasedalow-levellanguagecalledcmisin whichausercanprogrambelowthevirtual-processorlevel,webelievethatultimately mostoftheseoptimizationsshouldbeappliedbyhigh-levellanguagecompilers.fu- MostoftheseimprovementsarebelowtheleveloftheParisvirtualprocessorab- parallelprogrammingmodel. abletousethedataparallelprogrammingparadigmtoexpressitinastraightforward way.thehigh-levelscanandscatter-with-addcommunicationprimitivessubstantially simpliedtheprogramming.thesimplicityofourcodesspeakswellforthisdata AlthoughGridCholeskyismorecomplicatedthanRouterCholesky,wearestill 21

22 withoutsacricingeciency. ecientparallelprogramsforcomplextaskstobewrittennearlyaseasilyassequentialprograms.togettothatpoint,therewillhavetobeimprovementsincompilers, instructionsets,androutertechnology.virtualizationwillhavetobeimplemented problems.webelievethatfuturegenerationsofhighlyparallelmachinesmayallow dataparallelprogrammingandhighlyparallelarchitecturesforsolvingunstructured lessencouraged aboutthegridcholeskyalgorithmandaboutthepotentialof Insummary,eventhoughourpilotimplementationisnotfast,wearenonethe- playingeldisnottheonlypossibleone.thereisingeneralnoneedtoperformall thepartialfactorizationsatasinglelevelsimultaneously.itshouldbepossibletouse asameshofindividualvectoroating-pointchips.sometheoreticalworkhasbeen tions.oursimpleapproachofschedulingtheseonelevelatatimeontoaxed-size varyingvpratio,oreven(fortheconnectionmachine)ontoaplayingeldconsidered moresophisticatedheuristicstoschedulethesefactorizationsontoaplayingeldof supernodesidentiesaprecedencerelationshipamongthevariouspartialfactoriza- Wementionfouravenuesforfurtherresearch. doneonscheduling\rectangular"tasksontoasquaregridofprocessorseciently[8]. Therstisschedulingthedensepartialfactorizationseciently.Thetreeof willbecomemoresoastheplayingeldtimeisimproved. course,asproblemsgetlargerthistimebecomesasmallerfractionofthetotal.at presentmatrixstoragetimeisnotverysignicantevenforasmallproblem,butit ThesecondavenueisimprovingthetimespentinthematrixstorageVPset.Of arecurrentlydesigningdataparallelalgorithmstodothesethreesteps[18]. Choleskyforverylargeproblems.Herethecliquetreewouldbeusedtoschedule theprocessorsthemselves. transfersofdatabetweenthehigh-speedparalleldiskarrayconnectedtothecmand usesasequentiallygeneratedordering,symbolicfactorization,andcliquetree.we preliminariestothenumericalfactorizationinparallel.ourpilotimplementation Third,wementionthepossibilityofanout-of-main-memoryversionofGrid estingandencouragingthatthekeyideaofthealgorithm,namelypartitioningthe matrixintodensesubmatricesinasystematicway,hasalsobeenusedtomakesparse Choleskyfactorizationmoreecientonvectorsupercomputers[32]andevenonworkstations[29].Intheformercase,thedensesubmatricesvectorizeeciently;inthe WeconcludebyextractingonelastmoralfromGridCholesky.Wenditinter- Fourthandnally,wementionthepossibilityofperformingthecombinatorial techniquesforattainingeciencyonsequentialmachineswithhierarchicalstorage memoryandmainmemory.weexpectthatmoreexperiencewillshowthatmany willturnouttobeusefulforhighlyparallelmachines. latter,thedensesubmatricesarecarefullyblockedtominimizetracbetweencache [1]C.Ashcraft,R.Grimes,J.Lewis,B.Peyton,andH.Simon,Recentprogressinsparsematrix [2]C.C.Ashcraft,Thedomain/segmentpartitionforthefactorizationofsparsesymmetricpositivedenitematrices,Tech.ReportECA{TR{148,BoeingComputerServicesEngineering methodsforlargelinearsystems,internationaljournalofsupercomputerapplications, (1987),pp.10{30. REFERENCES ComputingandAnalysisDivision,Seattle,

23 [3]C.H.BischofandJ.J.Dongarra,Aprojectfordevelopingalinearalgebralibraryforhighperformancecomputers,Tech.ReportMCS{P105{0989,ArgonneNationalLaboratory, [4]J.R.S.BlairandB.W.Peyton,Onndingminimum-diametercliquetrees,Tech.Report ORNL/TM{11850,OakRidgeNationalLaboratory,1991. [5]M.DixonandJ.deKleer,Massivelyparallelassumption-basedtruthmaintenance,inProceedingsoftheNationalConferenceonArticialIntelligence,1988,pp.199{204. [6]I.S.Du,MultiprocessingasparsematrixcodeontheAlliantFX/8,Tech.ReportCSS210, ComputerScienceandSystemsDivision,AEREHarwell,1988. [7]I.S.DuandJ.K.Reid,Themultifrontalsolutionofindenitesparsesymmetriclinear equations,acmtransactionsonmathematicalsoftware,9(1983),pp.302{325. [8]A.Feldmann,J.Sgall,andS.-H.Teng,Dynamicschedulingonparallelmachines. [9]M.R.GareyandD.S.Johnson,ComputersandIntractability:AGuidetotheTheoryof NP-Completeness,W.H.FreemanandCompany,1979. [10]F.Gavril,Algorithmsforminimumcoloring,maximumclique,minimumcoveringbycliques, andmaximumindependentsetofachordalgraph,siamjournaloncomputing,1(1972), pp.180{187. [11]A.George,Nesteddissectionofaregularniteelementmesh,SIAMJournalonNumerical Analysis,10(1973),pp.345{363. [12]A.George,M.T.Heath,J.Liu,andE.Ng,SparseCholeskyfactorizationonalocal-memory multiprocessor,siamjournalonscienticandstatisticalcomputing,9(1988),pp.327{ 340. [13]A.GeorgeandJ.W.H.Liu,ComputerSolutionofLargeSparsePositiveDeniteSystems, Prentice-Hall,1981. [14]A.GeorgeandE.Ng,OnthecomplexityofsparseQRandLUfactorizationofnite-element matrices,siamjournalonscienticandstatisticalcomputing,9(1988),pp.849{861. [15]J.A.GeorgeandD.McIntyre,Ontheapplicationoftheminimumdegreealgorithmtonite elementsystems,siamjournalonnumericalanalysis,15(1978),pp.90{112. [16]J.R.Gilbert,Somenesteddissectionorderisnearlyoptimal,InformationProcessingLetters, 26(1988),pp.325{328. [17]J.R.GilbertandH.Hafsteinsson,Parallelsolutionofsparselinearsystems,inSWAT88: ProceedingsoftheFirstScandinavianWorkshoponAlgorithmTheory,Springer-Verlag LectureNotesinComputerScience318,1988,pp.145{153. [18]J.R.Gilbert,C.Lewis,andR.Schreiber,Parallelpreorderingforsparsematrixfactorization. Inpreparation. [19]C.-T.HoandS.L.Johnsson,SpanningbalancedtreesinBooleancubes,SIAMJournalon ScienticandStatisticalComputing,10(1989),pp.607{630. [20]J.J.Hopeld,Neuralnetworksandphysicalsystemswithemergentcollectivecomputational abilities,proceedingsofthenationalacademyofscience,79(1982),pp.2554{2558. [21]J.A.G.JessandH.G.M.Kees,AdatastructureforparallelL/Udecomposition,IEEE TransactionsonComputers,C-31(1982),pp.231{239. [22]S.G.Kratzer,Massivelyparallelsparsematrixcomputations,Tech.ReportSRC{TR{90{008, SupercomputerResearchCenter,1990. [23]J.W.H.Liu,Themultifrontalmethodforsparsematrixsolution:Theoryandpractice,Tech. ReportCS{90{04,YorkUniversityComputerScienceDepartment,1990. [24]J.W.H.Liu,Theroleofeliminationtreesinsparsefactorization,SIAMJournalonMatrix AnalysisandApplications,11(1990),pp.134{172. [25]J.Naor,M.Naor,andA.J.Schaer,Fastparallelalgorithmsforchordalgraphs,SIAMJournal oncomputing,18(1989),pp.327{349. [26]B.W.Peyton,SomeApplicationsofCliqueTreestotheSolutionofSparseLinearSystems, PhDthesis,ClemsonUniversity,1986. [27]D.J.Rose,Agraph-theoreticstudyofthenumericalsolutionofsparsepositivedenitesystems oflinearequations,ingraphtheoryandcomputing,r.c.read,ed.,1972,pp.183{217. [28]D.J.Rose,R.E.Tarjan,andG.S.Lueker,Algorithmicaspectsofvertexeliminationongraphs, SIAMJournalonComputing,5(1976),pp.266{283. [29]E.RothbergandA.Gupta,Fastsparsematrixfactorizationonmodernworkstations,Tech. ReportSTAN{CS{89{1286,StanfordUniversity,1989. [30]R.Schreiber,AnewimplementationofsparseGaussianelimination,ACMTransactionson MathematicalSoftware,8(1982),pp.256{276. [31]R.Schreiber,Anassessmentoftheconnectionmachine,inScienticApplicationsoftheConnectionMachine,H.Simon,ed.,WorldScientic,Singapore,1991. [32]H.Simon,P.Vu,andC.Yang,PerformanceofasupernodalgeneralsparsesolverontheCray 23

24 [33]B.Speelpenning,Thegeneralizedelementmethod,Tech.ReportUIUCDCS{R{78{946,UniversityofIllinois, [34]ThinkingMachinesCorporation,Cambridge,Massachusetts,Parisreferencemanual,version [35]E.Zmijewski,SparseCholeskyFactorizationonaMultiprocessor,PhDthesis,CornellUniversity,1987. Y-MP,Tech.ReportSCA{TR{117,BoeingComputerServices,