Wealsopresentaperformancemodelanduseittoanalyzeouralgorithms.Wendthatasymp- 1.1.Dataparallelism.Highlyparallel,localmemorycomputerarchitectures

Similar documents

AWinningStrategyforRoulette

How To Factor By Gcf In Algebra 1.5

Pain-Free Injections

Maths Refresher. Expanding and Factorising

OPTIMAL BINARY SEARCH TREES

Introduction to Software Verification

Eaton Fuller Heavy Duty Transmissions Illustrated Parts List FS-5005C March 2003

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Connectivity and cuts

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS

Creating Contrast Variables

STATEMENTS OF COST SPECIAL ASSESSMENTS SEPTEMBER, 2014

Stat 5303 (Oehlert): Tukey One Degree of Freedom 1

Module 5 (Lectures 17 to 19) MAT FOUNDATIONS

maxiflex Party and Exhibition Hall Type: 10.5/340 Model: 170

postage, $1.50 each, $3.00 total for both Laws and Rules Governing Structural Pest Management In Arizona

The GRE Advanced Test in Computer Science*

C 060 CCTV Coaxial Cable

CAMERA MTGS. Astro-Brac 1-Piece Camera Bracket Stellar Series

Dr.Web anti-viruses Visual standards

Overview of Fire Alarm Systems and Maintenance

Modélisation et résolutions numérique et symbolique

Catalog : JEWELRY. Grouping: 100 T-TYPE. Product: 10001H ST164H, 2 Qt HEAT TIMER & COVER Price: $475.00

Cleveland Tank & Supply, Inc. School Bus Diesel Fuel Tanks and Accessories. For Immediate Shipment

Address for Correspondence

Capturing Database Transformations for Big Data Analytics

Partial Fractions: Undetermined Coefficients

A safe, simple and economical solution. SAF-T Pump Waste Disposal System

12.2 Compact Type, MSB Series

Central Purchase unit National Institute of Technology Srinagar Soil Mechanics Lab of Civil Engineering Department.

Waterleaf ARB Application

Printing Letters Correctly

Relational Database Concepts

2016 TOURNAMENT OF CHAMPIONS NORTHWEST ALL-STARS & SEATTLE YOUTH BASKETBALL

CS Cellular and Mobile Network Security: GSM - In Detail

Lesson 20. Probability and Cumulative Distribution Functions

New CPT codes for Acupuncture & Electrical Acupuncture AAOM 2005

The effective theory of type IIA AdS 4 compactifications on nilmanifolds and cosets

Dotted Thirds For Smartboard

Example for CoinTosses CoinTosses[100, True] HTHTHHHHTHTHHTTTHHTTTTHTTHTHTTTTTTHTTTHTHHHTHHHTTHHTHTTHHT TTTHTTHTHTHTHTHHTHHTHTTTHHHHTTTHHTHTHTTTHT

Health Insurance A1 Screen Matrix - Revised 11/13/13

Forest Management Plan Templates

Dirk von Guenthner RULE 26 DISCLOSURES

Draft ARTICLE 20A "AO" ANTIETAM OVERLAY DISTRICT

Expense Reports. University of Kansas 2/12/2014

Topological Properties

Common Emitter BJT Amplifier Design Current Mirror Design

9 Summary of California Law (10th), Partnership

Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis

Copperplate Victorian Handwriting. Victorian. Exploring your History. Created by Causeway Museum Service

ACCOUNTS FROM INCOMPLETE RECORDS SINGLE ENTRY SYSTEM MODULE II OF IV

Optional custom API wrapper. C/C++ program. M program

Shape of Data Distributions

HUFFMAN CODING AND HUFFMAN TREE

SOPREMA ALSAN RS ROOFING/WATERPROOFING COLD LIQUID-APPLIED FULLY REINFORCED SYSTEM SPECIFICATION

Penalty for delayed repayments. Nil NA NA. 2% of overdue amount per month, if delayed more than 30 days

Supply Chain Management Global Logistics Technology Associate in Applied Science Degree A25620 Course and Hour Requirements

ISLAND COUNTY BOARD OF COUNTY COMMISSIONERS (including Diking Improvement Dist. #4)

METEOROLOGICAL INSTRUMENTS

From Last Time: Remove (Delete) Operation

LED-INTA-0024V-41-F-O

MAGNETIC CARD READER DESIGN KIT TECHNICAL SPECIFICATION

TURBOWENT TULIPAN - rotary chimney cowl Ø150

The ABC s of Web Site Evaluation

European Aviation Safety Agency

concern commitment Curamando 2014

Common Base BJT Amplifier Common Collector BJT Amplifier

Stand Density Management Diagram

Periodic Capacity Management under a Lead Time Performance Constraint

Spec (Per Contract) 1,750 USD (+70) 1,250 USD (+50) Hedge (Per Contract) 1,250 USD (+50) 1,250 USD (+50)

Improving global data on forest area & change Global Forest Remote Sensing Survey

APPLICATIONS & PRICING FOR MOTORCYCLES

TOPVIEW Rel /02/14

Business Intelligence at Work

A.) Using the pulley system B.) Lifting it straight up C.) Both using the pulley system or lifting it straight up require the same effort (force)

Lab 2: Swat ATM (Machine (Machine))

Incubators, cooled incubators, drying ovens, sterilizers, climatic chambers

Transcription:

Machine,adistributed-memorySIMDmachinewhoseprogrammingmodelconceptuallysuppliesone Choleskyfactorizationofasparsematrix.OurexperimentalimplementationsareontheConnection processorperdataelement.incontrasttospecial-purposealgorithmsinwhichthematrixstructure conformstotheconnectionstructureofthemachine,ourfocusisonmatriceswitharbitrarysparsity structure. Abstract.Wedevelopandcompareseveralne-grainedparallelalgorithmstocomputethe HIGHLYPARALLELSPARSECHOLESKYFACTORIZATION Themostpromisingalternativeisasupernodal,multifrontalalgorithmwhoseinnerloopperforms JOHNR.GILBERTANDROBERTSCHREIBERy usefulinchoosingamongalternativealgorithmsforacomplicatedproblem. tionratescomparabletothoseofthedensesubroutine.althoughatpresentarchitecturallimitations parallel,densefactorizationalgorithmisusedasthekeysubroutine.thesparsecodeattainsexecu- severaldensefactorizationssimultaneouslyonatwo-dimensionalgridofprocessors.ane-grained toticanalysiscombinedwithexperimentalmeasurementofparametersisaccurateenoughtobe preventthedensefactorizationfromrealizingitspotentialeciency,weconcludethataregulardata tifrontalfactorization,systemsoflinearequations,parallelcomputing,dataparallelalgorithms, parallelarchitecturecanbeusedecientlytosolvearbitrarilystructuredsparseproblems. chordalgraphs,cliquetrees,connectionmachine,performanceanalysis. Keywords.sparsematrixalgorithms,Choleskyfactorization,supernodalfactorization,mul- Wealsopresentaperformancemodelanduseittoanalyzeouralgorithms.Wendthatasymp- 05C50,15A23,65F05,65F50,68M20. 1.Introduction. siverelativetocomputation,soanalgorithmmustminimizecommunication.locality totheprocessors. ally),thushidingfromtheprogrammerthedetailsofdistributionofdataandwork simplehardwareinawaythatscaleswithoutbottlenecks.adataparallelprogrammingmodelsimpliestheprogrammingoflocalmemoryparallelarchitecturesby associatingaprocessorwitheverydataelementinacomputation(atleastconceptu- promisetoachievehighperformanceinexpensivelybyassemblingalargeamountof Somemajorchallengescomealongwiththesepromises.Communicationisexpen- 1.1.Dataparallelism.Highlyparallel,localmemorycomputerarchitectures sequentialalgorithms:theymustexploitregularityinthedata.foreciencyon isimportantincommunication,soitpaystosubstitutecommunicationwithnearby removesthegeneral-patterncommunicationfromtheinnerloop.) processorsformoregeneralpatternswherepossible.thesequentialprogrammer advantageofthemorecomplicatedofourtwoalgorithms,gridcholesky,isthatit tunestheinnerloopofanalgorithmforhighperformance,butsimpledataparallel algorithmstendtohave\everythingintheinnerloop"becauseasequentialloopover SIMDmachines,theymustalsobehighlyregularinthetimedimension.Insome thedataistypicallyreplacedbyaparalleloperation.(fromthispointofview,the NASAandtheUniversitiesSpaceResearchAssociation. lationsystemsdivisionofnasaandbydarpaviacooperativeagreementncc2-387between MoettField,CA94035.Thisauthor'sworkwassupportedbytheNumericalAerodynamicSimu- c1990,1991xeroxcorporation.allrightsreserved. XeroxPaloAltoResearchCenter,3333CoyoteHillRoad,PaloAlto,California94304.Copyright yresearchinstituteforadvancedcomputerscience,mst045-1,nasaamesresearchcenter, Algorithmsfordataparallelarchitecturesmustmakedierenttrade-osthan 1

casesentirelynewapproachesmaybeappropriate;examplesofexperimentswithsuch approachesincludeparticle-in-boxowsimulation,knowledgebasemaintenance[5], andtheentireeldofneuralcomputation[20].ontheotherhand,thesamekind ofregularityinaproblemoranalgorithmcanoftenbeexploitedinawiderange ofarchitectures;therefore,manyideasfromsequentialcomputationturnouttobe surprisinglyapplicableinthehighlyparalleldomain.forexample,block-oriented matrixoperationsareusefulinsequentialmachineswithhierarchicalstorageand conventionalvectorsupercomputers[3];weshallseethattheyarealsocrucialto ecientdataparallelmatrixalgorithms. 1.2.Goalsofthisstudy.Dataparallelalgorithmsarenaturalforcomputationsonmatricesthataredenseorhaveregularnonzerostructuresarisingfrom,for example,regularnitedierencediscretizations.themaingoalofthisresearchis todeterminewhetherdataparallelismisusefulindealingwithirregular,arbitrarily structuredproblems.specically,weconsidercomputingthecholeskyfactorization ofanarbitrarysparse,symmetric,positivedenitematrix.wewillmakenoassumptionsaboutthenonzerostructureofthematrixbesidessymmetry.wewillpresent evidencethatarbitrarysparseproblemscanbesolvednearlyasecientlyasdense problemsbycarefullyexploitingregularitiesinthenonzerostructureofthetriangular factorthatcomefromthecliquestructureofitschordalgraph. Asecondgoalistoperformacasestudyinanalysisofparallelalgorithms.The analysisofsequentialalgorithmsanddatastructuresisamatureandusefulscience thathascontributedtosparsematrixcomputationformanyyears.bycontrast,the studyofcomplexityofparallelalgorithmsisinitsinfancy,anditremainstobeseen howusefulparallelcomplexitytheorywillbeindesigningecientalgorithmsforreal parallelmachines.wewillarguebyexamplethat,atleastwithinaparticularclassof parallelarchitectures,asymptoticanalysiscombinedwithexperimentalmeasurement ofparametersisaccurateenoughtobeusefulinchoosingamongalternativealgorithms forasinglefairlycomplicatedproblem. 1.3.Outline.Thestructureoftheremainderofthepaperisasfollows.In Section2wereviewthedenitionsweneedfromnumericallinearalgebraandgraph theory,sketchthearchitectureoftheconnectionmachine,andpresentatimingmodel forageneralizeddataparallelcomputerthatabstractsthatarchitecture. InSection3wepresenttherstoftwoparallelalgorithmsforsparseCholesky factorization.thealgorithm,whichwecallroutercholesky,isbasedonatheoreticallyecientalgorithmintheprammodelofparallelcomputation.weanalyze thealgorithmandpointouttworeasonsthatitfailstobepractical,onehavingtodo withcommunicationandonewithprocessorutilization. InSection4wepresentasecondalgorithm,whichwecallGridCholesky.Grid Choleskyisadataparallelimplementationofasupernodal,multifrontalmethodthat drawsontheideasofduandreid[7]andashcraftetal.[1].itimprovesonrouter Choleskybyusingatwo-dimensionalgridofprocessorstooperateondensesubmatrices,thusreplacingmostoftheslowgenerally-routedcommunicationofRouter Choleskywithfastergridcommunication.Italsosolvestheprocessorutilization problembyassigningdierentdataelementstotheworkingprocessorsatdierent stagesofthecomputation.wepresentananalysisandexperimentalresultsforapilot implementationofgridcholeskyontheconnectionmachine. ThepilotimplementationofGridCholeskyisapproximatelyasecientasa densecholeskyfactorizationalgorithm,butisstillslowcomparedtothetheoretical peakperformanceofthemachine.severalstepsnecessarytoimprovetheabsolute 2

forfurtherresearch. eciencyofthealgorithm,mostofwhichconcernecientcholeskyfactorizationof densematrices,aredescribed.finallywedrawsomeconclusionsanddiscussavenues diagonalsuchthat sparsematrix.thereisauniquennlowertriangularmatrixl=(lij)withpositive ConnectionMachine,anddescribesourparametricmodelofadataparallelcomputer. sectionoutlinesthedataparallelprogrammingmodelanditsimplementationonthe andgraphtheoryneededtostudysparsecholeskyfactorization.mostofthismaterial iscoveredinmoredetailbygeorgeandliu[13,23,24].theremainderofthe 2.1.Linearalgebra.LetA=(aij)beannnreal,symmetric,positivedenite 2.Denitions.Thersttwosubsectionsbelowsummarizethelinearalgebra denotethenumberofnonzeroelementsofx. thelinearsystemax=bbysolvingly=bandltx=y.wewilldiscussalgorithms LthatwerezeroinAarecalledllorll-in.ForanymatrixX,wewrite(X)to solvedis forcomputinglbelow.ingeneral,lislesssparsethana.thenonzeropositionsof ThisistheCholeskyfactorizationofA.WeseektocomputeL;withitwemaysolve TherowsandcolumnsofAmaybesymmetricallyreorderedsothatthesystem A=LLT: turedmatricesmaybefactoredusingthesameorderingandsymbolicfactorization.) Astudyoftheimplementationofappropriatereorderingandsymbolicfactorization furtherassumethatthestructureoflhasbeendeterminedbyasymbolicfactoring actuallycomputingltypicallydominates.(inmanycases,severalidenticallystruc- process.weignorethesepreliminarycomputationsinthisstudybecausethecostof wherepisapermutationmatrix.weassumethatsuchareordering,chosentoreduce(l)andthenumberofoperationsrequiredtocomputel,hasbeendone.we PAPT(Px)=Pb proceduresondataparallelarchitecturesisinpreparation[18]. thelowertriangleofa,i.e.thereisnoll,thenaisaperfecteliminationmatrix.if PAPTisaperfecteliminationmatrixforsomepermutationmatrixP,wecallthe orderingcorrespondingtopaperfecteliminationorderingofa. IfthematrixAissuchthatitsCholeskyfactorLhasnomorenonzerosthan verticesf1;2;:::;ngandedgese(a)=f(i;j)jaij6=0g: elementsarear;sforr2rands2s.(foranysets,wewritejsjtodenoteits thesparse,symmetricmatrixa.first,g(a),thegraphofa,isthegraphwith cardinality.) 2.2.1.Vertexelimination.Weassociatetwoordered,undirectedgraphswith 2.2.Graphtheory. LetRandSbesubsetsoff1;:::;ng.ThenA(R;S)isthejRjjSjmatrixwhose (NotethatE(A)isasetofunorderedpairs.)Next,wedenethelledgraph,G(A), withverticesf1;2;:::;ngandedges E(A)=f(i;j)jlij6=0g; 3

sothatg(a)isg(l+lt).theedgesing(a)thatarenotedgesofg(a)arecalled lledges.theoutputofasymbolicfactorizationofaisarepresentationofg(a). anedgebetweentwononconsecutivevertices(achord).suchagraphissaidtobe reorderingofg(a). whoseverticesallhavenumberslowerthanbothiandj;moreover,foreverysuchpath neworderingisaperfecteliminationorderingofg(a);liu[24]callsitanequivalent Withanothernumbering,thislastpropertymayormaynothold.Ifitdoes,thenthe ing(a)thereisanedgeing(a)[28].considerrenumberingtheverticesofg(a). 2.2.2.Chordalgraphs.EverycycleofmorethanthreeverticesinG(A)has Foreverylledge(i;j)inE(A)thereisapathinG(A)fromvertexitovertexj ofanyotherclique.foranyv2v,theneighborhoodofv,writtenadj(v),isthe intheusualway. setfu2vj(u;v)2eg.themonotoneneighborhoodofv,writtenmadj(v),isthe graphofsomematrix[27]. smallersetfu2vju>v;(u;v)2eg.weextendadjandmadjtosetsofvertices thatforallu;v2x,(u;v)2e.acliqueismaximalifitisnotapropersubset chordal.notonlyisg(a)chordalforeverya,buteverychordalgraphisthelled tinguishableiffug[adj(u)=fvg[adj(v).twoverticesareindependentifthereis noedgebetweenthem.asetofverticesisindependentifeverypairofverticesinit Avertexvissimplicialifadj(v)isaclique.Twovertices,uandv,areindis- LetG=G(V;E)beanyundirectedgraph.AcliqueisasubsetXofVsuch titionsthesimplicialverticesintopairwiseindependentcliques.wecallthesethe vertexofb. simplicialcliquesofthegraph. tinguishable.asetofindistinguishablesimplicialverticesthusformsaclique,though notingeneralamaximalclique.theequivalencerelationofindistinguishabilitypar- isindependent;twosetsaandbareindependentifnovertexofaisadjacenttoa Itisimmediatethatanytwosimplicialverticesareeitherindependentorindis- consistingofonetreeforeachconnectedcomponentofg(a).forsimplicityweshall suchneighbor;otherwiseuisaroot.inotherwords,thersto-diagonalnonzero assumeinwhatfollowsthataisirreducible,sothatvertexnistheonlyroot,though inationtreet(a)isarootedspanningforestofg(a)denedasfollows.ifvertexu elementintheuthcolumnoflisinrowp(u).itiseasytoshowthatt(a)isaforest hasahigher-numberedneighborv,thentheparentp(u)ofuint(a)isthesmallest Liu[24]givesasurveyofitsmanyuses.LetAhavetheCholeskyfactorL.Theelim- eliminationistheeliminationtree.thisstructurewasdenedbyschreiber[30]; 2.2.3.Eliminationtrees.AfundamentaltoolinstudyingsparseGaussian that,ifwethinkoftheverticesoft(a)ascolumnsofaorl,anygivencolumnofl ouralgorithmsdonotassumethis. termsofoperationsonsinglecolumns.adescriptionintermsofoperationsonfull dependsonlyoncolumnsthatareitsdescendantsinthetree. (u;v)isanedgeofg(a)withu<v(thatis,iflvu6=0)thenvisonthisunique pathfromutotheroot.thismeansthatwhent(a)isconsideredasaspanningtree ofg(a),thereareno\crossedges"joiningverticesindierentsubtrees.itimplies ThereisamonotoneincreasingpathinT(A)fromeveryvertextotheroot.If blockscanyieldalgorithmswithbetterlocalityofreference,whichisanadvantage eitheronamachinewithamemoryhierarchy(registers,cache,mainmemory,disk)or 2.2.4.Cliquetrees.TheeliminationtreedescribesaCholeskyfactorizationin 4

onadistributed-memoryparallelmachine.theconnectionmachinefallsintoboth ofthesecategories. thekeyideainbothduandreid's\multifrontal"algorithm[7]andthe\supernodal" exploredextensivelyinthecombinatorialliterature;representationsofchordalgraphs algorithmofashcraft,grimes,lewis,peyton,andsimon[1],whichcanbetraced backtotheso-calledelementmodeloffactorization[15,33].afullsubmatrixoflis acliqueinthechordalgraphg(a).thecliquestructureofchordalgraphshasbeen astreesofcliquesdatebackatleastto1972[10]andcontinuetobeused[16,25,26]. Describingsymmetricfactorizationintermsofoperationsonfullsubmatricesis innodesthatareproperdescendantsofn.anequivalentdenitionistothinkof ofg(a)intocliques,insuchawaythatalltheverticesofanodenareindistinguishablesimplicialverticesinthegraphthatresultsbydeletingfromg(a)allvertices fromtheirparents.(thisdenitiondiersslightlyfromthatofpeyton[26],whose startingwithaneliminationtreeandcollapsingverticesthatareindistinguishable G(A)inproperdescendantsofsomesupernodearedeleted,thesupernodebecomesa treenodesareoverlappingmaximalcliquesofg(a).) Acliquewhichisanodeinacliquetreeisalsocalledasupernode.Ifallverticesof AcliquetreeformatrixAisatreewhosenodesaresetsthatpartitionthevertices simplicialcliqueintheresultinggraph.thecliquetreeissometimescalledasupernode treeorsupernodaleliminationtree[2].amatrixmayhavemanydierentclique trees indeed,theeliminationtreeitselfisone.ournumericalfactorizationalgorithm WeprogrammedtheCMin*lisp,whichiscompiledintoParis. machinearchitecturepresentedbytheassemblylanguageinstructionsetparis[34]. ory,simdparallelcomputer.thedescriptionwepresentherecorrespondstothe GridCholeskycanactuallyuseanycliquetree;thesymbolicfactorizationwedescribe insection4.1usesablockjessandkeesalgorithmtocomputeashortestpossible cliquetree. available.)theprocessorsareconnectedbyacommunicationnetworkcalledthe 65,536bitsofmemory.(Sincethisworkwasdone,largermemorieshavebecome Afull-sizedCMhas216=65,536processors,eachofwhichcandirectlyaccess 2.3.1.Architecture.TheConnectionMachine(modelCM-2)isalocalmem- 2.3.TheConnectionMachine. 16-dimensionalhypercube. router,whichisconguredbyacombinationofmicrocodeandhardwaretobea pvar.apvarisanarrayofdatainwhicheveryprocessorstoresandmanipulatesone processors,p.iftherearevtimesasmanyelementsinthepvarxasthereare element.thesizeofapvarmaybeamultipleofthenumberofphysicalmachine processors;thustheprogrammer'sviewremains\oneprocessorperdatum."the processors,then(throughmicrocode)eachphysicalprocessorsimulatesvvirtual ratioviscalledthevirtualprocessor(vp)ratio.thecmrequiresthatvmustbea oftwonotsmallerthantherealnumberx. poweroftwo.thuswewillndusefulthenotationdxe,meaningthesmallestpower TheessentialfeatureoftheCMprogrammingmodelistheparallelvariableor embeddedinthemachine(usinggraycodes)sothatneighboringvirtualprocessors arraywithdimensionsthatarepowersoftwo.thevpsetsandtheirpvarsare aresimulatedbythesameorneighboringphysicalprocessors. determinedbytheprogrammer,whomaychoosetoviewitasanymultidimensional Thegeometryofeachsetofvirtualprocessors(anditsassociatedpvars)isalso 5

TheParisinstructionsetcorrespondsfairlywelltotheabstractmodelofdata parallelprogrammingthatthecmattemptstopresenttotheprogrammer,butitdoes notcorrespondcloselytotheactualhardwareofthecm.largelyforthisreason,it ishardtogethighperformancewhenprogramminginparisorinalanguagethat iscompiledintoparis[31].weshallgointothispointindetaillater.thereare otherwaystoviewandtoprogramthehardwareofthecm-2thatcanprovidebetter performance.thesearejustnowbecomingavailabletousers,butwerenotwhenthis workwasdone. 2.3.2.ConnectionMachineprogramming.ParallelcomputationontheCM isexpressedthroughelementwisebinaryoperationsonpairsofpvarsthatresideinthe samevpset thatis,havethesamevpratioandlayoutonthemachine.(optionally, onemayspecifyabooleanmaskthatselectsonlycertainvirtualprocessorstobe active.)theseoperationstaketimeproportionaltov,sincetheactualprocessors mustloopovertheirsimulatedvirtualprocessors.thisremainstrueevenwhenthe setofselectedprocessorsisverysparse. Interprocessorcommunicationisexpressedandaccomplishedinthreeways,which wediscussinorderofincreasinggeneralitybutdecreasingspeed. Communicationwithvirtualprocessorsatnearestneighborgridcellsismost ecient.apvarmaybeshiftedalonganyofitsaxesusingthistypeofcommunication. Theshiftmaybecircularorend-oattheprogrammer'sdiscretion. Asecondcommunicationprimitive,scan,allowsbroadcastofdata.Forexample, ifxisaone-dimensionalpvarwiththevalue[1,2,3,4,5,6,7,8]thenascanofx yields[1,1,1,1,1,1,1,1].scansareimplementedusingthehypercubeconnections. Thetimeforascanoflengthnislinearinlogn.Scanscanalsobeusedtobroadcast alongeitherrowsorcolumnsofatwo-dimensionalarray.scansthatperformparallel prexarithmeticoperationsarealsoavailable,butwedonotusethem. Scansofsubarraysarepossible.Inasegmentedscan,theprogrammerspeciesa booleanpvar,thesegmentpvar,congruenttox.thesegmentsofxbetweenadjacent Tvaluesinthesegmentpvararescannedindependently.Thus,forexample,ifwe usethesegmentpvar[tffftfft]andxisasabove,thenasegmentedscan returns[1,1,1,1,5,5,5,8]. Thethirdandmostgeneralformofcommunicationallowsavirtualprocessorto accessdatainthememoryofanyothervirtualprocessor.theseoperationsgoby severaldierentnamesevenwithinthecmenvironment;weshallrefertothemin termsalreadyfamiliarinsparsematrixcomputation:gatherandscatter. Agatherallowsprocessorstoreaddatainthememoryofotherprocessors. TheCMtermforagatherispref!!(forthe*lispprogrammer)orget(forthe Parisprogrammer).Inagather,threepvarsareinvolved:thesource,thedestination,andtheaddress.Theaddressoftheprocessorwhosememoryistobereadis takenfromtheintegeraddresspvar.supposethesourceistheone-dimensionalpvar [15;14;13;:::;2;1;0]andtheaddresspvaris[0;1;2;0;1;2;:::;0;1;2;0].Thenthe datastoredinthedestinationis[15;14;13;15;14;13;:::;15;14;13;15].thefortran- 90orMatlabstatementthataccomplishesthisisisdest=source(address);it performstheassignmentdest(i) source(address(i))for1ilength(dest). Ascatterallowsprocessorstowritedatatothememoryofotherprocessors. TheCMtermforascatteris*pset(forthe*lispprogrammer)orsend(forthe Parisprogrammer).Againthethreepvarsareasource,adestination,andaninteger address.thefortran-90ormatlabversionisisdest(address)=source,andthe eectisdest(address(i)) source(i)for1ilength(source). 6

Parameter V Virtualprocessorratio memoryreferencetime4.8vsec Multiplyoraddtime7 ConnectionMachineParametricModel Description ParametersofCMmodel Table1 Scantime Newstime Routetime 6:2+1:2log2scan-distance 3scatter(nocollisions):64 scatter-add(4collisions):110 MeasuredCM-2value scatter-add(100collisions):200 sourceandaddressareasaboveandthedestinationinitiallyhasthevalue[1;1;:::;1] acollisionhasoccurred.)theprogrammercanselectoneofseveraldierentways tocombinecollidingvaluesbyspecifyingacombiningoperator.forexample,ifthe processorsaresenttothesamedestinationprocessor.(whenthishappenswesaythat Inascatter,whentheaddresspvarhasduplicatevalues,datafromseveralsource gather(manycollisions):430 thenafterascatter-with-add,thedestinationhasthevalue[45;40;35;1;1;1;:::;1]. Thesumofelementssource(j)suchthataddress(j)=kisstoredindest(k)ifthere tobeapowerfulaidtodataparallelprogramming. ondataparallelarchitecturesanduseittoanalyzeperformanceofseveralalgorithms areanysuchelements;otherwisedest(k)isunchanged.othercombiningoperators aredescribedbyveparameters: includeproduct,maximum,minimum,and\error".wehavefoundcombiningscatter forsparsecholeskyfactorization.theessentialmachinecharacteristicsinthemodel 2.3.3.MeasuredCMperformance.Wewilldevelopamodelofperformance (Floating-pointadditiontakesthesametimeasmultiplication,.)Inourmodel, The32bitgridcommunicationtime,inunitsof Thememoryreferencetimefora32bitword nectionmachine.thereforeisproportionaltovpratio,andtheotherparameters executiontimescaleslinearlywithvpratio,whichisessentiallycorrectforthecon- The32bitroutertime,inunitsof The32bitscantime,inunitsof The32bitoating-pointmultiplytime,inunitsof routingpatternsthatperformevenworsethanthis.foranygivenpattern,gather toroffourdependingonthenumberofcollisions;itispossibletodesignpathological obtainedbyexperimentonthecm-2.weobservethatroutertimesrangeoverafac- areindependentofvpratio.intable1,wegivemeasuredvaluesfortheseparameters usuallytakesjusttwiceaslongasscatter,presumablybecauseitisimplementedby sendingarequestandthensendingareply.inourapproximateanalyses,therefore, wegenerallychooseavalueofforscattercorrespondingtothenumberofcollisions observed,andmodelgatherastaking2oating-pointtimes. 7

closelyongilbertandhafsteinsson'stheoreticallyecientalgorithm[17]forthe PRAMmodelofcomputation.Itscommunicationrequirementsaretoounstructured forittobeveryecientonane-grainedmultiprocessorlikethecm,butweimplementedandanalyzedittouseasabasisforcomparisonandtohelptuneour performancemodelofthecm. 3.RouterCholesky.OurrstparallelCholeskyfactorizationalgorithmisa parallelimplementationofastandardcolumn-orientedsparsecholesky;itisbased thesymbolicfactorizationg(a)areavailable.(inourexperimentswecomputedthe symbolicfactorizationsequentially;gilbertandhafsteinsson[17]describeaparallel treet(a)toorganizeitscomputation.forthepresent,assumethatboththetreeand algorithm.)eachvertexofthetreecorrespondstoacolumnofthematrix. factorizationintermsoftwoprimitiveoperations,cdivandcmod: FollowingGeorgeetal.[12],weexpressasequentialcolumn-orientedCholesky 3.1.TheRouterCholeskyalgorithm.RouterCholeskyusestheelimination 8

proceduresequentialcholesky(matrixa); forj foreachedge(i;j)ofg(a)withi<jdo 1tondo columnjaccumulatesallnecessaryupdatescmod(j;i)fromcolumnstoitsleftjust thediagonalelementinthatcolumn,androutinecmod(j;i)modiescolumnjby Routinecdiv(j)dividesthesubdiagonalelementsofcolumnjbythesquarerootof subtractingamultipleofcolumni.thisiscalledaleft-lookingalgorithmbecause endsequentialcholesky; cdiv(j)od cmod(j;i)od; cdiv(i). beforethecdiv(j)thatcompletesitscomputation.bycontrast,aright-lookingalgorithmwouldperformalltheupdatescmod(j;i)usingcolumniimmediatelyafterthe bycolumns(vertices)thatareitsdescendantsinthetree[30].thereforeaparallel left-lookingalgorithmcancomputealltheleafvertexcolumnsatonce. NowconsidertheeliminationtreeT(A).Agivencolumn(vertex)ismodiedonly procedureroutercholesky(matrixa); forh foreachedge(i;j)withheight(i)<height(j)=hpardo 0toheight(n)do theleaveshaveheight0,verticeswhosechildrenareallleaveshaveheight1,and Hereheight(j)isthelengthofthelongestpathinT(A)fromvertexjtoaleaf.Thus endroutercholesky; od foreachvertexjwithheight(j)=hpardo cdiv(j)od cmod(j;i)od; soforth.theouterloopofthisalgorithmworkssequentiallyfromtheleavesofthe eliminationtreeuptotheroot.ateachstep,anentirelevel'sworthofcmod'sand cdiv'saredone. toeveryedgeandvertexofthelledgraphg).supposeprocessorpijisassignedto thenonzerothatisinitiallyaijandwilleventuallybecomelij.(iflijisall,thenaij thendividetheirownnonzerosbyljj.intheparallelcmod(j;i),processorpjisends themultiplierljitotheprocessorspkiwithk>j.eachsuchpkithencomputesthe updatelkiljilocallyandsendsittopkjtobesubtractedfromlkj. isinitiallyzero;recallthatweassumethatthesymbolicfactorizationisalreadydone, soweknowwhichlijwillbenonzero.)intheparallelcdiv(j),processorpjjcomputes ljjasthesquarerootofitselement,andsendsljjtoprocessorspijfori>j,which Aprocessorisassignedtoeverynonzeroofthetriangularfactor(or,equivalently, updatestoaprocessorincolumnj.eachcolumniisinvolvedinatmostonecmod atatimebecauseeverycolumnmodifyingjisadescendantofjint(a),andthe performedbytheprocessorsincolumniwhothen,ontheirowninitiative,sendthese subtreesrootedatverticesofanygivenheightaredisjoint.thereforeeachprocessor participatesinatmostonecmodorcdivateachparallelstep.ifweignorethetime takenbycommunication(includingthetimetocombineupdatestoasinglepkjthat maycomefromdierentpki1,pki2,:::),theneachparallelsteptakesaconstant Wecallthisaleft-initiatedalgorithmbecausethemultiplicationsincmod(j;i)are 9

howtodothecommunicationincdivandcmod. theeliminationtreet(a). leskyonthecmwemustspecifyhowtoassigndatatoprocessors,andthendescribe amountoftimeandtheparallelalgorithmrunsintimeproportionaltotheheightof 3.2.CMimplementationofRouterCholesky.ToimplementRouterCho- thinkofthisprocessorassignmentasaprocessorforeachvertexjofthelledgraph, eachsub-diagonalnonzero.thesymmetricuppertriangleisnotstored.wecanalso followedbyaprocessorforeachedge(i;j)withi>j. singlecolumnecientbecausetheycanusethecmscaninstructions.eachcolumn isrepresentedbyaprocessorforitsdiagonalelementfollowedbyaprocessorfor standardsequentialsparsematrixalgorithms[13].thismakesoperationswithina layoutthenonzerosinaone-dimensionalarrayincolumnmajororder,asinmany RouterCholesky,likemanydataparallelalgorithms,isproigateofparallelvariablestorage.Eachprocessorcontainssomeworkingstorageandthefollowingpvars: jhtjil Columnnumberofthiselement. Rownumberofthiselement. ElementoffactormatrixL,initiallyA. Weuseone(virtual)processorforeachnonzerointhetriangularfactorL.We (Recallthatp(j)>jistheeliminationtreeparentofvertexj<n.) iht diagonalp height(j)int(a). decidewhetheritparticipatesinacdivorcmod.bycomparingthelocalprocessor's nextupdate eparent height(i)int(a). ihtorjhttothecurrentvalueoftheouterloopindex,aprocessorcandetermine Ateachstageofthesequentialouterloop,eachprocessorusesihtandjhtto InprocessorPij,apointertoPi;p(j). Pointertonextelementthisonemayupdate. Boolean:Isthisadiagonalelement? activecolumn. aconnectedsubgraphoftheeliminationtree,andarelinkedtogetherinthistree nextupdatetoalaterelementinitsrow.thenonzeropositionsineachroware Theactualupdateisdonebyascatter-with-add,whichusestheroutertosendthe ifitselementisinacolumnthatisinvolvedinacdivoracmod. updatetoitsdestination. Thecdivusesascanoperationtocopythediagonalelementtotherestofthe onestepupthetreeusingtheeparentpointers. thatareitsancestorsintheeliminationtree.ateachstage,nextupdateismoved structurebytheeparentpointers.eachnonzeroupdatesonlyelementsincolumns Thecmodusesasimilarscantocopythemultiplierljidowntotherestofcolumni. Togureoutwheretosendtheupdate,eachelementmaintainsapointercalled whichinthiscaseisd(l)=pe.theciareconstants. kydoesaconstantnumberofrouteroperations,scans,andarithmeticoperations. Recallthatthememoryreferencetimeisproportionaltothevirtualprocessorratio, Thenumberofstagesish+1,wherehistheheightoftheeliminationtree.Interms oftheparametersofthemachinemodelinsection2.3.2,itsrunningtimeis 3.3.RouterCholeskyperformance:Theory.EachstageofRouterCholes- (c1+c2+c3)h: 10

routerisusedtwiceinthisoperation.thedominanttermistheroutertermc1h. Noticethatwedonotexplicitlycounttimeforcombiningupdatestothesameelement fromdierentsources,sincethisishandledwithintherouterandisthusincludedin. kpointsonaside,thenthegraphisakbyksquaregrid,andwehaven=k2, dierencemeshintwodimensionsorderedbynesteddissection[11].ifthemeshhas h=o(k),and(l)=o(k2logk).thenumberofarithmeticoperationsinthe CholeskyfactorizationisO(k3),ineitherthesequentialorparallelalgorithms.Router Themosttime-consumingstepisincrementingthenext-updatepointer;the thenumberofoperationsinthesequentialalgorithmtoparalleltime,wendthat Cholesky'srunningtimeisO(k3logk=p).Ifwedeneperformanceastheratioof theperformanceiso(p=logk)(takingtobeaconstantindependentofpork; thisisapproximatelycorrectfortheconnectionmachine,althoughtheoretically Togetafeelingforthisanalysisconsiderthemodelproblem,a5-pointnite shouldgrowatleastwithlogp).thisanalysispointsouttwoweakpointsofrouter Cholesky.First,theperformanceonthemodelproblemdropswithincreasingproblem size.(thisdependsontheproblem,ofcourse;forathree-dimensionalmodelproblem timingmodelandanalysis,weexperimentedwithroutercholeskyonavarietyof routertime,becauseeverystepusesgeneralcommunication. boundednodedegree,orderedbynesteddissection[17].theasymptoticanalysisis seriously,theconstantintheleadingtermofthecomplexityisproportionaltothe thesamebutthevaluesoftheconstantswillbedierent. asimilaranalysisshowsthatperformanceiso(p)regardlessofproblemsize.)more sparsematrices.wepresentoneexamplehereindetail.theoriginalmatrixis2500 ve-pointsquaremesh.itispreorderedbysparspak'sautomaticnesteddissection 2500with7400nonzeros(countingsymmetricentriesonlyonce),representinga5050 Thisanalysiscanbeextendedtoanytwo-dimensionalniteelementmeshwith heuristic[13],whichgivesorderingsverysimilartotheidealnesteddissectionordering operationstocompute. usedintheanalysisofthemodelproblemabove.thecholeskyfactorhas(l)= 48;608nonzeros,aneliminationtreeofheighth=144,andtakes1,734,724arithmetic 3.4.RouterCholeskyperformance:Experiments.Inordertovalidatethe seconds.thisisnotabadtforroutertime;itisnotclearwhytheremainingtime weuseonly48,608ofthe65,536virtualprocessors.)weobservedarunningtimeof wouldpredictroutertimec1h=39secondsandothertime(c2+c3)h=1:5 V=d(L)=pe=8.(Roundinguptoapoweroftwohasconsiderablecosthere,since intotheanalysisabove(using=200sincetherewereingeneralmanycollisions),we 53seconds,ofwhichabout41secondswasduetogathersandscatters.Substituting NASAAmesResearchCenter.Theresultsquotedherearefromp=8,192processors, withoatingpointcoprocessors,ofthemachineatnasa.thevpratiowastherefore WeranthisproblemonCM-2'sattheXeroxPaloAltoResearchCenterandthe issuchapoort,buttheexpensivesquarerootandthedatamovementinvolvedin thepointerupdatescontributetoit,anditseemsthati/omayhaveaectedthe andascatterwithexactlythesamecommunicationpattern.morecarefuluseofthe tobeacost-eectivewaytofactorsparsematrices.eachstagedoestwogathers routercouldprobablyspeeditupbyafactoroftwotove.however,thiswouldnot measured53seconds. 3.5.RemarksonRouterCholesky.RouterCholeskyistooslowasitstands Theobservation,inanycase,isthatroutertimecompletelydominates. 11

beenoughtomakeitpractical;somethingmorelikeahundredfoldimprovementin routerspeedwouldbeneeded. TheoneadvantageofRouterCholeskyistheextremesimplicityofitscode. Itisnomorecomplicatedthanthenumericfactorizationroutineofaconventional sequentialsparsecholeskypackage[13].itisinterestingtonotethatcolumn-oriented sparsecholeskycodesonmimdmessage-passingmultiprocessors[12,14,35]aremore complex.theyexploitmimdcapabilitytoimplementdynamicschedulingofthe cmodandcdivtasks.theyallowarbitraryassignmentofcolumnstoprocessorsand thereforearerequiredtouseindirectaddressingofcolumns.finally,theyarewritten withlow-levelcommunicationprimitives,theexplicit\send"and\receive." RouterCholesky'ssimplicitycomesdearly.FlexibilityinschedulingallowsMIMD implementationstogainamodestperformanceadvantageoveranypossiblesimd implementation.moreimportant,weemploy(l)virtualprocessors,regardlessof thenumberofphysicalprocessors.itisessentialthatthesevirtualprocessorsnotall sitidle,consumingphysicalprocessortimeslices,whenthereisnothingforthemto do.asimplementedbytheparisinstructionset,theydositidle. WedescribedRouterCholeskyasaleft-initiated,left-lookingalgorithm.Ina right-initiatedalgorithm,processorpijwouldperformtheupdatestolij.inarightlookingalgorithm,updateswouldbeappliedassoonastheupdatingcolumnofl wascomputedinsteadofimmediatelybeforetheupdatedcolumnoflwastobe computed.routercholeskyisthusoneoffourcousins.itistheonlyoneofthe fourthatmapsoperationstoprocessorsevenly;theotherthreealternativesrequire aninnersequentialloopofsomekind.allfourversionsrequireatleasthrouter operations. 4.GridCholesky.Inthissectionwepresentaparallelsupernodal,multifrontal CholeskyalgorithmanditsimplementationontheCM.Multifrontalmethods,introducedbyDuandReid[7],computeasparseCholeskyfactorizationbyperforming symmetricupdatesonasetofdensesubmatrices.wefollowliu[23]inreferring toanalgorithmthatusesrank-1updates\multifrontal"andtheblockversionthat usesrank-kupdates\supernodalmultifrontal."theideaofusingblockmethods tooperateonsupernodeshasbeenusedinmanydierentsparsefactorizationalgorithms[1,7].parallelsupernodalormultifrontalalgorithmshavebeenusedonmimd message-passingandshared-memorymachines[2,6,32]. Thealgorithmusesatwo-dimensionalVPset(whichwecallthe\playingeld") topartiallyfactor,inparallel,anumberofdenseprincipalsubmatricesofthepartially factoredmatrix.byworkingontheplayingeld,wemayusethefastgridandscan mechanismsforallthenecessarycommunicationduringthefactorizationofthedense submatrices.onlywhenweneedtomovethesedensesubmatricesbackandforthto theplayingelddoweneedtousetherouter.inthiswaywedrasticallyreducethe useoftherouter:forthemodelproblemonakkgridwereducethenumberof usesfromh=3k?1to2log2k?1.theplayingeldcanalsooperateatalowervp ratioingeneralbecauseitdoesnotneedtostoretheentirefactoredmatrixatonce. Thismethod,likeallmultifrontalmethods,isinessencean\outofcore"method inthatthecholeskyfactoriskeptinadatastructurethatisnotreferredtowhile doingarithmetic,allofwhichisdoneondensesubmatrices.thenoveltyhereisthe factorizationofmanyofthesedenseproblemsinparallel;thesimultaneousexploitationoftheparallelismavailablewithineachofthedensefactorizations;theuseofa two-dimensionalgridofprocessorsfortheseparalleldensefactorizations;theuseof themachine'srouterforparalleltransfersfromthematrixstoragedatastructure;and 12

theuseofthecombiningscatteroperationsforparallelupdateofthematrixstorage datastructure. 4.1.TheGridCholeskyalgorithm. 4.1.1.BlockJessandKeesreordering.FirstwedescribeanequivalentreorderingofthechordalgraphG=G(A)thatwecalltheblockJess/Keesordering. BlockJess/Keesisaperfecteliminationorderingthathastwopropertiesthatmake itthebestequivalentreorderingforourpurposes:iteliminatesverticeswithidenticalmonotoneneighborhoodsconsecutively,anditproducesacliquetreeofminimum height. OurreorderingeliminatesallthesimplicialverticesofGsimultaneouslyasone majorstep.intheprocess,itpartitionsalltheverticesofgintosupernodes.each ofthesesupernodesisacliqueing,andisasimplicialcliquewhenitscomponent verticesareabouttobeeliminated.eachvertexislabeledwiththestage,ormajor stepnumber,atwhichitiseliminated.inmoredetail,thereorderingalgorithmisas follows.procedurereorder(graphg) activestage?1; whilegisnotemptydo activestage activestage+1; NumberallthesimplicialverticesinG,assigning consecutivenumberswithineachsupernode; stage(v) activestageforallsimplicialverticesv; RemoveallthesimplicialverticesfromGod; h activestage endreorder Thecliquesarethenodesofacliquetreewhoseheightish,onelessthanthenumber ofmajoreliminationsteps.theparentofagivencliqueisthelowest-stageclique adjacenttothegivenclique. Thename\blockJess/Kees"indicatesarelationshipwithanalgorithmdueto JessandKees[21]thatndsanequivalentreorderingforachordalgraphsoasto minimizetheheightofitseliminationtree.theoriginal(or\point")jess/keesorderingeliminatesjustonevertexfromeachsimplicialcliqueateachmajorstep.(thisis amaximum-sizeindependentsetofsimplicialvertices.)eachstepofpointjess/kees producesonelevelofaneliminationtree,fromtheleavesup.theresultingeliminationtreehasminimumheightoverallperfecteliminationordersong.ourblock Jess/Keeseliminatesallthesimplicialverticesateachmajorstep,producingaclique treeonelevelatatimefromtheleavesup.thisorderingmaynotminimizetheheight oftheeliminationtree.however,asblairandpeyton[4]haveshown,itdoesproduce acliquetreeofminimumheightoverallperfecteliminationordersong. Everyvertexisincludedinexactlyonesupernode.Wenumberthesupernodesas fs1;:::;smginsuchawaythatifi<jthentheverticesinsihavelowernumbers thantheverticesinsj.thestageatwhichasupernodesiseliminatedistheiteration ofthewhileloopatwhichitsverticesarenumberedandeliminated;thus,forallv2s, stage(v)=stage(s)istheheightofnodesinthecliquetree. 4.1.2.Parallelsupernodalmultifrontalelimination.LetCbeasupernode. ItisimmediatethatK=adj(C)[Cisaclique,andthatitismaximal.Our factorizationalgorithmworksbyformingtheprincipalsubmatricesofacorresponding 13

toverticesinthemaximalcliquesgeneratedbysupernodes.let=jcjand= jadj(c)j.writea(k;k)fortheprincipalsubmatrixoforderjkj=+consisting ofelementsai;jwithi;j2k.itisnaturaltopartitiontheprincipalsubmatrix A(K;K)ofAas wherex=a(c;c)is,y=a(adj(c);adj(c))is,andeis. looking"algorithm.thedetailsareasfollows. proceduregridcholesky(matrixa) foractivestage Intheterminologyoftheprevioussection,GridCholeskyisa\blockright- A(K;K)=XE foreachsupernodecwithstage(c)=activestagepardo 0tohdo ETY; MoveA(K;K)totheplayingeld, SetYtozeroontheplayingeld; Perform=jCjstepsofparallelGaussianelimination A(C;C) wherek=c[adj(c); andtheschurcomplementy0=?etx?1e, tocomputethecholeskyfactorlofx, theupdatedsubmatrixe0=l?1e, wherex,e,andypartitiona(k;k)asabove; end,wediscussanimplementationofludecompositionwithoutpivoting.(weuse endgridcholesky; useful,weneedafastdensematrixfactorizationsontwo-dimensionalvpsets.tothat 4.2.Multipledensepartialfactorization.Inordertomakethisapproach oda(adj(c);adj(c)) A(adj(C);C) L; LUinsteadofCholeskyherebecausewecanseenoecientwaytoexploitsymmetry E0T; withatwo-dimensionalmachine;moreover,luavoidsthesquarerootateachstepand A(adj(C);adj(C))+Y0od thatusesonlynearestneighborcommunicationonthegrid,andarank-1update soisabitfaster.)weanalyzedandimplementedtwomethods:asystolicalgorithm aboutthetwo-dimensionalplayingeldsimultaneously(eachasaseparate\baseball submatricesa(k;k)correspondingtosupernodesatagivenstagearedistributed algorithmthatusesrowandcolumnbroadcast.witheitherofthesemethods,allthe diamond"),andthepartialfactorizationisappliedtoallthesubmatricesatonce. scanstokeepeachfactorizationwithinitsown\diamond."thenumberofstepson acrosseachrow,andnallyaparallelmultiplyandsubtracttoapplytheupdate.the numberofrank-1updatesis,thesizeofthesupernode. plicationtocomputethemultiplierforeachrow,anotherscantocopythemultiplier scandownthecolumnstocopythepivotrowtotheremainingrows,aparallelmulti- 1updateconsistsofadivisiontocomputethereciprocalofthediagonalelement,a A(K;K),withasupernodeofsizeandaSchurcomplementofsize.Asinglerank- Anentirestageofpartialfactorizationsisperformedatonce,usingsegmented Wedescribetherank-1algorithmintermsofitseectonasinglesubmatrix theplayingeldatstagesiss,themaximumvalueofoverallsupernodesat 14

Herec3is2s,andc4isproportionaltosaswell. stages.thenastageofrank-1partialfactorizationtakestime below,foracompletefactorization(thatis,oneinwhich=0).thebookkeeping includesnearest-neighborcommunicationtomovethreeone-bittokensthatcontrol whichprocessorsperformreciprocals,multiplications,andsoonateachstep. Scans(rowandcolumnbroadcast): Therelativecostofthevariouspartsoftherank-1updatecodearesummarized (c3+c4): Multiply(computingmultipliers): News(movingthetokens): 79.7% 5.5% allcommunicationisbetweengridneighbors.thusitscommunicationtermsare proportionaltoratherthan.thisadvantageismorethanosetbythefactthat Divide(reciprocalofpivotelement): Multiply-subtract(Gaussianelimination): Unliketherank-1implementation,thesystolicimplementationneverusesascan; 7.1% 2.7% 3s+2ssequentialiterationsarenecessary,whiletherank-1methodonlyneedss. 4.2.1.Remarksondensepartialfactorization.Theoretically,systolicfactorizationshouldbeasymptoticallymoreecientasmachinesizeandproblemsize 4.8% forthetwo-dimensionalmodelproblemtheaverageschurcomplementsizesisabout Butforapartialfactorizationtherank-1algorithmistheclearwinner.Forexample, 4s,sotherank-1codehasan11-to-1advantageinnumberofsteps.Thismorethan threefoldincreaseinnumberofsteps,andsothesystolicmethodissomewhatfaster. grows.realistically,however,thecmhappenstohave6;forafullfactorization makesupforthefactthatscanismuchslowerthangridcommunication. growwithoutbound,becausescansmustbecomemoreexpensiveasthemachine (=0)asixfolddecreaseincommunicationtimeperstepmorethanbalancesthe algorithm,themultiply-subtract,accountsforonly1=21ofthetotaltimeintherank- dousefulwork,sincetheactivepartofthematrixoccupiesasmallerandsmaller 1parallelalgorithm.Moreover,only1=3ofthemultiply-subtractoperationsactually subsetoftheplayingeldasthecomputationprogresses.thisgivesthecodean VPratios.Thereasonsarethese:scanisslowrelativetoarithmetic;thedivideand havefoundthistobetypicalin*lispcodesformatrixoperations,especiallywithhigh multiplyoperationsoccuronverysparsevpsets;andthevpratioremainsconstant overalleciencyofonepartin63forluandonepartin126forcholesky.we Itisinterestingtonotethattheonlyarithmeticthatmattersinasequential canreadilybemadeatcompiletime.asanalternative,wecouldhaverewrittenthe code;thevpsetcouldshrinkasthematrixshrinks,andthedivideandthemultiplies couldbeperformedinasparservpset.15 processorsthatareactive.sometimesadeterminationthatthisisgoingtohappen substantially:theloopovervvirtualprocessorsshouldberestrictedtothosevirtual astheactivepartofthematrixgetssmaller. Signicantperformanceimprovementscouldcomefromseveralpossiblesources. Amoreecientimplementationofvirtualprocessorscouldimproveperformance

onevectoroating-pointarithmeticchip.performing32oatingpointoperationsimpliesmoving32numbersinbit-serialfashionintoatransposerchip,thenmovingthem eachphysicalprocessorgetsonlyonecopyofthedataratherthanvcopies. sible.theparisinstructionsethidesthefactthatevery32physicalprocessorsshare tothescanwhichtakeso(b+d).thecopyscanscouldalsobeimplementedsothat takeso(b=d+d)timetobroadcastbbitsinad-dimensionalhypercube,incontrast tureofthecm.hoandjohnsson[19]havedevelopedaningeniousalgorithmthat Moreecientuseofthelow-leveloating-pointarchitectureoftheCM-2ispos- Thescanscouldbespedupconsiderablywithinthehypercubeconnectionstruc- inparallelintothevectorchip,thenreversingtheprocesstostoretheresult.while precludingtheuseofblockalgorithms[3]thatcouldstoreintermediateresultsinthe cyclesarerequiredjusttoaccessoneoating-pointnumber. thismodeofoperationconformstotheone-processor-per-data-elementprogramming model,itwastestimeandmemorybandwidthwhenonlyafewprocessorsareactually active(suchascomputingmultipliersordiagonalreciprocalsinlu),since32memory registersinthetransposer.thusthecomputationrateislimitedbythebandwidth betweenthetransposerandmemory(about3.5gigaopsfora64kprocessorcm) insteadofbytheoperationrateofthevectorchip(about27gigaopstotal). Thismodealsorequiresintermediateresultstobestoredbacktomainmemory, westill(early1991)cannotuseit. CMFortranallowsthismodel,itdoesnotallowscansandscatterwithcombining,so wewereworkingwithjust256realprocessors!)also,ifvirtualizationishandled (latein1988)thetoolsforprogrammingonthislevelwerenoteasilyusable.while processor-plus-transposer-and-vector-chipunitasasingleprocessor,andrepresenting eciently,weneedonlykeep256processorsbusy.atthetimethisworkwasdone theydonotneedtobemovedbit-seriallyintothearithmeticunit.(viewedthisway, 32-bitoating-pointnumbers\slicewise,"withonebitperphysicalprocessor,sothat Amoreecientdensematrixfactorizationcanbeachievedbytreatingeach32- setswithdierentgeometries:thematrixstoragestoresthenonzeroelementsofa theactivesubmatricestotheplayingeld,factorsthem,andmovesupdatesbackto andl(doingalmostnocomputation),andtheplayingeldiswherethedensepartial ofvirtualprocessorsthatstoresallofaandlinessentiallythesameformasrouter themainmatrix. factorizationsaredone.thetop-levelfactorizationprocedureisjustaloopthatmoves 4.3.CMimplementationofGridCholesky.GridCholeskyusestwoVP Cholesky.EachofthefollowingpvarshasoneelementforeachnonzeroinL. 4.3.1.Matrixstorage.ThematrixstorageVPsetisaone-dimensionalarray thatis,morethanoneycmaybecomputingupdatestothesameelementoflatthe eld.thesupernodescaredisjoint,buttheirneighboringsetsadj(c)mayoverlap; Weuseascattertomovetheactivecolumnsfrommatrixstoragetotheplaying activestagethestageatwhichjoccursinasupernode. griditheplayingeldrowinwhichthiselementsits. gridjtheplayingeldcolumninwhichthiselementsits. updatesworkingstorageforsumofincomingupdates. lvalueelementsofl,initiallythoseofa. samestage.therefore,weusescatter-with-addtomovethepartiallyfactoredmatrix fromtheplayingeldbacktomatrixstorage. 16

ThepvarsusedinthisVPsetare eciency.itssizeisdeterminedaspartofthesymbolicfactorizationandreordering. stage,althoughitcouldactuallyusedierentvpratiosatdierentstagesformore two-dimensionalgridonwhichthesupernodesarefactored.inourimplementation itislargeenoughtoholdalltheprincipalsubmatricesforallmaximalcliquesatany 4.3.2.Theplayingeld.ThesecondVPset,calledtheplayingeld,isthe ofallthemaximalcliques. aswellassomebooleanagsusedtocoordinatethesimultaneouspartialfactorization doingrank-1updates(seesectionsection4.2)ofallthedensesubmatricesstored there,usingsegmentedscanstodistributethepivotrowsandcolumnswithinall TheprocessorsoftheplayingeldcomputeLUfactorizationsbysimultaneously updatedestthematrixstoragelocation(processor)thatholdsthismatrixelement;anintegerpvararrayindexedbystage. denseatheplayingeldformatrixelements. submatricesatthesametime.thenumberofrank-1updatestepsisthesizeofthe largestsupernodeatthecurrentstage.thesubmatricesmaybedierentsizes;each matrixonlydoesasmanyrank-1updatesasthesizeofitssupernode. squarearraysontothesmallestpossiblerectangularplayingeld(whosebordersmust A(K;K)forallthemaximalcliquesKateverystage.Thisisatwo-dimensionalbin \rsttbylevels"heuristic.thislayoutisdoneduringthesequentialsymbolic bepowersoftwo).optimaltwo-dimensionalbinpackingisingeneralannp-hard problem,thoughvariousecientheuristicsexist[9].ourexperimentsuseasimple factorization,beforethenumericfactorizationisbegun. packingproblem.inordertominimizecmcomputationtime,wewanttopackthese Inordertousethisprocedureweneedtondaplacementofallthesubmatrices forgridcholeskyintotimeinthematrixstoragevpsetandtimeontheplaying isoneadditionperstagetoaddtheaccumulatedupdatestothematrix.)thereisa xednumberofrouterusesperstage,sothematrixstoragetimeis eld.theformerincludesalltheroutertrac,andessentiallynothingelse.(there forsomeconstantc5.thesubscriptmsindicatesthatthevalueofistakenin 4.4.GridCholeskyperformance:Theory.Weseparatetherunningtime completelycomputedcolumnsandtheschurcomplementsbacktomatrixstorage. ofrank-1updatesisthesizeofthelargestsupernodeatthatlevel,whichiss. thematrixstoragevpset,whosevpratiosetisvms=d(l)=pe.inthecurrent playingeldatthebeginningofastage,andthentwoscattersareusedtomovethe implementationc5=4,sincetwoscattersareusedtomovethedensematricestothe Weexpresstheplayingeldtimeasasumoverlevels.Ateachlevelsthenumber TMS=c5MSh wherec6andc7areconstants(infactc6=2),andthesubscriptsindicatesthat AccordingtotheanalysisinSection4.2, thevalueofistakenintheplayingeldvpsetatstages.thevpratiointhis VPsetcouldbeapproximatelytheratioofthetotalsizeofthedensesubmatricesat TPF=(c6+c7)hXs=0ss; 17

Stages h?1 h?3 Subproblemcountsandplayingeldsizeforthemodelproblem. R(s)s 1 ktable2 h?4 h?5 h?6 24816 32 k=4 3k=24:5k2 3k=29k2 3k=218k2 5k=425k2 7k=824:5k2 ks+sp(c+c)2 h?7 64 128 k=8 k=16 5k=825k2 7k=1624:5k2 k2 stagestothenumberofprocessors,changingateachstageasthenumberandsize maximumofthisvalueoverallstages. ofthemaximalcliquesvary.howeverinourimplementationitissimplyxedatthe Again,togetafeelingforthisanalysisletusconsiderthemodelproblem,ave- h?2r?122r+1k=2r+17k=2r+124:5k222r 5k=2r25k2 O(k3)arithmeticoperations.Table2summarizesthenumberandsizesofthecliques pointnitedierencemeshonakkgridorderedbynesteddissection.forthis problemn=k2,h=o(logk),and(l)=o(k2logk).thefactorizationrequires thatoccurateachstage.thecolumnsinthetableareasfollows. sizeplayingeldateverystage.accordingtotable2,aplayingeldofsized25k2e R(s) s+s Numberofsupernodesatstages. sucesiftheproblemscanbepackedinwithouttoomuchloss.thevpratiois P(C+C)2TotalareaofalldensesubmatricesA(K;K)atstages. sothematrixstoragetimeiso(k2log2k=p).ourpilotimplementationusesthesame TheVPratioinmatrixstorageforthemodelproblemisO((L))=p=O(k2logk=p), SizeoflargestmaximalcliqueC[adj(C)atstages. Sizeoflargestsupernodeatstages. eldtimeiso(k3=p).insummary,thetotalrunningtimeofgridcholeskyforthe modelproblemis O(k2=p).ThesumoverallstagesofsisO(k)(infactitis3k+O(1)),sotheplaying term.thisisbecausetheplayingeldcomputationsaredoneondensematriceswith moreimportantinpractice,therouterspeedappearsonlyinthesecond-order done,hasalowervpratiothanthematrixstoragestructure.second,andmuch vanished.thisisbecausetheplayingeld,wherethebulkofthecomputationis arithmeticoperationstotime,iso(p);thelogkineciencyofroutercholeskyhas Twothingsarenotableaboutthis:First,theperformance,orratioofsequential Ok2log2k p+k3 p: withtheproblemsize. moreecientgridcommunication.thismeansthattheroutertimebecomesless importantastheproblemsizegrows,whetherornotthenumberofprocessorsgrows 18

andmachinesizesothatthevpratioremainsconstant.thenthemodelproblem Choleskyisa\scalable"parallelalgorithm. analysisofthemodelproblemcarriesthrough(withdierentconstantfactors)for requireso(k)totalparalleloperations,butonlyo(logk)routeroperations.the analysiscarriesthroughforanythree-dimensionalniteelementproblem.thus,grid anytwo-dimensionalniteelementproblemorderedbynesteddissection;asimilar Onewayoflookingatthisanalysisistothinkofincreasingbothproblemsize pointdiscretizationofthelaplacianonasquare6363mesh,orderedbynested mentalresultsfromafairlysmallmodelproblem,thematrixarisingfromtheve- metricentriesonlyonce).thecholeskyfactorhas(l)=86,408nonzeros,aclique dissection.thismatrixhasn=3,969columnsand11,781nonzeros(countingsym- compute. treewithh=11stagesofsupernodes,andtakes3,589,202arithmeticoperationsto 4.5.GridCholeskyperformance:Experiments.Herewepresentexperi- localcomputation).theother2:04secondswasmatrixstoragetime,consistingmostly 256512VPs.Theresultsquotedherearefrom8,192processors,withoating-point coprocessors,ofthemachineatnasa.bothvpsetsthereforehadavpratioof16. (AlargerproblemwouldneedahigherVPratiointhematrixstoragethaninthe time(3:12forthescans,0:15fornearest-neighbormovesofone-bittokens,and0:82for playingeld.) ThematrixstorageVPsetrequires128KVPs.Thexed-sizeplayingeldrequires comestobetween1:5and4:7seconds,dependingonwhichvaluewechoosefor.in fact3=4oftherouteroperationsarescatterswithnocollisions,andtheother1=4 withexperiment.themodelpredictsmatrixstoragetimeofabouth4ms.this ofthefourscattersateachstage.ouranalyticmodelpredictsplayingeldtimeto beabout3k(2+4)pf.thiscomesto4:0seconds,whichisingoodagreement arescatter-with-add,typicallywithtwotofourcollisions.thettothemodelis Weobservedarunningtimeof6:13seconds.Ofthis,4:09secondswasplayingeld thereforequiteclose. about20timesasfastasroutercholesky.itis,however,onlyrunningat:586 megaopson8kprocessors,whichwouldscaleto4:68megaopsonafull64kmachine.alargerproblemwouldrunsomewhatfaster,butitisclearthatmakinggrid Choleskyattractivewillrequiremajorimprovements.Someoftheseweresketched insection4.2.1. Cholesky.Forthesmallsampleproblemtherelativetimesforrouterandnon-router computationsareasfollows. 4.6.RemarksonGridCholesky.Onthissmallexample,GridCholeskyis Movingdatatotheplayingeld: ArstquestioniswhetherGridCholeskyisarouter-boundcodelikeRouter Evidently,theGridCholeskycodeisnotrouter-boundforthisproblem.Forlarger (orstructurallydenser)problemsthissituationgetsbetterstill:foramachineofxed size,thetimespentusingtheroutergrowslikeo(log2k)whilethetimeontheplaying MovingSchurcomplementsbacktomatrixstorage: Factoringontheplayingeld: 21% 67% 12% eldgrowslikeo(k3)forakkgrid,asweshowedabove.ifwesolvedthesame 19

problemonafull-sized64kprocessormachine,therelativetimeswouldpresumably ofparalleleliminationstepsontheplayingeldisgivenby bethesameasabove;butifwesolvedaproblem8timesaslargetheoperation countwouldincreasebyafactorofabout22whilethenumberofstages,androuter operations,wouldincreaseonlybyafactorofabout1.3. Next,weaskwhetherouruseoftheplayingeldisecientornot.Thenumber doneisis3:69106,sotheprocessorutilizationis7:9%.thereareseveralreasons ustodo22:8106multiply-addsor45:6106ops.thenumberofopsactually forthislossofeciency: whichis177fortheexample.ontheplayingeldof131,072processorsthisallows hxs=1s; Thealgorithmdoesbothhalvesofthesymmetricdensesubproblems(factor theimplementationusesthesameplayingeldsizeateverylevel(factorof thearchitectureforcesthedimensionsoftheplayingeldtobepowersoftwo of2); about4=3); Parisvirtualizationmethodcostusafactorofroughly(131;072=10;500)=12:5. processors,butonaverageweonlyhaveworkforabout10,500ofthem.thus,the suredresult;1=:079=12:6.inotherwords,everystepmustuseall131,072virtual Theseeectsaccountforafactorofroughly12.4,whichisconsistentwithourmea- theplayingeldisnotfullycoveredbyactiveprocessorsinitially,andasthe densefactorizationprogressesprocessorsinthesupernodesfallidle(factorof about7=2). (factorofabout4=3); peak:virtualization.weused217virtualprocessors.onaverage,wemakeuseof of7:6. Asimilaranalysisshowsthatvirtualizationslowstheuseoftherouterbyafactor WesummarizethereasonsthatourachievedperformanceissofarbelowtheCM's TheParisinstructionset,whichmakesreuseofdatainregistersimpossible, Theslowrouter. Communicationcostsforscansontheplayingeld,usingthebuilt-insuboptimalalgorithm. 10,500ontheplayingeldand5,600inmatrixstorage.Inactuality,there TheSIMDrestriction.Thiscausesustohavetowaitforthedividesand tasks,mostofthiseectcouldalsoberemovedbyecientvirtualization). multiplies.(sincethereareveryfewactivevirtualprocessorsduringthese thusexposingthelowmemorybandwidthofthecm. are256physical(oating-point)processorsinthemachineweused. impedimentstoperformancewouldprovide.ineachcase,wehypothesizean\ideal" machineinwhichthecorrespondingcostiscompletelyremoved.thus,forexample, routeroperationtakesnotimewhatever.20 thestatisticsforthethirdrowofthetableareforamarvelousmachineinwhicha Table3givesanupperboundontheimprovementthatremovalofeachofthese

ImpedimentTimeMopsSpeedup removed None Virtualization0.6049. Slowrouter FactorsaectingeciencyofGridCholesky. 0.3389. 6.134.8 Table3 10. chiefproblemsaretheparisvirtualizationmethod;thelackofafastbroadcastin Slowscans 0.00427,000. 0.065450. 0.0211,400. 1.8 ParisontheCM-2;noneisfundamentaltoSIMD,dataparallelcomputing. thegrid;andthememory-to-memoryinstructionset.allofthesearepeculiaritiesof Clearly,goodeciencyispossible,evenonanSIMDmachinewitharouter.The 5.0 3.1 requirementforexpensivegeneral-purposecommunicationgrowsonlylogarithmically withproblemsize;evenformodestproblemsthecodeisnotlimitedbythecmrouter. presentintheeliminationtree,butbecauseitpayslittleattentiontothecostof itsworkwithgridcommunicationondensesubmatrices.analysisshowsthatthe rst,routercholesky,isconciseandelegantandtakesadvantageoftheparallelism Choleskyfactorizationalgorithms,implementedforadataparallelcomputer.The communicationitisnotpractical. 5.Conclusions.Wehavecomparedtwohighlyparallelgeneral-purposesparse foramoderatelysmallsampleproblem. Webelieve,however,thatourexperimentsandanalysisleadtotheconclusionthat ExperimentshowsthatGridCholeskyisabout20timesasfastasRouterCholesky nectionmachinecost-eectiveforsolvinggenerallystructuredsparsematrixproblems. OurpilotimplementationofGridCholeskyisnotfastenoughtomaketheCon- Wethereforedevelopedaparallelalgorithm,GridCholesky,thatdoesmostof twotothreeordersofmagnitudefasterwithimprovementsintheinstructionsetof theconnectionmachine.ofourpilotimplementationfromthe27-gigaoptheoretical peakperformanceofa64kprocessorcmyawnsomewhatlessdauntingly. ahighlyparallelsimdmachine.weshowedindetailthatgridcholeskycouldrun aparallelsupernodal,multifrontalalgorithmcanbemadetoperformecientlyon turecompilersforhighlyparallelmachines,whiletheywillsupportthedata-parallel virtualprocessorabstractionattheuser'slevel,willgeneratecodeatalevelbelow thatabstraction. straction,whichistosaybelowtheleveloftheassembly-languagearchitectureofthe machine.althoughtmchasrecentlyreleasedalow-levellanguagecalledcmisin whichausercanprogrambelowthevirtual-processorlevel,webelievethatultimately mostoftheseoptimizationsshouldbeappliedbyhigh-levellanguagecompilers.fu- MostoftheseimprovementsarebelowtheleveloftheParisvirtualprocessorab- parallelprogrammingmodel. abletousethedataparallelprogrammingparadigmtoexpressitinastraightforward way.thehigh-levelscanandscatter-with-addcommunicationprimitivessubstantially simpliedtheprogramming.thesimplicityofourcodesspeakswellforthisdata AlthoughGridCholeskyismorecomplicatedthanRouterCholesky,wearestill 21

withoutsacricingeciency. ecientparallelprogramsforcomplextaskstobewrittennearlyaseasilyassequentialprograms.togettothatpoint,therewillhavetobeimprovementsincompilers, instructionsets,androutertechnology.virtualizationwillhavetobeimplemented problems.webelievethatfuturegenerationsofhighlyparallelmachinesmayallow dataparallelprogrammingandhighlyparallelarchitecturesforsolvingunstructured lessencouraged aboutthegridcholeskyalgorithmandaboutthepotentialof Insummary,eventhoughourpilotimplementationisnotfast,wearenonethe- playingeldisnottheonlypossibleone.thereisingeneralnoneedtoperformall thepartialfactorizationsatasinglelevelsimultaneously.itshouldbepossibletouse asameshofindividualvectoroating-pointchips.sometheoreticalworkhasbeen tions.oursimpleapproachofschedulingtheseonelevelatatimeontoaxed-size varyingvpratio,oreven(fortheconnectionmachine)ontoaplayingeldconsidered moresophisticatedheuristicstoschedulethesefactorizationsontoaplayingeldof supernodesidentiesaprecedencerelationshipamongthevariouspartialfactoriza- Wementionfouravenuesforfurtherresearch. doneonscheduling\rectangular"tasksontoasquaregridofprocessorseciently[8]. Therstisschedulingthedensepartialfactorizationseciently.Thetreeof willbecomemoresoastheplayingeldtimeisimproved. course,asproblemsgetlargerthistimebecomesasmallerfractionofthetotal.at presentmatrixstoragetimeisnotverysignicantevenforasmallproblem,butit ThesecondavenueisimprovingthetimespentinthematrixstorageVPset.Of arecurrentlydesigningdataparallelalgorithmstodothesethreesteps[18]. Choleskyforverylargeproblems.Herethecliquetreewouldbeusedtoschedule theprocessorsthemselves. transfersofdatabetweenthehigh-speedparalleldiskarrayconnectedtothecmand usesasequentiallygeneratedordering,symbolicfactorization,andcliquetree.we preliminariestothenumericalfactorizationinparallel.ourpilotimplementation Third,wementionthepossibilityofanout-of-main-memoryversionofGrid estingandencouragingthatthekeyideaofthealgorithm,namelypartitioningthe matrixintodensesubmatricesinasystematicway,hasalsobeenusedtomakesparse Choleskyfactorizationmoreecientonvectorsupercomputers[32]andevenonworkstations[29].Intheformercase,thedensesubmatricesvectorizeeciently;inthe WeconcludebyextractingonelastmoralfromGridCholesky.Wenditinter- Fourthandnally,wementionthepossibilityofperformingthecombinatorial techniquesforattainingeciencyonsequentialmachineswithhierarchicalstorage memoryandmainmemory.weexpectthatmoreexperiencewillshowthatmany willturnouttobeusefulforhighlyparallelmachines. latter,thedensesubmatricesarecarefullyblockedtominimizetracbetweencache [1]C.Ashcraft,R.Grimes,J.Lewis,B.Peyton,andH.Simon,Recentprogressinsparsematrix [2]C.C.Ashcraft,Thedomain/segmentpartitionforthefactorizationofsparsesymmetricpositivedenitematrices,Tech.ReportECA{TR{148,BoeingComputerServicesEngineering methodsforlargelinearsystems,internationaljournalofsupercomputerapplications, (1987),pp.10{30. REFERENCES ComputingandAnalysisDivision,Seattle,1990. 22

[3]C.H.BischofandJ.J.Dongarra,Aprojectfordevelopingalinearalgebralibraryforhighperformancecomputers,Tech.ReportMCS{P105{0989,ArgonneNationalLaboratory, 1989. [4]J.R.S.BlairandB.W.Peyton,Onndingminimum-diametercliquetrees,Tech.Report ORNL/TM{11850,OakRidgeNationalLaboratory,1991. [5]M.DixonandJ.deKleer,Massivelyparallelassumption-basedtruthmaintenance,inProceedingsoftheNationalConferenceonArticialIntelligence,1988,pp.199{204. [6]I.S.Du,MultiprocessingasparsematrixcodeontheAlliantFX/8,Tech.ReportCSS210, ComputerScienceandSystemsDivision,AEREHarwell,1988. [7]I.S.DuandJ.K.Reid,Themultifrontalsolutionofindenitesparsesymmetriclinear equations,acmtransactionsonmathematicalsoftware,9(1983),pp.302{325. [8]A.Feldmann,J.Sgall,andS.-H.Teng,Dynamicschedulingonparallelmachines. [9]M.R.GareyandD.S.Johnson,ComputersandIntractability:AGuidetotheTheoryof NP-Completeness,W.H.FreemanandCompany,1979. [10]F.Gavril,Algorithmsforminimumcoloring,maximumclique,minimumcoveringbycliques, andmaximumindependentsetofachordalgraph,siamjournaloncomputing,1(1972), pp.180{187. [11]A.George,Nesteddissectionofaregularniteelementmesh,SIAMJournalonNumerical Analysis,10(1973),pp.345{363. [12]A.George,M.T.Heath,J.Liu,andE.Ng,SparseCholeskyfactorizationonalocal-memory multiprocessor,siamjournalonscienticandstatisticalcomputing,9(1988),pp.327{ 340. [13]A.GeorgeandJ.W.H.Liu,ComputerSolutionofLargeSparsePositiveDeniteSystems, Prentice-Hall,1981. [14]A.GeorgeandE.Ng,OnthecomplexityofsparseQRandLUfactorizationofnite-element matrices,siamjournalonscienticandstatisticalcomputing,9(1988),pp.849{861. [15]J.A.GeorgeandD.McIntyre,Ontheapplicationoftheminimumdegreealgorithmtonite elementsystems,siamjournalonnumericalanalysis,15(1978),pp.90{112. [16]J.R.Gilbert,Somenesteddissectionorderisnearlyoptimal,InformationProcessingLetters, 26(1988),pp.325{328. [17]J.R.GilbertandH.Hafsteinsson,Parallelsolutionofsparselinearsystems,inSWAT88: ProceedingsoftheFirstScandinavianWorkshoponAlgorithmTheory,Springer-Verlag LectureNotesinComputerScience318,1988,pp.145{153. [18]J.R.Gilbert,C.Lewis,andR.Schreiber,Parallelpreorderingforsparsematrixfactorization. Inpreparation. [19]C.-T.HoandS.L.Johnsson,SpanningbalancedtreesinBooleancubes,SIAMJournalon ScienticandStatisticalComputing,10(1989),pp.607{630. [20]J.J.Hopeld,Neuralnetworksandphysicalsystemswithemergentcollectivecomputational abilities,proceedingsofthenationalacademyofscience,79(1982),pp.2554{2558. [21]J.A.G.JessandH.G.M.Kees,AdatastructureforparallelL/Udecomposition,IEEE TransactionsonComputers,C-31(1982),pp.231{239. [22]S.G.Kratzer,Massivelyparallelsparsematrixcomputations,Tech.ReportSRC{TR{90{008, SupercomputerResearchCenter,1990. [23]J.W.H.Liu,Themultifrontalmethodforsparsematrixsolution:Theoryandpractice,Tech. ReportCS{90{04,YorkUniversityComputerScienceDepartment,1990. [24]J.W.H.Liu,Theroleofeliminationtreesinsparsefactorization,SIAMJournalonMatrix AnalysisandApplications,11(1990),pp.134{172. [25]J.Naor,M.Naor,andA.J.Schaer,Fastparallelalgorithmsforchordalgraphs,SIAMJournal oncomputing,18(1989),pp.327{349. [26]B.W.Peyton,SomeApplicationsofCliqueTreestotheSolutionofSparseLinearSystems, PhDthesis,ClemsonUniversity,1986. [27]D.J.Rose,Agraph-theoreticstudyofthenumericalsolutionofsparsepositivedenitesystems oflinearequations,ingraphtheoryandcomputing,r.c.read,ed.,1972,pp.183{217. [28]D.J.Rose,R.E.Tarjan,andG.S.Lueker,Algorithmicaspectsofvertexeliminationongraphs, SIAMJournalonComputing,5(1976),pp.266{283. [29]E.RothbergandA.Gupta,Fastsparsematrixfactorizationonmodernworkstations,Tech. ReportSTAN{CS{89{1286,StanfordUniversity,1989. [30]R.Schreiber,AnewimplementationofsparseGaussianelimination,ACMTransactionson MathematicalSoftware,8(1982),pp.256{276. [31]R.Schreiber,Anassessmentoftheconnectionmachine,inScienticApplicationsoftheConnectionMachine,H.Simon,ed.,WorldScientic,Singapore,1991. [32]H.Simon,P.Vu,andC.Yang,PerformanceofasupernodalgeneralsparsesolverontheCray 23

[33]B.Speelpenning,Thegeneralizedelementmethod,Tech.ReportUIUCDCS{R{78{946,UniversityofIllinois,1978. 5.0. [34]ThinkingMachinesCorporation,Cambridge,Massachusetts,Parisreferencemanual,version [35]E.Zmijewski,SparseCholeskyFactorizationonaMultiprocessor,PhDthesis,CornellUniversity,1987. Y-MP,Tech.ReportSCA{TR{117,BoeingComputerServices,1989. 24