Wealsopresentaperformancemodelanduseittoanalyzeouralgorithms.Wendthatasymp- 1.1.Dataparallelism.Highlyparallel,localmemorycomputerarchitectures
|
|
|
- Mavis Barton
- 10 years ago
- Views:
Transcription
1 Machine,adistributed-memorySIMDmachinewhoseprogrammingmodelconceptuallysuppliesone Choleskyfactorizationofasparsematrix.OurexperimentalimplementationsareontheConnection processorperdataelement.incontrasttospecial-purposealgorithmsinwhichthematrixstructure conformstotheconnectionstructureofthemachine,ourfocusisonmatriceswitharbitrarysparsity structure. Abstract.Wedevelopandcompareseveralne-grainedparallelalgorithmstocomputethe HIGHLYPARALLELSPARSECHOLESKYFACTORIZATION Themostpromisingalternativeisasupernodal,multifrontalalgorithmwhoseinnerloopperforms JOHNR.GILBERTANDROBERTSCHREIBERy usefulinchoosingamongalternativealgorithmsforacomplicatedproblem. tionratescomparabletothoseofthedensesubroutine.althoughatpresentarchitecturallimitations parallel,densefactorizationalgorithmisusedasthekeysubroutine.thesparsecodeattainsexecu- severaldensefactorizationssimultaneouslyonatwo-dimensionalgridofprocessors.ane-grained toticanalysiscombinedwithexperimentalmeasurementofparametersisaccurateenoughtobe preventthedensefactorizationfromrealizingitspotentialeciency,weconcludethataregulardata tifrontalfactorization,systemsoflinearequations,parallelcomputing,dataparallelalgorithms, parallelarchitecturecanbeusedecientlytosolvearbitrarilystructuredsparseproblems. chordalgraphs,cliquetrees,connectionmachine,performanceanalysis. Keywords.sparsematrixalgorithms,Choleskyfactorization,supernodalfactorization,mul- Wealsopresentaperformancemodelanduseittoanalyzeouralgorithms.Wendthatasymp- 05C50,15A23,65F05,65F50,68M20. 1.Introduction. siverelativetocomputation,soanalgorithmmustminimizecommunication.locality totheprocessors. ally),thushidingfromtheprogrammerthedetailsofdistributionofdataandwork simplehardwareinawaythatscaleswithoutbottlenecks.adataparallelprogrammingmodelsimpliestheprogrammingoflocalmemoryparallelarchitecturesby associatingaprocessorwitheverydataelementinacomputation(atleastconceptu- promisetoachievehighperformanceinexpensivelybyassemblingalargeamountof Somemajorchallengescomealongwiththesepromises.Communicationisexpen- 1.1.Dataparallelism.Highlyparallel,localmemorycomputerarchitectures sequentialalgorithms:theymustexploitregularityinthedata.foreciencyon isimportantincommunication,soitpaystosubstitutecommunicationwithnearby removesthegeneral-patterncommunicationfromtheinnerloop.) processorsformoregeneralpatternswherepossible.thesequentialprogrammer advantageofthemorecomplicatedofourtwoalgorithms,gridcholesky,isthatit tunestheinnerloopofanalgorithmforhighperformance,butsimpledataparallel algorithmstendtohave\everythingintheinnerloop"becauseasequentialloopover SIMDmachines,theymustalsobehighlyregularinthetimedimension.Insome thedataistypicallyreplacedbyaparalleloperation.(fromthispointofview,the NASAandtheUniversitiesSpaceResearchAssociation. lationsystemsdivisionofnasaandbydarpaviacooperativeagreementncc2-387between MoettField,CA94035.Thisauthor'sworkwassupportedbytheNumericalAerodynamicSimu- c1990,1991xeroxcorporation.allrightsreserved. XeroxPaloAltoResearchCenter,3333CoyoteHillRoad,PaloAlto,California94304.Copyright yresearchinstituteforadvancedcomputerscience,mst045-1,nasaamesresearchcenter, Algorithmsfordataparallelarchitecturesmustmakedierenttrade-osthan 1
2 casesentirelynewapproachesmaybeappropriate;examplesofexperimentswithsuch approachesincludeparticle-in-boxowsimulation,knowledgebasemaintenance[5], andtheentireeldofneuralcomputation[20].ontheotherhand,thesamekind ofregularityinaproblemoranalgorithmcanoftenbeexploitedinawiderange ofarchitectures;therefore,manyideasfromsequentialcomputationturnouttobe surprisinglyapplicableinthehighlyparalleldomain.forexample,block-oriented matrixoperationsareusefulinsequentialmachineswithhierarchicalstorageand conventionalvectorsupercomputers[3];weshallseethattheyarealsocrucialto ecientdataparallelmatrixalgorithms. 1.2.Goalsofthisstudy.Dataparallelalgorithmsarenaturalforcomputationsonmatricesthataredenseorhaveregularnonzerostructuresarisingfrom,for example,regularnitedierencediscretizations.themaingoalofthisresearchis todeterminewhetherdataparallelismisusefulindealingwithirregular,arbitrarily structuredproblems.specically,weconsidercomputingthecholeskyfactorization ofanarbitrarysparse,symmetric,positivedenitematrix.wewillmakenoassumptionsaboutthenonzerostructureofthematrixbesidessymmetry.wewillpresent evidencethatarbitrarysparseproblemscanbesolvednearlyasecientlyasdense problemsbycarefullyexploitingregularitiesinthenonzerostructureofthetriangular factorthatcomefromthecliquestructureofitschordalgraph. Asecondgoalistoperformacasestudyinanalysisofparallelalgorithms.The analysisofsequentialalgorithmsanddatastructuresisamatureandusefulscience thathascontributedtosparsematrixcomputationformanyyears.bycontrast,the studyofcomplexityofparallelalgorithmsisinitsinfancy,anditremainstobeseen howusefulparallelcomplexitytheorywillbeindesigningecientalgorithmsforreal parallelmachines.wewillarguebyexamplethat,atleastwithinaparticularclassof parallelarchitectures,asymptoticanalysiscombinedwithexperimentalmeasurement ofparametersisaccurateenoughtobeusefulinchoosingamongalternativealgorithms forasinglefairlycomplicatedproblem. 1.3.Outline.Thestructureoftheremainderofthepaperisasfollows.In Section2wereviewthedenitionsweneedfromnumericallinearalgebraandgraph theory,sketchthearchitectureoftheconnectionmachine,andpresentatimingmodel forageneralizeddataparallelcomputerthatabstractsthatarchitecture. InSection3wepresenttherstoftwoparallelalgorithmsforsparseCholesky factorization.thealgorithm,whichwecallroutercholesky,isbasedonatheoreticallyecientalgorithmintheprammodelofparallelcomputation.weanalyze thealgorithmandpointouttworeasonsthatitfailstobepractical,onehavingtodo withcommunicationandonewithprocessorutilization. InSection4wepresentasecondalgorithm,whichwecallGridCholesky.Grid Choleskyisadataparallelimplementationofasupernodal,multifrontalmethodthat drawsontheideasofduandreid[7]andashcraftetal.[1].itimprovesonrouter Choleskybyusingatwo-dimensionalgridofprocessorstooperateondensesubmatrices,thusreplacingmostoftheslowgenerally-routedcommunicationofRouter Choleskywithfastergridcommunication.Italsosolvestheprocessorutilization problembyassigningdierentdataelementstotheworkingprocessorsatdierent stagesofthecomputation.wepresentananalysisandexperimentalresultsforapilot implementationofgridcholeskyontheconnectionmachine. ThepilotimplementationofGridCholeskyisapproximatelyasecientasa densecholeskyfactorizationalgorithm,butisstillslowcomparedtothetheoretical peakperformanceofthemachine.severalstepsnecessarytoimprovetheabsolute 2
3 forfurtherresearch. eciencyofthealgorithm,mostofwhichconcernecientcholeskyfactorizationof densematrices,aredescribed.finallywedrawsomeconclusionsanddiscussavenues diagonalsuchthat sparsematrix.thereisauniquennlowertriangularmatrixl=(lij)withpositive ConnectionMachine,anddescribesourparametricmodelofadataparallelcomputer. sectionoutlinesthedataparallelprogrammingmodelanditsimplementationonthe andgraphtheoryneededtostudysparsecholeskyfactorization.mostofthismaterial iscoveredinmoredetailbygeorgeandliu[13,23,24].theremainderofthe 2.1.Linearalgebra.LetA=(aij)beannnreal,symmetric,positivedenite 2.Denitions.Thersttwosubsectionsbelowsummarizethelinearalgebra denotethenumberofnonzeroelementsofx. thelinearsystemax=bbysolvingly=bandltx=y.wewilldiscussalgorithms LthatwerezeroinAarecalledllorll-in.ForanymatrixX,wewrite(X)to solvedis forcomputinglbelow.ingeneral,lislesssparsethana.thenonzeropositionsof ThisistheCholeskyfactorizationofA.WeseektocomputeL;withitwemaysolve TherowsandcolumnsofAmaybesymmetricallyreorderedsothatthesystem A=LLT: turedmatricesmaybefactoredusingthesameorderingandsymbolicfactorization.) Astudyoftheimplementationofappropriatereorderingandsymbolicfactorization furtherassumethatthestructureoflhasbeendeterminedbyasymbolicfactoring actuallycomputingltypicallydominates.(inmanycases,severalidenticallystruc- process.weignorethesepreliminarycomputationsinthisstudybecausethecostof wherepisapermutationmatrix.weassumethatsuchareordering,chosentoreduce(l)andthenumberofoperationsrequiredtocomputel,hasbeendone.we PAPT(Px)=Pb proceduresondataparallelarchitecturesisinpreparation[18]. thelowertriangleofa,i.e.thereisnoll,thenaisaperfecteliminationmatrix.if PAPTisaperfecteliminationmatrixforsomepermutationmatrixP,wecallthe orderingcorrespondingtopaperfecteliminationorderingofa. IfthematrixAissuchthatitsCholeskyfactorLhasnomorenonzerosthan verticesf1;2;:::;ngandedgese(a)=f(i;j)jaij6=0g: elementsarear;sforr2rands2s.(foranysets,wewritejsjtodenoteits thesparse,symmetricmatrixa.first,g(a),thegraphofa,isthegraphwith cardinality.) Vertexelimination.Weassociatetwoordered,undirectedgraphswith 2.2.Graphtheory. LetRandSbesubsetsoff1;:::;ng.ThenA(R;S)isthejRjjSjmatrixwhose (NotethatE(A)isasetofunorderedpairs.)Next,wedenethelledgraph,G(A), withverticesf1;2;:::;ngandedges E(A)=f(i;j)jlij6=0g; 3
4 sothatg(a)isg(l+lt).theedgesing(a)thatarenotedgesofg(a)arecalled lledges.theoutputofasymbolicfactorizationofaisarepresentationofg(a). anedgebetweentwononconsecutivevertices(achord).suchagraphissaidtobe reorderingofg(a). whoseverticesallhavenumberslowerthanbothiandj;moreover,foreverysuchpath neworderingisaperfecteliminationorderingofg(a);liu[24]callsitanequivalent Withanothernumbering,thislastpropertymayormaynothold.Ifitdoes,thenthe ing(a)thereisanedgeing(a)[28].considerrenumberingtheverticesofg(a) Chordalgraphs.EverycycleofmorethanthreeverticesinG(A)has Foreverylledge(i;j)inE(A)thereisapathinG(A)fromvertexitovertexj ofanyotherclique.foranyv2v,theneighborhoodofv,writtenadj(v),isthe intheusualway. setfu2vj(u;v)2eg.themonotoneneighborhoodofv,writtenmadj(v),isthe graphofsomematrix[27]. smallersetfu2vju>v;(u;v)2eg.weextendadjandmadjtosetsofvertices thatforallu;v2x,(u;v)2e.acliqueismaximalifitisnotapropersubset chordal.notonlyisg(a)chordalforeverya,buteverychordalgraphisthelled tinguishableiffug[adj(u)=fvg[adj(v).twoverticesareindependentifthereis noedgebetweenthem.asetofverticesisindependentifeverypairofverticesinit Avertexvissimplicialifadj(v)isaclique.Twovertices,uandv,areindis- LetG=G(V;E)beanyundirectedgraph.AcliqueisasubsetXofVsuch titionsthesimplicialverticesintopairwiseindependentcliques.wecallthesethe vertexofb. simplicialcliquesofthegraph. tinguishable.asetofindistinguishablesimplicialverticesthusformsaclique,though notingeneralamaximalclique.theequivalencerelationofindistinguishabilitypar- isindependent;twosetsaandbareindependentifnovertexofaisadjacenttoa Itisimmediatethatanytwosimplicialverticesareeitherindependentorindis- consistingofonetreeforeachconnectedcomponentofg(a).forsimplicityweshall suchneighbor;otherwiseuisaroot.inotherwords,thersto-diagonalnonzero assumeinwhatfollowsthataisirreducible,sothatvertexnistheonlyroot,though inationtreet(a)isarootedspanningforestofg(a)denedasfollows.ifvertexu elementintheuthcolumnoflisinrowp(u).itiseasytoshowthatt(a)isaforest hasahigher-numberedneighborv,thentheparentp(u)ofuint(a)isthesmallest Liu[24]givesasurveyofitsmanyuses.LetAhavetheCholeskyfactorL.Theelim- eliminationistheeliminationtree.thisstructurewasdenedbyschreiber[30]; Eliminationtrees.AfundamentaltoolinstudyingsparseGaussian that,ifwethinkoftheverticesoft(a)ascolumnsofaorl,anygivencolumnofl ouralgorithmsdonotassumethis. termsofoperationsonsinglecolumns.adescriptionintermsofoperationsonfull dependsonlyoncolumnsthatareitsdescendantsinthetree. (u;v)isanedgeofg(a)withu<v(thatis,iflvu6=0)thenvisonthisunique pathfromutotheroot.thismeansthatwhent(a)isconsideredasaspanningtree ofg(a),thereareno\crossedges"joiningverticesindierentsubtrees.itimplies ThereisamonotoneincreasingpathinT(A)fromeveryvertextotheroot.If blockscanyieldalgorithmswithbetterlocalityofreference,whichisanadvantage eitheronamachinewithamemoryhierarchy(registers,cache,mainmemory,disk)or Cliquetrees.TheeliminationtreedescribesaCholeskyfactorizationin 4
5 onadistributed-memoryparallelmachine.theconnectionmachinefallsintoboth ofthesecategories. thekeyideainbothduandreid's\multifrontal"algorithm[7]andthe\supernodal" exploredextensivelyinthecombinatorialliterature;representationsofchordalgraphs algorithmofashcraft,grimes,lewis,peyton,andsimon[1],whichcanbetraced backtotheso-calledelementmodeloffactorization[15,33].afullsubmatrixoflis acliqueinthechordalgraphg(a).thecliquestructureofchordalgraphshasbeen astreesofcliquesdatebackatleastto1972[10]andcontinuetobeused[16,25,26]. Describingsymmetricfactorizationintermsofoperationsonfullsubmatricesis innodesthatareproperdescendantsofn.anequivalentdenitionistothinkof ofg(a)intocliques,insuchawaythatalltheverticesofanodenareindistinguishablesimplicialverticesinthegraphthatresultsbydeletingfromg(a)allvertices fromtheirparents.(thisdenitiondiersslightlyfromthatofpeyton[26],whose startingwithaneliminationtreeandcollapsingverticesthatareindistinguishable G(A)inproperdescendantsofsomesupernodearedeleted,thesupernodebecomesa treenodesareoverlappingmaximalcliquesofg(a).) Acliquewhichisanodeinacliquetreeisalsocalledasupernode.Ifallverticesof AcliquetreeformatrixAisatreewhosenodesaresetsthatpartitionthevertices simplicialcliqueintheresultinggraph.thecliquetreeissometimescalledasupernode treeorsupernodaleliminationtree[2].amatrixmayhavemanydierentclique trees indeed,theeliminationtreeitselfisone.ournumericalfactorizationalgorithm WeprogrammedtheCMin*lisp,whichiscompiledintoParis. machinearchitecturepresentedbytheassemblylanguageinstructionsetparis[34]. ory,simdparallelcomputer.thedescriptionwepresentherecorrespondstothe GridCholeskycanactuallyuseanycliquetree;thesymbolicfactorizationwedescribe insection4.1usesablockjessandkeesalgorithmtocomputeashortestpossible cliquetree. available.)theprocessorsareconnectedbyacommunicationnetworkcalledthe 65,536bitsofmemory.(Sincethisworkwasdone,largermemorieshavebecome Afull-sizedCMhas216=65,536processors,eachofwhichcandirectlyaccess Architecture.TheConnectionMachine(modelCM-2)isalocalmem- 2.3.TheConnectionMachine. 16-dimensionalhypercube. router,whichisconguredbyacombinationofmicrocodeandhardwaretobea pvar.apvarisanarrayofdatainwhicheveryprocessorstoresandmanipulatesone processors,p.iftherearevtimesasmanyelementsinthepvarxasthereare element.thesizeofapvarmaybeamultipleofthenumberofphysicalmachine processors;thustheprogrammer'sviewremains\oneprocessorperdatum."the processors,then(throughmicrocode)eachphysicalprocessorsimulatesvvirtual ratioviscalledthevirtualprocessor(vp)ratio.thecmrequiresthatvmustbea oftwonotsmallerthantherealnumberx. poweroftwo.thuswewillndusefulthenotationdxe,meaningthesmallestpower TheessentialfeatureoftheCMprogrammingmodelistheparallelvariableor embeddedinthemachine(usinggraycodes)sothatneighboringvirtualprocessors arraywithdimensionsthatarepowersoftwo.thevpsetsandtheirpvarsare aresimulatedbythesameorneighboringphysicalprocessors. determinedbytheprogrammer,whomaychoosetoviewitasanymultidimensional Thegeometryofeachsetofvirtualprocessors(anditsassociatedpvars)isalso 5
6 TheParisinstructionsetcorrespondsfairlywelltotheabstractmodelofdata parallelprogrammingthatthecmattemptstopresenttotheprogrammer,butitdoes notcorrespondcloselytotheactualhardwareofthecm.largelyforthisreason,it ishardtogethighperformancewhenprogramminginparisorinalanguagethat iscompiledintoparis[31].weshallgointothispointindetaillater.thereare otherwaystoviewandtoprogramthehardwareofthecm-2thatcanprovidebetter performance.thesearejustnowbecomingavailabletousers,butwerenotwhenthis workwasdone ConnectionMachineprogramming.ParallelcomputationontheCM isexpressedthroughelementwisebinaryoperationsonpairsofpvarsthatresideinthe samevpset thatis,havethesamevpratioandlayoutonthemachine.(optionally, onemayspecifyabooleanmaskthatselectsonlycertainvirtualprocessorstobe active.)theseoperationstaketimeproportionaltov,sincetheactualprocessors mustloopovertheirsimulatedvirtualprocessors.thisremainstrueevenwhenthe setofselectedprocessorsisverysparse. Interprocessorcommunicationisexpressedandaccomplishedinthreeways,which wediscussinorderofincreasinggeneralitybutdecreasingspeed. Communicationwithvirtualprocessorsatnearestneighborgridcellsismost ecient.apvarmaybeshiftedalonganyofitsaxesusingthistypeofcommunication. Theshiftmaybecircularorend-oattheprogrammer'sdiscretion. Asecondcommunicationprimitive,scan,allowsbroadcastofdata.Forexample, ifxisaone-dimensionalpvarwiththevalue[1,2,3,4,5,6,7,8]thenascanofx yields[1,1,1,1,1,1,1,1].scansareimplementedusingthehypercubeconnections. Thetimeforascanoflengthnislinearinlogn.Scanscanalsobeusedtobroadcast alongeitherrowsorcolumnsofatwo-dimensionalarray.scansthatperformparallel prexarithmeticoperationsarealsoavailable,butwedonotusethem. Scansofsubarraysarepossible.Inasegmentedscan,theprogrammerspeciesa booleanpvar,thesegmentpvar,congruenttox.thesegmentsofxbetweenadjacent Tvaluesinthesegmentpvararescannedindependently.Thus,forexample,ifwe usethesegmentpvar[tffftfft]andxisasabove,thenasegmentedscan returns[1,1,1,1,5,5,5,8]. Thethirdandmostgeneralformofcommunicationallowsavirtualprocessorto accessdatainthememoryofanyothervirtualprocessor.theseoperationsgoby severaldierentnamesevenwithinthecmenvironment;weshallrefertothemin termsalreadyfamiliarinsparsematrixcomputation:gatherandscatter. Agatherallowsprocessorstoreaddatainthememoryofotherprocessors. TheCMtermforagatherispref!!(forthe*lispprogrammer)orget(forthe Parisprogrammer).Inagather,threepvarsareinvolved:thesource,thedestination,andtheaddress.Theaddressoftheprocessorwhosememoryistobereadis takenfromtheintegeraddresspvar.supposethesourceistheone-dimensionalpvar [15;14;13;:::;2;1;0]andtheaddresspvaris[0;1;2;0;1;2;:::;0;1;2;0].Thenthe datastoredinthedestinationis[15;14;13;15;14;13;:::;15;14;13;15].thefortran- 90orMatlabstatementthataccomplishesthisisisdest=source(address);it performstheassignmentdest(i) source(address(i))for1ilength(dest). Ascatterallowsprocessorstowritedatatothememoryofotherprocessors. TheCMtermforascatteris*pset(forthe*lispprogrammer)orsend(forthe Parisprogrammer).Againthethreepvarsareasource,adestination,andaninteger address.thefortran-90ormatlabversionisisdest(address)=source,andthe eectisdest(address(i)) source(i)for1ilength(source). 6
7 Parameter V Virtualprocessorratio memoryreferencetime4.8vsec Multiplyoraddtime7 ConnectionMachineParametricModel Description ParametersofCMmodel Table1 Scantime Newstime Routetime 6:2+1:2log2scan-distance 3scatter(nocollisions):64 scatter-add(4collisions):110 MeasuredCM-2value scatter-add(100collisions):200 sourceandaddressareasaboveandthedestinationinitiallyhasthevalue[1;1;:::;1] acollisionhasoccurred.)theprogrammercanselectoneofseveraldierentways tocombinecollidingvaluesbyspecifyingacombiningoperator.forexample,ifthe processorsaresenttothesamedestinationprocessor.(whenthishappenswesaythat Inascatter,whentheaddresspvarhasduplicatevalues,datafromseveralsource gather(manycollisions):430 thenafterascatter-with-add,thedestinationhasthevalue[45;40;35;1;1;1;:::;1]. Thesumofelementssource(j)suchthataddress(j)=kisstoredindest(k)ifthere tobeapowerfulaidtodataparallelprogramming. ondataparallelarchitecturesanduseittoanalyzeperformanceofseveralalgorithms areanysuchelements;otherwisedest(k)isunchanged.othercombiningoperators aredescribedbyveparameters: includeproduct,maximum,minimum,and\error".wehavefoundcombiningscatter forsparsecholeskyfactorization.theessentialmachinecharacteristicsinthemodel MeasuredCMperformance.Wewilldevelopamodelofperformance (Floating-pointadditiontakesthesametimeasmultiplication,.)Inourmodel, The32bitgridcommunicationtime,inunitsof Thememoryreferencetimefora32bitword nectionmachine.thereforeisproportionaltovpratio,andtheotherparameters executiontimescaleslinearlywithvpratio,whichisessentiallycorrectforthecon- The32bitroutertime,inunitsof The32bitscantime,inunitsof The32bitoating-pointmultiplytime,inunitsof routingpatternsthatperformevenworsethanthis.foranygivenpattern,gather toroffourdependingonthenumberofcollisions;itispossibletodesignpathological obtainedbyexperimentonthecm-2.weobservethatroutertimesrangeoverafac- areindependentofvpratio.intable1,wegivemeasuredvaluesfortheseparameters usuallytakesjusttwiceaslongasscatter,presumablybecauseitisimplementedby sendingarequestandthensendingareply.inourapproximateanalyses,therefore, wegenerallychooseavalueofforscattercorrespondingtothenumberofcollisions observed,andmodelgatherastaking2oating-pointtimes. 7
8 closelyongilbertandhafsteinsson'stheoreticallyecientalgorithm[17]forthe PRAMmodelofcomputation.Itscommunicationrequirementsaretoounstructured forittobeveryecientonane-grainedmultiprocessorlikethecm,butweimplementedandanalyzedittouseasabasisforcomparisonandtohelptuneour performancemodelofthecm. 3.RouterCholesky.OurrstparallelCholeskyfactorizationalgorithmisa parallelimplementationofastandardcolumn-orientedsparsecholesky;itisbased thesymbolicfactorizationg(a)areavailable.(inourexperimentswecomputedthe symbolicfactorizationsequentially;gilbertandhafsteinsson[17]describeaparallel treet(a)toorganizeitscomputation.forthepresent,assumethatboththetreeand algorithm.)eachvertexofthetreecorrespondstoacolumnofthematrix. factorizationintermsoftwoprimitiveoperations,cdivandcmod: FollowingGeorgeetal.[12],weexpressasequentialcolumn-orientedCholesky 3.1.TheRouterCholeskyalgorithm.RouterCholeskyusestheelimination 8
9 proceduresequentialcholesky(matrixa); forj foreachedge(i;j)ofg(a)withi<jdo 1tondo columnjaccumulatesallnecessaryupdatescmod(j;i)fromcolumnstoitsleftjust thediagonalelementinthatcolumn,androutinecmod(j;i)modiescolumnjby Routinecdiv(j)dividesthesubdiagonalelementsofcolumnjbythesquarerootof subtractingamultipleofcolumni.thisiscalledaleft-lookingalgorithmbecause endsequentialcholesky; cdiv(j)od cmod(j;i)od; cdiv(i). beforethecdiv(j)thatcompletesitscomputation.bycontrast,aright-lookingalgorithmwouldperformalltheupdatescmod(j;i)usingcolumniimmediatelyafterthe bycolumns(vertices)thatareitsdescendantsinthetree[30].thereforeaparallel left-lookingalgorithmcancomputealltheleafvertexcolumnsatonce. NowconsidertheeliminationtreeT(A).Agivencolumn(vertex)ismodiedonly procedureroutercholesky(matrixa); forh foreachedge(i;j)withheight(i)<height(j)=hpardo 0toheight(n)do theleaveshaveheight0,verticeswhosechildrenareallleaveshaveheight1,and Hereheight(j)isthelengthofthelongestpathinT(A)fromvertexjtoaleaf.Thus endroutercholesky; od foreachvertexjwithheight(j)=hpardo cdiv(j)od cmod(j;i)od; soforth.theouterloopofthisalgorithmworkssequentiallyfromtheleavesofthe eliminationtreeuptotheroot.ateachstep,anentirelevel'sworthofcmod'sand cdiv'saredone. toeveryedgeandvertexofthelledgraphg).supposeprocessorpijisassignedto thenonzerothatisinitiallyaijandwilleventuallybecomelij.(iflijisall,thenaij thendividetheirownnonzerosbyljj.intheparallelcmod(j;i),processorpjisends themultiplierljitotheprocessorspkiwithk>j.eachsuchpkithencomputesthe updatelkiljilocallyandsendsittopkjtobesubtractedfromlkj. isinitiallyzero;recallthatweassumethatthesymbolicfactorizationisalreadydone, soweknowwhichlijwillbenonzero.)intheparallelcdiv(j),processorpjjcomputes ljjasthesquarerootofitselement,andsendsljjtoprocessorspijfori>j,which Aprocessorisassignedtoeverynonzeroofthetriangularfactor(or,equivalently, updatestoaprocessorincolumnj.eachcolumniisinvolvedinatmostonecmod atatimebecauseeverycolumnmodifyingjisadescendantofjint(a),andthe performedbytheprocessorsincolumniwhothen,ontheirowninitiative,sendthese subtreesrootedatverticesofanygivenheightaredisjoint.thereforeeachprocessor participatesinatmostonecmodorcdivateachparallelstep.ifweignorethetime takenbycommunication(includingthetimetocombineupdatestoasinglepkjthat maycomefromdierentpki1,pki2,:::),theneachparallelsteptakesaconstant Wecallthisaleft-initiatedalgorithmbecausethemultiplicationsincmod(j;i)are 9
10 howtodothecommunicationincdivandcmod. theeliminationtreet(a). leskyonthecmwemustspecifyhowtoassigndatatoprocessors,andthendescribe amountoftimeandtheparallelalgorithmrunsintimeproportionaltotheheightof 3.2.CMimplementationofRouterCholesky.ToimplementRouterCho- thinkofthisprocessorassignmentasaprocessorforeachvertexjofthelledgraph, eachsub-diagonalnonzero.thesymmetricuppertriangleisnotstored.wecanalso followedbyaprocessorforeachedge(i;j)withi>j. singlecolumnecientbecausetheycanusethecmscaninstructions.eachcolumn isrepresentedbyaprocessorforitsdiagonalelementfollowedbyaprocessorfor standardsequentialsparsematrixalgorithms[13].thismakesoperationswithina layoutthenonzerosinaone-dimensionalarrayincolumnmajororder,asinmany RouterCholesky,likemanydataparallelalgorithms,isproigateofparallelvariablestorage.Eachprocessorcontainssomeworkingstorageandthefollowingpvars: jhtjil Columnnumberofthiselement. Rownumberofthiselement. ElementoffactormatrixL,initiallyA. Weuseone(virtual)processorforeachnonzerointhetriangularfactorL.We (Recallthatp(j)>jistheeliminationtreeparentofvertexj<n.) iht diagonalp height(j)int(a). decidewhetheritparticipatesinacdivorcmod.bycomparingthelocalprocessor's nextupdate eparent height(i)int(a). ihtorjhttothecurrentvalueoftheouterloopindex,aprocessorcandetermine Ateachstageofthesequentialouterloop,eachprocessorusesihtandjhtto InprocessorPij,apointertoPi;p(j). Pointertonextelementthisonemayupdate. Boolean:Isthisadiagonalelement? activecolumn. aconnectedsubgraphoftheeliminationtree,andarelinkedtogetherinthistree nextupdatetoalaterelementinitsrow.thenonzeropositionsineachroware Theactualupdateisdonebyascatter-with-add,whichusestheroutertosendthe ifitselementisinacolumnthatisinvolvedinacdivoracmod. updatetoitsdestination. Thecdivusesascanoperationtocopythediagonalelementtotherestofthe onestepupthetreeusingtheeparentpointers. thatareitsancestorsintheeliminationtree.ateachstage,nextupdateismoved structurebytheeparentpointers.eachnonzeroupdatesonlyelementsincolumns Thecmodusesasimilarscantocopythemultiplierljidowntotherestofcolumni. Togureoutwheretosendtheupdate,eachelementmaintainsapointercalled whichinthiscaseisd(l)=pe.theciareconstants. kydoesaconstantnumberofrouteroperations,scans,andarithmeticoperations. Recallthatthememoryreferencetimeisproportionaltothevirtualprocessorratio, Thenumberofstagesish+1,wherehistheheightoftheeliminationtree.Interms oftheparametersofthemachinemodelinsection2.3.2,itsrunningtimeis 3.3.RouterCholeskyperformance:Theory.EachstageofRouterCholes- (c1+c2+c3)h: 10
11 routerisusedtwiceinthisoperation.thedominanttermistheroutertermc1h. Noticethatwedonotexplicitlycounttimeforcombiningupdatestothesameelement fromdierentsources,sincethisishandledwithintherouterandisthusincludedin. kpointsonaside,thenthegraphisakbyksquaregrid,andwehaven=k2, dierencemeshintwodimensionsorderedbynesteddissection[11].ifthemeshhas h=o(k),and(l)=o(k2logk).thenumberofarithmeticoperationsinthe CholeskyfactorizationisO(k3),ineitherthesequentialorparallelalgorithms.Router Themosttime-consumingstepisincrementingthenext-updatepointer;the thenumberofoperationsinthesequentialalgorithmtoparalleltime,wendthat Cholesky'srunningtimeisO(k3logk=p).Ifwedeneperformanceastheratioof theperformanceiso(p=logk)(takingtobeaconstantindependentofpork; thisisapproximatelycorrectfortheconnectionmachine,althoughtheoretically Togetafeelingforthisanalysisconsiderthemodelproblem,a5-pointnite shouldgrowatleastwithlogp).thisanalysispointsouttwoweakpointsofrouter Cholesky.First,theperformanceonthemodelproblemdropswithincreasingproblem size.(thisdependsontheproblem,ofcourse;forathree-dimensionalmodelproblem timingmodelandanalysis,weexperimentedwithroutercholeskyonavarietyof routertime,becauseeverystepusesgeneralcommunication. boundednodedegree,orderedbynesteddissection[17].theasymptoticanalysisis seriously,theconstantintheleadingtermofthecomplexityisproportionaltothe thesamebutthevaluesoftheconstantswillbedierent. asimilaranalysisshowsthatperformanceiso(p)regardlessofproblemsize.)more sparsematrices.wepresentoneexamplehereindetail.theoriginalmatrixis2500 ve-pointsquaremesh.itispreorderedbysparspak'sautomaticnesteddissection 2500with7400nonzeros(countingsymmetricentriesonlyonce),representinga5050 Thisanalysiscanbeextendedtoanytwo-dimensionalniteelementmeshwith heuristic[13],whichgivesorderingsverysimilartotheidealnesteddissectionordering operationstocompute. usedintheanalysisofthemodelproblemabove.thecholeskyfactorhas(l)= 48;608nonzeros,aneliminationtreeofheighth=144,andtakes1,734,724arithmetic 3.4.RouterCholeskyperformance:Experiments.Inordertovalidatethe seconds.thisisnotabadtforroutertime;itisnotclearwhytheremainingtime weuseonly48,608ofthe65,536virtualprocessors.)weobservedarunningtimeof wouldpredictroutertimec1h=39secondsandothertime(c2+c3)h=1:5 V=d(L)=pe=8.(Roundinguptoapoweroftwohasconsiderablecosthere,since intotheanalysisabove(using=200sincetherewereingeneralmanycollisions),we 53seconds,ofwhichabout41secondswasduetogathersandscatters.Substituting NASAAmesResearchCenter.Theresultsquotedherearefromp=8,192processors, withoatingpointcoprocessors,ofthemachineatnasa.thevpratiowastherefore WeranthisproblemonCM-2'sattheXeroxPaloAltoResearchCenterandthe issuchapoort,buttheexpensivesquarerootandthedatamovementinvolvedin thepointerupdatescontributetoit,anditseemsthati/omayhaveaectedthe andascatterwithexactlythesamecommunicationpattern.morecarefuluseofthe tobeacost-eectivewaytofactorsparsematrices.eachstagedoestwogathers routercouldprobablyspeeditupbyafactoroftwotove.however,thiswouldnot measured53seconds. 3.5.RemarksonRouterCholesky.RouterCholeskyistooslowasitstands Theobservation,inanycase,isthatroutertimecompletelydominates. 11
12 beenoughtomakeitpractical;somethingmorelikeahundredfoldimprovementin routerspeedwouldbeneeded. TheoneadvantageofRouterCholeskyistheextremesimplicityofitscode. Itisnomorecomplicatedthanthenumericfactorizationroutineofaconventional sequentialsparsecholeskypackage[13].itisinterestingtonotethatcolumn-oriented sparsecholeskycodesonmimdmessage-passingmultiprocessors[12,14,35]aremore complex.theyexploitmimdcapabilitytoimplementdynamicschedulingofthe cmodandcdivtasks.theyallowarbitraryassignmentofcolumnstoprocessorsand thereforearerequiredtouseindirectaddressingofcolumns.finally,theyarewritten withlow-levelcommunicationprimitives,theexplicit\send"and\receive." RouterCholesky'ssimplicitycomesdearly.FlexibilityinschedulingallowsMIMD implementationstogainamodestperformanceadvantageoveranypossiblesimd implementation.moreimportant,weemploy(l)virtualprocessors,regardlessof thenumberofphysicalprocessors.itisessentialthatthesevirtualprocessorsnotall sitidle,consumingphysicalprocessortimeslices,whenthereisnothingforthemto do.asimplementedbytheparisinstructionset,theydositidle. WedescribedRouterCholeskyasaleft-initiated,left-lookingalgorithm.Ina right-initiatedalgorithm,processorpijwouldperformtheupdatestolij.inarightlookingalgorithm,updateswouldbeappliedassoonastheupdatingcolumnofl wascomputedinsteadofimmediatelybeforetheupdatedcolumnoflwastobe computed.routercholeskyisthusoneoffourcousins.itistheonlyoneofthe fourthatmapsoperationstoprocessorsevenly;theotherthreealternativesrequire aninnersequentialloopofsomekind.allfourversionsrequireatleasthrouter operations. 4.GridCholesky.Inthissectionwepresentaparallelsupernodal,multifrontal CholeskyalgorithmanditsimplementationontheCM.Multifrontalmethods,introducedbyDuandReid[7],computeasparseCholeskyfactorizationbyperforming symmetricupdatesonasetofdensesubmatrices.wefollowliu[23]inreferring toanalgorithmthatusesrank-1updates\multifrontal"andtheblockversionthat usesrank-kupdates\supernodalmultifrontal."theideaofusingblockmethods tooperateonsupernodeshasbeenusedinmanydierentsparsefactorizationalgorithms[1,7].parallelsupernodalormultifrontalalgorithmshavebeenusedonmimd message-passingandshared-memorymachines[2,6,32]. Thealgorithmusesatwo-dimensionalVPset(whichwecallthe\playingeld") topartiallyfactor,inparallel,anumberofdenseprincipalsubmatricesofthepartially factoredmatrix.byworkingontheplayingeld,wemayusethefastgridandscan mechanismsforallthenecessarycommunicationduringthefactorizationofthedense submatrices.onlywhenweneedtomovethesedensesubmatricesbackandforthto theplayingelddoweneedtousetherouter.inthiswaywedrasticallyreducethe useoftherouter:forthemodelproblemonakkgridwereducethenumberof usesfromh=3k?1to2log2k?1.theplayingeldcanalsooperateatalowervp ratioingeneralbecauseitdoesnotneedtostoretheentirefactoredmatrixatonce. Thismethod,likeallmultifrontalmethods,isinessencean\outofcore"method inthatthecholeskyfactoriskeptinadatastructurethatisnotreferredtowhile doingarithmetic,allofwhichisdoneondensesubmatrices.thenoveltyhereisthe factorizationofmanyofthesedenseproblemsinparallel;thesimultaneousexploitationoftheparallelismavailablewithineachofthedensefactorizations;theuseofa two-dimensionalgridofprocessorsfortheseparalleldensefactorizations;theuseof themachine'srouterforparalleltransfersfromthematrixstoragedatastructure;and 12
13 theuseofthecombiningscatteroperationsforparallelupdateofthematrixstorage datastructure. 4.1.TheGridCholeskyalgorithm BlockJessandKeesreordering.FirstwedescribeanequivalentreorderingofthechordalgraphG=G(A)thatwecalltheblockJess/Keesordering. BlockJess/Keesisaperfecteliminationorderingthathastwopropertiesthatmake itthebestequivalentreorderingforourpurposes:iteliminatesverticeswithidenticalmonotoneneighborhoodsconsecutively,anditproducesacliquetreeofminimum height. OurreorderingeliminatesallthesimplicialverticesofGsimultaneouslyasone majorstep.intheprocess,itpartitionsalltheverticesofgintosupernodes.each ofthesesupernodesisacliqueing,andisasimplicialcliquewhenitscomponent verticesareabouttobeeliminated.eachvertexislabeledwiththestage,ormajor stepnumber,atwhichitiseliminated.inmoredetail,thereorderingalgorithmisas follows.procedurereorder(graphg) activestage?1; whilegisnotemptydo activestage activestage+1; NumberallthesimplicialverticesinG,assigning consecutivenumberswithineachsupernode; stage(v) activestageforallsimplicialverticesv; RemoveallthesimplicialverticesfromGod; h activestage endreorder Thecliquesarethenodesofacliquetreewhoseheightish,onelessthanthenumber ofmajoreliminationsteps.theparentofagivencliqueisthelowest-stageclique adjacenttothegivenclique. Thename\blockJess/Kees"indicatesarelationshipwithanalgorithmdueto JessandKees[21]thatndsanequivalentreorderingforachordalgraphsoasto minimizetheheightofitseliminationtree.theoriginal(or\point")jess/keesorderingeliminatesjustonevertexfromeachsimplicialcliqueateachmajorstep.(thisis amaximum-sizeindependentsetofsimplicialvertices.)eachstepofpointjess/kees producesonelevelofaneliminationtree,fromtheleavesup.theresultingeliminationtreehasminimumheightoverallperfecteliminationordersong.ourblock Jess/Keeseliminatesallthesimplicialverticesateachmajorstep,producingaclique treeonelevelatatimefromtheleavesup.thisorderingmaynotminimizetheheight oftheeliminationtree.however,asblairandpeyton[4]haveshown,itdoesproduce acliquetreeofminimumheightoverallperfecteliminationordersong. Everyvertexisincludedinexactlyonesupernode.Wenumberthesupernodesas fs1;:::;smginsuchawaythatifi<jthentheverticesinsihavelowernumbers thantheverticesinsj.thestageatwhichasupernodesiseliminatedistheiteration ofthewhileloopatwhichitsverticesarenumberedandeliminated;thus,forallv2s, stage(v)=stage(s)istheheightofnodesinthecliquetree Parallelsupernodalmultifrontalelimination.LetCbeasupernode. ItisimmediatethatK=adj(C)[Cisaclique,andthatitismaximal.Our factorizationalgorithmworksbyformingtheprincipalsubmatricesofacorresponding 13
14 toverticesinthemaximalcliquesgeneratedbysupernodes.let=jcjand= jadj(c)j.writea(k;k)fortheprincipalsubmatrixoforderjkj=+consisting ofelementsai;jwithi;j2k.itisnaturaltopartitiontheprincipalsubmatrix A(K;K)ofAas wherex=a(c;c)is,y=a(adj(c);adj(c))is,andeis. looking"algorithm.thedetailsareasfollows. proceduregridcholesky(matrixa) foractivestage Intheterminologyoftheprevioussection,GridCholeskyisa\blockright- A(K;K)=XE foreachsupernodecwithstage(c)=activestagepardo 0tohdo ETY; MoveA(K;K)totheplayingeld, SetYtozeroontheplayingeld; Perform=jCjstepsofparallelGaussianelimination A(C;C) wherek=c[adj(c); andtheschurcomplementy0=?etx?1e, tocomputethecholeskyfactorlofx, theupdatedsubmatrixe0=l?1e, wherex,e,andypartitiona(k;k)asabove; end,wediscussanimplementationofludecompositionwithoutpivoting.(weuse endgridcholesky; useful,weneedafastdensematrixfactorizationsontwo-dimensionalvpsets.tothat 4.2.Multipledensepartialfactorization.Inordertomakethisapproach oda(adj(c);adj(c)) A(adj(C);C) L; LUinsteadofCholeskyherebecausewecanseenoecientwaytoexploitsymmetry E0T; withatwo-dimensionalmachine;moreover,luavoidsthesquarerootateachstepand A(adj(C);adj(C))+Y0od thatusesonlynearestneighborcommunicationonthegrid,andarank-1update soisabitfaster.)weanalyzedandimplementedtwomethods:asystolicalgorithm aboutthetwo-dimensionalplayingeldsimultaneously(eachasaseparate\baseball submatricesa(k;k)correspondingtosupernodesatagivenstagearedistributed algorithmthatusesrowandcolumnbroadcast.witheitherofthesemethods,allthe diamond"),andthepartialfactorizationisappliedtoallthesubmatricesatonce. scanstokeepeachfactorizationwithinitsown\diamond."thenumberofstepson acrosseachrow,andnallyaparallelmultiplyandsubtracttoapplytheupdate.the numberofrank-1updatesis,thesizeofthesupernode. plicationtocomputethemultiplierforeachrow,anotherscantocopythemultiplier scandownthecolumnstocopythepivotrowtotheremainingrows,aparallelmulti- 1updateconsistsofadivisiontocomputethereciprocalofthediagonalelement,a A(K;K),withasupernodeofsizeandaSchurcomplementofsize.Asinglerank- Anentirestageofpartialfactorizationsisperformedatonce,usingsegmented Wedescribetherank-1algorithmintermsofitseectonasinglesubmatrix theplayingeldatstagesiss,themaximumvalueofoverallsupernodesat 14
15 Herec3is2s,andc4isproportionaltosaswell. stages.thenastageofrank-1partialfactorizationtakestime below,foracompletefactorization(thatis,oneinwhich=0).thebookkeeping includesnearest-neighborcommunicationtomovethreeone-bittokensthatcontrol whichprocessorsperformreciprocals,multiplications,andsoonateachstep. Scans(rowandcolumnbroadcast): Therelativecostofthevariouspartsoftherank-1updatecodearesummarized (c3+c4): Multiply(computingmultipliers): News(movingthetokens): 79.7% 5.5% allcommunicationisbetweengridneighbors.thusitscommunicationtermsare proportionaltoratherthan.thisadvantageismorethanosetbythefactthat Divide(reciprocalofpivotelement): Multiply-subtract(Gaussianelimination): Unliketherank-1implementation,thesystolicimplementationneverusesascan; 7.1% 2.7% 3s+2ssequentialiterationsarenecessary,whiletherank-1methodonlyneedss Remarksondensepartialfactorization.Theoretically,systolicfactorizationshouldbeasymptoticallymoreecientasmachinesizeandproblemsize 4.8% forthetwo-dimensionalmodelproblemtheaverageschurcomplementsizesisabout Butforapartialfactorizationtherank-1algorithmistheclearwinner.Forexample, 4s,sotherank-1codehasan11-to-1advantageinnumberofsteps.Thismorethan threefoldincreaseinnumberofsteps,andsothesystolicmethodissomewhatfaster. grows.realistically,however,thecmhappenstohave6;forafullfactorization makesupforthefactthatscanismuchslowerthangridcommunication. growwithoutbound,becausescansmustbecomemoreexpensiveasthemachine (=0)asixfolddecreaseincommunicationtimeperstepmorethanbalancesthe algorithm,themultiply-subtract,accountsforonly1=21ofthetotaltimeintherank- dousefulwork,sincetheactivepartofthematrixoccupiesasmallerandsmaller 1parallelalgorithm.Moreover,only1=3ofthemultiply-subtractoperationsactually subsetoftheplayingeldasthecomputationprogresses.thisgivesthecodean VPratios.Thereasonsarethese:scanisslowrelativetoarithmetic;thedivideand havefoundthistobetypicalin*lispcodesformatrixoperations,especiallywithhigh multiplyoperationsoccuronverysparsevpsets;andthevpratioremainsconstant overalleciencyofonepartin63forluandonepartin126forcholesky.we Itisinterestingtonotethattheonlyarithmeticthatmattersinasequential canreadilybemadeatcompiletime.asanalternative,wecouldhaverewrittenthe code;thevpsetcouldshrinkasthematrixshrinks,andthedivideandthemultiplies couldbeperformedinasparservpset.15 processorsthatareactive.sometimesadeterminationthatthisisgoingtohappen substantially:theloopovervvirtualprocessorsshouldberestrictedtothosevirtual astheactivepartofthematrixgetssmaller. Signicantperformanceimprovementscouldcomefromseveralpossiblesources. Amoreecientimplementationofvirtualprocessorscouldimproveperformance
16 onevectoroating-pointarithmeticchip.performing32oatingpointoperationsimpliesmoving32numbersinbit-serialfashionintoatransposerchip,thenmovingthem eachphysicalprocessorgetsonlyonecopyofthedataratherthanvcopies. sible.theparisinstructionsethidesthefactthatevery32physicalprocessorsshare tothescanwhichtakeso(b+d).thecopyscanscouldalsobeimplementedsothat takeso(b=d+d)timetobroadcastbbitsinad-dimensionalhypercube,incontrast tureofthecm.hoandjohnsson[19]havedevelopedaningeniousalgorithmthat Moreecientuseofthelow-leveloating-pointarchitectureoftheCM-2ispos- Thescanscouldbespedupconsiderablywithinthehypercubeconnectionstruc- inparallelintothevectorchip,thenreversingtheprocesstostoretheresult.while precludingtheuseofblockalgorithms[3]thatcouldstoreintermediateresultsinthe cyclesarerequiredjusttoaccessoneoating-pointnumber. thismodeofoperationconformstotheone-processor-per-data-elementprogramming model,itwastestimeandmemorybandwidthwhenonlyafewprocessorsareactually active(suchascomputingmultipliersordiagonalreciprocalsinlu),since32memory registersinthetransposer.thusthecomputationrateislimitedbythebandwidth betweenthetransposerandmemory(about3.5gigaopsfora64kprocessorcm) insteadofbytheoperationrateofthevectorchip(about27gigaopstotal). Thismodealsorequiresintermediateresultstobestoredbacktomainmemory, westill(early1991)cannotuseit. CMFortranallowsthismodel,itdoesnotallowscansandscatterwithcombining,so wewereworkingwithjust256realprocessors!)also,ifvirtualizationishandled (latein1988)thetoolsforprogrammingonthislevelwerenoteasilyusable.while processor-plus-transposer-and-vector-chipunitasasingleprocessor,andrepresenting eciently,weneedonlykeep256processorsbusy.atthetimethisworkwasdone theydonotneedtobemovedbit-seriallyintothearithmeticunit.(viewedthisway, 32-bitoating-pointnumbers\slicewise,"withonebitperphysicalprocessor,sothat Amoreecientdensematrixfactorizationcanbeachievedbytreatingeach32- setswithdierentgeometries:thematrixstoragestoresthenonzeroelementsofa theactivesubmatricestotheplayingeld,factorsthem,andmovesupdatesbackto andl(doingalmostnocomputation),andtheplayingeldiswherethedensepartial ofvirtualprocessorsthatstoresallofaandlinessentiallythesameformasrouter themainmatrix. factorizationsaredone.thetop-levelfactorizationprocedureisjustaloopthatmoves 4.3.CMimplementationofGridCholesky.GridCholeskyusestwoVP Cholesky.EachofthefollowingpvarshasoneelementforeachnonzeroinL Matrixstorage.ThematrixstorageVPsetisaone-dimensionalarray thatis,morethanoneycmaybecomputingupdatestothesameelementoflatthe eld.thesupernodescaredisjoint,buttheirneighboringsetsadj(c)mayoverlap; Weuseascattertomovetheactivecolumnsfrommatrixstoragetotheplaying activestagethestageatwhichjoccursinasupernode. griditheplayingeldrowinwhichthiselementsits. gridjtheplayingeldcolumninwhichthiselementsits. updatesworkingstorageforsumofincomingupdates. lvalueelementsofl,initiallythoseofa. samestage.therefore,weusescatter-with-addtomovethepartiallyfactoredmatrix fromtheplayingeldbacktomatrixstorage. 16
17 ThepvarsusedinthisVPsetare eciency.itssizeisdeterminedaspartofthesymbolicfactorizationandreordering. stage,althoughitcouldactuallyusedierentvpratiosatdierentstagesformore two-dimensionalgridonwhichthesupernodesarefactored.inourimplementation itislargeenoughtoholdalltheprincipalsubmatricesforallmaximalcliquesatany Theplayingeld.ThesecondVPset,calledtheplayingeld,isthe ofallthemaximalcliques. aswellassomebooleanagsusedtocoordinatethesimultaneouspartialfactorization doingrank-1updates(seesectionsection4.2)ofallthedensesubmatricesstored there,usingsegmentedscanstodistributethepivotrowsandcolumnswithinall TheprocessorsoftheplayingeldcomputeLUfactorizationsbysimultaneously updatedestthematrixstoragelocation(processor)thatholdsthismatrixelement;anintegerpvararrayindexedbystage. denseatheplayingeldformatrixelements. submatricesatthesametime.thenumberofrank-1updatestepsisthesizeofthe largestsupernodeatthecurrentstage.thesubmatricesmaybedierentsizes;each matrixonlydoesasmanyrank-1updatesasthesizeofitssupernode. squarearraysontothesmallestpossiblerectangularplayingeld(whosebordersmust A(K;K)forallthemaximalcliquesKateverystage.Thisisatwo-dimensionalbin \rsttbylevels"heuristic.thislayoutisdoneduringthesequentialsymbolic bepowersoftwo).optimaltwo-dimensionalbinpackingisingeneralannp-hard problem,thoughvariousecientheuristicsexist[9].ourexperimentsuseasimple factorization,beforethenumericfactorizationisbegun. packingproblem.inordertominimizecmcomputationtime,wewanttopackthese Inordertousethisprocedureweneedtondaplacementofallthesubmatrices forgridcholeskyintotimeinthematrixstoragevpsetandtimeontheplaying isoneadditionperstagetoaddtheaccumulatedupdatestothematrix.)thereisa xednumberofrouterusesperstage,sothematrixstoragetimeis eld.theformerincludesalltheroutertrac,andessentiallynothingelse.(there forsomeconstantc5.thesubscriptmsindicatesthatthevalueofistakenin 4.4.GridCholeskyperformance:Theory.Weseparatetherunningtime completelycomputedcolumnsandtheschurcomplementsbacktomatrixstorage. ofrank-1updatesisthesizeofthelargestsupernodeatthatlevel,whichiss. thematrixstoragevpset,whosevpratiosetisvms=d(l)=pe.inthecurrent playingeldatthebeginningofastage,andthentwoscattersareusedtomovethe implementationc5=4,sincetwoscattersareusedtomovethedensematricestothe Weexpresstheplayingeldtimeasasumoverlevels.Ateachlevelsthenumber TMS=c5MSh wherec6andc7areconstants(infactc6=2),andthesubscriptsindicatesthat AccordingtotheanalysisinSection4.2, thevalueofistakenintheplayingeldvpsetatstages.thevpratiointhis VPsetcouldbeapproximatelytheratioofthetotalsizeofthedensesubmatricesat TPF=(c6+c7)hXs=0ss; 17
18 Stages h?1 h?3 Subproblemcountsandplayingeldsizeforthemodelproblem. R(s)s 1 ktable2 h?4 h?5 h? k=4 3k=24:5k2 3k=29k2 3k=218k2 5k=425k2 7k=824:5k2 ks+sp(c+c)2 h? k=8 k=16 5k=825k2 7k=1624:5k2 k2 stagestothenumberofprocessors,changingateachstageasthenumberandsize maximumofthisvalueoverallstages. ofthemaximalcliquesvary.howeverinourimplementationitissimplyxedatthe Again,togetafeelingforthisanalysisletusconsiderthemodelproblem,ave- h?2r?122r+1k=2r+17k=2r+124:5k222r 5k=2r25k2 O(k3)arithmeticoperations.Table2summarizesthenumberandsizesofthecliques pointnitedierencemeshonakkgridorderedbynesteddissection.forthis problemn=k2,h=o(logk),and(l)=o(k2logk).thefactorizationrequires thatoccurateachstage.thecolumnsinthetableareasfollows. sizeplayingeldateverystage.accordingtotable2,aplayingeldofsized25k2e R(s) s+s Numberofsupernodesatstages. sucesiftheproblemscanbepackedinwithouttoomuchloss.thevpratiois P(C+C)2TotalareaofalldensesubmatricesA(K;K)atstages. sothematrixstoragetimeiso(k2log2k=p).ourpilotimplementationusesthesame TheVPratioinmatrixstorageforthemodelproblemisO((L))=p=O(k2logk=p), SizeoflargestmaximalcliqueC[adj(C)atstages. Sizeoflargestsupernodeatstages. eldtimeiso(k3=p).insummary,thetotalrunningtimeofgridcholeskyforthe modelproblemis O(k2=p).ThesumoverallstagesofsisO(k)(infactitis3k+O(1)),sotheplaying term.thisisbecausetheplayingeldcomputationsaredoneondensematriceswith moreimportantinpractice,therouterspeedappearsonlyinthesecond-order done,hasalowervpratiothanthematrixstoragestructure.second,andmuch vanished.thisisbecausetheplayingeld,wherethebulkofthecomputationis arithmeticoperationstotime,iso(p);thelogkineciencyofroutercholeskyhas Twothingsarenotableaboutthis:First,theperformance,orratioofsequential Ok2log2k p+k3 p: withtheproblemsize. moreecientgridcommunication.thismeansthattheroutertimebecomesless importantastheproblemsizegrows,whetherornotthenumberofprocessorsgrows 18
19 andmachinesizesothatthevpratioremainsconstant.thenthemodelproblem Choleskyisa\scalable"parallelalgorithm. analysisofthemodelproblemcarriesthrough(withdierentconstantfactors)for requireso(k)totalparalleloperations,butonlyo(logk)routeroperations.the analysiscarriesthroughforanythree-dimensionalniteelementproblem.thus,grid anytwo-dimensionalniteelementproblemorderedbynesteddissection;asimilar Onewayoflookingatthisanalysisistothinkofincreasingbothproblemsize pointdiscretizationofthelaplacianonasquare6363mesh,orderedbynested mentalresultsfromafairlysmallmodelproblem,thematrixarisingfromtheve- metricentriesonlyonce).thecholeskyfactorhas(l)=86,408nonzeros,aclique dissection.thismatrixhasn=3,969columnsand11,781nonzeros(countingsym- compute. treewithh=11stagesofsupernodes,andtakes3,589,202arithmeticoperationsto 4.5.GridCholeskyperformance:Experiments.Herewepresentexperi- localcomputation).theother2:04secondswasmatrixstoragetime,consistingmostly VPs.Theresultsquotedherearefrom8,192processors,withoating-point coprocessors,ofthemachineatnasa.bothvpsetsthereforehadavpratioof16. (AlargerproblemwouldneedahigherVPratiointhematrixstoragethaninthe time(3:12forthescans,0:15fornearest-neighbormovesofone-bittokens,and0:82for playingeld.) ThematrixstorageVPsetrequires128KVPs.Thexed-sizeplayingeldrequires comestobetween1:5and4:7seconds,dependingonwhichvaluewechoosefor.in fact3=4oftherouteroperationsarescatterswithnocollisions,andtheother1=4 withexperiment.themodelpredictsmatrixstoragetimeofabouth4ms.this ofthefourscattersateachstage.ouranalyticmodelpredictsplayingeldtimeto beabout3k(2+4)pf.thiscomesto4:0seconds,whichisingoodagreement arescatter-with-add,typicallywithtwotofourcollisions.thettothemodelis Weobservedarunningtimeof6:13seconds.Ofthis,4:09secondswasplayingeld thereforequiteclose. about20timesasfastasroutercholesky.itis,however,onlyrunningat:586 megaopson8kprocessors,whichwouldscaleto4:68megaopsonafull64kmachine.alargerproblemwouldrunsomewhatfaster,butitisclearthatmakinggrid Choleskyattractivewillrequiremajorimprovements.Someoftheseweresketched insection Cholesky.Forthesmallsampleproblemtherelativetimesforrouterandnon-router computationsareasfollows. 4.6.RemarksonGridCholesky.Onthissmallexample,GridCholeskyis Movingdatatotheplayingeld: ArstquestioniswhetherGridCholeskyisarouter-boundcodelikeRouter Evidently,theGridCholeskycodeisnotrouter-boundforthisproblem.Forlarger (orstructurallydenser)problemsthissituationgetsbetterstill:foramachineofxed size,thetimespentusingtheroutergrowslikeo(log2k)whilethetimeontheplaying MovingSchurcomplementsbacktomatrixstorage: Factoringontheplayingeld: 21% 67% 12% eldgrowslikeo(k3)forakkgrid,asweshowedabove.ifwesolvedthesame 19
20 problemonafull-sized64kprocessormachine,therelativetimeswouldpresumably ofparalleleliminationstepsontheplayingeldisgivenby bethesameasabove;butifwesolvedaproblem8timesaslargetheoperation countwouldincreasebyafactorofabout22whilethenumberofstages,androuter operations,wouldincreaseonlybyafactorofabout1.3. Next,weaskwhetherouruseoftheplayingeldisecientornot.Thenumber doneisis3:69106,sotheprocessorutilizationis7:9%.thereareseveralreasons ustodo22:8106multiply-addsor45:6106ops.thenumberofopsactually forthislossofeciency: whichis177fortheexample.ontheplayingeldof131,072processorsthisallows hxs=1s; Thealgorithmdoesbothhalvesofthesymmetricdensesubproblems(factor theimplementationusesthesameplayingeldsizeateverylevel(factorof thearchitectureforcesthedimensionsoftheplayingeldtobepowersoftwo of2); about4=3); Parisvirtualizationmethodcostusafactorofroughly(131;072=10;500)=12:5. processors,butonaverageweonlyhaveworkforabout10,500ofthem.thus,the suredresult;1=:079=12:6.inotherwords,everystepmustuseall131,072virtual Theseeectsaccountforafactorofroughly12.4,whichisconsistentwithourmea- theplayingeldisnotfullycoveredbyactiveprocessorsinitially,andasthe densefactorizationprogressesprocessorsinthesupernodesfallidle(factorof about7=2). (factorofabout4=3); peak:virtualization.weused217virtualprocessors.onaverage,wemakeuseof of7:6. Asimilaranalysisshowsthatvirtualizationslowstheuseoftherouterbyafactor WesummarizethereasonsthatourachievedperformanceissofarbelowtheCM's TheParisinstructionset,whichmakesreuseofdatainregistersimpossible, Theslowrouter. Communicationcostsforscansontheplayingeld,usingthebuilt-insuboptimalalgorithm. 10,500ontheplayingeldand5,600inmatrixstorage.Inactuality,there TheSIMDrestriction.Thiscausesustohavetowaitforthedividesand tasks,mostofthiseectcouldalsoberemovedbyecientvirtualization). multiplies.(sincethereareveryfewactivevirtualprocessorsduringthese thusexposingthelowmemorybandwidthofthecm. are256physical(oating-point)processorsinthemachineweused. impedimentstoperformancewouldprovide.ineachcase,wehypothesizean\ideal" machineinwhichthecorrespondingcostiscompletelyremoved.thus,forexample, routeroperationtakesnotimewhatever.20 thestatisticsforthethirdrowofthetableareforamarvelousmachineinwhicha Table3givesanupperboundontheimprovementthatremovalofeachofthese
21 ImpedimentTimeMopsSpeedup removed None Virtualization Slowrouter FactorsaectingeciencyofGridCholesky Table3 10. chiefproblemsaretheparisvirtualizationmethod;thelackofafastbroadcastin Slowscans , , ParisontheCM-2;noneisfundamentaltoSIMD,dataparallelcomputing. thegrid;andthememory-to-memoryinstructionset.allofthesearepeculiaritiesof Clearly,goodeciencyispossible,evenonanSIMDmachinewitharouter.The requirementforexpensivegeneral-purposecommunicationgrowsonlylogarithmically withproblemsize;evenformodestproblemsthecodeisnotlimitedbythecmrouter. presentintheeliminationtree,butbecauseitpayslittleattentiontothecostof itsworkwithgridcommunicationondensesubmatrices.analysisshowsthatthe rst,routercholesky,isconciseandelegantandtakesadvantageoftheparallelism Choleskyfactorizationalgorithms,implementedforadataparallelcomputer.The communicationitisnotpractical. 5.Conclusions.Wehavecomparedtwohighlyparallelgeneral-purposesparse foramoderatelysmallsampleproblem. Webelieve,however,thatourexperimentsandanalysisleadtotheconclusionthat ExperimentshowsthatGridCholeskyisabout20timesasfastasRouterCholesky nectionmachinecost-eectiveforsolvinggenerallystructuredsparsematrixproblems. OurpilotimplementationofGridCholeskyisnotfastenoughtomaketheCon- Wethereforedevelopedaparallelalgorithm,GridCholesky,thatdoesmostof twotothreeordersofmagnitudefasterwithimprovementsintheinstructionsetof theconnectionmachine.ofourpilotimplementationfromthe27-gigaoptheoretical peakperformanceofa64kprocessorcmyawnsomewhatlessdauntingly. ahighlyparallelsimdmachine.weshowedindetailthatgridcholeskycouldrun aparallelsupernodal,multifrontalalgorithmcanbemadetoperformecientlyon turecompilersforhighlyparallelmachines,whiletheywillsupportthedata-parallel virtualprocessorabstractionattheuser'slevel,willgeneratecodeatalevelbelow thatabstraction. straction,whichistosaybelowtheleveloftheassembly-languagearchitectureofthe machine.althoughtmchasrecentlyreleasedalow-levellanguagecalledcmisin whichausercanprogrambelowthevirtual-processorlevel,webelievethatultimately mostoftheseoptimizationsshouldbeappliedbyhigh-levellanguagecompilers.fu- MostoftheseimprovementsarebelowtheleveloftheParisvirtualprocessorab- parallelprogrammingmodel. abletousethedataparallelprogrammingparadigmtoexpressitinastraightforward way.thehigh-levelscanandscatter-with-addcommunicationprimitivessubstantially simpliedtheprogramming.thesimplicityofourcodesspeakswellforthisdata AlthoughGridCholeskyismorecomplicatedthanRouterCholesky,wearestill 21
22 withoutsacricingeciency. ecientparallelprogramsforcomplextaskstobewrittennearlyaseasilyassequentialprograms.togettothatpoint,therewillhavetobeimprovementsincompilers, instructionsets,androutertechnology.virtualizationwillhavetobeimplemented problems.webelievethatfuturegenerationsofhighlyparallelmachinesmayallow dataparallelprogrammingandhighlyparallelarchitecturesforsolvingunstructured lessencouraged aboutthegridcholeskyalgorithmandaboutthepotentialof Insummary,eventhoughourpilotimplementationisnotfast,wearenonethe- playingeldisnottheonlypossibleone.thereisingeneralnoneedtoperformall thepartialfactorizationsatasinglelevelsimultaneously.itshouldbepossibletouse asameshofindividualvectoroating-pointchips.sometheoreticalworkhasbeen tions.oursimpleapproachofschedulingtheseonelevelatatimeontoaxed-size varyingvpratio,oreven(fortheconnectionmachine)ontoaplayingeldconsidered moresophisticatedheuristicstoschedulethesefactorizationsontoaplayingeldof supernodesidentiesaprecedencerelationshipamongthevariouspartialfactoriza- Wementionfouravenuesforfurtherresearch. doneonscheduling\rectangular"tasksontoasquaregridofprocessorseciently[8]. Therstisschedulingthedensepartialfactorizationseciently.Thetreeof willbecomemoresoastheplayingeldtimeisimproved. course,asproblemsgetlargerthistimebecomesasmallerfractionofthetotal.at presentmatrixstoragetimeisnotverysignicantevenforasmallproblem,butit ThesecondavenueisimprovingthetimespentinthematrixstorageVPset.Of arecurrentlydesigningdataparallelalgorithmstodothesethreesteps[18]. Choleskyforverylargeproblems.Herethecliquetreewouldbeusedtoschedule theprocessorsthemselves. transfersofdatabetweenthehigh-speedparalleldiskarrayconnectedtothecmand usesasequentiallygeneratedordering,symbolicfactorization,andcliquetree.we preliminariestothenumericalfactorizationinparallel.ourpilotimplementation Third,wementionthepossibilityofanout-of-main-memoryversionofGrid estingandencouragingthatthekeyideaofthealgorithm,namelypartitioningthe matrixintodensesubmatricesinasystematicway,hasalsobeenusedtomakesparse Choleskyfactorizationmoreecientonvectorsupercomputers[32]andevenonworkstations[29].Intheformercase,thedensesubmatricesvectorizeeciently;inthe WeconcludebyextractingonelastmoralfromGridCholesky.Wenditinter- Fourthandnally,wementionthepossibilityofperformingthecombinatorial techniquesforattainingeciencyonsequentialmachineswithhierarchicalstorage memoryandmainmemory.weexpectthatmoreexperiencewillshowthatmany willturnouttobeusefulforhighlyparallelmachines. latter,thedensesubmatricesarecarefullyblockedtominimizetracbetweencache [1]C.Ashcraft,R.Grimes,J.Lewis,B.Peyton,andH.Simon,Recentprogressinsparsematrix [2]C.C.Ashcraft,Thedomain/segmentpartitionforthefactorizationofsparsesymmetricpositivedenitematrices,Tech.ReportECA{TR{148,BoeingComputerServicesEngineering methodsforlargelinearsystems,internationaljournalofsupercomputerapplications, (1987),pp.10{30. REFERENCES ComputingandAnalysisDivision,Seattle,
23 [3]C.H.BischofandJ.J.Dongarra,Aprojectfordevelopingalinearalgebralibraryforhighperformancecomputers,Tech.ReportMCS{P105{0989,ArgonneNationalLaboratory, [4]J.R.S.BlairandB.W.Peyton,Onndingminimum-diametercliquetrees,Tech.Report ORNL/TM{11850,OakRidgeNationalLaboratory,1991. [5]M.DixonandJ.deKleer,Massivelyparallelassumption-basedtruthmaintenance,inProceedingsoftheNationalConferenceonArticialIntelligence,1988,pp.199{204. [6]I.S.Du,MultiprocessingasparsematrixcodeontheAlliantFX/8,Tech.ReportCSS210, ComputerScienceandSystemsDivision,AEREHarwell,1988. [7]I.S.DuandJ.K.Reid,Themultifrontalsolutionofindenitesparsesymmetriclinear equations,acmtransactionsonmathematicalsoftware,9(1983),pp.302{325. [8]A.Feldmann,J.Sgall,andS.-H.Teng,Dynamicschedulingonparallelmachines. [9]M.R.GareyandD.S.Johnson,ComputersandIntractability:AGuidetotheTheoryof NP-Completeness,W.H.FreemanandCompany,1979. [10]F.Gavril,Algorithmsforminimumcoloring,maximumclique,minimumcoveringbycliques, andmaximumindependentsetofachordalgraph,siamjournaloncomputing,1(1972), pp.180{187. [11]A.George,Nesteddissectionofaregularniteelementmesh,SIAMJournalonNumerical Analysis,10(1973),pp.345{363. [12]A.George,M.T.Heath,J.Liu,andE.Ng,SparseCholeskyfactorizationonalocal-memory multiprocessor,siamjournalonscienticandstatisticalcomputing,9(1988),pp.327{ 340. [13]A.GeorgeandJ.W.H.Liu,ComputerSolutionofLargeSparsePositiveDeniteSystems, Prentice-Hall,1981. [14]A.GeorgeandE.Ng,OnthecomplexityofsparseQRandLUfactorizationofnite-element matrices,siamjournalonscienticandstatisticalcomputing,9(1988),pp.849{861. [15]J.A.GeorgeandD.McIntyre,Ontheapplicationoftheminimumdegreealgorithmtonite elementsystems,siamjournalonnumericalanalysis,15(1978),pp.90{112. [16]J.R.Gilbert,Somenesteddissectionorderisnearlyoptimal,InformationProcessingLetters, 26(1988),pp.325{328. [17]J.R.GilbertandH.Hafsteinsson,Parallelsolutionofsparselinearsystems,inSWAT88: ProceedingsoftheFirstScandinavianWorkshoponAlgorithmTheory,Springer-Verlag LectureNotesinComputerScience318,1988,pp.145{153. [18]J.R.Gilbert,C.Lewis,andR.Schreiber,Parallelpreorderingforsparsematrixfactorization. Inpreparation. [19]C.-T.HoandS.L.Johnsson,SpanningbalancedtreesinBooleancubes,SIAMJournalon ScienticandStatisticalComputing,10(1989),pp.607{630. [20]J.J.Hopeld,Neuralnetworksandphysicalsystemswithemergentcollectivecomputational abilities,proceedingsofthenationalacademyofscience,79(1982),pp.2554{2558. [21]J.A.G.JessandH.G.M.Kees,AdatastructureforparallelL/Udecomposition,IEEE TransactionsonComputers,C-31(1982),pp.231{239. [22]S.G.Kratzer,Massivelyparallelsparsematrixcomputations,Tech.ReportSRC{TR{90{008, SupercomputerResearchCenter,1990. [23]J.W.H.Liu,Themultifrontalmethodforsparsematrixsolution:Theoryandpractice,Tech. ReportCS{90{04,YorkUniversityComputerScienceDepartment,1990. [24]J.W.H.Liu,Theroleofeliminationtreesinsparsefactorization,SIAMJournalonMatrix AnalysisandApplications,11(1990),pp.134{172. [25]J.Naor,M.Naor,andA.J.Schaer,Fastparallelalgorithmsforchordalgraphs,SIAMJournal oncomputing,18(1989),pp.327{349. [26]B.W.Peyton,SomeApplicationsofCliqueTreestotheSolutionofSparseLinearSystems, PhDthesis,ClemsonUniversity,1986. [27]D.J.Rose,Agraph-theoreticstudyofthenumericalsolutionofsparsepositivedenitesystems oflinearequations,ingraphtheoryandcomputing,r.c.read,ed.,1972,pp.183{217. [28]D.J.Rose,R.E.Tarjan,andG.S.Lueker,Algorithmicaspectsofvertexeliminationongraphs, SIAMJournalonComputing,5(1976),pp.266{283. [29]E.RothbergandA.Gupta,Fastsparsematrixfactorizationonmodernworkstations,Tech. ReportSTAN{CS{89{1286,StanfordUniversity,1989. [30]R.Schreiber,AnewimplementationofsparseGaussianelimination,ACMTransactionson MathematicalSoftware,8(1982),pp.256{276. [31]R.Schreiber,Anassessmentoftheconnectionmachine,inScienticApplicationsoftheConnectionMachine,H.Simon,ed.,WorldScientic,Singapore,1991. [32]H.Simon,P.Vu,andC.Yang,PerformanceofasupernodalgeneralsparsesolverontheCray 23
24 [33]B.Speelpenning,Thegeneralizedelementmethod,Tech.ReportUIUCDCS{R{78{946,UniversityofIllinois, [34]ThinkingMachinesCorporation,Cambridge,Massachusetts,Parisreferencemanual,version [35]E.Zmijewski,SparseCholeskyFactorizationonaMultiprocessor,PhDthesis,CornellUniversity,1987. Y-MP,Tech.ReportSCA{TR{117,BoeingComputerServices,
AWinningStrategyforRoulette
AWinningStrategyforRoulette logreturn. Keywordsandphrases:Roulette,Bayesstrategy,Dirichletprior,convexloss,expected UniversityofWisconsinatMadison JeromeH.Klotz logcapitalafternplaysforlossfunction,weshowthatthebayesstrategyfora
How To Factor By Gcf In Algebra 1.5
7-2 Factoring by GCF Warm Up Lesson Presentation Lesson Quiz Algebra 1 Warm Up Simplify. 1. 2(w + 1) 2. 3x(x 2 4) 2w + 2 3x 3 12x Find the GCF of each pair of monomials. 3. 4h 2 and 6h 2h 4. 13p and 26p
Pain-Free Injections
T N S Bk P-F Ij I P Bk DV Dv Iv D D - I 8 8 9 1 D v C U C Y N D j I W U v v I k I W j k T k z D M I v 965 1 M Tk C P 1 2 v T I C G v j V T D v v A I W S DD G v v S D @ v DV M I USA! j DV - v - j I v v
Maths Refresher. Expanding and Factorising
Maths Refresher Expanding and Factorising Expanding and Factorising Learning intentions. Recap Expanding equations Factorising equations Identity: perfect pairs Difference of two squares Introduction Algebra
OPTIMAL BINARY SEARCH TREES
OPTIMAL BINARY SEARCH TREES 1. PREPARATION BEFORE LAB DATA STRUCTURES An optimal binary search tree is a binary search tree for which the nodes are arranged on levels such that the tree cost is minimum.
Introduction to Software Verification
Introduction to Software Verification Orna Grumberg Lectures Material winter 2013-14 Lecture 4 5.11.13 Model Checking Automated formal verification: A different approach to formal verification Model Checking
Eaton Fuller Heavy Duty Transmissions Illustrated Parts List FS-5005C March 2003
Eaton Fuller Heavy Duty Transmissions Illustrated Parts List FS-00C March 00 For the most current information, visit the Roadranger web site at www.roadranger.com Contents How To Use The Illustrated Parts
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
Connectivity and cuts
Math 104, Graph Theory February 19, 2013 Measure of connectivity How connected are each of these graphs? > increasing connectivity > I G 1 is a tree, so it is a connected graph w/minimum # of edges. Every
KITES TECHNOLOGY COURSE MODULE (C, C++, DS)
KITES TECHNOLOGY 360 Degree Solution www.kitestechnology.com/academy.php [email protected] [email protected] Contact: - 8961334776 9433759247 9830639522.NET JAVA WEB DESIGN PHP SQL, PL/SQL
ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS
ORTHOGONAL POLYNOMIAL CONTRASTS INDIVIDUAL DF COMPARISONS: EQUALLY SPACED TREATMENTS Many treatments are equally spaced (incremented). This provides us with the opportunity to look at the response curve
Creating Contrast Variables
Chapter 112 Creating Contrast Variables Introduction The Contrast Variable tool in NCSS can be used to create contrasts and/or binary variables for use in various analyses. This chapter will provide information
STATEMENTS OF COST SPECIAL ASSESSMENTS SEPTEMBER, 2014
STATEMENTS OF COST SPECIAL ASSESSMENTS SEPTEMBER, 2014 WATER: a. Statement of Cost for constructing Water Distribution System to serve Greenwich Business Center Addition (east of Greenwich, south of 29
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 > catch
Module 5 (Lectures 17 to 19) MAT FOUNDATIONS
Module 5 (Lectures 17 to 19) MAT FOUNDATIONS Topics 17.1 INTRODUCTION Rectangular Combined Footing: Trapezoidal Combined Footings: Cantilever Footing: Mat foundation: 17.2 COMMON TYPES OF MAT FOUNDATIONS
maxiflex Party and Exhibition Hall Type: 10.5/340 Model: 170
Type: 10.5/340 Model: 170 10.18 m 3.26 m Ridge height (D) 4.91 m 4-groove aluminium hollow profile - m Truss-distance (B3) Ground Anchoring Longest component 5.31 m Min. length 10.00 m Type: 10.5/400 Model:
postage, $1.50 each, $3.00 total for both Laws and Rules Governing Structural Pest Management In Arizona
OPM Publications for Exam Preparation and Business Operations http://opm.azda.gov PUBLICAIONS ARE NO REFUNDABLE Careful study of these materials is necessary for success. he exam questions come from these
The GRE Advanced Test in Computer Science*
The GR dvanced Test in omputer Science* Richard H. usting University of Maryland Introduction The Graduate Record xamination, for many years an aid to graduate schools in the student admission and placement
C 060 CCTV Coaxial Cable
C 060 CCTV Coaxial Cable These cables are used for video transmissons of CCTV applications..4...... Coaxial Cable (C 060).. Inner Conductor : Ø 0.60 mm Bare Copper.. Insulation : Ø.80 mm Gas-injected Foam
CAMERA MTGS. Astro-Brac 1-Piece Camera Bracket Stellar Series
Astro-Brac 1-Piece Camera Bracket Stellar Series AS-3004 SH-0514 SH-0514 Stellar Series Mount, Alum AS-0170 - - - 46=46 : 29, 36, 42, 48, 56, 72, 84, 96, or 114. Specify by including -SS in the part number,
Dr.Web anti-viruses Visual standards
Dr.Web anti-viruses Visual standards Contents Who are we? The main Dr.Web logo Logos of Dr.Web products Registered trademarks Typography 1 Who are we? Doctor Web is a Russian developer of information security
Overview of Fire Alarm Systems and Maintenance
Overview of Fire Alarm Systems and Maintenance By Mr. David Goh Vice President Fire Safety Managers Association Convener of Working Group for CP 10 : 2005 Er. Matthew Kwek Committee Member Fire Safety
Modélisation et résolutions numérique et symbolique
Modélisation et résolutions numérique et symbolique via les logiciels Maple et Matlab Jeremy Berthomieu Mohab Safey El Din Stef Graillat [email protected] Outline Previous course: partial review of what
Catalog : JEWELRY. Grouping: 100 T-TYPE. Product: 10001H ST164H, 2 Qt HEAT TIMER & COVER Price: $475.00
Grouping: 100 T-TYPE 10001H ST164H, 2 Qt HT TIMER & COVER $475.00 On/Off timer, tank heater, and stainless steel cover Tank Dimensions: 5.5 x 5.0 x 4.0 in (13.9 x 12.7 x 10.2 cm) Overall Dimensions: 6.5
Cleveland Tank & Supply, Inc. School Bus Diesel Fuel Tanks and Accessories. For Immediate Shipment
Cleveland Tank & Supply, Inc. Supplemental Catalog School Bus Diesel Fuel Tanks and Accessories For Immediate Shipment. (216) 771-8265 Fax: (216) 771-8239 www.clevelandtank.com [email protected]
Address for Correspondence
International Journal of Advanced Engineering Technology E-ISSN 0976-3945 Research Paper DEVELOPMENT OF LOW COST SHAKE TABLES AND INSTRUMENTATION SETUP FOR EARTHQUAKE ENGINEERING LABORATORY C. S. Sanghvi
Capturing Database Transformations for Big Data Analytics
Capturing Database Transformations for Big Data Analytics David Sergio Matusevich University of Houston Organization Introduction Classification Program Case Study Conclusions Extending ER Models to Capture
Partial Fractions: Undetermined Coefficients
1. Introduction Partial Fractions: Undetermined Coefficients Not every F(s) we encounter is in the Laplace table. Partial fractions is a method for re-writing F(s) in a form suitable for the use of the
A safe, simple and economical solution. SAF-T Pump Waste Disposal System
A safe, simple and economical solution SAF-T Pump Waste Disposal System Minimizing the risk... Working with infectious liquid medical waste presents significant challenges to the healthcare workers who
12.2 Compact Type, MSB Series
12.2 Compact Type, MSB Series A. Construction Upper Retainer Carriage End Cap End Seal Rail Grease Nipple Ball Lower Retainer 45 Bottom Seal 45 B. Characteristics The trains of balls are designed to a
Central Purchase unit National Institute of Technology Srinagar-190006 Soil Mechanics Lab of Civil Engineering Department.
Central Purchase unit National Institute of Technology Srinagar-190006 Tel:- 0194-2424792/2429423/2424809/2424797 Fax:- 0194-2420475 *************************************************************** No.
Waterleaf ARB Application
To: Waterleaf ARB Application Waterleaf Architectural Review Board c/o Property Management Systems, Inc. P.O. Box 1987 Yulee, FL 32097-1987 From: Name: Address: City, State, Zip: Phone: Fax: Lot Number:
Printing Letters Correctly
Printing Letters Correctly The ball and stick method of teaching beginners to print has been proven to be the best. Letters formed this way are easier for small children to print, and this print is similar
Relational Database Concepts
Relational Database Concepts IBM Information Management Cloud Computing Center of Competence IBM Canada Labs 1 2011 IBM Corporation Agenda Overview Information and Data Models The relational model Entity-Relationship
2016 TOURNAMENT OF CHAMPIONS NORTHWEST ALL-STARS & SEATTLE YOUTH BASKETBALL
SAT RENTON HIGH SCHOOL 400 - S 2ND ST. 21ST RENTON, 98055 RHS 1 RHS 2 RHS 3 RHS 4 GIRLS ELITE PURPLE POOL C GIRLS ELITE PURPLE POOL D GIRLS 5TH POOL S GIRLS 7TH ROYAL POOL N 9:00AM EBC ELITE EXCEL 253
CS 8803 - Cellular and Mobile Network Security: GSM - In Detail
CS 8803 - Cellular and Mobile Network Security: GSM - In Detail Professor Patrick Traynor 9/27/12 Cellular Telecommunications Architecture Background Air Interfaces Network Protocols Application: Messaging
Lesson 20. Probability and Cumulative Distribution Functions
Lesson 20 Probability and Cumulative Distribution Functions Recall If p(x) is a density function for some characteristic of a population, then Recall If p(x) is a density function for some characteristic
New CPT codes for Acupuncture & Electrical Acupuncture AAOM 2005
New CPT codes for Acupuncture & Electrical Acupuncture AAOM 2005 AMA Owns CPT Codes The codes that designate medical procedures are listed in Current Procedural Terminology (CPT). CPT codes are controlled,
The effective theory of type IIA AdS 4 compactifications on nilmanifolds and cosets
The effective theory of type IIA AdS 4 compactifications on nilmanifolds and cosets Based on: 0804.0614 (PK, Tsimpis, Lüst), 0806.3458 (Caviezel, PK, Körs, Lüst, Tsimpis, Zagermann) Paul Koerber Max-Planck-Institut
Dotted Thirds For Smartboard
Dotted Thirds For Smartboard Free PDF ebook Download: Dotted Thirds For Smartboard Download or Read Online ebook dotted thirds for smartboard in PDF Format From The Best User Guide Database (g Dotted Thirds
Example for CoinTosses CoinTosses[100, True] HTHTHHHHTHTHHTTTHHTTTTHTTHTHTTTTTTHTTTHTHHHTHHHTTHHTHTTHHT TTTHTTHTHTHTHTHHTHHTHTTTHHHHTTTHHTHTHTTTHT
All LOCATION data refer to the following website where programs are stored. www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/book.html An index of the Mathematica programs below
W Cisco Kompetanse eek end 2 0 0 8 SMB = Store Mu ll ii gg hh eter! Nina Gullerud ng ulleru@ c is c o. c o m 1 Vår E n t e r p r i s e e r f a r i n g... 2 S m å o g M e llo m s t o r e B e d r i f t e
Health Insurance A1 Screen Matrix - Revised 11/13/13
Contents Health Insurance A1 Screen Matrix - Revised 11/13/13 APPOINTMENTS... 2 RE-APPOINTMENTS... 3 LWOP/SLWOP/LOA (Leave of Absence)... 6 Special Rules Section: LWOP/SLWOP/LOA... 8 LAYOFF... 9 Special
Forest Management Plan Templates
Forest Management Plan Templates The following templates as described in the Forestry Schemes Manual 2011 can be used in support of all scheme applications as described in the scheme documents 2014-2020
Dirk von Guenthner RULE 26 DISCLOSURES
1. NATIONAL PUBLICATIONS Dirk von Guenthner RULE 26 DISCLOSURES a. FORENSIC EXAMINER - October 1994 - Forensic Accounting b. FORENSIC EXAMINER - December 1994- Forensic Accounting c. FORENSIC EXAMINER
Draft ARTICLE 20A "AO" ANTIETAM OVERLAY DISTRICT
Draft ARTICLE 20A "AO" ANTIETAM OVERLAY DISTRICT Section 20A.0 Purpose The purpose of the Antietam Overlay District is to provide mechanisms for the protection of significant historic structures and land
Expense Reports. University of Kansas 2/12/2014
2014 Expense Reports University of Kansas 2/12/2014 Table of Contents Create Expense Report... 2 Approval via Module... 28 Send Back via Module... 32 Approval via Email... 36 Send Back via Email... 39
Topological Properties
Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any
Common Emitter BJT Amplifier Design Current Mirror Design
Common Emitter BJT Amplifier Design Current Mirror Design 1 Some Random Observations Conditions for stabilized voltage source biasing Emitter resistance, R E, is needed. Base voltage source will have finite
9 Summary of California Law (10th), Partnership
9 Summary of California Law (10th), Partnership I. INTRODUCTION A. [ 1] Statutes Affecting Partnerships. B. Fictitious Business Name. 1. [ 2] In General. 2. [ 3] Fictitious Name Defined. 3. [ 4] Coverage
Graph Theory and Complex Networks: An Introduction. Chapter 06: Network analysis
Graph Theory and Complex Networks: An Introduction Maarten van Steen VU Amsterdam, Dept. Computer Science Room R4.0, [email protected] Chapter 06: Network analysis Version: April 8, 04 / 3 Contents Chapter
Copperplate Victorian Handwriting. Victorian. Exploring your History. Created by Causeway Museum Service
Victorian Coleraine Exploring your History Copperplate Victorian Handwriting Postcards courtesy of Coleraine Museum Collection Created by Causeway Museum Service In Victorian times hand writing was very
ACCOUNTS FROM INCOMPLETE RECORDS SINGLE ENTRY SYSTEM MODULE II OF IV
ACCOUNTS FROM INCOMPLETE RECORDS SINGLE ENTRY SYSTEM MODULE II OF IV 1 CA. Prathap S.S Agenda In Module II, we will focus on mastering the Journal Entries concerning Accounts from Incomplete Records. To
Optional custom API wrapper. C/C++ program. M program
GT.M GT.M includes a robust, high performance, multi-paradigm, open-architecture database. Relational, object-oriented and hierarchical conceptual models can be simultaneously applied to the same data
Shape of Data Distributions
Lesson 13 Main Idea Describe a data distribution by its center, spread, and overall shape. Relate the choice of center and spread to the shape of the distribution. New Vocabulary distribution symmetric
HUFFMAN CODING AND HUFFMAN TREE
oding: HUFFMN OING N HUFFMN TR Reducing strings over arbitrary alphabet Σ o to strings over a fixed alphabet Σ c to standardize machine operations ( Σ c < Σ o ). inary representation of both operands and
SOPREMA ALSAN RS ROOFING/WATERPROOFING COLD LIQUID-APPLIED FULLY REINFORCED SYSTEM SPECIFICATION
Penalty for delayed repayments. Nil NA NA. 2% of overdue amount per month, if delayed more than 30 days
Retail s Product P F/ Admn fee Prepayment options and charges Penalty for delayed repayments Charges for switching loan from fixed to floating rates and vice versa Interest reset clause, if any Charge
Supply Chain Management Global Logistics Technology Associate in Applied Science Degree A25620 Course and Hour Requirements
A25620 The curriculum prepares individuals for a multitude of career opportunities in distribution, transportation, warehousing, trucking operations, supply chain, and manufacturing organizations. Course
ISLAND COUNTY BOARD OF COUNTY COMMISSIONERS (including Diking Improvement Dist. #4)
IMPORTANT Note: The Public Hearing on proposed Ordinance C-42-05 Amending Chapter 17.02 ICC Relating to Type 5 Stream Buffers is going to be rescheduled -- see agenda item 7(a). Regular Meeting Agenda
METEOROLOGICAL INSTRUMENTS
METEOROLOGICAL INSTRUMENTS INSTRUCTIONS ULTRASONIC ANEMOMETER MODEL 85000 R.M. YOUNG COMPANY 2801 AERO PARK DRIVE, TRAVERSE CITY, MICHIGAN 49686, USA TEL: (231) 946-3980 FAX: (231) 946-4772 WEB: www.youngusa.com
From Last Time: Remove (Delete) Operation
CSE 32 Lecture : More on Search Trees Today s Topics: Lazy Operations Run Time Analysis of Binary Search Tree Operations Balanced Search Trees AVL Trees and Rotations Covered in Chapter of the text From
LED-INTA-0024V-41-F-O
Electrical Specifications LED-INTA-0024V-41-F-O Output Power (W) Output Voltage (V) Output Current (A) Tcase Max Input Current (A) Max. Input Power (W) Inrush Current (A pk /µs) Max. THD (%) Min. Power
MAGNETIC CARD READER DESIGN KIT TECHNICAL SPECIFICATION
MAGNETIC CARD READER DESIGN KIT TECHNICAL SPECIFICATION Part Number: 99821002 Rev 21 FEBRUARY 2011 REGISTERED TO ISO 9001:2008 1710 Apollo Court Seal Beach, CA 90740 Phone: (562) 546-6400 FAX: (562) 546-6301
TURBOWENT TULIPAN - rotary chimney cowl Ø150
Picture Function principle rotation direction Wind Chimney draught Description Rotary chimney cowl Turbowent Tulipan is a device, which, in a dynamic way, uses force of the wind to increase chimney draught.
The ABC s of Web Site Evaluation
The ABC s of Web Site Evaluation by Kathy Schrock Digital Literacy by Paul Gilster Digital literacy is the ability to understand and use information in multiple formats from a wide range of sources when
European Aviation Safety Agency
European Aviation Safety Agency EASA TYPE-CERTIFICATE DATA SHEET Number: E.073 Issue: 02 Date: Type: TURBOMECA Arriel 1 series engines Variants Arriel 1A Arriel 1A1 Arriel 1A2 Arriel 1B Arriel 1C Arriel
concern commitment Curamando 2014
FRII SEO/CONVERSION concern commitment November 2014 duc)on LEADING ONLINE SUCCESS Björn Michels ple of past and current customers Cura Engine Optimization Internet marketing strategy, SEO considers how
Common Base BJT Amplifier Common Collector BJT Amplifier
Common Base BJT Amplifier Common Collector BJT Amplifier Common Collector (Emitter Follower) Configuration Common Base Configuration Small Signal Analysis Design Example Amplifier Input and Output Impedances
Stand Density Management Diagram
H O W T O U S E A Stand Density Management Diagram Yield predictions for a spacing prescription Ministry of Forests Forest Practices Branch Canadian Cataloguing in Publication Data Main entry under title:
Periodic Capacity Management under a Lead Time Performance Constraint
Periodic Capacity Management under a Lead Time Performance Constraint N.C. Buyukkaramikli 1,2 J.W.M. Bertrand 1 H.P.G. van Ooijen 1 1- TU/e IE&IS 2- EURANDOM INTRODUCTION Using Lead time to attract customers
Spec (Per Contract) 1,750 USD (+70) 1,250 USD (+50) Hedge (Per Contract) 1,250 USD (+50) 1,250 USD (+50)
NOTICE July 14, 2011 Summary of Content: Margin Changes- Correcting effective date For more information please contact: William O Brien 212-748-4031 [email protected] Media Inquiries: Lee Underwood
Improving global data on forest area & change Global Forest Remote Sensing Survey
Improving global data on forest area & change Global Forest Remote Sensing Survey work by FAO and partners - Adam Gerrand, E. Lindquist, R. D Annunzio, M. Wilkie, FAO, - F. Achard et al. TREES team at
APPLICATIONS & PRICING FOR MOTORCYCLES
APPLICATIONS & PRICING FOR MOTORCYCLES MODELS & ADJUSTMENTS THREADED SPRING PRELOAD ADJUSTMENT All our shock absorbers come standard with a threaded preload adjustment that allows you to fine-tune the
TOPVIEW Rel. 1.00 06/02/14
Windows software for safety testers Pag 1 of 16 1. INTRODUCTION OF TOPVIEW SOFTWARE The TopView is a professional software designed for the management of data relative to multipurpose safety instrument
Business Intelligence at Work
Business Intelligence at Work BI@Work S.r.l. is a company that operates in the telecommunications and software markets to assist enterprises in identifying and implementing technology solutions that best
A.) Using the pulley system B.) Lifting it straight up C.) Both using the pulley system or lifting it straight up require the same effort (force)
Name: Date: Pulleys Test Instructions: Circle only one letter to indicate your answer for each question. Q1) If we ignore friction, which of the following two pulleys systems will require less effort (force)
Lab 2: Swat ATM (Machine (Machine))
Lab 2: Swat ATM (Machine (Machine)) Due: February 19th at 11:59pm Overview The goal of this lab is to continue your familiarization with the C++ programming with Classes, as well as preview some data structures.
Incubators, cooled incubators, drying ovens, sterilizers, climatic chambers
Incubators, cooled incubators, drying ovens, sterilizers, climatic chambers Incubators, cooled incubators, drying ovens, sterilizers, climatic chambers Laboratory incubators (CL) and drying ovens (SL)
