Size: px
Start display at page:

Download ""

Transcription

1 DistributedSharedMemorySystems? AdaptiveLoadBalancinginSoftware CompilerandRun-TimeSupportfor SotirisIoannidisandSandhyaDwarkadas DepartmentofComputerScience Rochester,NY14627{0226 UniversityofRochester ablehighperformancecomputingenvironments.acriticalissuefor Abstract.Networksofworkstationsoerinexpensiveandhighlyavail- balancingdynamicallyonsoftwaredistributedsharedmemoryprograms. achievinggoodperformanceinanyparallelsystemisloadbalancing, asystemthatcombinescompilerandrun-timesupporttoachieveload besharedamongmanyusers.inthispaper,wepresentandevaluate evenmoresoinworkstationenvironmentswherethemachinesmight Weuseinformationprovidedbythecompilertohelptherun-timesystemdistributetheworkoftheparallelloops,notonlyaccordingtothe 1Introduction relativepoweroftheprocessors,butalsoinsuchawayastominimize communicationandpagesharing. Clustersofworkstations,whetheruniprocessorsorsymmetricmultiprocessors eectivetargetforaparallelizingcompiler.theadvantagesofusingansdsm use.previouswork[5]hasshownthatansdsmrun-timecanprovetobean (SMPs),oercost-eectiveandhighlyavailableparallelcomputingenvironments.Softwaredistributedsharedmemory(SDSM)providesasharedmemory compile-timeandrun-timeinformationtoachievebetterperformance([6,18]). systemincludereducedcomplexityatcompile-time,andtheabilitytocombine abstractiononadistributedmemorymachine,withtheadvantageofease-of- themachinesmightbesharedamongmanyusers.inordertomaximizeperformancebasedonavailableresources,theparallelsystemmustnotonlyoptimally distributetheworkaccordingtotheinherentcomputationandcommunication Oneissueinachievinggoodperformanceinanyparallelsystemisloadbalancing.Thisissueisevenmorecriticalinaworkstationenvironmentwhere communicationresources. demandsoftheapplication,butalsoaccordingtoavailablecomputationand?thisworkwassupportedinpartbynsfgrantscda{ ,ccr{ ,and CCR{ ;andanexternalresearchgrantfromDigitalEquipmentCorporation.

2 run-timesupporttoachieveloadbalancingdynamicallyonsdsmprograms. Thecompilerprovidesaccesspatterninformationtotherun-timeatthepoints inthecodethatwillbeexecutedinparallel.therun-timeusesthesepointsto determineavailablecomputationalandcommunicationresources.basedonthe Inthispaper,wepresentandevaluateasystemthatcombinescompilerand loadevenly,butalsotominimizecommunicationoverheadinthefuture.the timecanthenmakeintelligentdecisionsnotonlytodistributethecomputational changesincomputationalpower,resultinginreducedexecutiontime. accesspatternsacrossphases,aswellasonavailablecomputingpower,therun- resultisasystemthatadaptsbothtochangesinaccesspatternsaswellasto forprefetchingandconsistency/communicationavoidancedescribedin[6].we implementedthenecessarycompilerextensionsinthesuif[1]compilerframework.ourexperimentalenvironmentconsistsofeightdecalphaserver2100 4/233computers,eachwithfour21064Aprocessorsoperatingat233MHz.Preliminaryresultsshowthatoursystemisabletoadapttochangesinload,with Ourtargetrun-timesystemisTreadMarks[2],alongwiththeextensions performancewithin20%ofideal. Section4describesrelatedwork.Finally,wepresentourconclusionsanddiscuss dynamicloadbalancingdecisions.section3presentssomepreliminaryresults. on-goingworkinsection5. timesystem,thenecessarycompilersupport,andthealgorithmusedtomake Therestofthispaperisorganizedasfollows.Section2describestherun- WerstprovidesomebackgroundonTreadMarks[2],therun-timesystemwe 2DesignandImplementation usedinourimplementation.wethendescribethecompilersupportfollowedby therun-timesupportnecessaryforloadbalancing. 2.1TheBaseSoftwareDSMLibrary user-levelsdsmsystemthatrunsoncommonlyavailableunixsystems.tread- TreadMarks[2]isanSDSMsystembuiltatRiceUniversity.Itisanecient Marksprovidesparallelprogrammingprimitivessimilartothoseusedinhardwaresharedmemorymachines,namely,processcreation,sharedmemoryallocation,andlockandbarriersynchronization.Thesystemsupportsarelease synchronizationtoensurethatchangestoshareddatabecomevisible. protocol[3]toreducetheoverheadinvolvedinimplementingthesharedmemory consistent(rc)memorymodel[10],requiringtheprogrammertouseexplicit abstraction. Consequently,theconsistencyunitisavirtualmemorypage.Themultiple-writer Thevirtualmemoryhardwareisusedtodetectaccessestosharedmemory. TreadMarksusesalazyinvalidate[14]versionofRCandamultiple-writer protocolreducestheeectsoffalsesharingwithsuchalargeconsistencyunit. Withthisprotocol,twoormoreprocessorscansimultaneouslymodifytheirown

3 copyofasharedpage.theirmodicationsaremergedatthenextsynchronizationoperationinaccordancewiththedenitionofrc,therebyreducingthe eectsoffalsesharing.themergeisaccomplishedthroughtheuseofdis.a comparingthepagetoacopysavedpriortothemodications(calledatwin). acquiresynchronizationoperation[10],thosepagesforwhichithasreceived diisarun-lengthencodingofthemodicationsmadetoapage,generatedby noticeofmodicationsbyotherprocessors.onasubsequentpagefault,the processfetchesthedisnecessarytoupdateitscopy. Withthelazyinvalidateprotocol,aprocessinvalidates,atthetimeofan 2.2Compile-TimeSupportforLoadBalancing builtontopofakernelthatdenestheintermediateformat.thepassesare implementedasseparateprogramsthattypicallyperformasingleanalysisor (SUIF)[11]compiler.TheSUIFsystemisorganizedasasetofcompilerpasses programusingtreadmarks,weusethestanforduniversityintermediateformat Forthesource-to-sourcetranslationfromasequentialprogramtoaparallel thatwestartwithisaversionofthecodeparallelizedforsharedaddressspace transformationandthenwritetheresultsouttoale.thelesalwaysusethe sameformat. machines.thecompilergeneratesasingle-program,multiple-data(spmd)programthatwemodiedtomakecallstothetreadmarksrun-timelibrary.alternatively,theusercanprovidethespmdprogram(insteadofhavingthesuif Theinputtothecompilerisasequentialversionofthecode.Theoutput compilergenerateit)byidentifyingtheparallelloopsintheprogramthatare executedbyallprocessors. tochangetheloaddistributionintheparallelloopsifnecessary. regions,andfeedsthisinformationtotherun-timesystem.thepassisalso responsibleforaddinghooksintheparallelizedcodetoallowtherun-timelibrary OurSUIFpassextractstheshareddataaccesspatternsineachoftheSPMD x).aregularsection[12]isthencreatedforeachsuchsharedaccess.regular sectiondescriptors(rsds)conciselyrepresentthearrayaccessesinaloopnest. Accesspatternextraction TheRSDsrepresenttheaccesseddataaslinearexpressionsoftheupperand theprogramlookingforaccessestosharedmemory(identiedusingtheshpre- Inordertogenerateaccesspatternsummaries,ourSUIFpasswalksthrough lowerloopboundsalongeachdimension,andincludestrideinformation.this andthesizeofeachdimensionofthearray,todeterminetheaccesspattern. entstrategiesofloadredistributionincaseofimbalance.wewilldiscussthese informationiscombinedwiththecorrespondingloopboundariesofthatindex, Dependingonthekindofdatasharingbetweenparalleltasks,wefollowdier- strategiesfurtherinsection2.3.

4 Prefetching Markslibraryoersprefetchingcalls.Thesecalls,givenarangeofaddresses, prefetchthedatacontainedinthepagesinthatrange,andprovideappropriate (read/write)permissionsonthepage.thisprefetchingpreventsfaultingand consistencyactionsonuncacheddatathatisguaranteedtobeaccessedinthe Theaccesspatterninformationcanalsousedtoprefetchdata[6].TheTreadbulktransfer. Loadbalancinginterfaceandstrategy future,aswellasallowscommunicationoptimizationbytakingadvantageof eachparalleltask.thisessentiallymeanschangingthenumberofloopiterations therun-timelibrarybeforetheparallelloops.thiscallisresponsibleforchanging performedbyeachtask.toaccomplishthis,weaugmentthecodewithcallsto theloopboundsandconsequentlytheamountofworkdonebyeachtask. Therun-timesystemneedsawayofchangingtheamountofworkassignedto timebyconsideringboththecommunicationandthecomputationcomponents. strategiesfordistributingtheparallelloops.thegoalistominimizeexecution 1.Shiftingofloopboundaries:Thisapproachchangestheupperandlower Thecompilercandirecttherun-timetochoosebetweentwopartitioning boundsofeachparalleltask,sothattasksonlightlyloadedprocessorswill ing,onthedataaccessedbyourtasks.applicationswithnearestneighbor endupwithmoreworkthantasksonheavilyloadedprocessors.withthis schemeweavoidthecreationofnewboundaries,andthereforepossibleshar- 2.Multipleloopbounds:Thisschemeisaimedatminimizingunnecessarydata sharingwillbenetfromthisscheme.thispolicyhoweverhasthedrawback ofcausingmorecommunicationatthetimeofloadredistribution,sincedata hastobemovedbetweenallneighboringtasksratherthatonlyfromtheslow processor movement.eachprocessthatusesthispolicycanaccessnon-continuous amongtheprocessors,butreducescommunicationatloadredistribution time.hence,caremustbetakentoensurethatthisfragmentationdoesnot databyusingmultipleloopbounds.thispolicyfragmentstheshareddata 2.3Run-TimeLoadBalancingSupport Therun-timelibraryisresponsibleforkeepingtrackoftheprogressofeach resultineitherfalsesharingorexcesstruesharingduetoloadredistribution. adjuststheloadaccordingly.theexecutiontimeforeachparalleltaskismaintainedonaper-processorbasis(tasktime).therelativeprocessingpowerof process.itcollectsstatisticsabouttheexecutiontimeofeachparalleltaskand region).itiscrucialnottotrytoadjusttooquicklytochangesinexecution Figure1. theprocessor(relativepower)iscalculatedonthebasisofcurrentloaddistribution(relativepower)aswellastheper-processortasktimeasdescribedin Eachprocessorexecutestheabovecodepriortoeachparallelloop(SPMD

5 floatrelativepower[numofprocessors]; floattasktime[numofprocessors]; floatsumofpowers; forallprocessorsi forallprocessorsi RelativePower[i]/=TaskTime[i]; RelativePower[i]/=SumOfPowers; SumOfPowers+=RelativePower[i]; timebecausesuddenchangesinthedistributionofthedatamightcausethe Fig.1.AlgorithmtoDetermineRelativeProcessingPower isveryslowthersttimewegatherstatistics.ifweadjusttheload,wewill endupsendingmostofitsworktoanotherprocessor.thiswillcauseitto systemtooscillate.tomakethisclear,imagineaprocessorthatforsomereason Forthisreasonwehaveaddedsomehysteresisinoursystem.Weredistribute beveryfastthesecondtimearound,resultinginaredistributiononceagain. theloadonlyiftherelativepowerremainsconsistentlyatoddswithcurrent allocationthroughacertainnumberoftaskcreationpoints.similarly,loadis balancedonlyifthevarianceinrelativepowerexceedsathreshold.ifthetime ifthetimeoftheslowestprocessisnotwithin10%ofthetimeofthefastest communicationisgeneratedduetotheadjustedload.inourexperiments,we collectstatisticsfor10taskcreationpointsbeforetryingtoadjust,andthen changethedistributionofwork.otherwise,minoroscillationsmayresultas oftheslowestprocessiswithinn%ofthetimeofthefastestprocesswedon't processweredistributethework.thesecut-oswereheuristicallydetermined onthebasisofourexperimentalplatform,andareafunctionoftheamountof computationandanyextracommunication. asloadbalancing.thisisevenmoresoinsoftwaredsmwheretheprocessorsare toavoidunnecessarymovementofdataandatthesametimeminimizepage LoadBalancingvs.LocalityManagement nottightlycoupled,makingcommunicationexpensive.consequently,weneed sharing.inordertodealwiththisproblem,therun-timelibraryusestheinformationsuppliedbythecompileraboutwhatloopdistributionstrategytouse. SPMDregions.Changesinpartitioningthatmightresultinextracommunicationareavoidedinfavorofasmallamountofloadimbalance.Wecallthis Previouswork[20]hasshownthatlocalitymanagementisatleastasimportant Inaddition,itkeepstrackofaccessestothesharedarrayasdeclaredinprevious methodlocality-consciousloadbalancing.

6 2.4Example ConsidertheparallelloopofFigure2.Ourcompilerpasstransformsthisloop intothatinfigure3.thenewcodemakesaredistributecalltotherun-time libraryprovidingitwithallthenecessaryinformationtocomputetheaccess patterns(thearrays,thetypesofaccesses,theupperandlowerboundsofthe algorithmshowninfigure1),andthenusestheaccesspatterninformationto loopsandtheformatoftheexpressionsfortheindices). decidehowtodistributetheworkload. Theredistributecomputestherelativepowersoftheprocessors(usingthe for(i=lowerbound;i<upperbound;i+=stride) intsh_dat1[a*i+b]+=sh_dat2[c*i+d]; sh_dat1[n],sh_dat2[n]; Fig.2.Initialparallelloop. int redistribute( listoftypesofaccesses listofsharedarrays,/*sh_dat1,sh_dat2*/ sh_dat1[n],sh_dat2[n]; listofcoefficientsand constantsforindices listofupperbounds, listoflowerbounds, /*upper_bound*/ /*lower_bound*/ /*a,c,b,d*/ /*read/write*/ ); while(therearestillranges){ lowerbound=newlowerboundforthatrange; upperbound=newupperboundforthatrange; }Fig.3.Parallelloopwithaddedcodethatservesasaninterfacewiththerun-time for(i=lowerbound;i<upperbound;i+=stride) range=range->next; sh_dat1[a*i+b]+=sn_dat2[c*i+d]; library.therun-timesystemcanthenchangetheamountworkassignedtoeachparallel task.

7 3ExperimentalEvaluation 3.1Environment OurexperimentalenvironmentconsistsofeightDECAlphaServer21004/233 computers.eachalphaserverisequippedwithfour21064aprocessorsoperating networkinterface.eachalphaserverrunsdigitalunix4.0dwithtruclusterv.1.5extensions.theprograms,theruntimelibrary,andtreadmarkswere compiledwithgccversion usingthe-o2optimizationag. 3.2LoadBalancingResults Weevaluateoursystemontwoapplications:amatrixmultiplicationofthree 256x256sharedmatricesoflongs(whichisrepeated100times)andJacobi,with amatrixsizeof2050x2050oats.thecurrentimplementationonlyusesthe oneoftheprocessorsofeachsmp.thisconsistsofatightloopthatwriteson rstpolicy,shiftingofloopboundariesanddoesnotuseprefetching.totestthe performanceofourloadbalancinglibrary,weintroducedanarticialloadon anarrayof10240longs.thisloadtakesup50%ofthecputime. timeson1,2,4,8,and16processors,usinguptofoursmps.weaddedone articialloadforeveryfourprocessorsexceptinthecaseoftwoprocessorswhere weonlyaddedoneload.theloadbalancingschemeweuseistheshiftingofloop boundaries(wedonotusemultipleloopbounds).therstcolumnshowsthe OurpreliminaryresultsappearinFigures4and5.Wepresentexecution at233mhzandwith256mbofsharedmemory,aswellasamemorychannel executiontimesforthecaseswheretherewasnoloadinthesystem.thesecond columnshowstheexecutiontimeswiththearticialload,andnallythelast columnisthecasewherethesystemisloadedbutweareusingourloadbalancing library. muchas100%inthecaseoftwoprocessors(withtheoverheadat4,8and16 notbeingfaro). Theintroductionofloadslowsdownbothmatrixmultiply,andJacobibyas obtainedusing8processorswithloadandourloadbalancescheme,withthat theresultsofourloadbalancingalgorithmare,wecomparetheexecutiontimes using7processorswithoutanyload.this7-processorrunservesasaboundon mancecomparedtoexecutiontimewithload.inordertodeterminehowgood Ourloadbalancingstrategyprovidesasignicantimprovementinperfor- howwellwecanperformwithloadbalancing,sincethatisthebestwecanhope theirpower,givingustheequivalentofsevenprocessors).theresultsarepresentedinfigure6.formatrixmultiply,ourloadbalancingalgorithmisonly toachieve(twoofoureightprocessorsareloaded,andoperateatonly50%of processorremainsthesame. duetothefactthatwhilecomputationcanberedistributed,communicationper 9%slowerthanthesevenprocessorloadfreecase.Jacobiis20%slower,partly tothaton8processorswithnoload,indicatingtherelativetimespentinuser InFigure7,wepresentabreakdownofthenormalizedexecutiontimerelative

8 Times in seconds Matrix Multiplication No load With load With load balance 100 code,intheprotocol,andincommunicationandwaittime(atsynchronization Fig.4.ExecutionTimesforMatrixMultiply 0 1p 2p 4p 8p 16p waitingatsynchronizationpointsrelativetotheexecutiontimewithloadandno points).whenweuseourloadbalancingalgorithm,wereducethetimespent Number of processors loadbalancebecausewehavebetterdistributionofwork,andthereforeimprove overallperformance. Werunmatrixmultiplicationandjacobiinaloadfreeenvironmentwithand withoutuseofourrun-timelibrary.theresultsarepresentedinfigure8.inthe worstcaseweimposelessthan6%overhead. Finallywewantedtomeasurestheoverheadimposedbyourrun-timesystem. Fortheevaluationofourlocality-consciousloadbalancingpolicyweusedShallow,withinputsize514x514matricesofdoubles.Shallowoperatesontheinterior 3.3Locality-consciousLoadBalancingResults codeoranaiveimplementationwouldhaveeachprocessupdateapartofthe cesseswritingthesamepages,falsesharing.asmarterapproachistohavethe processesthatowntheboundarypagesdotheupdates,thiseliminatesfalse elementsofthearraysandthenupdatestheboundaries.compilerparallelized boundariesalongeachdimensioninparallel.thiscanresulttomultiplepro-

9 Times in seconds Jacobi No load With load With load balance 50 sharing.ourintegratedcompiler/run-timesystemisabletomakethedecision 0 atrun-time,usingtheaccesspatterninformationprovidedbythecompiler.it Fig.5.ExecutionTimeforJacobi 1p 2p 4p 8p 16p identieswhichprocesscachesthedatacanrepartitiontheworksothatitmaximizeslocalityduceanyloadimbalancetooursystem,sincewewanttoevaluateourlocality- thatdoesn'tconsiderdataplacementperformsverypoorlyasthenumberof toeliminatefalsesharingassuggestedearlier.anaivecompilerparallelization consciousloadbalancingpolicy.wehaveoptimizedthemanualparallelization WepresentourresultsinFigure9.Intheseexperimentswedon'tintro- Number of processors code. processorsincreasesbecauseofthemultiplewritersonthesamepage.however balancingrun-timesystemtheperformanceisequivalenttothehandoptimized whenwecombinethecomplierparallelizationwithoutlocality-consciousload Inthisscheme,thereisacentralqueueofloopiterations.Onceaprocessorhas loadbalancing.perhapsthemostcommonapproachisthetaskqueuemodel. 4RelatedWork Therehavebeenseveralapproachestotheproblemsoflocalitymanagementand

10 80 70 Matrix Multiplication - Jacobi Manual 7p no load SUIF 8p with load balance Times in seconds Fig.6.Comparisonoftherunningtimesoftheapplicationsusingourloadbalancingalgorithmon8loadedprocessors,comparedtothereperformanceon7loadfree processors nisheditsassignedportion,moreworkisobtainedfromthisqueue.thereare Matrix Multiplication Jacobi severalvariations,includingself-scheduling[23],xed-sizechunking[15],guided self-scheduling[22],andadaptiveguidedself-scheduling[7]. portantthanloadbalancinginthreadassignment.theyintroduceapolicythey callmemory-consciousschedulingthatassignsthreadstoprocessorswhoselocalmemoryholdsmostofthedatathethreadwillaccess.theirresultsshow thattheloosertheinterconnectionnetworkthemoreimportantthelocality MarkatosandLeBlancin[20],arguethatlocalitymanagementismoreim- veryimportant,anityschedulingwasintroducedin[19].theloopiterations management. them.kisaparameteroftheiralgorithmwhichtheydeneaspinmostof isidle,itremoves1/koftheiterationsinitslocalworkqueueandexecutes aredividedoveralltheprocessorsequallyinlocalqueues.whenaprocessor Basedontheobservationthatthelocalityofthedatathataloopaccessesis andexecutesthem,wherepisthenumberofprocessors. processoranditremoves1/poftheiterationsinthatprocessor'sworksqueue theirexperiments.ifaprocessor'sworkqueueisempty,itndsthemostloaded Theiralgorithmissimilartoanityschedulingbuttheirruntimesystemcan Buildingon[19],Yanetal.in[24],suggestadaptiveanityscheduling.

11 Normalized Times for MM - Jacobi Normalized Execution Time User Communication & wait Protocol spentintheusertime,communicationandwaitatsynchronizationpointsandprotocol time. Fig.7.BreakupofnormalizedtimeformatrixmultiplicationandJacobi,intotime changingk,anexponentialadaptivemechanism,alinearadaptivemechanism, aconservativeadaptivemechanism,andagreedyadaptivemechanism. theloadedprocessor'slocalworkqueue.theypresentfourpossiblepoliciesfor modifykduringtheexecutionoftheprogram.whenaprocessorisloaded,kis increasedsothatotherprocessorswithalighterloadcangetloopiterationsfrom theoverallperformanceoftheapplication.similarly,moonandsaltz[21]also withrespecttoprograms,processorsandtheinterconnectionnetworks.theirresultsindicatethattakingintoaccounttherelativecomputationpoweraswellas In[4],Cierniaketal.studyloopschedulinginheterogeneousenvironments anyheterogeneityintheloopformatwhiledoingtheloopdistributionimproves repartitioningisrequiredateverytimestep. imbalance,theyintroduceperiodicre-mapping,orre-mappingatpredetermined pointsoftheexecution,anddynamicre-mapping,inwhichtheydetermineif lookedatapplicationswithirregularaccesspatterns.tocompensateforload Kaddourain[13]presentarun-timeapproachforhandlingsuchenvironments. Beforeeachparallelsectionoftheprogramtheycheckifthereisaneedtore- Inthecontextofdynamicallychangingenvironments,Edjlalietal.in[8]or MM 8p no load MM 8p with load MM 7p no load MM 8p with lb Jacobi 8p no load Jacobi 8p with load Jacobi 7p no load Jacobi 8p with lb

12 Times for MM - Jacobi without load Execution Time Fig.8.Runningtimesformatrixmultiplicationandjacobiinaloadfreeenvironment, withandwithoutoutuseofourrun-timelibrary. maptheloop.thisissimilartoourapproach.howevertheirapproachdeals withmessagepassingprograms. loadbalancingispresentedin[25].basedontheinformationtheyusetomake distributedamongtheprocessors.theauthorsarguethatdependingonthe loadbalancingdecisionstheycanbedividedintolocalandglobal.distributed andcentralizedreferstowhethertheloadbalancerisonemasterprocessoror Adiscussiononglobalvs.localanddistributedvs.centralizedstrategiesfor applicationandsystemparameterseachofthoseschemescanbemoresuitable DSMsystem.Itmonitorscommunicationandpagefaults,anddynamicallymodiesloopboundariessothattheprocessesaccessdatathatarelocalifpossible. Adaptisabletoextracttheaccesspatternsbyinspectingthepatternsofthe ThesystemthatseemmostrelatedtooursisAdapt,presentedin[17].Adapt thantheothers. isimplementedinconcertwiththedistributedfilamentssoftwarekernel[9],a pagefaults.itcanonlyrecognizetwopatterns:nearest-neighborandbroadcast, patternsandprovidesthemtotherun-timesystem,makingourapproachmore thislimitsitsexibility.inoursystemweusethecompilertoextracttheaccess generalandexible. tionofprocessesfromoneworkstationtoanother.however,suchsystemsdon't supportparallelprogramseciently. FinallytherearesystemslikeCondor[16],thatsupporttransparentmigra- MM 8p no lb MM 8p with lb Jacobi 8p no lb Jacobi 8p with lb

13 Times in seconds Shallow Manual Parallelization Compiler Parallelization Adaptive Parallelization 40 Fig.9.RunningtimesofthethreedierentimplementationsofShallow,inseconds. Themanualparallelizationtakesintoaccountdataplacementinordertoavoidpage 20 sharing.thecompilerparallelizationdoesn'tconsiderdataplacement.theadaptive 0 theworkloadtakingintoaccountdataplacementdynamically. parallelizationusesthecompilerparallelizationwithourrun-timelibrarywhichadjusts 1p 2p 4p 8p 16p Number processors trasttocloselycoupledsharedmemoryormessagepassing.ourloadbalancing methodtargetsbothirregularitiesoftheloopsaswellaspossibleheterogeneous processorsandloadcausedbycompetingprograms.furthermore,oursystem addresseslocalitymanagementbytryingtominimizecommunicationandpage Oursystemdealswithsoftwaredistributedsharedmemoryprograms,incon- sharing. 5Conclusions ticsthatareattractive:itoerstheeaseofprogrammingofasharedmemory modelinawidelyavailableworkstation-basedmessagepassingenvironment. Inthispaper,weaddresstheproblemofloadbalancinginSDSMsystemsby couplingcompile-timeandrun-timeinformation.sdsmhasuniquecharacteris- However,multipleusersandlooselyconnectedprocessorschallengetheperfor-

14 manceofsdsmprogramsonsuchsystemsduetoloadimbalancesandhigh communicationlatencies. twoapplicationsandaxedloadindicatethattheperformancewithloadbalance usedtoprefetchdata.preliminaryresultsareencouraging.performancetestson powerandcommunicationspeeds.thesameaccesspatterninformationisalso dynamicallyadjustloadatrun-timebasedontheavailablerelativeprocessing Ourintegratedsystemusesaccessinformationavailableatcompile-timeto sharing.oursystemidentiedregionswherefalsesharingexistedandchanged iswithin9and20%oftheidealperformance.additionally,oursystemisableto theloopboundariestoavoidit.theperformanceonourthirdapplication,when partitiontheworksothatprocessesaccessonlytheirlocaldata,minimizingfalse partitioning. thenumberofprocessorswashigh,wasequivalenttothebestpossibleworkload factorusedwhendeterminingwhentoredistributework.thetradeobetween Inaddition,foramorethoroughevaluation,weneedtodeterminethesensitivity localitymanagementandloadmustalsobefurtherinvestigated. ofourstrategytodynamicchangesinload,aswellastochangesinthehysteresis Furtherworktocollectresultsonalargernumberofapplicationsisnecessary. References 2.C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,and 1.S.P.Amarasinghe,J.M.Anderson,M.S.Lam,andC.W.Tseng.TheSUIF compilerforscalableparallelmachines.inproceedingsofthe7thsiamconference onparallelprocessingforscienticcomputing,february1995. W.Zwaenepoel.TreadMarks:Sharedmemorycomputingonnetworksofworkstations.IEEEComputer,29(2):18{28,February MichalCierniak,WeiLi,andMohammedJaveedZaki.Loopschedulingforheterogeneity.InFourthInternationalSymposiumonHighPerformanceDistributed Computing,August1995. consistency-relatedinformationindistributedsharedmemorysystems.acm 3.J.B.Carter,J.K.Bennett,andW.Zwaenepoel.Techniquesforreducing TransactionsonComputerSystems,13(3):205{243,August A.L.Cox,S.Dwarkadas,H.Lu,andW.Zwaenepoel.Evaluatingtheperformance 6.S.Dwarkadas,A.L.Cox,andW.Zwaenepoel.Anintegratedcompile-time/run- ofsoftwaredistributedsharedmemoryasatargetforparallelizingcompilers.in 482,April1997. Proceedingsofthe11thInternationalParallelProcessingSymposium,pages474{ 7.D.L.EageandJ.Zahorjan.Adaptiveguidedself-scheduling.TechnicalReport October1996. timesoftwaredistributedsharedmemorysystem.inproceedingsofthe7thsympo- siumonarchitecturalsupportforprogramminglanguagesandoperatingsystems, 8.GuyEdjlali,GaganAgrawal,AlanSussman,andJoelSaltz.Dataparallelprogramminginanadaptiveenvironment.InInternationParallelProcessingSymposium, April ,DepartmentofComputerScience,UniversityofWashington,January 1992.

15 10.K.Gharachorloo,D.Lenoski,J.Laudon,P.Gibbons,A.Gupta,andJ.Hennessy. 9.V.W.Freeh,D.K.Lowenthal,andG.R.Andrews.Distributedlaments:Ecient ne-grainparallelismonaclusterofworkstations.inproceedingsofthefirst Memoryconsistencyandeventorderinginscalableshared-memorymultiprocessors.InProceedingsofthe17thAnnualInternationalSymposiumonComputer USENIXSymposiumonOperatingSystemDesignandImplementation,pages201{ 213,November P.HavlakandK.Kennedy.Animplementationofinterproceduralboundedregular 11.TheSUIFGroup.Anoverviewofthesuifcompilersystem. 360,July1991. sectionanalysis.ieeetransactionsonparallelanddistributedsystems,2(3):350{ Architecture,pages15{26,May P.Keleher,A.L.Cox,andW.Zwaenepoel.Lazyreleaseconsistencyforsoftwaredistributedsharedmemory.InProceedingsofthe19thAnnualInternational SymposiumonComputerArchitecture,pages13{21,May1992. ParallelComputing,pages173{183,February1997. stationnetwork.incommunicationandarchitecturesupportfornetwork-based 13.MaherKaddoura.Loadbalancingforregulardata-parallelapplicationsonwork- 17.DavidK.LowenthalandGregoryR.Andrews.Anadaptiveapproachtodata 16.M.LitzkowandM.Solomon.Supportingcheckpointingandprocessmigration 15.C.KruskalandA.Weiss.Allocatingindependentsubtasksonparallelprocessors. outsidetheunixkernel.inusenixwinterconference,1992. placement. InTransactionsonComputerSystems,October H.Lu,A.L.Cox,S.Dwarkadas,R.Rajamony,andW.Zwaenepoel.Software 19.EvangelosP.MarkatosandThomasJ.LeBlanc.Usingprocessoranityinloop distributedsharedmemorysupportforirregularapplications.inproceedingsof the6thsymposiumontheprinciplesandpracticeofparallelprogramming,pages 20.EvangelosP.MarkatosandThomasJ.LeBlanc.Loadbalancingversuslocality schedulingonshared-memorymultiprocessors.ieeetpds,5(4):379{400,april 48{56,June BongkiMoonandJoelSaltz.Adaptiveruntimesupportfordirectsimulationmonte I:258{267,August1992. managementinshared-memorymultiprocessors.procofthe1992icpp,pages 22.C.D.PolychronopoulosandD.J.Kuck.Guidedself-scheduling:apractical carlomethodsondistributedmemoryarchitectures.insalablehighperformance 23.P.TangandP.C.Yew.Processorself-scheduling:Apracticalschedulingschemefor September1992. ComputingComference,May1994. schedulingschemeforparallelsupercomputers.intransactionsoncomputers, 24.YongYan,CanmingJin,andXiaodongZhang.Adaptivelyschedulingparallelloops parallelcomputers.ininternationalconferenceonparallelprocessing,augoust 25.MohammedJaveedZaki,WeiLi,andSrinivasanParthasarathy.Customizeddy indistributedshared-memorysystems.intransactionsonparallelandsitributed systems,volume8,january1997. namicloadbalancingforanetworkofworkstations.technicalreport602,de- partmentofcomputerscience,universityofrochester,december1995.

TheEectofNetworkTotalOrder,Broadcast,andRemote-Write CapabilityonNetwork-BasedSharedMemoryComputing RobertStets,SandhyaDwarkadas,LeonidasKontothanassisy,MichaelL.Scott DepartmentofComputerScienceyCompaqCambridgeResearchLab

More information

Copyright 2011 - Bizagi. Change Management Construction Document Bizagi Process Modeler

Copyright 2011 - Bizagi. Change Management Construction Document Bizagi Process Modeler Copyright 2011 - Bizagi Change Management Bizagi Process Modeler Table of Contents Change Management... 4 Description... 4 Main Facts in the Process Construction... 5 Data Model... 5 Parameter Entities...

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

ASocialMechanismofReputationManagement inelectroniccommunities 446EGRC,1010MainCampusDrive BinYuandMunindarP.Singh? DepartmentofComputerScience NorthCarolinaStateUniversity fbyu,[email protected] Raleigh,NC27695-7534,USA

More information

Application Note: AN00141 xcore-xa - Application Development

Application Note: AN00141 xcore-xa - Application Development Application Note: AN00141 xcore-xa - Application Development This application note shows how to create a simple example which targets the XMOS xcore-xa device and demonstrates how to build and run this

More information

Linux Driver Devices. Why, When, Which, How?

Linux Driver Devices. Why, When, Which, How? Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may

More information

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks

More information

DECLARATION OF PERFORMANCE NO. HU-DOP_TN-212-25_001

DECLARATION OF PERFORMANCE NO. HU-DOP_TN-212-25_001 NO. HU-DOP_TN-212-25_001 Product type 212 (TN) 3,5x25 mm EN 14566:2008+A1:2009 NO. HU-DOP_TN-212-35_001 Product type 212 (TN) 3,5x35 mm EN 14566:2008+A1:2009 NO. HU-DOP_TN-212-45_001 Product type 212 (TN)

More information

DECLARATION OF PERFORMANCE NO. HU-DOP_TD-25_001

DECLARATION OF PERFORMANCE NO. HU-DOP_TD-25_001 NO. HU-DOP_TD-25_001 Product type TD 3,5x25 mm EN 14566:2008+A1:2009 NO. HU-DOP_TD-35_001 Product type TD 3,5x35 mm EN 14566:2008+A1:2009 NO. HU-DOP_TD-45_001 Product type TD 3,5x45 mm EN 14566:2008+A1:2009

More information

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7 Storing : Disks and Files Chapter 7 Yea, from the table of my memory I ll wipe away all trivial fond records. -- Shakespeare, Hamlet base Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Disks and

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Security & Chip Card ICs SLE 44R35S / Mifare

Security & Chip Card ICs SLE 44R35S / Mifare Security & Chip Card ICs SLE 44R35S / Mifare Intelligent 1 Kbyte EEPROM with Interface for Contactless Transmission, Security Logic and Anticollision according to the MIFARE -System Short Product Info

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and

More information

Program Coupling and Parallel Data Transfer Using PAWS

Program Coupling and Parallel Data Transfer Using PAWS Program Coupling and Parallel Data Transfer Using PAWS Sue Mniszewski, CCS-3 Pat Fasel, CCS-3 Craig Rasmussen, CCS-1 http://www.acl.lanl.gov/paws 2 Primary Goal of PAWS Provide the ability to share data

More information

DISTRIBUTED AND PARALLELL DATABASE

DISTRIBUTED AND PARALLELL DATABASE DISTRIBUTED AND PARALLELL DATABASE SYSTEMS Tore Risch Uppsala Database Laboratory Department of Information Technology Uppsala University Sweden http://user.it.uu.se/~torer PAGE 1 What is a Distributed

More information

Thepurposeofahospitalinformationsystem(HIS)istomanagetheinformationthathealth

Thepurposeofahospitalinformationsystem(HIS)istomanagetheinformationthathealth FederatedDatabaseSystemsforReplicatingInformationin UniversityofDortmund,DepartmentofComputerScience,Informatik10 ExtendingtheSchemaArchitectureof E-mail:[email protected] HospitalInformationSystems

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

Optimizing Load Balance Using Parallel Migratable Objects

Optimizing Load Balance Using Parallel Migratable Objects Optimizing Load Balance Using Parallel Migratable Objects Laxmikant V. Kalé, Eric Bohm Parallel Programming Laboratory University of Illinois Urbana-Champaign 2012/9/25 Laxmikant V. Kalé, Eric Bohm (UIUC)

More information

What is Multi Core Architecture?

What is Multi Core Architecture? What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a

More information

AN4156 Application note

AN4156 Application note Application note Hardware abstraction layer for Android Introduction This application note provides guidelines for successfully integrating STMicroelectronics sensors (accelerometer, magnetometer, gyroscope

More information

ZA-12. Temperature - Liquidus + 45 o C (81 o C) Vacuum = 90mm

ZA-12. Temperature - Liquidus + 45 o C (81 o C) Vacuum = 90mm Ragonne Fluidity, Inches Zn-Al Impact 38 34 30 26 22 18 14 No. 3 Zn-Al ZA-8 Liquidius ZA-12 Temperature - Liquidus + 45 o C (81 o C) Vacuum = 90mm Zn-Al (0.01-0.02 percent mg) ZA-27 10 0 2 4 6 8 10 12

More information

Mesh Partitioning and Load Balancing

Mesh Partitioning and Load Balancing and Load Balancing Contents: Introduction / Motivation Goals of Load Balancing Structures Tools Slide Flow Chart of a Parallel (Dynamic) Application Partitioning of the initial mesh Computation Iteration

More information

Optimizing matrix multiplication Amitabha Banerjee [email protected]

Optimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu Optimizing matrix multiplication Amitabha Banerjee [email protected] Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],

More information

Customized Dynamic Load Balancing for a Network of Workstations 1

Customized Dynamic Load Balancing for a Network of Workstations 1 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 43, 156 162 (1997) ARTICLE NO. PC971339 Customized Dynamic Load Balancing for a Network of Workstations 1 Mohammed Javeed Zaki, Wei Li, and Srinivasan Parthasarathy

More information

M-RPC:ARemoteProcedureCallServiceforMobileClients. DepartmentofComputerScience. Rutgers,TheStateUniversityofNewJersey. Piscataway,NJ08855

M-RPC:ARemoteProcedureCallServiceforMobileClients. DepartmentofComputerScience. Rutgers,TheStateUniversityofNewJersey. Piscataway,NJ08855 M-RPC:ARemoteProcedureCallServiceforMobileClients AjayBakreandB.R.Badrinath DepartmentofComputerScience Rutgers,TheStateUniversityofNewJersey Piscataway,NJ08855 Email:fbakre,[email protected] Abstract

More information

Optimizing Performance. Training Division New Delhi

Optimizing Performance. Training Division New Delhi Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,

More information

Arbitration and Switching Between Bus Masters

Arbitration and Switching Between Bus Masters February 2010 Introduction Reference Design RD1067 Since the development of the system bus that allows multiple devices to communicate with one another through a common channel, bus arbitration has been

More information

Linux Kernel Architecture

Linux Kernel Architecture Linux Kernel Architecture Amir Hossein Payberah [email protected] Contents What is Kernel? Kernel Architecture Overview User Space Kernel Space Kernel Functional Overview File System Process Management

More information

threads threads threads

threads threads threads AHybridMultithreading/Message-PassingApproachforSolving IrregularProblemsonSMPClusters Jan-JanWu InstituteofInformationScience AcademiaSinica Taipei,Taiwan,R.O.C. Chia-LienChiang Nai-WeiLin Dept.ComputerScience

More information

How To Get A Computer Science Degree At Appalachian State

How To Get A Computer Science Degree At Appalachian State 118 Master of Science in Computer Science Department of Computer Science College of Arts and Sciences James T. Wilkes, Chair and Professor Ph.D., Duke University [email protected] http://www.cs.appstate.edu/

More information

ENGI E1112 Departmental Project Report: Computer Science/Computer Engineering

ENGI E1112 Departmental Project Report: Computer Science/Computer Engineering ENGI E1112 Departmental Project Report: Computer Science/Computer Engineering Daniel Estrada Taylor, Dev Harrington, Sekou Harris December 2012 Abstract This document is the final report for ENGI E1112,

More information

CUDA Programming. Week 4. Shared memory and register

CUDA Programming. Week 4. Shared memory and register CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

Avid ISIS 7000. www.avid.com

Avid ISIS 7000. www.avid.com Avid ISIS 7000 www.avid.com Table of Contents Overview... 3 Avid ISIS Technology Overview... 6 ISIS Storage Blade... 6 ISIS Switch Blade... 7 ISIS System Director... 7 ISIS Client Software... 8 ISIS Redundant

More information

To connect to the cluster, simply use a SSH or SFTP client to connect to:

To connect to the cluster, simply use a SSH or SFTP client to connect to: RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or

More information

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001 Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering

More information

Common Approaches to Real-Time Scheduling

Common Approaches to Real-Time Scheduling Common Approaches to Real-Time Scheduling Clock-driven time-driven schedulers Priority-driven schedulers Examples of priority driven schedulers Effective timing constraints The Earliest-Deadline-First

More information

chapater 7 : Distributed Database Management Systems

chapater 7 : Distributed Database Management Systems chapater 7 : Distributed Database Management Systems Distributed Database Management System When an organization is geographically dispersed, it may choose to store its databases on a central database

More information

MICHELIN CARGOXBIB and XP 27

MICHELIN CARGOXBIB and XP 27 TRAI LERS The trailer tires that protect your soil MICHELIN CARGOXBIB and XP 27 Sizes MICHELIN CARGOXBIB 500/60 R22.5 155D 560/60 R22.5 161D 600/50 R22.5 159D 710/45 R22.5 165D 600/55 R26.5 165D Sizes

More information

Programming Languages for Large Scale Parallel Computing. Marc Snir

Programming Languages for Large Scale Parallel Computing. Marc Snir Programming Languages for Large Scale Parallel Computing Marc Snir Focus Very large scale computing (>> 1K nodes) Performance is key issue Parallelism, load balancing, locality and communication are algorithmic

More information

Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016

Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016 Software-Programmable FPGA IoT Platform Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016 Agenda Introduction Who we are IoT Platform in FPGA Lattice s IoT Vision IoT Platform

More information

DISK: A Distributed Java Virtual Machine For DSP Architectures

DISK: A Distributed Java Virtual Machine For DSP Architectures DISK: A Distributed Java Virtual Machine For DSP Architectures Mihai Surdeanu Southern Methodist University, Computer Science Department [email protected] Abstract Java is for sure the most popular programming

More information

HPC enabling of OpenFOAM R for CFD applications

HPC enabling of OpenFOAM R for CFD applications HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,

More information

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and

More information

!NAVSEC':!A!Recommender!System!for!3D! Network!Security!Visualiza<ons!

!NAVSEC':!A!Recommender!System!for!3D! Network!Security!Visualiza<ons! !:!A!Recommender!System!for!3D! Network!Security!Visualiza

More information

ICRI-CI Retreat Architecture track

ICRI-CI Retreat Architecture track ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning

More information

High Performance Computing in the Multi-core Area

High Performance Computing in the Multi-core Area High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable

More information

The Asterope compute cluster

The Asterope compute cluster The Asterope compute cluster ÅA has a small cluster named asterope.abo.fi with 8 compute nodes Each node has 2 Intel Xeon X5650 processors (6-core) with a total of 24 GB RAM 2 NVIDIA Tesla M2050 GPGPU

More information

E) Modeling Insights: Patterns and Anti-patterns

E) Modeling Insights: Patterns and Anti-patterns Murray Woodside, July 2002 Techniques for Deriving Performance Models from Software Designs Murray Woodside Second Part Outline ) Conceptual framework and scenarios ) Layered systems and models C) uilding

More information

D1.1 Service Discovery system: Load balancing mechanisms

D1.1 Service Discovery system: Load balancing mechanisms D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction

More information

Dynamo: Amazon s Highly Available Key-value Store

Dynamo: Amazon s Highly Available Key-value Store Dynamo: Amazon s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

Design and Implementation of Distributed Process Execution Environment

Design and Implementation of Distributed Process Execution Environment Design and Implementation of Distributed Process Execution Environment Project Report Phase 3 By Bhagyalaxmi Bethala Hemali Majithia Shamit Patel Problem Definition: In this project, we will design and

More information

CPUInheritance Scheduling. http://www.cs.utah.edu/projects/flux/ UniversityofUtah

CPUInheritance Scheduling. http://www.cs.utah.edu/projects/flux/ UniversityofUtah CPUInheritance Scheduling DepartmentofComputerScience ComputerSystemsLaboratory BryanFord SaiSusarla http://www.cs.utah.edu/projects/flux/ UniversityofUtah October30,1996 [email protected] 1 KeyConcepts

More information

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009 An Algorithm for Dynamic Load Balancing in Distributed Systems with Multiple Supporting Nodes by Exploiting the Interrupt Service Parveen Jain 1, Daya Gupta 2 1,2 Delhi College of Engineering, New Delhi,

More information

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions

More information

A Real Time, Object Oriented Fieldbus Management System

A Real Time, Object Oriented Fieldbus Management System A Real Time, Object Oriented Fieldbus Management System Mr. Ole Cramer Nielsen Managing Director PROCES-DATA Supervisor International P-NET User Organisation Navervej 8 8600 Silkeborg Denmark [email protected]

More information

MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence : 0660-3

MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence : 0660-3 MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence : 0660-3 CMPS03 Magnetic Compass. Voltage : 5v only required Current : 20mA Typ. Resolution : 0.1 Degree Accuracy : 3-4 degrees approx. after calibration Output

More information

Managing Devices. Lesson 5

Managing Devices. Lesson 5 Managing Devices Lesson 5 Objectives Objective Domain Matrix Technology Skill Objective Domain Description Objective Domain Number Connecting Plug-and-Play Devices Connecting Plug-and-Play Devices 5.1.1

More information

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

Accelerating High-Speed Networking with Intel I/O Acceleration Technology White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing

More information

Guide to SATA Hard Disks Installation and RAID Configuration

Guide to SATA Hard Disks Installation and RAID Configuration Guide to SATA Hard Disks Installation and RAID Configuration 1. Guide to SATA Hard Disks Installation... 2 1.1 Serial ATA (SATA) Hard Disks Installation... 2 2. Guide to RAID Configurations... 3 2.1 Introduction

More information

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,

More information

Oracle Java SE and Oracle Java Embedded Products

Oracle Java SE and Oracle Java Embedded Products Oracle Java SE and Oracle Java Embedded Products This document describes the Oracle Java SE product editions, Oracle Java Embedded products, and the features available with them. It contains the following

More information

Operating Systems Networking for Home and Small Businesses Chapter 2

Operating Systems Networking for Home and Small Businesses Chapter 2 Operating Systems Networking for Home and Small Businesses Chapter 2 Copyleft 2012 Vincenzo Bruno (www.vincenzobruno.it) Released under Crative Commons License 3.0 By-Sa Cisco name, logo and materials

More information

NP-completeproblemstractable Copyingquantumcomputermakes MikaHirvensalo February1998 ISBN952-12-0158-4 TurkuCentreforComputerScience ISSN1239-1891 TUCSTechnicalReportNo161 superpositions,weshowthatnp-completeproblemscanbesolvedprobabilisticallyinpolynomialtime.wealsoproposetwomethodsthatcould

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Keystone 600N5 SERVER and STAND-ALONE INSTALLATION INSTRUCTIONS

Keystone 600N5 SERVER and STAND-ALONE INSTALLATION INSTRUCTIONS The following instructions are required for installation of Best Access System s Keystone 600N5 (KS600N) network key control software for the server side. Please see the system requirements on the Keystone

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

front unit 1 3 back unit

front unit 1 3 back unit GreedyApproximationsofIndependentSetsinLowDegree Magn sm.halld rsson GraphsKiyohitoYoshiharay lemincubicgraphsandgraphsofmaximumdegreethree.thesealgorithmiterativelyselect verticesofminimumdegree,butdierinthesecondaryruleforchoosingamongmanycandidates.westudythreesuchalgorithms,andprovetightperformanceratios,withthebest

More information

Last class: Distributed File Systems. Today: NFS, Coda

Last class: Distributed File Systems. Today: NFS, Coda Last class: Distributed File Systems Issues in distributed file systems Sun s Network File System case study Lecture 19, page 1 Today: NFS, Coda Case Study: NFS (continued) Case Study: Coda File System

More information

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12 Universität Dortmund 12 Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2010 年 11 月 26 日 These slides use Microsoft clip arts. Microsoft copyright

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

Data Storage at IBT. Topics. Storage, Concepts and Guidelines

Data Storage at IBT. Topics. Storage, Concepts and Guidelines Data Storage at IBT Storage, Concepts and Guidelines Topics Hard Disk Drives (HDDs) Storage Technology New Storage Hardware at IBT Concepts and Guidelines? 2 1 Hard Disk Drives (HDDs) First hard disk:

More information

Load Imbalance Analysis

Load Imbalance Analysis With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

More information

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes TRACE PERFORMANCE TESTING APPROACH Overview Approach Flow Attributes INTRODUCTION Software Testing Testing is not just finding out the defects. Testing is not just seeing the requirements are satisfied.

More information