|
|
|
- Coral Holt
- 10 years ago
- Views:
Transcription
1 DistributedSharedMemorySystems? AdaptiveLoadBalancinginSoftware CompilerandRun-TimeSupportfor SotirisIoannidisandSandhyaDwarkadas DepartmentofComputerScience Rochester,NY14627{0226 UniversityofRochester ablehighperformancecomputingenvironments.acriticalissuefor Abstract.Networksofworkstationsoerinexpensiveandhighlyavail- balancingdynamicallyonsoftwaredistributedsharedmemoryprograms. achievinggoodperformanceinanyparallelsystemisloadbalancing, asystemthatcombinescompilerandrun-timesupporttoachieveload besharedamongmanyusers.inthispaper,wepresentandevaluate evenmoresoinworkstationenvironmentswherethemachinesmight Weuseinformationprovidedbythecompilertohelptherun-timesystemdistributetheworkoftheparallelloops,notonlyaccordingtothe 1Introduction relativepoweroftheprocessors,butalsoinsuchawayastominimize communicationandpagesharing. Clustersofworkstations,whetheruniprocessorsorsymmetricmultiprocessors eectivetargetforaparallelizingcompiler.theadvantagesofusingansdsm use.previouswork[5]hasshownthatansdsmrun-timecanprovetobean (SMPs),oercost-eectiveandhighlyavailableparallelcomputingenvironments.Softwaredistributedsharedmemory(SDSM)providesasharedmemory compile-timeandrun-timeinformationtoachievebetterperformance([6,18]). systemincludereducedcomplexityatcompile-time,andtheabilitytocombine abstractiononadistributedmemorymachine,withtheadvantageofease-of- themachinesmightbesharedamongmanyusers.inordertomaximizeperformancebasedonavailableresources,theparallelsystemmustnotonlyoptimally distributetheworkaccordingtotheinherentcomputationandcommunication Oneissueinachievinggoodperformanceinanyparallelsystemisloadbalancing.Thisissueisevenmorecriticalinaworkstationenvironmentwhere communicationresources. demandsoftheapplication,butalsoaccordingtoavailablecomputationand?thisworkwassupportedinpartbynsfgrantscda{ ,ccr{ ,and CCR{ ;andanexternalresearchgrantfromDigitalEquipmentCorporation.
2 run-timesupporttoachieveloadbalancingdynamicallyonsdsmprograms. Thecompilerprovidesaccesspatterninformationtotherun-timeatthepoints inthecodethatwillbeexecutedinparallel.therun-timeusesthesepointsto determineavailablecomputationalandcommunicationresources.basedonthe Inthispaper,wepresentandevaluateasystemthatcombinescompilerand loadevenly,butalsotominimizecommunicationoverheadinthefuture.the timecanthenmakeintelligentdecisionsnotonlytodistributethecomputational changesincomputationalpower,resultinginreducedexecutiontime. accesspatternsacrossphases,aswellasonavailablecomputingpower,therun- resultisasystemthatadaptsbothtochangesinaccesspatternsaswellasto forprefetchingandconsistency/communicationavoidancedescribedin[6].we implementedthenecessarycompilerextensionsinthesuif[1]compilerframework.ourexperimentalenvironmentconsistsofeightdecalphaserver2100 4/233computers,eachwithfour21064Aprocessorsoperatingat233MHz.Preliminaryresultsshowthatoursystemisabletoadapttochangesinload,with Ourtargetrun-timesystemisTreadMarks[2],alongwiththeextensions performancewithin20%ofideal. Section4describesrelatedwork.Finally,wepresentourconclusionsanddiscuss dynamicloadbalancingdecisions.section3presentssomepreliminaryresults. on-goingworkinsection5. timesystem,thenecessarycompilersupport,andthealgorithmusedtomake Therestofthispaperisorganizedasfollows.Section2describestherun- WerstprovidesomebackgroundonTreadMarks[2],therun-timesystemwe 2DesignandImplementation usedinourimplementation.wethendescribethecompilersupportfollowedby therun-timesupportnecessaryforloadbalancing. 2.1TheBaseSoftwareDSMLibrary user-levelsdsmsystemthatrunsoncommonlyavailableunixsystems.tread- TreadMarks[2]isanSDSMsystembuiltatRiceUniversity.Itisanecient Marksprovidesparallelprogrammingprimitivessimilartothoseusedinhardwaresharedmemorymachines,namely,processcreation,sharedmemoryallocation,andlockandbarriersynchronization.Thesystemsupportsarelease synchronizationtoensurethatchangestoshareddatabecomevisible. protocol[3]toreducetheoverheadinvolvedinimplementingthesharedmemory consistent(rc)memorymodel[10],requiringtheprogrammertouseexplicit abstraction. Consequently,theconsistencyunitisavirtualmemorypage.Themultiple-writer Thevirtualmemoryhardwareisusedtodetectaccessestosharedmemory. TreadMarksusesalazyinvalidate[14]versionofRCandamultiple-writer protocolreducestheeectsoffalsesharingwithsuchalargeconsistencyunit. Withthisprotocol,twoormoreprocessorscansimultaneouslymodifytheirown
3 copyofasharedpage.theirmodicationsaremergedatthenextsynchronizationoperationinaccordancewiththedenitionofrc,therebyreducingthe eectsoffalsesharing.themergeisaccomplishedthroughtheuseofdis.a comparingthepagetoacopysavedpriortothemodications(calledatwin). acquiresynchronizationoperation[10],thosepagesforwhichithasreceived diisarun-lengthencodingofthemodicationsmadetoapage,generatedby noticeofmodicationsbyotherprocessors.onasubsequentpagefault,the processfetchesthedisnecessarytoupdateitscopy. Withthelazyinvalidateprotocol,aprocessinvalidates,atthetimeofan 2.2Compile-TimeSupportforLoadBalancing builtontopofakernelthatdenestheintermediateformat.thepassesare implementedasseparateprogramsthattypicallyperformasingleanalysisor (SUIF)[11]compiler.TheSUIFsystemisorganizedasasetofcompilerpasses programusingtreadmarks,weusethestanforduniversityintermediateformat Forthesource-to-sourcetranslationfromasequentialprogramtoaparallel thatwestartwithisaversionofthecodeparallelizedforsharedaddressspace transformationandthenwritetheresultsouttoale.thelesalwaysusethe sameformat. machines.thecompilergeneratesasingle-program,multiple-data(spmd)programthatwemodiedtomakecallstothetreadmarksrun-timelibrary.alternatively,theusercanprovidethespmdprogram(insteadofhavingthesuif Theinputtothecompilerisasequentialversionofthecode.Theoutput compilergenerateit)byidentifyingtheparallelloopsintheprogramthatare executedbyallprocessors. tochangetheloaddistributionintheparallelloopsifnecessary. regions,andfeedsthisinformationtotherun-timesystem.thepassisalso responsibleforaddinghooksintheparallelizedcodetoallowtherun-timelibrary OurSUIFpassextractstheshareddataaccesspatternsineachoftheSPMD x).aregularsection[12]isthencreatedforeachsuchsharedaccess.regular sectiondescriptors(rsds)conciselyrepresentthearrayaccessesinaloopnest. Accesspatternextraction TheRSDsrepresenttheaccesseddataaslinearexpressionsoftheupperand theprogramlookingforaccessestosharedmemory(identiedusingtheshpre- Inordertogenerateaccesspatternsummaries,ourSUIFpasswalksthrough lowerloopboundsalongeachdimension,andincludestrideinformation.this andthesizeofeachdimensionofthearray,todeterminetheaccesspattern. entstrategiesofloadredistributionincaseofimbalance.wewilldiscussthese informationiscombinedwiththecorrespondingloopboundariesofthatindex, Dependingonthekindofdatasharingbetweenparalleltasks,wefollowdier- strategiesfurtherinsection2.3.
4 Prefetching Markslibraryoersprefetchingcalls.Thesecalls,givenarangeofaddresses, prefetchthedatacontainedinthepagesinthatrange,andprovideappropriate (read/write)permissionsonthepage.thisprefetchingpreventsfaultingand consistencyactionsonuncacheddatathatisguaranteedtobeaccessedinthe Theaccesspatterninformationcanalsousedtoprefetchdata[6].TheTreadbulktransfer. Loadbalancinginterfaceandstrategy future,aswellasallowscommunicationoptimizationbytakingadvantageof eachparalleltask.thisessentiallymeanschangingthenumberofloopiterations therun-timelibrarybeforetheparallelloops.thiscallisresponsibleforchanging performedbyeachtask.toaccomplishthis,weaugmentthecodewithcallsto theloopboundsandconsequentlytheamountofworkdonebyeachtask. Therun-timesystemneedsawayofchangingtheamountofworkassignedto timebyconsideringboththecommunicationandthecomputationcomponents. strategiesfordistributingtheparallelloops.thegoalistominimizeexecution 1.Shiftingofloopboundaries:Thisapproachchangestheupperandlower Thecompilercandirecttherun-timetochoosebetweentwopartitioning boundsofeachparalleltask,sothattasksonlightlyloadedprocessorswill ing,onthedataaccessedbyourtasks.applicationswithnearestneighbor endupwithmoreworkthantasksonheavilyloadedprocessors.withthis schemeweavoidthecreationofnewboundaries,andthereforepossibleshar- 2.Multipleloopbounds:Thisschemeisaimedatminimizingunnecessarydata sharingwillbenetfromthisscheme.thispolicyhoweverhasthedrawback ofcausingmorecommunicationatthetimeofloadredistribution,sincedata hastobemovedbetweenallneighboringtasksratherthatonlyfromtheslow processor movement.eachprocessthatusesthispolicycanaccessnon-continuous amongtheprocessors,butreducescommunicationatloadredistribution time.hence,caremustbetakentoensurethatthisfragmentationdoesnot databyusingmultipleloopbounds.thispolicyfragmentstheshareddata 2.3Run-TimeLoadBalancingSupport Therun-timelibraryisresponsibleforkeepingtrackoftheprogressofeach resultineitherfalsesharingorexcesstruesharingduetoloadredistribution. adjuststheloadaccordingly.theexecutiontimeforeachparalleltaskismaintainedonaper-processorbasis(tasktime).therelativeprocessingpowerof process.itcollectsstatisticsabouttheexecutiontimeofeachparalleltaskand region).itiscrucialnottotrytoadjusttooquicklytochangesinexecution Figure1. theprocessor(relativepower)iscalculatedonthebasisofcurrentloaddistribution(relativepower)aswellastheper-processortasktimeasdescribedin Eachprocessorexecutestheabovecodepriortoeachparallelloop(SPMD
5 floatrelativepower[numofprocessors]; floattasktime[numofprocessors]; floatsumofpowers; forallprocessorsi forallprocessorsi RelativePower[i]/=TaskTime[i]; RelativePower[i]/=SumOfPowers; SumOfPowers+=RelativePower[i]; timebecausesuddenchangesinthedistributionofthedatamightcausethe Fig.1.AlgorithmtoDetermineRelativeProcessingPower isveryslowthersttimewegatherstatistics.ifweadjusttheload,wewill endupsendingmostofitsworktoanotherprocessor.thiswillcauseitto systemtooscillate.tomakethisclear,imagineaprocessorthatforsomereason Forthisreasonwehaveaddedsomehysteresisinoursystem.Weredistribute beveryfastthesecondtimearound,resultinginaredistributiononceagain. theloadonlyiftherelativepowerremainsconsistentlyatoddswithcurrent allocationthroughacertainnumberoftaskcreationpoints.similarly,loadis balancedonlyifthevarianceinrelativepowerexceedsathreshold.ifthetime ifthetimeoftheslowestprocessisnotwithin10%ofthetimeofthefastest communicationisgeneratedduetotheadjustedload.inourexperiments,we collectstatisticsfor10taskcreationpointsbeforetryingtoadjust,andthen changethedistributionofwork.otherwise,minoroscillationsmayresultas oftheslowestprocessiswithinn%ofthetimeofthefastestprocesswedon't processweredistributethework.thesecut-oswereheuristicallydetermined onthebasisofourexperimentalplatform,andareafunctionoftheamountof computationandanyextracommunication. asloadbalancing.thisisevenmoresoinsoftwaredsmwheretheprocessorsare toavoidunnecessarymovementofdataandatthesametimeminimizepage LoadBalancingvs.LocalityManagement nottightlycoupled,makingcommunicationexpensive.consequently,weneed sharing.inordertodealwiththisproblem,therun-timelibraryusestheinformationsuppliedbythecompileraboutwhatloopdistributionstrategytouse. SPMDregions.Changesinpartitioningthatmightresultinextracommunicationareavoidedinfavorofasmallamountofloadimbalance.Wecallthis Previouswork[20]hasshownthatlocalitymanagementisatleastasimportant Inaddition,itkeepstrackofaccessestothesharedarrayasdeclaredinprevious methodlocality-consciousloadbalancing.
6 2.4Example ConsidertheparallelloopofFigure2.Ourcompilerpasstransformsthisloop intothatinfigure3.thenewcodemakesaredistributecalltotherun-time libraryprovidingitwithallthenecessaryinformationtocomputetheaccess patterns(thearrays,thetypesofaccesses,theupperandlowerboundsofthe algorithmshowninfigure1),andthenusestheaccesspatterninformationto loopsandtheformatoftheexpressionsfortheindices). decidehowtodistributetheworkload. Theredistributecomputestherelativepowersoftheprocessors(usingthe for(i=lowerbound;i<upperbound;i+=stride) intsh_dat1[a*i+b]+=sh_dat2[c*i+d]; sh_dat1[n],sh_dat2[n]; Fig.2.Initialparallelloop. int redistribute( listoftypesofaccesses listofsharedarrays,/*sh_dat1,sh_dat2*/ sh_dat1[n],sh_dat2[n]; listofcoefficientsand constantsforindices listofupperbounds, listoflowerbounds, /*upper_bound*/ /*lower_bound*/ /*a,c,b,d*/ /*read/write*/ ); while(therearestillranges){ lowerbound=newlowerboundforthatrange; upperbound=newupperboundforthatrange; }Fig.3.Parallelloopwithaddedcodethatservesasaninterfacewiththerun-time for(i=lowerbound;i<upperbound;i+=stride) range=range->next; sh_dat1[a*i+b]+=sn_dat2[c*i+d]; library.therun-timesystemcanthenchangetheamountworkassignedtoeachparallel task.
7 3ExperimentalEvaluation 3.1Environment OurexperimentalenvironmentconsistsofeightDECAlphaServer21004/233 computers.eachalphaserverisequippedwithfour21064aprocessorsoperating networkinterface.eachalphaserverrunsdigitalunix4.0dwithtruclusterv.1.5extensions.theprograms,theruntimelibrary,andtreadmarkswere compiledwithgccversion usingthe-o2optimizationag. 3.2LoadBalancingResults Weevaluateoursystemontwoapplications:amatrixmultiplicationofthree 256x256sharedmatricesoflongs(whichisrepeated100times)andJacobi,with amatrixsizeof2050x2050oats.thecurrentimplementationonlyusesthe oneoftheprocessorsofeachsmp.thisconsistsofatightloopthatwriteson rstpolicy,shiftingofloopboundariesanddoesnotuseprefetching.totestthe performanceofourloadbalancinglibrary,weintroducedanarticialloadon anarrayof10240longs.thisloadtakesup50%ofthecputime. timeson1,2,4,8,and16processors,usinguptofoursmps.weaddedone articialloadforeveryfourprocessorsexceptinthecaseoftwoprocessorswhere weonlyaddedoneload.theloadbalancingschemeweuseistheshiftingofloop boundaries(wedonotusemultipleloopbounds).therstcolumnshowsthe OurpreliminaryresultsappearinFigures4and5.Wepresentexecution at233mhzandwith256mbofsharedmemory,aswellasamemorychannel executiontimesforthecaseswheretherewasnoloadinthesystem.thesecond columnshowstheexecutiontimeswiththearticialload,andnallythelast columnisthecasewherethesystemisloadedbutweareusingourloadbalancing library. muchas100%inthecaseoftwoprocessors(withtheoverheadat4,8and16 notbeingfaro). Theintroductionofloadslowsdownbothmatrixmultiply,andJacobibyas obtainedusing8processorswithloadandourloadbalancescheme,withthat theresultsofourloadbalancingalgorithmare,wecomparetheexecutiontimes using7processorswithoutanyload.this7-processorrunservesasaboundon mancecomparedtoexecutiontimewithload.inordertodeterminehowgood Ourloadbalancingstrategyprovidesasignicantimprovementinperfor- howwellwecanperformwithloadbalancing,sincethatisthebestwecanhope theirpower,givingustheequivalentofsevenprocessors).theresultsarepresentedinfigure6.formatrixmultiply,ourloadbalancingalgorithmisonly toachieve(twoofoureightprocessorsareloaded,andoperateatonly50%of processorremainsthesame. duetothefactthatwhilecomputationcanberedistributed,communicationper 9%slowerthanthesevenprocessorloadfreecase.Jacobiis20%slower,partly tothaton8processorswithnoload,indicatingtherelativetimespentinuser InFigure7,wepresentabreakdownofthenormalizedexecutiontimerelative
8 Times in seconds Matrix Multiplication No load With load With load balance 100 code,intheprotocol,andincommunicationandwaittime(atsynchronization Fig.4.ExecutionTimesforMatrixMultiply 0 1p 2p 4p 8p 16p waitingatsynchronizationpointsrelativetotheexecutiontimewithloadandno points).whenweuseourloadbalancingalgorithm,wereducethetimespent Number of processors loadbalancebecausewehavebetterdistributionofwork,andthereforeimprove overallperformance. Werunmatrixmultiplicationandjacobiinaloadfreeenvironmentwithand withoutuseofourrun-timelibrary.theresultsarepresentedinfigure8.inthe worstcaseweimposelessthan6%overhead. Finallywewantedtomeasurestheoverheadimposedbyourrun-timesystem. Fortheevaluationofourlocality-consciousloadbalancingpolicyweusedShallow,withinputsize514x514matricesofdoubles.Shallowoperatesontheinterior 3.3Locality-consciousLoadBalancingResults codeoranaiveimplementationwouldhaveeachprocessupdateapartofthe cesseswritingthesamepages,falsesharing.asmarterapproachistohavethe processesthatowntheboundarypagesdotheupdates,thiseliminatesfalse elementsofthearraysandthenupdatestheboundaries.compilerparallelized boundariesalongeachdimensioninparallel.thiscanresulttomultiplepro-
9 Times in seconds Jacobi No load With load With load balance 50 sharing.ourintegratedcompiler/run-timesystemisabletomakethedecision 0 atrun-time,usingtheaccesspatterninformationprovidedbythecompiler.it Fig.5.ExecutionTimeforJacobi 1p 2p 4p 8p 16p identieswhichprocesscachesthedatacanrepartitiontheworksothatitmaximizeslocalityduceanyloadimbalancetooursystem,sincewewanttoevaluateourlocality- thatdoesn'tconsiderdataplacementperformsverypoorlyasthenumberof toeliminatefalsesharingassuggestedearlier.anaivecompilerparallelization consciousloadbalancingpolicy.wehaveoptimizedthemanualparallelization WepresentourresultsinFigure9.Intheseexperimentswedon'tintro- Number of processors code. processorsincreasesbecauseofthemultiplewritersonthesamepage.however balancingrun-timesystemtheperformanceisequivalenttothehandoptimized whenwecombinethecomplierparallelizationwithoutlocality-consciousload Inthisscheme,thereisacentralqueueofloopiterations.Onceaprocessorhas loadbalancing.perhapsthemostcommonapproachisthetaskqueuemodel. 4RelatedWork Therehavebeenseveralapproachestotheproblemsoflocalitymanagementand
10 80 70 Matrix Multiplication - Jacobi Manual 7p no load SUIF 8p with load balance Times in seconds Fig.6.Comparisonoftherunningtimesoftheapplicationsusingourloadbalancingalgorithmon8loadedprocessors,comparedtothereperformanceon7loadfree processors nisheditsassignedportion,moreworkisobtainedfromthisqueue.thereare Matrix Multiplication Jacobi severalvariations,includingself-scheduling[23],xed-sizechunking[15],guided self-scheduling[22],andadaptiveguidedself-scheduling[7]. portantthanloadbalancinginthreadassignment.theyintroduceapolicythey callmemory-consciousschedulingthatassignsthreadstoprocessorswhoselocalmemoryholdsmostofthedatathethreadwillaccess.theirresultsshow thattheloosertheinterconnectionnetworkthemoreimportantthelocality MarkatosandLeBlancin[20],arguethatlocalitymanagementismoreim- veryimportant,anityschedulingwasintroducedin[19].theloopiterations management. them.kisaparameteroftheiralgorithmwhichtheydeneaspinmostof isidle,itremoves1/koftheiterationsinitslocalworkqueueandexecutes aredividedoveralltheprocessorsequallyinlocalqueues.whenaprocessor Basedontheobservationthatthelocalityofthedatathataloopaccessesis andexecutesthem,wherepisthenumberofprocessors. processoranditremoves1/poftheiterationsinthatprocessor'sworksqueue theirexperiments.ifaprocessor'sworkqueueisempty,itndsthemostloaded Theiralgorithmissimilartoanityschedulingbuttheirruntimesystemcan Buildingon[19],Yanetal.in[24],suggestadaptiveanityscheduling.
11 Normalized Times for MM - Jacobi Normalized Execution Time User Communication & wait Protocol spentintheusertime,communicationandwaitatsynchronizationpointsandprotocol time. Fig.7.BreakupofnormalizedtimeformatrixmultiplicationandJacobi,intotime changingk,anexponentialadaptivemechanism,alinearadaptivemechanism, aconservativeadaptivemechanism,andagreedyadaptivemechanism. theloadedprocessor'slocalworkqueue.theypresentfourpossiblepoliciesfor modifykduringtheexecutionoftheprogram.whenaprocessorisloaded,kis increasedsothatotherprocessorswithalighterloadcangetloopiterationsfrom theoverallperformanceoftheapplication.similarly,moonandsaltz[21]also withrespecttoprograms,processorsandtheinterconnectionnetworks.theirresultsindicatethattakingintoaccounttherelativecomputationpoweraswellas In[4],Cierniaketal.studyloopschedulinginheterogeneousenvironments anyheterogeneityintheloopformatwhiledoingtheloopdistributionimproves repartitioningisrequiredateverytimestep. imbalance,theyintroduceperiodicre-mapping,orre-mappingatpredetermined pointsoftheexecution,anddynamicre-mapping,inwhichtheydetermineif lookedatapplicationswithirregularaccesspatterns.tocompensateforload Kaddourain[13]presentarun-timeapproachforhandlingsuchenvironments. Beforeeachparallelsectionoftheprogramtheycheckifthereisaneedtore- Inthecontextofdynamicallychangingenvironments,Edjlalietal.in[8]or MM 8p no load MM 8p with load MM 7p no load MM 8p with lb Jacobi 8p no load Jacobi 8p with load Jacobi 7p no load Jacobi 8p with lb
12 Times for MM - Jacobi without load Execution Time Fig.8.Runningtimesformatrixmultiplicationandjacobiinaloadfreeenvironment, withandwithoutoutuseofourrun-timelibrary. maptheloop.thisissimilartoourapproach.howevertheirapproachdeals withmessagepassingprograms. loadbalancingispresentedin[25].basedontheinformationtheyusetomake distributedamongtheprocessors.theauthorsarguethatdependingonthe loadbalancingdecisionstheycanbedividedintolocalandglobal.distributed andcentralizedreferstowhethertheloadbalancerisonemasterprocessoror Adiscussiononglobalvs.localanddistributedvs.centralizedstrategiesfor applicationandsystemparameterseachofthoseschemescanbemoresuitable DSMsystem.Itmonitorscommunicationandpagefaults,anddynamicallymodiesloopboundariessothattheprocessesaccessdatathatarelocalifpossible. Adaptisabletoextracttheaccesspatternsbyinspectingthepatternsofthe ThesystemthatseemmostrelatedtooursisAdapt,presentedin[17].Adapt thantheothers. isimplementedinconcertwiththedistributedfilamentssoftwarekernel[9],a pagefaults.itcanonlyrecognizetwopatterns:nearest-neighborandbroadcast, patternsandprovidesthemtotherun-timesystem,makingourapproachmore thislimitsitsexibility.inoursystemweusethecompilertoextracttheaccess generalandexible. tionofprocessesfromoneworkstationtoanother.however,suchsystemsdon't supportparallelprogramseciently. FinallytherearesystemslikeCondor[16],thatsupporttransparentmigra- MM 8p no lb MM 8p with lb Jacobi 8p no lb Jacobi 8p with lb
13 Times in seconds Shallow Manual Parallelization Compiler Parallelization Adaptive Parallelization 40 Fig.9.RunningtimesofthethreedierentimplementationsofShallow,inseconds. Themanualparallelizationtakesintoaccountdataplacementinordertoavoidpage 20 sharing.thecompilerparallelizationdoesn'tconsiderdataplacement.theadaptive 0 theworkloadtakingintoaccountdataplacementdynamically. parallelizationusesthecompilerparallelizationwithourrun-timelibrarywhichadjusts 1p 2p 4p 8p 16p Number processors trasttocloselycoupledsharedmemoryormessagepassing.ourloadbalancing methodtargetsbothirregularitiesoftheloopsaswellaspossibleheterogeneous processorsandloadcausedbycompetingprograms.furthermore,oursystem addresseslocalitymanagementbytryingtominimizecommunicationandpage Oursystemdealswithsoftwaredistributedsharedmemoryprograms,incon- sharing. 5Conclusions ticsthatareattractive:itoerstheeaseofprogrammingofasharedmemory modelinawidelyavailableworkstation-basedmessagepassingenvironment. Inthispaper,weaddresstheproblemofloadbalancinginSDSMsystemsby couplingcompile-timeandrun-timeinformation.sdsmhasuniquecharacteris- However,multipleusersandlooselyconnectedprocessorschallengetheperfor-
14 manceofsdsmprogramsonsuchsystemsduetoloadimbalancesandhigh communicationlatencies. twoapplicationsandaxedloadindicatethattheperformancewithloadbalance usedtoprefetchdata.preliminaryresultsareencouraging.performancetestson powerandcommunicationspeeds.thesameaccesspatterninformationisalso dynamicallyadjustloadatrun-timebasedontheavailablerelativeprocessing Ourintegratedsystemusesaccessinformationavailableatcompile-timeto sharing.oursystemidentiedregionswherefalsesharingexistedandchanged iswithin9and20%oftheidealperformance.additionally,oursystemisableto theloopboundariestoavoidit.theperformanceonourthirdapplication,when partitiontheworksothatprocessesaccessonlytheirlocaldata,minimizingfalse partitioning. thenumberofprocessorswashigh,wasequivalenttothebestpossibleworkload factorusedwhendeterminingwhentoredistributework.thetradeobetween Inaddition,foramorethoroughevaluation,weneedtodeterminethesensitivity localitymanagementandloadmustalsobefurtherinvestigated. ofourstrategytodynamicchangesinload,aswellastochangesinthehysteresis Furtherworktocollectresultsonalargernumberofapplicationsisnecessary. References 2.C.Amza,A.L.Cox,S.Dwarkadas,P.Keleher,H.Lu,R.Rajamony,and 1.S.P.Amarasinghe,J.M.Anderson,M.S.Lam,andC.W.Tseng.TheSUIF compilerforscalableparallelmachines.inproceedingsofthe7thsiamconference onparallelprocessingforscienticcomputing,february1995. W.Zwaenepoel.TreadMarks:Sharedmemorycomputingonnetworksofworkstations.IEEEComputer,29(2):18{28,February MichalCierniak,WeiLi,andMohammedJaveedZaki.Loopschedulingforheterogeneity.InFourthInternationalSymposiumonHighPerformanceDistributed Computing,August1995. consistency-relatedinformationindistributedsharedmemorysystems.acm 3.J.B.Carter,J.K.Bennett,andW.Zwaenepoel.Techniquesforreducing TransactionsonComputerSystems,13(3):205{243,August A.L.Cox,S.Dwarkadas,H.Lu,andW.Zwaenepoel.Evaluatingtheperformance 6.S.Dwarkadas,A.L.Cox,andW.Zwaenepoel.Anintegratedcompile-time/run- ofsoftwaredistributedsharedmemoryasatargetforparallelizingcompilers.in 482,April1997. Proceedingsofthe11thInternationalParallelProcessingSymposium,pages474{ 7.D.L.EageandJ.Zahorjan.Adaptiveguidedself-scheduling.TechnicalReport October1996. timesoftwaredistributedsharedmemorysystem.inproceedingsofthe7thsympo- siumonarchitecturalsupportforprogramminglanguagesandoperatingsystems, 8.GuyEdjlali,GaganAgrawal,AlanSussman,andJoelSaltz.Dataparallelprogramminginanadaptiveenvironment.InInternationParallelProcessingSymposium, April ,DepartmentofComputerScience,UniversityofWashington,January 1992.
15 10.K.Gharachorloo,D.Lenoski,J.Laudon,P.Gibbons,A.Gupta,andJ.Hennessy. 9.V.W.Freeh,D.K.Lowenthal,andG.R.Andrews.Distributedlaments:Ecient ne-grainparallelismonaclusterofworkstations.inproceedingsofthefirst Memoryconsistencyandeventorderinginscalableshared-memorymultiprocessors.InProceedingsofthe17thAnnualInternationalSymposiumonComputer USENIXSymposiumonOperatingSystemDesignandImplementation,pages201{ 213,November P.HavlakandK.Kennedy.Animplementationofinterproceduralboundedregular 11.TheSUIFGroup.Anoverviewofthesuifcompilersystem. 360,July1991. sectionanalysis.ieeetransactionsonparallelanddistributedsystems,2(3):350{ Architecture,pages15{26,May P.Keleher,A.L.Cox,andW.Zwaenepoel.Lazyreleaseconsistencyforsoftwaredistributedsharedmemory.InProceedingsofthe19thAnnualInternational SymposiumonComputerArchitecture,pages13{21,May1992. ParallelComputing,pages173{183,February1997. stationnetwork.incommunicationandarchitecturesupportfornetwork-based 13.MaherKaddoura.Loadbalancingforregulardata-parallelapplicationsonwork- 17.DavidK.LowenthalandGregoryR.Andrews.Anadaptiveapproachtodata 16.M.LitzkowandM.Solomon.Supportingcheckpointingandprocessmigration 15.C.KruskalandA.Weiss.Allocatingindependentsubtasksonparallelprocessors. outsidetheunixkernel.inusenixwinterconference,1992. placement. InTransactionsonComputerSystems,October H.Lu,A.L.Cox,S.Dwarkadas,R.Rajamony,andW.Zwaenepoel.Software 19.EvangelosP.MarkatosandThomasJ.LeBlanc.Usingprocessoranityinloop distributedsharedmemorysupportforirregularapplications.inproceedingsof the6thsymposiumontheprinciplesandpracticeofparallelprogramming,pages 20.EvangelosP.MarkatosandThomasJ.LeBlanc.Loadbalancingversuslocality schedulingonshared-memorymultiprocessors.ieeetpds,5(4):379{400,april 48{56,June BongkiMoonandJoelSaltz.Adaptiveruntimesupportfordirectsimulationmonte I:258{267,August1992. managementinshared-memorymultiprocessors.procofthe1992icpp,pages 22.C.D.PolychronopoulosandD.J.Kuck.Guidedself-scheduling:apractical carlomethodsondistributedmemoryarchitectures.insalablehighperformance 23.P.TangandP.C.Yew.Processorself-scheduling:Apracticalschedulingschemefor September1992. ComputingComference,May1994. schedulingschemeforparallelsupercomputers.intransactionsoncomputers, 24.YongYan,CanmingJin,andXiaodongZhang.Adaptivelyschedulingparallelloops parallelcomputers.ininternationalconferenceonparallelprocessing,augoust 25.MohammedJaveedZaki,WeiLi,andSrinivasanParthasarathy.Customizeddy indistributedshared-memorysystems.intransactionsonparallelandsitributed systems,volume8,january1997. namicloadbalancingforanetworkofworkstations.technicalreport602,de- partmentofcomputerscience,universityofrochester,december1995.
Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration
TheEectofNetworkTotalOrder,Broadcast,andRemote-Write CapabilityonNetwork-BasedSharedMemoryComputing RobertStets,SandhyaDwarkadas,LeonidasKontothanassisy,MichaelL.Scott DepartmentofComputerScienceyCompaqCambridgeResearchLab
Copyright 2011 - Bizagi. Change Management Construction Document Bizagi Process Modeler
Copyright 2011 - Bizagi Change Management Bizagi Process Modeler Table of Contents Change Management... 4 Description... 4 Main Facts in the Process Construction... 5 Data Model... 5 Parameter Entities...
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
ASocialMechanismofReputationManagement inelectroniccommunities 446EGRC,1010MainCampusDrive BinYuandMunindarP.Singh? DepartmentofComputerScience NorthCarolinaStateUniversity fbyu,[email protected] Raleigh,NC27695-7534,USA
Application Note: AN00141 xcore-xa - Application Development
Application Note: AN00141 xcore-xa - Application Development This application note shows how to create a simple example which targets the XMOS xcore-xa device and demonstrates how to build and run this
Linux Driver Devices. Why, When, Which, How?
Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may
Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory
Graph Analytics in Big Data John Feo Pacific Northwest National Laboratory 1 A changing World The breadth of problems requiring graph analytics is growing rapidly Large Network Systems Social Networks
DECLARATION OF PERFORMANCE NO. HU-DOP_TN-212-25_001
NO. HU-DOP_TN-212-25_001 Product type 212 (TN) 3,5x25 mm EN 14566:2008+A1:2009 NO. HU-DOP_TN-212-35_001 Product type 212 (TN) 3,5x35 mm EN 14566:2008+A1:2009 NO. HU-DOP_TN-212-45_001 Product type 212 (TN)
DECLARATION OF PERFORMANCE NO. HU-DOP_TD-25_001
NO. HU-DOP_TD-25_001 Product type TD 3,5x25 mm EN 14566:2008+A1:2009 NO. HU-DOP_TD-35_001 Product type TD 3,5x35 mm EN 14566:2008+A1:2009 NO. HU-DOP_TD-45_001 Product type TD 3,5x45 mm EN 14566:2008+A1:2009
Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7
Storing : Disks and Files Chapter 7 Yea, from the table of my memory I ll wipe away all trivial fond records. -- Shakespeare, Hamlet base Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Disks and
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Security & Chip Card ICs SLE 44R35S / Mifare
Security & Chip Card ICs SLE 44R35S / Mifare Intelligent 1 Kbyte EEPROM with Interface for Contactless Transmission, Security Logic and Anticollision according to the MIFARE -System Short Product Info
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
CS 147: Computer Systems Performance Analysis
CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis 1 / 39 Overview Overview Overview What is a Workload? Instruction Workloads Synthetic Workloads Exercisers and
Program Coupling and Parallel Data Transfer Using PAWS
Program Coupling and Parallel Data Transfer Using PAWS Sue Mniszewski, CCS-3 Pat Fasel, CCS-3 Craig Rasmussen, CCS-1 http://www.acl.lanl.gov/paws 2 Primary Goal of PAWS Provide the ability to share data
DISTRIBUTED AND PARALLELL DATABASE
DISTRIBUTED AND PARALLELL DATABASE SYSTEMS Tore Risch Uppsala Database Laboratory Department of Information Technology Uppsala University Sweden http://user.it.uu.se/~torer PAGE 1 What is a Distributed
Thepurposeofahospitalinformationsystem(HIS)istomanagetheinformationthathealth
FederatedDatabaseSystemsforReplicatingInformationin UniversityofDortmund,DepartmentofComputerScience,Informatik10 ExtendingtheSchemaArchitectureof E-mail:[email protected] HospitalInformationSystems
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
Optimizing Load Balance Using Parallel Migratable Objects
Optimizing Load Balance Using Parallel Migratable Objects Laxmikant V. Kalé, Eric Bohm Parallel Programming Laboratory University of Illinois Urbana-Champaign 2012/9/25 Laxmikant V. Kalé, Eric Bohm (UIUC)
What is Multi Core Architecture?
What is Multi Core Architecture? When a processor has more than one core to execute all the necessary functions of a computer, it s processor is known to be a multi core architecture. In other words, a
AN4156 Application note
Application note Hardware abstraction layer for Android Introduction This application note provides guidelines for successfully integrating STMicroelectronics sensors (accelerometer, magnetometer, gyroscope
ZA-12. Temperature - Liquidus + 45 o C (81 o C) Vacuum = 90mm
Ragonne Fluidity, Inches Zn-Al Impact 38 34 30 26 22 18 14 No. 3 Zn-Al ZA-8 Liquidius ZA-12 Temperature - Liquidus + 45 o C (81 o C) Vacuum = 90mm Zn-Al (0.01-0.02 percent mg) ZA-27 10 0 2 4 6 8 10 12
Mesh Partitioning and Load Balancing
and Load Balancing Contents: Introduction / Motivation Goals of Load Balancing Structures Tools Slide Flow Chart of a Parallel (Dynamic) Application Partitioning of the initial mesh Computation Iteration
Optimizing matrix multiplication Amitabha Banerjee [email protected]
Optimizing matrix multiplication Amitabha Banerjee [email protected] Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Customized Dynamic Load Balancing for a Network of Workstations 1
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 43, 156 162 (1997) ARTICLE NO. PC971339 Customized Dynamic Load Balancing for a Network of Workstations 1 Mohammed Javeed Zaki, Wei Li, and Srinivasan Parthasarathy
M-RPC:ARemoteProcedureCallServiceforMobileClients. DepartmentofComputerScience. Rutgers,TheStateUniversityofNewJersey. Piscataway,NJ08855
M-RPC:ARemoteProcedureCallServiceforMobileClients AjayBakreandB.R.Badrinath DepartmentofComputerScience Rutgers,TheStateUniversityofNewJersey Piscataway,NJ08855 Email:fbakre,[email protected] Abstract
Optimizing Performance. Training Division New Delhi
Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,
Arbitration and Switching Between Bus Masters
February 2010 Introduction Reference Design RD1067 Since the development of the system bus that allows multiple devices to communicate with one another through a common channel, bus arbitration has been
Linux Kernel Architecture
Linux Kernel Architecture Amir Hossein Payberah [email protected] Contents What is Kernel? Kernel Architecture Overview User Space Kernel Space Kernel Functional Overview File System Process Management
threads threads threads
AHybridMultithreading/Message-PassingApproachforSolving IrregularProblemsonSMPClusters Jan-JanWu InstituteofInformationScience AcademiaSinica Taipei,Taiwan,R.O.C. Chia-LienChiang Nai-WeiLin Dept.ComputerScience
How To Get A Computer Science Degree At Appalachian State
118 Master of Science in Computer Science Department of Computer Science College of Arts and Sciences James T. Wilkes, Chair and Professor Ph.D., Duke University [email protected] http://www.cs.appstate.edu/
ENGI E1112 Departmental Project Report: Computer Science/Computer Engineering
ENGI E1112 Departmental Project Report: Computer Science/Computer Engineering Daniel Estrada Taylor, Dev Harrington, Sekou Harris December 2012 Abstract This document is the final report for ENGI E1112,
CUDA Programming. Week 4. Shared memory and register
CUDA Programming Week 4. Shared memory and register Outline Shared memory and bank confliction Memory padding Register allocation Example of matrix-matrix multiplication Homework SHARED MEMORY AND BANK
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
Avid ISIS 7000. www.avid.com
Avid ISIS 7000 www.avid.com Table of Contents Overview... 3 Avid ISIS Technology Overview... 6 ISIS Storage Blade... 6 ISIS Switch Blade... 7 ISIS System Director... 7 ISIS Client Software... 8 ISIS Redundant
To connect to the cluster, simply use a SSH or SFTP client to connect to:
RIT Computer Engineering Cluster The RIT Computer Engineering cluster contains 12 computers for parallel programming using MPI. One computer, cluster-head.ce.rit.edu, serves as the master controller or
Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001
Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering
Common Approaches to Real-Time Scheduling
Common Approaches to Real-Time Scheduling Clock-driven time-driven schedulers Priority-driven schedulers Examples of priority driven schedulers Effective timing constraints The Earliest-Deadline-First
chapater 7 : Distributed Database Management Systems
chapater 7 : Distributed Database Management Systems Distributed Database Management System When an organization is geographically dispersed, it may choose to store its databases on a central database
MICHELIN CARGOXBIB and XP 27
TRAI LERS The trailer tires that protect your soil MICHELIN CARGOXBIB and XP 27 Sizes MICHELIN CARGOXBIB 500/60 R22.5 155D 560/60 R22.5 161D 600/50 R22.5 159D 710/45 R22.5 165D 600/55 R26.5 165D Sizes
Programming Languages for Large Scale Parallel Computing. Marc Snir
Programming Languages for Large Scale Parallel Computing Marc Snir Focus Very large scale computing (>> 1K nodes) Performance is key issue Parallelism, load balancing, locality and communication are algorithmic
Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016
Software-Programmable FPGA IoT Platform Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016 Agenda Introduction Who we are IoT Platform in FPGA Lattice s IoT Vision IoT Platform
DISK: A Distributed Java Virtual Machine For DSP Architectures
DISK: A Distributed Java Virtual Machine For DSP Architectures Mihai Surdeanu Southern Methodist University, Computer Science Department [email protected] Abstract Java is for sure the most popular programming
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
!NAVSEC':!A!Recommender!System!for!3D! Network!Security!Visualiza<ons!
!:!A!Recommender!System!for!3D! Network!Security!Visualiza
ICRI-CI Retreat Architecture track
ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning
High Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
The Asterope compute cluster
The Asterope compute cluster ÅA has a small cluster named asterope.abo.fi with 8 compute nodes Each node has 2 Intel Xeon X5650 processors (6-core) with a total of 24 GB RAM 2 NVIDIA Tesla M2050 GPGPU
E) Modeling Insights: Patterns and Anti-patterns
Murray Woodside, July 2002 Techniques for Deriving Performance Models from Software Designs Murray Woodside Second Part Outline ) Conceptual framework and scenarios ) Layered systems and models C) uilding
D1.1 Service Discovery system: Load balancing mechanisms
D1.1 Service Discovery system: Load balancing mechanisms VERSION 1.0 DATE 2011 EDITORIAL MANAGER Eddy Caron AUTHORS STAFF Eddy Caron, Cédric Tedeschi Copyright ANR SPADES. 08-ANR-SEGI-025. Contents Introduction
Dynamo: Amazon s Highly Available Key-value Store
Dynamo: Amazon s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and
GPU Hardware Performance. Fall 2015
Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Design and Implementation of Distributed Process Execution Environment
Design and Implementation of Distributed Process Execution Environment Project Report Phase 3 By Bhagyalaxmi Bethala Hemali Majithia Shamit Patel Problem Definition: In this project, we will design and
CPUInheritance Scheduling. http://www.cs.utah.edu/projects/flux/ UniversityofUtah
CPUInheritance Scheduling DepartmentofComputerScience ComputerSystemsLaboratory BryanFord SaiSusarla http://www.cs.utah.edu/projects/flux/ UniversityofUtah October30,1996 [email protected] 1 KeyConcepts
RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009
An Algorithm for Dynamic Load Balancing in Distributed Systems with Multiple Supporting Nodes by Exploiting the Interrupt Service Parveen Jain 1, Daya Gupta 2 1,2 Delhi College of Engineering, New Delhi,
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions
A Real Time, Object Oriented Fieldbus Management System
A Real Time, Object Oriented Fieldbus Management System Mr. Ole Cramer Nielsen Managing Director PROCES-DATA Supervisor International P-NET User Organisation Navervej 8 8600 Silkeborg Denmark [email protected]
MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence : 0660-3
MODULE BOUSSOLE ÉLECTRONIQUE CMPS03 Référence : 0660-3 CMPS03 Magnetic Compass. Voltage : 5v only required Current : 20mA Typ. Resolution : 0.1 Degree Accuracy : 3-4 degrees approx. after calibration Output
Managing Devices. Lesson 5
Managing Devices Lesson 5 Objectives Objective Domain Matrix Technology Skill Objective Domain Description Objective Domain Number Connecting Plug-and-Play Devices Connecting Plug-and-Play Devices 5.1.1
Accelerating High-Speed Networking with Intel I/O Acceleration Technology
White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing
Guide to SATA Hard Disks Installation and RAID Configuration
Guide to SATA Hard Disks Installation and RAID Configuration 1. Guide to SATA Hard Disks Installation... 2 1.1 Serial ATA (SATA) Hard Disks Installation... 2 2. Guide to RAID Configurations... 3 2.1 Introduction
Distributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
Oracle Java SE and Oracle Java Embedded Products
Oracle Java SE and Oracle Java Embedded Products This document describes the Oracle Java SE product editions, Oracle Java Embedded products, and the features available with them. It contains the following
Operating Systems Networking for Home and Small Businesses Chapter 2
Operating Systems Networking for Home and Small Businesses Chapter 2 Copyleft 2012 Vincenzo Bruno (www.vincenzobruno.it) Released under Crative Commons License 3.0 By-Sa Cisco name, logo and materials
NP-completeproblemstractable Copyingquantumcomputermakes MikaHirvensalo February1998 ISBN952-12-0158-4 TurkuCentreforComputerScience ISSN1239-1891 TUCSTechnicalReportNo161 superpositions,weshowthatnp-completeproblemscanbesolvedprobabilisticallyinpolynomialtime.wealsoproposetwomethodsthatcould
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Keystone 600N5 SERVER and STAND-ALONE INSTALLATION INSTRUCTIONS
The following instructions are required for installation of Best Access System s Keystone 600N5 (KS600N) network key control software for the server side. Please see the system requirements on the Keystone
Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
front unit 1 3 back unit
GreedyApproximationsofIndependentSetsinLowDegree Magn sm.halld rsson GraphsKiyohitoYoshiharay lemincubicgraphsandgraphsofmaximumdegreethree.thesealgorithmiterativelyselect verticesofminimumdegree,butdierinthesecondaryruleforchoosingamongmanycandidates.westudythreesuchalgorithms,andprovetightperformanceratios,withthebest
Last class: Distributed File Systems. Today: NFS, Coda
Last class: Distributed File Systems Issues in distributed file systems Sun s Network File System case study Lecture 19, page 1 Today: NFS, Coda Case Study: NFS (continued) Case Study: Coda File System
Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12
Universität Dortmund 12 Middleware Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2010 年 11 月 26 日 These slides use Microsoft clip arts. Microsoft copyright
Intel Pentium 4 Processor on 90nm Technology
Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended
Data Storage at IBT. Topics. Storage, Concepts and Guidelines
Data Storage at IBT Storage, Concepts and Guidelines Topics Hard Disk Drives (HDDs) Storage Technology New Storage Hardware at IBT Concepts and Guidelines? 2 1 Hard Disk Drives (HDDs) First hard disk:
Load Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes
TRACE PERFORMANCE TESTING APPROACH Overview Approach Flow Attributes INTRODUCTION Software Testing Testing is not just finding out the defects. Testing is not just seeing the requirements are satisfied.
