universitat Autónoma' de Barcelona

Similar documents

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The OC Curve of Attribute Acceptance Plans

An Alternative Way to Measure Private Equity Performance

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

An Interest-Oriented Network Evolution Mechanism for Online Communities

Project Networks With Mixed-Time Constraints

DEFINING %COMPLETE IN MICROSOFT PROJECT

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Rate Monotonic (RM) Disadvantages of cyclic. TDDB47 Real Time Systems. Lecture 2: RM & EDF. Priority-based scheduling. States of a process

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Software project management with GAs

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Calculating the high frequency transmission line parameters of power cables

Calculation of Sampling Weights

Forecasting the Direction and Strength of Stock Market Movement

Enterprise Master Patient Index

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

What is Candidate Sampling

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Politecnico di Torino. Porto Institutional Repository

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

A Performance Analysis of View Maintenance Techniques for Data Warehouses

An MILP model for planning of batch plants operating in a campaign-mode

Recurrence. 1 Definitions and main statements

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

M3S MULTIMEDIA MOBILITY MANAGEMENT AND LOAD BALANCING IN WIRELESS BROADCAST NETWORKS

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Logical Development Of Vogel s Approximation Method (LD-VAM): An Approach To Find Basic Feasible Solution Of Transportation Problem

2008/8. An integrated model for warehouse and inventory planning. Géraldine Strack and Yves Pochet

Frequency Selective IQ Phase and IQ Amplitude Imbalance Adjustments for OFDM Direct Conversion Transmitters

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

A heuristic task deployment approach for load balancing

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

J. Parallel Distrib. Comput.

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

1. Math 210 Finite Mathematics

An Enhanced Super-Resolution System with Improved Image Registration, Automatic Image Selection, and Image Enhancement

An RFID Distance Bounding Protocol

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

Dynamic Scheduling of Emergency Department Resources

QoS in the Linux Operating System. Technical Report

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

Section 5.4 Annuities, Present Value, and Amortization

Network Aware Load-Balancing via Parallel VM Migration for Data Centers

A Programming Model for the Cloud Platform

NON-CONSTANT SUM RED-AND-BLACK GAMES WITH BET-DEPENDENT WIN PROBABILITY FUNCTION LAURA PONTIGGIA, University of the Sciences in Philadelphia

Enabling P2P One-view Multi-party Video Conferencing

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Effective Network Defense Strategies against Malicious Attacks with Various Defense Mechanisms under Quality of Service Constraints

An ILP Formulation for Task Mapping and Scheduling on Multi-core Architectures

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

IWFMS: An Internal Workflow Management System/Optimizer for Hadoop

Statistical Methods to Develop Rating Models

Single and multiple stage classifiers implementing logistic discrimination

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Activity Scheduling for Cost-Time Investment Optimization in Project Management

8 Algorithm for Binary Searching in Trees

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Checkng and Testng in Nokia RMS Process

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

Simulation and optimization of supply chains: alternative or complementary approaches?

Chapter 4 ECONOMIC DISPATCH AND UNIT COMMITMENT

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Study on Model of Risks Assessment of Standard Operation in Rural Power Network

Demographic and Health Surveys Methodology

1 Example 1: Axis-aligned rectangles

Multiple-Period Attribution: Residuals and Compounding

RequIn, a tool for fast web traffic inference

Financial Mathemetics

FORMAL ANALYSIS FOR REAL-TIME SCHEDULING

Traffic State Estimation in the Traffic Management Center of Berlin

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Ant Colony Optimization for Economic Generator Scheduling and Load Dispatch

Reinforcement Learning for Quality of Service in Mobile Ad Hoc Network (MANET)

A DATA MINING APPLICATION IN A STUDENT DATABASE

SUPPLIER FINANCING AND STOCK MANAGEMENT. A JOINT VIEW.

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

IMPACT ANALYSIS OF A CELLULAR PHONE

Support Vector Machines

Real-Time Process Scheduling

Transcription:

unverstat Autónoma' de Barcelona A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Departament d'informàtca Untat d'arqutectura d'ordnadors Sstemes Operatus A thess submtted by Ana Cortés Fté n fulflment of the requrements for the degree of Doctor per la Unverstat Autònoma de Barcelona. Barcelona (Span), September 2000

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Thess submtted by Ana Cortés Pté n fulflment of the requrements for the degree of Doctor per la Unverstat Autònoma de Barcelona. Ths work has been developed n the Computer Scence Department of the Unversdad Autónoma de Barcelona and was advsed by Dra. Ana Rpoll Aracl, Bellaterra September, 2000 Thess Advsor Ana Rpoll Aracl

A la meva fla Júla: "... mam explca'm aquest conte."

ACKNOWLEDGEMENTS A lot of people have made ths work possble. I wsh to express my sncere grattude to them all for beng there, for workng wth me and for helpng me. Frst of all I want to thank Ana Rpoll Aracl for beng my advsor throughout ths work, for her constant advce and never-endng encouragement. I have learned from her how to organse basc concepts and new deas, and how to descrbe and transmt them wth clarty. I would lke to express my grattude to Emlo Luque for hs- suggestons at the outset of ths work, as well as for hs nestmable contrbuton to ts development at certan crtcal moments. My deepest thanks go to Mquel Angel Senar for hs techncal support and the clarty of hs dscusson of ths work, and, partcularly, for not beng dscouraged at my own dscouragement, for brngng calm and good sense n moments of desperaton and madness and, above all, for beng there at all tmes. Wthout hs support and affecton ths work would never have reached ts concluson. Thanks to Ferran Cedó, member of the Department of Mathematcs at the Unverstat Autònoma de Barcelona, for hs mathematcal support n the formal detaled descrptons and analyss of ths work.» To Tomàs Margalef for thnkng of me at the very begnnng of ths work, and for gettng me under starters' orders. To Dan Franco and Indra Garcés for lendng me the beta verson of ther network smulator n order to obtan "deadlock" free relable communcaton tmes. I would also lke to thank Mara Serrano, Lola Rexachs, Eduardo César and Elsa Heymann for puttng up wth long talks wth me wthout ever losng ther patence.

Thanks to Josep Pons and Jaume Jo who have collaborated n the mplementaton of some of the programs related to ths work. Specal thanks to Porfdo Hernández for hs academc help durng the perod n whch I was ntensvely preparng my lectures. I am also grateful to the rest of my colleagues from the Computer Archtecture and Operatng Systems Group as well as to those who have passed through the Group durng the gestaton and preparaton perods of ths work, for ther constant support and encouragement. Last, but decdedly not least, I would lke to thank my famly for ther support throughout ths tme, and especally my daughter Júla who, wthout beng aware of t, has gven me the strength to cope wth the work's tough fnal stage, and who has had t to suffer n gnorance all of my many and varous moods and frames of mnd.

Contents PREFACE vu CHAPTER 1 THE DYNAMIC LOAD-BALANCING PROBLEM 1 1.1 INTRODUCTION 3 1.2 KEY ISSUES IN THE LOAD-BALANCING PROCESS 4 l.3 PROCESSOR LEVEL 5 1.3.1 Load Manager block 6 1.3.2 Load Balancng Algorthm block 8 Load Balancng Actvaton 8 Work Transfer Calculaton 9 1.3.3 Mgraton Manager block 10 Load unt selecton : 10 Load transfer 10 1.4 SYSTEM LEVEL 11 1.4.1 Centralsed. 12 1.4.2 Totally dstrbuted 12 1.4.3 Partally dstrbuted. 13 1.4.4 Synchronous versus asynchronous 14 1.5 LOAD-BALANCING ALGORITHM TAXONOMY 15 1.6 DYNAMIC LOAD-BALANCING STRATEGIES 16 1.6.1 Randomsed. 77 1.6.2 Physcal Optmsaton 18 1.6.3 Dffuson 20 1.6.4 Dmenson Exchange 22 1.6.5 Gradent Model 23 1.6.6 Mnmum-drecton 25 1.7 SOFTWARE FACILITIES FOR SUPPORTING DYNAMIC LOAD-BALANCING 28 7.7.7 Archtecture of Process-based LBS. 29 1.7.2 Desgn ssues of a process mgraton mechansm 30

Contents t Mgraton Intaton 30 State Capture 30 State Transfer 32 Process Restart '. 32 1.7.3 Lmtatons of process mgraton mechansms 33 1.7.4 Examples of exstng packages for supportng dynamc load balancng. 34 CHAPTER 2 NEAREST-NEIGHBOUR LOAD-BALANCING METHODS 39 2.2 BASIC NOTATION AND ASSUMPTIONS 41 2.3 ITERATIVE NEAREST-NEIGHBOUR LOAD BALANCING ALGORITHMS 43 2.3.1 Algorthm convergence 44 2.3.2 Termnaton detecton problem 46 2.4 ANALYSIS OF RELEVANT NEAREST-NEIGHBOURS LOAD BALANCING ALGORITHMS 48 2.4.1 The SI D (Sender Intated Dffuson) algorthm 49! 2.4.2 The GDE (Generalsed Dmenson Exchange) algorthm 56 2.4.3 The AN (Average Neghbourhood) algorthm 63 2.5 SUMMARY OF THIS CHAPTER 67 CHAPTER 3 DASUD LOAD-BALANCING ALGORITHM 71 3.; DASUD (DIFFUSION ALGORITHM SEARCHING UNBALANCED DOMAINS)'S MOTIVATION 73 3.2 DESCRIPTION OF THE DASUD ALGORITHM 75 3.2.1 Descrpton of the frst stage of DASUD 77 3.2.2 Descrpton of the second stage of DASUD 79 Searchng Unbalanced Domans (SUD) block 79 Fne Load Dstrbuton (FLD) block 79 Sendng Instructon Message (SIM) block 80 Processng Instructon Messages (PIM) block 81

Contents 3.3 AN EXAMPLE OF DASUD EXECUTION 83 3.4 DASUD'S COMPLEXITY 86 3.5 DASUD'S CONVERGENCE... 88 3.6 DASUD'S CONVERGENCE RATE 98 3.7 PERFECT LOCAL BALANCE ACHIEVED BY DASUD 99 3.8 GLOBAL BALANCE DEGREE ACHIEVED BY DASUD. 101 CHAPTER 4 COMPARATIVE STUDY OF NEAREST-NEIGHBOUR LOAD- BALANCING ALGORITHMS 103 4.1 SIMULATION FRAMEWORK 105 4.1.1 Interconnecton Networks 707 4.1.2 Synthetc Load dstrbutons 108 4.2 QUALITY METRICS 111 4.3 STABILITY ANALYSIS 112 4.3.1 Influence of the ntal load dstrbuton pattern n the dfjnax 773 4.3.2 Influence of the system sze n the dfjnax 776 4.3.3 Influence of the ntal load dstrbuton shape n the dfjnax 777 4.3.4 Influence of the ntal load dstrbuton pattern n the a. 77P 4.3.5 Influence of the system sze n the a. 720 4.3.6 Influence of the ntal load dstrbuton shape n the a. 722 4.3.7 Conclusons of the stablty analyss ; 722 4.4 EFFICIENCY ANALYSIS! 124 4.4.1 Influence of the ntal load dstrbuton pattern n u 124 4.4.2 Influence of the system sze n u 725 4.4.3 Influence of the ntal load dstrbuton shape n u 72P 4.4.4 Influence of the ntal load dstrbuton pattern n steps 130 4.4.5 Influence of the system sze nsteps 737 4.4.6 Influence of the ntal load dstrbuton shape n steps 735 4.4.7 Conclusons of the effcency analyss 736 4.5 SUMMARY AND CONCLUSIONS OF THE COMPARATIVE STUDY. 137

Contents CHAPTERS SCALABILITY OF DASUD....... 139 5.1 INTRODUCTION...........141 t 5.2 DASUD'S SCALABILITY WITH RESPECT TO THE PROBLEM SIZE... 142 I 5.3 DASUD'S SCALABILITY WITH RESPECT TO SYSTEM SIZE... 145 5.4 CONCLUSION ABOUT DASUD'S SCALABILITY... 150 I CHAPTER 6 4 ENLARGING THE DOMAIN (ZVDASUD)....... 151 6.1 INTRODUCTION....... 153 6.2 EXTENDED SYSTEM MODEL... 154 6.3 METRICS... 156 6.3.1 Communcaton perods... 757 The nformaton collecton perod (7^/)... 159 The transfer perod ( T^ )... 159 6.3.2 Computatonal perod (T* hal)... 759 6.3.3 Trade-off factor ( t _ o f f( k))... 160 6.4 THE EXPERIMENTAL STUDY OF ArDASUD... 161 6.4.1 The best degree of fnal balance... 762 6.4.2 Greater unbalancng reducton... 770 6.5 CONCLUSIONS TO THIS CHAPTER... 174 CHAPTER 7 CONCLUSIONS AND FUTURE WORK... 177 7.1 CONCLUSIONS AND MAIN CONTRIBUTIONS... 179 7.2 CURRENT AND FUTURE WORK... 184 I REFERENCES... 187 IV

Contents APPENDIX A DASUD LOAD-BALANCING ALGORITHM: AND THEORETICAL ANNEXES EXPERIMENTAL A.I A.l EXPERIMENTAL VALIDATION OF DASUD's FINAL BALANCE DEGREE A.3 A.2 EXPERIMENTAL VALIDATION OF DASUD's CONVERGENCE RATE A.4 A.3 A REALISTIC LOAD-BALANCING MODEL A.6 APPENDIX B COMPARATIVE STUDY OF NEAREST-NEIGHBOUR LOAD- BALANCING ALGORITHMS: COMPLEMENTARY TABLES... B.I APPENDIX C ENLARGING THE DOMAIN ( -DASUD): FIGURES COMPLEMENTARY C.I APPENDIX D ENLARGING THE DOMAIN (As-DASUD): FIGURES AND TABLES COMPLEMENTARY D.I

PREFACE Advances n hardware and software technologes have led to ncreased nterest n the use of large-scale parallel and dstrbuted systems for database, realtme, and large-scale scentfc and commercal applcatons. The operatng systems and management of the concurrent processes consttute ntegral parts of the parallel and dstrbuted envronments. One of the bggest ssues n such a system s the development of effectve technques for the dstrbuton of processes among processng elements to acheve some performance goal(s), such as mnmsng executon tme, mnmsng communcaton delays, and/or maxmsng resource utlsaton. Load-balancng s one of the most mportant problems whch have to be solved n order to enable the effcent use of multprocessor systems. Load-balancng ams at mprovng the performance of multprocessor systems by equalsng the computatonal load over all processors n the system snce t s commonly agreed that equally balancng loads between all processors n the system drectly leads to a mnmsaton of total executon tme. There are applcatons that can be parttoned nto tasks wth regular computaton and communcaton patterns and, therefore, load-balancng algorthms can be used to assgn computatonal tasks to processors before begnnng the executon. Ths s called statc load-balancng. However, there s an mportant and ncreasngly common class of scentfc applcatons (such as partcle/plasma smulatons, parallel solvers for partal dfferental equatons, numercal ntegraton, N- body problem to name just a few) where the computatonal load assocated wth a partcular processor may change over the course of a computaton and cannot be estmated beforehand. For ths class of non-unform problems wth unpredctable a pror computaton and communcaton requrements, dynamc load-balancng algorthms are needed to effcently dstrbute the computatonal load at run tme on the multprocessor system. Ths work s about dynamc load-balancng n messagepassng parallel computers where, n general, a drect, pont-to-pont nterconnecton network s used for communcaton. Load-balancng s performed by transferrng load from heavly to lghtly loaded processors. For that purpose, a load-balancng VII

Preface algorthm has to resolve the ssues of when to nvoke a balancng operaton, who makes load-balancng decsons accordng to what nformaton and how to manage load mgratons between processors. We can fnd several answers to these questons whch results n a wde set of load-balancng technques. A hghly popular class,of load-balancng strateges are nearest-neghbour approaches whch are edge-local, that s, methods that can be mplemented n a local manner by each processor consultng only ts neghbours, thereby avodng expensve global communcaton n dstrbuted applcatons. The load moved along each edge s related to the gradent n the loads across t. These knds of dstrbuted load-balancng algorthms are appealngly smple and they degrade gracefully n the presence of asynchrony and faults. Most of these algorthms are mplemented n an teratve way to acheve a global load balanced state and, therefore, they are referred to as teratve load-balancng algorthms. The load balance stems from successve approxmatons to a global optmal load dstrbuton by only beng concerned wth local load movements. In the deal case, a perfect state s acheved when all processors have the same load! These knds of load-balancng algorthms are suted to apprecably decreasng large mbalances.! Most teratve load-balancng algorthms proposed n the lterature consder an dealsed verson of the load-balancng problem n whch the loads are treated as real numbers; therefore, loads can! be splt to arbtrary. However, n a more realstc settng of the problem, whch covers medum and large gran parallelsm, the loads (processes, data, threads) are not nfntely dvsble and, as a consequence, they are treated as natural numbers. There are two categores of load-balancng algorthms that consder dscrete load model. On the one hand, the load-balancng algorthms that were orgnally desgned under the dscrete load model assumpton, and, on the other hand, the dscrete adaptaton of the dealsed load-balancng algorthms by performng roundng operatons. Iteratve load-balancng algorthms usng dscrete load model produce stuatons n whch a global load balance cannot be guaranteed when the load-balancng process termnates. Furthermore, the convergence analyss of these teratve load-balancng algorthms usng dscrete load model has not been explored n the lterature. VIM

Preface We rased the ssue of the development of a realstc teratve load-balancng algorthm whch was able to solve the balancng problems of exstent dscrete loadbalancng algorthms n an asynchronous fashon. One mportant goal n ths work was to derve a rgorous mathematcal descrpton of the proposed algorthm, whch allows us to analyse ts convergence, as well as other formal aspects of the algorthm such as ts complexty and convergence rate. The proposed algorthm called DASUD (Dffuson Algorthm Searchng Unbalanced Domans) s noteworthy for ts ablty to detect locally unbalanced stuatons that are not detected by other algorthms and for always achevng the optmal local balance dstrbuton Once DASUD was fully descrbed, we were nterested n comparng our proposal to the most relevant load-balancng strateges wthn the same famly, n order to evaluate the goodness of DASUD. For that purpose, we rased the need to develop a smulaton envronment n whch to be able to smulate the whole loadbalancng process for dfferent teratve load-balancng algorthms under the same condtons. Parameters such as topology, load dstrbutons, system sze and problem sze should be easly varable n order to analyse the nfluence of each of them on the behavour of the smulated load-balancng algorthms. 4 By smulaton we have compared our algorthm wth three well-known nearestneghbours load-balancng algorthms wthn the lterature attendng to two qualty measurements: stablty and effcency. Stablty measures the ablty of the algorthm to coerce any ntal load dstrbuton nto a global stable state as close to even as possble. Effcency measures the tme delay for arrvng at the globally stable state. From the results we are able to conclude that DASUD exhbts the best trade-off between the degree of balance acheved and the tme ncurred to reach t. Ths work s organsed as follows: The frst chapter gves an overvew of the dynamc load-balancng problem n parallel computng where the key ssues that must be consdered n ths problem are descrbed. Followng ths, a load-balancng algorthm taxonomy that IX

Preface j llustrates how load dstrbuton can be carred out s developed, and a smple descrpton of the most relevant dynamc load-balancng strateges s ncluded. The second chapter s 'focused on the nearest-neghbour load-balancng methods. Snce ths knd of load-balancng algorthms works n an teratve way, two mportant ssues such as the algorthm convergence and ts termnaton detecton are dscussed. Three relevant load-balancng algorthms from ths category are descrbed and analysed n detal. In chapter three the proposed dynamc load-balancng algorthm DASUD s descrbed and analysed. DASUD's complexty s provded, as well as ts convergence proof and upper bounds for the fnal balance degree and the number of balance teratons needed to acheve t. In chapter four the proposed load-balancng algorthm s compared by smulaton to three of the' most relevant load-balancng algorthms wthn the nearest-neghbour category that were descrbed n chapter 2. The smulaton framework has been desgned ncludng dfferent nterconnecton networks as well as a wde range of system szes. Moreover, the load dstrbuton patterns smulated vary from stuatons whch exhbt lghtly unbalanced degree to hghly unbalanced stuatons. The comparson has been carred out n terms of the unbalance degree reached by the algorthms and the tme needed to acheve ths fnal state. In chapter fve the scalablty of the proposed strategy s analysed n order to show the capacty of reactng smlarly under dfferent problem and system szes. A queston that ocurred to us durng the evaluaton of DASUD was: how would DASUD work f t was able to collect more load nformaton than only that of ts mmedate neghbours?. For that purpose, n chapter sx, an extended system model s provded. The nfluence of enlargng the doman nto the tme ncurred n transferrng messages beyond one lnk, and the extra computatonal cost

Preface ncurred by the extended verson of DASUD (cf s -DASUD) has been also consdered to evaluate whch enlargement provdes the best trade-off between balance mprovement and the load-balancng tme spent. Chapter seven summarses the man conclusons derved from ths thess, outlnng, n addton, current and future work n ths feld. Fnally, the complete bblography s provded, and complementary tables and fgures are ncluded n four appendxes. XI

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 1 The dynamc load-balancng problem Abstract Ths chapter gves an overvew of the dynamc load-balancng problem n parallel computng where, frstly, the key ssues that must be consdered n ths problem are descrbed. Followng ths, a load-balancng algorthm taxonomy that llustrates how load dstrbuton can be carred out s presented, as well as a smple descrpton of the most relevant dynamc load-balancng strateges. Fnally, envronments and exstng tools for supportng load-balancng are descrbed.

The dynamc load-balancng problem 1.1 Introducton When a parallel applcaton s dvded nto a fxed number of processes (tasks)* that are to be executed n parallel, each processor performs a certan amount of work. However, t may be that some processors wll complete ther tasks before others and become dle because the work s unevenly dvded, or some processors operate faster than others, or both stuatons. Ideally, all the processors should be operatng contnuously on tasks that would lead to the mnmum executon tme. Achevng ths goal by spreadng the tasks evenly across the processors s called load-balancng. Load-balancng can be attempted statcally before the executon of any process, or dynamcally durng the executon of the processes. Statc loadbalancng s usually referred to as the mappng problem or schedulng problem. Dynamc load-balancng technques assume lttle or no comple-tme knowledge of the runtme parameters of the problem, such as task executon tmes or communcaton delays. These technques are partcularly useful n effcently resolvng applcatons that have unpredctable computatonal requrements or rregular communcaton patterns. Adaptve calculatons, crcut smulatons and VLSI desgn, N-body problems, parallel dscrete event smulaton, and data mnng are just a few of those applcatons. Dynamc load-balancng (DLB) s based on the redstrbuton of load among the processors durng executon tme, so that each processor would have the same or nearly the same amount of work to do. Ths redstrbuton s performed by transferrng load unts (data, threads, processes) from the heavly loaded processors to the lghtly loaded processors wth the am of obtanng the hghest possble executon speed. DLB and load sharng are used as nterchangeable terms n the lterature. Whle DLB vews redstrbuton as the assgnng of the processes among the processors, load sharng defnes redstrbuton as the sharng of the system's processng power among the processes. The results of applyng an deal DLB algorthm to a 3x3 torus s shown n fgure 1.1. The numbers nsde the crcles denote In ths context, both terms (process and tasks) are used ndstnctly.

Chapter 1 the load value of each processor. Intally, at tme (t 0 ) the load s unevenly dstrbuted among the processors. The load becomes the same n all processors after executng an deal DLB strategy (tme í,). C Before DLB ( tme tj t r\ 2 r\; 3 After DLB ( tme/,,) 2 r\ 3 Fgure 1.1. Dynamc load-balancng process. Every DLB strategy has:to resolve the ssues of when to nvoke a balancng operaton, who makes load-balancng decson accordng to what nformaton and how to manage load mgratons between processors. There has been much research on DLB strateges for dstrbuted computng systems. However, on parallel computng systems, the DLB problem takes on dfferent characterstcs. Frst, parallel computers typcally use a regular pont-to-pont nterconnecton network, nstead of random network confguraton. Second, the load mbalance n a dstrbuted computer s due prmarly to external task arrvals, whereas the load mbalance n a parallel computer s due to the uneven and unpredctable nature of tasks. The advantage of dynamc load-balancng over statc load-balancng s that t the system need not be aware of the run-tme behavour of the applcatons before executon. Nevertheless, the major dsadvantage of DLB schemes s the run-tme overhead due to the load nformaton transfer among processors, the executon of the load-balancng strategy, and the communcaton delays due to load relocaton tself. 1.2 Key ssues n the Load-Balancng process The desgn of a dynamc load-balancng algorthm requres resolvng ssues such as: who specfes the amount of load nformaton made avalable to the decsonmaker; who determnes the condton under whch a unt of load should be transferred; who dentfes the destnaton processor of load to transfer; and how to

The dynamc load-balancng problem manage load mgratons between processors, amongst other ssues. Combnng dfferent answers to the above questons results n a large area of possble desgns of load-balancng algorthms wth wdely varyng characterstcs. On the one hand, there are decsons whch must be taken at processor level, and others that requre a greater or lesser degree of co-ordnaton between dfferent processors, so the latter become system-level decsons. In order to be systematc n the descrpton of all necessary decsons related to the load balancng process, we dstngush two dfferent desgn ponts of vew: the processor level pont of vew and the system level pont of vew (see fgure 1.2). We refer to processor level when the load-balancng operatons respond to decsons taken by a processor. Otherwse, we talk about system level when the decsons affect a group of processors. In the followng sectons, we outlne the descrpton of each one of these levels. Dynamc Load-Balancng Processor level System level Fgure 1.2. Load-balancng: two desgn ponts of vew * 1.3 Processor level. A processor whch ntervenes n the load-balancng process wll execute computatonal operatons both for applcatons tasks as well as for load-balancng operatons. Ths secton descrbes the load-balancng operatons carred out at processor level and ther desgn alternatves. In order to perform the load-balancng operatons, a processor must allocate three functonal blocks to effectvely mplement the load-balancng process: the Load Manager (LM), the Load-Balancng Algorthm (LBA) and the Mgraton Manager (MM), as shown n fgure 1.3. The Load Manager block s the one related to all load-keepng ssues. The Load-Balancng Algorthm block s related to the concrete specfcaton of the load-balancng strategy. Fnally, a Mgraton Manager block s needed n order to actually perform the load movements.

Chapter 1! The processors, whch ncorporate each one of these three blocks, wll be referred to as runnng processors. > ^^ computatonal operatons load balancng operaton ' Load Load Balancng!_J Mgraton Manager "* Algorthm j Manager Fgure 1.3 Functonal blocks that ntegrate the load-balancng process wthn a processor. The next sectons wll dscuss the mplementaton ssues for each one of these blocks, as well as ther co-operaton. 1.3.1 Load Manager block One of the most mportant ssues n the load-balancng process s to quantfy the amount of load (data, threads or processes ) of a gven processor (load ndex). It t s mpossble to quantfy exactly the executon tme of the resdent processes of a processor. Therefore, some measurable parameters should be used to determne the load ndex n a system such as the process szes, the number of ready processes, the amount of data to be processed and so on. However, prevous studes have shown that smple defntons such as the number of ready processes are partcularly effectve n quantfyng the load ndex of a processor [Kun91]. The Load Manager block has the responsblty of updatng the load nformaton of the runnng processor, as well as gatherng load nformaton from a set of processors of the system (underlyng doman). The tme at whch the load ndex of each processor s to be updated s known as the load evaluaton nstant. A nonnegatve varable (nteger or real number), takng on a zero value f the processor s dle and takng on ncreasng postve values as the load ncreases, wll be measured at ths tme accordng to the load unt defnton [Fer87]. There must be a trade-off between the load gatherng frequency and the ageng of the load nformaton kept by the LM block, n order to avod the use of obsolete values by the Load Balancng Algorthm block. Ths trade-off s captured n the followng three load collecton rules:

The dynamc load-balancng problem On-demand: Processors collect the load nformaton from each other whenever a load-balancng operaton s about to begn or be ntated [Sta84][Zna91]. Perodcal: Processors perodcally report ther load nformaton to others, regardless of whether the nformaton s useful to others or not [Yua90][Kal88]. On-state-change: Processors dssemnate ther load nformaton whenever ther state changes by a certan degree [Xu95] [SalQO]. The on-demand load-gatherng method mnmses the number of communcaton messages, but postpones the collecton of system-wde load nformaton untl the tme when a load-balancng operaton s to be ntated. Its man dsadvantage s that t results n an extra delay for load-balancng operatons. Typcally, ths category ncludes bddng methods, where underloaded processors ask for load nformaton from other processors to choose the best partner n performng load-balancng [Sta84][Zna91]. Conversely, the perodc method allows processors n need of a balancng operaton to ntate the operaton based on mantaned load nformaton wthout delay. The problem wth the perodcal scheme s how to set the nterval for nformaton gatherng. A short nterval would ncur heavy communcaton overheads, whle a long nterval would sacrfce the accuracy of the load nformaton used n the load-balancng algorthm. A protocol to exchange load nformaton perodcally called LIEP (Load Informaton Exchange Protocol) was presented n [Yua90]. In that work processors were arranged nto a logcal hypercube wth dmenson d (the topology dameter). Durng each perod of load nformaton exchange, a processor nvoked of rounds of load exchanges n such a way that a processor exchanged ts load value wth the drectly connected processor n the nspected dmenson. A way to optmse the load collecton process s reported n [Kal88] where the methodology proposed conssts of perodcally pggy-backng the load nformaton wth regular messages. The on-state-changng rule s a compromse of the on-demand and perodc schemes. In [Sal90] an on-state-changng method s reported n order to nclude the advantages of both approaches. In ths case, a processor sends a status message to ts neghbours only f ts own load has changed

Chapter 1 j by a certan value and an updated nterval has elapsed snce the last update. Ths reduces unnecessary frequent updates. Nevertheless, how the LM block proceeds to collect and hold load nformaton s not relevant to the Load Balancng Algorthm block. The nformaton requred by ths block s lmted to a set of non-negatve numbers that represent the load ndex of each one of the processors belongng to the underlyng doman. These values wll be used subsequently to evaluate whether t s necessary to perform load movements or not and how these movements must be performed. 1.3.2 Load Balancng Algorthm block t The Load Balancng Algorthm block uses the load nformaton provded by the prevous LM block n decdng f t s necessary or not to balance the load, source and destnaton processors of load movements, as well as the amount of load to be transferred. An LBA algorthm, can be dvded nto two phases: Load Balancng Actvaton and Work Transfer Calculaton. Load Balancng Actvaton < Ths phase uses the load nformaton kept by the LM block to determne the presence of a load mbalance, the crteron used to evaluate whether a processor s balanced or not s known as trgger condton and s normally assocated to a threshold value that can be defned as: Fxed threshold: one or several fxed values are used as crtera to determne whether a processor s an overloaded processor or not [Zho88] [Mun95]. Adaptve threshold: the threshold values are evaluated durng the executon of the load-balancng algorthm and ther values are usually state dependent [Wl93][Xu93][Cor99c]. ' Usually, the electon of fxed thresholds as trgger condton produces smple strateges where each processor compares ts load ndex wth a fxed threshold to determne whether a processor has load excess (overloaded) or not (underloaded). In applcatons where the total load s expected to reman farly constant, the load

The dynamc load-balancng problem balancng would be undertaken only n those cases where the load ndex of some processor falls outsde specfed upper and lower thresholds [Mun95]. Another method that has been suggested for stuatons n whch the total load s changng s to balance load f the dfference between a processor's load and the local load average (.e. the average load of a processor and ts neghbors) exceeds some threshold [Cor99c][Wl93]. Another smlar approach conssts of settng the threshold value usng the global load average nstead of the local load average to determne the trgger condton at each processor [Xu93]. All runnng processors n the system wll evaluate the trgger condton at the start of executng the load-balancng algorthm. However, not all the runnng processors wll overcome ther trgger condton. The processors whose trgger condton evaluaton does not fal wll be called actve processors. We refer to senderntated (SI) approaches when actve processors are the ones wth load excess and we refer to receved-ntated (Rl) schemes, [Eag85][Wl93], when the underloaded processors wll become the actve processors by requestng load from ther overloaded counterpart. Work Transfer Calculaton ' Ths phase s concerned wth devsng an approprate transfer strategy to correct the mbalance prevously detected and measured. After determnng that loadbalancng s requred, source and destnaton processor pars are determned, as well as how much work should be transferred from one processor to another. The functon used to determne the destnaton of load can be mplemented usng one of the followng choces: Randomly: no nformaton about the doman state of the underlyng processor s needed and destnaton processors are chosen n a random fashon [Zho88]. Fxed: decsons produced by the actve processors are not state dependent. The quantty of load to be transferred from one processor to another s set a pror as a fxed value. [Xu97][Cyb89]. Evaluated: the amount of load to be moved between processors s evaluated at run tme followng some predetermned functon [Wl93][Cor99c] 9

Chapter 1 1.3.3 Mgraton Manager block Fnally, the last block s the Mgraton Manager (MM) block. Ths block receves as nput the nformaton generated by the Load Balancng Algorthm block,.e., the destnaton processors and the amount of load that should be transferred to them. Ths block can be dvded nto two phases: load unt selecton and load transfer, n order to dfferentate between the way of choosng the ndvdual load unts to be transferred, and the physcal transfer of those elements. Both phases are descrbed below. Load unt selecton ' Source processors select the most sutable load unts (process, threads, data,...) whch properly match wth the load value to be moved. The qualty of loadunt selecton drectly affects the ultmate qualty of load-balancng. Sometmes, t may prove to be mpossble to choose a group of load unts whose assocated load ndex corresponds exactly to the value that needs to be moved. The problem of selectng whch load unts to move s weakly NP-complete, snce t s smply the subset sum problem. Fortunately, approxmaton algorthms exst whch allow the subset sum problem to be solved to a specfed non-zero accuracy n polynomal tme [Pap94]. Before consderng such an algorthm, t s mportant to note that other concerns may constran load transfer optons. In partcular, we would lke.to avod costly transfers of ether large numbers of processes or large quanttes of data unless absolutely necessary. We would also lke to gude load selecton to preserve, as best as possble, exstng communcaton localty n the applcaton. In general, we would lke to assocate a cost wth the transfer of a gven set of load unts and then fnd the lowest cost set for a partcular desred transfer. Load transfer Ths module should provde the approprate mechansms to correctly mgrate several selected load unts (whch can be ether processes, data or threads) to any destnaton processor. Data mgraton load-balancng systems support dynamc balancng through transparent data redstrbuton. Data mgraton mechansms usually exhbt the lowest complexty amongst the three mechansms' as they only have to move data-systems based on thread mgraton support dynamc load balancng 10

The dynamc load-balancng problem through thread redstrbuton n multthreadng envronments. In such systems, a user applcaton conssts of a number of processes assgned to dfferent processors and each process encapsulates a certan number of threads that can be created/destroyed dynamcally. Transparent mgraton of threads mples the movement of the data and the computaton state of a partcular thread for one process located n a processor to another process located n a dfferent processor. Process mgraton load-balancng systems support dynamc load balancng through transparent process redstrbuton n parallel and/or dstrbuted computng envronments. As n thread mgraton load-balancng systems, process mgraton mples the movement of the data and the computaton state. However, process mgraton mechansms exhbt the hghest complexty as they must be aware of a huge amount of nformaton. In the case of a process, the computaton state s consderably more complex compared to the thread case and, moreover, applcaton bnares must also be moved. In secton 1.8, a more detaled descrpton of the mgraton mechansms provded by some load-balancng software packages s reported. After havng descrbed the behavour of each one of the blocks correspondng to one load-balancng operaton, t s mportant to ndcate that ths decomposton n the load-balancng process n dfferent modules allows us to experence n a plugand-play fashon wth dfferent mplementatons at each one of the above blocks, allowng the space of technques to be more fully and readly explored. It s also possble to customse a load-balancng algorthm for a partcular applcaton by replacng some general methods wth those specfcally desgned for a certan class of computatons. 1.4 System level Ths level analyses whch processors are nvolved n the load-balancng process and how ther co-operaton s carred out. Hence, the frst decson that must be consdered s the electon of the set of runnng processors that partcpates n the load-balancng process. Dependng on the number of processors belongng to ths set we can dstngush between: centralsed, totally dstrbuted and sem-dstrbuted 11

Chapter 1 : \ approaches. In totally dstrbutee! and sem-dstrbuted schemes the load-balancng goal s obtaned because load-balancng operatons are concurrently executed n more than one processor as tme goes by. In partcular, when the load-balancng blocks are executed smultaneously n all runnng processors of the system, we are t consderng a synchronous mplementaton of the load-balancng process. Otherwse, the system works n an asynchronous way. The nfluence of each one of above characterstcs n the load-balancng process wll be dscussed n next subsectons. 1.4.1 Centralsed Centralsed load balancng strateges are charactersed by the use of a dedcated processor for mantanng a global vew of the system state and decson makng. Ths processor s called central scheduler (or central job dspatcher). A t central strategy can mprove résource utlsaton by havng all the nformaton of L processors and t can acheve optmal performance by usng sophstcated t algorthms. It can also mpose less overhead on the communcaton network by avodng transfers of duplcate or naccurate host state nformaton. Global schedulng can also avod task thrashng caused by contradctory load balancng decsons. However, centralsed models have low relablty. If the central processor fals, the operaton of the whole system an be corrupted. In addton, n large systems wth hgh load fluctuaton, the messages wth load nformaton can overload the nterconnecton structure around the central processor. 1.4.2 Totally dstrbuted An alternatve to centralsed approaches s a dstrbuted scheme, n whch the load-balancng decsons are carred out by all the processors of the system. Load nformaton exchanges are restrcted to a local sphere of processors and loadbalancng operatons are also performed wthn ths sphere or doman. Dependng on the exstng relatonshp between dfferent domans, we can dstngush between overlapped domans or non-overlapped domans. In fgure 1.4, the processors n red are chosen as the runnng processor. In ths example, we consder the doman of a gven processor as the processors drectly connected to t. Therefore, wth the blue and yellow colours we ndcate the domans of each one of the runnng processor. In fgure 1.4.a we can observe that there are some common processors between the 12

The dynamc load-balancng problem blue and the yellow domans. Hence, we refer to them as overlapped domans. Otherwse, we refer to non-overlapped domans (fgure 1.4.b). (a) W Fgure 1.4. Overlapped domans (a) and non-overlapped domans (b). When the doman ncludes a gven processor and ts mmedate neghbours we refer to t as a nearest-neghbour approach. Nearest-neghbour load-balancng methods operate on the prncple of reducng the load mbalance between each processor and ts mmedate neghbours wth the am of dffusng load through the system convergng toward a system-wde balance. Otherwse, load-balancng strateges are categorsed as non-nearest-neghbour approaches. Non-nearestneghbour load- balancng alternatves work n a decentralsed form by usng local nformaton, whch s not restrcted to mmedate neghbours. Under ths assumpton the scope of the doman s extended to a large radus that may also nclude the neghbours' neghbours and so on. Totally dstrbuted approaches, n usng local nformaton, do not make such effectve balance decsons as Centralsed approaches, but, n contrast, they ncur smaller synchronsaton overheads. 1.4.3 Partally dstrbuted. For large systems (more than 100 processors), nether centralsed nor dstrbuted strateges proved to be approprate. Although centralsed strateges have the potental of yeldng optmal performance, they also have dsadvantages that make them sutable only for small or moderate systems [Bau88]. On the other hand, dstrbuted strateges have good scalablty, but for large systems t s dffcult to acheve a global optmum because the processors have a lmted vew of the global 13

Chapter 1 I í system state. Partally dstrbuted strateges (also called sem-dstrbuted) were proposed as a trade-off between centralsed and fully dstrbuted mechansms. The man dea s to dvde the system nto regons and thus splt the load-balancng problem nto subtasks. These 1 strateges can be vewed at two levels: () loadbalancng wthn a regon and () load-balancng among all the regons. Dfferent solutons can be devsed for each level of the strategy. Each regon s usually managed by a sngle master-processor usng a centralsed strategy and, at the level of the regon, master-processors may (or may not) exchange aggregated nformaton about ther correspondng regons. 1.4.4 Synchronous versus asynchronous Takng nto account the nstant at whch load-balancng operatons are nvoked, both totally and partally dstrbuted strateges can be further subdvded nto synchronous and asynchronous strateges. From ths pont of vew, we talk about synchronous algorthms when all processors nvolved n load-balancng (runnng processors) carry out balancng operatons at the same nstant of tme so that each processor cannot proceed wth normal computaton untl the load mgratons demanded by the current operatons have been completed. Otherwse, f each runnng processor performs load-balancng operatons regardless of what the other processors do, we refer to asynchronous approaches. Fgure 1.5 shows these behavours for a four processor system. Notce that the dstncton between synchronous and asynchronous «does not apply for centralsed schemes due to the exstence of only one runnng processor n the entre system. o Computatonal operatons [] Load-balancng operaton (a)! (b) Í t tme I Fgure 1.5. Synchronous (à) and asynchronous (b) load-balancng operatons. 14

The dynamc load-balancng problem 1.5 Load-balancng algorthm taxonomy Most of the load-balancng strateges proposed n the lterature are focused bascally on the development of approaches for solvng the Load Balancng Algorthm block mentoned n secton 1.3.2., In terms of the algorthmc method used by these strateges we can derve the taxonomy shown n fgure 1.6. The man crtera n classfyng these algorthms concerns the way n whch load dstrbuton s carred out. Randomsed Physcal Optmsaton Sngle Drecton Dffuson Dmenson Exchange Gradent model Mnmum-drecton Fgure 1.6. Load-Balancng taxonomy n tems of algorthmc aspects from the processor level pont of vew. In Stochastc methods, the load s redstrbuted n some randomsed fashon, subject to the objectve of load balancng. Stochastc load balancng methods attempt to drve the system nto an equlbrum state wth hgh probablty. Two dfferent approaches can be found: Randomsed allocaton and Physcal optmsaton. Randomsed allocaton methods are very smple methods that do not use nformaton about potental destnaton processors. A neghbour processor,s selected at random and the process s transferred to that processor. No exchange of state nformaton among the processors s requred n decdng where to transfer a load unt. On the other hand, Stochastc algorthms, where physcal optmsaton s appled, are based on analoges wth physcal systems. They map the load-balancng problem onto some physcal systems, and then solve the problem usng smulaton or technques from theoretcal or expermental physcs. Physcal optmsaton algorthms offer a slghtly more varety n the control of the randomness n the redstrbuton of load unts. Ths control mechansm makes the process of load balancng less susceptble to beng trapped n local optma and therefore these stochastc algorthms 15

Chapter 1 are superor to other randomsed approaches whch could produce locally but not globally optmal results. Determnstc methods proceed accordng to certan predefned strateges. These solutons are usually performed n an teratve form when the executon of the load-balancng algorthm s repeated more than once n a gven processor, before restartng the executon of the user applcaton [Xu94]. Determnstc methods can be classfed nto two categores accordng to the load dstrbuton wthn the doman: Dffuson and Sngle-drecton.. Frstly, n dffuson methods the load excess of an overloaded processor s smultaneously dstrbuted amongst all processors of the underlyng doman, followng an teraton of the load-balancng algorthm. In contrast, n sngle-drecton methods only one processor of the underlyng doman can be chosen as destnaton processor after.executng one teraton of the load-balancng algorthm. Sngle Drecton methods are further classfed accordng to how the destnaton processor s selected. When the drecton of the closer lghtly loaded processor s used as a selecton crteron we refer to Gradent Model, and when the chosen processor s the least loaded processor of the underlyng doman we talk about Mnmum-Drecton schemes. Technques where all processors are consdered one by one at each load balancng teraton are called Dmenson Exchange strateges. lterature. We wll now descrbe some of the most relevant strateges that appear n the 1.6 Dynamc load-balancng strateges Followng the taxonomy descrbed n the prevous paragraph, and bearng n mnd the desgn characterstcs outlned n secton 1.4, we have constructed table 1.1, whch draws together all publshed strateges, as far as we are aware. Partcularly, n the case of processor level, the algorthmc aspects seen n secton 1.5 are used. At each box of the table, the mnemonc for the strategy and ts reference are gven. Strateges ndcated wth a contnuos lne are not feasble or have not been proposed as far as the author knows. We now descrbe some of the strateges ndcated n the table, startng wth those classfed n the randomsed category. 16

The dynamc load-balancng problem System Level Vew Dynamc Load-Balancng Centralsed Totally Dstrbuted Non-Nearest Nearest- Neghbours Neghbours Sem-Dstrbuted Reservaton [Eag85] > Algorthmc Aspects from Processor Leve Stochastc Determnstc Physcal Sngle-drecton I o o c(0 rr Optmsaton c o 'm ë a Dmenson Exchange (U lî <1> 0 GTCA [Bau95] CBLB[Bau9S) MultMevel- Dffuson [Hor93] RANDOM[Zho88] THRDL[Zho88] LOWEST[Zho88] MYPE[Yua90] NA [SalSO] DDE [Wu96] AN-n [Cor99c] EDN [Cor99c] S.-A.[Fox89] Dffuson [Cyb89][Bo90j SID.RID [WII93] ATD [Wat98] AN [Cor99c] [Son941[De99] [Hu99J DE [Cyb89] GDE Xu97] ON [Cor99c] Graph-Colourng[Hos90] GM [Ln87] B(Bar90] X-GM [Lul91] EG [Mun95] CSAM[Cha9S] MFAM [Cha95] Membershp-exc. [Eva94] Jont-membershp [Eva94] Mnmumdrecton Central [ZhoSSJ LBC[Ln92] GLRM Xu93] GMLM [Xu93] CWN Kal88] ACWN [Shu89] LLRM(Xu93] LMLM (Xu93) Sphere-lke [Ahm91 ] Herch. Sched. [Dan97] Table 1.1 Load-Balancng technques classfed wth respect to system level vew and processor level vew. 1.6.1 Randomsed As we have seen, n randomsed load-balancng algorthms, the destnaton processors for load transfer are chosen n a random fashon. Therefore, these knds of algorthms use less system state nformaton than determnstc algorthms. These algorthms use only local load nformaton to make movement decsons. In such cases a threshold value s preset as a crteron n determnng whether a processor must send out part of ts load or not. Several randomsed algorthms based on a threshold value (T/) as a trgger condton (RANDOM, THRHLD.LOWEST) are 17

Chapter 1 reported n [Zho88]. In the RANDOM algorthm when a processor detects that ts local load s bgger that T,, a processor s randomly selected as a destnaton of load movements. Snce all processors are able to make load movement decsons, ths algorthm s classfed as a totally dstrbuted and non-nearest-neghbours approach. THRHLD and LOWEST algorthms are smlar to the RANDOM algorthm n the sense that they also select the destnaton processor n a random way. However, a number of randomly selected processors, up to a lmt L p, are nspected nstead of selectng only one canddate. In the THRHLD algorthm, extra load s transferred to the frst processor whose load s below the threshold value. In contrast, n the LOWEST algorthm a fxed number of processors (L p ) are polled and the most lghtly loaded processor s selected as destnaton processor. A smlar scheme s used n the MYPE algorthm [Yua90]. Instead of usng only one threshold value, the MYPE algorthm uses two threshold values to determne the state of the underlyng processor, N u and A//. A processor s overloaded when ts load s hgher than N u. Underloaded processors are the ones whose load s lower than N\. Otherwse, they refer to neuter processors. An overloaded processor randomly selects a number of processors (up to the a preset lmt) whose load ndexes are lower than A//, as potental recevers. Then a pollng scheme s used to determne the fnal destnaton of the load. The load excess wll be sept to the frst processor whose current load s lower than A/,. f 1.6.2 Physcal Optmsaton j The most common physcal optmzaton algorthm for the load-balancng problem s smulated annealng. Smulated annealng s a general and powerful technque for combnatoral optmzaton problems borrowed from crystal annealng, n statstcal physcs. Snce smulated annealng s very expensve and one of the requrements for dynamc load balancng s yeldng the result n lmted tme, two hybrd methods, combnng statstcal and determnstc approaches, are proposed n [Cha95]: the Clusterng Smulated Annealng Model (CSAM) and the Mean Feld Annealng Model (MFAM). They were proposed to allocate or reallocate tasks at run tme, so that every processor n'the system had a nearly equal executon load and load nterprocessor communcaton was mnmsed. In these methods, the load balancng was actvated on a specfc processor called the local balancer. The local 18

The dynamc load-balancng problem balancer repeatedly actvates the task allocaton algorthm among a subset of processors. Each local balancer makes task allocaton decsons for a group of four to nne processors. Groups are overlapped wth each other, allowng tasks to be transferred through the whole system. The CSAM combnes a heurstc clusterng algorthm (HCA) and the smulated annealng technque. The HCA generates clusters, where each cluster contans tasks whch nvolve hgh ntertask communcaton. Varous task assgnments (called system confguratons) are generated from the HCA to provde clusters n varous szes that are sutable for the annealng process. Durng the annealng process, system confguratons are updated by reassgnng a cluster of tasks from one processor to another. The procedure of smulated annealng s used to ether accept or reject a new confguraton. The MFAM (Mean Feld Annealng Model) was derved from modellng the dstrbuted system as a large physcal system n whch load mbalance and communcaton costs causes the system to be n a state of non-equlbrum. Imbalance s balanced through a dynamc equaton whose soluton reduces the system mbalance. The dynamcs of the MFAM are derved from Gbbs dstrbuton. Intally all tasks have the same probablty of beng allocated n each processor. Several teratons of an annealng algorthm are carred out so that the system s brought to a stuaton where each process s assgned to only one processor. A major advantage of the MFAM s that computaton of the annealng algorthm can be mplemented n parallel. A smlar load-balancng algorthm that uses smulated annealng technque s reported n [Fox89] In addton to the smulatng annealng technque, genetc algorthms consttute another optmsaton method that has borrowed deas from natural scence and has also been adapted to dynamc load-balancng. Examples of genetc load-balancng algorthms can be found n [Bau95]. The frst algorthm presented n the paper s Genetc Central Task Assgner (GCTA). It uses a genetc algorthm to perform entre load-balancng acton. The second, Classfer-Based Load Balancer (CBLB), augments an exstng load-balancng algorthm usng a smple classfer system to tune the parameters of the algorthm. 19

Chapter 1 s 1.6.3 Dffuson One smple method for dynamc load-balancng s for each overloaded processor to transfer a porton of ts load to ts underloaded neghbours wth the am of achevng a local load balance. Such methods correspond closely to smple teratve methods for the soluton of dffuson problems; ndeed, the surplus load can be nterpreted as dffusng through the processors towards a steady balanced state. Dffuson algorthms assume that a processor s able to send and receve messages to/from all ts neghbours smultaneously. Corrad et alter propose a pore precse defnton of dffusve load-balancng strateges n [Cor99c]. In partcular, they defne an LB strategy as dffusve when: í It s based on replcated load-balancng operatons, each wth the same behavour and capable of autonomous actvty; The LB goal s locally pursued: the scope of the acton for each runnng processor s bound to a local area of the system (doman). Each runnng processor tres to balance the load of ts doman as f t were the whole system, based only on the load nformaton n ts doman; and Each runnng processor's doman overlaps wth the doman controlled by at f least one other runnng processor and the unfcaton of these domans acheves full coverage of the whole system. Cybenko descrbes n [Cyb89] a smple dffuson algorthm where a processor / compares ts load wth all ts mmedate neghbours n order to determne whch neghbourng processors have a load value smaller than the underlyng processor's load. Such processors wll be consdered underloaded neghbour processors. Once underloaded neghbours are determned, the underlyng processor wll evaluate the load dfference between tself and each one of ts neghbours. Then, a fxed porton of the correspondng load dfference s sent to each one of the underloaded neghbours. \ Ths strategy, as well as other strateges from the lterature based on ths, í [Ber89][De99][Hu99] were orgnally conceved under the assumpton that load can be dvded nto arbtrary fractons,.e., the load was treated as a non-negatve real quantty. However, to cover medum and large gran parallelsms whch are more realstc and more common n practcal parallel computng envronments, we must 20

The dynamc load-balancng problem treat the loads of the processors as non-negatve ntegers, as was carred out n [Cor99c][Son94][Wat98][Wl93J. A relevant strategy n ths area s the SID (Sender Intated Dffuson) algorthm [WÍI93]. In ths algorthm, each processor / has a load value equal to w and t evaluates ts local load average ( w ) to be used as a trgger condton. If the load of processor / s bgger than the load average of ts doman, then t s an overloaded processor. Otherwse, the processor was referred to as underloaded. An overloaded processor dstrbutes ts excess load among ts underloaded neghbours. A neghbour processor j of the underlyng processor /, s a neghbour wth defct f ts load s smaller than the load average of the underlyng doman (w >w 7 ). Then, the surplus of a gven processor / s dstrbuted among ts defcent neghbours n a proportonal way. Ths strategy s classfed as a senderntated scheme because the overloaded processors are the actve processors. The same authors descrbed a smlar strategy called RID (Recever Intated Dffuson) whch s based on the same dea as the SID algorthm, but usng a recever-ntated scheme to determne the processors whch take load-movements decsons. An example of the behavour of ths algorthm s shown n fgure 1.7. The number nsde the nodes represents the load of the correspondng processor. Processor 6 has a load value equal to 40 (w 6 =40) and the load average wthn ts doman s 12 ( w 6 = 12). Therefore, the load excess of processor 6 s equal to 28 unts of load. After applyng the SID algorthm, processor 6 decdes to move 1,11,5 and 7 load unt to processors 2, 5, 7 and 10 respectvely. These load movements are denoted by the numbers behnd the arrows. P4 w 6 =l2 W 6 - W 6-28 Fgure 1.7 An example of the executon of one teraton of the SID strategy n processor 6. 21

Chapter 1 The reader can fnd more examples of determnstc dffuson load-balancng strateges n [Cor99c][Eva94][Hor93][Sal90][Wat98] and [Wu96]. 1.6.4 Dmenson Exchange Ths load-balancng method was ntally studed for hypercube topologes where processor neghbours are nspected by followng each dmenson of the hypercube. Thus, ths s the orgn of the dmenson exchange (DE) name. Orgnally, n DE methods, the processors of a /(-dmensonal hypercube par up wth ther neghbours n each dmenson and exchange half the dfference n ther respectve loads [Cyb89]. The load value 1 of the underlyng processor s updated at each neghbour nspecton and the new value s consdered for the next revson. Gong through all the neghbours once conssts of carryng out a "sweep" of the loadbalancng algorthm. Such behavour s shown n fgure 1.8. frst dmenson second dmenson thrd dmenson 0 load movements runnng processor Fgure 1.8 Load movements for DE methods n a 3-dmensonal hypercube through three teratons of the load-balancng algorthm (1 sweep) n a runnng processor. Xu et alter present n [Xu97] a generalsaton of ths technque for arbtrary topologes, whch they call the GDE (Generalsed Dmenson Exchange) strategy. For arbtrary topologes the technque of edge colourng of undrected graphs (where each node of the graph dentfes one processor of the system and the edges are the 22

The dynamc load-balancng problem lnks) s used to determne the number of dmensons and the dmenson assocated at each lnk. The lnks between neghbourng processors are mnmally coloured so that no processor has two lnks of the same colour [Hos90]. Subsequently, a "dmenson" s then defned as beng the collecton of all edges of the same colour. At each teraton, one partcular colour/dmenson s consdered, and only processors on edges wth ths colour execute the dmenson exchange procedure. The porton of load exchanged s a fxed value and s called the exchange parameter. Ths process s repeated untl a balanced state s reached. The DE algorthm uses the same value of the exchange parameter for all topologes, whle the GDE algorthm uses dfferent values dependng on the underlyng topology. The DN (Drect Neghbour) algorthm s a strategy based on the dmenson exchange phlosophy, whch uses a dscrete load model [Cor99c]. Ths strategy allows load exchange between two processors only drectly connected by a physcal lnk. A balancng acton wthn a doman strves to balance the load of the two processors nvolved. In order to assure the convergence of ths method, the runnng processors must synchronse amongst themselves n such a way that the runnng processors actve n any gven moment have non-overlappng domans. The same authors descrbe an extenson to ths algorthm, the EDN (Extended Drect Neghbour) algorthm, whch works as a non-nearest neghbour strategy. Ths strategy allows a dynamc doman defnton by movng load between drect neghbours, overcomng the neghbourhood lmt through underloaded processors. Load reallocaton stops when there are no more useful movements,.e., a processor s reached whose load s mnmal n ts neghbourhood. 1.6.5 Gradent Model Wth gradent-based methods, load s restrcted to beng transferred along the drecton of the most lghtly loaded processors. That s, an overloaded processor wll send ts excess load only to one neghbor processor at the end of one teraton of the load-balancng algorthm. Therefore, the man dfference between the gradent model and the dmenson exchanged scheme s that at each teraton the load nformaton of the entre underlyng doman s consdered n decdng the destnaton processor, whlst n DE methods only one processor s consdered at each teraton. 23

Chapter 1 In the Gradent Model algorthm descrbed n [Ln87] two-tered load-balancng steps are employed. The frst step s to let each ndvdual processor determne ts own loadng condton: lght, moderate or heavy. The second step conssts of establshng a system-wde gradent surface to facltate load mgraton. The gradent surface s represented by the aggregate of all proxmtes, where a proxmty of a processor / s the mnmum dstance between the processor and a lghtly loaded processor n the system. The gradent surface s approxmated by a dstrbuted measurement called the pressure surface, then the excessve load from heavly loaded processors s routed to the neghbour of the least pressure (proxmty). The resultng effect s a form of relaxaton where load mgratng through the system s guded by the proxmty gradent and gravtates towards underloaded processors. Fgure 1.9 shows an example of a gradent surface n a 4x4 torus network where there are two lghtly loaded processors (processors 2 and 10). The value between brackets (x) represents the pressure surface of each processor. Let us suppose that processor 12 s an overloaded processor (yellow colour). By followng the proxmtes depcted n the fgure, the load excess of processor 12 wll be guded to be moved t through the red lnks n order to acheve one of the two lghtly loaded processors wthn the system (n ths case processor 6). p4 lghtly loaded processor 1.9 The GM scheme on a 4x4 torus.. t Ths basc gradent model has serous drawbacks. Frst, when a large porton of moderately loaded processors suddenly turns lghtly loaded, the result s consderable commoton. Idlng et alter proposed an mproved verson of the GM algorthm to remedy ths problem, the Extended Gradent Model (X-GM) [LÜI91]. Ths 24

The dynamc load-balancng problem method adds a sucton surface whch s based on the (estmated) proxmtes of nonheavly-loaded processors to heavly-loaded processors. Ths nformaton would cause load mgraton from heavly-loaded processors to nearby local mnma whch may be moderately-loaded processors. Snce the system load changes dynamcally, the proxmty nformaton kept by a processor may be consderably out-of-date. And fnally, f there are only a few lghtly-loaded processors n the system, more than one overloaded processor may emt some load toward the same underloaded processor. Ths "overflow" effect has the potental to transform underloaded processors nto overloaded ones. The authors of [Mun95] propose another extenson to the GM scheme, EG (Extended Gradent) mechansm, to overcome the problems mentoned. The EG mechansm s a two-phase strategy, where an overloaded processor confrms that a processor s stll underloaded before transferrng load to t, and then the underloaded processor s reserved n transferrng the load. 1.6.6 Mnmum-drecton The mnmum-drecton scheme s an alternatve to dmenson exchange methods and gradent model wthn the sngle-drecton category of determnstc loadbalancng algorthms. In the strateges based on ths scheme, the runnng processor chooses the least loaded processor wthn ts doman as the only destnaton of a load movement after executng the load-balancng algorthm once. Notce that, dependng on the scope of the doman, the least loaded processor wthn the underlyng doman may concde wth the least loaded processor n the whole system. Such a match s typcally produced n centralsed load-balancng systems where the runnng processors have access to the load of the entre system. The LBC (Load-Balancng wth a Central job dspatcher) strategy reported n [Ln92] makes load-balancng decsons based on global state nformaton whch s mantaned by a central job dspatcher. Each processor sends a message to the central ste whenever ts state changes. Upon recevng a state-change message, the central dspatcher updates the load value kept n ts memory accordngly. When a processor becomes underloaded, the state-change message s also used as a load request message. In response to ths load request, the dspatcher consults the table where load values are kept, and the most loaded processor s chosen as load source. 25

Chapter 1! I Then ths processor s notfed to transfer some load to the requestng processor. The LBC strategy s à centralsed algorthm because the central ste gudes all load movements. Ths strategy s also classfed as a recever-ntated method n the sense that the underloaded processors are the ones whch start the load-balancng operatons. The CENTRAL algorthm {descrbed n [Zho88] s a centralsed sender-ntated algorthm that works n a complementary form to the LBC strategy. When a processor detects that t s an overloaded processor, t notfes the load nformaton center (LIC) about ths fact by sendng a message wth ts current load value. The LIC selects a processor wth the lowest load value and nforms the orgnatng processor to send the extra load to the selected processor. GLRM (Global Least Recently Mgrated) and GMLM (Global Mnmum Load Mantaned) are two totally dstrbuted non-nearest-neghbour strateges where the doman of each processor ncludes all processors n the system [Xu93]. Both GLRM and GMLM strateges use the global load average n the system as a threshold to determne whether a processor s overloaded or not. Ths threshold s computed at each processor usng the load values receved from the nformaton collector (1C) processor. The 1C processor has the responsblty of collectng the load of the entre system and broadcastng t to all processors. These actons wll be performed on a tme wndow bass. Once a processor s consdered to be overloaded, a destnaton processor must be chosen. GLRM selects the destnaton processor by applyng the last recently mgrated dscplne n a wndow tme and the GMLM strategy determnes the destnaton processor as the processor wth mnmum load value n the current tme wndow. If the doman of each processor s restrcted to the mmedate neghbours, two nearest-neghbour strateges are easly derved from the two prevous ones: LLRM (Local Least Recently Mgrated) and LMLM (Local Mnmum Load Mantaned). f í I. Another algorthm based on the mnmum-drecton scheme s the CWN (Contractng Wthn Neghbourhood) strategy [Kal88]. CWN s a totally dstrbuted strategy where each processor only uses load nformaton about ts mmedate 26

The dynamc load-balancng problem neghbours. A processor would mgrate ts excess load to the neghbour wth the least load. A processor that receves some load keeps t for executon f t s most lghtly loaded when compared wth all ts neghbours; otherwse, t forwards the load to ts least loaded neghbour. Ths scheme has two parameters: the radus,.e. the maxmum dstance a load unt s allowed to travel, and the horzon,.e. the mnmum dstance a load unt s requred to travel. If we allow these parameters to be tuneable at run-tme, the algorthms become ACWN (Adaptve Contractng Wthn a Neghbourhood) [Shu89]. In the sem-dstrbuted strategy proposed by Ahmad [Ahm91], called Spherelke, the system s dvded nto symmetrc regons called 'spheres'. Consderng the load-balancng method appled among these spheres, ths strategy falls nto the mnmum-drecton category. Ths strategy has a two-level load-balancng scheme. At the frst level the load s balanced among dfferent spheres usng global system nformaton. At the second level, load balancng s carred out wthn ndvdual spheres. Each sphere has a processor that acts as a centralsed controller for ts own sphere. Snce ths strategy s prmarly desgned for massvely parallel systems, t also addresses the problem of creatng the spheres. State nformaton, mantaned by each centralsed controller, s the accumulatve load of ts sphere. In addton, a lnked lst s mantaned n non-decreasng order that sorts the processors of the sphere accordng to ther loads. The schedulng algorthm frst consders the load of the least loaded processor n local sphere and f t s less than or equal to chosen thresholdl, the task s scheduled on that processor. Otherwse the scheduler checks the cumulatve load of other spheres. If the load of the least loaded sphere s less than threshold2, the task s sent to that sphere where t s executed wthout further mgraton to any other sphere. If there s more than one such sphere, one s selected randomly. In the case that there s no such sphere, the task s scheduled n the least loaded processor of the local sphere. The parameters thresholdl and threshold2 are adjustable dependng upon system load and network characterstcs. More dynamc load-balancng algorthms ncluded wthn the mnmumdrecton category are reported n [Dan97]. 27

Chapter 1 1 1.7 Software facltes for supportng dynamc load-balancng? Software systems that support some knd of adaptve parallel applcaton executon are bascally classfed nto two man classes: system-level class and userlevel class. In the system-level class, load-balancng support s mplemented at the operatng system level [Sn97]. In contrast, the user-level class ncludes all the systems where the load-balancrg support s not ntegrated nto the operatng system level. They are bult on top of exstng operatng systems and communcaton envronments. In that sense, load-balancng systems supported at the system level provde more transparency and less nterference (mgraton can be carred out more effcently, for nstance) compared to load-balancng systems supported at the user level. However, they are not as portable as the user-level mplementatons. In the remander of the secton we focus manly on the second class of systems. Readers nterested n load-balancng systems based on system-level support could refer to the descrpton of systems such as Sprte [Dou91], V System [The85] and Mach [MÍI93]. Load-Balancng Systems (LBS) mplemented at the user level can be further subdvded nto data-based, thread-based or process-based systems, accordng to the load unt that s mgrated, as was mentoned prevously n sectons 1.3.1 and 1.3.3. Therefore, we wll refer to load unt mgraton as a general term that does not dfferentate whether mgraton nvolves data, threads or processes. Moreover, data-based and thread-based LBS are usually based on a í dstrbuted shared memory paradgm. As a consequence, some problems addressed n process-based LBS (process communcaton, for nstance) do not always appear n data-based and thread-based LBS. We wll focus below manly on the desgn ssues related to process-based LBS, although the reader should bear n mnd that many ssues are also applcable to the other two classes of LBS. 28

The dynamc load-balancng problem 1.7.1 Archtecture of Process-based LBS Despte the partcular characterstcs of dfferent LBS, a smlar system archtecture s shared by most of them. Ths archtecture s based on a set of layers where upper layer components nteract wth lower layer components through lbrary functons (see fgure 1.10) Applcaton Load Balancng Communcaton Envronment LB lbrary V CE lbrary CE lb. vv OS lbrary OS lb. OS lb. Operatng System Hardware Platform V Fgure 1.10 Layered structure of a Load Balancng System A parallel applcaton s made of a set of processes that execute n a dstrbuted envronment and co-operate/synchronse by means of message passng. For that purpose the Communcaton Envronment (CE) offers a set of servces that serve to communcate nformaton between tasks n a transparent way, and conceal the partcular OS network characterstcs. Smlarly, the Load Balancng (LB) layer wll take advantage of the servces offered by the Communcaton layer. PVM and MPI consttute common examples of such a communcaton layer that have been used n many exstng LBS. The LB layer s responsble for carryng out all the actons nvolved n process mgraton. In that sense, ths layer ncludes all the mechansms that mplement the object mgraton. Moreover, ths layer should also nclude the polces mentoned n secton 1.6 that manage the resources and are responsble for mantanng load balancng. Interacton between the user applcaton and the LB layer could be carred out by nvokng certan functons of the LB layer drectly. Alternatvely, the 29

Chapter 1 programmng language of the applcaton may be augmented wth new constructs and a code pre-processor wll transform those constructs to LB functons. In both cases, the LB functons wll be lnked to the user applcaton at a later stage. In contrast to LBS where the nteracton between the applcaton and the LB layer s! accomplshed through a lnked lbrary, there are LBS where such nteracton s carred out by means of messages usng the servces of the Communcaton layer. In ths' case, the user applcaton treats the LB system as another applcaton task wthn the communcaton doman of the message-passng envronment. 1.7.2 Desgn ssues of a process mgraton mechansm A major requrement for LBS s that the mgraton should not affect the correctness of the applcaton. Executon of the applcaton should proceed as f the F mgraton had never taken place, the mgraton beng "transparent". Such t transparency can be ensured f the state of a process on the source processor s reconstructed on the target processor. The process mgraton mechansm can be roughly dvded nto four man stages: mgraton ntaton, state capture, state transfer and process restart. Below, we present the most mportant ssues that a process mgraton envronment must address n practce. Mgraton Intaton \ Ths stage trggers the decson of startng the mgraton of a gven process,.e. t decdes when mgraton occurs. Moreover t should ndcate whch process should mgrate and where to. In'prncple, ths nformaton depends on the decsons adopted by the load balancng 'strategy runnng n the system. The scope of the mgraton event causes the > mgraton mechansm to be synchronous or asynchronous. Asynchronous mgraton allows a process to mgrate ndependently of what the other processes n the applcaton are dong. Synchronous mgraton mples that all the processes are frst executng and agree to enter nto a mgraton phase where the selected processes wll be fnally relocated. t Sfafe Capture Ths stage mples capturng the process' state n the source processor. In ths context, the process 1 state ncludes: () the processor state (contents of the machne 30

The dynamc load-balancng problem regsters, program counter, program status word, etc.), () the state held by the process tself (ts text, statc and dynamc data, and stack segments), () the state held by the OS for the process (blocked and pendng sgnals, open fles, socket connectons, page table entres, controllng termnals, process relatonshp nformaton, etc.), and (v) the OS state held by the process (fle descrptors, process dentfers, host name and tme). The prevous nformaton, known to the process, s only vald n the context of the local executon envronment (local operatng system and host). Furthermore, a process has an state as vewed from the perspectve of the communcaton layer. In ths regard, a process' state ncludes ts communcaton dentfers and the messages sent to/from that process. Capturng the process state can be non-restrcted or restrcted. Non-restrcted capture of the process 1 state means that the state can be saved at any moment. The LBS must block the process (for nstance, usng the Unx sgnal mechansm) and capture ts state. Restrcted capture means that the state wll be saved only when the process executes a specal code that has been nserted nto the applcaton. Ths mples that all ponts where a process may be suspended for mgraton must be known at the tme of complaton. The specal code usually conssts of a call to an LB servce that passes control to the LB layer. Then the LB layer decdes whether the process should be suspended for mgraton or f the call s servced and the control returned to the process. In the latter case, the servce nvocaton also serves to capture and preserve the process state nformaton. The process state can be saved on dsk, creatng a checkpont of the process. Readers nterested n ths area should refer to [Tan95] for a detaled descrpton of the checkpontng mechansm appled to a sngle user process n the Condor system. The checkpont mechansm has the advantage of beng mnmally obtrusve and provdng fault-tolerance. However, t requres sgnfcant dsk space consumpton. Other mgraton mechansms do not store the process state on dsk. They create a 31

Chapter 1, skeleton process on the target processor to receve the mgratng process, and then the process state s sent by the mgratng process drectly to the skeleton process. Sfae Transfer t For LBS mplemented at user-level, the entre vrtual address of a process s usually transferred at ths stage [Cas95]. There are dfferent mechansms to transfer ths nformaton and they dependí on the method that was used to capture the process state. When the checkpont of;the process has been stored on dsk (ndrect checkpontng), the state transferts carred out by accessng the checkpont fles from the target processor. The use of a certan global fle system (NFS, for nstance) s the smplest soluton n ths case. Otherwse, checkpont fles must be transferred from one host to another through the network. When a skeleton process mechansm s used (drect chekpontng), ths stage of the mgraton protocol mples that the skeleton process was successfully started at the target processor, (usng the same executable fle automatcally "mgrates" the text of the process). Then the process must detach from the local processor and ts state, whch was prevously preserved (data, stack and processor context), must be transferred to the target processor through a socket [Cas95][Ove96J. Process Restart t ' The process restart mples that a new process s created n the target host and ts data and stack nformaton s assmlated accordng to the nformaton obtaned from the state of the process n the source host. The new process n the target host reads data and stack ether from dsk or a socket, dependng on the mechansm used to capture the process state. Once the new process has assmlated all the nformaton needed as ts own process state, the process n the source host s removed. Before the new process can re-partcpate as part of the applcaton, t frst has to re-enrol tself wth the local server of the Communcaton layer. Ths mples that 32

The dynamc load-balancng problem some actons are carred out to ensure the correct delvery of messages. It must be ensured that all processes send all ther future messages destned for the mgrated process to the new destnaton, and that no n-transt messages are dropped durng mgraton. These actons must also solve problems related to the process dentfers wthn the communcaton envronment and message sequencng. Dfferent mechansms have been proposed to ensure the correct delvery and sequencng of n-transt messages. They can be roughly classfed nto three man categores: Message forwardng: a shadow process can be created n the processor where the process was orgnally created. Ths shadow process wll be responsble for forwardng all the messages drected to the process n ts new locaton. When a message arrves at a processor and fnds that the destnaton process s not there, the message s forwarded to the new locaton [Cas95]). Message restrcton: ths technque ensures that a process should not be communcatng wth another process at the moment of mgraton. That mposes the noton of crtcal sectons where all nterprocess communcaton s embedded n such sectons. Mgraton can only take place outsde a crtcal secton [Ove96]. Message flushng: n ths technque, a protocol s used to ensure that all pendng messages have been receved. Therefore, the network wll be draned when all the pendng messages are receved [Pru95]. Pror to restartng a mgrated process, t must be connected agan to the Communcaton Envronment n order to establsh a new communcaton dentfer. The new dentfer must be broadcast to all the hosts so that a mappng of old dentfers to new dentfers for each process s mantaned n a local table. All future communcatons wll go through ths mappng table before they are passed nto and out of the Communcaton Envronment. 1.7.3 Lmtatons of process mgraton mechansms In practce, there are addtonal problems that must also be addressed n the mplementaton of the LBS. These problems are related to the management of fle I/O (whch ncludes applcaton bnares, applcaton data fles and checkpont fles), 33

Chapter 1 I management of termnal I/O and GUIs, and management of cross-applcaton communcaton and nter-applcaton communcaton.. Access to the same set of fles can be carred out va a networked fle system (NFS, for example). When there s no common fle system, the remote access s accomplshed by mantanng a "shadow" process on the machne where the task was ntally runnng. The "shadow".process acts as an agent for fle access by the mgrated process. Smlar solutons can be devsed for accessng termnal I/O. However, some lmtatons are mposed by exstng LBS. For nstance, processes whch execute fork () or exec (), or whch communcate wth other processes va sgnals, sockets^ or ppes are not sutable for mgraton because exstng LBS cannot save and restore suffcent nformaton on the state of these processes. Ths lmtaton s reasonable accordng to the layerng archtecture of fgure 1.10. User applcatons are restrcted to usng the facltes provded by the Communcaton layer or the LB layer to establshng communcaton between processes orto creatng/destroyng processes. í í * Addtonally, process mgraton s normally restrcted between machnes wth homogeneous archtectures,.e., wth the same nstructon sets and data formats. However, there are some systems that allow mgraton of sequental processes between heterogeneous machnes. For nstance, Tu [Sm97] s a mgraton system that s able to translate the memory mage of a program (wrtten n ANSI-C) between four common archtectures (MC68000, SPARC, 486 and PowerPC). Another example s the Porch compler [Str98] that enables machne-ndependent checkponts by automatc generaton of checkpontng and recovery code. 1.7.4 Examples of exstng packages for supportng dynamc load balancng In ths subsecton, we brefly revew some of the most sgnfcant software packages that have been developed or are n an early stage of development n the framework of dynamc load balancng and load unt mgraton. These tools usually fall t, nto two man classes [Bak96]: 34

The dynamc load-balancng problem Job Management Software: these software tools are desgned to manage applcaton jobs submtted to parallel systems or workstaton clusters. Most of them mght be regarded as drect descendants from tradtonal batch and queung systems [Bak96][Jon97]. Process-based LBS usually belong to ths group. Dstrbuted Computng Envronments: these software tools are used as an applcaton envronment, smlar n many ways to a dstrbuted shared memory system. The applcaton programmer s usually provded wth a set of development lbrares added to a standard language that allows the development of a dstrbuted applcaton to be run on the hardware platform (usually, a dstrbuted cluster of workstatons). The envronment also contans a runtme system that extends or partally replaces the underlyng operatng system n order to provde support for load unt mgraton. Data-based and thread-based LBS manly belong to ths group of tools. For each class of LBS, we brefly descrbe the man characterstcs of one of the most relevant tools, whch serves as a representatve example of tools of that class. Ths descrpton s completed wth a lst of references for other smlar tools. a) Data-based LBS. Most of the dynamc mgraton envronments that dstrbute data are based on the SPMD model of computaton, where the user program s replcated n several processors and each copy of the program, executng n parallel, performs ts computatons on a subset of the data. Dome [Ara96] s a computng envronment that supports heterogeneous checkpontng through the use of C++ class abstractons. When an object of one class s nstantated, t s automatcally parttoned and adapted wthn the dstrbuted envronment. Load-balancng s performed by remappng data based on the tme taken by each process durng the last computatonal phase. Due to the SPMD computatonal nature of the applcatons, the synchronsaton between computatonal phases and load-balancng phases s straghtforward. Other systems smlar to Dome that also provde data mgraton are descrbed n [SN94] and [Bru99]. An archtecture-ndependent package s presented n [SÍI94], where 35

Chapter 1 í the user s responsble for nsertng calls to specfy the data to be saved and perform the checkponts. A framework mplemented n the context of the Charm++ system [Kal96] s presented n [Bru99]. Ths framework automatcally creates load balanced Charm++ applcatons by means of object mgraton. Load balancng decsons are guded by the nformaton provded by the run-tme system, whch can measure the work ncurred by partcular objects and can also record object-toobject communcaton patterns. In ths framework mgraton can only occur between method nvocatons,] so that mgraton s lmted to data members of the object. b) Thread-based LBS. These systems are usually object-based systems that provde a programmng envronment that exports a thread-based object orented programmng model to the user. The objects share a sngle address space per applcaton that s dstrbuted across the nodes n the network, and the objects are free (under certan constrants) to mgrate from one node to another. Arachne [Dm98] s a thread system that supports thread mgraton between heterogeneous platforms. It s based on C and C++ languages, whch have been augmented n order to facltate thread mgraton. Conventonal C++ s generated by a preprocessor that nserts specal code to enable the savng and subsequent restoraton of a thread state. Mgratng threads must be prevously suspended, and suspenson takes place when a thread nvokes an Arachne prmtve. Therefore, threads may be suspended (and potentally mgrated) only at partcular ponts that must be known at the tme of complaton. The Arachne envronment ncludes also a runtme system that manages the threads durng program executon. Generatng executables beforehand for each machne supports the heterogenety of the envronment. Other examplesjof object-based envronments that support mgratng threads are Aradne [Mas96],; Emerald [Ste95] and Ythreads [San94]. In contrast, UPVM [Kon97] s a process-based envronment that provdes thread mgraton for PVM programs wrtten n sngle program multple data (SPMD) style. c) Process-based LBS. Condor/CARMI [Pru95] consttutes one of the most notable examples of process mgraton envronments mplemented at the userlevel. It s based on Condor, a dstrbuted batch processng system for Unx that 36

The dynamc load-balancng problem was extended wth addtonal servces to support parallel PVM applcatons. Condor uses a checkpont/roll-back mechansm to support mgraton of sequental processes. The Condor system takes a snapshot of the state of the programs t s runnng. Ths s done by takng a core dump of the process and mergng t wth the executable fle. At mgraton tme, the currently runnng process s mmedately termnated and t s resumed on another host, based on the last checkpont fle. Condor was extended wth CARMI (Condor Applcaton Resource Management Interface) whch provdes an asynchronous Applcaton Programmng Interface (API) for PVM applcatons. CARMI provdes servces to allocate resources to an applcaton and allows applcatons to make use of and manage those resources by creatng processes to run there. CoCheck (Consstent Checkpontng) s the thrd component of the system. It s bult on top of Condor and mplements a network consstency protocol to ensure that the entre state of the PVM network s saved durng a checkpont and that communcaton can be resumed followng a checkpont. Process mgraton and checkpontng (wth certan lmtatons n most cases) have also been developed or are under development n some research packages such as MIST [Cas95], DynamcPVM [Ove96], Pbeam [Pet98] and Hector [Rus99], and n some commercal packages that were also based on Condor such us [Cod97] and LoadLeveler [IBM93]. 37

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 2 Nearest-neghbour load-balancng methods Abstract As reported n the prevous chapter, totally dstrbuted dynamc load-balancng algorthms seem to be partcularly adequate n the case of parallel systems. In ths chapter, we focus on nearest-neghbour load-balancng algorthms where each processor only consders ts mmedate neghbour processors to perform loadbalancng actons. Frst of all, we ntroduce a generc notaton for nearest-neghbour load-balancng algorthms. Due to the teratve nature of most of the nearestneghbour load-balancng methods, two mportant ssues such as the algorthm convergence and ts termnaton detecton are subsequently dscussed. Fnally, three relevant LB algorthms from ths category (SID, GDE and AN) are descrbed and analysed n detal. 39

Nearest-neghbour load-balancng methods 2.1 Introducton As was commented n chapter 1, nearest-neghbour load-balancng algorthms have emerged as one of the most mportant technques for parallel computers based on drect networks. Ths fact s corroborated by the fact that, amongst all categores of load-balancng algorthms reported n table 1.1, greater concentraton of strateges are found n ths (nearest-neghbours column). Amongst the two famles nto whch we can dvde ths category when we focus upon the algorthmcal aspects of the strategy, the load-balancng algorthms descrbed below belong to the determnstc category. As was commented n secton 1.5, determnstc algorthms are, n turn, dvded nto the subfamles of dffuson and sngle drecton. The load-balancng algorthm proposed n ths thess (chapter 3) belongs to the dffuson schemes and t has been compared, n chapter 4, wth three well known load-balancng strateges from the lterature: the SID (Sender Intated Dffuson) algorthm, the AN (Average Neghbourhood) algorthm and the GDE (Generalsed Dmenson Exchange) algorthm. SID and AN, are dffusve strateges and they have been chosen for ther popularty and smltude to the one proposed n ths thess. GDE belongs to the sngle-drecton famly and t has been selected as a representatve algorthm wthn ths famly and for beng one of the most popular and tradtonal load-balancng algorthms of the same. In order to be rgorous n the descrpton of the above mentoned loadbalancng algorthms, n secton 2.2 some generc notaton and assumptons are ntroduced. All these algorthms perform the load-balancng process n an teratve way, therefore the convergence of the algorthm and ts termnaton detecton are mportant ssues to be consdered. Secton 2.3 deals wth these two problems. Fnally, secton 4 of ths chapter ncludes a detaled descrpton of SID, AN and GDE algorthms and the balance problems that arse from each of them are also analysed. 2.2 Basc notaton and assumptons The load-balancng algorthms descrbed n the followng secton as well as the proposed n chapter 3 are sutable for parallel system that may be represented by a 41

Chapter 2 * smple undrected graph G=(P,E),.e. wthout loops and wth one or zero edges between two dfferent vertces. The set of vertces P={1,2,...,n} represents the set of processors. One edge {,j} e E f there s a lnk between processor / and processor). Let n denote the degree of a vertex / n G (number of drect neghbours of processor /) and r denote the maxmum degree of G's nodes. Note that n symmetrcal topologes the number of mmedate neghbours for all the processors n the system s the same,.e., r and r wll have the same value for any gven processor /'. The neghbourhood or doman of a gven processor ; s defned as the followng set of processors, NI = {j ep {/,/}e }U {/}. We assume that the basc communcaton model s the allport one. Ths model allows a processor to exchange message wth all ts drect neghbours smultaneously n one communcaton step,.e., communcaton hardware t supports parallel communcatons over the set of lnks of a processor. Furthermore, we state that at a gven nstant t, each processor / has a certan load w,.(/) and the load dstrbuton among the whole system s denoted by the load vector W(t) = (w(t)... w n (t)). The components of the load vector, W(t), can be real or nteger values dependng on the granularty of the applcaton. When the computatonal load of a processor s assumed to be nfntely dvsble, load can be represented by a real number and, therefore, the perfect balance s acheved when all processors n the system have the same load value. Ths assumpton s vald n parallel programs that explot very fne gran parallelsm. To cover medum and large gran parallelsm, the algorthm must be able to handle ndvsble tasks. Hence, the load would be represented by a non-negatve nteger number and, therefore, the global balanced load dstrbuton s the one n whch the maxmum load dfference throughout the entre system s 0 or 1 load unt. Let Z,=]Tw (0 n ' be the total load of the system at a gven tme t and = y denote the global load\average at the same tme. Thus, the objectve of a / /» [ load-balancng algorthm s to dstrbute loads such that at some nstant t each w,(t) s "close" to w(t). 42

Nearest-neghbour load-balancng methods 2.3 Iteratve nearest-neghbour load balancng algorthms Most of the nearest-neghbour load-balancng algorthms are mplemented n an teratve form where a processor balances ts load by exchangng load wth the neghbourng processors n successve executons of the three functonal blocks ntroduced n secton 1.3: the LM block, the LBA block and the MM block. By neglectng.the computatonal operatons related wth the underlyng applcaton, and only concernng wth the load-balancng process, a gven processor / should be allocated the loop shown n fgure 2.1. / Whle (not_converge) LB operatons LM block LBA block MM block Fgure 2.11teratve LB process n a processor. By ts dstrbuted nature, nearest-neghbour load-balancng algorthms lack central coordnaton ponts that control the executon of load-balancng operatons n each processor. As load decson movements are taken locally by each processor, an mportant ssue that must be consdered when desgnng teratve dstrbuted loadbalancng algorthms s the convergence of the algorthm. Convergence property guarantees that there exsts a fnte number of balancng teratons beyond whch all the processors n the system do not receve or send any amount of load, and therefore a stable state s reached. Wth regard to the practcal sde of these loadbalancng methods, another mportant ssue to be addressed s the termnaton detecton problem. In order to assst the processors nferrng global termnaton of the 43

Chapter 2 load-balancng process from local load nformaton, t s necessary to supermpose a dstrbuted termnaton detecton mechansm on the load-balancng procedure. These í two ssues are dscussed n the followng sectons. 2.3.1 Algorthm convergence Several analytcal methods for studyng dstrbuted load-balancng algorthms have been developed for the dealsed scenaro where' loads are treated as nfntely dvsble and represented by real numbers; therefore, loads can be splt to arbtrary precson [Ber89][Bo90][Cyb89][Hos90][Mut98][Wl93][Xu97]. The convergence of these methods has been studed usng dfferent mathematcal approaches (rangng from lnear algebra to other mathematcal models). Cybenko [Cyb89], and t ndependently by Bollat n [Bo90], made an mportant contrbuton by showng that deas from lnear system theory jean be employed n charactersng the behavour of load balancng. We summarse Cybenko's approach. The load assgned to processor / at tme t+1 s gven by w ^ 7=1 where the a, for 1 < j < n, satsfy the followng condtons: a j = ají for all, j. f {, j} g E, then a = a = 0. a j < forl < <n. > 7=1 Ths type of load assgnment can be regarded as "dffusng" loads among the processors. Cybenko' has shown that wth certan choces for the dffuson coeffcents, a ff, ths process tends to balance the total load among the vertces. It s possble to express the load vector at tme t+1 n terms of a matrx and the load vector at tme t. Ths matrx s called the dffuson matrx and t s denoted as M where Mjj=ajj for w and H-1- y It mmedately follows that w(t+1)=mw(t). It s 7'*' straghtforward to check that f each row of M sums up 1, then L s conserved. Dfferent M's result n dfferent dealsed load-balancng algorthms/works from the 44

Nearest-neghbour load-balancng methods lterature that matches wth ths matrx behavour concentrate ther effort n boundng the tme the algorthm takes to apprecably decrease mbalance and how ths process can be accelerated. Classcal works n ths area [Bo90][Cyb89][Hos90][Xu97] were focused on the analyss of the nfluence of the value of «// n the convergence rate and dfferent dstrbuted load-balancng algorthms were proposed by smply changng the value of the dffusve coeffcent, consderng the egenvalues of the Laplacan of the underlyng topologcal graph. More recent works try to accelerate the convergence rate of prevous dffusve approaches by keepng a memory at each load-balancng teraton of what happens n past teratons [De99][Hu99][Mut98]. In these alternatves the propertes of the Chebyshev polynomal should be appled to study the convergence rate. All the above mentoned dealsed dstrbuted load-balancng algorthms are supposed to work n synchronous envronments where all processors of the system perform load-balancng operatons smultaneously. Another framework where the analyss of dealsed dstrbuted load-balancng algorthms was focused on s the asynchronous framework where each processor proceeds regardless of what the other processors do. Under ths assumpton, no global co-ordnaton s needed between processors. Asynchronous dealsed teratve load-balancng algorthms dffer from ther synchronous counterparts n the manner n whch the porton of excess load to be moved s evaluated. Whle synchronous algorthms use a fxed value («//) to apporton the excess load of an overloaded processor, asynchronous algorthms use a varable parameter, named P, whch tend to depend on the current local load dstrbuton [Cor99c][Wl93]. Therefore, the matrx model cannot be drectly appled to ths knd of algorthms. Bertsekas and Tstskls proposed a load-balancng model for asynchronous dealsed load-balancng algorthms [Ber89]. They also proved ther convergence under the assumpton that the balancng decsons do not reverse the roles of the nvolved processors and ncludng some bounds for the message delays n arbtrary message passng networks (partal asynchronsm). In the more realstc settng of the problem whch covers medum and large gran parallelsm, the loads are not nfntely dvsble and, as a consequence, they are treated as natural numbers. An nterestng queston that arses s: what do the results 45

Chapter 2 mean n the realstc settng where the load s represented by natural numbers? Some prelmnary analyss about the! problems related to consder loads as ndvsble can be found n [Luq95][Mut97][Sub94]. A common soluton for ths queston s found n all the above mentoned dealsed dstrbuted load-balancng algorthms. Ths soluton smply conssts of roundng the load quanttes to be moved to make them ntegral. Then, the key s to analyse the convergence of these algorthms wth respect to the analyss performed n the dealsed stuaton. Intutvely t s clear that a scheme wth the small perturbaton (of roundng down) wll behave smlarly to the orgnal one. However, as t was stated n [Mut98], applyng standard lnear algebrac tools to handle perturbaton such as Gerschgorn's theorem n order to analyse the convergence rate of the algorthms yelds only weak results. In some cases the convergence of the algorthm has only been substantated by smulaton results as t happens for the realstc versons outlned n [Son94][Wl93][Xu97]. In other cases, some theoretcal works related wth the convergence rate of the realstc counterparts are provded but only attendng to obtan thgh bounds on the global unbalance [Cor99c][Cyb89][Hos90][Mut98], whereas the convergence proof was left open. Therefore, there s no proof n the lterature of the convergence of teratve load-balancng algorthms that work wth realstc load model, as far as we are aware. 2.3.2 Termnaton detecton problem Wth regard to the practcal sde of teratve load-balancng methods, another mportant ssue to be addressed s the termnaton detecton problem. Wth teratve LB algorthms, the load-balancng procedure s consdered termnated when the system reaches a global stable load dstrbuton. From the practcal pont of vew, the detecton of the global termnaton s by no means a trval problem because there s a lack of consstent knowledge n every processor about the whole load dstrbuton as the load-balancng progresses. In order to assst the processors nferrng global termnaton of the load-balancng process from local load nformaton, t s necessary to supermpose a dstrbuted termnaton detecton mechansm on the load-balancng procedure. There s extensve lterature on the termnaton detecton of synchronous algorthms [Er88][Haz87][Mat87][Ran83][Ron90][Szy85] as well as asynchronous forms [Cha85][Dj83][Fra82][Kum92][Sav96][fop84]. Most of the teratve LB 46

Nearest-neghbour load-balancng methods algorthms proposed n the lterature overlooked ths problem by also attendng to the analyss of ts convergence. Ensurng that all these algorthms lead to termnaton conssts of ncludng one of the exstent determnaton detecton algorthms at the expense of ncreasng the overhead of the load-balancng process by the termnaton delay. In contnuaton, and as an llustraton of what has been mentoned, we shall now descrbe a synchrone termnaton algorthm called SSP, whose name derves from the ntals of ts authors [Szy85]. At the pont of termnaton evaluaton, two states should be dstngushed n a processor: the busy and the dle state. A processor s n an dle state when the load-balancng process has locally fnshed,.e., when the load remans unchanged after an teraton of the load-balancng process. Otherwse, a processor s consdered to be n a busy state. But, subsequently, a busy processor can become dle and an dle processor may return to the busy state. Global termnaton occurs when all processors become dle. In order to detect ths global termnaton, control messages are used to pass nformaton about termnaton around, whch can emanate from both busy and dle processors. All system processors keep a counter S to determne how far the farthest busy processor mght be. The counter's value changes as the termnaton detecton algorthm proceeds. S s equal to 0 f and only f the processor s n a busy state. At the end of a loadbalancng teraton, each processor exchanges ts counter value wth all of ts nearest neghbours. Then each dle processor updates ts counter to be 1+mn{S, Inputs}, where Inputs s the set of all receved counter values. Evdently, the counter n a processor actually corresponds to the dstance between ths processor and the nearest busy processor. Snce the control nformaton of a busy processor can propagate at most over one edge n one teraton, when the value of S s cf+1, where d s the dameter of the underlyng topology, the global termnaton condton has been accomplshed. Ths pseudo-code of ths termnaton algorthm s shown n fgure 2.2. 47

Chapter 2 Algorthm SSP for Processor / 5 = 0; whle (S<d+l){ collect the S value from all neghbour processor and stored n InputS; S = mn {S, InputS}; f(localtermnated) S = 5+7; - else 5 = 0; Fgure 2.2 SSP algorthm for global termnaton detecton. 2.4 Analyss of relevant nearest-neghbours load balancng algorthms., As has been mentoned n the prevous secton, the LM, LBA and MM blocks are the three blocks that consttute the load-balancng process. In ths secton, we focus on the descrpton of the LBA block for three well-known load-balancng algorthms by overlookng the LM and MM blocks: the SID (Sender Intated Dffuson) algorthm, the GDE (Generalsed Dmenson Exchange) algorthm and the AN (Average Neghbourhood algorthm. The LBA block mplements the load-balancng rules that allow to a gven processor / to obtan, based on ts load and on that of the processors relevant to ts doman evaluated n tme t, ts new load value at tme f+1. Therefore, the load value of processor / at tme M can bé defned as a functon of all load values wthn the underlyng doman at tme t ( Wj(t) V y e TV, ) and t can be expressed as follows: V./ Below, the formal descrptons and analyss of the LBA block of the SID, the GDE and the AN algorthms are reported. 48

Nearest-neghbour load-balancng methods 2.4.1 The SID (Sender Intated Dffuson) algorthm The SID strategy s a nearest-neghbour dffuson approach whch employs overlappng balancng domans (defned n secton 1.4.2) to acheve global balancng [W1I93]. The scheme s purely dstrbuted and asynchronous. Each processor acts ndependently, apportonng excess load to defcent neghbours. Balancng s performed by each processor whenever t receves a load update message from a neghbour y, ndcatng that the neghbour load s lower than a preset threshold L ww,.e., (\Vj(t~)<L ww ). Each processor s lmted to load nformaton from wthn ts own doman, whch conssts of tself and ts mmedate neghbours. All processors nform ther nearest neghbours of ther load levels and update ths nformaton throughout program executon. Load-balancng actvaton s determned by frst computng the average load n the doman, w,(t), as follows. ^ÏÏT=-LÍ Next, f a processor's load exceeds the average load by a prespecfed amount, -that s w (t)-w (t)>l lhre!hald, then t proceeds to perform the load movements decsons. Load dstrbuton s performed by apportonng excess load to defcent neghbours. Each neghbour; s assgned a weght d*(t} accordng to the followng formula. n,, U otherwse These weghts are summed to determne the total defcency whch s obtaned as followng VjeN, Fnally, the porton of processor f s excess load that s assgned to neghbour / (0, s defned as p(t) = -0±L. 4(0 7 AC) 49

Chapter 2 Then, a non-negatve amount of load, denoted by Sy(t), s transferred from processor /to processor y at tme t and s computed as, Balancng contnues throughout applcaton executon whenever a processor's load exceeds the local average by more than a certan amount. Fgure 2.3 summarses the LBA block for the SID algorthm. In ths fgure r fl (t) denotes the amount of load receved by processor / from the neghbour processor j at tme t. A typcal value for the L lhreshold parameter s zero to force the load-balancng algorthm reachng an evenly local load dstrbuton. evaluate w,(0 f (w (t)-w (t)>l, hreshold ) Fgure 2.3 LBA block for the SID algorthm n processor. The SID algorthm was orgnally devsed to be appled under the assumpton that load can be treated as nfntely dvsble. Under ths assumpton, the algorthm has been expermentally proved to converge [WÍI93]. However, f the algorthm s modfed to support nteger loads, some problems arse. There are two rounded operatons that allow transformng s, y (/) ' nto an nteger value: floor or celng. Under the assumpton of usng dscrete load values, the load-balancng process should coerce nto a perfect balanced load dstrbuton f all processors of the system end up wth a load value equal to \Un\ orll/nj. We recall from secton 2.1 that L s the total amount of load across the system, and n s the number of processors. 50

Nearest-neghbour load-balancng methods In fgure 2.4 t s possble to observe how the convergence of the nteger verson of the SID algorthm greatly depends on how Sy(t) s rounded as t has been reported n [Cor97]. The nodes n yellow colour are the processors wth load excess. The executon of the dscrete SID algorthm n these processors wll produce the load movements ndcated wth the black arrows. If we apply celng roundng operatons, the processor loads could oscllate between unbalanced states. In the example shown n fgure 2.4(a) the central node starts the LB process wth load excess equal to 3.25. However, after executng one teraton of SID, t becomes an dle processor, whle ts neghbourng processors suddenly become overloaded processors. As a consequence, the external processors wll try to dstrbute ther new excess load by returnng some unts of load to the central processor, startng a new cycle of pngpong load movements. A clear consequence of such behavour s that the LB algorthm wll never converge, and overloaded processors can suddenly become dle processors as happens to the central one. On the other hand, wth the floor approach, the load dstrbuton has converged to a stuaton that does not exhbt a perfect global balance (see fgure 2.4(b)). Therefore, when a dscrete verson of the SID algorthm s needed to be mplemented, the floor approach s chosen because t stops at a stable state although t may not be the even one. Thus, the LBA block of the dscrete verson of SID n a gven processor / s summarsed n fgure 2.5. The condtonal sentence n fgure 2.5 allows each processor to detect whether t s balanced or not accordng to ts doman-load average. By takng an L threshold value equal to 0, f the comparson w (t)- w (t)> L lhreshold s not accomplshed, then the load of processor / concdes wth ts doman load average, and t s consdered to be balanced. In ths case, no load movements are guded by the LBA block. Otherwse, f the load of processor / s dfferent from ts own load average, then t s consdered as unbalanced and the load-balancng rules are evaluated. However, the LBA block may provde no load movements as a consequence of the roundng down operaton. Therefore, although the dscrete verson of the SID algorthm s able to detect that the underlyng processor s not balanced, t s not always able to correct ths stuaton. Below, we analyse n more detal, the reasons for such a stuaton. 51

Chapter 2 t=1 t=2 O 2V O o (a) ntal load dstrbuton fnal load dstrbuton t=1 (b) Fgure 2.4 Integer versons of the SID algorthm by applyng celng (a) and floor(b) roundng operaton LBA block evaluate w,(t) f (w (t)-w (t)>l threshold ) Fgure 2.5 LBA block for the dscrete SID algorthm n processor!. 52

Nearest-neghbour load-balancng methods More precsely, there are three unevenly balanced local load dstrbutons that the dscrete SID algorthm s not able to arrange, n spte of beng detected as unbalanced. These three stuatons are outlned below. a) Processor / s an overloaded processor whch has the hghest load value wthn ts own doman. In partcular, the load value of processor / s equal to m+s (w (t) = m + s) and all ts r neghbours have the same load value m (Wj(t) = m \/j\{,j}ee), (see fgure 2.6(a)). If the value of s s less than the number of neghbours (0<s<r), then processor /gudes no load movements. Moreover, f s>1, the underlyng doman s unbalanced. Otherwse, f s s equal to 1, the maxmum load dfference among the underlyng domans s 1, and then ths stuaton t s consdered as balanced. Fgure 2.6(b) shows an example of ths knd of load dstrbuton, where the central processor has a load value equal to 8 (w,.(0 = 8), and neghbourng processors have a load equal to 4. Thus, m s 4 and, snce s s also equal to 4, the doman s unbalanced and the dscrete verson of SID s not able to arrange such a stuaton, because after executng the LBA block the resultant load movements are all 0. (a) (b) Fgure 2.6 Unbalanced doman where the central processor has the hghest load value and all ts neghbour processors have the same load value b) Processor / s the most loaded processor wthn ts doman wth a load value equal to m ( w,(0 = m ) but not all ts neghbours have the same load value. Suppose that the load values of the neghbourng processors are «,,n 2,...,«r and they accomplsh the followng two condtons: 53

Chapter 2...<ñ <m, and m + Then, no load movements are guded by processor /. Furthermore, f IH-/I, > 1 there exsts an uneven load dstrbuton. Ths generc stuaton s depcted n fgure 2.7(a). (a) (b) Fgure 2. 7 Unbalanced doman where the central processor has the hghest load value wthn ts doman, and ts neghbours have dfferent load values. Fgure 2.7(b) shows an example of ths knd of load dstrbuton. The red processor s the central processor / whch has a load value equal to 4. The neghbourng processors have the followng load values: 2, 2 and 3 whch are dentfed wth «,,«2,«3 respectvely. The two condtons exposed above are accomplshed: w = 2 < «= 2 < «= 3 < 4. and. = 2 = n, Furthermore, wth 4-2>1, then the underlyng doman s unbalanced and SID s not able to balance t. 54

Nearest-neghbour load-balancng methods c) The last local unbalanced stuaton that SID s not able to arrange s the followng. Suppose that processor s load s equal to o larger than the average load wthn ts doman, but ts load s not the bggest. Then, supposng that the load of ts neghbours s denoted by «,,«2,...,«,. and m s the processor s load, then f the load dstrbuton wthn the doman of processor /accomplshes: «,<...<«,<m<n M <...<n r and 7tt + -t".. processor / does not perform load movements. In addton, f «r -«, >1, the current local load dstrbuton s unbalanced. Fgure 2.8(a) shows a generc llustraton of ths stuaton. ^y (a) (b) Fgure 2.8 Unbalanced doman where the central processor s overloaded but s not the most loaded processor wthn the doman. Fgure 2.8(b) depcts an example of ths stuaton. Processor / has a load value equal to 4, and ts neghbours have the followng load dstrbuton: 3, 3 and 5. Under ths stuaton the two condton, reported above are accomplshed: = 4<«=5 and 3 + 1 55

Chapter 2 Furthermore, as 5-3 > 1 the underlyng doman s unbalanced, but the underlyng processor wll not perform any load movement. Fnally, we can conclude that, although the orgnal SID algorthm s able to detect unbalanced stuatons and arrange them, the dscrete verson of SID, whch s a more realstc approach, s not able to acheve even load dstrbutons. Furthermore, although t has been expermentally proved that the dscrete SID stops, no theoretcal proof about ts convergence has been provded. 2.4.2 The GDE (Generalsed Dmenson Exchange) algorthm The GDE algorthm belongs to the sngle-drecton famly wthn the nearestneghbour strateges as stated n table 1.1. Ths strategy s based on the dmensonexchange method whch was, ntally ntensvely studed n hypercube-structured multcomputers [Cyb89][Ran88][Sh89b]. In these knds of archtectures, the dmenson exchange works n the way that each processor compares ts load wth those of ts nearest neghbours one after another. At each one of these comparsons, the processor would try to equalse ts load wth ts neghbour's. To do ths systematcally, all the processors could follow the order as mpled by the dmenson ndces of the hypercube: equalsng load wth the neghbour along dmenson 1, and then along dmenson 2, and so on. Thus, the load balancng algorthm block of the Dmenson Exchange algorthm s defned as follows: LBA block 1 For each dmenson { f there exsts an edge (j) along the current dmenson Fgure 2.9 LBA block for the orgnal DE algorthm n processor!. The Dmenson Exchange method s charactersed by "equal splttng" of load between a par of neghbours at every communcaton step. Due to ths fact, ths method does not take fullest advantage of the all-port communcaton model 56

Nearest-neghbour load-balancng methods descrbed n secton 2.2, havng to realse as many communcaton steps as dmensons exsts n the underlyng hypercube to execute the entre for-loop n the LBA block. Ths algorthms way of workng adapts best to the one-port communcaton model where a processor s restrcted to exchange messages wth at most one drect neghbour at one tme. It has been shown that ths smple load-balancng method yelds a unform dstrbuton from any gven ntal load dstrbuton n hypercubes topologes n a sngle sweep (.e. one teraton of the for-loop) [Cyb89]. For arbtrary networks t may not be the case, and the dmenson can be defned by edge-colourng technques. Wth edge-colourng technques [Hos90], the edges of a gven system graph G are coloured wth some mnmum number of colours (k) such that no two adjonng edges are of the same colour. The colours are ndexed wth nteger numbers from 1 to k, and represent the -colour graph as G k. A "dmenson" s defned as beng the collecton of all edges of the same colour. An edge between vertces / and j wth a chromatc ndex c n G k s represented by a 3-tuple (j,c). It s known that the mnmum number of colours k s strctly bounded by r, and r<k<r+\ [Fo78]. Fgure 2.10 shows an example of a 4x4 mesh coloured wth four colours where the numbers besde the edges correspond to the four colours. Durng each sweep, all colours/dmensons are consdered n turn. Snce no two adjonng edges have the same colour, each node needs to deal wth at most one neghbour at each teraton step (each step corresponds to one colour; a sweep corresponds to gong through all the colours once -fgure 2.11 (a)-). Fgure 2.10 Edge colorng for a 4x4 mesh 57

Chapter 2 Hossen et alter also studed the convergence property of ths method by treatng load as real numbers and usng lnear system theory. Moreover, the authors also derved an nteger verson of ths load-balancng algorthm as a perturbaton of the lnear verson. The load-balancng algorthm block appled to ths alternatve s shown n fgure 2.11 (b). LBA block For (c=1; c <= k; c=c+1) { : f there exsts an edge (j) wth colore For (c=1; c <= k; c=c+1) { f there exsts an edge (j) wth colore (a) (b) Fgure 2.11 LBA block for the DE algorthm appled to arbtrary networks n processor usng real (a) and dscrete (b) load values A queston that occurred to the authors when ths nteger verson was proposed s how does one decde whch node receves the floor and whch the celng for the average of the pared nodes?. Rather than dealng drectly wth ths queston the authors ntroduced a notaton that records the choces made n each balance operaton allowng the load-balancng algorthm to reach a load dstrbuton, where devaton from the global load average keeps bounded. Ths mprecson denotes the t dffculty of dervng a load-balancng algorthm that treats load as nteger numbers, whch s a more realstc approach usng the dmenson exchange dea. Xu and Lau ntroduced a more precse nteger verson of the dmenson exchange dea for arbtrary networks whch ensures ts convergence. The revsed algorthm proposed n [Xu97] was shown n fgure 2.12. 58

Nearest-neghbour load-balancng methods LBA block For (c=1; c <= k; c=c+1) f there exsts an edge (j) wth colore f0.5 w, (/) + 0.5w, (í)! fv, (0 > Wj (í) o/jervvwe Fgure 2.12 LBA block for the dscrete versó of the DE algorthm for arbtrary networks n processor. Fgure 2.13 shows the algorthm's behavour for a 4x4 mesh where the algorthm 2.12 s appled. The edge colourng used corresponds to the one shown n fgure 2.10. Suppose that the load dstrbuton at some tme nstant s as n fgure 2.13(a), n whch the number nsde a processor represents the load of the processor. Then, after a sweep of the dmenson exchange procedure, the load dstrbuton changes to that n fgure 2.13(b). The load dstrbutons obtaned when executng the load balancng functon at each dmenson/colour s also shown. The red edges denote the par of processor, nvolved n the load exchange wthn the nspected dmenson/colour, and the arrows show the amount of load moved between them. In order to acheve a stable global load dstrbuton, the load-balancng algorthm block (one sweep) must be executed three tmes more, as s shown n fgure 2.14. The obtaned fnal load dstrbuton s not a global balance stuaton because the maxmum load dfference throughout the entre system s two load unts nstead of one. Thus, one can derve that an equal splttng of load between pars of processors does not always coerce nto an even global load dstrbuton, where the maxmum load dfference between any two processors wthn the system s one load unt. 59

Chapter 2 sweep (b) Fgure 2.13 Load dstrbuton durng a sweep of DE for arbtrary networks 3 sweeps ntal load dstrbuton fnal load dstrbuton Fgure 2.14 DE for arbtrary networks In the lght of the ntuton that non-equal splttng of load mght lead to fewer sweeps necessary for obtanng a unform dstrbuton n certan networks, Xu and Lau generalsed the dmenson exchange method by addng an exchange parameter to control the splttng of load between a par of drectly connected processors. They called ths parametersed method the Generalsed Dmenson Exchange (GDE) method [Xu92][Xu94][Xu97]. The GDE method s based on edge-colourng of the 60

Nearest-neghbour load-balancng methods gven system G usng the dea ntroduced by Hossen et alter n [Hos90] and reported above. For a gven G*, let w,(0 denote the current local load of a processor / and A, denote the exchange parameter chosen. Then, the change of w (t) n the processor usng the GDE method s governed by the followng LBA block: LBA block For (c=1; c <= k; c=c+1) f there exsts an edge (j) wth colore Fgure 2.15 LBA block for the GDE algorthm n processor. In order to guarantee w,(7)>0, the doman of the exchange parameter A, s restrcted to [0,1]. There are two choces of the exchange parameter whch have been suggested as ratonal choces n the lterature: A,=1/2: equally splttng the total load of a par of processors. Note that ths choce reduces the GDE algorthm to the Dmenson Exchange method. Ths specal verson of the GDE algorthm s referred to as the ADË (Averagng Dmenson Excahge) method. As mentoned above, ths value of the exchange parameter favours hypercube topologes. À = l/(l + sn(x/k)) n the mesh and A = \/(l + sn(2n/k)) n the torus where k s the largest dmenson of the correspondng topology. Ths varant of the GDE x method s known as the ODE (Optmally tuned Dmenson Exchange) method. It has been proved n [Xu95] that under the assumpton that no load s created or consumed durng the load balancng process, the two prevous values of the exchange parameter are optmal for meshes and torus respectvely. Xu and La u have shown that n easy numerc terms, the optmal exchange parameter for /c-ary n-cubes topologes ranges between [0.7, 0.8]. 61

Chapter 2 These theoretcal results were obtaned by Xu and Lau treatng loads as real numbers. As we have prevously mentoned, to cover medum and large gran parallelsm whch are more realstc and more common n practcal parallel computng envronments, load may be treated more convenently as non-negatve ntegers. Then, a modfed verson of the GDE algorthm s needed. Such a varant s ncluded n the code of fgure 2.16 where the revsed formula of the LBA block s shown. LBA block For(c=í;c<=/c;c=c+íj f there exsts an edge (j) wth colore otherwse Fgure 2.16 LBA block for the dscrete verson of the GDE algorthm n processor. Because of the use of nteger loads, the load balancng process wll end wth a varance of some threshold value (n load unts) between neghbourng processors. Ths threshold value can be tuned to satsfactory performance. By settng ths value to one load unt the closest total load-balancng s enforced. Then, t s clear that 0.5 < À < 1 because a par of neghbourng processors wth a varance of more than one load unt would no longer balance ther loads when /l<0.5. Xu and Lau demonstrate expermentally that the optmal exchange parameter À when the nteger load model s appled, s not always 0.5, but somewhere between 0.7 and 0.8, whch s an agreement wth ther theoretcal results. As happens to other load-balancng algorthms that were orgnally devsed to be used by treatng loads as nfntely dvsble, the dscretzaton approach of the orgnal GDE algorthm also arses some problems whch have been analysed n [Mur97][Sub94]. Fgure 2.17 shows the load changes performed n the pctured doman after executng one sweep (three teratons) of the dscrete verson of the 62

Nearest-neghbour load-balancng methods GDE algorthm, where exchange parameter À has been chosen as 0.75. As can be observed n the example, the dscrete verson of GDE may converge to a stuaton that does not exhbt a perfect global balance. ntal load dstrbuton t=1 fnal load dstrbuton 1 sweep Fgure 2.17 Dscrete versons of the GDE algorthm wth exchange parameter fa equal to 0.75 2.4.3 The AN (Average Neghbourhood) algorthm The Average Neghbourhood algorthm s a load-balancng algorthm whch belongs to the dffuson famly as well as the SID algorthm (see table 1.1). These two algorthms are smlar n the sense that both evaluate the load average wthn the doman of each processor to determne whether t has load excess or not. In other words, f the load value of a gven processor / s bgger than the load average wthn ts doman, the LBA block may produce some load movement decsons to dstrbute ths surplus load among the neghbourng processors. From the analyss of SID's behavour reported n secton 2.4.1, we have concluded that although there s a tme when the executon of SID n a gven processor / provdes no load movements, the load dstrbuton obtaned at that tme may not be balanced. The AN algorthm tres to arrange ths problem by ncludng the capablty of movng loads between non-drectly connected processors. These load movements are restrcted to beng performed between processors that have a common neghbourng processor. Therefore, the goal of the AN's LBA block on processor / s to balance the load of all processors belongng to ts doman [Cor99c]. Ths goal s wder than smply tryng to balance the load of processor / only. The LBA block computes the average doman load usng the 63

Chapter 2 load nformaton from all processors belongng to ts doman. Bearng n mnd that the doman of a gven processor / s denoted by N,and that r s the number of the drect neghbours n a symmetrc topology, ths average s evaluated as follows: Two thresholds values named T sen te r and T rece/ver whch are centred around the doman load average, are also evaluated. A processor / s classfed as: sender n N f T xnder s smaller than the current load value of processor / recever f T rece/ v er s bgger than the current load value of processor / (w (t)<t rccever )\ neuter n any other case,.e., T recever < w, (/) z T sender. Any tme a processor / receves load nformaton from one of ts neghbours, the new load average wthn ts doman s calculated and T sender and 7" rece,v er are updated to evaluate the needs of load-balancng n the doman. In a sngle balancng teraton, dependng on the doman load stuaton, one acton can be decded: 1. f processor / s sender and at least one neghbourng processor s a recever, processor / sends load towards them proportonal to ts defct load; 2. f processor /, s receved and there s at least one sender n the doman, processor / requests load from one of them, n partcular the most loaded one, and the amount of load requested wll be proportonal to ts excess; 3. f processor / s neuter but there are senders n the doman, t requests to provde load from the most loaded sender and to gve t to the most underloaded processor of the doman. 64

Nearest-neghbour load-balancng methods In other words, a gven processor / tres to push the load of all processors wthn ts doman as close as possble to the doman load average. However, the algorthm gves prorty to balancng processor / (pont 1 and 2). Only f t s balanced does the algorthm try to balance neghbourng processors (pont 3). The descrbed mplementaton acheves the goal of movng the load of every doman processor wthn the range defned by Tender and T Kcever. In order to acheve a perfect balanced doman where the maxmum local load dfference s no bgger than one load unt, both thresholds must concde wth the local load average,.e., T recever = T sender = w (t). It s mportant to remark that a load movement must not reverse the role of processors n the doman to grant stablty, n other words, the load reallocaton must be lmted to avod a sender processor becomng a recever n the doman, and vceversa. Fgure 2.18 summarses the LBA block for the AN algorthm. An mportant dfference n the AN algorthm wth respect to the two loadf balancng algorthms descrbed n the prevous sectons (SID and GDE) les n the lmtaton of the number of processors n the system that can be executng the balancng process smultaneously. Ths means that some degree of co-ordnaton between the LBA blocks s needed to avod stuatons n whch balancng operatons take place concurrently n overlappng domans. For ths reason, all system processors are organsed n processors subgroups, n such a way that the processor domans relevant to a partcular subgroup are totally dsjont. These subgroups are referred to as seralsaton sets, and there should be the mnmum possble number n exstence. Snce the AN algorthm s a totally dstrbuted load-balancng algorthm, each processor of the system belongs to a seralsaton set. Bearng n mnd ths seralsaton restrcton, t would mply coverng all the seralsaton sets at once n order to obtan that all processors execute the balancng process one sngle tme. Fgure 2.19 depcts all the seralsaton sets n a 4x4 torus topology where the red processors are the processors whch are allowed to execute the load-balancng process. As we can observe there are 8 seralsaton sets wth 2 non-overlapped doman n each one of them. 65

Chapter 2 LBA block evaluate w,(r), T sentjer and T rece,v er w /(0- (0 fw (t)<t recever w (/) otherwse : n ths case load movements can be commanded between processors belongng to the underlyng doman Fgure 2.18 LBA block for the AN algorthm n a master processor. Fgure 2.19 Seralsaton sets for a 4x4 topology 66

Nearest-neghbour load-balancng methods Whenever the actvty of the algorthm s stopped because competng loadbalancng operatons are takng place n overlappng domans, the algorthm cannot smply wat for ther completon. The stuaton of the doman processors s lkely to have been changed, and the nformaton upon whch a decson was based could have become obsolete. The load-balancng process must abort whenever t s delayed, because of competng actons. Therefore, the need for seralsng the loadbalancng process can be vewed as an mportant constrant, snce t generates hgh synchronsaton message traffc and contnuous nterruptons of the load-balancng operatons n some processors. The convergence of the AN algorthm has been proved wth respect to keepng the global load varance bounded beyond a gven tme [Cor96]. However, fnal load dstrbuton does not always exhbt a global balance stuaton as happens n the example depcted n fgure 2.20 where the maxmum load dfference between any two processors at the end of the load-balancng process s 2. Fgure 2.20 Fnal unbalanced stuaton applyng the AN algorthm 2.5 Summary of ths chapter Ths secton s orented to summarse the man characterstcs of each one of the three algorthms descrbed above, n order to have a global vew of ther smlartes and dfferences wth respect the man mplementaton key ssues and ther derved capabltes. Recallng from chapter 1 that the load-balancng process vewed from the processor level pont of vew, s decomposed nto three functonal blocks: LM (Load Manager) block, LBA (Load Balancng Algorthm) block and MM (Mgraton Manager) block. In partcular, n ths chapter, we concentrate on an exhaustve descrpton of the correspondng LBA (Load-Balancng Algorthm) bock whch, 67

Chapter 2 recallng from chapter 1, s the block of the global load-balancng process that decdes source and destnaton processors for load movements, as well as the amount of load to be moved between them. MAIN CHARACTERISTICS SID GDE AN Orgnal Load Infntely dvsble Infntely dvsble Dscrete 8 J3 5 Model Doman load unts Nearest Neghbours load unts Nearest Neghbours load unts Nearest Neghbours Implementaton Key Issues $ 0) í 5 3 Load Balancng Actvaton (trgger condton) Work Transfer Calculaton Set of runnng processors One adaptve threshold (local load average) Evaluated (dependng on load dstrbuton wthn the doman) All system processors One adaptve threshold (load dfference between two neghbours Fxed (exchange parameter) All system processors Two adaptve threshold (evaluated around the local load average) Evaluated (dependng on load dstrbuton wthn the doman) All system processors 1 3 Degree of co- Asynchronous Synchronous Synchronous operaton B ID 3> Smultaneous actve processors All system processors Processors nvolved n the same dmenson Processors belongng to the same seralsaton set Convergence ' Not proved Not proved Not proved Detecton of unbalanced Not allowed Not allowed Not allowed Behavour Features domans Local Balance Global Balance Movements between non- Not always acheved Not always acheved Not allowed Not always acheved Not always acheved Not allowed Not always acheved Not always acheved Allowed drectly connected processors Table 2.1 Summary of the man characterstcs of SID, GDE and AN. 68

Nearest-neghbour load-balancng methods As was descrbed n secton 1.3.2 of ths work, the LBA block s dvded nto two phases: Load Balancng Actvaton and Work Transfer Calculaton. The frst phase s responsble of testng the trgger condton and, therefore, determnng whch processors are the actve processors at a certan tme,.e., whch processors overcome the trgger condton and wll consequently execute the second phase of the LBA block. In table 2.1, we have ncluded the partcular charactersaton of each one of these phases for all analysed algorthms. However, not only these characterstcs are reported n that table. Other relevant ssues from the processor level pont of vew, as well as from the system level pont of vew are also consdered n ts summary. Furthermore, table 2.1 ncorporates the behavour features derved from the executon of each algorthm. 69

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 3 DASUD load-balancng algorthm Abstract In ths chapter a new dynamc load-balancng algorthm called DASUD (Dffuson Algorthm Searchng Unbalanced Domans), whch uses a local teratve scheme to acheve a global balance, s proposed. DASUD was developed for applcatons wth a coarse and large granularty where load must be treated as non-negatve nteger values. Frstly, n ths chapter, we descrbe the proposed algorthm, ts complexty s analysed and ts functonng demonstrated for an example n whch the underlyng topology s a 3-dmensonal hypercube. Subsequently, the proof of DASUD's convergence s provded, and bounds for the degree of overall balance acheved are provded, as well as for the number of teratons requred for such balance. 71

DASUD load-balancng algorthm 3. í DASUD (Dffuson Algorthm Searchng Unbalanced Domans)'s motvaton The load-balancng strateges wthn the nearest-neghbour category exhbted some problems when load s packaged nto dscrete unts. In partcular, n chapter 2, the problems related to the three algorthms descrbed have been analysed. Summarsng, the man problem that all these strateges exhbted s that they may produce solutons whch, although they are locally balanced, prove to be globally unbalanced. Fgure 3.1 shows the load of 6 processors connected n a lnear array, whch are obtaned as balanced soluton by most of the nearest-neghbour loadbalancng algorthms. However, ths load dstrbuton s not an even dstrbuton because the maxmum load dfference through the whole system s 5 load unts. We recall from chapter 2 that the system s balanced when the global maxmum load dfference s 0 or 1 load unt. Fgure 3.1 Fnal stable load dstrbuton but globally unbalanced. The proposed algorthm DASUD, s based on the SID load-balancng algorthm and ncorporates the evaluaton of some new parameters to detect stable unbalanced local load dstrbutons acheved by SID and generates some load movements to slghtly arrange them. DASUD detects as unbalanced stuatons such as the one shown n fgure 3.1 and s able to drve the system to a fnal balanced load dstrbuton such as the one depcted n fgure 3.2. Fgure 3.2 Fnal balanced load dstrbuton. In table 3.1, we enumerate the characterstcs of DASUD by usng the same scheme that the one used n secton 2.5 to summarse the man characterstcs of the three load-balancng algorthms descrbed n chapter 2 (SID, GDE and AN). In the followng secton, an exhaustve descrpton of DASUD's behavour s provded. 73

Chapter 3 I Î 0) "E. E Behavour Features.3 1 System Level MAIN CHARACTERISTICS 8 J2 S «JQ Orgnal Load Model Doman Load Balancng Actvaton (trgger condton) Work Transfer Calculaton Set of runnng processors Degree of co-operaton Smultaneous actve processors Convergence Detecton of unbalanced domans Local Balance Global Balance Movements between non-drectly connected processors DASUD Dscrete load unts Nearest Neghbours One adaptve threshold (local load average) Evaluated (dependng on load dstrbuton wthn the doman) All system processors Asynchronous All system processors Fntude proved Allowed Always acheved Upper Bounded Allowed Table 3.1 Man characterstcs of the DASUD load-balancng algorthm \ 3.2 Descrpton of the DASUD algorthm We now dscuss the behavour of one teraton of DASUD or, n effect the same thng, the LBA block of the DASUD algorthm. Essentally, one teraton of DASUD conssts of two load-balancng stages as s shown n fgure 3.3. The frst stage performs a coarse load dstrbuton of the load excess of the underlyng processor, whereas the second stage produces a more accurate excess dstrbuton n order to acheve the perfect load balance wthn the underlyng doman. More precsely, n the frst stage, f the underlyng processor s an overloaded processor t wll proportonally dstrbute ts excess load among ts underloaded neghbour processors n such a way that the most underloaded neghbours wll receve more load than the less underloaded ones. In the second stage, frstly, each processor checks ts own doman to determne whether 't s unbalanced or not. In order to balance the 74

DASUD load-balancng algorthm underlyng doman, each processor can proceed by completng ts excess load dstrbuton n a more refned way, sendng messages to an overloaded neghbour nstructng t to send load to an underloaded neghbour, or performng load movements between non-drected connected processors. We now formally descrbe each one of the stages of DASUD. In order to do so, we use the same nomenclature descrbed n chapter 2, and we ntroduce certan new notaton that s descrbed below. In the DASUD algorthm, each processor sends a message, at certan gven tmes, to all ts neghbours contanng ts local load. Ths nformaton s not updated mmedately n all neghbour processors due to delays ntroduced by the network. Therefore, each processor / keeps n ts local memory an estmaton of processor's j load (we denote w, y (0 as the load estmaton of processor j kept n memory by processor ; at tme 0- Then, f / and j are neghbour processors (.e. {,j} e E), w j(t) - wj (T j (t)), where T tj (t) s a certan tme nstant satsfyng O < T y (t) < t. For convenence, f / #j and {,j} e E then ^.(0 = 0. Each processor s able to assess ts current load at each nstant. Ths means that w^t) = w,(t). For our algorthm, t s also convenent that each processor has a lst of ts neghbours sorted accordng to the assgned processor ndexes. Ths sorted lst wll be used as crtera for selecton n those cases n whch a processor must be selected from amongst a subset of neghbourng processors. DASUD works asynchronously. Therefore, each processor / can be assocated wth a set of nstants T, that wll be denoted as a set of load balancng tmes for processor /. At each one of these nstants, processor / begns the frst stage of the DASUD algorthm. In ths stage each processor compares ts load wth the load estmaton of all ts neghbours that are stored n ts local memory. As each T, s a tme-dscrete set, n order to study DASUD's behavour, we can dscrmnate the varable t. Therefore t assumes the value 0,1,2,... 75

Chapter 3 Fgure 3.3 One teraton ofdasud algorthm 76

DASUD load-balancng algorthm Frstly, processor / performs some ntal actons startng by checkng the load estmatons kept n ts memory of ts neghbours ( w //(0) and t computes the load average of ts doman as follows: #N Once processor / has computed ts local load average, w (t), t also evaluates a local load weght, denoted by d H (t), n order to detect whether t s an overloaded processor or not, d,,(t) = w (t)-w,(t). If processor / s an overloaded processor, d,,(t) wll be a negatve value (d,,(t)<0). Otherwse, f processor / s an underloaded processor, d u (t) wll be a non-negatve value (d,,(t) > 0). Then, dependng on the value of d H (t) one of the two stages of DASUD wll be performed. If d u (t)<q then the frst stage wll be executed. Otherwse, f d u (t)>q the computaton wll go on wth the second stage. 3.2.1 Descrpton of the frst stage of DASUD In ths stage, the load excess of the processor / s dstrbuted among ts neghbourng processors wth load defct. The load excess dstrbuton s performed proportonally to the correspondng load defct. For ths purpose, a load weght d^t) s evaluated for each processor) belongng to the doman of processor / accordng to the followng formula: d, j (t) = : w ()--w j (t). An overloaded processor / (d t (t)<0) performs load balancng by apportonng ts excess load only to defcent neghbours, j, whose load weght s a postve value ( / (/ (0 > 0 ). The amount of excess load to be moved from an overloaded processor / to one of ts defcent neghbours j wll be denoted by s,j(t). In order to evaluate Sy(t), a new weght called djj(t) s computed for all processors/ 77

Chapter 3 " [O otherwse The total amount of load defcts s computed on D t (t) to determne the total n defcency, D (t) = V d* (t) Subsequently, the porton of excess load of processor / that s assgned to "í neghbour/ Py(t), s computed as follows, otherwse Then, a non-negatve amount of load, denoted by s j(t), s transferred from processor /to processor; at tme t and s computed as, s v (t) = floor(-p J (t)*d u (t)) If s j(t) = Q for ally,.e., no load movements are performed by the frst stage of DASUD, then the second stage wll be executed. Otherwse, the second stage of DASUD s skpped and no more load movements wll be performed durng ths teraton. 3.2.2 Descrpton of the second stage of DASUD In ths stage, DASUD evaluates the balance degree of processor doman / by searchng unbalanced domans. Ths stage s composed of four blocks whch work together wth the am of completely balancng the underlyng doman. These blocks are: the SUD (Searchng Unbalanced Domans), FLD (Fne Load Dstrbuton), SIM (Sendng Instructon Message) and PIM (Processng Instructon Messages) blocks. Each one of these blocks are descrbed below. 78

DASUD load-balancng algorthm Searchng Unbalanced Domans (SUD) block In ths block four parameters are evaluated: a) maxmum load value of the whole doman (ncluded /): w"" x (t), b) mnmum load value of the whole doman (ncluded /): \v mn (t), c) maxmum load value of neghbourng processors of processor /: w * (/), d) mnmum load value of neghbourng processors of processor /': w '"(t). The maxmum load dfference through the doman of processor / s evaluated (w, mojr (0-w, m/ "(0) n order to detect whether ts doman s unbalanced or not. We recall that N, s balanced at nstant t f w, majr (0-w, m '"(0<l- If the doman s not balanced FLD block wll be executed. Otherwse, f the doman s balanced, the PIM block wll be executed. Fne Load Dstrbuton (FLD) block If the doman s unbalanced (w^"(t)-w^"(t)>^) then one of the two followng actons can be carred out accordng to the values of the four parameters evaluated n the prevous block. Acton 1: If processor ; s the processor wth maxmum load of ts doman (w,(0 = <""(/)) and all ts neghbours have the same load ( *>%"(t) = w " 1 (t)), then processor ; wll dstrbute a unts of load to ts neghbours one by one. The value of a s computed as a = (w" ax (t)-w" n (t)-l), n order to mantan the load value of processor ; lower-bounded by the load average of ts doman. The dstrbuton pattern concdes wth the neghbours order kept n memory by processor / (y, < j 2 <... < j r ), so processor / wll send one load unt to processors: j }, j 2... j a. Note that the value of a wll always be smaller than the number of neghbours (a < r), otherwse, the frst stage of DASUD would perform some load movements, and ths part would not start up. 79

Chapter 3 Acton 2: If processor / s the processor wth maxmum load of ts doman ( w. (/) = w^""(t) ) but not all the processor belongng to the doman have the same load (w7"(/)*w *'(0) then one unt of load s sent to one of the less loaded t neghbour = m» processors denoted by f mln, whch s obtaned as follows, If acton 1 or acton 2 have produced some load movements, then the second stage of DASUD has fnshed. Otherwse, the next block to be executed wll be the SIM block. A prelmnary proposal of DASUD, whch only ncorporated the two actons descrbed above, was reported n [Luq95]. Sendng Instructon Message (SIM) block Ths block s related to the possblty of commandng load movements between non-drected connected processors. If the doman of processor / s not balanced (wf fljr (0-M' "(0 > 1) but processor /s not the most loaded processor (w,(0 *<""(/)), then processor /wll command ^one of ts neghbours wth maxmum load, j' majc, to send a load unt to one of ts neghbours wth mnmum load, j' mn. The values of j' mm and j' mtn are obtaned as follows: and the message (/, j' mn, t, jw.., (í)) s sent from processor /to processor j' majc, where / s the ndex of the processor that sends the messages, j' mn s the ndex of the target processor, t s the tme nstant at whch the message s sent and w.., (?) s í/mor the estmaton load of j" processor that processor / keeps n memory when the 80

DASUD load-balancng algorthm message s sent. As a consequence of ths acton, processor / commands the movement of one unt of load between two processors belongng to ts doman. Note that these two processors can be non-drected connected processors. The PIM block wll subsequently be executed. Processng Instructon Messages (PIM) block Ths block wll always be executed after the SIM block and when no load movements have been performed by the frst stage of DASUD, and the underlyng doman s balanced. The block s related to the management of the nstructon messages receved from processors belongng to the underlyng doman. Processor / wll consder the receved nstructon messages whch have the followng content: (/, /, /', Wj(t')). All messages whose fourth component accomplshes that Wj, (/') = w, (/) are sorted accordng to the followng crtera: * Frstly, n descendng order of sendng tme, t'. * Second, n ascendng order of ndex/ * Lastly, n ascendng order of target processor j'. The frst element of the sorted lst s chosen, and processor / sends one load unt to the processor/', crossng processor/ Fnally, at the end of each teraton of DASUD, processor / proceeds wth the elmnaton of all receved nstructon messages from processors belongng to ts doman. Ths acton s always carred out ndependently of whch prevous stage has been executed. As a consequence of applyng the second stage of DASUD, some load movements can eventually be carred out to correct unbalanced stuaton. Ths amount of load s denoted by o,(t), and t can be equal to {0,1, a}. Notce that when no load movements are produced by the frst stage of DASUD (s j(t) = O V/ e P), the value of S (t)could be a non-negatve value (,(/)^0). Otherwse, f processor / 81

Chapter 3 ] sends a porton of ts excess load as a consequence of applyng the frst stage of DASUD, no extra load movements wll be carred out by the second stage of DASUD (,(0 = 0). Notce that the frstj stage of DASUD concdes wth the SID algorthm. Then, bearng n mnd that Sy(t) and S,(t) dentfy the amount of load sent by processor / to ts neghbours; at tme t, and denotng by /},(/) the amount of load receved by processor) from processor / at tme t, the load of processor / at tme t+1 can be expressed by the followng formula: (1) Fnally, the complete behavour of DASUD s summarsed n fgure 3.4 where the LBA block of DASUD s provded. LBA block evaluate w,(t) f (w,(0>w,(0) { Í F/gure 3.4 L8/\fa/ocfcof the DASUD algorthm n processor! 82

DASUD load-balancng algorthm 3.3 An example of DASUD executon Ths secton llustrates the behavour of the DASUD load-balancng algorthm by consderng a 3-dmensonal hypercube wth an ntal load dstrbuton denoted by the load vector: w(0) = (4, 3, 5, 3, 2, 1, 3, 8). In fgures 3.5 the numbers nsde the nodes denote the load of each processor and the subndexes are the correspondng processor ndex. The load movement between two processors s ndcated wth an arrow between them and the label of each arrow represents the amount of load to be moved. The actons carred out at each stage of DASUD, as well as the most relevant parameters evaluated for each processor durng each teraton of the load balancng process, are summarsed n table 3.1. Partcularly, column 3 n the table llustrates the receved but not processed nstructon messages of each processor. The fourth and ffth columns nclude the value of the load average wthn the underlyng doman and the load excess of the underlyng processor, respectvely. The load movements generated by the frst stage of DASUD are shown n the sxth column. When no load movements are performed from processor / to any of ts neghbours as a consequence of applyng ths stage, the content of the correspondng cell s s v (\) = 0. Otherwse, the subndexes ndcate the source and destnaton processors of a partcular load movement. The remanng columns correspond to the dfferent blocks of the second stage of DASUD. A dscontnuous lne n the cell ndcates that the correspondng acton s skpped. The word no ndcates that no load movements are performed by that block. In the FLD block column, the amount of load to be moved and the destnaton processor are ndcated by / toj. For example, for processor 3 the expresson "1 to 2" n the column Acton 1 of the FLD block, ndcates that one load unt s moved to processor 2. Fnally, the SIM block column reflects the message to be sent and the destnaton processor. In ths example, two teratons of DASUD are needed to acheve the fnal balanced load dstrbuton. Durng the frst teraton, processors 1 and 8 perform load movements as a consequence of executng the frst stage of DASUD. Therefore, stage two of DASUD s skpped. Processors 2 to 7 do not perform load movements 83

Chapter 3 as a consequence of applyng the frst stage of DASUD (s, y (l) = 0), stage 2 s then carred out. All these processors detect that ther domans are not balanced, so the load-balancng algorthm goes on through the FLD block. Processor 3 observes that ts load s the largest wthn ts' doman and all ts neghbours have the same load value. Then, acton 1 of the FLD block s executed by processor 3 and 1 load unt wll be sent to processor 2. Processors 2, 4, 5, 6 and 7 produce no load movements when the FLD block s executed, each one then executes the SIM and PIM blocks. The nstructon messages sent by each one of these processors are reported n the SIM block column of table 3.1. As the current teraton s the frst one, no processor has receved messages, therefore, no processor wll perform any load movements by the PIM block. t At the begnnng of the second teraton of the DASUD algorthm, processors 2, 3 and 8 have nstructon messages to be processed. Processors 2 and 5 have enough load excess to dstrbute among ther neghbours by applyng stage 1 of DASUD then stage 2 wll be skpped. Processor 1, 6 and 7 detect that ther domans are unbalanced. As they have performed no load movements at the frst stage of DASUD, ther load-balancng process goes to stage 2. Processor 7 sends 1 load unt to processor 6 as a consequence of executng acton 2 of the FLD block, the SIM and PIM blocks wll then be skpped by ths processor. Processors 1 and 6 each send one nstructon message to processor 5 and nether of them perform load movements by the PIM block because they have not receved messages from the prevous teraton. Processors 3, 4 and 8 perform no load movements by the frst stage of DASUD and ther domans are balanced, ther computatons then, go on drectly to the PIM block. Processor 4 has not receved nstructon messages so no acton can be carred out by the PIM block. The receved-nstructon messages of processors 3 and 8 are dscarded because the load value of these processors have changed snce the messages have been sent. At the end of the load-balancng teraton/each processor deletes ts receved nstructon messages ndependently of whch prevous stage has been executed.! The fnal maxmum load dfference obtaned throughout the whole system s one unt of load, and so the system s balanced. 84

DASUD load-balancng algorthm teraton 1 teraton 2 unbalanced ntal load dstrbuton > - Load movement * - Sendng one nstructon message Fgure 3.5 An example ofdasud's executon balanced fnal load dstrbuton rfert.f; 1 2 Proc. 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 receved nstructon messages no no no no no no no no no (6,6,1,3) (2,6,1,5) no no no no (4,4,1,8) (5,6,1,8) (7,6,1,8) w,(0 3 3.75 3.5 5 3.75 2.25 4.25 4 4 3 4 3.75 3.25 3.5 3.25 4.25 4/«) -1 0.75-1.5 2 1.75 1.25 1.25-4 1-1 0-0.25-1.75 2.5-0.75 0.25 DASUD Stage 1 Unbalance d doman?, 5 (D = I,/) = o ty(l) = 0 *(,(!) = 0 *(,(» = 0 s,j(\) = 0 tf(l) = 0 %,(!) = 1 í 85 0) = 2 87 (l) = l íy(2) = 0, í 26 (2) = 1 *(,(2)-=0 50(2) = 0 *S6<2) = 1 */,(2) = 0 j, (2) = 0 ty(2) = 0... yes yes yes yes yes yes yes no no yes yes no 5toee2 FLD block Acton Acton 1 2 no 1 to 2 no no no no no -- no no no no no no no no no --... no 1 to 6 SIM block (2,6,1, 5) to 3 (4,4, 1,8) to 8 (5,6, 1,8) to 8 (6,6, 1,3) to 2 (7,6,1, 8) to 8 (1,1, 2,5) to 5 _ -~ (6,6,2,5) to 5... PIM block no... no no no no no dscard message s no no dscard message s Table 3.1 Evoluton of the DASUD executon for the example of fgure 3.5 85

Chapter 3 I 3.4DASUD's complexty Now that we have descrbed the behavour of a sngle DASUD teraton, we shall now analyse the complexty nvolved there n. Before startng the DASUD executon, all processors should sort the ndexes of the neghbours usng the same crtera. However, the cost of ths can be neglected because t s nsgnfcant f the global computatonal cost s consdered. One teraton of DASUD n a processor /, nvolves dfferent operatons dependng on whch stage or block produces load dstrbuton among ts neghbours. The pseudo-code ntroduced n fgure 3.6 shows how the dfférent blocks of DASUD are organsed, and table 3.2 I summarses the maxmum number of operatons that should be performed at each one (we recall that the number of neghbourng processors of processor / s denoted by r). From ts overall analyss we can conclude that the computatonal complexty of DASUD s domnated by the complexty of the PIM block (O(rlogr)). Ths means that the overall complexty of the computatonal perod of DASUD s low. { evaluate w (í), d u (t); r FIRST STAGE' */ f (da (í) < O ; evaluate s y (/) V/ e N, \ {/} ; f (no load movements are performed by the FIRST STAGE) r SECOND STAGE Y { SUD block; f (unbalanced doman) { f LFD block Y ï((w, (O = w, ma (0^ && (<T"(0 = w v7"(0/> ^c/on 1 : f f w, (í) = w,""(0; &* íw~(0 ^ <'"(/)^ Acftm 2; /f Cno toad movement performed by Actonl and Acton2),{ SIM block; PIM block; }else 'PIM block; } Deletng Instructon Messages; l Fgure 3.6 Pseudo-code of one sngle DASUD teraton. 86

DASUD load-balancng algorthm Iteraton Intal Actons DASUD's operatons Actons knd of operaton check load estmatons memory accesses evaluate w,-() dvson addton ' evaluate d n (t) subtracton quantty r 1 r+1 1 evaluate d j(t) subtracton r Frst Stage evaluate D evaluate P -(t) addton dvson r r evaluate s j(t) multplcaton round r r SUD evaluate w^(t), w r(t), subtracton 2(r+1) Second Stage LFD SIM evaluate vl (f)-w ""(t): compare W(t)wth w, ma (0 send one message subtracton subtracton transmsson 1 1 1 PIM sort a lst of r elements dependng on 3 parameters comparsons Of / log r; Iteraton Completon Actons delete r messages Table 3.2. All possble operatons performed by one sngle DASUD teraton. In the followng sectons, formal aspects of DASUD are theoretcally studed by frstly regardng ts convergence. 87

í Chapter 3 3.5DASUD's convergence t I ' In ths secton we wll demonstrate DASUD convergence. For ths purpose we assume the partally asynchronous assumpton whch was ntroduced by Bertseka and Tstskls n [Ber89]. In that work, the authors dvded asynchronous algorthms t nto two groups: totally asynchronous and partally asynchronous. To paraphrase them, totally asynchronous algorthms "can tolerate arbtrarly large communcaton and computaton delays", but partally asynchronous algorthms "are not guaranteed to work unless there s an upper bound on those delays". Ths bound s denoted by a constant B called asynchronsm measure. That assumpton has been modfed to be appled to the realstc load model assumed by DASUD, and to consder some partcular characterstcs of the provded below: same. The formal descrpton of ths assumpton s Assumpton 1. (Partally Asynchronsm) There exsts a postve nteger B such that: a) for every / e P and for every t e N, {/, t + \,...,t + B-\] fl T #0 b) for every / e P, for every t e N and for every jen, t-b< T V (/)</ c) the load sent from processor, / to processor j at tme t s receved by processor j before tme t+b. ' d) the nstructon message sent by processor /to processor y at tme t, s receved by processor; before tme t+b. Part (a) of ths assumpton postulates that each processor performs a step of the load balancng process at least once durng any tme nterval of length 6; part (b) states that the load estmatons kept n memory by any processor at a gven tme t were obtaned at any tme between t-b and t; part (c) postulates that load messages wll not be delayed more than B tme unts as well as the nstructon message sent by processor /to processor/ as has been stated n part (d) of ths assumpton. Under assumpton 1, to prove the DASUD's convergence conssts of provng the followng theorem. Theorem 1. Under assumpton 1, DASUD s a fnte algorthm. 88

DASUD load-balancng algorthm For that purpose, we ntroduce the followng three defntons and two lemmas. Defnton 1. m,(0 = mm {w,(t') ep, t-3b<t'<t} s the mnmum load value among the total system at a gven nterval of tme of length three tmes 8 Defnton 2. m k (t) = mn {w,(t') \ ep, t-3b<t'<t w,.(f')>m t _,(0}, for any k>\. The value of m k (t) s the mnmum load value that occupes the k-st place f loads are sorted n ascendng order at a gven nterval of tme of length 38. Defnton 3. An agreement s: mn 0= L+1 Lemma 1. The sequence (m\ (/)), 20 s non-decreasng and upper bounded. Furthermore, there exsts a non-negatve nteger t } such that, a) m ] (t) = m ] (t ] ) for all t>t, and b) / _,,(0 = ^(0 = ^(0 = 0, Y/eP, Vt>t t Ve P such that w, (f, ) = m, (f, ) Lemma 1 states that there s a tme /, n whch the mnmum load value becomes stable (under DASUD algorthm) and all the processors wth that mnmum load value nether send nor receve any amount of load. Proof. Let fx a processor ep and a tme te N. We shall prove that, w,(/ + l) > w,(0. If tet then processor / does not execute the load balancng process, so t can receve load from some neghbour processor, but t has not sent any amount of load. Thus If t e T then two dfferent stuatons can be found dependng on whch block of DASUD produces the load movements: 89

Chapter 3 case 1: No load movements are generated by the FLD and PIM blocks of DASUD (s ()= o ). Two dfferent scenaros can be analysed: a) Processor / s an underloaded processor (d,,(t)>q). Then s,j(t) = Q for all neghbours and so that w (t +1) > w (t) > m, (t). b) Processor ; s an overloaded processor (d u (t) < 0). Then, 7=1 7=1 7=1 7=1 n = w (t) + d u (/) + r j, (f) = ïï, (í) + X r J W - "< W 7=1 7=1 n The nequalty (1) s obtaned by takng nto account that: 0- Moreover, the equalty (2) s true because y(0 = l Snce w,-(0" s the load average of some processors at a gven tme between t-b and t, t s obvous that, w,(0> w,(0- Hence w (t + ])>m,(0. case 2: Some unt of load can be moved by applyng FLD and PIM blocks of DASUD (S,(t)>0). In ths case processor / has not generated load movements by the frst stage of the DASUD algorthm (-s,j(t) = Q for all neghbours j). Two dfferent stuatons can be found: "Í a) Acton 1 or acton 2 of the FLD block of DASUD are performed. In ths case W l (/) = <"(/), and w, (/)-w,"'" (0 > (/). Thus w (/+1)=w, (/) - «y, (o + >> (O * w/ (O - s, w ><'" w > w, b) The PIM block of DASUD s executed, then processor / servces a message such as: (/, /, /', w yï (/')) ln ths case w,(0 = w ;,(O,.e., the estmaton load that processor y keeps n ts memory about processor / at the tme whch ths message was sent, concdes wth the current load value of processor /. Furthermore, t-2b<t'<t because all messages sent by a processor y to processor / before 90

DASUD load-balancng algorthm tme t -2B, wll be receved by ths processor before t-b and deleted before t. The meanng of the message (/, /, /', w,,(o) s the followng: so, there s a tme T, wth t-3b<t'-b<r < t'<t, such that w.,(t') = w,(r). Therefore, w (t)-w.-(t)>\ and we can conclude that w,,(t + 1) > w, (í) - 1 > W. (T) > m (O. As a concluson, we have proved that w (t + 1) >/«,(/) V/ e P and VfeN. So, m, (f +!)>/», (f) V/eN. Snce (/M, (/)), 20 s an upper bounded non-decreasng nteger sequence, (L s an upper bound) there exsts a nonnegatve nteger t 0 such that m (t) = m (t 0 ) Vf >/ 0. Then, part (a) of lemma 1 s proved for any tme t that t>t 0. We shall now prove part (b) of lemma 1. For ths purpose, we set P } (t) = {ep w (0 = w, (í)} as the set of processor wth mnmum load value of the system at tme f and we shall see that: P, (í 0 )2/ J 1 (/ +\)^P l (t 0 +2) 3...,.e., the sequence of sets of processors wth mnmum load s a non-ncreasng sequence beyond t 0. Let t > t Q and / e P \ P, (O *. We shall see that / g P, (í + 1),.e., f processor /' s not a processor wth mnmum load at tme t, then t wll never be a processor wth mnmum load value at tme t+1. ' Note that: P \ P } (t) = P, (t) 91

Chapter 3! t If t T t, then we have seen that w,(; + l) > w (t) > m,(0 = m,(f +1), so /*/>,(* +!). \ If t e r,,, some dfferent stuatons can be found: * * => f Sj(t)>Q then, as we have seen before w,(t + \)>m } (t) = m (t +1), so, /*/»,(* +!).!! => f,(0 = 0and processor/s an underloaded processor (d, v (0^0), then t has been proved that w, (t +1) > w, (0 > w, (0 = m (t +1), thus / g P, (í +1). => f,(0 = 0 and processor / s an overloaded processor (d,(t)<0), then we have that w, (t +1) > w (t) > m } (t) = m, (t +1) and / P, (t +1). j So, we can conclude that P, (t 0 )2P, ( 0 +l)3p ( 0 +2)3... Snce P,(/ 0 ) s a fnte set, there exsts an nteger í, >/ 0 such that P, (t) = P l (t } ) v/ > /,. Note that f processor / has the mnmum load value at tme í, (/e P, (/, )), then t,(0 = OV/>/,. Furthermore, ths processor becomes an underloaded processor (d(t)>0 V/>í,). In such stuaton, ^(O^O.VyeP, Ví>í,. Fnally, f the load of t processor / keeps constant and lt sends no load to any of ts neghbours, obvously, no load can be receved,.e. rj (t) = Q, Vy'eP, W>7,. Then, part (b) of lemma 1 s proved.! Snce part (a) has been proved for any t such that Vr > t 0 and í, > f 0, then lemma 1 s proved t Lemma 2. Let us defne P k (t) = {tp \ w,(t) = m k (t)}. Then, there exsts an ncreasng sequence t\,t 2... of postve ntegers such that, a) m k (t) = m k (t k ) for al t>t k, and '. *' b) ^(0 = ^(0 = ^(0 = 0, V/eP, vt>t k andv/eq/> A (/ t ) 92

DASUD load-balancng algorthm Ths lemma states that there exsts a tme t k beyond whch the mnmum load value of order k becomes stable, and all the processors that belong to the sets of processors wth load values equal to or less than that load value nether send nor receve any amount of load. Proof. We prove the result by nducton on k. For k = 1 we have lemma 1. Assume that k > 1 and the result s true for /c-1. k- Let fx a tme t>t k _ } +35 and a processor /ep\(jp A (/ A.,),.e., the load of processor / at tme t k _\ s bgger than mnmum load value of order k-1. We shall see that w (t + \)>m k (t). h=\ If t g T then processor / does not execute the load balancng process, so t can receve load from some neghbour processor, but t has not sent any amount of load. Thus If t e T then two dfferent stuatons can be found, dependng on whch block of DASUD produces the load movements: case 1: No load movements are generated by the FLD and PIM blocks of DASUD ( g (t) - 0 ). Three dfferent scenaros can be analysed: a) Processor / s an underloaded processor (d u (t) > 0). Then s^t) = W e P and so w(t + \'z w(t)>m(t). 93

Chapter 3! b) Processor / s an overloaded processor (d u (t)<0) and there exsts a r neghbourng processor j whose load s less than or equal to the mnmum load k-\ value of order k-1 (3 j e (J P,, (/*_,) such that/ e W,). Then the porton of excess /!=! *< I load to be moved to ths neghbour processor wll be bgger than the porton of any other neghbour y'whose load value s bgger than the mnmum load value of order k-1 (P,(0<^//(0 V/e P \ J/>,,(/*_,) and j e TV,). Therefore, 5., (0 < s j (0 = 0 and then w (t +1) > w (r) > m k (/). c) Processor / s an overloaded processor (d u (t) < 0) and there does not exst any neghbourng processor y whose load s less or equal to the mnmum load value *" 1. _ of order k-1 (Zje(JP h (t k _ ) such that/ e TV,) then w,(t)>m k (t). Bearng n mnd h=\ the proof of lemma 1, we see that: w,(t +1) > w (t), so w (/ +!)> m k (í) t case 2: Some unt of load can be moved by applyng FLD and PIM blocks of DASUD (S,(t)>Q). In ths case processor / has not generated load movements by the frst stage of the DASUD algorthm ( ^ (/) = 0 for all neghbours j). Two dfferent stuatons can be found: a) Acton 1 or acton 2 of the FLD block of DASUD are performed. In ths case ">,(')-<""(') ><?/(') Therefore, w / (/ + l) w,(0- /(0>w / mftl (0*^(0. The last nequalty s true because a neghbour j of processor / that has mnmal k-\ load cannot belong to (JJp/, (/*_,). If processor y belonged to ths set t could h=\ I not receve load from /', but <5,(0> 0 t b) The PIM block of DASUD s executed, then processor ; servces a message lke ths: (J,J",T,WJ (T)). Then w y/ (r)=w,(0, t-2b<r <t,.e., the estmaton 94

DASUD load-balancng algorthm load that processor) keeps n ts memory about processor / at the tme whch ths message was sent, concdes wth the current load value of processor /'. Then, the estmaton load of the target processor (/") that processor y keeps n ts memory s smaller than the current load of processor ; mnus one (w^.(r) < w (t)-\). The target processor has a load value bgger than the mnmum load value of order k-1 (fep\[jp h (t/! _ l )). A= where t k _ { < t-3b<r-b< r'< r<t. k-\ Now, w^(r)=w 7,(r'), Then w, (/ +1) = w, (/)-! + r jt (t) > w f (r 1 ) > m k (/). 7=1 Wth the above steps, we have proved that: Hence, the sequence of ntegers (m k (t)) t>t +3 # s a non-decreasng nteger sequence upper bounded by /., then there exsts a tme t' k wth t' k > t k _ } + 35 such that m k (t) = m k (t' k ) \/t>t' k. We have now proved the part (a) of lemma 2. Now, we shall prove part (b) of lemma 2. For ths purpose, we shall prove that P*(4)3/ > *( í * +1 )^/ > t('*+2)2... -e., the sequence of sets of processors wth mnmum load of order k s a non-ncreasng sequence beyond t k. k Let t > t' k and let / P \ (J p h (t). We shall see that f processor / s a processor /! =! wth a load value bgger than the mnmum load value of order k at tme t, then ts load 95

Chapter 3 t wll reman lower bounded by the mnmum load value of order k beyond t+1 If t $. T,,, then as we have already seen before, w (t + 1) > w, (t) > m k (t) = m k (t + 1) k and, therefore, ep\ [J p h (t + V). h = \ If t e T, then some dfferent'stuatons can be found: t => If o,(t)> 0, as we have already seen, we have that w (t + \)>m k (t) -m k (/ +!), *! and, therefore, ep\\jp h (t + Y). h = \ => If S (t)=0 and the procesèor / s an underloaded processor (d,,(t)^0) then t w,(t + \)> w (0 > m * k (0 = m k (t + ]) and, therefore, / e P \ J p h (t + 1). "=' :=> If <5,(0=0 and processor / s an overloaded processor ( /,,(/)<0) and there exsts a neghbourng processor) wth load value less than or equal to the mnmum load *-l value of order k-1 (3 j e (J p/, (0 such that y E N ) then s j (t) < ^^ (r) = O. t+b k-\ Then, as P J <(t)<p j (t) V fep\\\p h (t) wth/ea^,., we have that *í s,j<(t)<sj(t) =0, thus w (.(í + l)>w,.(0>'» A (0 =»í A (/ +!). Therefore ep\\jp h (t + \) " =1 If s (t)=0 and processor / s an overloaded processor (d (t)<0) and there does! not exst any neghbourng processor wth load value less or equal to the k-\ mnmum load value of order, k-1, (2y e(j/>/,(/) such thatyen,) then w, (í) > m k (t) = m k (/ +1) and t w (t +1) > w, (t). Therefore, / P \ (J p h (t +1). h=\ 96

DASUD load-balancng algorthm Hence, we can conclude that p k (t' k )^p k (t' k +l)3p k (t' k +2)^... Snce p k (t' k ) s fnte, 3 t k > t' k such that p k (t) = p k (t k ) v '^*- Note that f processor has the mnmum load value of order k at tme t k (e/» t (f t ))then,(0=0 Vt>t k. k-\ Moreover, f ay e y /»/,(/) wth j e N, then s (/ (0 = 0 V/ e P and W>/ t, and f wth j e NI then d,,(t)>0 Vt>t k and, therefore, ^,(0 = 0 V/ e P and W >f A. Hence, we also have r j, () = ü V/ e P and \ft>t k, part (b) of lemma 2 s proved. Proof (of the Theorem). Let t }, t 2,... the ncreasng sequence of postve ntegers of Lemma 2. * Snce P s fnte, there s a postve nteger k such that y p h (t k ) = P. h=\ Therefore, beyond tme t k the DASUD algorthm does not perform any addtonal load movement (a sketch of ths proof was presented n [Cor98]). Fnally, t should be noted that ths algorthm operates wthout the processors knowng the value of L. If the value of L vares, the algorthm s able to adapt to those changes durng the global load-balancng process. The DASUD's convergence proof provdes the bass for developng a general convergence proof for realstc teratve and totally dstrbuted load-balancng algorthms. A general model for realstc load-balancng algorthms s developed and the convergence of ths realstc load-balancng model s proved. Ths mportant contrbuton s ncluded n appendx A. 97

Chapter 3 j j 3.6DASUD's convergence rate As has already been commented, DASUD s an teratve load-balancng algorthm, therefore, after havng establshed ts convergence, the next logcal path would be to determne the number of teratons requred to obtan a stable state. Gven the dffculty of obtanng ths number n any exact way, an upper bound for the convergence rate s conjectured. We follow ths proposal! by an analyss of the worst ntal load dstrbuton where all load s concentrated n one processor. Bearng n mnd that n s the number t, of processors of the system and L s the total load, the maxmum amount of load to r -, be moved among the system should be (n-1) blocks of unts of load. If one ' ' multples ths amount of load by the maxmum number of edges to cross n the worst case, one obtans! t n \ In From ths nequalty the followng conjecture s derved: Conjecture A. DASUD acheves the stable stuaton wth at most *(/, +!) steps. f A more accurate upper bound can be conjectured by consderng the topology dameter (d) and the maxmum ntal load dfference (D 0 ). Conjecture B. DASUD acheves the stable stuaton wth at most *(Z>o+l)sfeps. As ts name ndcates, both proposed upper bound conjectures are vald up to the pont that no counter example s found. Snce no mathematcal proof s provded for these conjectures, we decded í to expermentally valdate them by comparng the 98

DASUD load-balancng algorthm expermental results wth the theoretcal conjectured value. In partcular, we only valdate Conjecture B for beng more precse than Conjecture A. Ths expermantal valdaton s provded n chapter 5 of ths work. 3.7Perfect local balance acheved by DASUD As we have reported n the general assumptons ntroduced n chapter 2, a doman s consdered to be unbalanced, as s the entre system, when the maxmum load dfference between any two processor belongng to t s bgger than one load unt. Recallng from secton 3.2 that one teraton of DASUD n a processor / s composed by two load-balancng stages where the frst stage performs a coarse load dstrbuton of the load excess of the underlyng processor, whereas the second stage produces a more accurate excess dstrbuton n order to acheve the perfect load balance. In secton 2.4.1 we have provded the descrpton of the SID algorthm, as well the unbalanced stuatons that SID was not able to balance. The same examples, used to exemplfy SID's problems wll be used n ths secton to show how DASUD s able to solve them. These cases are shown together n fgures 3.7. 9 (a) (b) (c) Fgure 3.7 Three unbalanced load dstrbutons acheved by the SID algorthm. O (a) (b) (c) Fgure 3.8 Balanced load dstrbutons aceved by DASUD. 99

Chapter 3 Let us analyse each one of these three problems ndvdually by startng wth example 3.7(a). In ths case, snce the local load average evaluated by the red processor s equal to 4.6, ts load excess s 3.4. Snce the frst stage of DASUD results n no load movng, the second stage must be executed, more precsely, acton 1 of the FLD block. The red processor dstrbutes 3 load unts ndvdually amongst certan neghbourng processors'. The fnal load dstrbuton obtaned after the DASUD teraton n the red processor s llustrated n fgure 3.8(a). Let us now consder the unbalanced stuaton depcted n fgure 3.6(b). In ths case, not all processor neghbourng the red one have the same load value, but the maxmum load value corresponds to ths one. As happens wth the prevous example, the second stage of DASUD should be executed because the frst one generates no load movements, but now the acton 2 of the FLD block wll be executed nstead of acton 1. Acton 2 generates the red processor sendng one load unt to one of ts less loaded neghbours. In partcular, to the processor whch s dentfed wth the smaller ndex n the sorted lst of the underlyng processor. Therefore, the load dstrbuton acheved n ths case s the one depcted n fgure 3.8(b). > > Fnally, the unbalanced load dstrbuton shown n fgures 3.7(c) s treated. In ths case, the processor n yellow detects that ts doman s not balanced because the maxmum load dfference wthn, t s bgger than one load unt, but t has no excess load to move. Therefore, t commands the red processor to send 1 load unt to one of the blue processors, specfcally the one whose ndex s the smallest. Ths commandng acton s performed by sendng an nstructon message from the yellow processor to the red one. The block responsble for sendng ths s the SIM block. Ths nstructon message wll be receved by the red processor n a posteror tme nstant and t wll be stored to be processed when requred. The DASUD's executon n the red processor goes drectly to ts second stage because ths processor has no load excess. Snce the underlyng doman s detected to be balanced, the PIM block s executed and, as a consequence, one load unt s sent from the red processor to the selected blue one va de yellow processor. The fnal load dstrbuton after ths balancng process s depcted n fgures 3.8(c). 100

DASUD load-balancng algorthm In concluson, we have seen that DASUD has the ablty of detectng unbalanced domans and gudng the necessary load movements to acheve a local load dstrbuton where the maxmum load dfference s one load unt. However, ths fact does not ensure the capablty of reachng an even global load dstrbuton. The followng secton deals wth ths fact. 3.8 Global balance degree acheved by DASUD As we have just seen, on DASUD's completng ts executon, each processor's doman s balanced. Snce the balance condton s locally but not globally assured, stuatons such as the one shown n fgure 3.9 may be attaned. Fgure 3.9 Global state unbalanced, local doman balanced As we can observe n the fgure, each doman s balanced because ts maxmum load dfference s one load unt, but the exstence of overlapped domans stems from havng a global unbalance load dstrbuton snce the global maxmum load dfference s 3 load unts. Notce that n ths example, each processor observes ts doman as balanced. However, each processor s not able to control the balance degree between processors outsde ts doman, although ther domans overlap wth the underlyng doman. All that can be assured s that the load from the relevant processors of the 2 overlapped domans wll dffer at most n one load unt of the processor load common to both domans. However, ths fact does not apply between the non-common processors of two overlapped domans as s observed from fgure 3.9 where the maxmum load dfference between a par of non-common processors belongng to two overlapped domans s sometmes 2 load unts. In the worst case, ths effect would be spread by the shortest path between two processors located at the maxmum dstance,. e. by a path wth a dstance equal to the dameter of the archtecture (d) drvng the system to a fnal global unbalanced load dstrbutons. We call such effect a "platform effecf. Thus, an upper bound for the fnal maxmum global load dfference should be delvered. 101

Chapter 3 Let t f be the nstant at whch DASUD fnshes n all processors, then the maxmum load dfference for any doman s upper bounded by one, whch s formally denoted as the followng:, j (tf ) - n' (tf ) < 1 V/, k e TV,, flwrf V e P. Therefore, f JeP and 0 = /, /,, / 2,..., r =j s a mnmum length way between processors / and j, then andas t,/ A+2 etf /t+, we have: If., I,}, {/,,/ 2 },..., {/_,,/,.}< - w where = floor - -. The prevous nequalty corresponds to the formal descrpton of the platform effect descrbed above. Therefore, f d s the dameter of the archtecture graph G, then.e., the maxmum load dfference for processors at the end of DASUD s upper bounded by p, whch s defned as follows: 'd* 102

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 4 Comparatve study of nearestneghbour load-balancng algorthms Abstract In ths chapter, the proposed load-balancng algorthm DASUD s compared to the three nearest-neghbour load-balancng algorthms descrbed n chapter 2: SID, GDE and AN. The smulaton framework has been desgned ncludng dfferent nterconnecton networks as hypercube and torus, as well as a wde set of system szes whch range from 8 to 128 processors and for dfferent load dstrbutons patterns whch vary from stuatons whch exhbt a lght unbalance degree to hgh unbalance stuatons. The comparson has been carred out n terms of stablty and effcency. The stablty concerns the goodness of the fnal stable load dstrbuton acheved for each one of the tested algorthms and effcency measures the cost ncurred n achevng such a fnal stuaton n terms of the number of smulaton steps and the amount of load movements performed durng the global load-balancng process. 103

Comparatve study of nearest-neghbour load-balancng algorthms 4.1 Smulaton framework Recallng from chapter 1 that n a totally dstrbuted load-balancng framework, each processor n the system alternates ts executon tme between computatonal operatons from the underlyng applcaton and load-balancng operatons. These load-balancng operatons are dvded nto three bocks: the LM (Load Manager) block, the LBA (Load-Balancng Algorthm) block and the MM (Mgraton Manager) block. Ths decomposton of the load-balancng process nto three blocks allows to experment n a "plug & play" fashon wth dfferent strateges at each one of the blocks. As mentoned n chapter 2, we are nterested n analysng dfferent nearestneghbour strateges wth respect to the behavour of ther LBA (Load-Balancng Algorthm) block. For that purpose, we have developed an LBA smulator, whch allow us to: test the behavour of dfferent load-balancng algorthms under the same condtons; evaluate the behavour of the load-balancng algorthms for dfferent processor networks; evaluate the behavour of the algorthms for dfferent load stuatons; Ths smulaton envronment hs been used to evaluate the effectveness of the load-balancng algorthms analysed n chapter 2 (SID, GDE and AN) and 3 (DASUD). A consderaton was adopted n the load-balancng smulaton process n order to smplfy programmng and make the results more comprehensble. We assumed that all the smulated algorthms were globally synchronsed. All processors perform the whle-loop ntroduced n secton 2.3 n global smulaton steps where no processor proceeds wth the next teraton of ts load-balancng process untl the current one has been fnshed n all processors of the system. Therefore, the load-balancng smulaton process s performed by consecutve smulaton steps understandng by smulaton step the executon of an teraton of the load-balancng operatons n as many system processors as possble n a smultaneous manner. Although the 105

Chapter 4 ' smulaton process s performed n a synchronous way for all of them, each smulated load-balancng algorthm (DASUD, SID.GDE and AN) has dfferent synchronsaton requrements. Let us descrbe for each algorthm what s dentfed as step of the load-balancng smulaton process. t DASUD and SID have no synchronsaton requrements between processors, therefore the defnton of one smulaton step of the load-balancng smulaton process concdes for both algorthms and conssts of executng the algorthm smultaneously n all processors of the system once. Ths defnton does not apply ether to GDE nor to AN because they have some partcular synchronsaton requrements that are not requred ether for SID or DASUD. On the one hand, the GDE algorthm supermposes an order n the communcaton steps guded by the number of dmensons of the underlyng topology. Therefore, f the dmenson of the underlyng topology s equal to c (recallng from f chapter 2 that dmenson and edge-coloured s assumed to be the same) then one step of the load-balancng smulaton process conssts of concurrently executng the algorthm n all processors that have one edge n the nspected dmenson once. Notce that one step does not concde wth the term sweep (ntroduced n secton 2.4.2) whch corresponds to executng as many load-balancng steps as dmensons exst n the underlyng system. In ths case, as has already been observed (above) t the all-port communcaton model assumed n ths dscusson s not fully exploted. On the other hand, the restrcton mposed by the AN algorthm les n the mpossblty of executng the load-balancng process smultaneously n processors whose domans overlap, gvng rse to the creaton of groups called seralsaton sets descrbed n secton 2.4.3. Therefore, we dentfy one step of the load-balancng smulaton process as the smultaneous executon of the load-balancng operatons n all processors belongng to the same seralsaton set. Fgure 4.1 ndcates n red those processors that execute load-balancng durng a gven smulaton step for each one of the smulated algorthms where the underlyng topology s a 3x3 torus. 106

Comparatve study of nearest-neghbour load-balancng algorthms C c c c (a) (b) (c) Fgure 4.1 Processors that execute load-balancng smultaneously n a certan smulaton step under SID and DASUD (a), GDE (b) and AN (c). The load-balancng smulaton process was run untl global termnaton detecton was accomplshed. Ths termnaton condton can be a lmt on the number of smulaton steps set beforehand, or the detecton that no load movements have been carred out from one step to the next,.e., the algorthms have converged. Specfcally, n our experments, smulatons were stopped when no new load movements were performed durng two consecutve steps of the load-balancng smulaton process. We refer to the smulaton step at whch the load-balancng smulaton fnshes as last_step. Although the smulaton dd not mmc the truly asynchronous behavour of some algorthms, ther results can stll help us to understand the performance of the algorthms snce the fnal load mbalances are the same whether the algorthm s mplemented synchronously or asynchronously. The man dfference s n the convergence speed. Subsequently, we shall descrbe the complete smulaton framework by frstly descrbng the dfferent knds of nterconnecton networks used, and followng ths wth the set of ntal load dstrbutons appled. 4.1.1 Interconnecton Networks The load-balancng smulator has been desgned to execute teratve loadbalancng algorthms n arbtrary networks. In our expermental study, the followng k- ary n-cube topologes have been used: 2-ary n-cube (hypercube) and /c-ary 2-Cube (2-dmensonal torus). The szes of these communcaton networks were: 8, 16, 32, 107

Chapter 4 j I 64 and 128 processors. However, n order to have square /c-ary 2-Cube, nstead of 8, : 32 and 128 processors, the szes of these topologes have been changed by 9 (3x3), 36 (6x6) and 121 (11x11), respectvely.! t 4.1.2 Synthetc Load dstrbutons In our smulatons, the problem sze s known beforehand and all the experments ncluded n ths chapter are performed for a fxed problem sze L equal to ) 3000 load unts. Therefore, the expected fnal load at each processor,.e., the global load average, can be evaluated a pror to be \Un\ or \_Un\ n beng the sze of the topology. We generated an ntal set of synthetc load dstrbutons that were used as nputs to the smulator. The set of ntal load dstrbutons were classfed nto two man groups: lkely dstrbutons and pathologcal dstrbutons. Lkely dstrbutons cover all the stuatons that are assumed to appear n real scenaros where most of the processors start from an ntal load that s not zero. In ths case, each element! w,(0) of the ntal global load dstrbuton denoted by w(0), has been obtaned by random generaton from one of four unform dstrbutons patterns. These four k dstrbuton patterns cover a wde range of lkely confguratons: from hghly balanced ntal stuatons to hghly unbalanced ntal stuatons. The four patterns used n lkely dstrbutons were the followng: Intal load dstrbutons varyng 25% from the global load average: V w,(0) e [/n -0.25 *L/n,L/n + 0.25 *L/n] Intal load dstrbutons varyng 50% from the global load average: V w>,(0) e [L/n-0.50 *L/n,L/n + 0.50 *L/n] Intal load dstrbutons varyng 75% from the global load average: V/ w, (0) E [/n - 0.75 * L/n, L/n + 0.75 * L/n] Intal load dstrbutons varyng 100% from the global load average: V w, (0) e[l/n-l/n, L/n + L/n] The 25% varaton pattern corresponds to the stuaton where all processors have a smlar load at the begnnng, and these loads are close to the global average, 108

Comparatve study of nearest-neghbour load-balancng algorthms.e., the ntal stuaton s qute balanced. On the other hand, the 100% varaton pattern corresponds to the stuaton where the dfference of load between processors at the begnnng s consderable. 50% and 75% varaton patterns consttute ntermedate stuatons between the other two. For every lkely dstrbuton pattern, 10 dfferent ntal load dstrbutons were used. The group of pathologcal dstrbutons was also used n order to evaluate the behavour of the strateges under extreme ntal dstrbutons. In these dstrbutons a sgnfcant amount of processors has a zero ntal load. These scenaros seem less lkely to appear n practce, but we have used them for the sake of completeness n the evaluaton of the strateges. The pathologcal dstrbutons were classfed n four groups: A spked ntal load dstrbuton, where all the load s located on a sngle processor: w(0) = (, O,..., 0),.e., there are n-1 dle processors n the system. 25% of dle processors, a quarter of the processors have an ntal load equal too. 50% of dle processors, half of the processors start wth an ntal load equal to 0. 75% of dle processors, a quarter of the processors have all the ntal load. In addton to the above mentoned load dstrbutons, each one was scattered usng two dfferent shapes: a sngle mountan shape and a chan shape defned as follows: Sngle Mountan (SM), where load values from the ntal load dstrbuton have been scattered by drawng a sngle mountan surface,.e., there s a localsed concentraton of load around a gven processor n the network. Therefore, the unbalance s concentrated and t s not easly recognsable n ts real magntude wth a smple overwew of the system (see fgure 4.2(a)). 109

Chapter 4 Chan, where load values from the ntal load dstrbuton have been scattered by drawng multple mountan surfaces,. e., there are several processors that have: a local concentraton of load and as a consequence, a homogeneous dstrbuton on the unbalance n the system s obtaned (see fgure 4.2(b)). (a) (b) Fgure 4.2. Two shapes: Sngle Mountan (a) and Chan (b) As a consequence, we have evaluated not only the nfluence of the values of ntal load dstrbuton, but also the nfluence of how these values are collocated onto! I the processors. \ To sum up, the total number of dstrbutons tested for a gven processor network was 87, whch were obtaned n the followng way: 10 lkely dstrbutons * 4 patterns * 2 shapes + 3 pathologcal dstrbutons * 2 shapes + 1 spked pathologcal dstrbuton. The study outlned below s orented to compare the, smulated algorthms (DASUD, SID, GDE and AN) accordng to ther stablty and effcency. For ths purpose, n the followng secton, we shall ntroduce the qualty ndexes measured to perform the study, and n the subsequent sectons, the stablty and effcency analyss of the smulated load-balancng algorthms are provded. Fnally, the last secton of ths chapter provdes a summary of all results and the man conclusons of the comparatve study are reported. 110

Comparatve study of nearest-neghbour load-balancng algorthms 4.2 Qualty metrcs Stablty measures the goodness of the fnal load dstrbuton acheved by any load-balancng algorthm. Therefore, bearng n mnd that we are dealng wth nteger load values, the fnal balanced state wll be the one where the maxmum load dfference between any two processors of the topology should be zero or one dependng on L and the number of processors. If L s an exact multple of n, the optmal fnal balanced state s the one where the maxmum load dfference between any two processors of the system s zero. Otherwse, t should be one. In our experments, two dfferent ndexes have been measured to evaluate the stablty of the compared algorthms: df_max: maxmum load dfference between the hghest loaded processor and the least loaded processor throughout the whole system; a : global standard load devaton. Snce effcency reflects the cost ncurred n arrvng at the equlbrum state, the followng two ndexes were evaluated to have a measure of ths cost for all strateges: steps: s the number of smulaton steps needed to reach a fnal stable dstrbuton; load unts (u): ths measures the quantty of load movements ncurred n the global load-balancng process. For a gven step s of the smulaton process the maxmum amount of load moved from any processor to one of ts neghbours s called maxjoad(s). Accordng to our synchronous smulaton paradgm, step s wll not end untl max_load(s) unts of loads have been moved from the correspondng processor to ts neghbour. Therefore the duraton of each step depends drectly on the value of maxjoad(s). The underlyng communcaton model s the all-port one as has been commented on secton 2.2, therefore, the value of u mght be evaluated as follows: s=lasl _slep u= 2J ma *_ load(s) 111

Chapter 4 Accordng to the defntons presented above, whereas the steps ndex determnes the number of smulaton steps needed to acheve the fnal load dstrbuton wthout consderng the duraton of each one these, the ndex u s seen to be a good measure for dealng wth ths aspect. Furthermore, u s a representatve ndex of the "tme" ncurred by each step of the smulaton process because the global tme s drectly related to the amount of load that should be moved at each loadbalancng step. As has been mentoned throughout the descrpton of our smulaton framework, the global load-balancng process may be terated untl no load movements are carred out throughout the entre system or can be stopped at a predetermned step. We have selected the frst of these two alternatves. Therefore, snce we are evaluatng the complete load-balancng process the qualty ndexes descrbed above are measured at the end of the load-balancng smulaton process. Followng ths, the smulaton results for stablty and effcency are reported startng wth the frst. 4.3 Stablty analyss Stablty s the ablty of a load-balancng algorthm to coerce any ntal load dstrbuton nto an equlbrum 'state. As has been prevously mentoned, the fnal balance degree acheved by each one of the compared load-balancng algorthms has been evaluated by measurng the df_max and thectndexes at the end of the load-balancng smulaton process. For both ndexes, the nfluence of the followng three parameters are ndvdually consdered: the ntal load dstrbuton pattern (% varaton from the global load average n lkely ntal load dstrbutons, and % of dle processors n pathologcal dstrbutons); the system sze (number of processors); the shape of the ntal load dstrbuton (sngle mountan versus chan); 112

Comparatve study of nearest-neghbour load-balancng algorthms The results obtaned for each one of these parameters are outlned n the followng sectons, and for all of them the nfluence of the underlyng nterconnecton network (hypercube and torus), as well as the two groups of ntal load dstrbutons (lkely and pathologcal) have been ndvdually consdered by showng the results n dfferent graphcs or tables. Fnally, at the end of the stablty analyss, the conclusons extracted from ths experment are outlned. 4.3.1 Influence of the ntal load dstrbuton pattern n df_max Fgures 4.3 and 4.4 show the results obtaned on average n the stablty comparson n terms of df_max for hypercube and torus respectvely. In partcular, for hypercube nterconnecton networks, the nfluence for lkely and pathologcal ntal load dstrbutons s depcted n fgure 4.3(a) and 4.3(b), and for torus topologes the nfluence of both load patterns s shown n fgure 4.4(a) and 4.4(b), respectvely. The maxmum load dfference obtaned by SID s always greater than the one obtaned by DASUD, GDE and AN ndependently of the ntal load dstrbuton and the underlyng topology. DASUD, GDE and AN have the qualty of keepng the maxmum load dfference nearly constant for any load dstrbuton pattern. Nevertheless, DASUD outperforms GDE and AN because t obtans a better fnal balance degree n all cases. On average, GDE obtaned a maxmum dfference of 3.2 for torus and 3.3 for hypercubes. AN keeps boundng ts maxmum load dfference by 2.2 and 2.5 for hypercube and torus topologes, respectvely. Fnally, DASUD obtaned a maxmum dfference of 1.4 for hypercube and 1.8 for torus. An unapprecably slght ncrease n the maxmum dfference was obtaned on average by these strateges for pathologcal dstrbutons, but ther relatve stuaton s mantaned. Notce that the behavour of all smulated algorthms s very smlar whatever topology s used, and only a slght ncrease n the fnal maxmum load dfference s observed for torus topologes. 113

Chapter 4 30 Lkely dstrbutons (Hypercube) 25 20 E 15 ' 5 10 <, ' DASUD SID-H- GDE AN» 25 50 75 100 % varaton from average j (a) t\j Pathologcal dstrbutons (Hypercube) 35 - X X s 30 X x - s 5T3 25 20 15 10 5 - n 25! 50 75 I DASUD SID GDE AN 1 H... - - n- : % dle processors! (b) Fgure 4.3 Maxmum load dfference for DASUD, SID, GDE and AN algorthms consderng (a) lkely and (b) pathologcal ntal load dstrbutons for hypercube topology wth respect to the ntal load patterns. 114

Comparatve study of nearest-neghbour load-balancng algorthms 35 Lkely dstrbutons (Torus) 30 25 S E 20 S 1 15J 10 5 DASUD -«- SID-->- GDE AN» 25 50 75 100 % varaton from average (a) 45 40 35 Pathologcal dstrbutons (Torus) 30 25 Ë on DASUD 20 > SID- 15 h GDE-*- AN 10 25 50 75 n-1 % dle processors Fgure 4.4 Maxmum load dfference for DASUD, SID, GDE and AN algorthms consderng (a) lkely and (b) pathologcal ntal load dstrbutons for torus topology wth respect to the ntal load patterns. 115

Chapter 4 4.3.2 Influence of the system sze n df_max Tables 4.1 (a) and 4.1(b) show for hypercubes and torus topologes respectvely, and for lkely and pathologcal ntal dstrbutons, the nfluence of the! sze of the archtecture on the fnal balance for all smulated algorthms. The values ncluded n those tables are the mean values for all ntal load dstrbuton patterns wthn each group of ntal load dstrbuton. In tables B.1 and B.2 n appendx B, the ndvdual values for each load patterns are ncluded for hypercube and torus respectvely. From the analyss of the results shown n tables 4.1, we can extract that as the number of processors ncreases, the maxmum dfference obtaned at the end lkewse ncreases for DASUD, GDE and AN. The ncrement of the maxmum dfference observed s not very' sgnfcant for the three algorthms; for nstance, on average, the maxmum dfference was always less than 3 for DASUD, 5 for GDE and 4 for AN when the number of processors was 128 for both lkely and pathologcal dstrbutons and for hypercube nterconnectons schemes. When DASUD s! consdered for torus topologes, the maxmum load dfference obtaned s 4, whereas for GDE and AN t s 5, as wth the case of the bggest sze (121 processors). In contrast, the SID algorthm exhbts a dfferent behavour because, for large system szes (121 or 128 processors), the maxmum load dfference that t s able to acheve slghtly decrease nstead of ncreasng. Snce the problem sze remans constant for any system sze, the global load average decreases as the number of processors ncreases. Therefore, ntal load dstrbutons for large archtecture szes provde less unbalance dstrbutons than ntal load dstrbutons for small system szes. Consequently, SID does not mprove ts balancng ablty as the number of processors ncreases, but t takes advantage of the more balanced dstrbuton at the begnnng of the load-balancng process. By comparng the results obtaned for both topologes, we can conclude that all strateges perform slghtly better for hypercube schemes than for torus schemes. The sze of the dameter that s drectly related to the doman's sze s the man reason explanng ths phenomenon. Snce the dameter of the torus s bgger than 116

Comparatve study of nearest-neghbour load-balancng algorthms the hypercubes 1 dameter for the same (or smlar) number of processors, a load gradent effect appears, along overlapped domans, that has more ncdence n torus than n hypercubes. Hypercube (df_max) Number of Procs. 8 16 32 64 128 DASUD 0.4 1 1.56 1.98 2.28 lkely dstrbutons pathologcal dstrbutons SID 8,76 23.73 31.27 27.01 18.37 GDE 2.32 2.65 3.2 3.74 4.08 AN 0.8 1.27 1.95 3.25 3.77 DASUD 0.74 1 1.75 2.25 2.37 (a) Torus (df_max) SID 3.75 18.25 43.73 49.37 32.37 GDE 2.5 3 3.37 3.87 4 AN 0.25 1.5 2.12 2.62 4.12 Number lkely dstrbutons pathologcal dstrbutons of Proc. DASUD SID GDE AN DASUD SID GDE AN 9 0.85 8.1 1.77 1.07 1 6 2 1 16 1 24.01 2.52 1.4 1.25 17.5 2.25 1 36 1 34.48 2.92 2.37 1 54.87 3.25 2 64 2 29.71 4.18 3.05 2 56 4.12 2.87 121 3.05 20.01 4.87 4.25 3.75 43.12 4.5 4.12 (b) Table 4.1 Maxmum load dfference for DASUD, SID, GDE and AN consderng lkely and pathologcal ntal load dstrbutons for hypercubes (a) and torus (b) attendng to the archtecture sze. 4.3.3 Influence of the ntal load dstrbuton shape n dfjnax As a fnal consderaton for stablty analyss wth reference to the maxmum load dfference wthn the system (df_max), we have observed the results obtaned accordng to the orgnal shape used n the ntal load dstrbuton. For all the experments we have always consdered two dfferent shapes for every ntal load dstrbuton: a Sngle Mountan (SM) shape and a Chan shape. For all topologes we have observed that the fnal maxmum load dfference depends on how the load dstrbuton was scattered through the system. Tables 4.2(3) and 4.2(b) shows ths 117

Chapter 4 t dependency for hypercubes and torus respectvely and, addtonally, for both lkely and pathologcal dstrbutons. Each number s the mean value for all dstrbuton patterns, whch are reported n'tables B.3 and B.4 n appendx B. One can observe that, on average, for the chan shape ntal scatterng, the fnal state obtaned s slghtly more balanced than the fnal state obtaned when the ntal scatterng corresponds to the sngle mountan shape for both topologes and for all smulated algorthms. Ths behavour can be explaned because wth the sngle mountan shape there s a hgh load gradent effect on the whole system. As a consequence, snce the maxmum local load dfference that can be acheved, n the best case, s 1 load unt, the exstence of a hgh ntal load gradent favours mantanng a global load gradent at the end of the load-balancng process. Ths effect has a remarkable nfluence on the SID algorthm because t s not able to detect unbalanced domans, and n the case of scatterng the load usng the sngle mountan shape, all domans are ntally unbalanced. ; Wth the chan shape, the load s scattered onto varous hgh-load areas surrounded by low-load areas. As a consequence, the ntal load gradent effect that appears s lower than n the sngle mountan shape and, therefore, t s easer for all \ strateges to arrange global unbalance. SM Chan SM Chan DASUD 1.725 1.185 DASUD 1.59 1.56 lkely dstrbutons SID 32.17 17.39 Hypercube (df_max by shapes) GDE 3.35 t 3.07 ( lkely dstrbutons \ SID GDE 34.08 12.56 AN 2.17 2.16 (a) Torus (df_max 3.35 ( 3.17 AN 2.42 2.42 DASUD 1.9 1.06 by shapes) DASUD 1.75 1.73 pathologcal dstrbutons SID 38.3 16.9 GDE 3.65 3.25 pathologcal dstrbutons SID 42.15 23.9 GDE (b) Table 4.2. Maxmum load dfference for DASUD, SID, GDE and AN consderng lkely and pathologcal ntal load dstrbutons for hypercubes (a) and torus (b) wth respect to the shape on the ntal load dstrbuton. 3.49 3.3 AN 2.93 2.7 AN 2.47 2.6 118

Comparatve study of nearest-neghbour load-balancng algorthms 4.3.4 Influence of the ntal load dstrbuton pattern n the a Tables 4.3(a) and 4.3(b) show the results obtaned on average n the stablty comparson n terms of global standard devaton (a) for hypercube and torus respectvely by consderng the nfluence of lkely and pathologcal ntal load dstrbutons n both cases. As can be observed from the results ncluded n both tables, all smulated load-balancng algorthms have a smlar behavour whatever ntal load dstrbuton group (lkely or pathologcal) s appled. Only SID denotes a slght dfference between the standard global load devaton acheved for lkely and pathologcal ntal load dstrbutons. The rest of the smulated algorthms (DASUD, GDE and AN) are always able to drve the system nto the same degree of balance whatever ntal load dstrbuton the load-balancng process starts from. From the ndvdual analyss of each algorthm, we can extract the followng conclusons: the AN algorthm exhbts the best fnal degree of balance followed by DASUD and GDE n ths order. As was expected, the balance degree acheved by the SID algorthm s the worst. In prevous sectons, we saw that the maxmum load dfference acheved by DASUD was less than the maxmum load dfference acheved by AN. Therefore, we can deduce that AN has the ablty of obtanng, n most of the processors, a fnal load value equal to the global load average, whereas DASUD has the ablty of drvng all processors nto a fnal stuaton where ther load values are very close to the global load average. However, snce the maxmum load dfference obtaned by AN s larger than the one obtaned by DASUD, ths means that at the end of the load-balancng process when AN s appled, there s a small number of processors whose load values dffers from the global load average by a sgnfcant value. 119

Chapter 4 Hypercube (standard devaton - aj lkely dstrbutons l pathologcal dstrbutons 25% 50% 75% 100% 25% 50% 75% n-1 DASUD 0.21 0.2 ; 0.2 0.21 0.2 0.2 0.2 0.2 SID II 3.8 5.11 5.9 6.2 6.32 5.85 5.66 7.66 GDE II 0.78 0.7 0.75 0.8 0.86 0.8 0.82 0.88 AN II 0.1 0.11 0.1 0.11 II 0.1 0.1 0.1 0.1 (a) torus (standard devaton - a) I 25% 0.3 lkely dstrbutons 50% 75% 0.31 0.31 100% 0.31 pathologcal dstrbutons 25% 0.345 50% 0.34 75% 0.34 n-1 0.35 SID II 3.89 5.49 6.63 7.56 8.43 8.82 8.48 10.89 GDE II 0.81 0.8 0.83 0.85 0.87 0.86 0.86 0.88 AN I 0.14 0.14 0.14 0.14 0.17 0.17 0.17 0.17 (b) Table 4.3 Standard devaton for DASUD, SID, GDE and AN consderng lkely and pathologcal ntal load dstrbutons for hypercube (a) and torus (b), for all system szes and for all shapes wth respect to the ntal dstrbuton patterns. 4.3.5 Influence of the system sze n the a Tables 4.4(a) and 4.4(b) show the global standard load devaton obtaned by all the strateges wth respect to the system sze. The values ncluded n each table are the mean values for all ntal load dstrbuton patterns. The separate values are ncluded n tables B.5 and B.6 from appendx B. I As can be seen, DASUD, GDE and AN acheve a devaton that s very low, for all topologes and dstrbutons. In partcular t s less than 1.5 for all cases. In contrast, SID exhbts a hgher standard devaton for all cases. SID obtans, on average, more than 10 tmes the devaton obtaned by the other three algorthms DASUD, GDE and AN. However, all strateges have a common behavour as the system sze ncreases. The balance degree obtaned for all load-balancng algorthms 120

Comparatve study of nearest-neghbour load-balancng algorthms worsens as the number of processors grows. Such a characterstc reflects the fact that totally dstrbuted load-balancng algorthms are affected by archtecture sze and, n partcular, by the dameter of the topology. We also observe that there s a slght dfference between the results obtaned for each topology. The fnal balance degree attaned n torus s worse than the balance degree obtaned n hypercube topology. The reason for ths dfference s the dameter of each topology, snce for the same dameter the fnal standard devaton reached s very smlar whatever nterconnecton network s used. For example, the 4- dmensonal hypercube and the 4x4 torus have the same dameter, whch s equal to 4, and ther fnal standard devatons concde. Hypercube (standard load devaton - a) Number lkely dstrbutons pathologcal dstrbutons of Proc. DASUD SID GDE AN DASUD SID GDE AN 8 0.03 2.6 0.6 0.02 0.0 1.2 0.7 0.0 16 0.5 5.2 0.7 0.21 0.5 4.7 0.8 0.21 32 0.0 5.8 0.8 0.0 0.0 8.4 0.8 0.0 64 0.01 6.3 0.9 0.01 0.01 8.5 1 0.01 128 0.5 6.4 0.9 0.3 0.5 9.3 1 0.3 (a) Torus (standard load devaton - a) Number lkely dstrbutons pathologcal dstrbutons of Proc. DASUD SID GDE AN DASUD SID GDE AN 9 0.35 2.5 0.5 0.15 0.47 2.1 0.5 0.18 16 0.5 5.7 0.6 0.2 0.5 4.4 0.6 0.21 36 0.01 5.8 0.8 0.01 0.01 11.3 0.8 0.01 64 0.01 7.4 1 0.02 0.01 13.6 1 0.01 121 0.72 8.2 1.2 0.33 0.75 14.4 1.5 0.43 (b) Table 4.4 Global standard devaton fordasud, SID, GDE and AN consderng lkely and pathologcal ntal load dstrbutons for hypercubes (a) and torus (b) wth respect to the archtecture sze. 121

Chapter 4 As a fnal consderaton, we observe that there s not such a sgnfcant dfference between the fnal balance degree acheved for the two groups of ntal load dstrbutons (lkely and pathologcal). Ths means that the fnal balance degree obtaned by these load-balancng algorthms does not depend so much on the ntal load dstrbuton. t 4.3.6 Influence of the ntal load dstrbuton shape n the a The last parameter evaluated concernng stablty analyss s how the fnal balance degree can be affected by the ntal load dstrbuton shape. The values depcted n tables 4.5(a) and 4.5(b) are the mean values of the fnal balance standard devaton for all ntal load patterns n the correspondng ntal load dstrbuton group (lkely or pathologcal). The ndvdual values for each load pattern are ncluded n tables B.7 and B.8 n appendx B. As happens n the analyss performed for the maxmum load dfference (df_max), the global standard devaton s very smlar whatever load scatterng s appled n the case of DASUD, GDE and AN. Only for SID s a sgnfcant ncrement í observed when the ntal load dstrbuton shape s Sngle Mountan. Ths fact confrms that the presence of a hgh unbalance gradent throughout the whole system favours the exstence of fnal balanced domans, but favours unbalanced when were compared to ts overlapped domans. j 4.3.7 Conclusons of the stablty analyss In ths secton, we summarse the man concluson extracted from the stablty analyss outlned prevously. In table 4.7 these conclusons are exposed takng nto! account each one of the analysed parameters. Notce that, n general, DASUD, GDE and AN exhbt a common behavour when the ablty of drvng the system nto a stable state as close to the even load dstrbuton s consdered. DASUD and AN obtan the best results for all analysed parameter, therefore, to emphasse ths fact, the name of both algorthms s wrote n red n table 4.7. 122

Comparatve study of nearest-neghbour load-balancng algorthms Hypercube (crby shapes) lkely dstrbutons pathologcal dstrbutons DASUD SID GDE AN DASUD SID GDE AN SM 0.2 7.43 0.76 0.1 0.2 8.06 0.84 0.14 Chan 0.21 3.07 0.75 0.07 0.2 3.7 0.82 0.1 (a) Torus (a by shapes) lkely dstrbutons pathologcal dstrbutons SM DASUD 0.32 0.32 SID 8.45 3.33 GDE 0.82 0.82 AN 0.14 0.14 DASUD 0.345 Chan p) Table 4.5. Global standard devaton for DASUD, SID, GDE and AN consderng lkely and pathologcal 0.34 SID 10.74 6.45 GDE ntal load dstrbutons for hypercubes (a) and torus (b) wth respect to the shape on the ntal load dstrbuton. 0.87 0.86 AN 0.17 0.16 Stablty Summary dfjnax standard devaton (a) patterns DASUD GDE AN SID remans low and nvarant for any topology, any ntal load group and any ntal load dstrbuton pattern. Very hgh, and depends on the ntal load dstrbuton group DASUD GDE AN SID remans low and nvarant for any topology, any ntal load group and any ntal load dstrbuton pattern. Very hgh, and depends on the ntal load dstrbuton group.8 10 Q> DASUD GDE AN remans low and ncreases as the system sze ncreases, but not greatly. It s hgher for torus than forhypercube topologes. DASUD GDE AN remans low and ncreases as the system sze ncreases, but not greatly. It s hgher for torus than for hypercube topologes V 3? SID Very hgh and ncreases as the system sze also ncreases. SID Very hgh and ncreases as the system sze also ncreases. L ra (0 DASUD GDE AN SID remans low and s slghtly larger for the SM shape than for the Chan shape Very hgh and s larger for the SM shape than for the Chan DASUD GDE AN SID remans low and s slghtly larger for the SM shape than for the Chan shape. Hgh and s larger for the SM shape than for the Chan shape Table 4.6. Summary of the results from the comparatve study wth respect to the stablty analyss 123

I I Chapter 4 4.4 Effcency analyss In ths secton, the comparson s focused on evaluatng the costs ncurred by the load-balancng process of all smulated algorthms n terms of the number of loadbalancng smulaton steps needed to reach the stable state and the quantty of load unts moved throughout the system durng the complete load-balancng process (u's). We recall from secton 4.1 that one smulaton step s defned as the executon of the load-balancng operatons n ; as many processors as s possble to do so smultaneously, and that u (load unts) measures the maxmum amount of load í moved at each smulaton step for all load-balancng process. For the sake of smplcty, n the rest of ths secton we refer to smulaton steps as steps. The order of effcency result exposure follows the same scheme as that followed for stablty analyss. Both the effcency ndexes load unts (u) and steps have been analysed by consderng the nfluence of the ntal load dstrbuton pattern, the system sze and the shape of the ntal load dstrbuton. As happens n the stablty case, all studes have treated the underlyng topology (hypercube and torus), ndependently as well as the ntal load dstrbuton group (lkely and pathologcal). Fnally, the conclusons of the effcency analyss are reported. \ 4.4.1 Influence of the ntal load dstrbuton pattern n u Table 4.8 summarses the amount of load moved throughout the system for all smulated algorthms by takng nto account the ntal load dstrbuton pattern. Each number s the mean value obtaned for all system szes. It can be observed that all strateges exhbt a smlar behavour for both ntal load dstrbuton groups (lkely and pathologcal). As the ntal unbalance degree ncreases, the amount of load movements requred to acheve the fnal load dstrbuton ncreases as well. However, the amount of load moved for'each ndvdual algorthm presents some mportant dfferences. The two algorthms that generated less effort n arrangng the ntal unbalance n terms of u's are the SID and the DASUD algorthm. Both of these behave, on average, n a smlar way, ndependently of the underlyng topology. AN has a common behavour for both topologes but exhbts a sgnfcant ncrement n the load moved throughout the system compared to SID and DASUD. Fnally, GDE clearly depends on the underlyng topology, obtanng worst results for torus 124

Comparatve study of nearest-neghbour load-balancng algorthms nterconnecton networks than for hypercubes. The man reason for these dfferences s the degree of concurrency allowed by each algorthm n the executon of the loadbalancng operatons among all system processors (smultaneous runnng processors). Snce SID and DASUD maxmally explot ths capacty, they are able to overlap more load movements than ther counterparts and, by contrast, GDE and AN restrct ths concurrence degree to a subset of processors wthn the system, and thus the total load movements are propagated throughout more tme. Notce, however, that GDE hardly depends on the underlyng topology. Ths algorthm does not take advantage of the all-port communcaton model because at each step a gven processor can only perform load movements between tself and one mmedate neghbour. However, when t s executed n a hypercube topology, snce for all nspected dmensons all processor have a lnk wth t, all processors execute the load-balancng process smultaneously. Ths fact does not apply to torus topologes because the number of lnks for each processor does not concde wth the number of dmensons (colours n ths case). For that reason, the total load unts moved n the entre load-balancng process exhbts a consderable ncrement. Hypercube (toad unts - u's) lkely dstrbutons pathologcal dstrbutons 25% 50% 75% 100% 25% 50% 75% n-1 DASUD 38.62 75.75 108.17 155.64 117 180.2 301.7 781.4 SID 29.41 61.92 93.66 140.47 99.7 160.9 284.4 753.8 GDE 45.1 91.26 82.49 183.92 263.12 389.3 444.8 1222.2 AN 65.16 129.24 180.42 264.18 289.9 425.8 517.5 1261.4 Torus (toad unts - u's) DASUD 37.53 75.88 121.42 139.77 118.8 169.1 312.4 1041.2 SID 27.11 61.3 41.21 123.68 93.8 142.8 278.4 1012 GDE 168.9 247.5 337.2 459.9 365.76 750.9 1567.0 2957.6 AN 53.74 98.65 129.37 188.14 239.3 370.2 494.3 1152.8 Table 4.8 Load unts moved (u's) for DASUD, SID, GDE and AN algorthms consderng lkely and pathologcal ntal load dstrbutons for hypercube and torus topology wth respect to the ntal load dstrbuton patterns 125

Chapter 4 4.4.2 Influence of the system sze n u Fgures 4.5 and 4.6 gve more detaled nformaton about the movements of load unts (u) for hypercube and torus topologes, respectvely. From ther analyss we can extract the followng observatons. Frst, DASUD and SID result n beng ndependent of the underlyng topology because, startng from the same load dstrbuton, ether lkely or pathologcal, both strateges generate smlar results for hypercube and for torus. Secondly, the total amount of load moved durng the loadbalancng process by DASUD was slghtly hgher than the quantty moved by SID for any ntal load dstrbuton (lkely or pathologcal) and for any topology (hypercube or torus). In contrast, GDE and AN obtan, n general, worse results. The amount of load moved durng the LB process s, on average, twce that moved by DASUD and SID for lkely ntal load dstrbutons and, furthermore, when we consder pathologcal patterns, ths dfference could be more than three tmes. It s worth observng that GDE exhbts a topology dependence that s detected when a more detaled analyss of ts behavour s performed. Ths dependence s easly detected f the relatve stuaton wth the AN's result s observed. Whereas the GDE algorthm has better results, on average, than AN for hypercube topologes, for torus AN s faster than GDE,.e., the movement measure s smaller for AN than for GDE. But notce that, snce AN does not sgnfcantly vary ts behavour whatever topology s analysed (the u's are nearly the same for torus of hypercube), GDE stems from an underlyng topology dependence (the ü's for torus are twce the ü's for hypercube). One mportant reason for ths fact s that GDE s t based on DE algorthm as has been descrbed n chapter 2, whch was orgnally t developed for hypercube nterconnecton networks. Under ths topology, at each step of the smulaton process, all processors execute load-balancng operatons because they all have an edge n all dmensons of the hypercube. Therefore, n spte of not explotng the all-port communcaton model, for hypercubes the maxmum concurrence n load-balancng operatons s exploted. However, for the torus topologes, GDE represents a consderable loss n effcency. Ths s due to the fact that n each smulaton step, not all processors n the system wll be executng loadbalancng operatons, snce n ths case the number of dmensons s obtaned by mnmally colourng the graph assocated wth the topology. 126

Comparatve study of nearest-neghbour load-balancng algorthms 350 300 ^ 250 200 Lkely dstrbutons (Hypercube) DASUD - SID-- GDE-- AN 1 150 3 100 50 8 16 32 64 128 number of processors (a) 1200 1000 g 800 c 600 I o 400 Pathologcal dstrbutons (Hypercube) DASUD GDE- AN - 200 0 8 16 32 64 128 number of processors (b) Fgure 4.5 Effcency n terms of movement measure (u's) for hypercube topologes, and consderng (a) lkely and (b) pathologcal ntal load dstrbutons. 127

Chapter 4 Lkely dstrbutons (Torus) DASUD -> SID-+ GDE-- AN number of processors (a) "S 'E u TJ CO _g 3 uuu 1800 1600 1400 1200 1000 800 600 Pathologcal dstrbutons (Torus) - --.! DASUD '\ ' SID -1-- -.! GDE -- X F ""**.. * K..f ( '"" -.. 400, ^^^ ^^^Jft ^ "" ^ 200 n 1 1 _ _ 1 9 16 36 64 121 f. number of processors t (b) Fgure 4.6 Effcency n terms of movement measure (u's) for torus topologes and consderng (a) lkely and (b) pathologcal ntal load dstrbutons. 128

Comparatve study of nearest-neghbour load-balancng algorthms 4.4.3 Influence of the ntal load dstrbuton shape n u The effect of how the ntal load dstrbuton s scattered through the processors has also been nvestgated n order to observe whether t has any nfluence on the evaluated ndexes, or not, for all algorthms. In table 4.9, we show the amount of load moved on average by each algorthm to obtan a stable state dependng on how the ntal load dstrbuton s scattered to processors. Each value s the mean value for all szes of hypercubes and torus, and for all ntal load patterns. The separate values are ncluded n tables A.5 and A.6 n appendx A. From the analyss of tables 4.9(a) and 4.9(b) we extract that the value of u for sngle mountan scatterng s larger than that obtaned when chan scatterng s appled for all strateges when the ntal load dstrbuton group s the pathologcal group. Ths s attrbutable to the knd of load movements generated. When sngle mountan scatterng s appled all local load movements have the same global drecton, from heavly loaded processors to lghtly loaded ones, because ths knd of load scatterng generates a local load gradent equal to the global one. Consequently, all local load movements are productve movements. Furthermore, snce n sngle mountan shape all load s evenly concentrated n the same regon of the system keepng the rest of processors dles, load redstrbuton becomes dffcult. Non-dle processor belongng to the lmt of the loaded regon are contnuously recevng load from processors wthn the loaded area and sendng t to the dle regon, ths fact mples an ncrement n the load moved around the system. On the other hand, when chan scatterng s used, dle processors are surrounded by loaded processors and, consequently the load dstrbuton s faster and not so tedous. When we analyse the values obtaned for lkely ntal load dstrbutons only GDE and AN behave n the same way as for pathologcal cases. DASUD and SID change ther behavour by provdng more load movement for chan shape than for sngle mountan shape. In ths case, both algorthms are penalsed for ther capacty of havng all processors concurrently executed load-balancng operatons. The followng scenaro appears: there are some processors that can see themselves as locally load-maxmum whle not beng globally-maxmum and, therefore, some load thrashng s generated. GDE and AN do not exhbt ths problem, because only a few 129

Chapter 4 processors are smultaneously, executng the load-balancng process, and the thrashng effect s avoded. Hypercube (load unts- u's- by shapes) lkely dstrbutons pathologcal dstrbutons SM DASUD 88.34 SID 69.94 GDE 110.5 AN 192.21 II DASUD II 344.05 SID 320.45 GDE 629.07 AN 705.35 Chan 100.75 92.79 90.87 ( 127.29 (a) 201 187.33 300.1 302.13 Torus (load unts -u's- by shapes) lkely dstrbutons pathologcal dstrbutons DASUD SID GDE AN DASUD SID GDE AN SM 88.46 69.48 332.3 118.63 420.45 385.75 1482.6 580.6 Chan 92.99 84.1 274.5 116.32 140 346 798.13 346 Table 4.9 Load unts moved (u's) for DASUD, SID, GDE and AN consderng lkely and pathologcal ntal load dstrbutons for hypercube (a) and torus (b) topologes wth respect to the shape on the ntal load dstrbuton. 4.4.4 Influence of the ntal load dstrbuton pattern n steps 1 ' ' Fnally, n the followng sectons we deal wth the number of smulaton steps needed to acheve the fnal stable load dstrbuton. In ths secton, we analyse the nfluence of the ntal load dstrbuton pattern on the number of total steps performed durng the load-balancng smulaton process. As happens for the load unts, the number of steps ncurred by all smulated algorthms ncreases as the ntal load mbalance ncreases, whatever, ntal load dstrbuton group s observed. From an ndvdual analyss of each load-balancng algorthm, the algorthm that ncurs n the least number of steps s the SID algorthm. DASUD occupes the second place n the rankng followed by AN and, fnally by GDE. However, all strateges need more steps to reach the fnal load dstrbuton when they are executed n a torus topology than n a hypercube topology. The hghest connectvty degree exhbted by hypercube nterconnecton networks s the reason for such dfference. GDE has obtaned the worst results because t does not explot the all-port communcaton model, and t takes more steps to be aware of load changes at each doman. 130

Comparatve study of nearest-neghbour load-balancng algorthms Hypercube (steps) lkely dstrbutons pathologcal dstrbutons 25% 50% 75% 100% 25% 50% 75% n-1 DASUD 9.56 13.47 15.26 16.78 15.7 17.2 17 24.2 SID 5.35 7.51 9.03 9.79 8.2 9.9 9.5 12 GDE 34.11 34.73 35.9 45.11 44.5 47 38.4 51.5 AN 29.56 33.6 35.6 38.64 38.8 39.6 38 51.2 (a) torus (steps) lkely dstrbutons pathologcal dstrbutons 25% 50% 75% 100% 25% 50% 75% n-1- DASUD 22.5 28.5 33.02 38.16 38.8 47.8 48.2 68 SID 5.25 9.09 11.25.14.11 11.6 19.1 22.4 38.6 GDE 47.76 49.24 50.9 52.9 66.1 67.8 69.5 70.4 AN 22.43 25.76 28.18 30.28 32.1 35.9 33.9 42.4 (b) Table 4.10 Steps for DASUD, SID, GDE and AN algorthms consderng lkely and pathologcal ntal load dstrbutons forhypercube (a) and torus (b) topologes wth respect to the ntal load dstrbuton patterns 4.4.5 Influence of the system sze n steps Fgures 4.7 and 4.8 show the number of smulaton steps needed for the whole smulaton process on average for hypercubes and torus, respectvely, for DASUD, SID, GDE and AN strateges. The ndvdual values correspondng to each ntal load dstrbuton patterns are ncluded n Appendx B. SID s the strategy that needs the smallest number of steps for all topologes and the smallest number of processors to stop the load balancng-process. Ths fact stems from the ncapacty of SID to evenly attan load dstrbutons at the end of the 131

Chapter 4 load-balancng process as t has been extracted from the stablty analyss. For that reason, the low number of SID steps s not an mportant pont n ths analyss. However, that s not the case for the other smulated strateges. We contnue by» analysng the results obtaned for each one n further detal. í We begn wth the GDE load-balancng algorthm. If we focus globally on GDE behavour for hypercubes and torus, we wll see that, whlst for the hypercubes the number of smulatons steps ncreases as the system sze also ncreases, n the case of the torus, the number of smulaton steps remans farly constant. The man reason for ths fact s the connectvty degree of a gven processor,.e., the number of processors drectly connected to t. In hypercube topologes, the sze of the doman of a gven processor ncreases as the system sze ncreases, whereas, for torus the doman sze remans constant as the number of processors grows. In partcular, for hypercube topologes the sze,of the doman concdes wth the dameter of the system, whereas the number of mmedate neghbours for a gven processor n torus nterconnecton networks s always 4. We shall now analyse the results obtaned by the AN algorthm. In ths case, we can also hghlght a dstnct behavour for the topologes analysed. When the loadbalancng process s performed n hypercube topologes, the number of steps spent n the global load-balancng process exhbts a homogeneous behavour, whatever the system sze mght be. By contrast, when the underlyng topology s torus, a more rregular behavour s obtaned, the degree of connectvty exhbted by each topology has an mportant relevance n these results. Snce the AN strategy has the constrant of not allowng the smultaneous executon of the load-balancng algorthm n processors whose domans are'overlapped, large doman szes stem from beng less overlapped n load movements because fewer concurrent load-balancng operatons may be performed n the system. Therefore, the number of steps ncreases. 132

Comparatve study of nearest-neghbour load-balancng algorthms ou c «30 & 25 20, 15 10: <C-+, Lkely dstrbutons (Hypercube) 1 a-....x M V- " * 40 35' -.- ' DASUD _. - ' SID ~~ t ~". / GDE AN _í -. ' ~^ ~ - - n 32 number of processors (a) 128 IV 60 Pathologcal dstrbutons (Hypercube) l...-*- "*""*""" g. V) 50 40 30 20 / '..M - s na^ n * " / SID--»--" / GDE Ml AM N...K.. " " 10: n number of processors (b) 128 Fgure 4.7 Steps for DASUD, SID, GDE and AN algorthms consderng (a) lkely and (b) pathologcal ntal load dstrbutons forhypercube topology wth respect to the system sze. 133

Chapter 4 w o. w /u 60 CA 40 30 on Lkely dstrbutons (Torus) 1 1 1 1...- -""" V, DASUD -^ SID--^-' GDE---- AM... " 10 -/^^^~~~-^~~-~~~~-.^_ n 9 16 36 ' ~~" number of processors (a) 121 8. o> 90 80 70 60 50 40 30 20 10 0 9 16 36 Pathologcal dstrbutons (Torus) K''" ' f 64 DASUD - - SID--+- GDE--*- AN--» 121 number of processors (b) Fgure 4.8 Steps for DASUD, SID, GDE and AN algorthms consderng (a) lkely and (b) pathologcal ntal load dstrbutons for torus topology wth respect to the system sze. 134

Comparatve study of nearest-neghbour load-balancng algorthms 4.4.6 Influence of the ntal load dstrbuton shape n steps We have also nvestgated the nfluence of the ntal load scatterng on the number of steps needed by the load-balancng process to reach the termnaton condton. Tables 4.11 (a) and 4.11(b) show the average of such number of steps for all szes of hypercubes and torus, respectvely. More detaled nformaton can be found n tables B.15 and B.16 n appendx B. For sngle mountan shapes the number of steps s hgher than for chan shapes than for sngle mountan shape Ths characterstc s ndependent of the ntal unbalanced degree and of the ntal load dstrbuton group (lkely or pathologcal). For all lkely dstrbutons, the number of steps requred for sngle mountan shapes s approxmately twce, on average, the number of steps requred for chan dstrbutons. Ths feature has a hgh smlarty to the behavour exhbted by the total amount of load moved when the load dstrbuton shape s consdered (secton 4.4.3). The local unbalance gradent observed by each processor concdes wth the global one, therefore, a number of productve steps are performed to coerce the mbalance nto a load dstrbuton as close to the even one as possble. SM Chan SM Chan DASUD 17.79 9.74 DASUD 39.6 21.49 HypercubeTopology (steps by shapes) lkely dstrbutons pathologcal dstrbutons SID 9.22 6.62 GDE 40.64 34.28 AN 36.56 32.14 DASUD 21.6 12.53 SID 11.55 7 (a) Torus Topology (steps by shapes) GDE AN 48.19 44.8 39.55 34.9 lkely dstrbutons pathologcal dstrbutons SID 13.34 6.5 GDE 52.71 47.72 AN 28.55 24.77 (b) DASUD SID GDE AN 58.05 27.5 69.57 37.37 35.13 11.6 66.32 32.26 Table 4.11 Steps for DASUD, SID, GDE and AN consderng lkely and pathologcal ntal load dstrbutons for (a) hypercube and (b) torus topologes wth respect to the ntal load dstrbuton shape 135

Chapter 4 Effcency Summary /oad un/fs (u'sj smulaton steps patterns. M DASUD GDE AN SID DASUD GDE ncreases as the ntal unbalance degree also ncreases! decreases as the system sze t DASUD GDE AN SID DASUD GDE ncreases as the ntal unbalance degree ncreases ncreases as the system 0) oí >. M AN SID ncreases \ AN SID sze ncreases but not greatly fl> Q. n (0 DASUD SID AN GDE lkely: larger for Chan than for SM shape pathologcal: larger for SM than for Chan. larger for SM shape than for Chan shape DASUD GDE AN SID larger for SM shape than for Chan shape Table 4.12. Summary of the results from the comparatve study wth respect to the stablty analyss 4.4.7 Conclusons of the effcency analyss Fnally, n ths secton, we summarse the man concluson extracted from the effcency analyss. In table 4.12 these conclusons are exposed by takng nto account each one of the analysed parameters. For each parameter (pattern, system! sze and shape) the two algorthms that provdes the best results are wrtten n red. Notce that for all parameters and qualty ndexes the load-balancng algorthm that requres íess tme to reach the fnal load dstrbuton s the SID algorthm. However, DASUD's behavour remans very close to SID for the whole expermentaton. It s worth notng that the best effcency results obtaned by the SID algorthm denote ts ncapacty to coerce the system nto a balance load dstrbuton. The loadbalancng process applyng the SID algorthm lasts a short tme perod because, although t allows all processors to work smultaneously, the performed load movements slghtly arrange the system mbalance. In contrast, DASUD ncurs wthn a smlar load-balancng tme, but performs more productve load-balancng 136

Comparatve study of nearest-neghbour load-balancng algorthms movements. AN and GDE both are penalsed for ther synchronsaton requrements. Ths restrcton forces them to execute more load-balancng steps and load movements cannot be overlapped, as happens when SID or DASUD algorthms are appled. Ths drawback mples that AN and GDE take a large tme perod n achevng the fnal load dstrbuton. 4.5 Summary and conclusons of the comparatve study Table 4.13 ncludes a summary of all expermental studes performed n ths chapter. Snce the detaled analyss of the stablty and effcency results have already been carred out n the correspondng sectons, n table 4.13 we use very smple terms to denote the goodness of each algorthm wth respect to each study. A tradeoff column s also ncluded to expose the man conclusons of ths chapter. On the one hand, wth respect to the stablty analyss, DASUD, AN and GDE obtan the best results. In partcular, DASUD s the best. On the other hand, by consderng the effcency analyss, SID and DASUD are the two algorthms that arrve frst at the end of the load-balancng process. Therefore, DASUD s the load-balancng algorthm that exhbts the best trade-off between the fnal balance degree and the cost ncurred n achevng t. DASUD AN GDE SID Summary of the comparatve study Stablty Effcency Trade-off Very Good Low Cost The best trade-off between stablty and effcency Very Good Hgh Cost Medum trade-off Good Hgh Cost Bad trade-off Very Bad Low Cost Bad trade-off Table 4.13. Summary of the results from the coparatve study wth respect to the stablty analyss 137

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 5 Scalablty of DASUD Abstract /n Í/7/S chapter, usng the same smulaton framework reported n chapter 4, we analyse the scalablty of DASUD wth respect to problem sze as well as system sze. 139

Scalablty of DASUD 5.1 Introducton An mportant aspect of performance analyss n load-balancng algorthms s the study of how well the algorthm adapts to changes n parameters such as problem sze and system sze. The expermental study outlned n chapter 4 has been undertaken consderng dfferent systems szes and topologes but the problem sze keeps constant for all experments. The am of that chapter was to compare dfferent loadbalancng algorthms wth respect to ther ablty to acheve the fnal stable state, and the cost ncurred to reach t. From that analyss, we have concluded that DASUD results n the best trade-off between the fnal balance degree and the cost ncurred to acheve t. In ths chapter, we perform a more precse analyss of DASUD's behavour under dfferent problem szes. The dstrbuted nature of DASUD make us suppose that DASUD wll react n a smlar way for small problems as well as for large ones. Furthermore, we beleve that DASUD wll behave smlarly as the problem sze ncreases. The results provded n the followng sectons have been obtaned by settng the total amount of load dstrbuted among the whole system (/_) by one of the followng values: 3000, 6000,12000, 24000 and 48000 load unts. Two expermental studes are outlned below. Frstly, we analyse how DASUD s able to adapt to dfferent problem szes when the underlyng topology and system sze do not change. For that purpose, we do not smulate the load-balancng process for all commented problem szes, and we only consder the two extreme and the medal values (3000, 12000 and 48000). Secondly, the nfluence of the system sze s consdered. Snce n chapter 4 ths parameter has been evaluated for a fxed problem sze, here, we analyse what happens when the problem sze changes as the system sze changes as well. In ths case, all above mentoned problem szes (/.) have been used to obtan a smlar ntal load mbalance for all system szes. 141

Chapter 5! 5.2 DASUD's scalablty wth respect to the problem sze j Ths secton s focused on the evaluaton of the DASUD's ablty to work ndependently on problem sze for a fxed topology, and on system sze. Snce we have executed the load-balancng smulaton process for the three problem szes mentoned (L): 3000, 12000 and 48000, the fnal global load average wll be dfferent for each problem sze. At each step of the smulaton process, the maxmum load dfference (df_max) throughout) the system and the global standard devaton (a), denoted as stdev n the graphcs, have been evaluated n order to plot ther evoluton as the load-balancng smulaton process progresses, both ndexes have been prevously ntroduced n chapter 4. The graphcs for all topologes and system szes consderng lkely and pathologcal ntal load dstrbutons are reported n appendx C to ths work. Snce DASUD exhbts a smlar behavour for all of them, we only nclude n ths secton the analyss of the fgures correspondng to the largest system sze for each topology (7-dmensonal hypercube and 11x11 torus) and for lkely ntal load dstrbutons. r Fgures 5.1 (a) and 5.1(b) show the global maxmum load dfference and the global load standard devaton as the load-balancng smulaton process progresses for a 7-dmensonal hypercube. Each plot for those fgures s the mean value of the global maxmum load dfference for all ntal lkely load dstrbutons at the same smulaton step. Fgures 5.2(a) and 5.2(b) depct the same nformaton descrbed above but n ths case for a 11x11 torus topology. As can be observed n fgures 5.1 (a) and 5.1(b), the balance rate has an nsgnfcant degradaton as the problem sze ncreases for a 7-dmensonal hypercube topology. Consequently, for hypercubes, DASUD's ablty to have a hgh decreasng gradent n the global unbalance durng the ntal teratons of the load-balancng process s ndependent of the problem sze. From a more accurate analyss of fgure 5.1 (a) we note that the global maxmum load dfference.shows rregular behavour throughout the ntal loadbalancng smulaton steps. However, the global standard devaton (fgure 5.1(b)) shows constant decreasng throughout the whole load-balancng smulaton process. Ths stuaton s not usual, but t sometmes appears at the very begnnng of the loadbalancng smulaton process as a consequence of some small load thrashng. However, ths fact s not very relevant, snce the global load standard devaton shows 142

Scalablty of DASUD no fluctuatons durng ths small perod. Therefore, the load-balancng smulaton process n hypercubes seems to be less nfluenced by the growth of the problem sze. /uu 600 500 400 Lkely dstrbutons (Hypercube d=7) 1 1 1 1 1 L _ OUUU onnn j. L=12000. : L _«H3UUU Aonnn - M-' 300-200 '. / ' -. 100 " -. " - -. x ^ -.._ n c) 20 40 60 80 100 12>0 steps IfU 120 100 80 60 40 20 n c (a) Lkely dstrbutons (Hypercube d=7) L ouuu nnn 1 1 1 1 1 L=12000. _ Aonnn ; L 4OUUU - - - \ \ -> \ -. ^ ) 20 40 60 80 100 12>0 steps (b) Fgure 5.11nfluence of the problem sze n (a) the global df_max and (b) n the global load stdev as the load-balancng process progresses for a 7-dmensonal hypercube and for lkely ntal load dstrbutons. 143

Chapter 5 x m 5TD auu 450 A f\f\ 400 350 300 250 200 150 100 50. Lkely dstrbutons (Torus 11x11) ' ' ' ' -"ínnn L=12000 ' T ~ - "' -. r ' '"' --. ÍI^-- - """"" ---... () 50 100 150 200 250 3C)0 steps (a) (U V) 160 140 120 100 80 60 40 20 Lkely dstrbutons (Torus 11 x11 ) L=2 L=12000 50 100 150 steps 200 (b) 250 300 Fgure 5.2 Influence of the problem sze (a) n the global df_max and (b) n the global load stdev as the load-balancng process progresses fora 11x11 torus and for lkely ntal load dstrbutons 144

Scalablty of DASUD The analyss of DASUD behavour n torus topologes when the problem sze changes exhbts some mportant characterstcs. As happens for hypercubes, the global maxmum load dfference and the global load standard devaton monotoncally decreases as the load-balancng smulaton process progresses. However, DASUD's response rate n 11x11 torus decreases slower than n hypercubes,.e., whle the problem sze ncreases the response rate slghtly degrade, as can be observed n fgures 5.2(a) and 5.2(b). These results can be explaned by the enlarged dameter of ths partcular nterconnecton network. The dameter of an 11x11 torus s equal to 10, whereas the dameter of a 7-dmensonal hypercube s 7. Therefore, as we only consder local load movements, the larger topology dameter s, the slower the load dstrbuton s performed. Wth regard to the results obtaned for both topologes wth smlar dameter, the response rate remans very close, as s shown n fgure 5.3. Ths fgure shows DASUD's behavour wth respect to the global standard devaton (stdev) for a 3-dmensonal hypercube (5.3(a)) whose dameter s equal to 3 and for a 3x3 torus (5.3(b)) whose dameter s 2, for lkely ntal load dstrbuton (these graphcs are extracted from appendx C). It s easy to observe that the behavour í depcted n both fgures s practcally the same. In concluson, the ablty of DASUD to reach a good balance degree depends mnmally on the problem sze. Addtonally, ths load-balancng algorthm s also able to act smlarly for dfferent confguratons of the archtecture by exhbtng a slght dependence on the topology's dameter. 5.3 DASUD's scalablty wth respect to system sze In ths secton, we analyse DASUD's response rate as the number of processors ncreases for a gven nterconnecton pattern. For ths purpose we have vared the number of processors from 8 to 128. As n the prevous secton, the study has been performed for hypercubes and torus, startng the load-balancng process from any ntal load dstrbuton ncludng ether lkely or pathologcal patterns. The results ncluded below correspond to those obtaned when lkely ntal load dstrbutons are appled. Snce the observed behavour for pathologcal ntal load 145

Chapter 5 I dstrbutons s the same as for lkely patterns, the graphcs for pathologcal dstrbutons are not ncluded n ths secton ; however, they may be consulted n t appendx C. 0) M uuu 1800 A t*f\e\ 1600 1400 1200 1000 800 600 400 200 n Lkely dstrbutons (Hypercube d=3) ' I ' ' ' -^nn'n :! 'L=12000 ' -': - \ ' -\ \ \ \ \ *, E steps 30 (A uuu 1800 A f^f\f\ 1600 1400 1200 1000 800 600 400 Lkely dstrbutons (Torus 3x3) f tnnn L OUUU j L=12000-----» L~"*tOUv/U Afr\r\r\ - -: - '. \ ' f 200 f\ VÒ---' c) 5. 10 15 steps 20 25 3 (b) - Fgure 5.3 Influence of the problem sze n the global standard devaton as the load-balancng process progresses for (a) a hypercube wth d=3, and (b) a torus wth d=2 for lkely ntal load dstrbutons. 146

Scalablty of DASUD The evoluton of the global maxmum load dfference (dfjmax) for hypercubes and torus s depcted n fgures 5.4(a) and 5.4(b) respectvely. The results for global load standard devaton (stdev) are shown n fgures 5.5(a) and 5.5(b). As has been mentoned prevously, DASUD's capacty for movng load among the system s lmted by the dameter of the underlyng nterconnecton network. Larger dameters yeld slower load propagaton, snce load movements are performed locally. The results obtaned for torus topologes seems, at frst sght, worse than the results obtaned for hypercubes, however, what s happenng s that for the same system sze, the dameter of each topology has dfferent values. In torus nterconnecton schemes, the drected connected processors for all processor remans constant as the system sze ncreases whereas n hypercube archtectures the sze of the doman ncreases as the system sze also ncreases. Therefore, for the same number of processors, hypercube exhbts a hgher connectvty degree than torus and, consequently, the value of the dameter s, at most, the same for hypercube and for torus, but n most cases, t s smaller for the same system sze. Notce that the larger system szes are 128 and 121 processors for hypercube and torus respectvely, whereas the correspondng dameters are 7 and 10. But, f we observe the results for topologes wth the same dameter value as happens for a 4- dmensonal hypercube and 4x4 torus, for nstance, we observe that there s a slght dfference n the obtaned response rate, even f we analyse the global maxmum load dfference or the global standard devaton. Fnally, a smlar concluson as for scalablty analyss wth respect to problem sze can be extracted. The ablty of DASUD to reach a good balance degree depends mnmally on the system sze, more precsely, DASUD exhbts a slght dependence on the topology's dameter whatever nterconnecton pattern we have. 147

Chapter 5 x CO E ^j S Lkely dstrbutons (Hypercube) ÍUU! ' ' ú ' nyper d _0 600 j * hyper d=4. k^\/r\ûr ' nyper d c o «l \/r\ar 500 d D R 1 hyper d=7--- " 400 300 200 100 1 : 1 \. 1 ï! l». \ *' '.' v ' \^ï ::: '::::~----- -, () 20 40 60 80! steps. - - 100 1Í>0 x ço 5 T3 auu 450 A f\f\ 400 350 300 250 200 150 100 50 n 0 \. \ \ X - N ( ^'*. Lkely dstrbutons (Torus) 1 1 1 1 1 4._..A. O». O torus 0X0 torus 4x4 " tf\fí IÓ ÍÍVÍN torus DXO t/^r torus c oxo AvR 1 '. ' -., ^"^-^ \ \ ' '.. *** ^. torus 11x11 " V. N. "r-. ''" -... t ~"""-~ - 50 100 150 200 250 3( )0 [ steps (b) Fgure 5.4 Influence of the system sze n the global maxmum load dfference as the load-balancng process progresses for (a) a hypercube and (b) torus for lkely ntal load dstrbutons. 148

Scalablty of DASUD T3 t» If U 120 A f\f\ 10U 80 60 40 20 n c Lkely dstrbutons (Hypercube) 1 1 1 1 1 LktlMAIH nj«*o nyper a o hyper d=4 L *J^C nyper a o ~. L*fHAr *J^C nyper Q o - «hyper d=7 h n b kv* %-.. V N Ò-'"" - : ~.'. >^~'^«.-'-'-'-: ;"j--~. ~'.~:~'n -,._. ) 20 40 60 80 100 12IQ (a) steps 0) en IDU 140 120 100 80 60 40 20 n IS \ \ ' 'N ' '. * N. I'. ''. '"*» \ \ '-... "" -.^ \, ''... '~ Lkely dstrbutons (Torus) ^ M *3%*O torus oxo torus 4x4 É C^»C torus oxo torus 8x8 torus 11x11. V^^^ ""t-.._ '"! í ~í '!.!^- 50 100 150 200 250 300 350 (b) steps Fgure 5.5 Influence of the system sze n the global standard devaton as the load-balancng process progresses for (a) a hypercube and (b) torus for lkely ntal load dstrbutons. 149

Chapter 5! 5.4 Concluson about DASUD's scalablty Í From the concluson obtaned n the two prevous scalablty analyses where the nfluence of the problem sze and the system sze have been studed, we conclude that DASUD s a load-balancng algorthm whose balance propertes slghtly depend upon the dameter of the underlyng topology. The man reason for ths fact s ts totally dstrbuted nature. Snce load movements are performed locally, larger dameters yeld slower load propagaton. 150

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 6 Enlargng the doman (c/ s -DASUD) Abstract In ths chapter, the followng queston s analysed: s t possble to accelerate the load-balancng process enlargng the DASUD's doman to nclude non-drected connected processors? For ths purpose, an extended system model s provded. The nfluence of enlargng the doman nto the tme ncurred n transferrng messages beyond one lnk and n the extra computatonal cost ncurred by the extended verson of DASUD (d s -DASUD) has been analysed. From ths analyss we determne whch enlargement provdes the best trade-off between balance mprovement and loadbalancng tme spent. 151

Enlargng the doman (d s -DASUD) 6.1 Introducton Up to ths pont, the DASUD algorthm dentfes the doman of a gven processor / wth processors wth whch t has a drect neghbourhood relaton. A queston that arose durng the evaluaton of DASUD was: how would DASUD work f t was able to collect more load nformaton than only that of ts mmedate neghbours? Intutvely, we expect that f DASUD had the capablty of usng more load nformaton to take load-balancng decsons, globally, t should be able to reach a better fnal balance stuaton spendng less tme. We also expect that ths effect be proportonal to the enlargement of the doman. For nstance, f each processor were able to collect the load nformaton from the whole system,.e., DASUD workng as a totally dstrbuted load-balancng algorthm wth global nformaton, DASUD should then, be very fast n correctng unbalanced load dstrbuton, and the fnal state should be evenly. However, we know that ths DASUD extenson ncurs an extra tme cost due to communcaton and computatonal cost ncrements. Thus, we were nterested n studyng the nfluence of the doman enlargement n the trade-off between mprovement n the balance degree and the tme ncurred durng the load-balancng process. We refer to the extended verson of DASUD as d s -DASUD where d s s the doman scope whch s a non-negatve nteger value that represents the mnmum number of lnks needed to cross from the underlyng processor to the furthest processor belongng to ts doman. We notce that when d s s equal to, the algorthm concdes wth the orgnal DASUD algorthm. Although of s -DASUD s stll a totally dstrbuted load-balancng algorthm, the mplementaton of d s -DASUD ntroduces some changes n the underlyng system model descrbed n chapter 2. Therefore, an extended system model s descrbed n the followng secton. Subsequently, the tme metrcs evaluated to determne whether t s worth enlargng the domans or not are ntroduced. Fnally, we report the expermental study performed to analyse how DASUD works when the doman of each processor s enlarged. 153

Chapter 6! 6.2 Extended system model We recall from chapter 2 that we represent a system by a smple undrected graph G=(P,E),.e. wthout loops and wth one or zero edges between two dfferent vertces. The set of vertces P={1,2,...,,n} represents all processors n the system. One edge {,j} e E f there s a lnk between processor /and processor). A processor / has a doman whch s defned as the set of processors from whch processor / mantans load nformaton ncludng tself and s defned as follows N,={jeP\{,j}eE\J\}. The sze of a gven doman s the number of processors belongng to that doman and s denoted by #N. The number of drect neghbours for a gven processor s denoted by r, whch concdes wth ##,-!. In the new defnton of DASUD (d s -DASUD) when the value of cf s s bgger í than one (d s >1), some vrtual lnks can be consdered to provde a way for connectng the underlyng processor to npn-drectly connected processors belongng to ts doman. Then, a new set of edges E v can be consdered and {,j} e E v f there s a real or vrtual lnk between processor7 and processor/ The set of processors belongng to the doman of processor /', ncludng tself for a certan value of d s, s denoted by A/f'The resultant topology s referred to as vrtual topology and, consequently, a vrtual dameter (d v ) should be defned. Fgures 6.1 (a) and 6.1(b) show the doman of a gven processor (red colour) for a vrtual 3-dmensonal hypercube topology wth d s equal to 1 and 2, respectvely. Notce that, when the value of cf s s equal to 1, the set of processors belongng to the doman of the underlyng processor match drectly wth ts drect neghbourhood (processors drectly connected to t), therefore, the vrtual topology concdes wth the real one. The sze of the vrtual doman of processor / depends on the value of d s and s denoted by #/v/ s, and the number of vrtual neghbours wll be denoted by r v. In order to smplfy notaton, n the rest of ths chapter when we refer to neghbour processors, we consder both neghbours, real and vrtual. 154

Enlargng the doman (cf s -DASUD) vrtual lnks real lnks Fgure 6.1 Vrtual hypercube topology for (a) 1-DASUD and (b) 2-DASUD The orgnal verson of DASUD has been theoretcally analysed n chapter 3. In that chapter, some upper bounds have been provded for the fnal balance degree and the balance rate. Both of these have been derved under the assumpton that a gven processor was restrcted to use load nformaton from ts drectly connected processors. However, f ths condton s relaxed as a consequence of enlargng the doman of each processor, these upper bounds should be updated n order to be appled to the extended verson of DASUD (d s -DASUD). Snce ths modfcaton affects the dameter of the underlyng topology (d) whch should be changed by ts vrtual verson d v (vrtual dameter), the orgnal formulas for the above mentoned upper bound, should be updated. The notaton of these extended upper bounds s shown n table 6.1. d v ~2 Table 6.1 Extended upper bounds ford s -DASUD 155

Chapter 6 The extended verson of DASUD seems to be an nterestng alternatve for ncreasng the convergence rate of the orgnal algorthm. Snce larger domans allow performng more accurate balance decsons, we are nterested n evaluatng such a possblty. Therefore, we analyse the nfluence of the doman scope enlargement n the balance degree wth the am of determnng a doman scope (cf s ) that exhbts the best trade-off between balance degree mprovement and the tme overhead ntroduced when the scope of the doman s enlarged. For ths purpose, n the followng secton, we descrbe all tmes nvolved durng the load-balancng process, and how these tmes have been 1 evaluated by takng nto account the nfluence of the doman scope. Subsequently, these metrcs wll be used to expermentally analyse whether or not t s worth enlargng the doman scope of the DASUD algorthm. 6.3 Metrcs Iteratve load-balancng algorthms mprove the load mbalance of a system by t successve teratons of the load-balancng operatons. In such approaches the total load-balancng tme depends on the number of teratons performed. Snce the loadbalancng process has been evaluated by smulaton usng the same load-balancng smulator descrbed n chapter 4, we wll use the term smulaton step (or step for smplcty) as n chapter 4, nstead of teraton. Under the synchronous smulaton paradgm, the total load-balancng overhead ncurred by any teratve load-balancng algorthm (T bal ) can be obtaned as follows, last _ step ha ~ /. bal s where T h al s the tme requred to execute one smulaton step of the load-balancng process n the whole system, ; and last_step denotes the last step of the loadbalancng smulaton process. More precsely, the duraton of the s-step of the loadbalancng smulaton process ( 7, ) can be dvded nto two communcaton perods (nformaton collecton and transfer perods) and the computatonal perod: 156

Enlargng the doman (c/ s -DASUD) Communcaton perods: => Informaton collecton perod (T* ol ): Interval of tme requred to gather all load nformaton needed by all processors for executng the load-balancng algorthm. => Transfer perod (7^): Tme requred to perform all load movements from source processors to destnaton processors. Computatonal perod (T hal): Perod of tme dedcated by the load-balancng strategy to evaluate the load movements n all processors. Consequently, and bearng n mnd the synchronous paradgm, the total tme overhead ntroduced by the global load-balancng process (T ha ) can be obtaned as follows: last teraton s=\ Notce that the duraton of these load-balancng perods s drectly affected by the doman scope (cf s ),.e., as the doman scope ncreases, those perods wll lkewse extend. Therefore, t s necessary to be able to evaluate ths overhead n order to obtan relable results. We now descrbe how the communcaton and computatonal perods have been evaluated. Subsequently, these tme are used to ntroduce a goodness ndex called the trade-off factor, whch allow us to determne whch d s provdes the best results. 6.3.1 Communcaton perods Snce n drect networks, the blockng tme of a message (whch s defned as the tme spent watng for a channel currently beng used by another message [NÍ93]) cannot be pursued, to evaluate the communcaton tmes ncurred by the extended verson of DASUD (d s -DASUD) durng the collecton nformaton and transfer perods, the nterconnecton network functonal smulator N ETS IM has been used [Fra99]. Ths smulator consders a wormhole routng as routng technque and t takes nto account the resource contentons that a message encounters n ts path. In ths 157

Chapter 6 j í routng technque the orgnal message s broken nto small unts called flts. The header flt(s) of a message contans all the necessary routng nformaton and all the other flts contan the data elements. The flts of the message are transmtted through the network n a ppelned fashon. Snce only the header flt(s) has(have) the routng I nformaton, all the tralng flts follow the header flt(s) contguously. Flts of two dfferent messages cannot be nterleaved at any ntermedate node [Moh98]. The communcaton latency for a wdrmhole routng technque s obtaned by consderng the followng tmes: Start-up tme (t s ): tme needed to prepare a message. Ths delay s ncurred only once for a sngle message. Per-hop tme (t h ): the tme taken by the header of a message to travel between two drectly-connected processors n the network. Per-flt transfer tme (t w ): the tme taken by a flt of the message to traverse one lnk. If we consder a message that s traversng a path wth / lnks, then the header of the message takes lt h tme to reach the destnaton. If the message s m flts long, then the entre message wll arrve n tme mt w after the arrval of the header of the message. Therefore, the total communcaton tme (t com m) for a wormhole routng,.e. the tme taken by one message to go from source processors to the destnaton one, s gven by **», =', The njecton of messages nto the network has been smulated by usng the worst pattern n whch all processors start the njecton process smultaneously. Ths njecton pattern has been used for both communcaton perods, collecton nformaton perod and transfer perod. From the executon of the NETSIM smulator we have obtaned the total smulaton tme ncurred for the whole communcaton process by ncludng the njecton of the messages by the source processors, the resource contentons and deadlock detecton and recovery. Subsequently, the partcular usage of ths network smulator for each communcaton perod s descrbed. I 158

Enlargng the doman (d s -DASUD) The nformaton collecton perod (T* 0 ) Remember that the nformaton collecton perod s the tme spent n gatherng load nformaton for all processors n the system. Snce the doman scope s a fxed value, whch does not change at each load-balancng step, the tme ncurred by the nformaton collecton perod wll be the same for each load-balancng step. Therefore, the smulaton of load messages travellng around the system only needs to be executed once. Ths smulaton has been performed by njectng 2-flt messages from all processors to all of ts neghbours, and the total N ETS IM smulaton tme s consdered as the tme spent by the nformaton collecton perod. The transfer perod (T, s rf) In contrast to the nformaton collecton perod, the duraton of the transfer perod depends on the current step of the load-balancng smulaton process. At each load-balancng step, the executon of the load-balancng algorthm at each ndvdual processor provdes dfferent load movements. Therefore, the tme spent n sendng and recevng those messages must be evaluated at each load-balancng step. For that purpose, for each experment a trace fle s generated, where all load movement decson generated for all processors at each load-balancng smulaton step are recorded. The nformaton stored n ths trace fle s used as nput to the NETSIM smulator n order to evaluate the NETSIM tme spent for each load-balancng smulaton step n performng the correspondng load transfer movements. In partcular, for a gven load transfer of sze M, the length of the correspondng messages that was njected n the networks was 2M. 6.3.2 Computatonal perod ( T*_ bal ) The computatonal perod for d s -DASUD has been evaluated usng the formula of DASUD's complexty derved n chapter 3. Snce that formula was derved by consderng the doman of a gven processor as ts mmedate neghbours, t should be adapted to take nto account the extended model of DASUD. Ths updatng conssts of substtutng the number of drect neghbours (r) by the number of vrtual neghbours (r v ) as follows: O(r v logr v )). 159

Chapter 6 Snce the doman sze s fxed for a gven d s, and t the same for all processors, the computatonal tme ncurred by each processor at each loadbalancng step remans constant durng the whole load-balancng smulaton process. 6.3.3 Trade-off factor (t_off(k) ) Fnally, we ntroduce an ndex to measure the trade-off between the global standard devaton n the balance teraton k (a(k)) and the tme ncurred to acheve ths (T bal (untl _&)). Ths goodness ndex s called the trade-off factor and s denoted by t_off(k). For ths purpose, let us ntroduce the term T hal (untl _ k). In the í prevous secton, we have descrbed how the three tme perods nvolved n the load balancng process (nformaton 'collecton perod, the computatonal perod and the transfer perod) are ndvdually evaluated. We evaluate the duraton of these perods at each load-balancng teraton n the way descrbed n the prevous secton. By addng together all these tmes from step 1 untl step k, we can obtan the tme spent for the load-balancng process throughout these k steps. Fgure 6.3 graphcally shows the seralsaton of these tmes, and formula (2) formally descrbes t. We observe that the terms T c l, and T '_ hal from formula (2) remans constant for all steps, whereas T^ ~~ ] depends on the current smulaton step, as has prevously been commented. Furthermore, when k concdes wth the last_step the evaluated tme s the tme spent n the global load-balancng smulaton process. + T,' rf ) (2) Fnally, the trade-off factor s obtaned by the multplcaton of both parameters nvolved of/c) and T bal (untl _k), as shown n formula (3). = T ba,(untl_k)*a(k} (3) 160

Enlargng the doman (c/ s -DASUD) Load-Balancng process Total Load-Balancng tme Fgure 6.3 Tme evoluton as the load-balancng process progresses. In partcular, ths trade-off factor would tend to be optmal for small global load standard devatons values, and for small load-balancng tmes as well. But, furthermore, the trade-off factor wll be small when one of the two operands s also small. 6.4 The expermental study of d s -DASUD Ths secton s amed to analysng the nfluence of doman enlargement on balance mprovement by takng nto account the extra tme ncurred as a consequence of sendng and recevng messages beyond mmedate neghbours, and spendng more tme n executng the load-balancng algorthm. The expermental study outlned below has been carred out usng the same smulaton framework descrbed n chapter 4. We recall that the set of ntal load dstrbutons used s dvded nto two groups (lkely and pathologcal), and that each ntal load dstrbuton has been scattered by followng two dfferent shapes (Sngle Mountan and Chan). Moreover, the load-balancng smulaton process has been executed under two dfferent nterconnecton networks (hypercube and torus), and the system sze ranges from 8/9 to 121/128 processors. The problem sze vares, as does the system sze, n order to obtan n all cases a smlar ntal load mbalance. 161

Chapter 6 In order to analyse the exstent relaton between balance mprovement throughout the load-balancng smulaton process, and the tme spent durng that process, the global load standard devaton has been evaluated at each smulaton \ step k (c(k)), as well as the correspondng T M (untl_k). Ths parameter has been evaluated for all smulaton steps untl the last_step s acheved. As n the prevous expermental studes, the smulaton process has been run untl no load movements were produced from one teraton to the next. However, we have supermposed a maxmum number of smulaton steps at whch the smulaton process wll be stopped, although the fnal stable load dstrbuton has not been acheved. Ths smulaton step lmt has been chosen as 2000. The expermental study outlned n the followng secton s amed at determnng whch doman scope (cf s ) provdes the best trade-off between the fnal balance degree and the tme ncurred to reach t. In the subsequent secton, the expermental study reported s.focused on dervng the d s that provdes the best balance mprovement at the begnnng of the load-balancng process wthout executng the load-balancng process untl ts completeness. 6.4.1 The best degree of fnal balance Fgures 6.4 and 6.5 show the evoluton of the global load standard devaton (stdev) through smulaton tme (tme) for lkely ntal load dstrbutons for a 5- dmensonal hypercube and for a 6x6 torus, respectvely, where all possble d s have been consdered. These two examples have been chosen as representatve for both topologes because the rest of system szes exhbt a smlar behavour. Nevertheless, complete smulaton results may be found n Appendx D. Each plot of the depcted curves shows the mean value of the global load standard devaton at a gven step k (a(k)) of the load-balancng smulaton process for all ntal load dstrbutons versus the mean tme needed to acheve such stuatons (T bal (untl_&)). Although these values are obtaned for the entre balance smulaton process, n order to make the analyss of the curves easer, n fgures 6.4 and 6.5 only the tme nterval where the relevant varatons are detected have been plotted. However, ths fact does not affect the comprehenson of the followng dscusson. 162

Enlargng the doman (c/ s -DASUD) 0) a to 140 r 120 100 80 60 40 20 Hypercube d=5 (lkely dstrbutons) 0 o CM O O O O CO o CO o tme Fgure 6.4 Global load standard devaton versus tme for a 5-dmensonal hypercube varyng ds from 1 to 5 for lkely ntal load dstrbutons. 120, Torus 6x6 (lkely dstrbutons) 100 80 <D w 60 40 20 0 o CM o o CD o CO O O O o CM O O tme Fgure 6.5 Global load standard devaton versus tme for a 6x6 torus varyng ds from 1 to 6 for lkely ntal load dstrbutons. 163

Chapter 6 From a prelmnary analyss of fgures 6.4 and 6.5 we can conclude that there seems to be no dfference n the global balance rate, whchever doman scope s appled. However, we observe that there s a tme beyond whch the balance rate of larger d s 's s slowed down reversng ther behavour wth respect to small doman scopes. We call ths atypcal phenomenon nverson rate effect. The magntude of ths effect s more apprecable n tables 6.2 and 6.3 where the mean values for the fnal global load standard devaton, the total load-balancng tme and the total number of load-balancng smulatng steps are shown for hypercube and torus respectvely. It s nterestng to observe that the fnal balance degree acheved for all doman scopes s approxmately the same, but only a slght mprovement s obtaned for larger cf s 's. Notce that n the case of the largest d s, the perfect fnal balance stuaton s acheved as was expected,.e., stdev equal to 0 when L (problem sze) s an exact multple of the underlyng number of processors, and s very close to 0 otherwse. In the case of the largest system szes for both torus and hypercube topologes (121 and 128 processors), the values ncluded n the tables do not represent the real fnal stuaton because n both cases the smulaton step lmt prevously commented on has been acheved. Wth respect to the load-balancng smulaton steps needed to reach the fnal load dstrbuton, one can observe that, on average, ths follows a sequence of values that exhbts a global mnmum at approxmately d s equal to 3, on average, for torus topologes, whlst for hypercube topologes the mnmum number of steps alternates between d s equal to 1 and 2 (yellow cells). In contrast, the tme ncurred n attanng ths fnal stuaton sgnfcantly ncreases as the doman scope also ncreases. For nstance, for all topology and system szes, the bggest doman scope, nstead of beng the fastest n reachng the fnal stable stuaton, becomes the slowest. Unexpectedly, the mnmum tme s obtaned wth cf s equal to 1 (green cells). 164

Enlargng the doman (d s -DASUD) Pathologcal dstrbutons Table 6.2 Global load standard devaton, load-balancng tme and steps at the end of the load-balancng process n hypercube topologes for all doman scopes. 165

$ r Chapter 6 Pathologcal dstrbutons Table 6.3 Global load standard devaton, load-balancng tme and steps at the end of the load-balancng process n torus topology for all doman scopes. 166

Enlargng the doman (c/ s -DASUD) Let us analyse more precsely the reasons for ths atypcal behavour. For ths purpose fgures 6.6(a) and 6.6(b) must also be studed. These fgures show the mean value of the global maxmum load dfference at each load-balancng step for all lkely ntal load dstrbutons for a 5-dmensonal hypercube and a 6x6 torus topology respectvely and for all doman scopes. The results for the remanng system szes and ntal load dstrbutons patterns can be found n Appendx D. Note that the global maxmum load dfference also suffers from the nverson effect whch s caused by the exstence of a step beyond whch larger d s slowed down the reducton of the global maxmum load dfference, as for standard devaton. The reasons for such behavour are, although t apparently contradctory, the ablty of DASUD to evenly balance unbalanced domans and certan convergence requrements. We now analyse these motves n detal. 120 100 Hypercbe varyng d.=(l..5 wth d=5 lkely dstrbutons d.= l d.=2 d.=3 d,=4 d.=5 Torus 6x6 varyng d,=(l..6) lkely dstrbutons 80 60 200 0 20 40 60 80 100 120 140 160 180 steps (a) (b) Fgure 6.6 Global maxmum load dfference versus load-balancng steps for (a) a 5-dmensonal hypercube and (b) a 6x6 torus for all possble doman scopes. 167

Chapter 6 In chapter 4 we observed that the orgnal DASUD algorthm has the ablty of reachng, n most cases, even load dstrbuton or, at most, n keepng the unbalance degree bounded by half the value of the dameter of the topology plus one. The extended verson of DASUD, tf s -DASUD, also exhbts ths ablty. In partcular, when the value of d s concdes wth the topology dameter (cf) we can assert that the fnal stable state wll be evenly. Ths characterstc stems from the capablty of ths loadbalancng algorthm to search unbalanced domans, and to equlbrate them (maxmum load dfference avalable wthn the doman 0 or 1). Snce the doman of each processor becomes the whole system when cf s s equal to d, c/ s -DASUD s terated untl the perfect fnal balance stuaton s acheved. Attanng the balance state mples a great effort by d s -DASUD for large d s n terms of load-balancng steps, as s clearly depcted n fgures 6.6(a) and 6.6(b). However, ths effort s spent n movng small load quanttes throughout the system, snce the maxmum load dfference s not greatly reduced at each load-balancng step. The man reason for such an anomalous stuaton s certan convergence requrements needed by DASUD and by c/ s -DASUD as well. We recall some mplementaton features from chapter 3 of DASUD whch are drectly nvolved n the nverson rate effect. Durng the executon of one teraton of the DASUD algorthm n a gven processor, load movement decsons may be performed as a consequence of executng the named PIM block. Ths DASUD block processes all the receved nstructon messages and, sometmes, as a consequence of attendng one of them, the movement of one load unt can be ordered. In ths case, the rest of the nstructon messages are deleted. Bearng n mnd such behavour, let us analyse what occurs n the followng example. Assume 5 processors connected n a lnear array, as shown n fgure 6.7. The real dameter of ths nterconnecton network s 4 (d=4) and, snce the extended verson of DASUD used n ths example s 4-DASUD, ts vrtual dameter s 1 (cfy=1). Therefore, the doman of each processor ncludes all system processors, and the vrtual topology corresponds to a full-connected system. The ntal load dstrbuton s that depcted n fgure 6.7(a), where the maxmum load dfference n the whole system, concdng wth each processor doman, s 2 load unts. Thus, t s detected as unbalanced by 4- DASUD, and some actons are carred out to arrange t. In partcular, all processors detect ther domans as unbalanced, because the global load average s equal to 9.2, but only processors 4 and 5 really perform certan actons. These actons are derved 168

Enlargng the doman (d s -DASUD) from the executon of the second stage of 4-DASUD. Snce both processors observe that the maxmum load dfference wthn ther doman s bgger than 1 load unt, and that ther load values do not correspond to the bggest n the correspondng doman, both processors execute the SIM block of 4-DASUD. Consequently, both processors send one nstructon message to processor 1 commandng 1 load unt to be sent to processor 4. When processor 1 processes ths message, t wll only attend to one of them by performng the load movement shown n fgure 6.7(a). The resultng load dstrbuton s depcted n fgure 6.7(b). The unattended message wll be deleted. Subsequently, the executon of the next teraton of 4-DASUD generates processors 1, 4 and 5 to send, one nstructon message each to processor 2, commandng send 1 load unt to be sent to processor 5. Fnally, only one of these messages wll be attended to, and the fnal load dstrbuton acheved s that shown n fgure 6.7(c). processor ] ndexes (a) (b) (c) Fgure 6.7 Load-balancng process appled usng the 4-DASUD algorthm to a lnear array wth 5 processors 169

Chapter 6 Ths slow way to balance a doman that ncludes all processors n the system becomes slower as the system grows. Notce that all nstructon messages sent at each teraton of the load-balancng process are drven to the same processor nstead of beng sent to dfferent processors. Ths desgn characterstc s needed to ensure the convergence of the algorthm. Therefore, we can conclude that the convergence requrements and the perfect local balance always acheved by DASUD whatever d s s used, are the causes of the slow convergence rate for larger d s. At ths pont of the analyss of ds-dasud, a prelmnary concluson arses: although ntutvely the extended verson of DASUD s supposed to be more effectve n reachng a fnal stable load dstrbuton, we have expermentally demonstrated that the best trade-off between the ablty to stop the balance process and the tme ncurred to acheve t was exhbted by the orgnal DASUD. There are two reasons that explan ths anomalous behavour. Frst, as the doman scope ncreases, the communcaton and computatonal costs sgnfcantly ncrease as well. Secondly, once a certan balance degree has been acheved n the system, the process to further reduce the exstng mbalance s slower for large domans than for small domans. Therefore, we reorented the current study to fnd an alternatve soluton f the load-balancng process s not carred out to ts completeness. In partcular, we were nterested n the possblty of settng a pror a number of balancng steps to stop the load-balancng process wth the certanly of havng reached a substantal unbalanced reducton. 6.4.2 Greater unbalancng reducton Tables 6.4 and 6.5 show the global standard devaton (stdev), the trade-off factor (t_off) and the percentage of unbalance reducton (%) for some partcular steps from the very begnnng of the load-balancng process (4 and 7). Each cell ncludes the mean value of the results obtaned for lkely ntal load dstrbutons. Snce pathologcal ntal load dstrbutons exhbt a smlar behavour, ther results are omtted here but they are also ncluded n table D.1 and D.2 n appendx D. In order to put the ntal load unbalance at the same level for all system szes, problem sze L 170

Enlargng the doman (d s -DASUD) has been vared proportonally to the number of processors. Ths fact s shown n the column denoted by step 0, where the global load standard devaton before executng the load-balancng process s ncluded. Although we only show the values for the global load standard devaton, and the trade-off factor at steps 4 and 7, the same nformaton has been evaluated and recorded throughout the whole load balancng process. However, we have chosen these ntal balancng steps because we detected that durng ths perod c/ s -DASUD was able to reduce the ntal load unbalance up to 90% and, n most cases mprovement could be even greater. We ndcate n green, for each system sze, the bggest unbalance reducton and also the correspondng load standard devaton. Furthermore, yellow ndcates the best tradeoff factor for each system sze too. We can observe that for most of the cases both colour ndces concde n the same doman scope value. More precsely, for hypercube topologes, t s obvous that d s equal to 3 s, on average, the best choce. However, ths crteron does not apply to the torus topology. In ths case we clearly observe that the best doman scope depends on system sze. In partcular, we can expermentally derve an optmal d s n order to obtan a fast unbalance reducton by the followng formula d 2 where d s the dameter of the topology. The optmal d s values for both topologes are ndcated n red n both tables. 171

Chapter 6 Hypercthe Lkely dstrbutons 134,42 17136,59 134,42 134,42 134,42 12,98 14,95 14,89 90,3 88,8 88,9 24239,50 29492,99 29766,60 12,75 14,84 14,79 90,5 88,8 89 133,03 60,65 54,4 35,63 73,2 133,03 133,03 31,07 23,25 76,6 82,5 30523,17 72206,99 21,57 23,01 83,7 82,7 133,03 133,03 133,03 26,01 28,68 28,13 80,2 78,5 78,8 99923,61 116302,1 114600,3 25,88 28,62 28,07 80,5 78,5 78,8 36776,42 45690,51 46154,22 E9635Í7JL 32617,08 49445,57 118297,1 161719,6 189485,1 186383,8 Table 6.4 Standard devaton, unbalance reducton percentage and trade-off between balance degree and tme for hypercube topologes and for all possble d s at dfferent load-balancng steps. 172

Enlargng the doman (d s -DASUD) Torus Lkely dstrbutons step nxn stdev stdev to stdev Loff 3x3 117.15 9,22 92.12 117.15 tía*) t 2102,85 3,36 97,1 992,80 4x4 128.22 128.22 34,74 5,95 72,9 95,3 8671,97 2379,11 14,41 3,57 88,7 97,2 4949,65 1967,83 128.22 128.22 3,55 97,27 2222,30 3,17 97,5 2729,41 119.48 49,61 58,47 14525,16 35,21 70,5 13802,32 6x6 119.48 119.48 24,33 9,38 79,63 92,14 11059,08 7643,76 14,48 6,86 87,8 94,2 9619,57 831724 119.48 7,32 93,8 9276,27 7,11 13801,85 119.48 8,44 92,8 10829,25 8,26 93,1 16329,50 134.42 70,53 47,5 20539Í18 61,38 54,3 23998!05 134.42 46,50 65,4 23741,51 33,86 74,8 25527,80 8x8 134.42 134.42 134.42 134.42 27,35 14,64 mm 13,67 79,6 m/ 89,8 29022,10 24971,08 37270,91 17,44 11,46 13,45 87 91,4 89,9 28260,42 29982,51 58003,30 134.42 15,70 88,3 45863,81 15,59 88,4 72697,34 134.42 14,89 88,9 44589,42 14,79 88,9 70755,54 11x11 144.14 144.14 144.14 144.14 83,75 68,08 56,75 36,84 41,9 52,7 60,6 74,4 21557Í25 37041,99 68063,11 78724,50 79,19 58,83 44,33 26,62 45,1 59,2 69,2 81,5 27.81 8150 48572,11 84999,45 89771,03 144.14 144.14 25,91 22,93 82 84,1 88097,55 111291,33 20,80 85,5,mm\ os 116659,14 20,89 85,5 112173,61 158905,38 191785,35 144.14 24,27 83,2 153230,30 24,07 83,3 260083,33 10 144.14 144.14 27,53 26,80 80,9 81,4 220648,41 225854,32 27,44 26,74 80,9 81,4 358625,85 373368,61 Table 6.5 Standard devaton, unbalance reducton percentage and trade-off between balance degree and tme for torus topologes and for all possble d s at dfferent load-balancng steps. 173

Chapter 6 Let us analyse n more detal what occurs wth the torus topology. The torus topology exhbts a low connectvty degree because the number of drected connected processors remans constant as the system sze ncreases. Therefore, the sze of the doman rses slowly, ncreasng the number of teratons needed to sgnfcantly reduce the ntal unbalance as well as the tme ncurred for that. However, the proposal cf s value exhbts the best unbalance reducton for all system szes, excludng 11x11 torus, wthout a substantally ncrease n the trade-off factor at step 4. At step 7 the unbalance reducton s very close, as n step 4, but the trade-off factor s ncreased by more than 40%, whch supposes a very hgh ncrement n loadbalancng tme. Although ths penalty s not so mportant for hypercube topology, step 4 also exhbts the best compromse between balance reducton and tme. Therefore, teratng the load-balancng process 4 tmes by usng the proposal optmal d s, the unbalanced reducton obtaned wll be more than 90% n most cases. 6.5 Conclusons to ths chapter We now summarse the man conclusons that arse from the above study. In scenaros where the best fnal balance degree s desred, the best choce s to execute the orgnal DASUD algorthm untl t fnshes. Typcal frameworks that sut ths choce well are the executon of parallel applcatons such as Image thnnng, N- body, Gaussan reducton, Branch&Bound,... where the load unts can be expressed n terms of data. These applcatons are n need of synchronsaton n some partcular executon tmes. These ponts are good tme canddates to execute the loadbalancng process for evenly dstrbutng the data. The load-balancng process should be executed smultaneously n all processors wth the am of achevng the best balance degree. In contrast, frameworks where load unts can be dentfed wth ndependent processes wthout dependency restrctons best ft the second approach, where no termnaton detecton technque s needed, but the number of balance steps and the doman scope should be set before startng, dependng on the underlyng topology and system sze. In partcular, for hypercube topology, the best d s s equal to 3, 174

Enlargng the doman (d s -DASUD) whereas for torus topology the best cf s depends on the underlyng dameter topology n the followng way: d. j = à +, 1 5 2 However, for both topologes, n order to obtan a balance mprovement larger than 90% wth respect to the ntal load mbalance, t s enough to execute the loadbalancng process no more than 10 tmes. 175

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Chapter 7 Conclusons and future work Abstract Ths chapter presents the conclusons obtaned from ths thess and the current work and work plan to be undertaken n the future n order to contnue research on load-balancng algorthms. 177

Conclusons and future work 7.1 Conclusons and man contrbutons After workng for the last few years on the development of ths thess, one has to thnk about what the ntal goals were, and what degree of achevement has been attaned for them. We shall now descrbe each of the man goals n ths thess and how each one of them has been satsfactory acheved. Frst of all, our work was amed at developpng an overvew of the dynamc load-balancng problem n parallel computng by summarsng the man ssues that must be consdered n ths problem. The exhaustve analyss of the state of the art n load-balancng algorthms led to the elaboraton of a load-balancng algorthm taxonomy where algorthmc desgn aspects, as well as generc mplementaton features of load-balancng strateges are consdered. Ths load-balancng classfcaton s ncluded n the followng book: [SenOO] M.A.Senar, A. Cortés, A. Rpoll et alter. Chapter 4: Dynamc Load Balancng, Parallel Program Development for Cluster Computng: Methodology, Tools and Integrated Envronments, edtors José C. Cunha, Peter Kacsuk & Stephen C. Wnter, Ed. Nova Scence, (to be publshed) The next goal was to develop a dynamc load-balancng algorthm that drves any ntal load dstrbuton n parallel system nto the even load dstrbuton by treatng load as ndvsble unts. We focused on the development of a totally dstrbuted loadbalancng algorthm,because t arose as the most popular approach n the lterature, as opposed to centralsed schemes. In partcular, we were nterested n nearestneghbour strateges for ther apparent smplcty, effectveness and scalablty. Our prelmnary proposal of load-balancng algorthm can be found n: 179

Chapter 7 [Luq94] E.Luque, A.Rpoll, A. Cortés and T.Margalef, A Dstrbuted Dffuson Method for Dynamc Load-Balancng on Parallel Computers, Actas de las V Jornadas de Paralelsmo, (Malaga) Span, September 1994, pp. 74-85. [Luq95] E. Luque, A. Rpoll, A. Cortés, T. Margalef, A Dstrbuted Dffuson Method for Dynamc Load Balancng on Parallel Computers, Proceedngs of the Euromcro Workshop on Parallel and Dstrbuted Processng, January 1995, pp. 43-50. Ths prelmnary DASUD proposal ncorporated two mportant features. One of them was ts ablty to detect most of the local unbalanced load dstrbutons and to arrange them by only performng local load movements, and the other one, lay n ts asynchronous mplementaton. Later, a more accurate analyss of these problems led to the ncorporaton of actons complementary to DASUD, facltatng load movements between non-drected connected processors. Wth the ncorporaton of these actons DASUD was always able to reach optmal local load dstrbutons and, consequently n most of the cases, global balanced load dstrbutons were acheved. The analyss of the balancng problems exhbted by dscrete nearestneghbour load-balancng algorthms can be found n the followng publcatons: [Cor97] A. Cortés, A. Rpoll, M. A. Senar and E. Luque, Dynamc Load Balancng Strategy for Scalable Parallel Systems, Parallel Computng: Fundamentals, Applcatons and New Drectons (ParCo97), 1998 Elsever Scence B.V., pp.735-738. [Rp98] A. Rpoll, M.A. Senar, A. Cortés and E. Luque, Mappng and Load- Balancng Strateges for Parallel Programmng, Computers and Artfcal Intellgence, Vol. 17, N 5, 1998, pp. 481-491. 180

Conclusons and future work Snce DASUD was executed n an teratve way, one of the most mportant ssues that arses was to ensure that the load-balancng process provdes a fnal stable load dstrbuton,.e., the algorthm globally converges. As far as we are aware, there s no proof, n the lterature, that demonstrates the convergence of realstc teratve load-balancng algorthms. For ths reason, a new mportant goal arose: to descrbe DASUD n a formal way that allows us to prove ts convergence. Ths goal was completely accomplshed. Furthermore, havng a formal descrpton of DASUD allowed us to derve theoretcal upper bounds for the fnal balance degree acheved by DASUD and for the number of teratons of the algorthm needed to obtan t. The formal descrpton of DASUD, as well as a sketch of ts convergence proof were reported n the followng publcaton: [Cor98] A. Cortés, A. Rpoll, M. A. Senar, F.Cedó and E. Luque, On the Stablty of a Dstrbuted Dynamc Load Balancng Algorthm, IEEE Proc. Int'l Conf. Parallel and Dstrbuted Systems (ICPADS), 1998, pp. 435-446. The full convergence demonstraton of DASUD s provded n the followng techncal report: [Cor98e] A.Cortés, A. Rpoll, M. A. Senar, F.Cedó and E.Luque, On the convergence of SID and DASUD load-balancng algorthms, PIRDI-8/98, Techncal Report, Computer Scence Department, Unverstat Autònoma de Barcelona, 1998. http://prd.uab.es/ The proposed algorthm was compared by smulaton to three well-known load-balancng algorthms wthn the nearest-neghbour category: SID (Sender- Intated Dffuson), GDE (Generalsed Dmenson Exchange) and AN (Average Neghbourhood). For ths purpose, a smulaton envronment was developed where dfferent load-balancng strateges could be smulated under the same condtons. Ths load-balancng smulator was desgned to easly change parameters such as 181

Chapter 7 topology (hypercube and torus), system sze (from 8 to 128 processors) and ntal load dstrbuton. In partcular, the set of ntal load dstrbutons smulated ncluded stuatons whch vary from load dstrbutons that exhbt a lght ntal mbalance to load dstrbutons wth a hgh ntal unbalance degree. The comparatve analyss was performed n terms of stablty and effcency. On the one hand, stablty was evaluated by measurng the df_max and a (global standard devaton) qualty metrcs. On the other hand, the number of smulaton steps needed to reach the fnal stable load dstrbuton (steps) and the quantty of load movements ncurred n the global load-balancng process (load unts - u -) was evaluated to measure effcency. The whole comparatve study was performed for each qualty metrc and, for all of them, the nfluence of the ntal load dstrbuton pattern, system sze and the shape of the ntal load dstrbuton were ndvdually consdered. The results show that our algorthm obtans the best trade-off between the fnal balance degree and the tme ncurred to acheve t. Ths smulaton envronment also provded a framework to expermentally valdate the theoretcal upper bound for the fnal balance degree acheved by DASUD, as well as the upper bound for the number of teratons needed to coerce the system nto a stable state. publshed n: The most relevant results from ths expermental evaluaton have been [Cor98b] A. Cortés, A. Rpoll, M. A. Senar, P.Pons and E. Luque, Un algortmo para balancear dnámcamente las tareas de un programa en sstemas paralelos, Actas del IV Congreso Argentno Internaconal de Cencas de la Computacón (CACIC'98), 1998, pp. 707-721. [Cor98c] A.Cortés, A. Rpoll, M. A. Senar and E.Luque, Evaluaton of the Balancng Qualtes of the DASUD algorthm, Actas de las IX Jornadas de Paralelsmo, San Sebastan (Span), September 1998, pp. 381-388. 182

Conclusons and future work [Cor98d] A.Cortés, A. Rpoll, M. A. Senar and E.Luque, Performance Comparson of Dynamc Load-Balancng Strateges for Dstrbuted, PIRDI-9/98, Techncal Report, Computer Scence Department, Unverstat Autònoma de Barcelona, 1998. http://prd.uab.es/ [Cor99] A. Cortés, A. Rpoll, M. A. Senar and E. Luque, Performance Comparson of Dynamc Load-Balancng Strateges for Dstrbuted Computng, IEEE Proc. Hawa Int'l Conference on Systems Scences (HICSS-32), 1999 [Cor99b] A. Cortés, A. Rpoll, M. A. Senar, P. Pons and E. Luque, On the Performance of Nearest-Neghbours Load Balancng Algorthms n Parallel Systems, IEEE Proc. of the Seventh Euromcro Workshop on Parallel and Dstrbuted Processng (PDP'99), 1999, pp. 170-177. Once the goodness of DASUD was establshed wth respect to other algorthms n the lterature, we focused on studyng dfferent aspects of DASUD. In partcular, we expermentally tested the scalablty of DASUD wth respect to problem sze and also the system sze. From the experments performed wth ths am, we can conclude that DASUD s well-scalable, exhbtng the same behavour for large szed systems as well as for large problem szes, wth only a slght dependence on the topology's dameter beng shown. Fnally, the DASUD's ablty to always reache a local balance load dstrbuton by only consderng load nformaton from ts mmedate neghbours caused us to rase the ssue of what would happen f we ncreased the DASUD scope so as to provde t, as near neghbours, not only wth processors at dstance 1 from any processor, but also those at dstance 2 or 3, and so on... We analysed ths effect of enlargng the localty scope of the algorthm, and an extended verson of the orgnal DASUD was proposed (d s -DASUD). Although we ntutvely sensed that dsposng of more load nformaton would produce the best results, from the expermental study we concluded that the ncrease n communcaton costs ncurred n collectng load nformaton and transferrng load beyond mmedate neghbours, as well as the ncrement n the computatonal tme of the load-balancng algorthm, degrades the 183

Chapter 7 response rate of DASUD. Therefore, under executon envronments where perfect balance dstrbuton s needed (maxmum load dfference n the system should be smaller or equal to one load unt), the orgnal DASUD algorthm where the doman of each processor s restrcted to the drect neghbourhood obtans the best trade-off between the fnal balance degree and the tme spent n attanng t wth respect to any c/s-dasud. However, n executon envronments where t s suffcent to hghly reduce ntal load mbalance, the extended verson of DASUD provdes better results f the load-balancng process s stopped before reachng ts concluson. In partcular, we have observed that for decreasng the orgnal mbalance more than 90%, t s enough to terate the load-balancng process 10 tmes. The d s values that exhbt ths ablty are 3 for hypercube topologes, and half the dameter plus one for torus. For example, for a 4x4 torus topology whose dameter s equal to 4, the best d s corresponds to 3 and, for a 6x6 torus we wll chose d s equal to 4. These experments are currently n process of publcaton. 7.2 Current and future work From the experence obtaned throughout the development of ths work, as explaned n ths thess, new deas have emerged, some of whch are practcally concluded, whlst others are stll beng worked on. We now outlne all current and future lnes of work, as well as ther current degree of development. An mportant contrbuton dervng from of ths work, s the development of a general convergence proof for realstc teratve and totally dstrbuted load-balancng algorthms. DASUD's convergence proof provdes the bass for ths new demonstraton. A general model for realstc load-balancng algorthms has been developed and the convergence of ths realstc load-balancng model s proved. As far as we are aware, t s the frst convergence proof for load-balancng algorthms that treat loads as ndvsble. Therefore, snce most of the realstc strateges from the lterature ft well wth ths load-balancng model, ther convergence s, n ths way, fully establshed. Ths proof s reported n: 184

Conclusons and future work [CedOO] F.Cedó, A. Rpoll, A.Cortés, M. A. Senar and E.Luque, On the convergence of dstrbuted load-balancng algorthms wth nteger load, PIRDI-2/00, Techncal Report, Computer Scence Department, Unverstat Autònoma de Barcelona, 2000. http://prd.uab.es/ (submtted to SIAM Journal of Computng). The load-balancng algorthm proposed n ths work, DASUD, has been valdated by smulaton for a wde set of ntal load dstrbutons that have been consdered as statc load stuatons, where load s nether created nor destroyed durng the load-balancng process. Therefore, n order to evaluate the mpact of applcaton dynamcty on dfferent load-balancng algorthms, some changes should be ntroduced n the current load-balancng smulator. For ths purpose, the loadbalancng smulator used n the expermental study descrbed n ths thess has been changed to ncorporate the capacty of updatng the load values for each ndvdual processor durng the load-balancng smulaton process. Gven the theoretcal nature of ths work, an mmedate challenge was to carry out a real mplementaton of our DASUD algorthm. Currently, a prelmnary real asynchronous verson of DASUD has been developed. Ths real mplementaton was mplemented on a cluster of PC's under Lnux. In order to be able to execute DASUD for dfferent nterconnecton patterns (hypercube, torus...) a mechansm to logcally confgure the underlyng platform such as dfferent nterconnectons schemes s provded. As a result of ths work the followng degree project s derved [HerOO]. Once DASUD has been mplemented n a real platform, the followng step should be to ncorporate DASUD to a real applcaton whose parallel executon n a multprocessor envronment generates a sgnfcant computatonal unbalance between processors as the applcaton computaton progresses. Currently, we are n a process that jons the DASUD load-balancng algorthm wth a parallel verson of the mage thnnng applcaton [PlaOO]. Snce ths s an applcaton wth data parallelsm, the observed mbalance s caused by the data dstrbuton throughout the processors. 185

Chapter 7 We recall that n chapter 1 of ths thess we ntroduce the dea that to desgn a load-balancng strategy three functonal blocks should be consdered: the Load Manager block, the Load-Balancng Algorthm block and the Mgraton Management block. In ths thess we focused on the development of a Load-Balancng Algorthm (DASUD) by skppng the other blocks. However, n executon envronments where load conssts of ndependent task, DASUD could be appled drectly. In ths case, t would smply requre havng a tool to facltate movng tasks between processors. But, what t happen when the underlyng parallel applcaton s defned as a set of cooperatng tasks,.e., when there are dependency relatons between the tasks?. A new lne of work arses from ths queston, whose prncpal objectve s to provde a framework for dstrbuton strateges to obtan the necessary data n estmatng costs and profts from possble mgratons, and then decde whether or not to mgrate such tasks. 186

References [Ahm91] Ishfaq Ahmad and Arf Ghafoor, Sem-Dstrbuted Load Balancng For Massvely Parallel Multcomputer Systems, IEEE Transactons on Software Engneerng, Vol. 17, No. 10, October 1991, pp. 987-1004. [Ara96] J. Arabe, A. Begueln, B. Lowekamp, E. Selgman, M. Starkey and P. [Bak96] [Bar90] Stephan, Dome: parallel programmng n a dstrbuted computng envronment. In Procc. of the Internatonal Parallel Processng Symposum (IPPS-96), 1996, pp. 218-224. M.A. Baker, G.C. Fox and H.W. Yau, A Revew of Commercal and Research Cluster Management Software, Northeast Parallel Archtectures Center, Syracuse Unversty, Techncal Report, June 1996 C. Barmon, M.N. Faruqu and G.P. Battacharjee, Dynamc Load Balancng Algorthm n a Dstrbuted System, Mcroprocessng and Mcroprogrammng 29, 1990/91, pp. 273-285. - [Bau88] K. M. Baumgartner and R. Klng and B. Wah. A Global Load Balancng [Bau95] [Ber89] Strategy for a Dstrbuted System Proc. Int. Conf. on Future Trends n Dstr. Comp. Syst., pp. 93-102, 1988. Joey Baumgartner, Dane J. Cook, Behrooz Shraz. Genetc Solutons to the Load Balancng Problem Internatonal Conference on Parallel Processng ICPP Workshop on Challenges for Parallel Processng",, pp. 72-81, CRC Press, August 1995. D.P. Bertsekas and J. Tstskls, Parallel and Dstrbuted Computaton: Numercal Methods, Prentce-Hall, Englewood Clffs, NJ, 1989. - * [Bo90] J.E. Bollat, Load Balancng and Posson Equaton n a Graph, Concurrency: Practce and Experence, Vol. 2(4), December 1990, pp. 289-313. [Bru99] R.K. Brunner and L.V. Kaé, Handlng applcaton-nduced load mbalance usng parallel objects, Techncal Report 99-03, Parallel Programmng Laboratory, Department of Urbana-Champagn, May 1999. Computer Scence, Unversty of Illnos at [Cha95] Hua Wu Davd Chang and Wllam J. B. Oldham. Dynamc task allocaton models for large dstrbuted computng systems. IEEE Transactons on Parallel and Dstrbuted Systems, 6(12), pp. 1301-1315, December 1995. 187

References [Cas95] J. Casas et alter, MPVM: A mgraton transparent verson of PVM, Techncal Report CSE-95-002, Dept. of Computer Scence and Engneerng, Oregon Graduate Insttute of Scence & Technology, February, 1995. [CedOO] F.Cedó, A. Rpoll, A.Cortés, M. A. Senar and E.Luque, On the convergence of dstrbuted load-balancng algorthms wth nteger load, PIRDI-2/00, Techncal Report Computer Scence Department, Unverstat Autònoma de Barcelona, 2000. http://prd.uab.es/ [Cha85] K.M. Chandy and L. Lamport, Dstrbuted snapshots: determnng global states of dstrbuted systems, ACM Trans. Còmput. Syst. Vol. 3, 1985, pp. 63-75. [Cor96] Antono Corrad, Letza Leonard!, Dffsusve Algorthms for Dynamc Load Balancng n Massvely Parallel Archtectures, DEIS Techncal Report N DEIS-LIA-96-001, LIA Seres N 8, Aprl 1996. http://wwwla.des.unbo.t/research/techreport.html [Cod97] Codne: Computng n Dstrbuted Networked Envronments, GENIAS Software, URL:/www.genas.de/genas/englsh/codne.html, 1997. [Cor97] A. Cortés, A. Rpoll, M. A. Senar and E. Luque, Dynamc Load Balancng Strategy for Scalable Parallel Systems, Parallel Computng: Fundamentals, Applcatons and New Drectons (ParCo97), 1998 Elsever Scence B.V., pp.735-738. {Cor98] A. Cortés, A. Rpoll, M. A. Senar, F.Cedó and E. Luque, On the Stablty of a Dstrbuted Dynamc Load Balancng Algorthm, IEEE Proc. Int'l Conf. Parallel and Dstrbuted Systems (ICPADS), 1998, pp. 435-446. [Cor98b] A. Cortés, A. Rpoll, M. A. Senar, P.Pons and E. Luque, Un algortmo para balancear dnámcamente las tareas de un programa en sstemas paralelos, Actas del IV Congreso Argentno Internaconal de Cencas de la Computacón (CACIC'98), 1998, pp. 707-721. [Cor98c] A.Cortés, A. Rpoll, M. A. Senar and E.Luque, Evaluaton of the Balancng Qualtes of the DASUD algorthm, Actas de las IX Jornadas de Paralelsmo, San Sebastan (Span), September 1998, pp. 381-388. [Cor98d] A.Cortés, A. Rpoll, M. A. Senar and E.Luque, Performance Comparson of Dynamc Load-Balancng Strateges for Dstrbuted Computng, PIRDI-9/98, 188

References Techncal Report Computer Scence Department, Unverstat Autònoma de Barcelona, 1998. http://prd.uab.es/ [Cor98e] A.Cortés, A. Rpoll, M. A. Senar, F.Cedó and E.Luque, On the convergence of SID and DASUD load-balancng algorthms, PIRDI-8/98, Techncal Report Computer Scence Department, Unverstat Autònoma de Barcelona, 1998. http://prd.uab.es/ [Cor99] A. Cortés, A. Rpoll, M. A. Senar and E. Luque, Performance Comparson of Dynamc Load-Balancng Strateges for Dstrbuted Computng, IEEE Proc. Hawa Int'l Conference on Systems Scences (HICSS-32), 1999 [Cor99b] A. Cortés, A. Rpoll, M. A. Senar, P. Pons and E. Luque, On the Performance of Nearest-Neghbours Load Balancng Algorthms n Parallel Systems, IEEE Proc. of the Seventh Euromcro Workshop on Parallel and Dstrbuted Processng (PDP'99), 1999, pp. 170-177. [Cor99c] Antono Corrad, Letza Leonard! and Franco Zambonell, Dffusve Load- Balancng Polces for Dynamc Applcatons, IEEE Concurrency Parallel Dstrbuted & Moble Computng, January-March 1999, pp. 22-31. [Cyb89] George Cybenko, Dynamc Load Balancng for Dstrbuted Memory Multprocessors, J.Parallel Dstrbuted Compt. 7, 1989, pp. 279-301. [Dan97] S.P. Dandamud and M. Lo, "A Herarchcal Load Sharng Polcy for Dstrbuted Systems", IEEE Int. Symp. on Modelng, Analyss and Smulaton of Computer and Telecomm. Systems (MASCOTS), Hafa, Israel, 1997, pp. 3-10. [De99] Ralf Dekmann, Andreas Frommer, Burkhard Monen, Effecent schemes for nearest neghbors load balancng, Parallel Computng, 25, (1999), pp. 789-812. [Dj83] E.W. Djkstra, W.H.J. Fejen and A.J.M. van Gasteren, Dervaton of a termnaton algorthm for dstrbuted computatons, Inform. Processng Lett. Vol. 16, 1983, pp. 217-219. [Dm98] B. Dmtrov and V. Rego, Arachne: a portable threads system supportng mgrant threads on heterogeneous network farms, IEEE Trans. On Parallel and Dstrb. Syst., Vol. 9(5), pp. 459-469, 1998. 189

References [Dou91] F. Dougls and J. Ousterhout, Transparent process mgraton: desgn alternatves and the Sprte mplementaton, Software - Practce and Experence, 21 (8), pp. 757-785, August, 1991. [Eag85] Derek L. Eager, Edwuard D. Lazowska and John Zahorjan, A Comparson of Recever-Intated and Sender-Intated Adaptve Load Sharng, ACM SIGMETRICS, Conference on Measurement and Modellng of Computer Systems, 1985, pp. 1-3. [Er88] O. Erksen, A termnaton detecton protocol and ts formal verfcaton, Journal of Parallel and Dstrbuted Computng, 5, 1988, pp. 82-91. [Eva94] D.J. Evans, W.U.N. Butt, Load Balancng wth Network Parttonng Usng Host Groups, Parallel Computng 20 (1994), pp. 325-345. [Fer87] Ferrar D. and Zhou S.,An emprcal nvestgaton of load ndces for load balancng applcatons, Proc. of Performance' 87, pp 515-528, 1987. [Fo78] S. Forn and R.J. Wlson, Edge-colorng of graphs, In L. W. Beneke and R.J. Wlson edtors, Selected Topcs n Graph Theory, Academc Press 1978, pp. 103-125. [Fox89] G.C. Fox, W.Furmansk.J.Koller and P.Smc, Physcal optmzaton and load balancng algorthms, In Proceedngs of Conference on hypercube Concurrent Computers and Applcatons, pp. 591-594, 1989. [Fra82] N. Francez and M. Rodeh, Achevng dstrbuted termnaton wthout freezng, IEEE Trans. Software Eng. Vol. SE-8, May 1982, pp. 287-292. [Fra99] D. Franco, I. Garcés and E. Luque, "A new method to make communcaton latency unform: Dstrbuted Routng Balancng", Proc. of ACM Internatonal Conference on Supercomputng (ICS99), 1999, pp. 210-219. [HerOO] Jame Herrero Sánchez, Dseño e mplementaton de un algortmo asncrono de balanceo de carga en un sstema dstrbudo, Engnyera Superor en Informàtca (E.T.S.E),Unverstat Autònoma dé Barcelona, Sep. 2000. [Hor93] G. Horton, A Mult-Level Dffuson Method for Dynamc Load Balancng, Parallel Computng 19, 1993, pp. 209-218. [Hos90] S.H. Hossen, B. Ltow, M.Malkaw, J.McPherson, and K. Varavan, Analyss of a Graph Colorng Based Dstrbuted Load Balancng Algorthm, Journal of Parallel and Dstrbuted Computng 10, 1990, pp. 160-166. 190

References [Hu99] Y.F. Hu, R.J. Blake, An mproved dffuson algorthm for dynamc load balancng, Parallel Computng 25 (1999), pp. 417-444. [IBM93] IBM. IBM LoadLeveler: General nformaton. IBM, September 1993. [Jon97] J.P. Jones and C. Brckell, Second Evaluaton of Job Queeng/Schedulng Software: Phase 1 Report, Nasa Ames Research Center, NAS Tech. Report NAS-97-013, June, 1997. [Kal88] LV. Kale, Comparng the Performance of Two Dynamc Load Dstrbuton Methods, Proceedng of the 1988 Internatonal Conference on Parallel Processng, Vol. 1, pp. 8-12. [Kal96] L.V. Kale and S. Krshnan, Cham++: Parallel programmng wth messagedrven objects, n Gregory V. Wlson and Paul Lu, edtors, Parallel Programmng usng C++, pp. 175-213, MIT Press, 1996. [Kon97] R.B. Konuru, S.W. Otto and J. Walpolè, A Mgratable User-level Process Package for PVM, Journal of Parallel and Dstrbuted Computng, 40, pp. 81-102, 1997. [Kum92] D. Kumar, Development of a class of dstrbuted termnaton detecton algorthms, IEEE Trans. Knowledge and Data Eng. Vol. 4, N 2, Aprl 1992, pp. 145-155. [Kun91] Kunz T., The nfluence of dfferent workload descrptons on a heurstc load balancng scheme, IEEE Trans, on Software Engneerng, 17 (7) pp 725-730,1991 [Ln87] Frank C. H. Ln and Robert M.keller, The Gradent Model Load Balancng Method, IEEE Transactons on Software Engneerng, Vol. SE-13, No. 1, January 1987, pp. 32-38. [Ln92] Hwa-Chun Ln and C. S. Raghavendra, A Dynamc Load-Balancng Polcy wth a Central Job Dspatcher (LBC), IEEE Transactons on Software Engneerng, Vol. 18, No. 2, February 1992, pp. 148-158. [LÜI91] R. Lülng B. Monen, and F. Ramme, Load balancng n large networks: A comparatve study. In Proceedngs of 3th. IEEE symposum on parallel and dstrbuted processng, pp. 686-689, December 1991. [Luq94] E.Luque, A.Rpoll, A. Cortés and T.Margaléf, A Dstrbuted Dffuson Method for Dynamc Load-Balancng on Parallel Computers, Actas de las V 191

References Jornadas de Paralelsmo, Las Alpujarras (Malaga) Span, September 1994, pp. 74-85. [Luq95] E. Luque, A. Rpoll, A. Cortés, T. Margalef, A Dstrbuted Dffuson Method for Dynamc Load Balancng on Parallel Computers, Proceedngs of the Euromcro Workshop on Parallel and Dstrbuted Processng, January 1995, pp. 43-50 [Mas96] E. Mascarenhas and V. Rego, Aradne: Archtecture of a portable threads system supportng thread mgraton, Software practce and Experence, vol. 26(3), pp. 327-357, 1996. [Mat87] F. Mattern, Algorthms for dstrbuted termnaton detecton, Dstrbuted Computng, 2, 1987, pp. 161-175. [MÍI93] O.S. Mlojcc, W. Znt, A. Dangel and P. Gese, Task mgraton on top of the Mach mcrokernel, n Mach III Symposum Proceedngs, pp. 273-289, Santa Fe, New Mexco, Aprl, 19-21, 1993. [Mun95] F.J. Munz, E.J. Zaluska, Parallel load-balancng: An extenson to the gradent model, Parallel Computng, 21 (1995), pp. 287-301. [Mut98] S.Muthukrshnan, B. Ghosh and M.H. Schultz, Frst- and Second-Order Dffusve Methods for Rapd, Coarse, Dstrbuted Load Balancng, Theory of Computng Systems, 31, 1998, pp. 331-354. [Mur97] Tna A. Murphy and John G. Vaughan, On the Relatve Performance of Dffuson and Dmenson Exchange Load Balancng n Hypercubes, Procc. of the Ffth Euromcro Workshop on Parallel and Dstrbuted Processng, PDP'97, January 1997, pp. 29-34. [N93] L.M. N and P.K. McKnley, "A survey of wormhole routng technques n drect-networks", IEEE Computer 26 (2), 1993, pp. 62-76. [Ove96] B.J. Overender, P.M.A. Sloot, R.N. Heederck and L.O. Hertzberger, A dynamc load balancng system for parallel cluster computng, Future Generaton Computer Systems, 12(1), pp. 101-115, May, 1996. [Pap84] C. Papadmtrous, Computatonal Complexty, Addson-Wesley, 1994. [Pet98] S. Pétr, M. Bolz and H. Langendôrfer, Mgraton and rollback transparency for arbtrary dstrbuted applcatons n workstaton clusters, Proc. of Workshop on Run-Tme Systems for Parallel Programmng, held n conjuncton wth IPPS/SPDP'98, 1998. 192

References [PlaOO] Mercedes Planas Sánchez, Dseño e mplementacón de una aplcacón paralela en un sstema dstrbudo: Algortmo de Thnnng, Engnyera Superor en Informátca (E.T.S.E),Unverstat Autónoma de Barcelona, Sep. 2000. [Pru95] J. Pruyne and M, Lvny, Provdng resource management servces to parallel applcatons, n J. Dongarra and B. Tourancheau de., 2 nd Workshop on Envronments and Tools for Parallel Scentfc Computng, pp. 152-161, 1995. [Ran83] S.P. Rana, A dstrbuted soluton of the dstrbuted termnaton problem, Inf. Process. Letters, Vol. 17, 1983, pp. 43-46. [Ran88] S. Ranka, Y.Won and S. Sahn, Programmng a hypercube multcomputer, IEEE Software, 5, September 1988, pp. 69-77. [Rp98] A. Rpoll, M.A. Senar, A. Cortés and E. Luque, Mappng and Load-Balancng Strateges for Parallel Programmng, Computers and Artfcal Intellgence, Vol. 17, N 5, 1998, pp. 481-491. [Ron90] S. Rônn and H. Hakkonen, Dstrbuted termnaton detecton wth counters, Informaton Processng Letters, Vol 34, 1990, pp. 223-227. [Rus99] S.H. Russ et alter, Hector: An Agent-Based Archtecture for Dynamc Resource Management, IEEE Concurrency, pp. 47-55, Aprl-June, 1999. [Sal90] Vkram A. Saletore, A Dstrbuted and Adaptve Dynamc Load Balancng Scheme for Parallel Processng of Medum-Gran Tasks, In Proc. of the 5 th Dstrbuted Memory Còmput. Conf., pp. 994-999, 1990. [San94] J. Sang G. Peters and V. Rego, Thread mgraton on heterogeneous systems va comple-tme transformatons, Proc. Int'l Conf. Parallel and Dstrbuted Systems (ICPADS), pp. 634-639, 1994. [Sav96] Serap A. Savar, Dmtr P. Bertsekas, Fnte Termnaton of Asynchronous Iteratve Algorthms, Parallel Computng, vol.22, 1996, pp. 39-56. [SenOO] M.A.Senar, A. Cortés, A. Rpoll et alter. Chapter 4: Dynamc Load Balancng, Parallel Program Development for Cluster Computng: Methodology, Tools and Integrated Envronments, edtors José C. Cunha, Peter Kacsuk & Stephen C. Wnter, Ed. Nova Scence, (to be publshed) 193

References [Sh 89b] Y. Shn and J. Fer, Hypercube systems and key applcatons, In K. Hwang and D. Degroot, edtors, Parallel Processng for Supercomputers and Artfcal Intellgence, McGraw-Hll Publshng Co. 1989, pp. 203-243. [Shu89] Shu W. and Kale L.V., A dynamc schedulng strategy for the Chare-kernel system, In Proceedngs of Supercomputng 89, November 1999. [SÍI94] L. Slva, B. Veer and J. Slva, Checkpontng SPMD applcatons on transputer networks, Procc. of the Scalable Hgh Performance Computng Conference, pp. 694-701, 1994. [Sn97] P. K. Snha, Dstrbuted Operatng Systems. Concepts and Desgn, IEEE Press, 1997. [Sm97] P. Smth and N. C. Hutchnson, Heterogeneous Process Mgraton: The Tu System, Tech. Rep., Department of Computer Scence, Unversty of Brtsh Columba, March, 14, 1997. [Son94] Janjan Song, A Partally Asynchronous and Iteratve Algorthm for Dstrbuted Load Balancng, Parallel Computng 20 (1994), pp. 853-868. [Sta84] J.A. Stankovc and I.S. Sdhu, An adaptve bddng algorthm for processes, clusters and dstrbuted groups. In Proceedngs of 4 th. Internatonal Conference on Dstrbuted Computer Systems, pp. 49-59, May 1984. [Ste95] B. Steensgaard and E. Jul, Object and natve code thread moblty among heterogeneous computers, Proc. ACM Symp. Operatng Systems Prncples, pp. 68-78, 1995. [Str98] V. Strumpen and B. Ramkumar, Portable Checkpontng for Heterogeneous Archtectures, n Fault-Tolerant Parallel and Dstrbuted Systems, Eds. Dmter R. Avresky and Davd R. Kael, chapter 4, pp. 73-92, Kluwer Academc Press, 1998. [Sub94] Raghu Subraman, Isaac D. Scherson, An Analyss of Dffusve Load- Balancng, In Proceedngs of 6th ACM Symposummon Parallel Algorthms and Archtectures, 1994. [Szy85] B. Szymansk, Y. Sh and S. Prywes, Synchronzed dstrbuted termnaton, IEEE Transactons on Software Engneerng, SE-11(10), October 1985, pp. 1136-1140. [Tan95] T. Tannenbaum and M. Ltzkow, The Condor dstrbuted processng system, Dr. Dobb's Journal, pp. 40-48, 1995. 194

References [The85] M. M. Themer, K. A. Lantz and D. R. Cherton, Preemptable remote executon facltes for the V System, Proceedngs of the 10 th ACM Symposum on Operatng Systems Prncples, Pp. 2-12, Oseas Islands, Washngton, December 1-4, 1985. [Top84] R.W. Topor, Termnaton detecton for dstrbuted computatons, Inform. Process. Lett. Vol.18, 1984, pp. 33-36. [Wat98] Jerrell Watts and Stephen Taylor, A Practcal Approach to Dynamc Load Balancng, IEEE Transacton on Parallel and Dstrbuted Systems, vol. 9, No. 3, March 1998, pp. 235-248. [WÍI93] Marc H. Wllebeek-LeMar, Anthony P. Reeves, Strateges for Dynamc Load Balancng on Hghly Parallel Computers, IEEE Transactons on Parallel and Dstrbuted Systems, vol. 4, No. 9, September 1993, pp. 979-993 [Wu96] Mn-You Wu and We Shu, The Drect Dmenson Exchange Method for Load Balancng n k-ary n-cubes, IEEE Symposum on Parallel and Dstrbuted Processng, October 1996, pp. 366-369. [Xu92] C. Z. Xu and F. C. M. Lau, Analyss of the Generalzed Dmenson Exchange Method for Dynamc Load Balancng, Journal of Parallel and Dstrbuted Computng, 16, 1992, pp. 385-393. [Xu93] C.-Z. Xu and F.C.M. Lau, Optmal Parameters for Load Balancng Usng the Dffuson Method n k-ary n-cube Networks, Informaton Processng Letters 47,1993, pp.181-187. [Xu94] Chengzhong Xu and Francs C. M. Lau, Iteratve Dynamc Load Balancng n Multcomputers, Journal of Operatonal Research Socety, Vol. 45, N 7, July 1994, pp. 786-796 [Xu95] Cheng-Zhong Xu and Francs, C. M. Lau, The Generalzed Dmenson Exchange Method for Load Balancng n k-ary n-cubes and Varants, Journal of Parallel and Dstrbuted Computng 24, 1995, pp.72-85. [Xu97] Chengzhong Xu, Francs C. M. Lau, Load Balancng n Parallel Computers. Theory and Practce, Kluwe Academc Publshers, 1997. [Yua90] Shyan-Mng Yuan, An Effcent Perodcally Exchanged Dynamc Load- Balancng Algorthm, Internatonal Journal of Mn and Mcrocomputers, Vol. 12, No. 1, 1990, pp. 1-6. 195

References [Zho88] Songnan Zhou, A Trace-Drven Smulaton Study of Dynamc Load Balancng, IEEE Transactons oon Software Engneerng, Vol. 14, No. 9, September 1988, pp. 1327-1341 [Zna91] T.F. Znat, R.G. Mehem, and K.R. Pruhs, Dlaton-based bddng schemes for dynamc load balancng on dstrbutng processng systems, In Proceedngs of 6 th Dstrbuted Memory Computng Conference, pp. 129-136, Aprl 1991. 196

A new dstrbuted dffuson algorthm for dynamc load-balancng n parallel systems Appendx A DASUD load-balancng algorthm: expermental and theoretcal annexes In chapter 3 a full descrpton of DASUD's behavour and a theoretcal study about t has been performed. From that analyss some theoretcal upper bound have been derved for the fnal the balance degree and the convergence rate of the proposed algorthm. In ths appendx, we nclude an expermental valdaton of these bounds usng the same expermental framework ntroduced n chapter 4. Fnally, a general load-balancng model and ts convergence proof are provded. A.1

Appendx A A.1 Expermental valdaton of DASUD's fnal balance degree Recall that one of the relevant characterstcs of DASUD s ts ablty of searchng and balancng unbalanced domans. As the doman of a gven processor / concdes wth ts mmedate neghbours one can obtan local balance load dstrbutons, but not always globally balanced. However, ths fact s controlled by the exstence of an upper bound for the maxmum global load dfference acheved at the end of the load-balancng process whch has been provded n chapter 3. Ths bound preserves DASUD from reachng poor fnal balanced stuatons. Recallng from that chapter that ths bound s referred as /? and t s defned as In ths secton, we expermentally valdate ths upper bound for the fnal balance degree by comparng the fnal maxmum load dfference obtaned by smulaton versus the theoretcal value. Frstly, we show n table A.1 the value of the dameter (d) for hypercube and torus topologes for all smulated system szes and the correspondng p value, whch has been theoretcally evaluated. Table A.2 shows the maxmum value for the maxmum load dfference obtaned by DASUD n all our tests. As can be seen, n the worst case DASUD always acheves a maxmum dfference lower than the correspondng value of p. Ths means that, even for hghly pathologcal ntal dstrbutons, DASUD.s able to obtan a fnal maxmum dfference bounded by half of the dameter of the archtecture plus one. n=n processors 8/9 16 32/36 64 121/128 Topol. Dam. d P d P d P d P d P Hyper log n 3 3 4 3 5 4 6 4 7 5 Torus *{% 2 2 4 3 6 4 8 5 10 6 Table A. 1 Dameter of some topologes and ts correspondng ft bound. A.3

Appendx A Lkely dstrbutons ~] Pathologcal dstrbutons) %load varaton 8/9 16 32/36 64 121/12 8 dle proc. 8/9 16 32/36 64 121/128 25% 1 1 2 3 3 25% 0 1 2 2 3 Hyper. 50% 1 1 2 2 3 50% 2 1 2 2 3 75% 1 1 2 3 3 75% 0 1 2 3 3 100% 1 1 2 2 3 n-1 2 1 2 3 3 25% 1 1 2 3 4 25% 1 1 2 3 4 Torus 50% 1 1 2 3 4 50% 1 2 1 2 4 75% 1 2 2 3 4 75% 1 1 2 2 4 100% 1 1 2 3 4 n-1 1 1 1 2 4 Table A.2. Maxmum df_max on average for lkely and pathologcal dstrbutons. Followng, a smlar expermental valdaton s reported for the number of teratons needed by DASUD to acheve the fnal stable load dstrbutons. A.2 Expermental valdaton of DASUD's convergence rate We have seen that DASUD acheves a good global balance degree at the end of the load-balancng process by valdatng the theoretcal results wth the ones obtaned by smulaton. In ths secton, the same comparson s performed by attendng to the "tme" needed to reach the fnal stable stuaton. Snce DASUD s and teratve load-balancng algorthm, we measure ths "tme" n terms of smulatng steps. In chapter 3, two upper bounds for the number of steps needed to complete the load-balancng process are conjectured. We valdated only the referred as Conjecture B por ser la más precsa de ambas, and ts defnton s provded by followng. Remember that d denotes the topology dameter and D 0 s the maxmum ntal load dfference. B = j*(d 0 +\) It was proved by smulaton that ths bound was attaned for all dstrbutons. As an example, we show the data correspondng to stuatons n whch a grater number of steps were consumed, and ths s contrasted wth theoretcal values n tables A.3 and A.4 where the results for lkely and pathologcal load dstrbutons are A.4

Appendx A dstrbutons are shown repectvely. On one hand, column Bound B from both tables shows the value of the bound obtaned accordng to Conjecture B. Each value has been computed usng the bggest ntal load dfference for a gven set of dstrbutons wth the same load varaton and the same number of processors. On the other hand, column Max. steps contans the maxmum number of steps spent by the DASUD algorthm for that partcular dstrbuton. Hyper. Torus %/oad varaton 25% 50% 75% 100% 25% 50% 75% 100% Lkely Dstrbutons 8/9 II 16 32/36 II 64 II 121/128 Max. steps 11 13 14 14 10 11 11 12 Bound B 282 564 845 1125 167 333 499 666 Max. steps 14 16 17 19 14 17 18 20 Bound B 189 377 564 752 189 377 564 750 Max. steps 17 21 24 27 18 26 31 33 Bound B 119 236 354 471 127 252 377 499 Max. steps 16 25 27 30 19 28 35 40 Bound B 73 143 213 284 97 191 285 379 Max. steps 11 19 24 28 16 30 33 45 Bound B 44 85 126 164 66 128 190 252 Table A.3 Maxmum number of steps by smulatons agants Conjecture B for hypercübe and torus topologes usng lkely ntal load dstrbutons Patologcal Dstrbutons I dle proc. II Max. 8/9 Bound steps B 16 Max. Bound steps B 32/36 II 64 II 121/128 Max. Bound steps B Max. Bound Max. steps B steps Bound B 25% 12 751 15 502 21 315 21 192 18 112 Hyper. 50% 11 1126 18 752 21 472 27 285 27 168 75% n-1 15 12 2251 4500 17 17 1502 6002 26 25 940 7502 27 32 567 9003 33 35 332 10503 25% 9 430 13 502 25 339 28 256 31 170 Torus 50% 10 600 17 752 33 504 41 380 49 255 75% 9 1000 17 1502 38 1005 53 756 67 490 n-1 11 3001 17 6002 40 9003 59 12004 82 15005 Table A.4 Maxmum number of steps by smulatons agants Conjecture B for hypercube and torus topologes usng pathologcal ntal load dstrbutons A.5

Appendx A A.3 A realstc load-balancng model Each processor / keeps n ts memory an estmate w v (/) of the load carred by each neghbourng processor and some other non-mmedate neghbour processor j at tme t. The load values from neghbour processors are perodcally updated because each processor perodcally sends nformaton about ts load to ts neghbours. Furthermore, each processor / sporadcally receves nformaton about the load of some other non-drected connected processor through ts neghbours. Due to communcaton delays and asynchronsm, these estmates can be outdated, and we assume that where r, y (0 s a certan tme nstant satsfyng 0 < r (> (/) < /. As the set of processors whose load values are kept n the memory of a gven processor / s not a statc set because t depends on tme, we call t the temporal doman and we refer to t as the /-doman of the processor / at tme t. Formally, the f-doman of processor / s defned as D(, j) = [j e j I / has an estmate of the load of the processor j at tme t}. Perodcally each processor / compares ts load wth the estmate of the load of the processors of ts f-doman. We say that processor / detects ts /-doman unbalanced at ths tme t f there s j e /)(/,/) such that w,(0->*v(0 > 1.In ths case, processor /transfers a non-negatve amount of load, s tj (t), to processor/ Note that Sj(t) s an nteger varable. So there s a set of tmes T at whch the processor / sends nformaton about ts load to ts neghbours and compares ts load wth the load of the processors of ts f-doman, and f t fnds that t s overloaded, t transfers some of ts loads to some underloaded processor n ts /-doman by followng assumpton 1. A.6

Appendx A Assumpton 1. For all non negatve nteger t and all j e P, s j (t) s a non-negatve nteger. Furthermore, we have: s jo (í) > 0 => / e 7), J 0 e D(, /) and w, (/) - s v (/) > w^ (/) + s ljo (/). y e y> Ths assumpton s needed n order to prohbt processor / from transferrng a very large amount of load and creatng a load mbalance n the opposte drecton, and precludes the possblty that two processors keep sendng load to each other back and forth, wthout ever reachng equlbrum. More precsely, f processor / s an overloaded processor a tme t and t decdes to dstrbute a porton of ts load among several processors belongng to ts f-doman at that tme f, then the remanng load of processor /', after carryng out those load movements, must be bgger or equal than the load of any of the processors that have receved some load. As we have prevously mentoned, Bertsekas and Tstskls [Ber89] dvde asynchronous algorthms nto two groups: totally asynchronous and partally asynchronous. To paraphrase them, totally asynchronous algorthms "can tolerate arbtrarly large communcaton and computaton delays", but partally asynchronous ones "are not guaranteed to work unless there s an upper bound on those delays". Ths bound s denoted by a constant B called asynchronsm measure. We assume the partally asynchronous assumpton whch s descrbed by assumpton 2. Assumpton 2 (partally asynchronsm). There exsts a postve nteger B such that: (a2) For every and for every t>q, (b2) For all and t, and all j e D(,t), v t-b<ty(t)<t (c2) The load s, y (/) sent from processor to processor j at tme t e T s receved by processor] before tme t+b. A.7

Appendx A Part (a2) of assumpton 2 postulates that each processor performs a step of the load balancng process at least once durng any tme nterval of length B; part (b2) states that the load estmatons kept n memory by any processor at a gven tme t were obtaned at any tme between t-b and t; and part (c2) postulates that load messages wll not be delayed more than B tme unts. Note that synchronous algorthms also fulfl ths assumpton because they are a partcular case of the partally asynchronous one where the asynchronsm measure B becomes zero. Fnally, assumpton 3 descrbes the named statc stuaton where no load s created or consumed durng the IB process and therefore load s conserved. Assumpton 3. The total load L of the system s constant. More precsely: Let v^o) be the amount of load that has been sent from processor to processor j before tme t, but that has not been receved by processor j before tme t. Let r v (t) be the load receved by processor at tme t. Then we have w, (/ +1) = w, (0 - Sy (/) + r jt (t), jep jep r=0 and so (=1 A.8

Appendx A The am of the rest of ths secton s to postulate and prove that ths teratve dstrbuted load-balancng (IDLB) model s fnte. For that purpose Theorem A.1 s stated and proved. Theorem A.1.Under Assumpton 1,2 and 3, the load balancng algorthm descrbed s fnte. That means that there exsts a tme t>0 such that for all t>t and all,jep, s j (t) = 0. Proof. For notatonal convenence, we defne w,(t) = w,(0) for all t < 0. Let m ] (t) = mn {W (T) \ ep, t-b<t<t] ths means that m,(r) s the mnmum load value among the total system at a gven nterval of tme of length 8. The mnmum load value that occupes the /c-st place f loads are sorted n ascendng order at a gven nterval of tme of length 8 s defned as m k (t) = mm {w,(r) ep, t-b<t<t w (T)>m k for all nteger k> 1. For notatonal convenence, we defne mn0=l+1 Let for al nteger k > 1 represents the set of processors whose load value s equal to the mnmum load value of order k. By nducton on k, we shall prove that there exsts an ncreasng sequence t\<t 2 <t 3 <... of postve ntegers such that for all k > 1. A.9

Appendx A (1) m k (t) = m k (t k ), \Jt>t k, (2) P k (t) = P k (t k ), \/t>t k, (3) s v (0 = rjl (/) = 0, Vt>t k,\fep k (t k ), V/ e P. The prevous three tems postulate that there exsts a tme t k beyond whch the mnmum load value of order k keeps constant (1), the set of processors holdng that load value wll not change (2), and no load wll be receved/sent from/to any of these processors (3). Note that P = \J k \P k (t)fof all t. Snce ths s a dsjont unon and P s a fnte set, by defnton of P k (t), we see that for all t. Thus, wth t =t n, the theorem follows. As the model's convergence s based on the three tems outlned above, let us prove each one of them separately. Snce the proofs are performed by nducton on k, we frstly nclude the proofs of tems (1),(2) and (3) for k=1 and, followng, the nducton step for k>1 s provded for each tem. j e P, then In order to see (1) for k=1, we fx a processor / and a tme f. If Sy(t) = 0 for all W (t + 1) = w, (O + r jt (O * w, (t) > m, (O. If s y (O > O for some j 0 e P, then W w, a + 1) = Wj (o - X s ü (o + Z r// w yev ye/ 3 > W0 o (/) + íy o (O (by assumpton 1 ) í fc (0 ' (by(b2)) A.10

Appendx A Thus m, (f +1) > m, (0 for all t. Snce (^(O),^ ' s a non-decreasng sequence of ntegers and m, (0 < L, there exsts a postve nteger t\ such that /»,(/) =»!,(/,'), V/>/j. So (1) follows for k=1. In order to prove (2), we shall see that /) 3 />,(/; +)2/u/;+2)a... n Let / > /j and let / e P \ P, (O *. If Sy(t) = 0, Vy e P, then If 5 /y (/) > 0 for some y 0 e P, then, as above, w, (t + 1) > w, (0 = w, (r + 1) Thus í e/>\/(/ + l) and ths prove (*). Snce P,(/J) s a fnte set, there exsts /, >/j such that So (2) s true for /c=l snce In order to prove (3), let t>t }, let ep } (t } ) and let j 0 ep. If j 0 ed(,t) then A.11