Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1
Load Balancing Techniques Objectives: The amount of computation assigned to each processor is balanced so that some processors do not sit idle while others are executing tasks. The interaction among the different processors is minimized, so that the processors spend most of the time in doing work. Many times to balance the load among processors, it may be necessary to assign tasks, that interact heavily, to different processors. 3 General classes of problems The first class consists of problems in which all the tasks are available at the beginning of the computation but the amount of time required by each task is different The second class consists of problems in which tasks are available at the beginning but as the computation progresses, the amount of time required by each task changes The third class consists of problems in which tasks are not available at the beginning but are generated dynamically 4 2
Static load-balancing Load Balancing Techniques Distribute the work among processors prior to the execution of the algorithm Example: Matrix-Matrix Computation Easy to design and implement Dynamic load-balancing Distribute the work among processors during the execution of the algorithm Algorithms that require dynamic load-balancing are somewhat more complicated (Examples: Parallel Graph Partitioning and Adaptive Finite Element Computations) 5 Static load-balancing Before the execution of any process. Some potential static load balancing techniques: Round robin algorithm passes out tasks in sequential order of processes, coming back to the first when all processes have been given a task Randomized algorithms selects processes at random to take tasks Recursive bisection recursively divides the problem into sub problems of equal computational effort while minimizing message passing Simulated annealing an optimization technique Genetic algorithm another optimization technique 6 3
Static load-balancing Since load is balanced prior to the execution, several fundamental flaws with static load balancing even if a mathematical solution exists: Very difficult to estimate accurately the execution times of various parts of a program without actually executing the parts. Communication delays that vary under different circumstances Some problems have an indeterminate number of steps to reach their solution. 7 General Strategy for Load Balancing 8 4
Static Load Balancing Ex. Array distribution schemes: One dimensional (strip) Block distributions of matrix (checkerboard) Block cyclic distributions Randomized block distributions 9 Using Stripped Data Decomposition When stripped data decomposition is used to derive concurrency, a suitable decomposition of data can itself be used to balance the load and minimize interactions. 10 5
Matrix-Matrix addition P 0 P 0 P 0 P 1 P 1 P 1 P 2 P 2 P 2 P 3 P 3 P 3 P 4 P 4 P 4 P 5 P 5 P 5 P 6 P 6 P 6 P 7 P 7 P 7 A B C 11 Block distribution (Dense matrix-matrix multiplication) 12 6
P 0 P 1 P 2 P 3 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 12 P 13 P 14 P 15 P 0 P 1 P 2 P 3 P 0 P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 4 P 5 P 6 P 7 P 8 P 9 P 10 P 11 P 8 P 9 P 10 P 11 P 12 P 13 P 14 P 15 P 12 P 13 P 14 P 15 (a) will lead to load imbalances sparse matrix in Gaussian elimination. Using the block-cyclic distribution shown in (b) to distribute the computations to processors 13 Schemes for Dynamic Load Balancing Dynamic partition There are problems in which we cannot statically partition the work among the processors In these problems, a static work partitioning is either impossible (e.g. first class) or can potentially lead to serious load imbalance problems (e.g., second and third classes) The only way to develop efficient message passing programs for these classes of problem is if we allow dynamic load balancing Thus during the execution of the program, work is dynamically transferred among the processors that have a lot of work to one that has little or no work 14 7
Dynamic Load Balancing Can be classified as: Centralized Decentralized 15 Centralized dynamic load balancing Tasks handed out from a centralized location. Master-slave structure. Decentralized dynamic load balancing Tasks are passed between arbitrary processes. A collection of worker processes operate upon the problem and interact among themselves, finally reporting to a single process. A worker process may receive tasks from other worker processes and may send tasks to other worker processes (to complete or pass on at their discretion). 16 8
Centralized Dynamic Load Balancing Master process(or) holds the collection of tasks to be performed. Tasks are sent to the slave processes. When a slave process completes one task, it requests another task from the master process. Terms used : work pool, replicated worker, processor farm. 17 Centralized work pool 18 9
Termination Computation terminates when: The task queue is empty and Every process has made a request for another task without any new tasks being generated Not sufficient to terminate when task queue empty Reason: if one or more processes are still running if a running process may provide new tasks for task queue. 19 Decentralized Dynamic Load Balancing Distributed Work Pool 20 10
Process Selection Algorithms for selecting a process: Round robin algorithm process P i requests tasks from process P x, where x is given by a counter that is incremented after each request, using modulo n arithmetic (n processes), excluding x = i. Random polling algorithm process P i requests tasks from process P x, where x is a number that is selected randomly between 0 and n - 1 (excluding i). 21 Load Balancing Using a Line Structure 22 11
Load Balancing Using a Line Structure Master process (P 0 ) feeds queue with tasks at one end, and tasks are shifted down queue. When a process, P i (1 i < n), detects a task at its input from queue and process is idle, it takes task from queue. Then tasks shuffle down to the right in queue so that space held by task is filled. A new task is inserted into the left side end of queue. Eventually, all processes have a task and queue filled with new tasks. High- priority or larger tasks could be placed in queue first. 23 Code Using Time Sharing Between Communication and Computation Master process (P 0 ) for (i = 0; i < no_tasks; i++) { recv(p1, request_tag); /*request for task*/ send(&task, P1, task_tag);/*send tasks into queue*/ } recv(p1, request_tag); /*request for task*/ send(&empty, P1, task_tag); /*in case end of tasks*/ 24 12
Process P i (1 < i < n) if (buffer == empty) { send(pi-1, request_tag); /* request new task */ recv(&buffer, Pi-1, task_tag); /* task from left proc */ } if ((buffer == full) && (!busy)) { /* get next task */ task = buffer; /* get task*/ buffer = empty; /* set buffer empty */ busy = TRUE; /* set process busy */ } nrecv(pi+1, request_tag, request); /* check msg from right */ if (request && (buffer == full)) { send(&buffer, Pi+1); /* shift task forward */ buffer = empty; } if (busy) { /* continue on current task */ Do some work on task. If task finished, set busy to false. } Nonblocking nrecv() necessary to check for a request being received from right. 25 Load balancing using a tree Tasks passed from node into one of the two nodes below it when node buffer empty. 26 13
General Techniques for Choosing Right Parallel Algorithm» Maximize data locality» Minimize volume of data» Minimize frequency of Interactions» Overlapping computations with interactions.» Decision on Data replication» Minimize construction» Use highly optimized collective interaction operations» Collective data transfers and computations» Maximize Concurrency 27 Decision tree to choose a mapping strategy 28 14
Static number of tasks Dynamic number of tasks Structured communication pattern Unstructured communication pattern Roughly constant computation time per task Computation time per task varies Frequent communications between tasks Many short-lived tasks. No inter task communication Join tasks to minimize communication. Create one task per processor Cyclically map tasks to processors for computational load balancing Use static load balancing techniques Use dynamic load balancing techniques Use run time task scheduling algorithms 29 15