Static Load Balancing



Similar documents
Load Balancing and Termination Detection

Chapter 7 Load Balancing and Termination Detection

A number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution

Load Balancing and Termination Detection

Load balancing Static Load Balancing

Load Balancing and Termination Detection

Load Balancing Techniques

Load balancing; Termination detection

Load balancing; Termination detection

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Analysis of Algorithms, I

PPD: Scheduling and Load Balancing 2

Approximation Algorithms

SIMULATION OF LOAD BALANCING ALGORITHMS: A Comparative Study

2. (a) Explain the strassen s matrix multiplication. (b) Write deletion algorithm, of Binary search tree. [8+8]

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

Load Balancing. Load Balancing 1 / 24

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

A Review And Evaluations Of Shortest Path Algorithms

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Warshall s Algorithm: Transitive Closure

QUEUES. Primitive Queue operations. enqueue (q, x): inserts item x at the rear of the queue q

Cloud Computing. Lectures 10 and 11 Map Reduce: System Perspective

Solutions to Homework 6

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

LOAD BALANCING FOR MULTIPLE PARALLEL JOBS

Resource Allocation Schemes for Gang Scheduling

Social Media Mining. Graph Essentials

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

A SIMULATOR FOR LOAD BALANCING ANALYSIS IN DISTRIBUTED SYSTEMS

Data Structures and Algorithms Written Examination

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Load Balancing in Distributed System. Prof. Ananthanarayana V.S. Dept. Of Information Technology N.I.T.K., Surathkal

DATA ANALYSIS II. Matrix Algorithms

LOAD BALANCING MECHANISMS IN DATA CENTER NETWORKS

Readings for this topic: Silberschatz/Galvin/Gagne Chapter 5

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

OPTIMAL BINARY SEARCH TREES

The optimize load balancing in cluster computing..

Chapter 7: Termination Detection

KDDCS: A Load-Balanced In-Network Data-Centric Storage Scheme for Sensor Networks

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

5. Binary objects labeling

Network (Tree) Topology Inference Based on Prüfer Sequence

Process Scheduling. Process Scheduler. Chapter 7. Context Switch. Scheduler. Selection Strategies

SCAN: A Structural Clustering Algorithm for Networks

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Data Structures. Chapter 8

DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS

LOAD BALANCING TECHNIQUES

Distributed Computing over Communication Networks: Maximal Independent Set

Analysis of Micromouse Maze Solving Algorithms

Distributed R for Big Data

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

A Review on Load Balancing In Cloud Computing 1

Manifold Learning Examples PCA, LLE and ISOMAP

each college c i C has a capacity q i - the maximum number of students it will admit

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Hagit Attiya and Eshcar Hillel. Computer Science Department Technion

Distance Degree Sequences for Network Analysis

Topological Properties

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Minimizing Probing Cost and Achieving Identifiability in Probe Based Network Link Monitoring

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Multi-GPU Load Balancing for Simulation and Rendering

Experiments on the local load balancing algorithms; part 1

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

AN EFFICIENT LOAD BALANCING APPROACH IN CLOUD SERVER USING ANT COLONY OPTIMIZATION

IMPROVED LOAD BALANCING MODEL BASED ON PARTITIONING IN CLOUD COMPUTING

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

MAX = 5 Current = 0 'This will declare an array with 5 elements. Inserting a Value onto the Stack (Push)

Energy Efficient Load Balancing among Heterogeneous Nodes of Wireless Sensor Network

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Distributed Computing over Communication Networks: Topology. (with an excursion to P2P)

Design and Analysis of ACO algorithms for edge matching problems

A Survey Of Various Load Balancing Algorithms In Cloud Computing

International Journal of Software and Web Sciences (IJSWS)

Load Balancing in cloud computing

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems

Scheduling Algorithm for Delivery and Collection System

The International Journal Of Science & Technoledge (ISSN X)

Using Data-Oblivious Algorithms for Private Cloud Storage Access. Michael T. Goodrich Dept. of Computer Science

Transcription:

Load Balancing

Load Balancing Load balancing: distributing data and/or computations across multiple processes to maximize efficiency for a parallel program. Static load-balancing: the algorithm decides a priori how to divide the workload. Dynamic load-balancing: the algorithm collects statistics while it runs and uses that information to rebalance the workload across the processes as it runs.

Static Load Balancing Round-robin: Hand the tasks in round robin fashion. Randomized: Divide the tasks in a randomized manner to help load balance. Recursive bisection: Recursively divide a problem into sub-problems of equal computational effort. Heuristic techniques: For example, a genetic algorithm to determine a good static load balanced workload.

Dynamic Load Balancing Manages a queue of tasks, known as the workpool. Usually more effective than static load balancing. Overhead of collecting statistics while the program runs. More complex to implement. Two styles of dynamic load balancing: Centralized Workpool: The workpool is kept at a coordinator process, which hands out tasks and collects newly generated tasks from worker processes. Distributed Workpool: The workpool is distributed across the worker processes. Tasks are exchanged between arbitrary processes. Requires a more complex termination detection technique to know when the program has finished.

Centralized Workpool Return results/ new tasks/ Request task Task P 0 Centralized Work Pool... task queue Coordinator 1 p 1 worker processes 2 p 2 The workpool holds a collection of tasks to be performed. Processes are supplied with tasks when they finish previously assigned task and request for another task. This leads to load balancing. Processes can generate new tasks to be added to the workpool as well. Termination: The workpool program is terminated when the task queue is empty, and each worker process has made a request for another task without any new tasks being generated.

Distributed Workpool Distributed task queue Requests/Tasks The task of queues is distributed across the processes. Any process can request any other process for a task or send it a task. Suitable when the memory required to store the tasks is larger than can fit on one system.

Distributed Workpool How does the work load get balanced? Receiver initiated: A process that is idle or has light load asks another process for a task. Works better for a system with high load. Sender initiated: A process that has a heavy load send task(s) to another process. Works better for a system with a light load. How do we determine which process to contact? Round robin. Random polling. Structured: The processes can be arranged in a logical ring or a tree.

Distributed Workpool Termination Two conditions must be true to be able to terminated a distributed workpool correctly: local termination conditions exist on each process, and no messages are in transit. This is tricky...here are two ways to deal with it: Tree-based termination algorithm. A tree order is imposed on the processes based on who sends a message for the first time to a process. At termination the tree is traversed bottom-up to the root. Dual-pass token ring algorithm. A separate phase that passes a token to determine if the distributed algorithm has finished. The algorithm specifically detects if any messages were in transit.

Dual-pass Token Ring Termination Task white token white token Pi turns white token black P0 Pj Pi Pn 1 Process 0 becomes white when terminated and passes a white token to process 1. Processes pass on the token after meeting local termination conditions. However, if a process sends a message to a process earlier than itself in the ring, it colors itself black. A black process colors a token black before passing it on. A white process passes the token without any change. If process 0 receives a white token, termination conditions have been met. If it receives a black token, it starts a new ring with another white token.

Example: Shortest Paths Given a directed graph with n vertices and m weighted edges, find the shortest paths from a source vertex to all other vertices. For two given vertices i and j, the weight of the edge between the two is given by the weight function w(i, j). The distance is infinite if there is no edge between i and j. Graph can be represented in two different ways: Adjacency matrix: A two dimensional array w[0... n 1][0... n 1] holds the weight of the edges. Adjacency lists: An array adj[0... n 1] of lists, where the ith list represents the vertices adjacent to the ith vertex. The list stores the weights of the corresponding edges. Sequential shortest paths algorithms Dijstkra s shortest paths algorithm: Uses a priority queue to grow the shortest paths tree one edge at a time: has limited opportunities for parallelism. Moore s shortest path algorithm: Works by finding new shorter paths all over the graph. Allows for more parallelism but can do extra work by exploring a given vertex multiple times.

Moore s Algorithm A FIFO queue of vertices to explore is maintained. Initially it contains just the source vertex. A distance array dist[0... n 1] represents the current shortest distance to the respective vertex. Initially the distance to the source vertex is zero and all other distances are infinity. Remove the vertex i in the front of the queue and explore edges from it. Suppose vertex j is connected to vertex i. Then compare the shortest distance from the source that is currently known to the distance going through vertex i. If the new distance is shorter, update the distance and add vertex j into the queue (if not in queue already). d[i] i w(i,j) j d[j] Repeat until the vertex queue is empty.

Shortest Paths using Centralized Workpool Task (for Shortest paths): One vertex to be explored. Coordinator process (process 0): Holds the workpool, which consists of the queue of vertices to be explored. This queue shrinks and grows dynamically.

Centralized Workpool Pseudo-Code centralized shortest paths(s, w, n, p, id) // id: current process id, total p processes, numbered 0,..., p 1 // s - source vertex, w[0..n-1][0..n-1] - weight matrix, dist[0..n-1] shortest distance // Q - queue of vertices to explore, initially empty coordinator 0 bcast(w, &n, coordinator); // the coordinator part if (id = coordinator) numworkers 0 enqueue(q, s) do recv(pany, &worker, ANY TAG, &tag) if (tag = NEW TASK TAG) recv(&j, &newdist, Pany, &worker, NEW TASK TAG) dist[j] min(dist[j], newdist[j]) enqueue(q, j) else if tag = INIT TAG or tag = REQUEST TAG if (tag = REQUEST TAG) numworkers numworkers - 1 if (queuenotempty(q)) v dequeue(q) send(&v, P worker, TASK TAG) send(dist, &n, P worker, TASK TAG) numworkers numworkers + 1 while (numworkers > 0) for i 1 to p-1 do send(&dummy, P i, TERMINATE TAG)

Centralized Workpool Pseudo-Code (contd.) else //the worker part send(&id, P coordinator, INIT TAG) recv(&v, P coordinator, ANY TAG, &tag) while tag TERMINATE TAG) recv(dist, &n, P coordinator, TASK TAG) for j 0 to n-1 do if w[v][j] newdist j dist[v] + w[v][j] if newdist j < dist[j] dist[j] newdist j send(&id, P coordinator, NEW TASK TAG) send(&j, &newdist j, P coordinator, NEW TASK TAG) send(&id, P coordinator, REQUEST TAG) recv(&v, P coordinator, ANY TAG, &tag)

Improvements to the Centralized Workpool Solution Make the task contain multiple vertices to make the granularity be more coarse. Instead of sending a new task every time a lower distance is found, wait until all edges out of a vertex have been explored and then send results together in one message. Updating local copy of distance array would eliminate many tasks from being created in the first place. This would give further improvement. Use a priority queue instead of a FIFO queue for workpool. This should give some more improvement for large enough graphs.

Shortest Paths using Distributed Workpool Process i searches around vertex i and stores if vertex i is in the queue or not. Process i keeps track of the ith entry of the distance array. Process i stores the adjacency matrix row or adjacency list for vertex i. If a process receives a message containing a distance, it checks with its stored value and if it is smaller, it updates distances to its neighbors and send messages to the corresponding processes

Improvements to the Distributed Workpool Solution Make the task contain multiple vertices to make the granularity be more coarse. Combine messages and only send known minimums by keeping a local estimate of the distance array. Maintain the local copy of the distance array as a priority queue. In actual implementation, the distributed workpool solution (with the optimizations) was able to scale much more than the centralized solution.

Comparison of Various Implementations $

Further Reading Pencil Beam Redefinition Algorithm: A dynamic load balancing scheme that is adaptive in nature. The statistics are collected centrally but the data is rebalanced in a distributed manner! This is based on an actual medical application code. Parallel Toolkit Library: Masters project by Kirsten Allison. This library gives a centralized and distributed workpool design pattern that any application programmer can use without having to implement the same complex patterns again and again. Notes on both are on the class website under lecture notes.