Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008

Similar documents
CS 267: Applications of Parallel Computers. Load Balancing

Load balancing. David Bindel. 12 Nov 2015

PPD: Scheduling and Load Balancing 2

Load Balancing. Load Balancing 1 / 24

ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Load Balancing Techniques

Optimizing Load Balance Using Parallel Migratable Objects

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Scalable Dynamic Load Balancing Using UPC

SCALABILITY AND AVAILABILITY

A number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution

A Review of Customized Dynamic Load Balancing for a Network of Workstations

ABSTRACT. Keywords scalable, load distribution, load balancing, work stealing

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

A Locally Cache-Coherent Multiprocessor Architecture

Load Balancing & Termination

Static Load Balancing

A Static Load-Balancing Scheme for Parallel XML Parsing on Multicore CPUs

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

MOSIX: High performance Linux farm

Customized Dynamic Load Balancing for a Network of Workstations

An Effective Dynamic Load Balancing Algorithm for Grid System

Multi-GPU Load Balancing for Simulation and Rendering

Notation: - active node - inactive node

Multi-Objective Genetic Test Generation for Systems-on-Chip Hardware Verification

Dynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas

Dynamic Load Balancing of Unbalanced Computations Using Message Passing

Parallelization: Binary Tree Traversal

UTS: An Unbalanced Tree Search Benchmark

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Experiments on the local load balancing algorithms; part 1

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Decentralized Utility-based Sensor Network Design

Load Balancing Part 2: Static Load Balancing

Aperiodic Task Scheduling

Hectiling: An Integration of Fine and Coarse Grained Load Balancing Strategies 1

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

?kt. An Unconventional Method for Load Balancing. w = C ( t m a z - ti) = p(tmaz - 0i=l. 1 Introduction. R. Alan McCoy,*

Load balancing Static Load Balancing

A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3

Load Balancing In Concurrent Parallel Applications

SIMULATION OF LOAD BALANCING ALGORITHMS: A Comparative Study

Parallel Computing. Benson Muite. benson.

RESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009

The optimize load balancing in cluster computing..

Load Balancing on Massively Parallel Networks: A Cellular Automaton Approach

Distributed Load Balancing for Machines Fully Heterogeneous

Parallel Scalable Algorithms- Performance Parameters

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Load Balancing Algorithms for Peer to Peer and Client Server Distributed Environments

Dynamic Load Balancing in a Network of Workstations

LOAD BALANCING TECHNIQUES

PETASCALE DATA STORAGE INSTITUTE. Petascale storage issues. 3 universities, 5 labs, G. Gibson, CMU, PI

Supercomputing applied to Parallel Network Simulation

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Binary search tree with SIMD bandwidth optimization using SSE

Decentralized load balancing for highly irregular search problems

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Heat Diffusion Based Dynamic Load Balancing for Distributed Virtual Environments

Using Predictive Adaptive Parallelism to Address Portability and Irregularity

Asynchronous Computations

Algorithm Design and Analysis

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Load Balancing For Network Based Multi-threaded Applications

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Load Balancing and Termination Detection

DYNAMIC GRAPH ANALYSIS FOR LOAD BALANCING APPLICATIONS

Grid Computing Approach for Dynamic Load Balancing

International Journal of Advance Research in Computer Science and Management Studies

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

COMPUTATIONAL biology attempts to simulate and understand

Load Balancing in Structured Peer to Peer Systems

Dynamic Load Balancing using Graphics Processors

Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Mesh Generation and Load Balancing

Load Balancing and Termination Detection

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

14.1 Rent-or-buy problem

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Performance metrics for parallelism

Classification of Load Balancing in a Distributed System

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

CHAPTER 1 INTRODUCTION

A Load Balancing Procedure for Parallel Constraint Programming

AI: A Modern Approach, Chpts. 3-4 Russell and Norvig

Various Schemes of Load Balancing in Distributed Systems- A Review

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

PERFORMANCE EVALUATION OF THREE DYNAMIC LOAD BALANCING ALGORITHMS ON SPMD MODEL

Transcription:

Load balancing Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 1

Today s sources CS 194/267 at UCB (Yelick/Demmel) Intro to parallel computing by Grama, Gupta, Karypis, & Kumar 2

Sources of inefficiency in parallel programs Poor single processor performance; e.g., memory system Overheads; e.g., thread creation, synchronization, communication Load imbalance Unbalanced work / processor Heterogeneous processors and/or other resources 3

Parallel efficiency: 4 scenarios Consider load balance, concurrency, and overhead 4

Recognizing inefficiency Cost = (no. procs) * (execution time) C 1 T 1 C p p T p = W p ( V M p ) 5

Tools: VAMPIR, ParaProf (TAU), Paradyn, HPCToolkit (serial) 6

Sources of irregular parallelism Hierarchical parallelism, e.g., adaptive mesh refinement Divide-and-conquer parallelism, e.g., sorting Branch-and-bound search Example: Game tree search Challenge: Work depends on computed values Discrete-event simulation 7

Major issues in load balancing Task costs: How much? Dependencies: How to sequence tasks? Locality: How does data or information flow? Heterogeneity: Do processors operate at same or different speeds? Common question: When is information known? Answers Spectrum of load balancing techniques 8

Task costs Easy: Equal costs n tasks p processor bins Harder: Different, but known costs. n tasks p processor bins Hardest: Unknown costs. 9

Dependencies Easy: None Harder: Predictable structure. Wave-front Trees (balanced or unbalanced) General DAG Hardest: Dynamically evolving structure. 10

Locality (communication) Easy: No communication Harder: Predictable communication pattern. Regular Irregular Hardest: Unpredictable pattern. 11

When information known spectrum of scheduling solutions Static: Everything known in advance off-line algorithms Semi-static Information known at well-defined points, e.g., start-up, start of time-step Off-line algorithm between major steps Dynamic Information known in mid-execution On-line algorithms 12

Dynamic load balancing Motivating example: Search algorithms Techniques: Centralized vs. distributed 13

Motivating example: Search problems Optimal layout of VLSI chips Robot motion planning Chess and other games Constructing a phylogeny tree from a set of genes 14

Example: Tree search Search tree unfolds dynamically May be a graph if there are common sub-problems Terminal node (non-goal) Non-terminal node Terminal node (goal) 15

Search algorithms Depth-first search Simple back-tracking Branch-and-bound Track best solution so far ( bound ) Prune subtrees guaranteed to be worse than bound Iterative deepening: DFS w/ bounded depth; repeatedly increase bound Breadth-first search 16

Parallel search example: Simple back-tracking DFS A static approach: Spawn each new task on an idle processor 2 processors 4 processors 17

Centralized scheduling Maintain shared task queue Dynamic, on-line approach Worker threads Good for small no. of workers Independent tasks, known For loops: Self-scheduling Task = subset of iterations Loop body has unpredictable time Tang & Yew (ICPP 86) Task queue 18

Self-scheduling trade-off Unit of work to grab: balance vs. contention Some variations: Grab fixed size chunk Guided self-scheduling Tapering Weighted factoring 19

Variation 1: Fixed chunk size Kruskal and Weiss (1985) give a model for computing optimal chunk size Independent subtasks Assumed distributions of running time for each subtask (e.g., IFR) Overhead for extracting task, also random Limitations Must know distributions However, n / p does OK (~.8 optimal for large n/p) Ref: Allocating independent subtasks on parallel processors 20

Variation 2: Guided self-scheduling Idea Large chunks at first to avoid overhead Small chunks near the end to even-out finish times Chunk size Ki = ceil(ri / p), Ri = # of remaining tasks Polychronopoulos & Kuck (1987): Guided self-scheduling: A practical scheduling scheme for parallel supercomputers 21

Variation 3: Tapering Idea Chunk size Ki = f(ri; μ, σ) (μ, σ) estimated using history High-variance small chunk size κ = min. chunk size h = selection ( overhead σ = K i = f µ, κ, R ) i p,h Low-variance larger chunks OK S. Lucco (1994), Adaptive parallel programs. PhD Thesis. Better than guided self-scheduling, at least by a little 22

Variation 4: Weighted factoring What if hardware is heterogeneous? Idea: Divide task cost by computational power of requesting node Ref: Hummel, Schmit, Uma, Wein (1996). Load-sharing in heterogeneous systems using weighted factoring. In SPAA 23

When self-scheduling is useful Task cost unknown Locality not important Shared memory or small numbers of processors Tasks without dependencies; can use with, but most analysis ignores this 24

Distributed task queues Extending approach for distributed memory Shared task queue distributed task queue, or bag Idle processors pull work, busy processors push work When to use? Distributed memory, or shared memory with high sync overhead, small tasks Locality not important Tasks known in advance; dependencies computed on-the-fly Cost of tasks not known in advance 25

Distributed dynamic load balancing For a tree search Processors search disjoint parts of the tree Busy and idle processors exchange work Communicate asynchronously busy Service pending messages Select processor and request work idle No work found Do fixed amount of work Got work Service pending messages 26

Selecting a donor processor: Basic techniques Asynchronous round-robin Each processor k maintains targetk When out of work, request from targetk and update targetk Global round robin: Proc 0 maintains global target for all procs Random polling/stealing 27

How to split work? How many tasks to split off? Total tasks unknown, unlike selfscheduling case Which tasks? Send oldest tasks (stack bottom) top Execute most recent (top) Other strategies? bottom 28

A general analysis of parallel DFS Let w = work at some processor Split into two parts: Then: 0 < ρ < 1: ρ w (1 ρ) w φ : 0 < φ 1 2 φ w<ρ w φ w<(1 ρ) w Each partition has at least ϕw work, or at most (1-ϕ)w. 29

A general analysis of parallel DFS If processor Pi initially has work wi and receives request from Pj: After splitting, Pi & Pj have at most (1-ϕ)wi work. For some load balancing strategy, let V(p) = no. of work requests after which each processor has received at least 1 work request [ V(p) p] Initially, P0 has W units of work, and all others have no work After V(p) requests, max work < (1-ϕ)*W After 2*V(p) requests, max work < (1-ϕ) 2 *W Total number of requests = O (V (p) log W ) 30

Computing V(p) for random polling n balls p baskets Consider randomly throwing balls into bins V(p) = average number of trials needed to get at least 1 ball in each basket What is V(p)? 31

A general analysis of parallel DFS: Isoefficiency Asynchronous round-robin: V (p) =O(p 2 ) = W = O(p 2 log p) Global round-robin: Random: W = O(p 2 log p) W = O(p log 2 p) 32

Theory: Randomized algorithm is optimal with high probability Karp & Zhang (1988) prove for tree with equal-cost tasks A randomized parallel branch-and-bound procedure (JACM) Parents must complete before children Tree unfolds at run-time Task number/priorities not known a priori Children pushed to random processors 33

Theory: Randomized algorithm is optimal with high probability Blumofe & Leiserson (1994) prove for fixed task tree with variable cost tasks Idea: Work-stealing idle task pulls ( steals ), instead of pushing Also bound total memory required Scheduling multithreaded computations by work stealing Chakrabarti, Ranade, Yelick (1994) show for dynamic tree w/ variable tasks Pushes instead of pulling possibly worse locality Randomized load-balancing for tree-structured computation 34

Diffusion-based load balancing Randomized schemes treat machine as fully connected Diffusion-based balancing accounts for topology Better locality Slower Cost of tasks assumed known at creation time No dependencies between tasks 35

Diffusion-based load balancing Model machine as graph At each step, compute weight of tasks remaining on each processor Each processor compares weight with neighbors and averages See: Ghosh, Muthukrishnan, Schultz (1996): First- and second-order diffusive methods for rapid, coarse, distributed load balancing (SPAA) 36

Summary Unpredictable loads online algorithms Fixed set of tasks with unknown costs self-scheduling Dynamically unfolding set of tasks work stealing Other scenarios: What if locality is of paramount importance? task graph is known in advance? 37

Administrivia 38

Final stretch Project checkpoints due already 39

Locality considerations 40

What if locality is important? Example scenarios Bag of tasks that need to communicate Arbitrary task graph, where tasks share data Need to run tasks on same or nearby processor 41

Stencil computation on a regular mesh Load balancing equally sized partitions Locality Minimize perimeter to minimize processor edge-crossings n (p 1) 2 n ( p 1) 42

In conclusion 43

Ideas apply broadly Physical sciences, e.g., Plasmas Molecular dynamics Electron-beam lithography device simulation Fluid dynamics Generalized n-body problems: Talk to your classmate, Ryan Riegel 44

Backup slides 45