CS 267: Applications of Parallel Computers. Load Balancing
|
|
- Kevin McCoy
- 8 years ago
- Views:
Transcription
1 CS 267: Applications of Parallel Computers Load Balancing Kathy Yelick 04/18/2007 CS267 Lecture 24 1
2 Outline Motivation for Load Balancing Recall graph partitioning as load balancing technique Overview of load balancing problems, as determined by Task costs Task dependencies Locality needs Spectrum of solutions Static - all information available before starting Semi-Static - some info before starting Dynamic - little or no info before starting Survey of solutions How each one works Theoretical bounds, if any When to use it 04/18/2007 CS267 Lecture 24 2
3 Load Imbalance in Parallel Applications The primary sources of inefficiency in parallel codes: Poor single processor performance Typically in the memory system Too much parallelism overhead Thread creation, synchronization, communication Load imbalance Different amounts of work across processors Computation and communication Different speeds (or available resources) for the processors Possibly due to load on the machine How to recognizing load imbalance Time spent at synchronization is high and is uneven across processors, but not always so simple 04/18/2007 CS267 Lecture 24 3
4 Measuring Load Imbalance Challenges: Can be hard to separate from high synch overhead Especially subtle if not bulk-synchronous Spin locks can make synchronization look like useful work Note that imbalance may change over phases Insufficient parallelism always leads to load imbalance Tools like TAU can help (acts.nersc.gov) 04/18/2007 CS267 Lecture 24 4
5 Review of Graph Partitioning Partition G(N,E) so that N = N 1 U U N p, with each N i ~ N /p As few edges connecting different N i and N k as possible If N = {tasks}, each unit cost, edge e=(i,j) means task i has to communicate with task j, then partitioning means balancing the load, i.e. each N i ~ N /p minimizing communication volume Optimal graph partitioning is NP complete, so we use heuristics (see earlier lectures) Spectral Kernighan-Lin Multilevel Speed of partitioner trades off with quality of partition Better load balance costs more; may or may not be worth it Need to know tasks, communication pattern before starting What if you don t? 04/18/2007 CS267 Lecture 24 5
6 Load Balancing Overview Load balancing differs with properties of the tasks (chunks of work): Tasks costs Do all tasks have equal costs? If not, when are the costs known? Before starting, when task created, or only when task ends Task dependencies Can all tasks be run in any order (including parallel)? If not, when are the dependencies known? Locality Before starting, when task created, or only when task ends Is it important for some tasks to be scheduled on the same processor (or nearby) to reduce communication cost? When is the information about communication known? 04/18/2007 CS267 Lecture 24 6
7 Task Cost Spectrum, search 04/18/2007 CS267 Lecture 24 7
8 Task Dependency Spectrum 04/18/2007 CS267 Lecture 24 8
9 Task Locality Spectrum (Communication) 04/18/2007 CS267 Lecture 24 9
10 Spectrum of Solutions A key question is when certain information about the load balancing problem is known. Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling algorithm, which runs before any real computation starts. Off-line algorithms, eg graph partitioning Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other welldefined points. Offline algorithms may be used even though the problem is dynamic. eg Kernighan-Lin Dynamic scheduling. Information is not known until midexecution. On-line algorithms 04/18/2007 CS267 Lecture 24 10
11 Dynamic Load Balancing Motivation for dynamic load balancing Search algorithms as driving example Centralized load balancing Overview Special case for schedule independent loop iterations Distributed load balancing Overview Engineering Theoretical results Example scheduling problem: mixed parallelism Demonstrate use of coarse performance models 04/18/2007 CS267 Lecture 24 11
12 Search Search problems are often: Computationally expensive Have very different parallelization strategies than physical simulations. Require dynamic load balancing Examples: Optimal layout of VLSI chips Robot motion planning Chess and other games (N-queens) Speech processing Constructing phylogeny tree from set of genes 04/18/2007 CS267 Lecture 24 12
13 Example Problem: Tree Search In Tree Search the tree unfolds dynamically May be a graph if there are common sub-problems along different paths Graphs unlike meshes which are precomputed and have no ordering constraints Terminal node (non-goal) Non-terminal node Terminal node (goal) 04/18/2007 CS267 Lecture 24 13
14 Sequential Search Algorithms Depth-first search (DFS) Simple backtracking Search to bottom, backing up to last choice if necessary Depth-first branch-and-bound Keep track of best solution so far ( bound ) Cut off sub-trees that are guaranteed to be worse than bound Iterative Deepening Choose a bound on search depth, d and use DFS up to depth d If no solution is found, increase d and start again Iterative deepening A* uses a lower bound estimate of cost-tosolution as the bound Breadth-first search (BFS) Search across a given level in the tree 04/18/2007 CS267 Lecture 24 14
15 Depth vs Breadth First Search DFS with Explicit Stack Put root into Stack Stack is data structure where items added to and removed from the top only While Stack not empty If node on top of Stack satisfies goal of search, return result, else Mark node on top of Stack as searched If top of Stack has an unsearched child, put child on top of Stack, else remove top of Stack BFS with Explicit Queue Put root into Queue Queue is data structure where items added to end, removed from front While Queue not empty If node at front of Queue satisfies goal of search, return result, else Mark node at front of Queue as searched If node at front of Queue has any unsearched children, put them all at end of Queue Remove node at front from Queue 04/18/2007 CS267 Lecture 24 15
16 Parallel Search Consider simple backtracking search Try static load balancing: spawn each new task on an idle processor, until all have a subtree Load balance on 2 processors Load balance on 4 processors We can and should do better than this 04/18/2007 CS267 Lecture 24 16
17 Centralized Scheduling Keep a queue of task waiting to be done May be done by manager task Or a shared data structure protected by locks worker worker worker Task Queue worker worker worker 04/18/2007 CS267 Lecture 24 17
18 Centralized Task Queue: Scheduling Loops When applied to loops, often called self scheduling: Tasks may be range of loop indices to compute Assumes independent iterations Loop body has unpredictable time (branches) or the problem is not interesting Originally designed for: Scheduling loops by compiler (or runtime-system) Original paper by Tang and Yew, ICPP 1986 This is: Dynamic, online scheduling algorithm Good for a small number of processors (centralized) Special case of task graph independent tasks, known at once 04/18/2007 CS267 Lecture 24 18
19 Variations on Self-Scheduling Typically, don t want to grab smallest unit of parallel work, e.g., a single iteration Too much contention at shared queue Instead, choose a chunk of tasks of size K. If K is large, access overhead for task queue is small If K is small, we are likely to have even finish times (load balance) (at least) Four Variations: 1. Use a fixed chunk size 2. Guided self-scheduling 3. Tapering 4. Weighted Factoring 04/18/2007 CS267 Lecture 24 19
20 Variation 1: Fixed Chunk Size Kruskal and Weiss give a technique for computing the optimal chunk size Requires a lot of information about the problem characteristics e.g., task costs as well as number Not very useful in practice. Task costs must be known at loop startup time E.g., in compiler, all branches be predicted based on loop indices and used for task cost estimates 04/18/2007 CS267 Lecture 24 20
21 Variation 2: Guided Self-Scheduling Idea: use larger chunks at the beginning to avoid excessive overhead and smaller chunks near the end to even out the finish times. The chunk size K i at the i th access to the task pool is given by ceiling(r i /p) where R i is the total number of tasks remaining and p is the number of processors See Polychronopolous, Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Transactions on Computers, Dec /18/2007 CS267 Lecture 24 21
22 Variation 3: Tapering Idea: the chunk size, K i is a function of not only the remaining work, but also the task cost variance variance is estimated using history information high variance => small chunk size should be used low variance => larger chunks OK See S. Lucco, Adaptive Parallel Programs, PhD Thesis, UCB, CSD , Gives analysis (based on workload distribution) Also gives experimental results -- tapering always works at least as well as GSS, although difference is often small 04/18/2007 CS267 Lecture 24 22
23 Variation 4: Weighted Factoring Idea: similar to self-scheduling, but divide task cost by computational power of requesting node Useful for heterogeneous systems Also useful for shared resource clusters, e.g., built using all the machines in a building as with Tapering, historical information is used to predict future speed speed may depend on the other loads currently on a given processor See Hummel, Schmit, Uma, and Wein, SPAA 96 includes experimental data and analysis 04/18/2007 CS267 Lecture 24 23
24 When is Self-Scheduling a Good Idea? Useful when: A batch (or set) of tasks without dependencies can also be used with dependencies, but most analysis has only been done for task sets without dependencies The cost of each task is unknown Locality is not important Shared memory machine, or at least number of processors is small centralization is OK 04/18/2007 CS267 Lecture 24 24
25 Distributed Task Queues The obvious extension of task queue to distributed memory is: a distributed task queue (or bag ) Doesn t appear as explicit data structure in message-passing Idle processors can pull work, or busy processors push work When are these a good idea? Distributed memory multiprocessors Or, shared memory with significant synchronization overhead Locality is not (very) important Tasks that are: known in advance, e.g., a bag of independent ones dependencies exist, i.e., being computed on the fly The costs of tasks is not known in advance 04/18/2007 CS267 Lecture 24 25
26 Distributed Dynamic Load Balancing Dynamic load balancing algorithms go by other names: Work stealing, work crews, Basic idea, when applied to tree search: busy Each processor performs search on disjoint part of tree When finished, get work from a processor that is still busy Requires asynchronous communication Service pending messages Select a processor and request work idle No work found Do fixed amount of work Got work Service pending messages 04/18/2007 CS267 Lecture 24 26
27 How to Select a Donor Processor Three basic techniques: 1. Asynchronous round robin Each processor k, keeps a variable target k When a processor runs out of work, requests work from target k Set target k = (target k +1) mod procs 2. Global round robin Proc 0 keeps a single variable target When a processor needs work, gets target, requests work from target Proc 0 sets target = (target + 1) mod procs 3. Random polling/stealing When a processor needs work, select a random processor and request work from it Repeat if no work is found 04/18/2007 CS267 Lecture 24 27
28 How to Split Work First parameter is number of tasks to split Related to the self-scheduling variations, but total number of tasks is now unknown Second question is which one(s) Send tasks near the bottom of the stack (oldest) Execute from the top (most recent) May be able to do better with information about task costs Bottom of stack Top of stack 04/18/2007 CS267 Lecture 24 28
29 Theoretical Results (1) Main result: A simple randomized algorithm is optimal with high probability Karp and Zhang [88] show this for a tree of unit cost (equal size) tasks Parent must be done before children Tree unfolds at runtime Task number/priorities not known a priori Children pushed to random processors Show this for independent, equal sized tasks Throw balls into random bins : Θ ( log n / log log n ) in largest bin Throw d times and pick the smallest bin: log log n / log d = Θ (1) [Azar] Extension to parallel throwing [Adler et all 95] Shows p log p tasks leads to good balance 04/18/2007 CS267 Lecture 24 29
30 Theoretical Results (2) Main result: A simple randomized algorithm is optimal with high probability Blumofe and Leiserson [94] show this for a fixed task tree of variable cost tasks their algorithm uses task pulling (stealing) instead of pushing, which is good for locality I.e., when a processor becomes idle, it steals from a random processor also have (loose) bounds on the total memory required Chakrabarti et al [94] show this for a dynamic tree of variable cost tasks works for branch and bound, i.e. tree structure can depend on execution order uses randomized pushing of tasks instead of pulling, so worse locality Open problem: does task pulling provably work well for dynamic trees? 04/18/2007 CS267 Lecture 24 30
31 Distributed Task Queue References Introduction to Parallel Computing by Kumar et al (text) Multipol library (See C.-P. Wen, UCB PhD, 1996.) Part of Multipol ( Try to push tasks with high ratio of cost to compute/cost to push Ex: for matmul, ratio = 2n 3 cost(flop) / 2n 2 cost(send a word) Goldstein, Rogers, Grunwald, and others (independent work) have all shown advantages of integrating into the language framework very lightweight thread creation CILK (Leiserson et al) (supertech.lcs.mit.edu/cilk) Space bound on task stealing X10 from IBM 04/18/2007 CS267 Lecture 24 31
32 Diffusion-Based Load Balancing In the randomized schemes, the machine is treated as fully-connected. Diffusion-based load balancing takes topology into account Locality properties better than prior work Load balancing somewhat slower than randomized Cost of tasks must be known at creation time No dependencies between tasks 04/18/2007 CS267 Lecture 24 32
33 Diffusion-based load balancing The machine is modeled as a graph At each step, we compute the weight of task remaining on each processor This is simply the number if they are unit cost tasks Each processor compares its weight with its neighbors and performs some averaging Analysis using Markov chains See Ghosh et al, SPAA96 for a second order diffusive load balancing algorithm takes into account amount of work sent last time avoids some oscillation of first order schemes Note: locality is still not a major concern, although balancing with neighbors may be better than random 04/18/2007 CS267 Lecture 24 33
34 Mixed Parallelism As another variation, consider a problem with 2 levels of parallelism course-grained task parallelism good when many tasks, bad if few fine-grained data parallelism good when much parallelism within a task, bad if little Appears in: Adaptive mesh refinement Discrete event simulation, e.g., circuit simulation Database query processing Sparse matrix direct solvers 04/18/2007 CS267 Lecture 24 34
35 Mixed Parallelism Strategies 04/18/2007 CS267 Lecture 24 35
36 Which Strategy to Use And easier to implement 04/18/2007 CS267 Lecture 24 36
37 Switch Parallelism: A Special Case 04/18/2007 CS267 Lecture 24 37
38 List of CS267 Projects 04/18/2007 CS267 Lecture 24 38
39 CS267 Projects Matt: Sorting algorithms (UPC impl. & LoPC model) Nutt & Shoiab: parallel liquid simulation for graphic (PIC and parallel sparse poisson solver) Jim: Phylogentic tree construction (neighbor joining algorithm); some aspects run in parallel, but large trees don t fit on single computer (MPI with C) David, Peter, Steve: Parallel driver for iterative refinement for ScaLAPACK Jack: Characterizing and scaling material codes for optimal properties (diag large matrices and calc entries) (MPI with Fortran) JRL: TBD Narayanan & Statish: Statistical model on GPU (NVIDIA) 04/18/2007 CS267 Lecture 24 39
40 More Projects Kyle: DNS simulations of lid-driven cavity flow using Chombo (AMR NavStokes Poisson solvers; change current ring vortex changed to lid; wind on a lake ) (C++ and Fortran with MPI; could you C++ expertise) Max: Adam s unstructured multigrid applied to pumice deformation; may modify to take breakage into account. (C with MPI); maybe room for other partners 04/18/2007 CS267 Lecture 24 40
41 Extra Slides 04/18/2007 CS267 Lecture 24 41
42 Simple Performance Model for Data Parallelism 04/18/2007 CS267 Lecture 24 42
43 04/18/2007 CS267 Lecture 24 43
44 Modeling Performance To predict performance, make assumptions about task tree complete tree with branching factor d>= 2 d child tasks of parent of size N are all of size N/c, c>1 work to do task of size N is O(N a ), a>= 1 Example: Sign function based eigenvalue routine d=2, c=2 (on average), a=3 Combine these assumptions with model of data parallelism 04/18/2007 CS267 Lecture 24 44
45 Actual Speed of Sign Function Eigensolver Starred lines are optimal mixed parallelism Solid lines are data parallelism Dashed lines are switched parallelism Intel Paragon, built on ScaLAPACK Switched parallelism worthwhile! 04/18/2007 CS267 Lecture 24 45
46 Values of Sigma (Problem Size for Half Peak) 04/18/2007 CS267 Lecture 24 46
47 Small Example The 0/1 integer-linear-programming problem E.g., Given integer matrices/vectors as follows: an nxm matrix A, an m-element vector b, and an n-element vector c Find n-element vector x whose elements are 0 or 1 Satisfies the constraint: A*x >= b The function: f(x) = c.dot x should be minimized A = b = 2 c = x 1 + 2x 2 + x 3 + 2x 4 >= 8 (and 2 others inequalities) Minimize: 2x 1 + x 2 x 3 2x Note: 4 24 possible values for x 04/18/2007 CS267 Lecture 24 47
48 Discrete Optimizations Problems in General A discrete optimization problem (S, f) S is a set of feasible solutions that satisfy given constraints. S is finite or countably infinite. f is the cost function that maps each element of S onto the set of real numbers R. The objective of a discrete optimization problem (DOP) is to find a feasible solution x opt, such that f(x opt ) <= f(x) for all x in S. Discrete optimizations problems are NP-complete, so only exponential solutions are known Parallelism gives only a constant speedup Need to focus on average case behavior 04/18/2007 CS267 Lecture 24 48
49 Best-First Search Rather than searching to the bottom, keep set of current states in the space Pick the best one (by some heuristic) for the next step Use lower bound l(x) as heuristic l(x) = g(x) + h(x) g(x) is the cost of reaching the current state h(x) is a heuristic for the cost of reaching the goal Choose h(x) to be a lower bound on actual cost E.g., h(x) might be sum of number of moves for each piece in game problem to reach a solution (ignoring other pieces) 04/18/2007 CS267 Lecture 24 49
50 Branch and Bound Search Revisited The load balancing algorithms as described were for full depth-first search For most real problems, the search is bounded Current bound (e.g., best solution so far) logically shared For large-scale machines, may be replicated All processors need not always agree on bounds Big savings in practice Trade-off between Work spent updating bound Time wasted search unnecessary part of the space 04/18/2007 CS267 Lecture 24 50
51 Simulated Efficiency of Eigensolver Starred lines are optimal mixed parallelism Solid lines are data parallelism Dashed lines are switched parallelism 04/18/2007 CS267 Lecture 24 51
52 Simulated efficiency of Sparse Cholesky Starred lines are optimal mixed parallelism Solid lines are data parallelism Dashed lines are switched parallelism 04/18/2007 CS267 Lecture 24 52
Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008
Load balancing Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 1 Today s sources CS 194/267 at UCB (Yelick/Demmel) Intro
More informationLoad balancing. David Bindel. 12 Nov 2015
Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment Overhead for parallelism
More informationLoad Balancing. Load Balancing 1 / 24
Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait
More informationΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007
ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007 ΔΙΑΛΕΞΗ 6: Load Balancing and Scheduling for Multiprocessor Systems ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ (ttheocharides@ucy.ac.cy) Load Balancing
More informationPPD: Scheduling and Load Balancing 2
PPD: Scheduling and Load Balancing 2 Fernando Silva Computer Science Department Center for Research in Advanced Computing Systems (CRACS) University of Porto, FCUP http://www.dcc.fc.up.pt/~fds 2 (Some
More informationA Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationDistributed Dynamic Load Balancing for Iterative-Stencil Applications
Distributed Dynamic Load Balancing for Iterative-Stencil Applications G. Dethier 1, P. Marchot 2 and P.A. de Marneffe 1 1 EECS Department, University of Liege, Belgium 2 Chemical Engineering Department,
More informationCSE 4351/5351 Notes 7: Task Scheduling & Load Balancing
CSE / Notes : Task Scheduling & Load Balancing Task Scheduling A task is a (sequential) activity that uses a set of inputs to produce a set of outputs. A task (precedence) graph is an acyclic, directed
More informationLoad Balancing Part 2: Static Load Balancing
Load Balancing Part 2: Static Load Balancing Kathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f07 10/7/2007 CS194 Lecture 1 Load Balancing Overview Load balancing differs with properties
More informationA Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
More informationPraktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)
Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming) Dynamic Load Balancing Dr. Ralf-Peter Mundani Center for Simulation Technology in Engineering Technische Universität München
More informationLoad Balancing Techniques
Load Balancing Techniques 1 Lecture Outline Following Topics will be discussed Static Load Balancing Dynamic Load Balancing Mapping for load balancing Minimizing Interaction 2 1 Load Balancing Techniques
More informationLoad balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
More informationLoad balancing Static Load Balancing
Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection
More informationCost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:
CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm
More informationA STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationDesign and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
More informationHome Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit
Data Structures Page 1 of 24 A.1. Arrays (Vectors) n-element vector start address + ielementsize 0 +1 +2 +3 +4... +n-1 start address continuous memory block static, if size is known at compile time dynamic,
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationStatic Load Balancing
Load Balancing Load Balancing Load balancing: distributing data and/or computations across multiple processes to maximize efficiency for a parallel program. Static load-balancing: the algorithm decides
More informationA number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution
Scheduling MIMD parallel program A number of tasks executing serially or in parallel Lecture : Load Balancing The scheduling problem NP-complete problem (in general) Distribute tasks on processors so that
More informationLoad Balancing and Termination Detection
Chapter 7 Load Balancing and Termination Detection 1 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection
More informationMesh Generation and Load Balancing
Mesh Generation and Load Balancing Stan Tomov Innovative Computing Laboratory Computer Science Department The University of Tennessee April 04, 2012 CS 594 04/04/2012 Slide 1 / 19 Outline Motivation Reliable
More information{emery,browne}@cs.utexas.edu ABSTRACT. Keywords scalable, load distribution, load balancing, work stealing
Scalable Load Distribution and Load Balancing for Dynamic Parallel Programs E. Berger and J. C. Browne Department of Computer Science University of Texas at Austin Austin, Texas 78701 USA 01-512-471-{9734,9579}
More informationParallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
More informationPerformance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations
Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations Roy D. Williams, 1990 Presented by Chris Eldred Outline Summary Finite Element Solver Load Balancing Results Types Conclusions
More informationOutline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits
Outline NP-completeness Examples of Easy vs. Hard problems Euler circuit vs. Hamiltonian circuit Shortest Path vs. Longest Path 2-pairs sum vs. general Subset Sum Reducing one problem to another Clique
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationLoad Balancing & Termination
Load Load & Termination Lecture 7 Load and Termination Detection Load Load Want all processors to operate continuously to minimize time to completion. Load balancing determines what work will be done by
More informationLoad Balancing and Termination Detection
Chapter 7 slides7-1 Load Balancing and Termination Detection slides7-2 Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination
More informationDecentralized Utility-based Sensor Network Design
Decentralized Utility-based Sensor Network Design Narayanan Sadagopan and Bhaskar Krishnamachari University of Southern California, Los Angeles, CA 90089-0781, USA narayans@cs.usc.edu, bkrishna@usc.edu
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationLocality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 18, 1037-1048 (2002) Short Paper Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors PANGFENG
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationSubgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
More informationResource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
More informationUTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
More informationLoad Balancing and Termination Detection
Chapter 7 Slide 1 Slide 2 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationCS/COE 1501 http://cs.pitt.edu/~bill/1501/
CS/COE 1501 http://cs.pitt.edu/~bill/1501/ Lecture 01 Course Introduction Meta-notes These notes are intended for use by students in CS1501 at the University of Pittsburgh. They are provided free of charge
More informationAsynchronous Computations
Asynchronous Computations Asynchronous Computations Computations in which individual processes operate without needing to synchronize with other processes. Synchronizing processes is an expensive operation
More informationHigh Performance Computing for Operation Research
High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity
More informationCPSC 211 Data Structures & Implementations (c) Texas A&M University [ 313]
CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 313] File Structures A file is a collection of data stored on mass storage (e.g., disk or tape) Why on mass storage? too big to fit
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationDesign and Implementation of a Massively Parallel Version of DIRECT
Design and Implementation of a Massively Parallel Version of DIRECT JIAN HE Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA ALEX VERSTAK Department
More informationChapter 7 Load Balancing and Termination Detection
Chapter 7 Load Balancing and Termination Detection Load balancing used to distribute computations fairly across processors in order to obtain the highest possible execution speed. Termination detection
More informationA Locally Cache-Coherent Multiprocessor Architecture
A Locally Cache-Coherent Multiprocessor Architecture Kevin Rich Computing Research Group Lawrence Livermore National Laboratory Livermore, CA 94551 Norman Matloff Division of Computer Science University
More informationParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
More informationAlgorithm Design and Analysis
Algorithm Design and Analysis LECTURE 27 Approximation Algorithms Load Balancing Weighted Vertex Cover Reminder: Fill out SRTEs online Don t forget to click submit Sofya Raskhodnikova 12/6/2011 S. Raskhodnikova;
More informationBranch, Cut, and Price: Sequential and Parallel
Branch, Cut, and Price: Sequential and Parallel T.K. Ralphs 1, L. Ladányi 2, and L.E. Trotter, Jr. 3 1 Department of Industrial and Manufacturing Systems Engineering, Lehigh University, Bethlehem, PA 18017,
More informationEECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun
EECS 750: Advanced Operating Systems 01/28 /2015 Heechul Yun 1 Recap: Completely Fair Scheduler(CFS) Each task maintains its virtual time V i = E i 1 w i, where E is executed time, w is a weight Pick the
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationIntroduction to Scheduling Theory
Introduction to Scheduling Theory Arnaud Legrand Laboratoire Informatique et Distribution IMAG CNRS, France arnaud.legrand@imag.fr November 8, 2004 1/ 26 Outline 1 Task graphs from outer space 2 Scheduling
More informationDynamic Load Balancing in a Network of Workstations
Dynamic Load Balancing in a Network of Workstations 95.515F Research Report By: Shahzad Malik (219762) November 29, 2000 Table of Contents 1 Introduction 3 2 Load Balancing 4 2.1 Static Load Balancing
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationDECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH
DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH P.Neelakantan Department of Computer Science & Engineering, SVCET, Chittoor pneelakantan@rediffmail.com ABSTRACT The grid
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationWhite Paper Perceived Performance Tuning a system for what really matters
TMurgent Technologies White Paper Perceived Performance Tuning a system for what really matters September 18, 2003 White Paper: Perceived Performance 1/7 TMurgent Technologies Introduction The purpose
More informationLecture 12: Partitioning and Load Balancing
Lecture 12: Partitioning and Load Balancing G63.2011.002/G22.2945.001 November 16, 2010 thanks to Schloegel,Karypis and Kumar survey paper and Zoltan website for many of today s slides and pictures Partitioning
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More information32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M
Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating
More informationScientific Computing Programming with Parallel Objects
Scientific Computing Programming with Parallel Objects Esteban Meneses, PhD School of Computing, Costa Rica Institute of Technology Parallel Architectures Galore Personal Computing Embedded Computing Moore
More informationPERFORMANCE EVALUATION OF THREE DYNAMIC LOAD BALANCING ALGORITHMS ON SPMD MODEL
PERFORMANCE EVALUATION OF THREE DYNAMIC LOAD BALANCING ALGORITHMS ON SPMD MODEL Najib A. Kofahi Associate Professor Department of Computer Sciences Faculty of Information Technology and Computer Sciences
More information! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of
More informationCustomized Dynamic Load Balancing for a Network of Workstations
Customized Dynamic Load Balancing for a Network of Workstations Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester, Rochester NY 4627 zaki,wei,srini
More informationPortable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications
Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications Rupak Biswas MRJ Technology Solutions NASA Ames Research Center Moffett Field, CA 9435, USA rbiswas@nas.nasa.gov
More informationAI: A Modern Approach, Chpts. 3-4 Russell and Norvig
AI: A Modern Approach, Chpts. 3-4 Russell and Norvig Sequential Decision Making in Robotics CS 599 Geoffrey Hollinger and Gaurav Sukhatme (Some slide content from Stuart Russell and HweeTou Ng) Spring,
More informationDynamic Load Balancing. Using Work-Stealing 35.1 INTRODUCTION CHAPTER. Daniel Cederman and Philippas Tsigas
CHAPTER Dynamic Load Balancing 35 Using Work-Stealing Daniel Cederman and Philippas Tsigas In this chapter, we present a methodology for efficient load balancing of computational problems that can be easily
More informationDynamic load balancing of parallel cellular automata
Dynamic load balancing of parallel cellular automata Marc Mazzariol, Benoit A. Gennart, Roger D. Hersch Ecole Polytechnique Fédérale de Lausanne, EPFL * ABSTRACT We are interested in running in parallel
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationEfficient Parallel Graph Exploration on Multi-Core CPU and GPU
Efficient Parallel Graph Exploration on Multi-Core CPU and GPU Pervasive Parallelism Laboratory Stanford University Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun Graph and its Applications Graph Fundamental
More informationContributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
More informationFast Multipole Method for particle interactions: an open source parallel library component
Fast Multipole Method for particle interactions: an open source parallel library component F. A. Cruz 1,M.G.Knepley 2,andL.A.Barba 1 1 Department of Mathematics, University of Bristol, University Walk,
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationOptimization on Huygens
Optimization on Huygens Wim Rijks wimr@sara.nl Contents Introductory Remarks Support team Optimization strategy Amdahls law Compiler options An example Optimization Introductory Remarks Modern day supercomputers
More informationParallel & Distributed Optimization. Based on Mark Schmidt s slides
Parallel & Distributed Optimization Based on Mark Schmidt s slides Motivation behind using parallel & Distributed optimization Performance Computational throughput have increased exponentially in linear
More informationAll ju The State of Software Development Today: A Parallel View. June 2012
All ju The State of Software Development Today: A Parallel View June 2012 2 What is Parallel Programming? When students study computer programming, the normal approach is to learn to program sequentially.
More informationScalable Dynamic Load Balancing Using UPC
Scalable Dynamic Load Balancing Using UPC Stephen Olivier, Jan Prins Department of Computer Science University of North Carolina at Chapel Hill {olivier, prins}@cs.unc.edu Abstract An asynchronous work-stealing
More informationThe Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationRESEARCH PAPER International Journal of Recent Trends in Engineering, Vol 1, No. 1, May 2009
An Algorithm for Dynamic Load Balancing in Distributed Systems with Multiple Supporting Nodes by Exploiting the Interrupt Service Parveen Jain 1, Daya Gupta 2 1,2 Delhi College of Engineering, New Delhi,
More informationHIGH PERFORMANCE BIG DATA ANALYTICS
HIGH PERFORMANCE BIG DATA ANALYTICS Kunle Olukotun Electrical Engineering and Computer Science Stanford University June 2, 2014 Explosion of Data Sources Sensors DoD is swimming in sensors and drowning
More informationLoad Balancing on Massively Parallel Networks: A Cellular Automaton Approach
Load Balancing on Massively Parallel Networks: A Cellular Automaton Approach by Brian Dickens at the Thomas Jefferson High School for Science and Technology supervising instructor: Donald Hyatt originally
More informationHectiling: An Integration of Fine and Coarse Grained Load Balancing Strategies 1
Copyright 1998 IEEE. Published in the Proceedings of HPDC 7 98, 28 31 July 1998 at Chicago, Illinois. Personal use of this material is permitted. However, permission to reprint/republish this material
More informationA Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster
Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906
More informationEfficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing
Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing James D. Jackson Philip J. Hatcher Department of Computer Science Kingsbury Hall University of New Hampshire Durham,
More informationDistributed Load Balancing for Machines Fully Heterogeneous
Internship Report 2 nd of June - 22 th of August 2014 Distributed Load Balancing for Machines Fully Heterogeneous Nathanaël Cheriere nathanael.cheriere@ens-rennes.fr ENS Rennes Academic Year 2013-2014
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationA FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS
SIAM J. SCI. COMPUT. Vol. 20, No., pp. 359 392 c 998 Society for Industrial and Applied Mathematics A FAST AND HIGH QUALITY MULTILEVEL SCHEME FOR PARTITIONING IRREGULAR GRAPHS GEORGE KARYPIS AND VIPIN
More informationLOAD BALANCING MECHANISMS IN DATA CENTER NETWORKS
LOAD BALANCING Load Balancing Mechanisms in Data Center Networks Load balancing vs. distributed rate limiting: an unifying framework for cloud control Load Balancing for Internet Distributed Services using
More informationSupercomputing applied to Parallel Network Simulation
Supercomputing applied to Parallel Network Simulation David Cortés-Polo Research, Technological Innovation and Supercomputing Centre of Extremadura, CenitS. Trujillo, Spain david.cortes@cenits.es Summary
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationSynchronization. Todd C. Mowry CS 740 November 24, 1998. Topics. Locks Barriers
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based (barriers) Point-to-point tightly
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More information