Lecture 4: Principles of Parallel Algorithm Design (part 4)

Similar documents
Load Balancing Techniques

Static Load Balancing

Load Balancing. Load Balancing 1 / 24

Load balancing in a heterogeneous computer system by self-organizing Kohonen network

Load balancing. David Bindel. 12 Nov 2015

DATA ANALYSIS II. Matrix Algorithms

Load balancing Static Load Balancing

Mesh Generation and Load Balancing

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Load Balancing and Termination Detection

Load balancing; Termination detection

Load Balancing In Concurrent Parallel Applications

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

A number of tasks executing serially or in parallel. Distribute tasks on processors so that minimal execution time is achieved. Optimal distribution

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Introduction to Parallel Programming and MapReduce

Hadoop SNS. renren.com. Saturday, December 3, 11

6. Cholesky factorization

Spring 2011 Prof. Hyesoon Kim

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Mining Association Rules on Grid Platforms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Solution of Linear Systems

Load balancing; Termination detection

Report: Declarative Machine Learning on MapReduce (SystemML)

Matrix Multiplication

The International Journal Of Science & Technoledge (ISSN X)

Part 2: Community Detection

Hunting for the Root Cause of Robotic VoIP

Warshall s Algorithm: Transitive Closure

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

MapReduce and the New Software Stack

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Direct Methods for Solving Linear Systems. Matrix Factorization

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Load Balancing & Termination

Load Balancing and Termination Detection

Topological Properties

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Scheduling Shop Scheduling. Tim Nieberg

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

Load Balancing and Termination Detection

Various Schemes of Load Balancing in Distributed Systems- A Review

solution flow update flow

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

Group Testing a tool of protecting Network Security

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042

Compact Representations and Approximations for Compuation in Games

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

MULTISTAGE INTERCONNECTION NETWORKS: A TRANSITION TO OPTICAL

3 P0 P0 P3 P3 8 P1 P0 P2 P3 P1 P2

Source Code Transformations Strategies to Load-balance Grid Applications

Design of LDPC codes

THE DESIGN OF AN EFFICIENT LOAD BALANCING ALGORITHM EMPLOYING BLOCK DESIGN. Ilyong Chung and Yongeun Bae. 1. Introduction

SOLVING LINEAR SYSTEMS

Sport Timetabling. Outline DM87 SCHEDULING, TIMETABLING AND ROUTING. Outline. Lecture Problem Definitions

Dynamic Programming. Lecture Overview Introduction

Francesco Sorrentino Department of Mechanical Engineering

Graph models for the Web and the Internet. Elias Koutsoupias University of Athens and UCLA. Crete, July 2003

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

D1.1 Service Discovery system: Load balancing mechanisms

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Keywords Distributed Computing, On Demand Resources, Cloud Computing, Virtualization, Server Consolidation, Load Balancing

Asynchronous Computations

Load Imbalance Analysis

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

Self Organizing Maps: Fundamentals

Chapter 7 Load Balancing and Termination Detection

3/25/ /25/2014 Sensor Network Security (Simon S. Lam) 1

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Data Storage - II: Efficient Usage & Errors

CS473 - Algorithms I

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Giving life to today s media distribution services

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Two Dimensional Array Based Overlay Network for Balancing Load of Peer-to-Peer Live Video Streaming

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Lecture 2 Linear functions and examples

Social Media Mining. Graph Essentials

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cellular Computing on a Linux Cluster

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. An Improved Approximation Algorithm for the Traveling Tournament Problem

Transcription:

Lecture 4: Principles of Parallel Algorithm Design (part 4) 1

Mapping Technique for Load Balancing Sources of overheads: Inter-process interaction Idling Goals to achieve: To reduce interaction time To reduce total amount of time some processes being idle Remark: these two goals often conflict Classes of mapping: Static Dynamic 2

Schemes for Static Mapping Mapping Based on Data Partitioning Task Graph Partitioning Hybrid Strategies 3

Mapping Based on Data Partitioning By owner-computes rule, mapping the relevant data onto processes is equivalent to mapping tasks onto processes Array or Matrices Block distributions Cyclic and block cyclic distributions Irregular Data Example: data associated with unstructured mesh Graph partitioning 4

1D Block Distribution Example. Distribute rows or columns of matrix to different processes 5

Multi-D Block Distribution Example. Distribute blocks of matrix to different processes 6

Load-Balance for Block Distribution Example. n n dense matrix multiplication C = A B using p processes Decomposition based on output data. Each entry of C use the same amount of computation. Either 1D or 2D block distribution can be used: 1D distribution: n rows are assigned to a process p 2D distribution: n/ p n/ p size block is assigned to a process Multi-D distribution allows higher degree of concurrency. Multi-D distribution can also help to reduce interactions 7

8

Cyclic and Block Cyclic Distributions If the amount of work differs for different entries of a matrix, a block distribution can lead to load imbalances. Example. Doolittle s method of LU factorization of dense matrix 9

Doolittle s method of LU factorization A = a 11 a 12 a 1n a 21 a 22 a 2n a n1 a n2 a nn = LU = 1 0 0 l 21 1 0 l n1 l n2 1 u 11 u 12 u 1n 0 u 22 u 2n 0 0 u nn By matrix-matrix multiplication u 1j = a 1j, j = 1,2,, n (1st row of U) l j1 = a j1 /u 11, j = 1,2,, n (1st column of L) For i = 2,3,, n 1 do u ii = a ii u ij = a ij l ji = a ji i 1 t=1 u ii i 1 t=1 i 1 l it u tj t=1 l it u tj for j = i + 1,, n (ith row of U) l jt u ti for j = i + 1,, n (ith column of L) End u nn = a nn n 1 t=1 l nt u tn 10

Serial Column-Based LU Remark: Matrices L and U share space with A 11

Work used to compute Entries of L and U 12

Block distribution of LU factorization tasks leads to load imbalance. 13

Block-Cyclic Distribution A variation of block distribution that can be used to alleviate the load-imbalance. Steps 1. Partition an array into many more blocks than the number of available processes 2. Assign blocks to processes in a round-robin manner so that each process gets several nonadjacent blocks. 14

(a) The rows of the array are grouped into blocks each consisting of two rows, resulting in eight blocks of rows. These blocks are distributed to four processes in a wraparound fashion. (b) The matrix is blocked into 16 blocks each of size 4 4, and it is mapped onto a 2 2 grid of processes in a wraparound fashion. Cyclic distribution: when the block size =1 15

Graph Partitioning Assign equal number of nodes (or cells) to each process Minimize edge count of the graph partition Random Partitioning Partitioning for Minimizing Edge-Count 16

Mappings Based on Task Partitioning Mapping based on task partitioning can be used when computation is naturally expressed in the form of a static task-dependency graph with known sizes. Finding optimal mapping minimizing idle time and minimizing interaction time is NP-complete Heuristic solutions exist for many structured graphs 17

Mapping a Binary Tree Task-Dependency Graph Finding min. Mapping the tree graph onto 8 processes Mapping minimizes the interaction overhead by mapping independent tasks onto the same process (i.e., process 0) and others on processes only one communication link away from each other Idling exists. This is inherent in the graph 18

Mapping a Sparse Graph Example. Sparse matrix-vector multiplication using 3 processes Arrow distribution 19

Partitioning task interaction graph to reduce interaction overhead 20

Schemes for Dynamic Mapping When static mapping results in highly imbalanced distribution of work among processes or when task-dependency graph is dynamic, use dynamic mapping Primary goal is to balance load dynamic load balancing Example: Dynamic load balancing for AMR Types Centralized Distributed 21

Centralized Dynamic Mapping Processes Master: mange a group of available tasks Slave: depend on master to obtain work Idea When a slave process has no work, it takes a portion of available work from master When a new task is generated, it is added to the pool of tasks in the master process Potential problem When many processes are used, mast process may become bottleneck Solution Chunk scheduling: every time a process runs out of work it gets a group of tasks. 22

Distributed Dynamic Mapping All processes are peers. Tasks are distributed among processes which exchange tasks at run time to balance work Each process can send or receive work from other processes How are sending and receiving processes paired together Is the work transfer initiated by the sender or the receiver? How much work is transferred? When is the work transfer performed? 23

Techniques to Minimize Interaction Overheads Maximize data locality Maximize the reuse of recently accessed data Minimize volume of data-exchange Use high dimensional distribution. Example: 2D block distribution for matrix multiplication Minimize frequency of interactions Reconstruct algorithm such that shared data are accessed and used in large pieces. Combine messages between the same source-destination pair 24

Techniques to Minimize Interaction Overheads Minimize contention and hot spots Contention occur when multi-tasks try to access the same resources concurrently: multiple processes sending message to the same process; multiple simultaneous accesses to the same memory block p 1 Using C i,j = k=0 A i,k B k,j causes contention. For example, C 0,0, C 0,1, C 0, p 1 attempt to read A 0,0, at once. A contention-free manner is to use: p 1 C i,j = k=0 A i, i+j+k % p B i+j+k % p,j All tasks P,j that work on the same row of C access block A i, i+j+k % p, which is different for each task. 25

Techniques to Minimize Interaction Overheads Overlap computations with interactions Use non-blocking communication Replicate data or computations Replicate a copy of shared data on each process if possible, so that there is only initial interaction during replication. Use collective interaction operations Overlap interactions with other interactions 26

Parallel Algorithm Models Data parallel Each task performs similar operations on different data Typically statically map tasks to processes Task graph Use task dependency graph to promote locality or reduce interactions Master-slave One or more master processes generating tasks Allocate tasks to slave processes Allocation may be static or dynamic Pipeline/producer-consumer Pass a stream of data through a sequence of processes Each performs some operation on it Hybrid Apply multiple models hierarchically, or apply multiple models in sequence to different phases 27