ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007

Transcription

1 ΗΜΥ 656 ΠΡΟΧΩΡΗΜΕΝΗ ΑΡΧΙΤΕΚΤΟΝΙΚΗ ΗΛΕΚΤΡΟΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ Εαρινό Εξάμηνο 2007 ΔΙΑΛΕΞΗ 6: Load Balancing and Scheduling for Multiprocessor Systems ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ

2 Load Balancing in General Enormous and diverse literature on load balancing Computer Science systems operating systems parallel computing distributed computing Computer Science theory Operations research (IEOR) Application domains A closely related problem is scheduling, which is to determine the order in which tasks run

3 Understanding Different Load Balancing Problems Load balancing problems differ in: Tasks costs Do all tasks have equal costs? If not, when are the costs known? - Before starting, when task created, or only when task ends Task dependencies Can all tasks be run in any order (including parallel)? If not, when are the dependencies known? - Before starting, when task created, or only when task ends Locality Is it important for some tasks to be scheduled on the same processor (or nearby) to reduce communication cost? When is the information about communication between tasks known?

4 Task cost spectrum

5 Task Dependency Spectrum

6 Task Locality Spectrum (Data Dependencies)

7 Spectrum of Solutions One of the key questions is when certain information about the load balancing problem is known Leads to a spectrum of solutions: Static scheduling. All information is available to scheduling algorithm, which runs before any real computation starts. (offline algorithms) Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic. Dynamic scheduling. Information is not known until mid-execution. (online algorithms)

8 Approaches Static load balancing Semi-static load balancing Self-scheduling Distributed task queues Diffusion-based load balancing DAG scheduling Mixed Parallelism Note: these are not all-inclusive, but represent some of the problems for which good solutions exist.

9 Static Load Balancing Static load balancing is use when all information is available in advance Common cases: dense matrix algorithms, such as LU factorization - done using blocked/cyclic layout - blocked for locality, cyclic for load balance most computations on a regular mesh, e.g., FFT - done using cyclic+transpose+blocked layout for 1D - similar for higher dimensions, i.e., with transpose sparse-matrix-vector multiplication - use graph partitioning - assumes graph does not change over time (or at least within a timestep during iterative solve)

10 Semi-Static Load Balance If domain changes slowly over time and locality is important use static algorithm do some computation (usually one or more timestamps) allowing some load imbalance on later steps recompute a new load balance using static algorithm Often used in: particle simulations, particle-in-cell (PIC) methods - poor locality may be more of a problem than load imbalance as particles move from one grid partition to another tree-structured computations (Barnes Hut, etc.) grid computations with dynamically changing grid, which changes slowly

11 Self-Scheduling Self scheduling: Keep a centralized pool of tasks that are available to run When a processor completes its current task, look at the pool If the computation of one task generates more, add them to the pool Originally used for: Scheduling loops by compiler (really the runtime-system) Original paper by Tang and Yew, ICPP 1986

12 When is Self-Scheduling a Good Idea? Useful when: A batch (or set) of tasks without dependencies can also be used with dependencies, but most analysis has only been done for task sets without dependencies The cost of each task is unknown Locality is not important Using a shared memory multiprocessor, so a centralized pool of tasks is fine

13 Variations on Self-Scheduling Typically, don t want to grab smallest unit of parallel work. Instead, choose a chunk of tasks of size K. If K is large, access overhead for task queue is small If K is small, we are likely to have even finish times (load balance) Four variations: Use a fixed chunk size Guided self-scheduling Tapering Weighted Factoring Note: there are more

14 Variation 1: Fixed Chunk Size Kruskal and Weiss give a technique for computing the optimal chunk size Requires a lot of information about the problem characteristics e.g., task costs, number Results in an off-line algorithm. Not very useful in practice. For use in a compiler, for example, the compiler would have to estimate the cost of each task All tasks must be known in advance

15 Variation 2: Guided Self-Scheduling Idea: use larger chunks at the beginning to avoid excessive overhead and smaller chunks near the end to even out the finish times. The chunk size K i at the ith access to the task pool is given by ceiling(r i /p) where R i is the total number of tasks remaining and p is the number of processors See Polychronopolous, Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers, IEEE Transactions on Computers, Dec

16 Variation 3: Tapering Idea: the chunk size, K i is a function of not only the remaining work, but also the task cost variance variance is estimated using history information high variance => small chunk size should be used low variant => larger chunks OK See S. Lucco, Adaptive Parallel Programs, PhD Thesis, UCB, CSD , Gives analysis (based on workload distribution) Also gives experimental results -- tapering always works at least as well as GSS, although difference is often small

17 Variation 4: Weighted Factoring Idea: similar to self-scheduling, but divide task cost by computational power of requesting node Useful for heterogeneous systems Also useful for shared resource NOWs, e.g., built using all the machines in a building as with Tapering, historical information is used to predict future speed speed may depend on the other loads currently on a given processor See Hummel, Schmit, Uma, and Wein, SPAA 96 includes experimental data and analysis

18 Distributed Task Queues The obvious extension of self-scheduling to distributed memory is: a distributed task queue (or bag) When are these a good idea? Distributed memory multiprocessors Or, shared memory with significant synchronization overhead Locality is not (very) important Tasks that are: - known in advance, e.g., a bag of independent ones - dependencies exist, i.e., being computed on the fly The costs of tasks is not known in advance

19 Diffusion-Based Load Balancing In the randomized schemes, the machine is treated as fully-connected. Diffusion-based load balancing takes topology into account Locality properties better than prior work Load balancing somewhat slower than randomized Cost of tasks must be known at creation time No dependencies between tasks

20 Diffusion-based load balancing The machine is modeled as a graph At each step, we compute the weight of task remaining on each processor This is simply the number if they are unit cost tasks Each processor compares its weight with its neighbors and performs some averaging Markov chain analysis See Ghosh et al, SPAA96 for a second order diffusive load balancing algorithm takes into account amount of work sent last time avoids some oscillation of first order schemes Note: locality is still not a major concern, although balancing with neighbors may be better than random

21 DAG Scheduling For some problems, you have a directed acyclic graph (DAG) of tasks nodes represent computation (may be weighted) edges represent orderings and usually communication (may also be weighted) not that common to have the DAG in advance Two application domains where DAGs are known Digital Signal Processing computations Sparse direct solvers (mainly Cholesky, since it doesn t require pivoting). More on this in another lecture. The basic offline strategy: partition DAG to minimize communication and keep all processors busy NP complete, so need approximations Different than graph partitioning, which was for tasks with communication but no dependencies See Gerasoulis and Yang, IEEE Transaction on P&DS, Jun 93.

22 Task Graphs Directed Acyclic Graph (DAG) G = (V,E, τ,c) V Set of nodes, representing computational tasks E Set of edges, representing communication of data between tasks τ(v) Execution cost for node v c(i,j) Communication cost for edge (i,j) Referred to as the delay model (macro dataflow model)

23 Small Task Graph Example

24 Task Scheduling Algorithms Multiprocessor Scheduling Problem For each task, assign - Starting time - Processor assignment (P 1,...P N ) Goal: minimize execution time, given - Precedence constraints - Execution cost - Communication cost Algorithms in literature List Scheduling approaches (ERT, FLB) Critical Path scheduling approaches (TDS, MCP) Categories: Fixed No. of Proc, fixed c and/or τ,...

25 Which Strategy to Use

26 Multiprocessor Systems Continuous need for faster computers shared memory model message passing multiprocessor wide area distributed system

27 Multiprocessors Definition: A computer system in which two or more CPUs share full access to a common RAM

28 Multiprocessor Hardware (1) Bus-based multiprocessors

29 Multiprocessor Hardware (2) UMA Multiprocessor using a crossbar switch

30 Multiprocessor Hardware (3) UMA multiprocessors using multistage switching networks can be built from 2x2 switches (a) 2x2 switch (b) Message format

31 Multiprocessor Hardware (4) Omega Switching Network

32 Multiprocessor Hardware (5) NUMA Multiprocessor Characteristics 1. Single address space visible to all CPUs 2. Access to remote memory via commands - LOAD - STORE 3. Access to remote memory slower than to local

33 Multiprocessor Hardware (6) (a) 256-node directory based multiprocessor (b) Fields of 32-bit memory address (c) Directory at node 36

34 Multiprocessor OS Types (1) Bus Each CPU has its own operating system

35 Multiprocessor OS Types (2) Bus Master-Slave multiprocessors

36 Multiprocessor OS Types (3) Bus Symmetric Multiprocessors SMP multiprocessor model

37 Multiprocessor Synchronization (1) TSL instruction can fail if bus already locked

38 Multiprocessor Synchronization (2) Multiple locks used to avoid cache thrashing

39 Multiprocessor Synchronization (3) Spinning versus Switching In some cases CPU must wait waits to acquire ready list In other cases a choice exists spinning wastes CPU cycles switching uses up CPU cycles also possible to make separate decision each time locked mutex encountered

40 Assigning Processes to Scheduling Processors How are processes/threads assigned to processors? Static assignment. Advantages Dedicated short-term queue for each processor. Less overhead in scheduling. Allows for group or gang scheduling. Process remains with processor from activation until completion. Disadvantages One or more processors can be idle. One or more processors could be backlogged. Difficult to load balance. Context transfers costly.

41 Assigning Processes to Scheduling Processors Who handles the assignment? Master/Slave Single processor handles O.S. functions. One processor responsible for scheduling jobs. Tends to become a bottleneck. Failure of master brings system down. Peer O.S. can run on any processor. More complicated operating system. Generally use simple schemes. Overhead is a greater problem Threads add additional concerns CPU utilization is not always the primary factor.

42 Process Scheduling Scheduling Single queue for all processes. Multiple queues are used for priorities. All queues feed to the common pool of processors. Specific scheduling disciplines is less important with more than one processor. Simple FCFS discipline or FCFS within a static priority scheme may suffice for a multiple-processor system.

43 Thread Scheduling Scheduling Executes separate from the rest of the process. An application can be a set of threads that cooperate and execute concurrently in the same address space. Threads running on separate processors yields a dramatic gain in performance. However, applications requiring significant interaction among threads may have significant performance impact w/multiprocessing.

44 Multiprocessor Thread Scheduling Scheduling Load sharing processes are not assigned to a particular processor Gang scheduling a set of related threads is scheduled to run on a set of processors at the same time Dedicated processor assignment threads are assigned to a specific processor Dynamic scheduling number of threads can be altered during course of execution

45 Multiprocessor Scheduling (1) Timesharing note use of single data structure for scheduling

46 Load Sharing Scheduling Load is distributed evenly across the processors. Select threads from a global queue. Avoids idle processors. No centralized scheduler required. Uses global queues. Widely used. FCFS Smallest number of threads first Preemptive smallest number of threads first

47 Multiprocessor Scheduling (2) Space sharing multiple threads at same time across multiple CPUs

48 Scheduling Disadvantages of Load Sharing Central queue needs mutual exclusion. may be a bottleneck when more than one processor looks for work at the same time Preemptive threads are unlikely to resume execution on the same processor. cache use is less efficient If all threads are in the global queue, all threads of a program will not gain access to the processors at the same time.

49 Multiprocessor Scheduling (3) Problem with communication between two threads both belong to process A both running out of phase

50 Multiprocessor Scheduling (4) Solution: Gang Scheduling 1. Groups of related threads scheduled as a unit (a gang) 2. All members of gang run simultaneously on different timeshared CPUs 3. All gang members start and end time slices together

51 Gang Scheduling Scheduling Schedule related threads on processors to run at the same time. Useful for applications where performance severely degrades when any part of the application is not running. Threads often need to synchronize with each other. Interacting threads are more likely to be running and ready to interact. Less overhead since we schedule multiple processors at once. Have to allocate processors.

52 Multiprocessor Scheduling (5) Gang Scheduling

53 Dedicated Processor Assignment When application is scheduled, its threads are assigned to a processor. Advantage: Avoids process switching Disadvantage: Some processors may be idle Works best when the number of threads equals the number of processors. Scheduling

54 Dynamic Scheduling Scheduling Number of threads in a process are altered dynamically by the application. Operating system adjusts the load to improve use. assign idle processors new arrivals may be assigned to a processor that is used by a job currently using more than one processor hold request until processor is available new arrivals will be given a processor before existing running applications

55 Issues in MP Scheduling Starvation Number of active parallel threads < number of allocated processors Overhead CPU time used to transfer and start various portions of the application Contention Multiple threads attempt to use same shared resource Latency Delay in communication between processors and I/O devices

56 How to allocate processors Allocate proportional to average parallelism Other factors: System load Variable parallelism Min/Max parallelism Acquire/relinquish processors based on current program needs

57 Cache Affinity While a program runs, data needed is placed in local cache When job is rescheduled, it will likely access some of the same data Scheduling jobs where they have affinity improves performance by reducing cache penalties

58 Cache Affinity (cont) Tradeoff between processor reallocation and cost of reallocation Utilization versus cache behavior Scheduling policies: Equipartition: constant number of processors allocated evenly to all jobs. Low overhead. Dynamic: constantly reallocates jobs to maximize utilization. High utilization.

59 Cache Affinity (cont) Vaswani and Zahoran, 1991 When a processor becomes available, allocate it to runnable process that was last run on processor, or higher priority job If a job requests additional processors, allocate critical tasks on processor with highest affinity If an allocated processor becomes idle, hold it for a small amount of time in case task with affinity comes along

60 Vaswani and Zahoran, 1991 Results showed that utilization was dominant effect on performance, not cache affinity But their algorithm did not degrade performance Predicted that as processor speeds increase, significance of cache affinity will also increase Later studies validated their predictions

61 How to Measure Utilization? IPC not necessarily best predictor: IPC can have high variations throughout process High-IPC threads may unfairly take system resources from low-ipc threads Other predictors: low # conflicts, high cache hit rate, diverse instruction mix Balance: schedule with lowest deviation in IPC between coschedules is considered best

62 What About Priorities? Scheduler estimates the natural IPC of job If a high-priority jobs is not meeting the desired IPC, it will be exclusively scheduled on CPU Provides a truer implementation of priority: Normal schedulers only guarantee proportional resource sharing, assumes no interaction between jobs

63 Another Priority Algorithm: SMT hardware fetches instructions to issue from queue Scheduler can bias fetching algorithm to give preference to high-priority threads Hardware already exists, minimal modifications

64 Multiprocessor Scheduling Will consider only shared memory multiprocessor Salient features: One or more caches: cache affinity is important Semaphores/locks typically implemented as spinlocks: preemption during critical sections

65 Multiprocessor Scheduling Central queue queue can be a bottleneck Distributed queue load balancing between queue

66 Scheduling Common mechanisms combine central queue with per processor queue (SGI IRIX) Exploit cache affinity try to schedule on the same processor that a process/thread executed last Context switch overhead Quantum sizes larger on multiprocessors than uniprocessors

67 Parallel Applications on SMPs Effect of spin-locks: what happens if preemption occurs in the middle of a critical section? Preempt entire application (co-scheduling) Raise priority so preemption does not occur (smart scheduling) Both of the above Provide applications with more control over its scheduling Users should not have to check if it is safe to make certain system calls If one thread blocks, others must be able to run

68 Distributed Scheduling: Motivation Distributed system with N workstations Model each w/s as identical, independent M/M/1 systems Utilization u, P(system idle)=1-u What is the probability that at least one system is idle and one job is waiting?

69 Implications Probability high for moderate system utilization Potential for performance improvement via load distribution High utilization => little benefit Low utilization => rarely job waiting Distributed scheduling (aka load balancing) potentially useful What is the performance metric? Mean response time What is the measure of load? Must be easy to measure Must reflect performance improvement

70 Design Issues Measure of load Queue lengths at CPU, CPU utilization Types of policies Static: decisions hardwired into system Dynamic: uses load information Adaptive: policy varies according to load Preemptive versus non-preemptive Centralized versus decentralized Stability: λ>μ => instability, λ 1 +λ 2 <μ 1 +μ 2 =>load balance Job floats around and load oscillates

71 Components Transfer policy: when to transfer a process? Threshold-based policies are common and easy Selection policy: which process to transfer? Prefer new processes Transfer cost should be small compared to execution cost Select processes with long execution times Location policy: where to transfer the process? Polling, random, nearest neighbor Information policy: when and from where? Demand driven [only if sender/receiver], time-driven [periodic], state-change-driven [send update if load changes]

72 Sender-initiated Policy Transfer policy Selection policy: newly arrived process Location policy: three variations Random: may generate lots of transfers => limit max transfers Threshold: probe n nodes sequentially Transfer to first node below threshold, if none, keep job Shortest: poll N p nodes in parallel Choose least loaded node below T

73 Receiver-initiated Policy Transfer policy: If departing process causes load < T, find a process from elsewhere Selection policy: newly arrived or partially executed process Location policy: Threshold: probe up to N p other nodes sequentially Transfer from first one above threshold, if none, do nothing Shortest: poll n nodes in parallel, choose node with heaviest load above T

74 Symmetric Policies Nodes act as both senders and receivers: combine previous two policies without change Use average load as threshold Improved symmetric policy: exploit polling information Two thresholds: LT, UT, LT <= UT Maintain sender, receiver and OK nodes using polling info Sender: poll first node on receiver list Receiver: poll first node on sender list

75 Code and Process Migration Motivation How does migration occur? Resource migration Agent-based system Details of process migration

76 Motivation Key reasons: performance and flexibility Process migration (aka strong mobility) Improved system-wide performance better utilization of system-wide resources Examples: Condor, DQS Code migration (aka weak mobility) Shipment of server code to client filling forms (reduce communication, no need to pre-link stubs with client) Ship parts of client application to server instead of data from server to client (e.g., databases) Improve parallelism agent-based web searches

77 Motivation Flexibility Dynamic configuration of distributed system Clients don t need preinstalled software download on demand

78 Migration models Process = code seg + resource seg + execution seg Weak versus strong mobility Weak => transferred program starts from initial state Sender-initiated versus receiver-initiated Sender-initiated (code is with sender) Client sending a query to database server Client should be pre-registered Receiver-initiated Java applets Receiver can be anonymous

79 Do Resources Migrate? Depends on resource to process binding By identifier: specific web site, ftp server By value: Java libraries By type: printers, local devices Depends on type of attachments Unattached to any node: data files Fastened resources (can be moved only at high cost) Database, web sites Fixed resources Local devices, communication end points

80 Resource Migration Actions Resource-to machine binding Unattached Fastened Fixed Process-toresource binding By identifier By value By type MV (or GR) CP ( or MV, GR) RB (or GR, CP) GR (or MV) GR (or CP) RB (or GR, CP) GR GR RB (or GR) Actions to be taken with respect to the references to local resources when migrating code to another machine. GR: establish global system-wide reference MV: move the resources CP: copy the resource RB: rebind process to locally available resource