LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Size: px

Start display at page:

Download "LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig"

Arthur Montgomery
10 years ago
Views:

1 LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig

2 ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism (MosiX) management algorithms (MosiX) information dissemination decision algorithms 2

migration mechanism (MosiX) management algorithms

3 EXTREME STARTING POINTS independent OS processes block synchronous execution (HPC) sequence: compute - communicate all processes wait for all other processes often: message passing for example Message Passing Library (MPI) 3

communicate all processes wait for all other processes

4 SIMPLE PROGRAM MULTIPLE DATA all processes execute same program while (true) { work; exchange data (barrier)} common in High Performance Computing: Message Passing Interface (MPI) library 4

5 DIVIDE AND CONQUER CPU #1 result 1 part 1 result 3 CPU #1 part 2 problem result result 2 part 3 result 4 CPU #2 CPU #2 part 4 node 1 node 2 5

6 MPI BRIEF OVERVIEW Library for message-oriented parallel programming Programming model: Multiple instances of same program Independent calculation Communication, synchronization 6

7 MPI STARTUP & TEARDOWN MPI program is started on all processors MPI_Init(), MPI_Finalize() Communicators (e.g., MPI_COMM_WORLD) MPI_Comm_size() MPI_Comm_rank(): Rank of process within this set Typed messages Dynamically create and spread processes using MPI_Spawn() (since MPI-2) 7

, MPI_COMM_WORLD) MPI_Comm_size() MPI_Comm_rank(): Rank of process

8 MPI EXECUTION Communication Point-to-point Collectives Synchronization Test MPI_Recv( MPI_Bcast( MPI_Reduce( MPI_Barrier( MPI_Send( MPI_Test( MPI_Wait( MPI_Comm void* MPI_Request* buffer, sendbuf, comm request, ) void int MPI_Status count, *flag, *recvbuf, *status ) MPI_Datatype, int MPI_Status count *status ) int MPI_Datatype, source, root, dest, int MPI_Comm MPI_Op tag, op, comm ) MPI_Comm int root, comm, ) MPI_Status MPI_Comm comm *status ) Wait Barrier 8

MPI_Status count, *flag, *recvbuf, *status ) MPI_Datatype, int MPI_Status count *status ) int MPI_Datatype, source,

9 BLOCK AND SYNC blocking call non-blocking call synchronous communication returns when message has been delivered returns immediately, following test/wait checks for delivery asynchronous communication returns when send buffer can be reused returns immediately, following test/wait checks for send buffer 9

test/wait checks for delivery asynchronous communication returns when send

10 int rank, total; MPI_Init(); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &total); EXAMPLE MPI_Bcast(...); /* work on own part, determined by rank */ if (id == 0) { for (int rr = 1; rr < total; ++rr) MPI_Recv(...); /* Generate final result */ } else { MPI_Send(...); } MPI_Finalize(); 10

..); /* work on own part, determined by rank */ if (id == 0) { for (int rr =

11 AMDAHLS LAW interpretation for parallel systems: P: section that can be parallelized 1-P: serial section N: number of CPUs 11

12 AMDAHL S LAW Serial section: communicate, longest sequential section Parallel, Serial, possible speedup: 1ms, 100 μs: 1/ ms, 1 μs: 1/ μs, 1 μs: 0.01/

13 WEAK VS. STRONG SCALING Strong: accelerate same problem size Weak: extend to larger problem size 13

14 WHO DOES IT application run-time library operating system 14

15 SCHEDULER: GLOBAL RUN QUEUE CPU 1 CPU 2 CPU N all ready processes immediate approach: global run queue 15

16 SCHEDULER: GLOBAL RUN QUEUE does not scale shared memory only contended critical section cache affinity separate run queues with explicit movement of processes 16

17 OS/HW & APPLICATION Operating System / Hardware: All participating CPUs: active / inactive Partitioning (HW) Gang Scheduling (OS) Within Gang/Partition: Applications balance!!! 17

18 HW PARTITIONS & ENTRY QUEUE request queue Application cation 18

19 HW PARTITIONS optimizes usage of network takes OS off critical path best for strong scaling burdens application/library with balancing potentially wastes resources current state of the art in High Performance Computing (HPC) 19

application/library with balancing potentially wastes

20 TOWARDS OS/RT BALANCING work item time Barrier 20

21 EXECUTION TIME JITTER work item time Barrier 21

22 SOURCES FOR EXECUTION JITTER Hardware??? Application Operating system noise 22

23 OPERATING SYSTEM NOISE Use common sense to avoid: OS usually not directly on the critical path, BUT OS controls: interference via interrupts, caches, network, memory bus, (RTS techniques) avoid or encapsulate side activities small critical sections (if any) partition networks to isolate traffic of different applications (HW: Blue Gene) do not run Python scripts or printer daemons in parallel 23

24 SPLITTING BIG JOBS work item time Barrier overdecomposition & oversubscription 24

25 SMALL JOBS (NO DEPS) work item time Barrier Execute small jobs in parallel (if possible) 25

26 BALANCING AT LIBRARY LEVEL Programming Model many (small) decoupled work items overdecompose create more work items than active units run some balancing algorithm Example: CHARM ++ 26

27 BALANCING AT SYSTEM LEVEL create many more processes use OS information on run-time and system state to balance load examples: create more MPI processes than nodes run multiple applications 27

28 CAVEATS added overhead additional communication between smaller work items (memory & cycles) more context switches OS on critical path (for example communication) 28

29 BALANCING ALGORITHMS required: mechanism for migrating load information gathering decisions MosiX system as an example -> Barak s slides now 29

30 FORK IN MOSIX Deputy fork() syscall request Remote Establish a new link fork() Deputy(child) Remote(child) Reply from fork() fork() Reply from fork() 30

31 PROCESS MIGRATION IN MOSIX Process migration request migdaemon Send state, memory maps, dirty pages ack Deputy Remote fork() Transition Finalize migration Ack Migration completed 31

32 SOME PRACTICAL PROBLEMS flooding all processes jump to one new empty node => decide immediately before migration commitment extra communication, piggy packed ping pong if thresholds are very close, processes moved back and forth => tell a little higher load than real 32

33 PING PONG Scenario: compare load on nodes 1 and 2 node 1 moves process to node 2 Solutions: add one + little bit to load average over time Node 1 Node 2 One process two nodes Solves short peaks problem as well (short cron processes) 33

34 LESSONS execution time jitter matters (Amdahl) approaches: partition./. balance dynamic balance components: migration of load, information, decision who, when, where 34

High performance computing systems. Lab 1

High performance computing systems Lab 1 Dept. of Computer Architecture Faculty of ETI Gdansk University of Technology Paweł Czarnul For this exercise, study basic MPI functions such as: 1. for MPI management: