Load Balancing in Charm++ Eric Bohm

Transcription

1 Load Balancing in Charm++ and AMPI Eric Bohm

2 How to Diagnose Load Imbalance? Often hidden in statements such as: o Very high synchronization overhead Most processors are waiting at a reduction Count total amount of computation (ops/flops) per processor o In each phase! o Because the balance may change from phase to phase August 5th, 2009 Charm++ and AMPI: Session II 2

3 Golden Rule of Load Balancing Fallacy: objective of load balancing is to minimize variance in load across processors Example: 50,000 tasks of equal size, 500 processors: A: All processors get 99, except last 5 gets = 199 OR, B: All processors have 101, except last 5 get 1 Identical variance, but situation A is much worse! Golden Rule: It is ok if a few processors idle, but avoid having processors that are overloaded with work Finish time = max i {Time on processor i}, excepting data dependence and communication overhead issues August 5th, 2009 Charm++ and AMPI: Session II 3

4 Amdahls s Law and Grainsize Before we get to load balancing: Original law : o If a program has K % sequential section, then speedup is limited to 100/K. If the rest of the program is parallelized completely Grainsize corollary: o If any individual piece of work is > K time units, and the sequential program takes T seq, Speedup is limited to T seq / K So: o o Examine performance data via histograms to find the sizes of remappable work units If some are too big, change the decomposition method to make smaller units August 5th, 2009 Charm++ and AMPI: Session II 4

5 Grainsize (working) Definition: the amount of computation per potentially parallel event (task creation, enqueue/dequeue, messaging, locking..) Time 1 processor Grainsize p processors August 5th, 2009 Charm++ and AMPI: Session II 5

6 Rules of Thumb for Grainsize Make it as small as possible, as long as it amortizes the overhead More specifically, ensure: o Average grainsize is greater than k v (say 10v) o No single grain should be allowed to be too large Must be smaller than T/p, but actually we can express it as Must be smaller than k m v (say 100v) Important corollary: o You can be close to optimal grainsize without having to think about P, the number of processors August 5th, 2009 Charm++ and AMPI: Session II 8

7 Molecular Dynamics in NAMD Collection of [charged] atoms, with bonds o Newtonian mechanics o Thousands of atoms (1, ,000) o 1 femtosecond time step, millions needed! At each time step o Calculate forces on each atom Bonds: Non bonded: electrostatic and van der Waal s Short distance: every timestep Long distance: every 4 timesteps using PME (3D FFT) Multiple Time Stepping o Calculate velocities and advance positions Collaboration with K. Schulten, R. Skeel, and coworkers August 5th, 2009 Charm++ and AMPI: Session II 10

8 Hybrid Decomposition Object Based Parallelization for MD: Force Decomp. + Spatial Decomp. We have many objects to load balance: o Each diamond can be assigned to any proc. o Number of diamonds (3D): 14 Number of Patches August 5th, 2009 Charm++ and AMPI: Session II 11

9 Grainsize Analysis via Histograms Grainsize distribution number of objects Solution: Split compute objects that may have too much work, using a heuristic based on the number of interacting atoms grainsize in milliseconds Problem August 5th, 2009 Charm++ and AMPI: Session II 13

10 Fine Grained Decomposition on BlueGene Force Evaluation Integration Decomposing atoms into smaller bricks gives finer grained parallelism August 5th, 2009 Charm++ and AMPI: Session II 16

11 Load Balancing Strategies Classified by when it is done: o o o Initially Dynamic: Periodically Dynamic: Continuously Classified by whether decisions are taken with global information o o o Fully centralized Quite good a choice when load balancing period is high Fully distributed Each processor knows only about a constant number of neighbors Extreme case: totally local decision (send work to a random destination processor, with some probability). Use aggregated global information, and detailed neighborhood info. August 5th, 2009 Charm++ and AMPI: Session II 17

12 Dynamic Load Balancing Scenarios: Examples representing typical classes of situations o Particles distributed over simulation space Dynamic: because Particles move. Cases: Highly non uniform distribution (cosmology) Relatively Uniform distribution o Structured grids, with dynamic refinements/coarsening o Unstructured grids with dynamic refinements/coarsening August 5th, 2009 Charm++ and AMPI: Session II 18

13 Measurement Based Load Balancing Principle of persistence o Object communication patterns and computational loads tend to persist over time, in spite of dynamic behavior Abrupt but infrequent changes Slow and small changes Runtime instrumentation o Measures communication volume and computation time Measurement based load balancers o Use the instrumented data base periodically to make new decisions o Many alternative strategies can use the database August 5th, 2009 Charm++ and AMPI: Session II 21

14 Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Time Instrumented Timesteps Refinement Load Balancing August 5th, 2009 Charm++ and AMPI: Session II 23

15 Charm++ Strategies Centralized GreedyLB GreedyCommLB RecBisectBfLB MetisLB TopoCentLB RefineLB RefineCommLB OrbLB NeighborLB NeighborCommLB WSLB Distributed HybridLB o Combine strategies hierarchically August 5th, 2009 Charm++ and AMPI: Session II 25

16 Load Balancer in Action Automatic Load Balancing in Crack Propagation Number of Iterations Per second Elements Added 2. Load Balancer Invoked 3. Chunks Migrated Iteration Num ber August 5th, 2009 Charm++ and AMPI: Session II 28

17 Distributed Load Balancing Centralized strategies o Still ok for 3000 processors for NAMD Distributed balancing is needed when: o o Number of processors is large and/or Load variation is rapid Large machines: o o o Need to handle locality of communication Topology sensitive placement Need to work with scant global information Approximate or aggregated global information (average/max load) Incomplete global info (only neighborhood ) Work diffusion strategies (1980s work by Kale and others!) Achieving global effects by local action August 5th, 2009 Charm++ and AMPI: Session II 29

18 Load Balancing on Large Machines Existing load balancing strategies don t scale on extremely large machines Limitations of centralized strategies: o Central node: memory/communication bottleneck o Decision making algorithms tend to be very slow Limitations of distributed strategies: o Difficult to achieve well informed load balancing decisions August 5th, 2009 Charm++ and AMPI: Session II 30

19 Simulation Study Memory Overhead Simulation performed with the performance simulator BigSim Memory usage (MB) K 256K 512K 1M 32K processors 64K processors Number of objects lb_test benchmark is a parameterized program that creates a specified number of communicating objects in 2D mesh. August 5th, 2009 Charm++ and AMPI: Session II 31

20 Hierarchical Load Balancers Hierarchical distributed load balancers o Divide into processor groups o Apply different strategies at each level o Scalable to a large number of processors August 5th, 2009 Charm++ and AMPI: Session II 33

21 Our HybridLB Scheme Refinement-based Load balancing 1 Load Data Load Data (OCG) Greedy-based Load balancing token object August 5th, 2009 Charm++ and AMPI: Session II 34

22 Hybrid Load Balancing Performance Time(s) 500 Simulation of lb_test for 64K processors Load Balance Time K 512K 1M Number of Objects Maximum predicted load (seconds) GreedyCommLB HybridLB(GreedyCommLB) Application Time 256K 512K 1M Number of Objects N procs Memory 6.8MB 22.57MB 22.63MB lb_test benchmark s actual run on BG/L at IBM (512K objects) August 5th, 2009 Charm++ and AMPI: Session II 35

23 Load Balancing: Hands on August 5th, 2009 Charm++ and AMPI: Session II 36

24 Simple Imbalance LB_Test.C 1D Array of chares Half of which have 2x computation load Strong scaling o make will produce LB_Test o Run LB_Test with these parameters Arguments: Chares per core, iterations, workload multiplier, array size Use at least 7 processors (precede those arguments with np 7) August 5th, 2009 Charm++ and AMPI: Session II 37

25 Output Without Balancing Charm++> cpu topology info is being gathered. Charm++> 1 unique compute nodes detected. Running on 7 processors with 40 chares per pe All array elements ready at seconds. Computation Begins [0] Element 0 took seconds for work 1 at iteration 0 sumc 4.664e+13 [1] Element 40 took seconds for work 2 at iteration 0 sumc 8.748e+14 [0] Element 0 took seconds for work 1 at iteration 99 sumc 4.664e+13 [1] Element 40 took seconds for work 2 at iteration 99 sumc 8.748e+14 Total work performed = seconds Average total chare work per iteration = seconds Average iteration time = seconds Done after seconds August 5th, 2009 Charm++ and AMPI: Session II 38

26 Analyze Performance Productivity => not wasting your time o Measure twice, cut once make projections o Produces LB_Test_prj o Change your job script to run LB_Test_prj o mkdir nobalancetrace o Add arguments +logsize traceroot $PWD/nobalancetrace o Execution will create trace files in nobalancetrace August 5th, 2009 Charm++ and AMPI: Session II 39

27 Download and Visualize Download the contents of nobalancetrace Or extract sample from nobalancetrace.tar o tar xf nobalancetrace.tar Run projections o Load LB_Test_prj.sts Open timeprofile on several steps o 4s to 8s for sample August 5th, 2009 Charm++ and AMPI: Session II 40

28 Time Profile no Balance August 5th, 2009 Charm++ and AMPI: Session II 41

29 Fix Migration Fix the pup routine for the LB_Test chare o PUP each member variable p varname; o Need to do memory allocation when unpacking if(p.isunpacking){ /* allocate dynamic members */ } o PUP dynamically created arrays PUParray(p, varname, numelements); o Remove the CkAbort August 5th, 2009 Charm++ and AMPI: Session II 42

30 Add Load Balancer Support Add call to AtSync in LB_Test::next_iter if ((iteration == balance_iteration) && usesatsync) { AtSync(); } else { compute(); } Add ResumeFromSync void ResumeFromSync(void) { // Called by Load balancing framework compute(); } Answer is in LB_Test_final.C August 5th, 2009 Charm++ and AMPI: Session II 43

31 Use GreedyLB Change job script to run LB_Test_LB Add argument +balancer GreedyLB o Run on the same number of processors with the same arguments o August 5th, 2009 Charm++ and AMPI: Session II 44

32 Output with Balancing Charm++> cpu topology info is being gathered. Charm++> 1 unique compute nodes detected. [0] GreedyLB created Running on 7 processors with 40 chares per pe All array elements ready at seconds. Computation Begins [0] Element 0 took seconds for work 1 at iteration 0 sumc 4.664e+13 [1] Element 40 took seconds for work 2 at iteration 0 sumc 8.748e+14 [6] Element 0 took seconds for work 1 at iteration 99 sumc 4.664e+13 [6] Element 40 took seconds for work 2 at iteration 99 sumc 8.748e+14 Total work performed = seconds Average total chare work per iteration = seconds Average iteration time = seconds Done after seconds August 5th, 2009 Charm++ and AMPI: Session II 45

33 Compare Consider average iteration time Consider total CPU time o Walltime * number of processors o The more processors you use, the more important it is to reduce iteration time through efficiency o Look for overloaded processors Underloading is just a symptom Overload implies bottleneck August 5th, 2009 Charm++ and AMPI: Session II 46

34 Usage Profile Use Usage Profile from Tools menu Examine area before load balancing o Note, intervals are in 100ms o 3000ms to 4000ms works for the sample August 5th, 2009 Charm++ and AMPI: Session II 47

35 Analyze Performance Again Productivity => not wasting your time o Measure twice, cut once Make projections o Produces LB_Test_LB_prj o Change your job script to run LB_Test_LB_prj o mkdir balancetrace o Add arguments +logsize traceroot $PWD/balancetrace o Execution will create trace files in balancetrace August 5th, 2009 Charm++ and AMPI: Session II 48

36 Usage Profile Before Balance August 5th, 2009 Charm++ and AMPI: Session II 49

37 Timeline Across Balancer Open timeline spanning load balancing o 4s to 8s works for sample o Try a large time span on a few cores then zoom in August 5th, 2009 Charm++ and AMPI: Session II 50

38 Summary Look for load imbalance Migratable objects are not hard to use Charm++ has significant infrastructure to help o On your own try this benchmark at varying processor numbers See the impact on scaling with different array sizes See the impact on total runtime when the number of iterations grows large. Try other load balancers 1p.html#lbFramework August 5th, 2009 Charm++ and AMPI: Session II 51

39 Sanjay Kale & Eric Bohm INTERMEDIATE CHARM ++ August 5th, 2009 Charm++ and AMPI: Session II 52

40 Outline Messages Groups, nodegroups Startup process Fault tolerance Advanced o Communication optimization o Advanced arrays o Conditional parking o Make your own LB strategy o Interact with CCS and Python o Higher level languages August 5th, 2009 Charm++ and AMPI: Session II 53

41 Parameter Marshalling Application passes parameters by value o myproxy.myentry(... arguments...); PUP::able types as arguments The receiver cannot maintain a pointer to the input The system allocates a message containing the parameters to send (CkMarshallMsg) entry void receive(int v); entry void startstep(); entry void eastghost(int n, double vals[n]); n vals_off vals_cnt vals August 5th, 2009 Charm++ and AMPI: Session II 54

42 message InfoMsg; class InfoMsg : public CMessage_InfoMsg { int iter;... other data... methods } Messages Necessary in some situations o E.g. Specify order of operations (priority) Possible optimizations o Avoid memcopy and memory allocation o Reuse the same message multiple times E.g. Yield processor using a message void MyArray::compute(InfoMsg *msg) {... do some work } if (workdone) delete msg; else thisproxy[thisindex].compute(msg); August 5th, 2009 Charm++ and AMPI: Session II 55

43 Jacobi::startStep() { Ghost *msg = new (localrows) Ghost(localRows); for (int i = 1; i < localrows; i++) msg >vals[i 1] = values[i][localcols+1]; } Variable Messages (Jacobi) } thisproxy(thisindex.x + 1, thisindex.y).westghost(msg);... Jacobi::northGhost(Ghost *msg) { north = msg; ghostreceived ++; A[0] = msg >vals 1; attemptcompute(); } Jacobi::attemptCompute() {... delete north; } class Ghost : CMessage_Ghost{ int len; double *vals; } len vals *vals north message Ghost { double vals[]; A[0][1]A[0][localCols] August 5th, 2009 Charm++ and AMPI: Session II 56

44 Message Priorities Application assigns priorities to some messages Charm scheduler respects priorities while draining message queues Separate message queues for zero, negative and positive priorities It is an optimization Beware of starvation! o A message might never get scheduled Charm ++ does not guarantee the delivery order, only a best effort August 5th, 2009 Charm++ and AMPI: Session II 57

45 Message Priorities (Cont.) Different queuing strategies CK_QUEUEING_FIFO, CK_QUEUEING_LIFO To specify the priority: int prio =... MsgType *msg = new (8*sizeof(int)) MsgType; *CkPriorityPtr(msg) = prio; CkSetQueueing(msg, CK_QUEUEING_IFIFO); CK_QUEUEING_IFIFO, CK_QUEUEING_ILIFO negative high 0 positive low August 5th, 2009 Charm++ and AMPI: Session II 58

46 mainchare Main {... Main::Main(CkArgMsg }; *m) : CBase_Main(m) { array [1D] MyArray {... CProxy_MyGroup }; group1, group2; group MyGroup { entry MyGroup(int n); entry MyGroup(); }; } Groups Collection of chares in which exactly one chare is present in each processor o Indexable with the processor rank It is an optimization o Useful for libraries, when each processor needs a local branch to service local chares Ex. Software cache manager: all chares in a processor share the same read only data, avoid extra communication arrayproxy = CProxy_MyArray::ckNew(nElem); group1 = CProxy_MyGroup::ckNew(); group2 = Cproxy_MyGroup::ckNew(100); August 5th, 2009 Charm++ and AMPI: Session II 59

47 Groups (Cont.) Should not be used to perform computation in place of chare arrays! o Groups are not load balanced Nodegroups: o Like groups, but with one chare per node o Differ from groups only if Charm ++ compiled for SMP o Can execute on any processor within the node, even concurrently Keywork exclusive to prevent data races August 5th, 2009 Charm++ and AMPI: Session II 60

48 Startup Process initnode and initproc executed o Run once every node or processor, respectively o Declared in the.ci file All mainchare constructors executed o Create chare arrays/groups Constructor methods are executed immediately on proc 0 o They can set readonly variables Readonly are synchronized Every other entry method is executed o This includes constructor methods on proc#0 August 5th, 2009 Charm++ and AMPI: Session II 61 61

49 Fault Tolerance Checkpointing o Simply PUP all Charm ++ entities to disk o Trigger with CkStartCheckpoint( dir, cb) o Callback cb called upon checkpoint completion Called both after checkpoint and upon restart o To restart: +restart <logdir> Live recovery methods (experimental) o Double in memory checkpoint o Message logging Only faulty processor rolls back August 5th, 2009 Charm++ and AMPI: Session II 62

50 Fault Tolerance: Example Main::Main(CkMigrateMessage *m) : CBase_Main(m) { // Subtle: Chare proxy readonly needs to be updated // manually because of the object pointer inside it! mainproxy = thisproxy; } void Main::pup(PUP::er &p) {... } readonly CProxy_Main mainproxy; mainchare [migratable] Main {... }; group [migratable] MyGroup {... }; void Main::next(CkReductionMsg *m) { if ((++step % 10) == 0) { CkCallback cb(ckindex_hello::sayhi(),helloproxy); CkStartCheckpoint("log",cb); } else { helloproxy.sayhi(); } delete m; } August 5th, 2009 Charm++ and AMPI: Session II 63

51 Sanjay Kale & Eric Bohm ADVANCED TUTORIAL August 5th, 2009 Charm++ and AMPI: Session II 64

52 Communication Optimization Optimize certain most recurrent communication patterns o Streaming: reduce overhead of many small msgs o Multicast o All to all Each must be used with its own API o Each may have multiple alternative implementations, embodying different strategies o Programmer can choose best strategy for their scenario August 5th, 2009 Charm++ and AMPI: Session II 65

53 Advanced Arrays Sections: create proxies representing slices of a chare array o Optimize communication with Comlib or CkMulticast o Ex: Row/column of a 2D chare array Mapping: manually specify map of chares to PEs o Ex: Place communicating objects on same processor Bound arrays: Tie two chare arrays together o The system places and migrates corresponding indices together o Ex: FFT helper library bound to work array August 5th, 2009 Charm++ and AMPI: Session II 66

54 Conditional Packing for SMP Pass pointer if destination is one the same node Copy data into message if remote destination message Slice { conditional Boomarray<double> data; } chare Integrate { entry Integrate(Slice *m); entry Integrate(Boomarray<double> d conditional); } void Integrate::Integrate(Slice *msg) { Boomarray<double> &b = *msg >data;... do work using b... // Send back the modified data mainproxy.results(msg); } class Slice : CMessage_Slice { double *data; } August 5th, 2009 Charm++ and AMPI: Session II 67

55 Make Your Own LB strategy You can overwrite automatic measurements, with application supplied performance estimates o Reimplement UserSetLBLoad() in your chare o Use setobjtime(time) and getobjtime() Or, you can implement a new strategy o foolb::work(centrallb::ldstats* stats, int count) o Use the gathered data, decide a new assignment of objects to processor o The system will handle migration of objects August 5th, 2009 Charm++ and AMPI: Session II 68

56 CCS: Converse Client Server Allows interactivity User registers callbacks to execute when certain messages are received by the application from the outside CcsRegisterHandler( myrequest, CkCallback(CkIndex_Main::request(0), mainproxy)); Current uses: o LiveViz (visualization) o CharmDebug o Projections August 5th, 2009 Charm++ and AMPI: Session II 69

57 Interact with Python Scripting Upload Python scripts via CCS and run them on demand There are three ways in which Python scripts can interact with application o Low level read/write (access single variables) o High level (call local entry methods) o Iterative (apply a Python function to a set of objects) Client binding for C++ and Java August 5th, 2009 Charm++ and AMPI: Session II 70

58 Higher Level Languages Incomplete but simple languages Target specific patterns of interaction Interoperate effectively with each other o And with Charm ++, AMPI o Because of message drive scheduler in Charm ++ SDAG: describes life cycle of a chare clearly Charisma: orchestrates multiple collections of chares, describing global flow of data and control MSA (Multiphase Shared Arrays): disciplined shared memory August 5th, 2009 Charm++ and AMPI: Session II 71

59 More References Online tutorial: o Charm ++ manual: o o CCS and liveviz under Converse manual Comprehensive FAQ o August 5th, 2009 Charm++ and AMPI: Session II 72