Optimizing Load Balance Using Parallel Migratable Objects



Similar documents
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Charm++, what s that?!

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Feedback guided load balancing in a distributed memory environment

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Cloud Friendly Load Balancing for HPC Applications: Preliminary Work

Scientific Computing Programming with Parallel Objects

Partitioning and Divide and Conquer Strategies

CSE 4351/5351 Notes 7: Task Scheduling & Load Balancing

Load Imbalance Analysis

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Mesh Partitioning and Load Balancing

NAMD2- Greater Scalability for Parallel Molecular Dynamics. Presented by Abel Licon

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Mesh Generation and Load Balancing

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Energy Efficient MapReduce

Load Balancing Techniques

Distributed communication-aware load balancing with TreeMatch in Charm++

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Windows Server Performance Monitoring

Load Balancing. Load Balancing 1 / 24

Load Balancing in Structured Peer to Peer Systems

Load Balancing in Structured Peer to Peer Systems

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

Parallel Scalable Algorithms- Performance Parameters

Dynamic Load Balancing in a Network of Workstations

Performance Monitoring of Parallel Scientific Applications

Locality-Preserving Dynamic Load Balancing for Data-Parallel Applications on Distributed-Memory Multiprocessors

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Various Schemes of Load Balancing in Distributed Systems- A Review

Raima Database Manager Version 14.0 In-memory Database Engine

Load balancing in SOAJA (Service Oriented Java Adaptive Applications)

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Quiz 4 Solutions EECS 211: FUNDAMENTALS OF COMPUTER PROGRAMMING II. 1 Q u i z 4 S o l u t i o n s

Guideline for stresstest Page 1 of 6. Stress test

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

IMPROVED PROXIMITY AWARE LOAD BALANCING FOR HETEROGENEOUS NODES

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Load Balancing on a Grid Using Data Characteristics

SIMULATION OF LOAD BALANCING ALGORITHMS: A Comparative Study

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

How To Balance In Cloud Computing

Cloud Computing at Google. Architecture

Write Barrier Removal by Static Analysis

Survey on Load Rebalancing for Distributed File System in Cloud

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

An Implementation Of Multiprocessor Linux

Converting a Number from Decimal to Binary

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Parallelization: Binary Tree Traversal

Concurrency control. Concurrency problems. Database Management System

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

DECENTRALIZED LOAD BALANCING IN HETEROGENEOUS SYSTEMS USING DIFFUSION APPROACH

Storage Systems Autumn Chapter 6: Distributed Hash Tables and their Applications André Brinkmann

Praktikum Wissenschaftliches Rechnen (Performance-optimized optimized Programming)

GraySort on Apache Spark by Databricks

Real Time Network Server Monitoring using Smartphone with Dynamic Load Balancing

An Empirical Study and Analysis of the Dynamic Load Balancing Techniques Used in Parallel Computing Systems

Multi-GPU Load Balancing for Simulation and Rendering

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

A Comparison of General Approaches to Multiprocessor Scheduling

Decentralized Dynamic Load Balancing: The Particles Approach

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

PEER-TO-PEER (P2P) systems have emerged as an appealing

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

MOSIX: High performance Linux farm

Linux Driver Devices. Why, When, Which, How?

Optimizing Distributed Application Performance Using Dynamic Grid Topology-Aware Load Balancing

Uintah Framework. Justin Luitjens, Qingyu Meng, John Schmidt, Martin Berzins, Todd Harman, Chuch Wight, Steven Parker, et al

Unsupervised Data Mining (Clustering)

Fair Scheduling Algorithm with Dynamic Load Balancing Using In Grid Computing

End-user Tools for Application Performance Analysis Using Hardware Counters

Using Peer to Peer Dynamic Querying in Grid Information Services

Transcription:

Optimizing Load Balance Using Parallel Migratable Objects Laxmikant V. Kalé, Eric Bohm Parallel Programming Laboratory University of Illinois Urbana-Champaign 2012/9/25 Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 1 / 1

Optimization Do s and Don ts Correcting load imbalance is an optimization Like all optimizations, it is a cure to a performance ailment Diagnose the ailment before applying treatment Use performance analysis tools to understand performance Ironically, we cover that material later... But your process should be to use them early A sampling of tools of interest: Compiler reports for inlining, instruction level parallelism, etc Profiling tools (gprof, xprof, manual timing) Hardware counters (PAPI, PCL, etc) Valgrind memory tool suite Parallel analysis tools: Projections, HPCToolkit, TAU, JumpShot, etc. Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 2 / 1

How to Diagnose Load Imbalance Often hidden in statements such as: Very high synchronization overhead Most processors are waiting at a reduction Count total amount of computation (ops/flops) per processor In each phase! Because the balance may change from phase to phase Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 3 / 1

Golden Rule of Load Balancing Fallacy: objective of load balancing is to minimize variance in load across processors Example: 50,000 tasks of equal size, 500 processors: A: All processors get 99, except last 5 gets 100 + 99 = 199 OR, B: All processors have 101, except last 5 get 1 Identical variance, but situation A is much worse! Golden Rule: It is ok if a few processors idle, but avoid having processors that are overloaded with work Finish time = maxi(time on processor i) excepting data dependence and communication overhead issues The speed of any group is the speed of slowest member of that group. Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 4 / 1

Automatic Dynamic Load Balancing Measurement based load balancers Principle of persistence: In many CSE applications, computational loads and communication patterns tend to persist, even in dynamic computations Therefore, recent past is a good predictor of near future Charm++ provides a suite of load-balancers Periodic measurement and migration of objects Seed balancers (for task-parallelism) Useful for divide-and-conquer and state-space-search applications Seeds for charm++ objects moved around until they take root Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 5 / 1

Using the Load Balancer link a LB module -module <strategy> RefineLB, NeighborLB, GreedyCommLB, others EveryLB will include all load balancing strategies compile time option (specify default balancer) -balancer RefineLB runtime option +balancer RefineLB Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 6 / 1

Code to Use Load Balancing Insert if (mylbstep) AtSync() else ResumeFromSync(); call at natural barrier Implement ResumeFromSync() to resume execution Typical ResumeFromSync contribute to a reduction Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 7 / 1

Example: Stencil while (!converged) { atomic { int x = thisindex.x, y = thisindex.y, z = thisindex.z; copytoboundaries(); thisproxy(wrapx(x 1),y,z).updateGhosts(i, RIGHT, dimy, dimz, right); /...similar calls to send the 6 boundaries... / thisproxy(x,y,wrapz(z+1)).updateghosts(i, FRONT, dimx, dimy, front); } for (remotecount = 0; remotecount < 6; remotecount++) { when updateghosts[i](int i, int d, int w, int h, double b[w h]) atomic { updateboundary(d, w, h, b); } } atomic { int c = computekernel() < DELTA; CkCallback cb(ckreductiontarget(jacobi, checkconverged), thisproxy); if (i%5 == 1) contribute(sizeof(int), \&c, CkReduction::logical and, cb); } if (i % lbperiod == 0) { atomic { AtSync(); } when ResumeFromSync() {} } if (++i % 5 == 0) { when checkconverged(bool result) atomic { if (result) { mainproxy.done(); converged = true; } } } } Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 8 / 1

Chare Migration: motivations Chares are initially placed according to a placement map The user can specify this map While running, some processors might be overloaded Need to rebalance the load Automatic checkpoint Migration to disk Chares are made serializable for transport using the Pack UnPack (PUP) framework Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 9 / 1

The PUP Process Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 10 / 1

PUP Usage Sequence Migration out: ckabouttomigrate Sizing Packing Destructor Migration in: Migration constructor UnPacking ckjustmigrated Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 11 / 1

Writing a PUP routine class MyChare : public CBase MyChare { int a; float b; char c; float localarray[local SIZE]; int heaparraysize; float heaparray; MyClass pointer; }; public: MyChare(); MyChare(CkMigrateMessage msg) {}; MyChare() { if (heaparray!= NULL) { delete [] heaparray; heaparray = NULL; } void pup(pup::er &p) { CBase MyChare::pup(p); p a; p b; p c; p(localarray, LOCAL SIZE); p heaparraysize; if (p.isunpacking()) { heaparray = new float[ heaparraysize]; } p(heaparray, heaparraysize); int isnull = (pointer==null)? 1 : 0; p isnull; if (!isnull) { if (p.isunpacking()) pointer = new MyClass(); p pointer; } } Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 12 / 1

PUP: Issues If variables are added to an object, update the PUP routine If the object allocates data on the heap, copy it recursively, not just the pointer Remember to allocate memory while unpacking Sizing, Packing, and Unpacking must scan the same variables in the same order Test PUP routines with +balancer RotateLB Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 13 / 1

Performance Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 14 / 1

Grainsize and Load Balancing How Much Balance Is Possible? Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 15 / 1

Grainsize For Extreme Scaling Strong Scaling is limited by expressed parallelism Minimum iteration time limited lengthiest computation Largest grains set lower bound 1-away generalized to k-away provides fine granularity control Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 16 / 1

NAMD: 2-AwayX Example Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 17 / 1

Load Balancing Strategies Classified by when it is done: Initially Dynamic: Periodically Dynamic: Continuously Classified by whether decisions are taken with global information Fully centralized Quite good a choice when load balancing period is high Fully distributed Each processor knows only about a constant number of neighbors Extreme case: totally local decision (send work to a random destination processor, with some probability). Use aggregated global information, and detailed neighborhood info. Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 18 / 1

Dynamic Load Balancing Scenarios Examples representing typical classes of situations Particles distributed over simulation space Dynamic: because Particles move. Highly non-uniform distribution (cosmology) Relatively Uniform distribution Structured grids, with dynamic refinements/coarsening Unstructured grids with dynamic refinements/coarsening Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 19 / 1

Example Case: Particles Orthogonal Recursive Bisection (ORB) At each stage: divide Particles equally Processor dont need to be a power of 2: Divide in proportion 2:3 with 5 processors How to choose the dimension along which to cut? Choose the longest one How to draw the line? All data on one processor? Sort along each dimension Otherwise: run a distributed histogramming algorithm to find the line, recursively Find the entire tree, and then do all data movement at once Or do it in two-three steps. But no reason to redistribute particles after drawing each line. Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 20 / 1

Dynamic Load Balancing using Objects Object based decomposition (I.e. virtualized decomposition) helps Allows RTS to remap them to balance load But how does the RTS decide where to map objects? Just move objects away from overloaded processors to underloaded processors How is load determined? Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 21 / 1

Measurement Based Load Balancing Principle of Persistence Object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt but infrequent changes Slow and small changes Runtime instrumentation Measures communication volume and computation time Measurement based load balancers Use the instrumented data-base periodically to make new decisions Many alternative strategies can use the database Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 22 / 1

Periodic Load Balancing Stop the computation? Centralized strategies: Charm RTS collects data (on one processor) about: Computational Load and Communication for each pair If you are not using AMPI/Charm, you can do the same instrumentation and data collection Partition the graph of objects across processors Take communication into account Pt-to-pt, as well as multicast over a subset As you map an object, add to the load on both sending and receiving processor Multicasts to multiple co-located objects are effectively the cost of a single send Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 23 / 1

Typical Load Balancing Steps Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 24 / 1

Object Partitioning Strategies You can use graph partitioners like METIS, K-R BUT: graphs are smaller, and optimization criteria are different Greedy strategies: If communication costs are low: use a simple greedy strategy Sort objects by decreasing load Maintain processors in a heap (by assigned load) In each step: assign the heaviest remaining object to the least loaded processor With small-to-moderate communication cost: Same strategy, but add communication costs as you add an object to a processor Always add a refinement step at the end: Swap work from heaviest loaded processor to some other processor Repeat a few times or until no improvement Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 25 / 1

Object Partitioning Strategies 2 When communication cost is significant: Still use greedy strategy, but: At each assignment step, choose between assigning O to least loaded processor and the processor that already has objects that communicate most with O. Based on the degree of difference in the two metrics Two-stage assignments: In early stages, consider communication costs as long as the processors are in the same (broad) load class, In later stages, decide based on load Branch-and-bound Searches for optimal, but can be stopped after a fixed time Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 26 / 1

Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle As computation progresses, crack propagates, and new elements are added, leading to more complex computations in some chunks Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 27 / 1

Load Balancing Crack Propagation Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 28 / 1

Distributed Load balancing Centralized strategies Still ok for 3000 processors for NAMD Distributed balancing is needed when: Number of processors is large and/or load variation is rapid Large machines: Need to handle locality of communication Topology sensitive placement Need to work with scant global information Approximate or aggregated global information (average/max load) Incomplete global info (only neighborhood) Work diffusion strategies (1980s work by Kale and others!) Achieving global effects by local action Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 29 / 1

Load Balancing on Large Machines Existing load balancing strategies dont scale on extremely large machines Limitations of centralized strategies: Central node: memory/communication bottleneck Decision-making algorithms tend to be very slow Limitations of distributed strategies: Difficult to achieve well-informed load balancing decisions Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 30 / 1

lb test benchmark is a parameterized program that creates a specified Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 31 / 1 Simulation Study - Memory Overhead lb test experiments performed with the performance simulator BigSim

Hierarchical Load Balancers Partition processor allocation into processor groups Apply different strategies at each level Scalable to a large number of processors Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 32 / 1

Our Hybrid Scheme Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 33 / 1

Hybrid Load Balancing Performance Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 34 / 1

Summary Use Profiling and Performance Analysis Tools Early Measure twice, cut once! Look for overloaded processors, not underloaded processors Use PUP for object serialization Enables Migration for Load Balancing or Fault Tolerance Don t forget to consider granularity Laxmikant V. Kalé, Eric Bohm (UIUC) Load Balancing 2012/9/25 35 / 1