Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration with the Charm++ Team from the PPL (UIUC, IL) : Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale July 1, 2014 Francois Tessier TreeMatch in Charm++ 1 / 19
Introduction Scalable execution of parallel applications Number of cores is increasing But memory per core is decreasing Application will need to communicate even more than now Our solution Process placement should take into account process affinity Here: load balancing in Charm++ considering : CPU load process affinity (or other communicating objects) topology : network and intra-node Francois Tessier TreeMatch in Charm++ 2 / 19
Charm++ Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven objects called chares. In general we have more chares than processing units Chares are mapped to physical processors by an adaptive runtime system Load balancers can be called to migrate chares Charm++ is able to use MPI for the processes communications Francois Tessier TreeMatch in Charm++ 3 / 19
Processes Placement Why we should consider it Many current and future parallel platforms have several levels of hierarchy Application chares/processes do not exchange the same amount of data (affinity) The process placement policy may have impact on performance Cache hierarchy, memory bus, high-performance network... Switch Cabinet Cabinet... Node Node... Processor Processor Core Core Core Core Francois Tessier TreeMatch in Charm++ 4 / 19
Problems Given The parallel machine topology The application communication pattern Map application processes to physical resources (cores) to reduce the communication costs (NP-complete) zeus16.map Receiver rank 5 10 15 5 10 15 0 1 2 3 4 5 6 7 Sender rank Francois Tessier TreeMatch in Charm++ 5 / 19
TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recovery on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i TreeMatch can only work on tree topologies. How to deal with 3d torus? Francois Tessier TreeMatch in Charm++ 6 / 19
Network placement libtopomap T. Hoefler and M. Snir, "Generic Topology Mapping Strategies for Large-Scale Parallel Architectures" Proc. Int l Conf. Supercomputing (ICS), pp. 75-84, 2011. Library that enables to map processes on various network topologies Used in TreeMatchLB to consider the Blue Waters 3d torus Figure: 3d Torus and a Cray Gemini router Francois Tessier TreeMatch in Charm++ 7 / 19
Load balancing Principle Iterative applications load balancer called at regular interval Migrate chares in order to optimize several criteria Charm++ runtime system provides: chares load chares affinity etc... Constraints Dealing with complex modern architectures Taking into account communications between elements Some other communication-aware load-balacing algorithms [L. L. Pilla, et al. 2012] NUCOLB, shared memory machines [L. L. Pilla, et al. 2012] HwTopoLB Some "built-in" Charm++ load balancers : RefineCommLB, GreedyCommLB... Francois Tessier TreeMatch in Charm++ 8 / 19
Several issues raised Not so easy... Several issues raised! Scalability of TreeMatch How to deal with process mapping (user, core numbering) Intel Xeon 5550 : 0,2,4,6,1,3,5,7 Intel Xeon 5550 : 0,1,2,3,4,5,6,7 (!!) AMD Interlagos : 0,1,2,3,4,5,6,7...,30,31 Need to find a relevant compromise between processes affinities and load balancing What about load balancing time? The next slides will present our load balancer relying on TreeMatch and libtopomap which performs a parallel and distributed communication-aware load balancing. Francois Tessier TreeMatch in Charm++ 9 / 19
Strategy for Charm++ - Network Placement First step : minimize communication cost on network libtopomap reorders processes from a communicator How to use it to reorder groups of processes (or chares)? Example : groups of chares on nodes Charm++ uses MPI : full access to the MPI API New MPI communicator with MPI_Comm_split Network (3d torus, tree, ) 0 1 2 3 Nodes New communicator Francois Tessier TreeMatch in Charm++ 10 / 19
Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel CPU Load Network (3d torus, hierarchical, ) 3 6 8 9 12 14 15 16 Groups of chares assigned to nodes Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel Figure: Part of a 3d Torus attributed by the resource manager Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel CPU Load Network (3d torus, hierarchical, ) 3... 0 2 4 6 0 2 4 6 Groups of chares assigned to cores Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel 0 2 4 6 Chares Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel 0 2 4 6 Chares Francois Tessier TreeMatch in Charm++ 11 / 19
Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel CPU Load Network (3d torus, hierarchical, ) 0 2 4 6 0 2 4 6 Groups of chares assigned to cores Francois Tessier TreeMatch in Charm++ 11 / 19
Results commbench Benchmark designed to simulate irregular communications Experiments on 16 nodes with 32 cores on each (AMD Interlagos 6276) - Blue Waters Cluster) 1 MB messages - 100 iterations - 2 distant receivers for each chare Average time of one iteration in ms 0 50 100 150 commbench on 512 cores 8192 elements 1MB message size DummyLB RefineCommLB TreeMatchLB Francois Tessier TreeMatch in Charm++ 12 / 19
Results commbench 1 MB messages - 100 iterations - 2 distant receivers for each chare TreeMatch applied on a chares communication matrix Chares comm matrix CommBench 1 PlaFRIM node Chares comm matrix CommBench 1 PlaFRIM node Receiver rank 5 10 15 5 10 15 0 1000 2000 3000 4000 TreeMatch Receiver rank 5 10 15 5 10 15 0 1000 2000 3000 4000 Sender rank Sender rank Figure: σ(i) = 0, 8, 4, 5, 12, 1, 9, 6, 14, 2, 3, 13, 7, 10, 11, 15 Francois Tessier TreeMatch in Charm++ 13 / 19
Results kneighbor Benchmarks application designed to simulate regular intensive communication between processes Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIM Cluster Particularly compared to RefineCommLB Takes into account load and communication Minimizes migrations kneighbor on 64 cores 128 elements 1MB message size kneighbor on 64 cores 256 elements 1MB message size 700 2000 600 Execution time (in seconds) 500 400 300 200 Execution time (in seconds) 100 0 DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased 1500 1000 500 0 Francois Tessier TreeMatch in Charm++ 14 / 19
Results kneighbor Experiments on 16 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIM Cluster 1 MB messages - 100 iterations - 7-Neighbor Average time for each 7-kNeighbor iteration (in ms) 200 180 160 140 120 100 80 60 DummyLB 0-7 TreeMatchLB DummyLB 0,2,4,6,1,3,5,7 Execution time versus chares by core 40 1 2 4 8 16 Number of chares by core Francois Tessier TreeMatch in Charm++ 15 / 19
Results What about the load balancing time? Comparison between the sequential and the distributed versions of TreeMatchLB The master node distributes the data to each node which will compute its own chares placement. This data distribution can be done in parallel (around 20% of improvments) Time in seconds 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Time repartition for each step of the load balancing process Initialization TM Sequential TM Parallel Distributed Sequential Distributed Sequential Sequential Distributed 4096 8192 16384 Francois Tessier TreeMatch in Charm++ 16 / 19
Results What about the load balancing time? Comparison between the sequential and the distributed versions of TreeMatchLB The master node distributes the data to each node which will compute its own chares placement. This data distribution can be done in parallel (around 20% of improvments) 4096 Chares - reverse - Par Master 0 1 Init Process results Distribute Calculate Return 2 3 4 5 6 7 165.6 165.7 165.8 165.9 166 166.1 166.2 time Francois Tessier TreeMatch in Charm++ 16 / 19
Results What about the load balancing time? Linear trajectory while the number of chares is doubled TreeMatchLB is slower than the other Greedy strategies RefineCommLB which provides some good results for communication-bound applications is not scalable (fails from 8192 chares) Execution time of load balancing strategies (running on 128 cores) GreedyCommLB GreedyLB RefineCommLB TreeMatchLB Execution time (in ms) 10000 1000 100 10 1 0.1 128 256 512 1024 2048 4096 8192 Number of chares Figure: Load balancing time of the different strategies vs. number of chares for the KNeighbor application. Francois Tessier TreeMatch in Charm++ 17 / 19
Future work and Conclusion The end Topology is not flat! Processes affinities are not homogeneous Take into account these information to map chares give us improvement Algorithm adapted to large problems (Distributed) Published at IEEE Cluter 2013 Future work Find a better way to gather the topology (Hwloc?) Improve network part (BGQ routing?) Perform more large scale experiments Evaluate our solution on other applications (CFD?) Francois Tessier TreeMatch in Charm++ 18 / 19
The End Thanks for your attention! Any questions? Francois Tessier TreeMatch in Charm++ 19 / 19