Distributed communication-aware load balancing with TreeMatch in Charm++



Similar documents
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Charm++, what s that?!

GSAND C. Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

Scientific Computing Programming with Parallel Objects

Operating System Multilevel Load Balancing

A Hierarchical Approach for Load Balancing on Parallel Multi-core Systems

Layer Load Balancing and Flexibility

Multilevel Load Balancing in NUMA Computers

High Performance Computing. Course Notes HPC Fundamentals

A highly configurable and efficient simulator for job schedulers on supercomputers

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Power Aware and Temperature Restraint Modeling for Maximizing Performance and Reliability Laxmikant Kale, Akhil Langer, and Osman Sarood

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Supercomputing applied to Parallel Network Simulation

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

OpenMP Programming on ScaleMP

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Overlapping Data Transfer With Application Execution on Clusters

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Cellular Computing on a Linux Cluster

Big Graph Processing: Some Background

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Chapter 2 Parallel Architecture, Software And Performance

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Methodology for predicting the energy consumption of SPMD application on virtualized environments *

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Parallel Programming Survey

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Combining Scalability and Efficiency for SPMD Applications on Multicore Clusters*

A Flexible Cluster Infrastructure for Systems Research and Software Development

Performance Monitoring of Parallel Scientific Applications

Scalability evaluation of barrier algorithms for OpenMP

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Technical Computing Suite Job Management Software

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

A Review of Customized Dynamic Load Balancing for a Network of Workstations

Parallel Computing. Benson Muite. benson.

LS DYNA Performance Benchmarks and Profiling. January 2009

Mesh Generation and Load Balancing

Load Balancing Support for Grid-enabled Applications

Multi-Threading Performance on Commodity Multi-Core Processors

Scalability and Classifications

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Enabling Large-Scale Testing of IaaS Cloud Platforms on the Grid 5000 Testbed

Simplest Scalable Architecture

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Managing and Using Millions of Threads

Architectures for Big Data Analytics A database perspective

Power-Aware High-Performance Scientific Computing

Networking Virtualization Using FPGAs

Error Characterization of Petascale Machines: A study of the error logs from Blue Waters Anno Accademico

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

FD4: A Framework for Highly Scalable Dynamic Load Balancing and Model Coupling

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Uncovering degraded application performance with LWM 2. Aamer Shah, Chih-Song Kuo, Lucas Theisen, Felix Wolf November 17, 2014

ComprehensiveBench: a Benchmark for the Extensive Evaluation of Global Scheduling Algorithms

Building an Inexpensive Parallel Computer

LaPIe: Collective Communications adapted to Grid Environments

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Performance Evaluation of the XDEM framework on the OpenStack Cloud Computing Middleware

ST810 Advanced Computing

HPC Programming Framework Research Team

Performance Characteristics of Large SMP Machines

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

HPC enabling of OpenFOAM R for CFD applications

Ressources management and runtime environments in the exascale computing era

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

A Promising Approach to Dynamic Load Balancing of Weather Forecast Models

A Case Study - Scaling Legacy Code on Next Generation Platforms

Components: Interconnect Page 1 of 18

Cloud Friendly Load Balancing for HPC Applications: Preliminary Work

A Performance Data Storage and Analysis Tool

OpenMP and Performance

Infrastructure as a Service (IaaS)

How To Understand The Concept Of A Distributed System

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

A Load Balancing Schema for Agent-based SPMD Applications

Distributed Operating Systems. Cluster Systems

Autonomic Power & Performance Management of Large-scale Data Centers

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Advancing Applications Performance With InfiniBand

Performance Engineering of the Community Atmosphere Model

Optimizing Load Balance Using Parallel Migratable Objects

M.Sc. IT Semester III VIRTUALIZATION QUESTION BANK Unit 1 1. What is virtualization? Explain the five stage virtualization process. 2.

Entropy-Based Collaborative Detection of DDoS Attacks on Community Networks

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

HPC Wales Skills Academy Course Catalogue 2015

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

High Performance Computing for Operation Research

Why the Network Matters

Modeling Parallel Applications for Scalability Analysis: An approach to predict the communication pattern

On the Placement of Management and Control Functionality in Software Defined Networks

Transcription:

Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration with the Charm++ Team from the PPL (UIUC, IL) : Esteban Meneses-Rojas, Gengbin Zheng, Sanjay Kale July 1, 2014 Francois Tessier TreeMatch in Charm++ 1 / 19

Introduction Scalable execution of parallel applications Number of cores is increasing But memory per core is decreasing Application will need to communicate even more than now Our solution Process placement should take into account process affinity Here: load balancing in Charm++ considering : CPU load process affinity (or other communicating objects) topology : network and intra-node Francois Tessier TreeMatch in Charm++ 2 / 19

Charm++ Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven objects called chares. In general we have more chares than processing units Chares are mapped to physical processors by an adaptive runtime system Load balancers can be called to migrate chares Charm++ is able to use MPI for the processes communications Francois Tessier TreeMatch in Charm++ 3 / 19

Processes Placement Why we should consider it Many current and future parallel platforms have several levels of hierarchy Application chares/processes do not exchange the same amount of data (affinity) The process placement policy may have impact on performance Cache hierarchy, memory bus, high-performance network... Switch Cabinet Cabinet... Node Node... Processor Processor Core Core Core Core Francois Tessier TreeMatch in Charm++ 4 / 19

Problems Given The parallel machine topology The application communication pattern Map application processes to physical resources (cores) to reduce the communication costs (NP-complete) zeus16.map Receiver rank 5 10 15 5 10 15 0 1 2 3 4 5 6 7 Sender rank Francois Tessier TreeMatch in Charm++ 5 / 19

TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recovery on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i TreeMatch can only work on tree topologies. How to deal with 3d torus? Francois Tessier TreeMatch in Charm++ 6 / 19

Network placement libtopomap T. Hoefler and M. Snir, "Generic Topology Mapping Strategies for Large-Scale Parallel Architectures" Proc. Int l Conf. Supercomputing (ICS), pp. 75-84, 2011. Library that enables to map processes on various network topologies Used in TreeMatchLB to consider the Blue Waters 3d torus Figure: 3d Torus and a Cray Gemini router Francois Tessier TreeMatch in Charm++ 7 / 19

Load balancing Principle Iterative applications load balancer called at regular interval Migrate chares in order to optimize several criteria Charm++ runtime system provides: chares load chares affinity etc... Constraints Dealing with complex modern architectures Taking into account communications between elements Some other communication-aware load-balacing algorithms [L. L. Pilla, et al. 2012] NUCOLB, shared memory machines [L. L. Pilla, et al. 2012] HwTopoLB Some "built-in" Charm++ load balancers : RefineCommLB, GreedyCommLB... Francois Tessier TreeMatch in Charm++ 8 / 19

Several issues raised Not so easy... Several issues raised! Scalability of TreeMatch How to deal with process mapping (user, core numbering) Intel Xeon 5550 : 0,2,4,6,1,3,5,7 Intel Xeon 5550 : 0,1,2,3,4,5,6,7 (!!) AMD Interlagos : 0,1,2,3,4,5,6,7...,30,31 Need to find a relevant compromise between processes affinities and load balancing What about load balancing time? The next slides will present our load balancer relying on TreeMatch and libtopomap which performs a parallel and distributed communication-aware load balancing. Francois Tessier TreeMatch in Charm++ 9 / 19

Strategy for Charm++ - Network Placement First step : minimize communication cost on network libtopomap reorders processes from a communicator How to use it to reorder groups of processes (or chares)? Example : groups of chares on nodes Charm++ uses MPI : full access to the MPI API New MPI communicator with MPI_Comm_split Network (3d torus, tree, ) 0 1 2 3 Nodes New communicator Francois Tessier TreeMatch in Charm++ 10 / 19

Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel CPU Load Network (3d torus, hierarchical, ) 3 6 8 9 12 14 15 16 Groups of chares assigned to nodes Francois Tessier TreeMatch in Charm++ 11 / 19

Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel Figure: Part of a 3d Torus attributed by the resource manager Francois Tessier TreeMatch in Charm++ 11 / 19

Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel CPU Load Network (3d torus, hierarchical, ) 3... 0 2 4 6 0 2 4 6 Groups of chares assigned to cores Francois Tessier TreeMatch in Charm++ 11 / 19

Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel 0 2 4 6 Chares Francois Tessier TreeMatch in Charm++ 11 / 19

Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel 0 2 4 6 Chares Francois Tessier TreeMatch in Charm++ 11 / 19

Strategy for Charm++ - Intra-node placement TreeMatch load balancer 1 st step : Remap groups of chares on nodes according to the communication on the network libtopomap (example : part of 3d Torus) 2 nd step : Reorder chares inside each node (distributed) Apply TreeMatch on the NUMA topology and the chares communication pattern Bind chares according to their load (leveling on less loaded chares) Each node carries out its own placement in parallel CPU Load Network (3d torus, hierarchical, ) 0 2 4 6 0 2 4 6 Groups of chares assigned to cores Francois Tessier TreeMatch in Charm++ 11 / 19

Results commbench Benchmark designed to simulate irregular communications Experiments on 16 nodes with 32 cores on each (AMD Interlagos 6276) - Blue Waters Cluster) 1 MB messages - 100 iterations - 2 distant receivers for each chare Average time of one iteration in ms 0 50 100 150 commbench on 512 cores 8192 elements 1MB message size DummyLB RefineCommLB TreeMatchLB Francois Tessier TreeMatch in Charm++ 12 / 19

Results commbench 1 MB messages - 100 iterations - 2 distant receivers for each chare TreeMatch applied on a chares communication matrix Chares comm matrix CommBench 1 PlaFRIM node Chares comm matrix CommBench 1 PlaFRIM node Receiver rank 5 10 15 5 10 15 0 1000 2000 3000 4000 TreeMatch Receiver rank 5 10 15 5 10 15 0 1000 2000 3000 4000 Sender rank Sender rank Figure: σ(i) = 0, 8, 4, 5, 12, 1, 9, 6, 14, 2, 3, 13, 7, 10, 11, 15 Francois Tessier TreeMatch in Charm++ 13 / 19

Results kneighbor Benchmarks application designed to simulate regular intensive communication between processes Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIM Cluster Particularly compared to RefineCommLB Takes into account load and communication Minimizes migrations kneighbor on 64 cores 128 elements 1MB message size kneighbor on 64 cores 256 elements 1MB message size 700 2000 600 Execution time (in seconds) 500 400 300 200 Execution time (in seconds) 100 0 DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased 1500 1000 500 0 Francois Tessier TreeMatch in Charm++ 14 / 19

Results kneighbor Experiments on 16 nodes with 8 cores on each (Intel Xeon 5550) - PlaFRIM Cluster 1 MB messages - 100 iterations - 7-Neighbor Average time for each 7-kNeighbor iteration (in ms) 200 180 160 140 120 100 80 60 DummyLB 0-7 TreeMatchLB DummyLB 0,2,4,6,1,3,5,7 Execution time versus chares by core 40 1 2 4 8 16 Number of chares by core Francois Tessier TreeMatch in Charm++ 15 / 19

Results What about the load balancing time? Comparison between the sequential and the distributed versions of TreeMatchLB The master node distributes the data to each node which will compute its own chares placement. This data distribution can be done in parallel (around 20% of improvments) Time in seconds 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Time repartition for each step of the load balancing process Initialization TM Sequential TM Parallel Distributed Sequential Distributed Sequential Sequential Distributed 4096 8192 16384 Francois Tessier TreeMatch in Charm++ 16 / 19

Results What about the load balancing time? Comparison between the sequential and the distributed versions of TreeMatchLB The master node distributes the data to each node which will compute its own chares placement. This data distribution can be done in parallel (around 20% of improvments) 4096 Chares - reverse - Par Master 0 1 Init Process results Distribute Calculate Return 2 3 4 5 6 7 165.6 165.7 165.8 165.9 166 166.1 166.2 time Francois Tessier TreeMatch in Charm++ 16 / 19

Results What about the load balancing time? Linear trajectory while the number of chares is doubled TreeMatchLB is slower than the other Greedy strategies RefineCommLB which provides some good results for communication-bound applications is not scalable (fails from 8192 chares) Execution time of load balancing strategies (running on 128 cores) GreedyCommLB GreedyLB RefineCommLB TreeMatchLB Execution time (in ms) 10000 1000 100 10 1 0.1 128 256 512 1024 2048 4096 8192 Number of chares Figure: Load balancing time of the different strategies vs. number of chares for the KNeighbor application. Francois Tessier TreeMatch in Charm++ 17 / 19

Future work and Conclusion The end Topology is not flat! Processes affinities are not homogeneous Take into account these information to map chares give us improvement Algorithm adapted to large problems (Distributed) Published at IEEE Cluter 2013 Future work Find a better way to gather the topology (Hwloc?) Improve network part (BGQ routing?) Perform more large scale experiments Evaluate our solution on other applications (CFD?) Francois Tessier TreeMatch in Charm++ 18 / 19

The End Thanks for your attention! Any questions? Francois Tessier TreeMatch in Charm++ 19 / 19