Asynchronous Dynamic Load Balancing (ADLB)

Similar documents

The Asynchronous Dynamic Load-Balancing Library

OS/Run'me and Execu'on Time Produc'vity

Session 2: MUST. Correctness Checking

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

Portable, Scalable, and High-Performance I/O Forwarding on Massively Parallel Systems. Jason Cope

Help Framework. Ticket Management Ticket Resolu/on Communica/ons. Ticket Assignment Follow up Customer - communica/on System updates Delay management

IT Change Management Process Training

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Phone Systems Buyer s Guide

Effec%ve AX 2012 Upgrade Project Planning and Microso< Sure Step. Arbela Technologies

Parallel Computing with Mathematica UVACSE Short Course

To connect to the cluster, simply use a SSH or SFTP client to connect to:

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

SharePoint Capacity Planning Balancing Organiza,onal Requirements with Performance and Cost

Debugging with TotalView

The Asterope compute cluster

Clusters in the Cloud

Software Automated Testing

RA MPI Compilers Debuggers Profiling. March 25, 2009

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Chapter 3. Database Architectures and the Web Transparencies

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Advanced MPI. Hybrid programming, profiling and debugging of MPI applications. Hristo Iliev RZ. Rechen- und Kommunikationszentrum (RZ)

Experiments on cost/power and failure aware scheduling for clouds and grids

UAB Cyber Security Ini1a1ve

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Introduction to OpenMP Programming. NERSC Staff

HPCC - Hrothgar Getting Started User Guide MPI Programming

Using the Windows Cluster

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Parallelization: Binary Tree Traversal

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Legacy Archiving How many lights do you leave on? September 14 th, 2015

High performance computing systems. Lab 1

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

DDC Sequencing and Redundancy

Implementing MPI-IO Shared File Pointers without File System Support

BENCHMARKING V ISUALIZATION TOOL

Part I Courses Syllabus

Migrating to Hosted Telephony. Your ultimate guide to migrating from on premise to hosted telephony.

Introduction to Hybrid Programming

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Understanding and Detec.ng Real- World Performance Bugs

Enterprise. Thousands of companies save 1me and money by using SIMMS to manage their inventory.

Non-Cooperative Computation Offloading in Mobile Cloud Computing

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

MPI and Hybrid Programming Models. William Gropp

ICN based Scalable Video Conferencing on Virtual Edge Service Routers (VSER) Platform

Distributed systems Lecture 6: Elec3ons, consensus, and distributed transac3ons. Dr Robert N. M. Watson

Parallel Algorithm Engineering

Parallel and Distributed Computing Programming Assignment 1

:Introducing Star-P. The Open Platform for Parallel Application Development. Yoel Jacobsen E&M Computing LTD

Measuring MPI Send and Receive Overhead and Application Availability in High Performance Network Interfaces

SDN Controller Requirement

Running applications on the Cray XC30 4/12/2015

Introduc)on of Pla/orm ISF. Weina Ma

Tools for Performance Debugging HPC Applications. David Skinner

Introduction to Supercomputing with Janus

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Architec;ng Splunk for High Availability and Disaster Recovery

Sourcing Strategy: Oracle ULA. ITALCONSOR Procurement Consultants

Introduc)on to the MapReduce Paradigm and Apache Hadoop. Sriram Krishnan

Bacula Open Source Project Bacula Systems (professional support)

Developing OpenDaylight Apps with MD-SAL. J. Medved, E. Warnicke, A. Tkacik. R. Varga Cisco Sample App: M. Rehak, Cisco February 04, 2014

Broadening Moab/TORQUE for Expanding User Needs

VoIP Security How to prevent eavesdropping on VoIP conversa8ons. Dmitry Dessiatnikov

Telephone Related Queries (TeRQ) IETF 85 (Atlanta)

WINDOWS AZURE AND WINDOWS HPC SERVER

So#ware Defined Radio (SDR) Architecture and Systems Issues

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Alan Kaminsky

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

The RWTH Compute Cluster Environment

Advanced Mac OS X Rootkits. Dino Dai Zovi Chief Scientist Endgame Systems

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Privacy- Preserving P2P Data Sharing with OneSwarm. Presented by. Adnan Malik

Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises

Transcription:

Asynchronous Dynamic Load Balancing (ADLB) A high-level, non-general-purpose, but easy-to-use programming model and portable library for task parallelism Rusty Lusk Mathema.cs and Computer Science Division Argonne Na.onal Laboratory

Two Classes of Parallel Programming Models! Data Parallelism Parallelism arises from the fact that physics is largely local Same opera.ons carried out on different data represen.ng different patches of space Communica.on usually necessary between patches (local) global (collec.ve) communica.on some.mes also needed Load balancing some.mes needed! Task Parallelism Work to be done consists of largely independent tasks, perhaps not all of the same type LiJle or no communica.on between tasks Tradi.onally needs a separate master task for scheduling Load balancing fundamental 2

Load Balancing! Defini.on: the assignment (scheduling) of tasks (code + data) to processes so as to minimize the total idle.mes of processes! Sta.c load balancing all tasks are known in advance and pre- assigned to processes works well if all tasks take the same amount of.me requires no coordina.on process! Dynamic load balancing tasks are assigned to a worker process by master process when worker process becomes available by comple.ng previous task Requires communica.on between manager and worker processes Tasks may create addi.onal tasks Tasks may be quite different from one another 3

Generic Manager/Worker Algorithm Manager Shared Work queue Worker Worker Worker Worker Worker! Easily implemented in MPI! Solves some problems implements dynamic load balancing termina.on dynamic task crea.on can implement workflow structure of tasks! Provides some scalability problems Manager can become a communica.on bojleneck (granularity dependent) Memory can become a bojleneck (depends on task descrip.on size) 4

The ADLB Idea! No explicit master for load balancing; workers make calls to ADLB library; those subrou.nes access local and remote data structures (remote ones via MPI).! Simple Put/Get interface from applica.on code to distributed work queue hides MPI calls! Poten.al proac.ve load balancing in background 5

The ADLB Model (no master) Worker Worker Worker Worker Worker Shared Work queue! Doesn t really change algorithms in Workers! Not a new idea (e.g. Linda)! But need scalable, portable, distributed implementa.on of shared work queue MPI complexity hidden here 6

API for a Simple Programming Model! Basic calls ADLB_Init( num_servers, am_server, app_comm) ADLB_Server() ADLB_Put( type, priority, len, buf, target_rank, answer_dest ) ADLB_Reserve( req_types, handle, len, type, prio, answer_dest) ADLB_Ireserve( ) ADLB_Get_Reserved( handle, buffer ) ADLB_Set_Done() ADLB_Finalize()! A few others, for op.mizing and debugging ADLB_{Begin,End}_Batch_Put() Gebng performance sta.s.cs with ADLB_Get_info(key) 7

API Notes! Types, answer_rank, target_rank can be used to implement some common pajerns Sending a message Decomposing a task into subtasks Maybe should be built into API! Return codes (defined constants) ADLB_SUCCESS ADLB_NO_MORE_WORK ADLB_DONE_BY_EXHAUSTION ADLB_NO_CURRENT_WORK (for ADLB_Ireserve)! Batch puts are for inser.ng work units that share a large propor.on of their data 8

More API Notes! If some parameters are allowed to default, this becomes a simple, high- level, work- stealing API examples follow! Use of the fancy parameters on Puts and Reserve- Gets allows varia.ons that allow more elaborate pajerns to be constructed! This allows ADLB to be used as a low- level execu.on engine for higher- level models ADLB is being used as execu.on engine for the Swii workflow management language 9

How It Works (production implementation) put/get Application Processes ADLB Servers 10

The ADLB Server Logic! Main loop: MPI_Iprobe for message in busy loop MPI_Recv message Process according to type Update status vector of work stored on remote servers Manage work queue and request queue (may involve pos.ng MPI_Isends to isend queue) MPI_Test all requests in isend queue Return to top of loop! The status vector replaces single master or shared memory Circulates among servers at high priority 11

ADLB Uses Multiple MPI Features! ADLB_Init returns separate applica.on communicator, so applica.on processes can communicate with one another using MPI as well as by using ADLB features.! Servers are in MPI_Iprobe loop for responsiveness.! MPI_Datatypes for some complex, structured messages (status)! Servers use nonblocking sends and receives, maintain queue of ac.ve MPI_Request objects.! Queue is traversed and each request kicked with MPI_Test each.me through loop; could use MPI_Testany. No MPI_Wait.! Client side uses MPI_Ssend to implement ADLB_Put in order to conserve memory on servers, MPI_Send for other ac.ons.! Servers respond to requests with MPI_Rsend since MPI_Irecvs are known to be posted by clients before requests.! MPI provides portability: laptop, Linux cluster, BG/Q! MPI profiling library is used to understand applica.on/adlb behavior. 12

Typical Code Pattern rc = MPI_Init( &argc, &argv ); aprintf_flag = 0; /* no output from adlb itself */ num_servers = 1; /* one server might be enough */ use_debug_server = 0; /* default: no debug server */ rc = ADLB_Init( num_servers, use_debug_server, aprintf_flag, num_t, type_vec, &am_server, &am_debug_server, &app_comm ); if ( am_server ) { ADLB_Server( 3000000, 0.0 ); /* mem limit, no logging */ } else { /* application process */ code using ADLB_Put and ADLB_Reserve, ADLB_Get_Reserved, etc. } ADLB_Finalize(); MPI_Finalize(); 13

Some Example Applications! Fun Sudoku solver! Simple but useful Physics applica.on parameter sweep! World s simplest batch scheduler for clusters! Serious GFMC: complex Monte Carlo physics applica.on 14

A Tutorial Example: Sudoku 1 2 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8! (The following algorithm is not a good way to solve this, but it fits on one slide.) 15

Parallel Sudoku Solver with ADLB 1 2 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8 Work unit = partially completed board Program: if (rank = 0) ADLB_Put initial board ADLB_Get board (Reserve+Get) while success (else done) ooh find first blank square if failure (problem solved!) print solution ADLB_Set_Done else for each valid value set blank square to value ADLB_Put new board ADLB_Get board end while 16

How it Works 1 2 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8 Get 4 6 8 Put 1 2 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8 1 2 4 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8 1 2 6 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8 1 2 8 9 7 3 6 1 7 8 5 3 7 9 1 8 2 6 5 6 1 9 6 7 1 2 5 3 8 Pool of Work Units Put! Aier ini.al Put, all processes execute same loop (no master) 17

Optimizing Within the ADLB Framework! Can embed smarter strategies in this algorithm ooh = op.onal op.miza.on here, to fill in more squares Even so, poten.ally a lot of work units for ADLB to manage! Can use priori.es to address this problem On ADLB_Put, set priority to the number of filled squares This will guide depth- first search while ensuring that there is enough work to go around How one would do it sequen.ally! Exhaus.on automa.cally detected by ADLB (e.g., proof that there is only one solu.on, or the case of an invalid input board) 18

A Physics Application Parameter Sweep in Material Science Application " Finding materials to use in luminescent solar concentrators Stationary, no moving parts Operate efficiently under diffuse light conditions (northern climates) Inexpensive collector, concentrate light on high-performance solar cell " In this case, the authors never learned any parallel programming approach other than ADLB 19

The Batcher : World s Simplest Job Scheduler for Linux Clusters! Simple (100 lines of code) but poten.ally useful! Input is a file (or stream) of Unix command lines, which become the ADLB work units put into the work pool by one manager process! ADLB worker processes execute each one with the Unix system call! Easy to add priority considera.ons 20

Green s Function Monte Carlo A Complex Application! Green s Func.on Monte Carlo - - the gold standard for ab ini)o calcula.ons in nuclear physics at Argonne (Steve Pieper, PHY)! A non- trivial manager/worker algorithm, with assorted work types and priori.es; mul.ple processes create work dynamically; large work units! Had scaled to 2000 processors on BG/L, then hit scalability wall.! Needed to get to 10 s of thousands of processors at least, in order to carry out calcula.ons on 12 C, an explicit goal of the UNEDF SciDAC project.! The algorithm threatened to become even more complex, with more types and dependencies among work units, together with smaller work units! Wanted to maintain master/slave structure of physics code! This situa.on brought forth ADLB! Achieving scalability has been a mul.- step process balancing processing balancing memory balancing communica.on! Now runs on 100,000 processes 21

Early Experiments with GFMC/ADLB on BG/P! Using GFMC to compute the binding energy of 14 neutrons in an ar.ficial well ( neutron drop = teeny- weeny neutron star )! A weak scaling experiment BG/P cores ADLB Servers Configs Time (min.) Efficiency (incl. serv.) 4K 130 20 38.1 93.8% 8K 230 40 38.2 93.7% 16K 455 80 39.6 89.8% 32K 905 160 44.2 80.4%! Recent work: micro- paralleliza.on needed for 12 C, OpenMP in GFMC. a successful example of hybrid programming, with ADLB + MPI + OpenMP 22

Progress with GFMC 100 Efficiency = compute_time/wall_time 25 Feb 2010 90 Oct 2009 Efficiency in % 80 70 Feb 2009 Jun 2009 12 C ADLB+GFMC 60 128 512 2,048 8,192 32,768 Number of nodes (4 OpenMP cores per node) 23

An Alternate Implementation of the Same API! Mo.va.on for 1- sided, single- server version: Eliminate mul.ple views of shared queue data structure and the effort required to keep them (almost) coherent) Free up more processors for applica.on calcula.ons by elimina.ng most servers. Use larger client memory to store work packages! Relies on passive target MPI- 2 remote memory opera.ons! Single master proved to be a scalability bojleneck at 32,000 processors (8K nodes on BG/P) not because of processing capability but because of network conges.on. ADLB_Put ADLB_Get MPI_Put MPI_Get 24

Getting ADLB! Web site is hjp://www.cs.mtsu.edu/~rbutler/adlb documenta.on download bujon! What you get: source code configure script and Makefile README, with API documenta.on Examples Sudoku Batcher Batcher README Traveling Salesman Problem - >! To run your applica.on Configure, make to build ADLB library Compile your applica.on with mpicc, use Makefile as example Run with mpiexec! Problems/ques.ons/sugges.ons to {lusk,rbutler}@mcs.anl.gov 25

Future Directions! API design Some higher- level func.on calls might be useful User community will generate these! Implementa.ons The one- sided version implemented single server to coordinate matching of requests to work units stores work units on client processes Uses MPI_Put/Get (passive target) to move work Hit scalability wall for GFMC at about 8000 processes The thread version uses separate thread on each client; no servers the original plan maybe for BG/Q, where there are more threads per node not re- implemented (yet) 26

Conclusions! There are benefits to limi.ng generality of approach! Scalability need not come at the expense of complexity! ADLB might be handy 27

The End 28