The Asynchronous Dynamic Load-Balancing Library

Similar documents

Asynchronous Dynamic Load Balancing (ADLB)

Multicore Parallel Computing with OpenMP

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Stream Processing on GPUs Using Distributed Multimedia Middleware

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Part I Courses Syllabus

Software Development around a Millisecond

Remote Graphical Visualization of Large Interactive Spatial Data

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Contributions to Gang Scheduling

Spring 2011 Prof. Hyesoon Kim

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

End-user Tools for Application Performance Analysis Using Hardware Counters

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Overlapping Data Transfer With Application Execution on Clusters

Kernel comparison of OpenSolaris, Windows Vista and. Linux 2.6

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Server Consolidation with SQL Server 2008

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Computing with MATLAB

Cellular Computing on a Linux Cluster

HPC Wales Skills Academy Course Catalogue 2015

Multi-core Programming System Overview

A Flexible Cluster Infrastructure for Systems Research and Software Development

Scalability and Classifications

PMI: A Scalable Parallel Process-Management Interface for Extreme-Scale Systems

How To Install Linux Titan

Scientific Computing Programming with Parallel Objects

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Performance of the JMA NWP models on the PC cluster TSUBAME.

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

CHAPTER 1 INTRODUCTION

INTRODUCTION ADVANTAGES OF RUNNING ORACLE 11G ON WINDOWS. Edward Whalen, Performance Tuning Corporation

BSC vision on Big Data and extreme scale computing

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Memory Debugging with TotalView on AIX and Linux/Power

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

Search Engine Architecture

Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010

MPI and Hybrid Programming Models. William Gropp

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-GPU Load Balancing for Simulation and Rendering

Developing Parallel Applications with the Eclipse Parallel Tools Platform

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

A PERFORMANCE ANALYSIS of HADOOP CLUSTERS in OPENSTACK CLOUD and in REAL SYSTEM

High Performance Computing in CST STUDIO SUITE

BLM 413E - Parallel Programming Lecture 3

Final Project Report. Trading Platform Server

FPGA area allocation for parallel C applications

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Interoperability between Sun Grid Engine and the Windows Compute Cluster

Petascale Software Challenges. William Gropp

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

An Introduction to Parallel Computing/ Programming

Virtual Machines.

Data-Flow Awareness in Parallel Data Processing

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Part V Applications. What is cloud computing? SaaS has been around for awhile. Cloud Computing: General concepts

Make the Most of Big Data to Drive Innovation Through Reseach

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Numerix CrossAsset XL and Windows HPC Server 2008 R2

CUDA programming on NVIDIA GPUs

A General Framework for Tracking Objects in a Multi-Camera Environment

PostgreSQL Performance Characteristics on Joyent and Amazon EC2

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

ANALYSIS OF GRID COMPUTING AS IT APPLIES TO HIGH VOLUME DOCUMENT PROCESSING AND OCR

Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

DELL s Oracle Database Advisor

In: Proceedings of RECPAD th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal

Parallel Algorithm Engineering

Mathematical Libraries on JUQUEEN. JSC Training Course

CA NSM System Monitoring Option for OpenVMS r3.2

A High Performance Computing Scheduling and Resource Management Primer

Chapter 18: Database System Architectures. Centralized Systems

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Lightweight Service-Based Software Architecture

COMP5426 Parallel and Distributed Computing. Distributed Systems: Client/Server and Clusters

IBM Deep Computing Visualization Offering

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Eloquence Training What s new in Eloquence B.08.00

Transcription:

The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear Physics Division

Outline The Nuclear Physics problem - GFMC The CS problem: scalability, load balancing, simplicity for the application Approach: a portable, scalable library, implementing a simple programming model Status Choices made Lessons learned Promising results Plans Scaling up Branching out Using threads so far 2

The Physics Project: Green s function Monte Carlo (GFMC) The benchmark for nuclei with 12 or fewer nucleons Starts with variational wave function containing non-central correlations Uses imaginary-time propagation to filter out excited-state contamination Samples removed or multiplied during propagation -- work fluctuates Local energies evaluated every 40 steps For 12 C expect ~10,000 Monte Carlo samples to be propagated for ~1000 steps Non ADLB version: Several samples per node; work on one sample not parallelized ADLB version: Three-body potential and propagator parallelized Wave functions for kinetic energy and two-body potential done in parallel Can use many processors per sample Physics target: Properties of both ground and excited states of 12 C 3

The Computer Science Project: ADLB Specific Problem: to scale up GFMC, a master/slave code General Problem: scaling up the master/slave paradigm in general Usually based on a single (or shared) data structure Goal for GFMC: scale to 160,000 processes (available on BG/P) General goal: provide simple yet scalable programming model for algorithms parallelized via master/slave structure General goal in GFMC setting: eliminate (most) MPI calls from GFMC and scale the general approach GFMC is not an easy case: Multiple types of work Any process can create work Large work units (multi-megabyte) Priorities and types used together to specify some sequencing without constraining parallelism ( workflow ) 4

Master/Slave Algorithms and Load Balancing Master Shared Work queue Slave Slave Slave Slave Slave Advantages Automatic load balancing Disadvantages Scalability - master can become bottleneck Wrinkles Slaves may create new work Multiple work types and priorities that impose ordering 5

Load Balancing in GFMC Before ADLB Master/Slave algorithm Slaves do create work dynamically Newly created work stored locally Periodic load balancing Done by master Slaves communicate work queue lengths to master Master determines reallocation of work Master tells slaves to reallocate work Slaves communicate work to one another Computation continues with balanced work queues 6

GFMC Before ADLB 7

Zoomed in 8

BlueGene P 4 cpus per node 4 threads or processes per node 160,000 cpus Efficient MPI communication Original GFMC not expected to continue to scale past 2000 processors More parallelism needed to exploit BG/P for Carbon 12 Existing load-balancing mechanism will not scale A general-purpose load-balancing library should Enable GFMC to scale Be of use to other codes as well Simplify parallel programming of applications 9

The Vision No explicit master for load balancing; slaves make calls to ADLB library; those subroutines access local and remote data structures (remote ones via MPI). Simple Put/Get interface from application code to distributed work queue hides most MPI calls Advantage: multiple applications may benefit Wrinkle: variable-size work units, in Fortran, introduce some complexity in memory management Proactive load balancing in background Advantage: application never delayed by search for work from other slaves Wrinkle: scalable work-stealing algorithms not obvious 10

The API (Application Programming Interface) Basic calls ADLB_Init( num_servers, am_server, app_comm) ADLB_Server() ADLB_Put( type, priority, len, buf, answer_dest ) ADLB_Reserve( req_types, work_handle, work_len, work_type, work_prio, answer_dest) ADLB_Ireserve( ) ADLB_Get_Reserved( ) ADLB_Set_done() ADLB_Finalize() A few others, for tuning and debugging (still at experimental stage) 11

Asynchronous Dynamic Load Balancing - Thread Approach The basic idea: Shared Memory Put/get Work queue Application Threads ADLB Library Thread MPI Communication with other nodes 12

A A A ADLB AA - Process Approach A put/get Application Processes ADLB Servers 13

Early Version of ADLB in GFMC on BG/L 14

History and Status Year 1: learned the application; worked out first version of API; did thread version of implementation Year 2: Switched to process version, began experiments at scale On BG/L (4096), SiCortex (5800), and BG/P (16K so far) Variational Monte Carlo (part of GFMC) Full GFMC (see following performance graph) Latest: GFMC for neutron drop at large scale Some additions to the API for increased scalability, memory management Internal changes to manage memory Still working on memory management for full GMFC with fine-grain parallelism, needed for 12 C Basic API not changing 15

Comparing Speedup 16

Most Recent Runs 14-neutron drop on 16,384 processors of BG/P Speedup of 13,634 (83% efficiency) No microparallization since more configurations ADLB processes 171 million work packages of size 129KB each, total of 20.5 terabytes of data moved Heretofore uncomputed level of accuracy for the computed energy and density Also some benchmarking runs for 9 Be and 7 Li 17

Future Plans Short detour into scalable debugging tools for understanding behavior, particularly memory usage Further microparallelization of GFMC using ADLB Scaling up on BG/P Revisit the thread model for ADLB implementation, particularly in anticipation of Q and experimental compute-node Linux on P Help with multithreading the application (locally parallel execution of work units via OpenMP) Work with others in the project who use manager/worker patterns in their codes Work with others outside the project in other SciDACs 18

Summary We have designed a simple programming model for a class of applications We are working on this in the context of a specific UNEDF application, which is a challenging one We have done two implementations Performance results are promising so far Still have not completely conquered the problem Needed: tools for understanding behavior, particularly to help with application-level debugging 19

The End 20