LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015. Hermann Härtig

Similar documents

High performance computing systems. Lab 1

Lecture 6: Introduction to MPI programming. Lecture 6: Introduction to MPI programming p. 1

Session 2: MUST. Correctness Checking

Parallel Programming with MPI on the Odyssey Cluster

MPI Application Development Using the Analysis Tool MARMOT

Introduction to Hybrid Programming

Parallelization: Binary Tree Traversal

Kommunikation in HPC-Clustern

HP-MPI User s Guide. 11th Edition. Manufacturing Part Number : B September 2007

Lightning Introduction to MPI Programming

To connect to the cluster, simply use a SSH or SFTP client to connect to:

HPCC - Hrothgar Getting Started User Guide MPI Programming

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Message Passing with MPI

MARMOT- MPI Analysis and Checking Tool Demo with blood flow simulation. Bettina Krammer, Matthias Müller

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

WinBioinfTools: Bioinformatics Tools for Windows Cluster. Done By: Hisham Adel Mohamed

Operating System for the K computer

Introduction. Reading. Today MPI & OpenMP papers Tuesday Commutativity Analysis & HPF. CMSC 818Z - S99 (lect 5)

Chapter 11 I/O Management and Disk Scheduling

System Software for High Performance Computing. Joe Izraelevitz

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

MPI-Checker Static Analysis for MPI

Distributed communication-aware load balancing with TreeMatch in Charm++

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Storage. The text highlighted in green in these slides contain external hyperlinks. 1 / 14

Tools for Performance Debugging HPC Applications. David Skinner

RA MPI Compilers Debuggers Profiling. March 25, 2009

Debugging with TotalView

Introduction to grid technologies, parallel and cloud computing. Alaa Osama Allam Saida Saad Mohamed Mohamed Ibrahim Gaber

Middleware. Peter Marwedel TU Dortmund, Informatik 12 Germany. technische universität dortmund. fakultät für informatik informatik 12

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

Asynchronous Dynamic Load Balancing (ADLB)

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scientific Computing Programming with Parallel Objects

Multi-Threading Performance on Commodity Multi-Core Processors

Network File System (NFS) Pradipta De

Load Balancing. computing a file with grayscales. granularity considerations static work load assignment with MPI

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

High Performance Computing. Course Notes HPC Fundamentals

MPI Runtime Error Detection with MUST For the 13th VI-HPS Tuning Workshop

MOSIX: High performance Linux farm

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Overlapping Data Transfer With Application Execution on Clusters

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Contributions to Gang Scheduling

Big Data Analytics Using R

Proactive, Resource-Aware, Tunable Real-time Fault-tolerant Middleware

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

- An Essential Building Block for Stable and Reliable Compute Clusters

Technical Computing Suite Job Management Software

Performance metrics for parallel systems

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

I/O. Input/Output. Types of devices. Interface. Computer hardware

Distribution One Server Requirements

Symmetric Multiprocessing

Full and Para Virtualization

Chapter 11 I/O Management and Disk Scheduling

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

An Introduction to Parallel Computing/ Programming

Experiences with HPC on Windows

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Introduction to MPI Programming!

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Scaling Networking Applications to Multiple Cores

6.828 Operating System Engineering: Fall Quiz II Solutions THIS IS AN OPEN BOOK, OPEN NOTES QUIZ.

Simplest Scalable Architecture

LS DYNA Performance Benchmarks and Profiling. January 2009

Operating System Overview. Otto J. Anshus

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

CPU Scheduling. Core Definitions

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

Distributed Systems LEEC (2005/06 2º Sem.)

Optimizing Shared Resource Contention in HPC Clusters

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Optimizing Linux Performance

Libmonitor: A Tool for First-Party Monitoring

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

A Transport Protocol for Multimedia Wireless Sensor Networks

ProTrack: A Simple Provenance-tracking Filesystem

An Implementation Of Multiprocessor Linux

GPI Global Address Space Programming Interface

Introduction to Cloud Computing

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Systemverwaltung 2009 AIX / LPAR

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Computer Systems Structure Input/Output

Transcription:

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2015 Hermann Härtig

ISSUES starting points independent Unix processes and block synchronous execution who does it load migration mechanism (MosiX) management algorithms (MosiX) information dissemination decision algorithms 2

EXTREME STARTING POINTS independent OS processes block synchronous execution (HPC) sequence: compute - communicate all processes wait for all other processes often: message passing for example Message Passing Library (MPI) 3

SIMPLE PROGRAM MULTIPLE DATA all processes execute same program while (true) { work; exchange data (barrier)} common in High Performance Computing: Message Passing Interface (MPI) library 4

DIVIDE AND CONQUER CPU #1 result 1 part 1 result 3 CPU #1 part 2 problem result result 2 part 3 result 4 CPU #2 CPU #2 part 4 node 1 node 2 5

MPI BRIEF OVERVIEW Library for message-oriented parallel programming Programming model: Multiple instances of same program Independent calculation Communication, synchronization 6

MPI STARTUP & TEARDOWN MPI program is started on all processors MPI_Init(), MPI_Finalize() Communicators (e.g., MPI_COMM_WORLD) MPI_Comm_size() MPI_Comm_rank(): Rank of process within this set Typed messages Dynamically create and spread processes using MPI_Spawn() (since MPI-2) 7

MPI EXECUTION Communication Point-to-point Collectives Synchronization Test MPI_Recv( MPI_Bcast( MPI_Reduce( MPI_Barrier( MPI_Send( MPI_Test( MPI_Wait( MPI_Comm void* MPI_Request* buffer, sendbuf, comm request, ) void int MPI_Status count, *flag, *recvbuf, *status ) MPI_Datatype, int MPI_Status count *status ) int MPI_Datatype, source, root, dest, int MPI_Comm MPI_Op tag, op, comm ) MPI_Comm int root, comm, ) MPI_Status MPI_Comm comm *status ) Wait Barrier 8

BLOCK AND SYNC blocking call non-blocking call synchronous communication returns when message has been delivered returns immediately, following test/wait checks for delivery asynchronous communication returns when send buffer can be reused returns immediately, following test/wait checks for send buffer 9

int rank, total; MPI_Init(); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &total); EXAMPLE MPI_Bcast(...); /* work on own part, determined by rank */ if (id == 0) { for (int rr = 1; rr < total; ++rr) MPI_Recv(...); /* Generate final result */ } else { MPI_Send(...); } MPI_Finalize(); 10

AMDAHLS LAW interpretation for parallel systems: P: section that can be parallelized 1-P: serial section N: number of CPUs 11

AMDAHL S LAW Serial section: communicate, longest sequential section Parallel, Serial, possible speedup: 1ms, 100 μs: 1/0.1 10 1ms, 1 μs: 1/0.001 1000 10 μs, 1 μs: 0.01/0.001 10... 12

WEAK VS. STRONG SCALING Strong: accelerate same problem size Weak: extend to larger problem size 13

WHO DOES IT application run-time library operating system 14

SCHEDULER: GLOBAL RUN QUEUE CPU 1 CPU 2 CPU N all ready processes immediate approach: global run queue 15

SCHEDULER: GLOBAL RUN QUEUE does not scale shared memory only contended critical section cache affinity separate run queues with explicit movement of processes 16

OS/HW & APPLICATION Operating System / Hardware: All participating CPUs: active / inactive Partitioning (HW) Gang Scheduling (OS) Within Gang/Partition: Applications balance!!! 17

HW PARTITIONS & ENTRY QUEUE request queue Application cation 18

HW PARTITIONS optimizes usage of network takes OS off critical path best for strong scaling burdens application/library with balancing potentially wastes resources current state of the art in High Performance Computing (HPC) 19

TOWARDS OS/RT BALANCING work item time Barrier 20

EXECUTION TIME JITTER work item time Barrier 21

SOURCES FOR EXECUTION JITTER Hardware??? Application Operating system noise 22

OPERATING SYSTEM NOISE Use common sense to avoid: OS usually not directly on the critical path, BUT OS controls: interference via interrupts, caches, network, memory bus, (RTS techniques) avoid or encapsulate side activities small critical sections (if any) partition networks to isolate traffic of different applications (HW: Blue Gene) do not run Python scripts or printer daemons in parallel 23

SPLITTING BIG JOBS work item time Barrier overdecomposition & oversubscription 24

SMALL JOBS (NO DEPS) work item time Barrier Execute small jobs in parallel (if possible) 25

BALANCING AT LIBRARY LEVEL Programming Model many (small) decoupled work items overdecompose create more work items than active units run some balancing algorithm Example: CHARM ++ 26

BALANCING AT SYSTEM LEVEL create many more processes use OS information on run-time and system state to balance load examples: create more MPI processes than nodes run multiple applications 27

CAVEATS added overhead additional communication between smaller work items (memory & cycles) more context switches OS on critical path (for example communication) 28

BALANCING ALGORITHMS required: mechanism for migrating load information gathering decisions MosiX system as an example -> Barak s slides now 29

FORK IN MOSIX Deputy fork() syscall request Remote Establish a new link fork() Deputy(child) Remote(child) Reply from fork() fork() Reply from fork() 30

PROCESS MIGRATION IN MOSIX Process migration request migdaemon Send state, memory maps, dirty pages ack Deputy Remote fork() Transition Finalize migration Ack Migration completed 31

SOME PRACTICAL PROBLEMS flooding all processes jump to one new empty node => decide immediately before migration commitment extra communication, piggy packed ping pong if thresholds are very close, processes moved back and forth => tell a little higher load than real 32

PING PONG Scenario: compare load on nodes 1 and 2 node 1 moves process to node 2 Solutions: add one + little bit to load average over time Node 1 Node 2 One process two nodes Solves short peaks problem as well (short cron processes) 33

LESSONS execution time jitter matters (Amdahl) approaches: partition./. balance dynamic balance components: migration of load, information, decision who, when, where 34