Automatic load balancing and transparent process migration



Similar documents
MOSIX: High performance Linux farm

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Infrastructure for Load Balancing on Mosix Cluster

Simplest Scalable Architecture

Amoeba Distributed Operating System

Process Migration and Load Balancing in Amoeba

Distributed Operating Systems. Cluster Systems

Distributed File Systems

1 Organization of Operating Systems

Kappa: A system for Linux P2P Load Balancing and Transparent Process Migration

Code and Process Migration! Motivation!

Scalable Cluster Computing with MOSIX for LINUX

Weiping Zhu C.F. Steketee. processes into account. to the potential performance gain from this service.

How To Make A Distributed System Transparent

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Network Attached Storage. Jinfeng Yang Oct/19/2015

Overlapping Data Transfer With Application Execution on Clusters

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Load Balancing in Beowulf Clusters

Distributed Operating Systems

Chapter 11 I/O Management and Disk Scheduling

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Virtual machine interface. Operating system. Physical machine interface

The MOSIX Cluster Management System for Distributed Computing on Linux Clusters and Multi-Cluster Private Clouds

Distributed Systems LEEC (2005/06 2º Sem.)

How To Write A Network Operating System For A Network (Networking) System (Netware)

Network File System (NFS) Pradipta De

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Chapter 3 Operating-System Structures

A Comparison of Distributed Systems: ChorusOS and Amoeba

CSE 120 Principles of Operating Systems. Modules, Interfaces, Structure

Distributed File Systems

SiteCelerate white paper

PeerMon: A Peer-to-Peer Network Monitoring System

Improved Hybrid Dynamic Load Balancing Algorithm for Distributed Environment

Chapter 6, The Operating System Machine Level

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Chapter 2 System Structures

Xen Live Migration. Networks and Distributed Systems Seminar, 24 April Matúš Harvan Xen Live Migration 1

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

S y s t e m A r c h i t e c t u r e

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

AN EXPERIMENTAL COMPARISON OF REMOTE PROCEDURE CALL AND GROUP COMMUNICATION

- An Essential Building Block for Stable and Reliable Compute Clusters

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems

Resource Utilization of Middleware Components in Embedded Systems

Grid Scheduling Dictionary of Terms and Keywords

Cloud Computing. Lectures 3 and 4 Grid Schedulers: Condor

PRODUCTIVITY ESTIMATION OF UNIX OPERATING SYSTEM

Operating Systems. Design and Implementation. Andrew S. Tanenbaum Melanie Rieback Arno Bakker. Vrije Universiteit Amsterdam

Outline. Operating Systems Design and Implementation. Chap 1 - Overview. What is an OS? 28/10/2014. Introduction

Operating System Components and Services

Processes and Non-Preemptive Scheduling. Otto J. Anshus

LOAD BALANCING AS A STRATEGY LEARNING TASK

Cluster Operating Systems

Linux Virtual Server Tutorial

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Principles and characteristics of distributed systems and environments

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

Linux Kernel Architecture

ELEC 377. Operating Systems. Week 1 Class 3

Hadoop and Map-Reduce. Swati Gore

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

MACH: AN OPEN SYSTEM OPERATING SYSTEM Aakash Damara, Vikas Damara, Aanchal Chanana Dronacharya College of Engineering, Gurgaon, India

Cloud Computing. Up until now

Multi-Channel Clustered Web Application Servers

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

How To Understand And Understand An Operating System In C Programming

Virtual Machine Synchronization for High Availability Clusters

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

Middleware: Past and Present a Comparison

Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum

- Behind The Cloud -

OS Concepts and structure

Distributed System: Definition

Design and Implementation of Distributed Process Execution Environment

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Middleware Lou Somers

Box Leangsuksun+ * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston,

EWeb: Highly Scalable Client Transparent Fault Tolerant System for Cloud based Web Applications

CHAPTER 15: Operating Systems: An Overview

A Comparison on Current Distributed File Systems for Beowulf Clusters

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

CS 416: Opera-ng Systems Design

Dynamic Load Balancing of Virtual Machines using QEMU-KVM

Reliable Adaptable Network RAM

A Comparative Performance Analysis of Load Balancing Algorithms in Distributed System using Qualitative Parameters

High Performance Cluster Support for NLB on Window

Virtualization in Linux

Rodrigo Fernandes de Mello, Evgueni Dodonov, José Augusto Andrade Filho

Live process migration

Road Map. Scheduling. Types of Scheduling. Scheduling. CPU Scheduling. Job Scheduling. Dickinson College Computer Science 354 Spring 2010.

These sub-systems are all highly dependent on each other. Any one of them with high utilization can easily cause problems in the other.

Transcription:

Automatic load balancing and transparent process migration Roberto Innocente rinnocente@hotmail.com November 24,2000 Download postscript from : mosix.ps or gzipped postscript from: mosix.ps.gz Nov 24,2000 r.innocente 1 Introduction The purpose of this presentation is an introduction to the most important features of MOSIX: - automatic load balancing - transparent process migration giving an overview of related projects and of the different possible implementations of these mechanisms. Nov 24,2000 r.innocente 2 1

Scheme of the presentation 1. Overview of some Distributed Operating Systems 2. Load balancing mechanisms 3. Process migration mechanisms 4. focus on Mosix Nov 24,2000 r.innocente 3 Overview of some distributed OS Nov 24,2000 r.innocente 4 2

Distributed O.S. Sprite Charlotte Accent Amoeba V System MOSIX Condor (this is just a batch system, but offering job migration) Nov 24,2000 r.innocente 5 Sprite (Douglis,Ousterhout* 1987) Experimental Network OS developed at UCB. Was running on Suns and Decstations. The process migration in this system introduced the simplifying concept of Home Node of a process (the originating node where it should still appear to run). After then this concept has been utilized by many other implementations. *of Tcl/Tk fame Nov 24,2000 r.innocente 6 3

Charlotte (Artsy, Finkel 1989) It s a Distributed Operating System developed at the UoWM based on the message passing paradigm. Provides a well-developed process migration mechanism that leaves no residual dependencies (the migrated process can survive to the death of the originating node), and therefore is suitable also for fault tolerant systems. Nov 24,2000 r.innocente 7 Accent (Zayas 1987) developed at CMU as precursor of Mach a kernel not Unix compatible with IPCs based on a port abstraction process migration has been added by Zayas(1987) no load balancing Nov 24,2000 r.innocente 8 4

Amoeba (Tanenbaum et al 1990) Tanenbaum,van Renesse,van Staveren, Sharp, Mullender, Jansen, van Rossum*(Experiences with the Amoeba Distributed Operating System 1990) Zhu,Steketee(An experimental study of Load Balancing on Amoeba, 1994) doesn t support virtual memory was based on the processor-pool model, that assumes the # of processors is more than the # of processes * of python fame Nov 24,2000 r.innocente 9 V system (Theimer 1987) developed at Stanford, microkernel based, runs on a set of diskless wks V has support for both remote execution and process migration, the facilities were added having in mind to utilize wks during idle periods Nov 24,2000 r.innocente 10 5

MOSIX (Barak 1990,1995) its roots are back in the mid 80s : MOS multicomputer OS (Barak, HUJI 1985) set of enhancements to BSD/OS: MOSIX 90 port to LINUX (Mosix 7 th edition) around 95: the system call based mgmt has been transformed to a /proc filesystem interface enhancements 98 (memory ushering algorithm,...) Nov 24,2000 r.innocente 11 Condor (Litzkow,Livny 1988) Condor is not a Distributed OS. It is a batch system whose purpose is the utilization of the idle time of WKS and PCs. It can migrate jobs from machine to machine, but imposes restrictions on the jobs that can be migrated and the programs have to be linked with a special checkpoint library. (More details on http://hpc.sissa.it/batch) Nov 24,2000 r.innocente 12 6

Load balancing mechanisms Nov 24,2000 r.innocente 13 Load balancing methods taxonomy job initiation only (aka job placement, aka remote execution) static dynamic (almost all batch systems can do this: NQS and derivatives) migration (Mosix, some Distributed OS) job initiation + migration Nov 24,2000 r.innocente 14 7

Load balancing algorithms Two main categories: Non pre-emptive: they are used only when a new job has to be started, they decide where to run a new job (aka job placement, many batch systems) Pre-emptive: jobs can be migrated during execution, therefore load balancing can be dynamically applied during job execution (Distributed OS,Condor) Nov 24,2000 r.innocente 15 Non pre-emptive load balancing (aka job placement) centralized initiation: there is only 1 load balancer in the system that is in charge of assigning new jobs to computers according to the info it collects distributed initiation: there is a load balancer on each computer that xfers its load to the others anytime it detects a load change, the local balancer decides according to the load info received where to start new jobs random initiation: when a new job arrives the local load balancer selects randomly to which computer to assign new jobs Nov 24,2000 r.innocente 16 8

Pre-emptive load balancing random: when a computer becomes overloaded it select at random another computer to which to transfer a process central: one load balancer regularly polls the other machine to obtain load info. If it detects a load imbalance it select a computer with a low load to migrate a process from an overloaded computer neighbour: (Bryant,Finkel 1981) each computer sends a request to its neighbours, when the request is accepted, a pair is constituted.if they detect a load imbalance between them they migrate some processes from one to another after that the pair is broken broadcast: server based distributed: each node receives a measure of the load of other nodes, if it discovers a load imbalance it would try to migrate some of its processes to a less loaded node Nov 24,2000 r.innocente 17 Processor load measures The metric used to evaluate the load is extremely important in a load balancing system. More than 20 different metrics are cited in only 1 paper(mehra & Wah 1993): std Unix : # of tasks in the run queue size of the free available memory rate of CPU context switches rate of system calls 1 minute load avg amount of CPU free time but the conclusion of many studies indicates that the simple rqueue is a better measure than complex load criteria in most situations. Nov 24,2000 r.innocente 18 9

Process migration mechanisms Nov 24,2000 r.innocente 19 User/kernel level migration Kernel level process migration (Mosix): no special requirements, no need to recompile transparent migration User level process migration (Condor,...): compilation with special libraries (libckpt.a) a snapshot of the running process is taken on signal the snapshot is restarted on another machine more details at http://hpc.sissa.it/batch Nov 24,2000 r.innocente 20 10

Process migration factors: transparency:how much a process needs to be aware of the migration residual dependency: how much a process is dependent on the source node after migration performance:how much the migration affects performance complexity: how much complex the migration is Nov 24,2000 r.innocente 21 Transparent: Location transparent Access transparent (or network) Migration transparent Failure transparent Nov 24,2000 r.innocente 22 11

System call interposing In many cases transparency is obtained interposing some code at the system call level. There are different ways to do it: Very easy on message based OSs in which a system call is just a message to the kernel A modified kernel: in this case it can be completely transparent to the process (MOSIX), reasonably efficient for some kind of processes A modified system library: in this case the programs need to be linked with a special library(condor), reasonably transparent Nov 24,2000 r.innocente 23 Process migration parts: execution state migration memory space migration communication state migration Nov 24,2000 r.innocente 24 12

Execution state migration It s not that difficult. For the most part the code can be the same as that used for swapping. In that case the destination is a disk while for migration the destination is a different node. Nov 24,2000 r.innocente 25 Communication state migration It is a difficult task and usually it is avoided forcing the migrated process to redirect networking operations to the Home Node where they are performed. On Amoeba each thread of a process has its own communication state which remembers the stage of the ongoing RPCs, these states are kept in the kernel during migration any message received for a suspended process is rejected and a process migrating response is returned sending machine will rexmit after some delay sending machine using a special protocol(flip) will locate the migrated process the same technique is used for reply msgs to a migrating client Nov 24,2000 r.innocente 26 13

Migratable sockets Unfortunately TCP/IP has no direct support for this. Research is ongoing on different solutions. A proposed user-level implementation requires: substitution of the socket library relinking of programs with the new library (Bubak, Zbik et al. 1999) In this proposal communication is done between virtual addresses (there are servers that keep a correspondence table between virtual and real addresses). When an endpoint migrates, a stub remains on the original node to respond with a process migrating msg. The other endpoint then should consult the server for the new real address to contact. Nov 24,2000 r.innocente 27 Memory space migration total copy(charlotte,locus): process is frozen and the entire memory space is moved, the process is then restarted on remote (this can require many seconds and further it can transfer memory that will not be used) pre-copying (V System:Theimer): allows the process to continue the execution, while its memory space is being transferred to the target.then V freezes the process on the source and re-copies the pages modified during the transfer time lazy-copying or demand page(accent:zayas): pages are left on the source machine and are transferred only at page faults. This mechanism can leave dirty pages on all the nodes the process goes through. A small variation used by MOSIX is to freeze the process during the transfer of all the dirty pages and then to transfer from the backing store the other pages when referred shared file server(sprite): Sprite uses the network file system as a backing store for virtual memory. Dirty pages are flushed on the network file system and on the target the process is restarted (it s just aswap-out,swap-in) Nov 24,2000 r.innocente 28 14

focus on Mosix Nov 24,2000 r.innocente 29 MOSIX important points Probabilistic info dissemination PPM (Pre-emptiveProcessMigration) Dynamic load balancing Memory ushering Decentralized control Efficient kernel communications Nov 24,2000 r.innocente 30 15

Probabilistic info dissemination each node at regular intervals sends info on its resources at a random subset of nodes each node at the same time keeps a small buffer with the most recently info received Nov 24,2000 r.innocente 31 MOSIX load balancing algorithm Mosix uses a pre-emptive and distributed load balancing algorithm that is essentially based on a Unix load avg metric. This means that each node will try to find a less loaded node to which it would try to migrate one of its processes. When a node is low on memory a different algorithm prevails: memory ushering Nov 24,2000 r.innocente 32 16

Mosix info variables These are the variables exchanged between Mosix nodes to feed the distributed load balancing algorithm: Load (based on a unit of a P II-400Mhz) Number of processors Speed of processor Memory used Raw Memory Used Total Memory Look at http://hpc.sissa.it/walrus/mjm for a java applet showing these values over a Mosix cluster. Nov 24,2000 r.innocente 33 Memory ushering algorithm Normally only the load balancing algorithm is active. When memory is under a threshold (paging), the memory ushering algorithm is started and prevails the load balancing algorithm. This algorithm would try to migrate the smallest running process to the node with more free memory. Eventually processes are migrated between remote nodes to create sufficient space on a single node. Nov 24,2000 r.innocente 34 17

MOSIX network protocols These protocols are used to exchange load info among the nodes and to negotiate and carry out process migrations between nodes. Mosix uses: UDP to txfer load info TCP to migrate processes These communications are not encrypted/ authenticated/ protected therefore it s better to use a private network for the MNP. (Unfortunately the sockets used by Mosix are not reported by standard tools: e.g. netstat) Nov 24,2000 r.innocente 35 MOSIX process migration It s a kernel level migration (it s done without the necessity of any modification to the programs, because Mosix changes the linux kernel) fully transparent : the process involved does nt have any need to know about the migration strong residual dependency:the migrated process is completely dependent on the deputy for system calls( and particularly I/O and networking) Nov 24,2000 r.innocente 36 18

Mosix migration transparency Mosix achieves complete transparency at the price of some efficiency interposing to system calls a link layer that redirects their execution to the originating node of the process(unique Home Node). The skeleton of the process that remains on the originating node to execute the system calls, receive signals, perform I/O is called in Mosix the deputy (what Condor calls shadow) The migrated process is called the remote. Nov 24,2000 r.innocente 37 MOSIX deputy/remote mechanism UHN (Unique Home Node) Link layer Deputy MNP (Mosix Network Protocol) Remote Link layer Running node Nov 24,2000 r.innocente 38 19

Mosix performance/1 While a CPU bound process obtains clear advantages from Mosix being very little penalized by process migration, I/O bound and network bound processes suffer big drawbacks and are usually not good candidates for migration. A first solution addressing only part of the problems has been the development of a daemon to support remote initiation (MExec/MPmake) of processes. With the help of this daemon processes that can run in parallel are dispatched during the exec call (you need to change the source and recode the exec as mexec) thus avoiding the creation of a bottle-neck single Home Node where all the I/O would be performed. Gmake is a tool already prepared to be recompiled to exploit its inherent parallelism (follow the indication in the MExec tarball). Nov 24,2000 r.innocente 39 Mosix performance/2 A more general solution to the I/O bound processes performance problem under migration is anyway envisioned to be the use of a cache coherent global file system: cache coherent: clients keep their caches synchronized, or only 1 cache at the server node is kept global: every file can be seen from every node with the same name If such a file system can be used then I/O operations can be executed directly from the running node without resorting to redirect them to the Unique Home Node. Mosix recently has incorporated the possibility to declare that a file system is a Direct File System Access. For such file systems Mosix will perform I/O operation directly on the running node. Nov 24,2000 r.innocente 40 20

MOSIX dfsa (direct file system access) UHN (Unique Home Node) Remote Link layer Deputy MNP (Mosix Network Protocol) Link layer Running node DFSA access Nov 24,2000 r.innocente 41 Cache coherent global distributed file systems To be a MOSIX DFSA it s not sufficient to be a cache-coherent global file system. In fact some additional functions (not available in std Linux super-blocks) must be available at the super-block level: identify,compare,reconstruct A prototype DFSA that is just a small software layer over the std Linux file system is MFS (Mosix File System) provided directly by the MOSIX team. Other possibile candidate DFSA are: GFS (GlobalFileSystem) xfs (Serverless File System) It seems now stable the integration of Mosix with GFS. Nov 24,2000 r.innocente 42 21