Improving Job Scheduling by using Machine Learning & Yet an another SLURM simulator. David Glesser, Yiannis Georgiou (BULL) Denis Trystram(LIG)



Similar documents
Simulation of batch scheduling using real production-ready software tools

Introduction to Virtual Machines

Utilization Driven Power-Aware Parallel Job Scheduling

Power and Energy aware job scheduling techniques

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Cloud Computing through Virtualization and HPC technologies

Virtual Machines. Virtual Machines

Advances in Virtualization In Support of In-Memory Big Data Applications

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Building a Volunteer Cloud

1 Bull, 2011 Bull Extreme Computing

Share and aggregate GPUs in your cluster. F. Silla Technical University of Valencia Spain

Full and Para Virtualization

Compiler-Assisted Binary Parsing

Algorithms. Miroslav Ruda 1 Hana Rudová 2. Masaryk University. Masaryk University. Grid Simulator with Production Scheduling.

Real Scale Experimentations of SLURM Resource and Job Management System

COM 444 Cloud Computing

Experimenting OAR in a virtual cluster environment for batch schedulers comparative evaluation

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Virtualization. Types of Interfaces

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

How do Users and Processes interact with the Operating System? Services for Processes. OS Structure with Services. Services for the OS Itself

Suh yun Ki m (KIS T) (KIS suhyunk@.com

Multi-core Programming System Overview

Abstract: Motivation: Description of proposal:

Virtualization. Dr. Yingwu Zhu

A Scheduler-Level Incentive Mechanism for Energy Efficiency in HPC

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Overlapping Data Transfer With Application Execution on Clusters

Uses for Virtual Machines. Virtual Machines. There are several uses for virtual machines:

Distributed and Cloud Computing

MPI / ClusterTools Update and Plans

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

Altaro VM Backup vs. Microsoft s System Center DPM

System Structures. Services Interface Structure

SSL for VM: The Hard Way and the Easy Way

SLURM Resources isolation through cgroups. Yiannis Georgiou Matthieu Hautreux

OS Virtualization Frank Hofmann

Virtual Machine Monitors. Dr. Marc E. Fiuczynski Research Scholar Princeton University

MAQAO Performance Analysis and Optimization Tool

Provisioning and Resource Management at Large Scale (Kadeploy and OAR)

Xen and the Art of. Virtualization. Ian Pratt

Corso di Reti di Calcolatori L-A. Cloud Computing

Example of Standard API

Cloud Computing. Up until now

Mobile Application Hacking for Android and iphone. 4-Day Hands-On Course. Syllabus

IBM TRIRIGA Anywhere Version 10 Release 4. Installing a development environment

Virtualization with Windows

Data Center Specific Thermal and Energy Saving Techniques

So#ware Tools and Techniques for HPC, Clouds, and Server- Class SoCs Ron Brightwell

How To Use A Cloud For A Local Cluster

CPS221 Lecture: Operating System Structure; Virtual Machines

How to use PDFlib products with PHP

A Unified View of Virtual Machines

Running applications on the Cray XC30 4/12/2015

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Datacenter Operating Systems

OPERATING SYSTEM SERVICES

Virtualization. Jukka K. Nurminen

12. Introduction to Virtual Machines

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Confinement Problem. The confinement problem Isolating entities. Example Problem. Server balances bank accounts for clients Server security issues:

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Virtual machines and operating systems

Performance Monitoring of Parallel Scientific Applications

DevOps Best Practices for Mobile Apps. Sanjeev Sharma IBM Software Group

Getting help - guide to the ticketing system. Thomas Röblitz, UiO/USIT/UAV/ITF/FI ;)

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Box Leangsuksun+ * Thammasat University, Patumtani, Thailand # Oak Ridge National Laboratory, Oak Ridge, TN, USA + Louisiana Tech University, Ruston,

DB2 Connect for NT and the Microsoft Windows NT Load Balancing Service

Stacey D. Son Consultant/SRI International. BSDCan Developer s Summit 15-May-2013

Section 1.4. Java s Magic: Bytecode, Java Virtual Machine, JIT,

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Lecture 17: Mobile Computing Platforms: Android. Mythili Vutukuru CS 653 Spring 2014 March 24, Monday

Monitoring, Tracing, Debugging (Under Construction)

Quality assurance for mobile applications Case studies for GUI test automation. Alexandra Schladebeck

Scaling from 1 PC to a super computer using Mascot

Code and Process Migration! Motivation!

ELEC 377. Operating Systems. Week 1 Class 3

Basics in Energy Information (& Communication) Systems Virtualization / Virtual Machines

Application-Level Debugging and Profiling: Gaps in the Tool Ecosystem. Dr Rosemary Francis, Ellexus

System Architecture V3.2. Last Update: August 2015

Please note that this SDK will only work with Xcode or above. If you need an SDK for an older Xcode version please support.

Feedback Autonomic Provisioning for guaranteeing performance (and reliability. - application to Big Data Systems

Lecture 2 Cloud Computing & Virtualization. Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

OpenSAF A Standardized HA Solution

Cross Platform Mobile. -Vinod Doshi

BG/Q Performance Tools. Sco$ Parker BG/Q Early Science Workshop: March 19-21, 2012 Argonne Leadership CompuGng Facility

Nested Virtualization

Libmonitor: A Tool for First-Party Monitoring

Virtualization across the organization

Hadoop Fair Scheduler Design Document

COS 318: Operating Systems. Virtual Machine Monitors

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Advanced Server Virtualization: Vmware and Microsoft Platforms in the Virtual Data Center

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Mobile Application Development Android

Adaptive Processor Allocation for Moldable Jobs in Computational Grid

Transcription:

Improving Job Scheduling by using Machine Learning & Yet an another SLURM simulator David Glesser, Yiannis Georgiou (BULL) Denis Trystram(LIG)

Improving Job Scheduling by using Machine Learning & Yet an another SLURM simulator

Improving Job Scheduling by using Machine Learning Improving Backfilling by using Machine Learning to Predict Running Times By Eric Gaussier, David Glesser, Valentin Reis, Denis Trystram In proceedings of SuperComputing 2015 3

Improving Job Scheduling by using Machine Learning Machine Learning algorithms can learn odd patterns SLURM uses a backfilling algorithm the running time given by the user is used for scheduling, as the actual running time is not known The value used is very important better running time estimation => better performances Predict the running time to improve the scheduling 4

Improving Job Scheduling by using Machine Learning We select a Machine Learning algorithm that: Use classic job parameters as input parameters Work online (to adapt to new behaviors) Use past knowledge of each user (as each user has its own behaviour) Robust to noise (parameters are given by humans, jobs can segfault...) 5

Improving Job Scheduling by using Machine Learning We test 128 different algorithms on 6 logs (from the Feitelson Workload Archive) on the Pyss simulator A leave-one-out cross validation product give us the best algo that we called E-Loss: Online linear regression model Predict that a running time is more than the actual value cost more to the model When we under estimate a running time, we add a fixed value (1min, 5min, 15 min, 30 min ) When we backfill jobs we sort them by shortest first 6

Improving Job Scheduling by using Machine Learning Backfilling performances can be improved by changing running times More precise running times does not mean better performances Scheduling performances can be increased using basic Machine Learning algorithms 7

Improving Job Scheduling by using Machine Learning Ongoing works Impelement E-Loss in SLURM We need a simulator within SLURM to test it Machine Learning algorithms perform best when they have a lot of data to learn from Instead of customizing the priority factors by hand, a Machine Learning can do it for you! 8

Improving Job Scheduling by using Machine Learning & Yet an another SLURM simulator

Yet an another SLURM simulator Previous work: run sleeps instead of actual jobs, multiple slurmd per physical node (to emulate bigger cluster than you have access to) Why not advancing in time when everybody sleeps? Use simulation! 10

Yet an another SLURM simulator SLURM libraries OS VM Virtual Machines + perfect behaviour - heavy and slow + No modifications to SLURM SLURM simulator Classic simulators - no guarantee on the behaviour + extra light - Modifications of SLURM 11

Yet an another SLURM simulator Introducing Simunix, an UNIX simulator We implement the "UNIX" API: pthreads, pthread_mutex, gettimeofday, sleep, send, recv Use Simgrid framework We can run an unmodified slurm on a simulated cluster 12

Yet an another SLURM simulator SLURM libraries OS VM Virtual Machines + perfect behaviour - heavy and slow + No modifications to SLURM SLURM libraries Simunix Simunix + close behaviour + light + No modifications to SLURM SLURM simulator Classic simulators - no guarantee on the behaviour + extra light - Modifications of SLURM 13

Yet an another SLURM simulator How to force a binary to use our libraries? Change how linking is done! The Linux linker load from the system and LD_PRELOAD the needed shared libraries It fills the GOT (Global Object Table) with the address of each functions of each libraries The compiler compile sleep(10); to GOT[ sleep@libc ](10); (Of course, it's not exactly like this, if you have more question RTFM of the ELF format) 14

Yet an another SLURM simulator How to force a binary to use our libraries? Change how linking is done! At runtime, simunix rewrite the GOT Of the selected binary/libraries Not on the simunix library nor the Simgrid library! Addresses in the GOT are replace by our own functions: GOT[ sleep@libc ] = &simunix_sleep; GOT[ time@libc ] = &simunix_time;... 15

Yet an another SLURM simulator Simgrid a framework to design simulators of distributed applications Supports: advanced network models energy consumption models I/O models Actively developped Good practice : they (in)validate their simulator (they explictly give the strengths and weaknesses of their models by testing them and compared them to real runs!) 16

Yet an another SLURM simulator How this work? Then, each intercepted calls communicate to an independent maestro process Slurmd Slurmd Simgrid Slurmctld Simunix wrapper Simunix wrapper Simunix maestro Simunix wrapper 17

Yet an another SLURM simulator Current works Optimize to simulate 1 year in a reasonable amount of time Support more Simgrid features: run simulated apps not just a sleep (network contention ) DVFS and energy Try out with other schedulers (every Linux software is compatible!) 18

Thanks