Simulation of batch scheduling using real production-ready software tools
|
|
|
- Joseph Collins
- 10 years ago
- Views:
Transcription
1 Simulation of batch scheduling using real production-ready software tools Alejandro Lucero 1 Barcelona SuperComputing Center Abstract. Batch scheduling for high performance cluster installations has two main goals: 1) to keep the whole machine working at full capacity at all times, and 2) to respect priorities avoiding lower priority jobs jeopardizing higher priority ones. Usually, batch schedulers allow different policies with several variables to be tuned by policy. Other features like special job requests, reservations or job preemption increase the complexity for achieving a fine-tuned algorithm. A local decision for a specific job can change the full scheduling for a high number of jobs and what can be thought as logical within a short term could make no sense for a long trace measured in weeks or months. Although it is possible to extract algorithms from batch scheduling software to make simulations of large job traces, this is not the ideal approach since scheduling is not an isolated part of this type of tools and replicating same environment requires an important effort plus a high maintenance cost. We present a method for obtaining a special mode of operation for a real production-ready scheduling software, SLURM, where we can simulate execution of real job traces to evaluate impact of scheduling policies and policy tuning. 1 Introduction Simple Linux Utility for Resource Manager(SLURM)[1] has been used at Barcelona Supercomputing Center (BSC) [2] since Initially designed as a resource manager for jobs execution through large linux cluster installations, it has today extended features for job scheduling. This software is a key tool for a successful management of such a large machine as Marenostrum, a 2500 node and more than cores installed at BSC. SLURM first versions supported a very simple scheduler: a First Come First Served algorithm. Last versions improve scheduling with a main algorithm based on job priority allowing to define different quality of service queues with users getting access to specific ones. A second scheduling algorithm is present with SLURM as configurable: the backfilling algorithm. It can work together with the main scheduler, its goal being to use free resources by lower priority jobs as long as none higher priority job is delayed. SLURM offers other features as reservations, special job requests and job preemption. It is worth to mention SLURM scheduling is fully dependent on how resources are managed. For a better understanding of this relationship, the scheduling should be thought of as a two step process: [email protected]
2 1. Afirststepselectsajobwhichrequisitiescanbegrantedbasedonpurescheduling algorithm. At this point a job was selected and there are enough resources to execute it. 2. A second step selects best nodes for job execution trying to use resources in such a way that job requisities are accomplished and no more resources than really needed are wasted. For example, jobs can ask for an amount of memory per node, so resource management needs to know which nodes own such amount of memory, and in case of other jobs running at the nodes already, it needs to know current memory available. Resource management is currently done with cpus/cores/threads and memory but it would be possible to use others like disk, network bandwith or memory bandwith. Taken SLURM as the batch scheduler the simulation software will be based on, it would be possible to extract the SLURM code related to scheduling and work with it. However, as we have seen above, scheduling is linked to resource management so it would be needed to extract resource management code as well if we really want to work with a real scheduling software. Assuming this task is feasible without a high effort, there are other components we would need for using this code: job submission supporting special job requests reservations controlling nodes controlling job execution job preemption... For a valid simulation all of these possibilities should be contemplated therefore extracting just scheduler code from SLURM is not enough. It would be possible to build a simulation bench based on SLURM code but there are two main problems with this approach: 1. It would require an important effort implementing all the components needed 2. It would have a high maintenance cost for every new SLURM version. We present a solution for job trace execution using SLURM as the simulation tool with minor SLURM code changes. Our design goal is to hide simulation mode from SLURM developers and to have everything SLURM supports available in simulation mode. We think the effort should be done once, then new SLURM features in future new versions will be ready for simulation purposes with minor effort making easier simulation maintenance. The rest of the paper covers related work, design, implementation and preliminary results.
3 2 Related Work Process scheduling is an interesting part of operating systems and a focus of research papers for decades. It is well known there is no perfect scheduler covering whatever the workload and in specific systems like those with real time requirements scheduling is tuned and configured with detail. Batch scheduling for supercomputers clusters shares same complexity and problems as process or I/O scheduling presented at operating systems. Previous work trying to test scheduling algorithms and scheduling configurations has been done as in [3] where process scheduling simulation is done at user space copying scheduler code from Linux kernel. That work suffers from one of the problems listed above: maintenance is needed to keep same scheduler code at user space. Another work more related to batch scheduling for supercomputers is presented at [4] where it is well-described the importance of having a way of testing scheduling policies. But this work also faces problems since simulator code replaces real scheduler and wrappers for resource management software are specifically built, so simulation is done assuming such an enviroment being realistic enough. Other works[5] use CSIM [6] modelling batch scheduling algorithms with discrete events. Although this is broadly used for pure algorithm research it requires an important effort and it is always difficult to reproduce the same environment what can lead to understimate other parts of real scheduler systems like resource management. Research using a real batch scheduler system has been done [7] but without a time domain for simulation: assumption is made for dividing per 1000 time events obtained from a job trace. In this case batch scheduling software runs as usual but every event take 1/1000 of its real time. This approach has some drawbacks as for example what happens with jobs taking less than 1000 seconds to execute. Clearly it is not possible to simulate a real trace and it is important to notice small changes of job scheduling can have a huge impact on global scheduling results. It is worth to mention Moab [8] cluster software which is currently used at BSC along with SLURM. Moab takes care of scheduling with help from SLURM to know which are the nodes and running jobs state. Moab has had support for a simulator mode since initial versions although it has not always worked as marketing said. In fact, this work was started to overcome the problem trying to use this Moab feature in the past. Being Moab a commercial software there is not documentation about how this simulation mode was implemented. However, we got some information from Moab engineers commenting how high was the cost of maintaining this feature with new versions. We have not found any work using the preloading technique with dynamic libraries for creating the simulation time domain and neither using this approach with such a multithread and distributed program as SLURM.
4 3 Simulator Design SLURM is a distributed multithreading software with two main components: a slurm controller (slurmctl) and a slurm daemon (slurmd). There is just one slurmctl by cluster installation and as many slurmd as computing nodes. The slurmctl tasks are resource management, job scheduling and job execution, although in some installations SLURM is used along with an external scheduler as MOAB. Once a job is scheduled for execution the slurmctl communicates with those slurmd belonging to the nodes selected for the job. Behind the scene there is a lot of processing for allowing job execution on nodes, like preserving user environment when job was submitted. SLURM design had scalability as a main goal so it supports thousands of jobs and thousands of nodes/cores per cluster. When slurmctl needs to send a message to the slurmds it uses agents (communication threads) for supporting a heavy peak of use as, for example, signaling a multinode job for terminating, getting nodes status or launching several jobs at the same time. Besides those threads, created for a specific purpose and with a short life, there are other main threads as the main controller thread, power management thread, backup thread, remote procedure call (RPC) thread, and backfilling thread. In the case of slurmd there is just one main thread waiting for messages from the slurmctl, an equivalent to the slurmctl RPC thread. On job termination slurmd sends a message to the slurmctl with the jobid and job execution result code. Figure 1 shows SLURM system components. Along with slurmctl and slurmd there are several programs for interacting with slurmctl: sbatch for submitting jobs, squeue and sinfo for getting infotmation about jobs and cluster state, scancel for cancelling jobs, and some other programs not shown in the picture. sbatch sinfo squeue slurmctl backup scancel User commands slurmd slurmd slurmd... slurmd Fig. 1. Slurm Architecture From a simulation point of view the task done by slurmd is simpler since it is not necessary job execution, just to know how long will it take, information we get from the job trace. A simple modification for simulation mode in the slurmd will support this easily just sending a message when the job finishes knowing how long ago the job was launched. The key is how to know the current time since in simulation mode we want to execute a trace of days or months in just hours or minutes. Another issue is to support simulation for thousand of nodes: it is not necessary to have one slurmd by node but just one super-slurmd taking control of jobs launched
5 and job duration. No change is needed for this super-slurmd since SLURM has an option called FRONTEND mode which allows this behaviour although for other purposes. Although slurmd by design can be in a different machine than slurmctl, for simulation this is not needed. For slurmctl no changes are needed. However it is a good idea to avoid some threads execution related to power management, backup, slurm global state save and signal handling, which make no sense under simulation mode. Also, during a normal execution, SLURM takes control of all computational nodes running a slurmd. Periodically the slurm controller pings the nodes and ask for current jobs being executed. This way SLURM can react to nodes or jobs problems. Under simulation we work within a controlled environment so no problem is expected, therefore that funcionality is not needed. As in the case of slurmd, slurmctl execution is based on time elapsed: both main scheduler algorithm and backfilling are infintite loops executed within a speficic period; a job priority is calculated based on fixed parameters dependent on user/group/qos, but usually priority increases proportionally to job s time queued; jobs have a wall clock limit fixing maximum execution time allowed for the job; reservations are made for future job execution. In simulation mode we want to speed up time therefore execution of long traces inashorttimecanbedoneforstudyingbestschedulingpolicytoapply.ourdesign is based on the LD PRELOAD [9] functionality available with shared libraries in UNIX systems. Using this feature it is possible to capture specific function calls exported by shared libraries and to execute code whatever the necessity. As SLURM is based on shared libraries we can make use of LD PRELOAD to control simulation time, so calls to time() function are captured and current simulated time is returned instead of real time. Along with time() there are other functions time related used by SLURM which are wrappered: gettimeofday() and sleep(). Although there are other functions time related, like usleep() and select(), SLURM uses them for communications what is not going to affect simulation. Finally, as SLURM design is based on threads, and calls to time functions are made indistincly by all of them, some POSIX threads related functions are wrappered as well. Figure 2, shows how simulator design works: a specific program external to SLURM is created, sim mgr, along with a shared library, sim lib, containing wrappers for time and thread functions. During initialization phase, sim mgr waits for programs to control: slurmctl and slurmd. Once those programs are registered (executing init function related to sim lib, 0a and 0b in the figure), every time pthread create is called, wrapper code at sim lib registers the new thread(1a and 4b1). In the figure, lines with arrows represent simulation related actions, with first number at the identifier representing the simulation cycle when that action happens. The letter at the identifier is just to differenciate the two threads shown,
6 SLURM SLURMCTL SLURMD pthread pthread pthread pthread pthread 4a 1a 3a 0a SIM_LIB... 0b 4b1 4b2 4b3 SIM_MGR Fig. 2. Simulator Design and the second number indicates the sequence of actions in the same cycle. So the slurmctl thread was created at simulation cycle 1 and the slurmd thread was created at simulation cycle 4. Inside pthread wrappers there is a call to the real pthread functions, what is not true for the time wrappers which fully replace the real time functions. Registration links a thread to a POSIX semaphore used for simulation synchronization and it also links the thread to a structure containing sleep value for that thread and other fields used by the sim mgr. A simulation clock cycle is equivalent to one real time second and in each cycle sim mgr goes through an internal thread array working with registered threads. There are two classes of threads from sim mgr point of view: those with do sleep calls running through several simulation cycles (and someones through all the simulation). Sim mgr detects a sleepy thread and keeps a counter for waking up the thread when necessary. those which are created, do something and exit in the same simulation cycle.
7 Back to figure 2, thread A does a sleep call at simulation cycle 3 (3a) and sim mgr wakes up the thread at simulation cycle 4 (4a). So we know it was a sleep call for just one second. At that point sim mgr waits till that thread does another sleep call or maybe a call to pthread exit. The main threads in slurmctl have this behavior: infinite loops with a frequency defined by thread. The other type of threads is shown with the B thread in the figure. This thread is created at simulation cycle 4 (4b1) and at the same cycle it calls to pthread exit (4b2). When sim mgr needs to act with that thread there are two situations: the thread did not finish yet or it did finish. In the first case sim mgr waits till the thread calls to pthread exit. In the second one sim mgr just sets the array thread slot free for new threads. 4 Simulator Implementation A main goal for the simulator is to avoid SLURM modifications or, if necessary, to make those changes trasparent for SLURM developers. The ideal situation would be simulator feature going along future SLURM versions smothly just with minor maintenance. If this turns out to be an impossible task a discussion should be opened at SLURM mailing list exposing the problem and studying the cost. Just if the simulator becomes a key piece for SLURM, could it make sense major changes of core SLURM functionality. Simulator design implies to implement two pieces of code outside SLURM code tree: the simulation manager and the simulation shared library. The simulation manager, sim mgr, keeps control of threads and simulation time, waiting for threads to terminate or waking up threads when sleep timeouts requested by those threads expire. The simulation lib captures time related calls thanks to LD PRELOAD functionality and synchronize with sim mgr for sleep calls or gettingsimulationtime.thisdesignisquitesimpleforasingleprocessorthreadbutit gets more complex for the multithread and distributed SLURM software. Sim mgr needs a way to identify threads and to control each one independently, therefore sim lib implements wrappers for some pthreads functions like pthread create, pthread exit and pthread join, which give information to sim mgr when they are called. A first implementation with sockets inside those pthread wrappers was made but it made sim mgr less intuitive to understand and debug. A second implementation was done using POSIX shared memory containing global variables and per thread variables. One of the structures inside the shared memory is an array of POSIX semaphores used for synchronizarion between sim mgr and slurm threads. A limitation of this design using POSIX mechanisms is slurmd needs to be at the same machine than slurmctl and sim mgr, although we have commented above that this is not a problem for simulation purposes. When sim mgr is launched shared memory and semaphores initialization is done. After that, a new thread is registered when pthread create wrapper is called. Both, slurmctl and slurmd can create threads so it is necessary registration to be atomic for getting a simulation identifier by thread. Linked to this identifier
8 are two main fields: a sleep counter and a semaphore index. For tracking which thread haswhich identifier a table is createdusing pthread t value returned by real pthread create function. A thread calling pthread exit, pthread join or sleep functions will use that table and value of pthread self() to finding out which semaphore to use for synchronization with the sim mgr. When a thread calls time function, sim lib wrapper for time is executed. Inside this wrapper there is just a read of a global simulation address using the shared memory space containing the current simulation time. This is the same for gettimeofday with a difference related to microseconds value returned by this function. Initially we do not need to have a simulation with such a fine grain precission so microseconds is a fake value wich increments every time gettimeofday is called inside same simulation cycle. When a thread calls sleep function the sim lib wrapper is executed as follows: 1. wrapper code gets which thread identifier that thread has inside simulation. For this a search through a table linking pthread t values coming from pthread create and simulation thread identifiers is made. A thread can know which is its pthread t value with a simple call to pthread self which is part of standard pthread library. A potential problem with this approach is the possibility of getting two threads with same pthread t value. This is not possible for a single Linux process but SLURM is a distributed program with at least two processes, a slurmctl and a slurmd, so it could be possible to have two threads with same pthread t value. Linux uses the stack addressof the created pthread for that pthread t value so it is quite seldom for two different programs to get same pthread t values. Nonetheless a table using pthread t along with process identifier (pid) would be safer. 2. Using the thread identifier as an array index to thread info structure, the thread sleep field is updated to the value passed as a parameter to sleep function. 3. Next step is synchonization with sim mgr. There is a POSIX semaphore by pthread by design but this is not enough for getting synchronization right so there are two semaphores by thread. First semaphore is for thread waiting to be woken up by the sim mgr meanwhile the other thread is for sim mgr to wait thread coming back to sleep or exiting. 4. If the thread calls sleep again the loop goes back to 1). If pthread exit is called by the thread a sem post(sem back[thread id]) is done inside pthread exit wrapper. We have seen what wrappers do except for pthread join. This wrapper is quite simple, adding a new valid thread state, joining, for simulation. This is necessary for avoiding interlocks when SLURM agents are used.
9 Simulation manager Wrappers implemented inside sim lib library are useful for connecting slurm threads with the simulation manager, sim mgr. The global picture of how simulation works can be clarified with a detailed view of what sim mgr does in a simulation cycle: 1. for each registered thread being neither a never ending thread nor a new thread: (a) it waits till that thread sleeps, exits or joins (b) if thread is sleeping and sleep counter is equal to 0 i. it does a sem post(sem[thread id]) ii. it does a sem wait(sem back[thread id]) The term never ending thread needs some explanation. What sim mgr does is toallowjustoneslurmthreadtoexecuteatthreadsametimebut allowingeach thread to execute within a simulation cycle. However, some specific threads need less control for avoiding an interlock between threads. This is the case of threads getting remote requests at both slurmctl and slurmd. Usually those threads call to pthread create when a new message arrive and the new thread will be under sim mgr control as usual. 2. Using the trace information, if there are new jobs to be submitted at this simulation cycle, sbatch, usual slurm program for job submission, is executed with the right parameters for the job. 3. For each new thread sim mgr waits till that thread sleeps, exits or joins. As slurm can have some specific situations with a burst of new threads, sim mgr needs a way of controlling how threads are created avoiding to have more than 64 threads at the same time, since this is the current limit sim mgr supports. So this last part of the simulation cycle manages thread creation delaying some pthread create calls and identifying what we have named proto-threads, this is almost created threads. This is loop managing this possible situation: (a) For each new thread, if the thread is still alive, wait till it does a call to sleep, exit or join. In other case, it does mean this thread did its task quickly and called pthread exit. (b) When all new threads have been managed, all threads which did a call to pthread exit during this simulation cycle can release their slot so new threads can be created. If there were not any free slot it could be possible a pthread create call is waiting for getting a simulation thread identifier. A global counter shows how many proto-threads are waiting for free slots. If this counter is greater than zero then sim mgr leaves some time for those pthread create calls to finish then goes back to previous step looking again for new threads. Otherwise simulation cycle goes on. 4. At this point any thread alive or any thread created through this simulation cycle has had a chance to execute. So if the thread is still alive it is waiting on a semaphore. Sim mrg can now decrement sleep value for all of those threads. 5. All threads slots belonging to threads which called to pthread exit during this cycle are released. 6. Last step is to increment the simulation time is one second. 7. Go back to 1
10 5 Slurm code changes Our main design s goal was to use SLURM program for simulation without major changes. In the design section were described which changes are needed for the slurmd and slurmctl in order to support the simulation mode. Once implementation has been done, changes can be shown in detail: Our design implies threads to explicitly call pthread exit. Usually this is not needed and the operating system/pthread library know what to do when a thread function is over. Therefore we have added pthread exit calls to each thread code routine what is not an important change for SLURM code. Under simulation we build up a controlled environment: we know when threads will be submitted, when a node will have problems (if simulation supports this feature) and how long a job will run. There will not be unexpected job or node failures, therefore we can avoid code execution related to monitor jobs and nodes state when simulation mode is on. As jobs are not executed it is not necessary to send packets to each node for job cancelling, job signaling, job modification, etc. Removing this part of SLURM simplifies simulator work significantly since slurmctl uses agents for these communications. In clusters with thousands of nodes could it be possible jobs with thousand of nodes so sending messages to each node is not negligible at all. SLURM design was made for supporting those communication bursts and agents take care of sending messages from slurmctl to slurmds. The problem with agents is they are based on threads strongly: agents are threads themselves, there are pools of available agents with a static maximum limit and watchdog threads are specifically created to monitor agent threads. Although this is working fine for SLURM getting a high scalable software, such a mesh of threads is too much complex for our simulation design since interlocks between threads appear easily. As we said before, slurmd is simpler under simulation since no job needs to be really executed. By other hand, an extra task needs to be done: to know how long a job will run and to send a message of job completion when simulation time reaches that point. A new thread is created, a simulator helper, which each second checks for running jobs to be completed taken into account current time and job duration, and sends a message to slurmctl if necessary. Finally, slurmctl works as normal SLURM execution. Just some changes for sending messages without using agents has been done which makes sim mgr simpler and facilitates syncronization for obtaining determinsm through simulations using the same job trace.
11 6 Preliminary Results A first simulator implementation is working and executing traces with jobs from different users and queues. A regression testsbench has been build up using synthetic traces, useful for testing exceptional situations like a burst of jobs starting or completing at the same cycle. Several simulations using same synthetic trace are executed to validate design expecting same scheduling results which implies we achieve determinsm under simulation. After some simulator executions we discovered a problem hard to overcome since it is linked to normal SLURM behavior. As we have seen, sim mgr controls simulation execution leaving threads to execute but in sequential order, one by one. This is just the opposite threads were designed for, but it allows to execute traces with determinism. Leaving threads without control would mean small differences like a job starting one second later. Althouhg this looks harmless it could mean other jobs being scheduled in a different way with such a change being spread through the whole simulation, ending with a totally different trace execution. So sim mgr leaves a thread to execute and waits till that thread calls sleep function again. There is a slurmctl thread for backfilling which is executed two time per minute by default, and the problem is the task done by this backfilling thread can easily take more than a second: the more jobs queued the more time will it take. It is also related to amount of resources, then doing simulations with a cpus nodes cluster and with 1000jobs, with jobs duration from 30 seconds to an hour, it is normal for the backfilling thread to take up to 5 seconds to complete. Therefore, what is the goal behind simulation, to speed up job execution, is compromised due to this natural SLURM behaviour.we hope SLURM simulator can help to overcome this problem for real SLURM, then simulator itself can take benefit for executing long traces faster. Next table shows results using a Intel Core Duo at 2.53Mhz and synthetic traces of 3000 jobs executed on a cpus node cluster: Trace number Executions Sim Time Exec Time Speedup Time results per trace number are mean of 5 executions. Simulation of four days of jobs execution takes around one hour and a half. Extrapolating these results, for a one month trace simulation will last for a bit more than 10 hours. 7 Conslusions and Future Work A solution is presented for using a real batch scheduling software as SLURM for job trace simulation. Although SLURM design is strongly based on threads it is
12 possible to keep the original SLURM functionality with minor changes. For this some calls done to standard shared libraries are captured and specific code for simulation purposes executed. Implementation complexity has been left outside SLURM code so main SLURM developers will not to be aware of this simulation mode for future SLURM enhancements and simulator maintenance will be easier. Next step will be to present this work to SLURM core developers and users and if it is welcomed to integrate the code with SLURM code project. References 1. M. Jette, M. Grondona, SLURM: Simple Linux Utility for Resource Management, Proceedings of ClusterWorld Conference and Expo, San Jose, California, June Barcelona SuperComputing Center, 3. LinSched: The Linux Scheduler Simulator, John M. Calandrino, Dan P. Baumberger, Tong Li, Jessica C. Young, and Scott Hahn. jmc/linsched/ 4. Impact of Reservations on Production job Scheduling Martin W. Margo, Kenneth Yoshimoto, Patricia Kovatch, Phil Andrews. Proceedings of the 13th international conference on Job scheduling strategies for parallel processing A measurement-based simulation study of processor co-allocation in multicluster systems. S. Banen, A. I. D. Bucur, and D. H. J. Epema In Job Scheduling Strategies for Parallel Processing, Parallel Job Scheduling Under Dynamic Workloads (2003). Eitan Frachtenberg, Dror G. Feitelson, Juan Fernandez, Fabrizio Petrini. In Job Scheduling Strategies for Parallel Processing, ld-linux.so(8), Linux Programmer s Manual Administration and Privileged Commands
SLURM Workload Manager
SLURM Workload Manager What is SLURM? SLURM (Simple Linux Utility for Resource Management) is the native scheduler software that runs on ASTI's HPC cluster. Free and open-source job scheduler for the Linux
Improving Job Scheduling by using Machine Learning & Yet an another SLURM simulator. David Glesser, Yiannis Georgiou (BULL) Denis Trystram(LIG)
Improving Job Scheduling by using Machine Learning & Yet an another SLURM simulator David Glesser, Yiannis Georgiou (BULL) Denis Trystram(LIG) Improving Job Scheduling by using Machine Learning & Yet an
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
Real-Time Scheduling 1 / 39
Real-Time Scheduling 1 / 39 Multiple Real-Time Processes A runs every 30 msec; each time it needs 10 msec of CPU time B runs 25 times/sec for 15 msec C runs 20 times/sec for 5 msec For our equation, A
Petascale Software Challenges. Piyush Chaudhary [email protected] High Performance Computing
Petascale Software Challenges Piyush Chaudhary [email protected] High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
Chapter 1 - Web Server Management and Cluster Topology
Objectives At the end of this chapter, participants will be able to understand: Web server management options provided by Network Deployment Clustered Application Servers Cluster creation and management
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Real-Time Systems Prof. Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No. # 26 Real - Time POSIX. (Contd.) Ok Good morning, so let us get
- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)
Slurm Training15 Agenda 1 2 3 About Slurm Key Features of Slurm Extending Slurm Resource Management Daemons Job/step allocation 4 5 SMP MPI Parametric Job monitoring Accounting Scheduling Administration
SYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
SAN Conceptual and Design Basics
TECHNICAL NOTE VMware Infrastructure 3 SAN Conceptual and Design Basics VMware ESX Server can be used in conjunction with a SAN (storage area network), a specialized high speed network that connects computer
The Asterope compute cluster
The Asterope compute cluster ÅA has a small cluster named asterope.abo.fi with 8 compute nodes Each node has 2 Intel Xeon X5650 processors (6-core) with a total of 24 GB RAM 2 NVIDIA Tesla M2050 GPGPU
Performance analysis of a Linux based FTP server
Performance analysis of a Linux based FTP server A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Technology by Anand Srivastava to the Department of Computer Science
Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research
! Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research! Cynthia Cornelius! Center for Computational Research University at Buffalo, SUNY! cdc at
Distributed File Systems
Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.
Microsoft HPC. V 1.0 José M. Cámara ([email protected])
Microsoft HPC V 1.0 José M. Cámara ([email protected]) Introduction Microsoft High Performance Computing Package addresses computing power from a rather different approach. It is mainly focused on commodity
White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux
White Paper Real-time Capabilities for Linux SGI REACT Real-Time for Linux Abstract This white paper describes the real-time capabilities provided by SGI REACT Real-Time for Linux. software. REACT enables
Hadoop Fair Scheduler Design Document
Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................
Until now: tl;dr: - submit a job to the scheduler
Until now: - access the cluster copy data to/from the cluster create parallel software compile code and use optimized libraries how to run the software on the full cluster tl;dr: - submit a job to the
Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010
Estimate Performance and Capacity Requirements for Workflow in SharePoint Server 2010 This document is provided as-is. Information and views expressed in this document, including URL and other Internet
Chapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
Scheduling. Yücel Saygın. These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum
Scheduling Yücel Saygın These slides are based on your text book and on the slides prepared by Andrew S. Tanenbaum 1 Scheduling Introduction to Scheduling (1) Bursts of CPU usage alternate with periods
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
Linux Driver Devices. Why, When, Which, How?
Bertrand Mermet Sylvain Ract Linux Driver Devices. Why, When, Which, How? Since its creation in the early 1990 s Linux has been installed on millions of computers or embedded systems. These systems may
SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Operating Systems 4 th Class
Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science
CSC 2405: Computer Systems II
CSC 2405: Computer Systems II Spring 2013 (TR 8:30-9:45 in G86) Mirela Damian http://www.csc.villanova.edu/~mdamian/csc2405/ Introductions Mirela Damian Room 167A in the Mendel Science Building [email protected]
Monitoring Infrastructure for Superclusters: Experiences at MareNostrum
ScicomP13 2007 SP-XXL Monitoring Infrastructure for Superclusters: Experiences at MareNostrum Garching, Munich Ernest Artiaga Performance Group BSC-CNS, Operations Outline BSC-CNS and MareNostrum Overview
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Chapter 11 I/O Management and Disk Scheduling
Operatin g Systems: Internals and Design Principle s Chapter 11 I/O Management and Disk Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles An artifact can
Chapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
An objective comparison test of workload management systems
An objective comparison test of workload management systems Igor Sfiligoi 1 and Burt Holzman 1 1 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA E-mail: [email protected] Abstract. The Grid
MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM?
MEASURING WORKLOAD PERFORMANCE IS THE INFRASTRUCTURE A PROBLEM? Ashutosh Shinde Performance Architect [email protected] Validating if the workload generated by the load generating tools is applied
Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6
Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6 Winter Term 2008 / 2009 Jun.-Prof. Dr. André Brinkmann [email protected] Universität Paderborn PC² Agenda Multiprocessor and
A High Performance Computing Scheduling and Resource Management Primer
LLNL-TR-652476 A High Performance Computing Scheduling and Resource Management Primer D. H. Ahn, J. E. Garlick, M. A. Grondona, D. A. Lipari, R. R. Springmeyer March 31, 2014 Disclaimer This document was
LSKA 2010 Survey Report Job Scheduler
LSKA 2010 Survey Report Job Scheduler Graduate Institute of Communication Engineering {r98942067, r98942112}@ntu.edu.tw March 31, 2010 1. Motivation Recently, the computing becomes much more complex. However,
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education www.accre.vanderbilt.
SLURM: Resource Management and Job Scheduling Software Advanced Computing Center for Research and Education www.accre.vanderbilt.edu Simple Linux Utility for Resource Management But it s also a job scheduler!
Recommendations for Performance Benchmarking
Recommendations for Performance Benchmarking Shikhar Puri Abstract Performance benchmarking of applications is increasingly becoming essential before deployment. This paper covers recommendations and best
Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano
Managing GPUs by Slurm Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano Agenda General Slurm introduction Slurm@CSCS Generic Resource Scheduling for GPUs Resource
Operating Systems Concepts: Chapter 7: Scheduling Strategies
Operating Systems Concepts: Chapter 7: Scheduling Strategies Olav Beckmann Huxley 449 http://www.doc.ic.ac.uk/~ob3 Acknowledgements: There are lots. See end of Chapter 1. Home Page for the course: http://www.doc.ic.ac.uk/~ob3/teaching/operatingsystemsconcepts/
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
Matlab on a Supercomputer
Matlab on a Supercomputer Shelley L. Knuth Research Computing April 9, 2015 Outline Description of Matlab and supercomputing Interactive Matlab jobs Non-interactive Matlab jobs Parallel Computing Slides
Batch and Cloud overview. Andrew McNab University of Manchester GridPP and LHCb
Batch and Cloud overview Andrew McNab University of Manchester GridPP and LHCb Overview Assumptions Batch systems The Grid Pilot Frameworks DIRAC Virtual Machines Vac Vcycle Tier-2 Evolution Containers
Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest
Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest 1. Introduction Few years ago, parallel computers could
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications
Performance Prediction, Sizing and Capacity Planning for Distributed E-Commerce Applications by Samuel D. Kounev ([email protected]) Information Technology Transfer Office Abstract Modern e-commerce
10.04.2008. Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details
Thomas Fahrig Senior Developer Hypervisor Team Hypervisor Architecture Terminology Goals Basics Details Scheduling Interval External Interrupt Handling Reserves, Weights and Caps Context Switch Waiting
Analysis and Research of Cloud Computing System to Comparison of Several Cloud Computing Platforms
Volume 1, Issue 1 ISSN: 2320-5288 International Journal of Engineering Technology & Management Research Journal homepage: www.ijetmr.org Analysis and Research of Cloud Computing System to Comparison of
Cloud Bursting with SLURM and Bright Cluster Manager. Martijn de Vries CTO
Cloud Bursting with SLURM and Bright Cluster Manager Martijn de Vries CTO Architecture CMDaemon 2 Management Interfaces Graphical User Interface (GUI) Offers administrator full cluster control Standalone
Dell One Identity Manager Scalability and Performance
Dell One Identity Manager Scalability and Performance Scale up and out to ensure simple, effective governance for users. Abstract For years, organizations have had to be able to support user communities
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005
Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1
Chapter 3 Operating-System Structures
Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual
Kiko> A personal job scheduler
Kiko> A personal job scheduler V1.2 Carlos allende prieto october 2009 kiko> is a light-weight tool to manage non-interactive tasks on personal computers. It can improve your system s throughput significantly
Automating Big Data Benchmarking for Different Architectures with ALOJA
www.bsc.es Jan 2016 Automating Big Data Benchmarking for Different Architectures with ALOJA Nicolas Poggi, Postdoc Researcher Agenda 1. Intro on Hadoop performance 1. Current scenario and problematic 2.
The Importance of Software License Server Monitoring
The Importance of Software License Server Monitoring NetworkComputer How Shorter Running Jobs Can Help In Optimizing Your Resource Utilization White Paper Introduction Semiconductor companies typically
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
Grid Scheduling Dictionary of Terms and Keywords
Grid Scheduling Dictionary Working Group M. Roehrig, Sandia National Laboratories W. Ziegler, Fraunhofer-Institute for Algorithms and Scientific Computing Document: Category: Informational June 2002 Status
Effective Computing with SMP Linux
Effective Computing with SMP Linux Multi-processor systems were once a feature of high-end servers and mainframes, but today, even desktops for personal use have multiple processors. Linux is a popular
find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
Introduction to Operating Systems. Perspective of the Computer. System Software. Indiana University Chen Yu
Introduction to Operating Systems Indiana University Chen Yu Perspective of the Computer System Software A general piece of software with common functionalities that support many applications. Example:
Distributed Operating Systems. Cluster Systems
Distributed Operating Systems Cluster Systems Ewa Niewiadomska-Szynkiewicz [email protected] Institute of Control and Computation Engineering Warsaw University of Technology E&IT Department, WUT 1 1. Cluster
Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems...
Job Scheduling Explained More than you ever want to know about how jobs get scheduled on WestGrid systems... Martin Siegert, SFU Cluster Myths There are so many jobs in the queue - it will take ages until
Windows Server Performance Monitoring
Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly
Operating Systems. Notice that, before you can run programs that you write in JavaScript, you need to jump through a few hoops first
Operating Systems Notice that, before you can run programs that you write in JavaScript, you need to jump through a few hoops first JavaScript interpreter Web browser menu / icon / dock??? login??? CPU,
Storage Performance Testing
Storage Performance Testing Woody Hutsell, Texas Memory Systems SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA. Member companies and individuals may use this material
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Chapter 11 I/O Management and Disk Scheduling
Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization
Tasks Schedule Analysis in RTAI/Linux-GPL
Tasks Schedule Analysis in RTAI/Linux-GPL Claudio Aciti and Nelson Acosta INTIA - Depto de Computación y Sistemas - Facultad de Ciencias Exactas Universidad Nacional del Centro de la Provincia de Buenos
CGL Architecture Specification
CGL Architecture Specification Mika Karlstedt Helsinki 19th February 2003 Seminar paper for Seminar on High Availability and Timeliness in Linux University of Helsinki Department of Computer science i
An Oracle White Paper July 2011. Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide
Oracle Primavera Contract Management, Business Intelligence Publisher Edition-Sizing Guide An Oracle White Paper July 2011 1 Disclaimer The following is intended to outline our general product direction.
Chapter 2: Getting Started
Chapter 2: Getting Started Once Partek Flow is installed, Chapter 2 will take the user to the next stage and describes the user interface and, of note, defines a number of terms required to understand
An Oracle White Paper August 2011. Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability
An Oracle White Paper August 2011 Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability Note This whitepaper discusses a number of considerations to be made when
Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
Java Bit Torrent Client
Java Bit Torrent Client Hemapani Perera, Eran Chinthaka {hperera, echintha}@cs.indiana.edu Computer Science Department Indiana University Introduction World-wide-web, WWW, is designed to access and download
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
Example of Standard API
16 Example of Standard API System Call Implementation Typically, a number associated with each system call System call interface maintains a table indexed according to these numbers The system call interface
Figure 1: MQSeries enabled TCL application in a heterogamous enterprise environment
MQSeries Enabled Tcl Application Ping Tong, Senior Consultant at Intelliclaim Inc., [email protected] Daniel Lyakovetsky, CIO at Intelliclaim Inc., [email protected] Sergey Polyakov, VP Development
Synchronization. Todd C. Mowry CS 740 November 24, 1998. Topics. Locks Barriers
Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based (barriers) Point-to-point tightly
Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design
Learning Outcomes Simple CPU Operation and Buses Dr Eddie Edwards [email protected] At the end of this lecture you will Understand how a CPU might be put together Be able to name the basic components
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Introduction to parallel computing and UPPMAX
Introduction to parallel computing and UPPMAX Intro part of course in Parallel Image Analysis Elias Rudberg [email protected] March 22, 2011 Parallel computing Parallel computing is becoming increasingly
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
KVM: Kernel-based Virtualization Driver
KVM: Kernel-based Virtualization Driver White Paper Overview The current interest in virtualization has led to the creation of several different hypervisors. Most of these, however, predate hardware-assisted
VERITAS Cluster Server v2.0 Technical Overview
VERITAS Cluster Server v2.0 Technical Overview V E R I T A S W H I T E P A P E R Table of Contents Executive Overview............................................................................1 Why VERITAS
Operating System Components
Lecture Overview Operating system software introduction OS components OS services OS structure Operating Systems - April 24, 2001 Operating System Components Process management Memory management Secondary
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
Redpaper. Performance Metrics in TotalStorage Productivity Center Performance Reports. Introduction. Mary Lovelace
Redpaper Mary Lovelace Performance Metrics in TotalStorage Productivity Center Performance Reports Introduction This Redpaper contains the TotalStorage Productivity Center performance metrics that are
How to control Resource allocation on pseries multi MCM system
How to control Resource allocation on pseries multi system Pascal Vezolle Deep Computing EMEA ATS-P.S.S.C/ Montpellier FRANCE Agenda AIX Resource Management Tools WorkLoad Manager (WLM) Affinity Services
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
