The SUN ONE Grid Engine BATCH SYSTEM



Similar documents
Grid Engine. Application Integration

Introduction to Sun Grid Engine (SGE)

Introduction to Sun Grid Engine 5.3

Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer. Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA)

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

User s Guide. Introduction

Notes on the SNOW/Rmpi R packages with OpenMPI and Sun Grid Engine

SGE Roll: Users Guide. Version Edition

Running ANSYS Fluent Under SGE

Grid Engine Users Guide p1 Edition

Cluster Computing With R

Oracle Grid Engine. User Guide Release 6.2 Update 7 E

Grid Engine Training Introduction

SMRT Analysis Software Installation (v2.3.0)

Grid 101. Grid 101. Josh Hegie.

Efficient cluster computing

Streamline Computing Linux Cluster User Training. ( Nottingham University)

GRID Computing: CAS Style

Oracle Grid Engine. Administration Guide Release 6.2 Update 7 E

Installing and running COMSOL on a Linux cluster

Grid Engine 6. Troubleshooting. BioTeam Inc.

How To Use A Job Management System With Sun Hpc Cluster Tools

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Quick Tutorial for Portable Batch System (PBS)

Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster

Sun Grid Engine Manual

High Performance Computing

Submitting Jobs to the Sun Grid Engine. CiCS Dept The University of Sheffield.

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

Sun ONE Grid Engine, Enterprise Edition Administration and User s Guide

Beyond Windows: Using the Linux Servers and the Grid

LSKA 2010 Survey Report Job Scheduler

Grid Engine Administration. Overview

MPI / ClusterTools Update and Plans

Hodor and Bran - Job Scheduling and PBS Scripts

User s Manual

CycleServer Grid Engine Support Install Guide. version 1.25

LAE Enterprise Server Installation Guide

Scheduling in SAS 9.3

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

Introduction to Grid Engine

IBM WebSphere Application Server Version 7.0

INF-110. GPFS Installation

Release Notes for Open Grid Scheduler/Grid Engine. Version: Grid Engine

Introduction to the SGE/OGS batch-queuing system

Oracle Grid Engine. Installation and Upgrade Guide Release 6.2 Update 7 E

Manual for using Super Computing Resources

Using Parallel Computing to Run Multiple Jobs

Sun Grid Engine, a new scheduler for EGEE middleware

Sun Grid Engine, a new scheduler for EGEE

Sun Grid Engine Update

Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

Running applications on the Cray XC30 4/12/2015

How to Run Parallel Jobs Efficiently

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Cisco Setting Up PIX Syslog

Miami University RedHawk Cluster Working with batch jobs on the Cluster

KISTI Supercomputer TACHYON Scheduling scheme & Sun Grid Engine

Ra - Batch Scripts. Timothy H. Kaiser, Ph.D. tkaiser@mines.edu

Batch Job Analysis to Improve the Success Rate in HPC

TABLE OF CONTENTS OVERVIEW SYSTEM REQUIREMENTS - SAP FOR ORACLE IDATAAGENT GETTING STARTED - DEPLOYING ON WINDOWS

Batch Scripts for RA & Mio

Features - SRM UNIX File System Agent

An Introduction to High Performance Computing in the Department

Configuring LocalDirector Syslog

High-Performance Reservoir Risk Assessment (Jacta Cluster)

Job Scheduling with Moab Cluster Suite

Wolfr am Lightweight Grid M TM anager USER GUIDE

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Sun N1 Grid Engine 6.1 Release Notes

Configuration of High Performance Computing for Medical Imaging and Processing. SunGridEngine 6.2u5

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

Parallel Debugging with DDT

Sun ONE Grid Engine 5.3 Release Notes

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

IGEL Universal Management. Installation Guide

Using Symantec NetBackup with Symantec Security Information Manager 4.5

MATLAB Distributed Computing Server System Administrator s Guide. R2013b

Scheduling in SAS 9.4 Second Edition

HP OpenView Storage Data Protector

Opswise Managed File Transfer for UNIX Quick Start Guide

Ahsay Replication Server v5.5. Administrator s Guide. Ahsay TM Online Backup - Development Department

Job scheduler details

HPC system startup manual (version 1.30)

Architecture and Mode of Operation

MATLAB Distributed Computing Server System Administrator's Guide

Transcription:

The SUN ONE Grid Engine BATCH SYSTEM Juan Luis Chaves Sanabria Centro Nacional de Cálculo Científico (CeCalCULA) Latin American School in HPC on Linux Cluster October 27 November 07 2003 What is SGE? Is a cluster resource management software Acceptsjobssubmittedby usersand schedules them for execution on the cluster based upon resource management policies (who gets how much resources when) Jobs are distributed in a way that optimizes uniform workload across the cluster

Who develop SGE? SGE is developed by Sun Microsystems http://www.sun.com/gridware http://gridengine.sunsource.net SUN adquired Gridware. Developer of Distributed Resource Management (DRM) software (July 2000) SUN release SGE as a free downloable binary for solaris and linux OS to facilitate deployment of compute farms. Source code is available. Open source project to enable the Grid Computing Model. SGE 5.3 supported platforms Compaq Tru64 Unix 5.0, 5.1 Hewlett Packard HP-UX 10.20, 11.00 IBM AIX 4.3.X Linux x86, kernel 2.4, glibc 2.2 Linux Alpha/AXP, kernel 2.2, glibc 2.2 SGI IRIX 6.2 6.5 SUN Solaris (sparc) 2.6, 7, 8, 9 32-bit SUN Solaris (sparc) 2.6, 7, 8, 9 64-bit Sun Solaris (x86) 8

How the System Operates? SGE accepts jobs requests for computer resources (requeriment profile by each job) Jobs requests are located in a holding area until they can be executed When are ready to be executed, the request is forwarded to the adequate execution(s) device(s) SGE manage the execution of the request Logs the record of their execution when it s finalized SGE Components Hosts: Master (sge_qmaster y sge_schedd): control all the SGE components and the overall cluster activity Execution (sge_execd): authorized to execute jobs through SGE Administration: designated to carry out any kind of administrative task for the SGE system Submit: for submitting (qsub) and controlling (qstat, qdel, qhold, qrls,... ) batch jobs

SGE Components (2) Queues: A queue is a container for a class of jobs (Batch/Parallel/Interactive/Checkpoint) allowed to execute on a particular host concurrently Commands applied to a queue affect all jobs associated with this. SGE Components (3) Queues (2): Properties: name: queue s name hostname: machine host of the queue processors: in a multiprocessor system are the processors to which queue has access qtype: type of jobs permited to run in this queue (Interactive, Batch, Parallel, Checkpointing) slots: the numbers of jobs that can run concurrently in that queue

SGE Components (4) Queues (3): Properties (2): owner_lists: queue s owners user_lists: users o grups ids of those who may access the queue xuser_lists: userso grupsidsofthosewhomay not access the queue complex_list: indicate the complexs associated with the queue complex values: assigns capacities as provided for this queue for certain complex attributes SGE Components (5) Complex: Set of features (resources) associated with a queue, a hosts, or the entire cluster that are known to SGE. Cell: Each loosely separated SGE cluster, with a different configuration and master machine. The SGE_CELL environment variable permit discriminate among clusters

SGE funcionality Is controlled by four daemons: sge_qmaster: control all the cluster s management and scheduling activities Receive scheduling decisions from sge_schedd Requets actions from sge_execd on the execution hosts Mantain tables about cluster status sge_shadowd: daemon used if exist a host backup (shadow master host) for the functionality of sge_qmaster SGE functionality (2) sge_schedd: mantain an up to date view of the cluster s status with the data provided by the sge_qmaster daemon. It : Decide which jobs are forwarded to which queues Comunicate these decisions to the sge_qmaster, who initiates the appropriate actions

SGE funcionality (3) sge_execd: is responsible for the queues on its host and for the execution of the jobs in this queues. It send information to the master host (sge_qmaster) about jobs status or load on its host. sge_commd: all the daemons communicates among them through the communication daemons (one per host) SGE functionality (4) Master Host sge_qmaster sge_schedd q2 q3 sge_execd sge_commd sge_commd sge_commd sge_execd q1 switch sge_commd sge_execd q4 q5

Using SGE Depend of the user type executing the SGE command. SGE define four types of users: Managers: Have full capabilities to manipulate SGE Operators: Can execute all the commands like managers, with the exception of making configuration changes to the SGE Owners: Are defined by queue and can manipulate the owned queues or jobs within them. Users: Only can manage the owned jobs and only can use queues or parallel environments where are authorized Using SGE (2) Command Manager Operator Owner User qacct qalter qconf No system setup changes Shown only Shown only qdel qhold qhost qlogin

Using SGE (3) Command Manager Operator Owner User qmod qmon qrls No system setup changes Own jobs and owned queues only No configuration changes No configuration changes qselect qsh qstat Submitting Jobs Prerequisites ensure that in your.[t]cshrc or. bashrc no commands are executed that need a terminal (tty) bash, sh or ksh tty s if [ $? = 0 ]; then stty erase ^H fi csh or tcsh tty s if ( $status = 0 ) then stty erase ^H endif

Submitting Jobs (2) Prerequisites (2) ensure that in your.[t]cshrc or.bashrc you set executable search path and other SGE environmental conditions csh or tcsh: source <sge_root_dir>/default/common/settings.csh bash, sh or ksh:. <sge_root_dir>/default/common/settings.sh Submitting Jobs (3) specify what script should be executed qsub cwd job_script -cwd: run the job from the current working directory. (Default: $HOME) in the simplest case the job script contains one line, the name of the executable various examples in <sge_root_dir>/examples/jobs/ many options are available for qsub man qsub

Submitting Jobs (4) Example of a script file #!/bin/csh WORKDIR=/tmp/scratch/$USER DATADIR=$HOME/data mkdir -p $WORKDIR cp $DATADIR/input_data $WORKDIR cd $WORKDIR executable < input_data > out_executable cp out_executable $DATADIR rm rf $WORKDIR Submitting Jobs (5) Output and Error redirection: Default standard output filename: <Job_name>.o<Job_id> Can by changed with the o option Default standard error filename: <Job_name>.e<Job_id> Can by changed with the e option Active SGE comments in script files: Per default are identified by #$

Submitting Jobs (6) Array Jobs: Are parametrized executions of the same script SGE view them as an array of independent tasks joined into a single job. task_id is the array job task index number Each task can use the environment variable $SGE_TASK_ID to retrieve their own task index number and use it to access input data sets arranged for this task_id Submitting Jobs (7) Array Jobs (2): Example: qsub l h_cpu=0:30:0 t 2-10:2 script.sh input.data Default standard output filename: <Job_name>.o<Job_id>.<Task_id> Default standard error filename: <Job_name>.e<Job_id>.<Task_id> Can be monitored and controlled as a total or by individual or subset of tasks

Submitting Jobs (8) Interactive Jobs: Are executed on interactive queues Three ways are available: qlogin: start a telnet-like sesion on a host choosed by SGE qrsh: Is like rsh or rlogin UNIX commands qsh: Is an xterm that is brought up with the display set corresponding to the setting of the DISPLAY environment variable. If this variable is not set, the xterm is directed to the 0.0 screen of the X server on the host from which the interactive job was submitted. DISPLAY can be set with the -display option. Monitoring and Controlling Jobs qstat: show job/queue status Whithout arguments show running/pending jobs -j show detailed information on running/pending jobs -f show submitted jobs and full listing of all queues qhost: show job/host status Whithout arguments show all execution host and their configuration -q show detailed information on queues at each host

Monitoring and Controlling Jobs (2) qdel: cancel jobs submitted through SGE qdel <job_id> qmod: suspend/unsuspend running jobs qmod s <job_id> (suspend) qmod us <job_id> (unsuspend) qhold: holds back pending jobs from execution qrls: releases jobs from holds previously assigned to them Parallel Jobs Are submitted to run on parallel environments Parallel environments are procedures to accomplish with requeriments needed to run a specific parallel application One parallel environment by each class or type of parallel application configured into the cluster

Parallel Jobs (2) qconf ap <parallel environment name> create a new parallel environment qconf spl list all defined parallel environments qconf sp <parallel environment name> show detailed information on the specified parallel environtment name Parallel Jobs (3) Parallel environment example: $ qconf -sp mpich pe_name mpich queue_list all slots 8 user_lists NONE xuser_lists NONE start_proc_args $pe_hostfile /usr/local/sge/mpi/startmpi.sh -catch_rsh stop_proc_args /usr/local/sge/mpi/stopmpi.sh allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE

Parallel Jobs (4) Script example: #!/bin/csh # # (c) 2002 Sun Microsystems, Inc. Use is subject to license terms. # # our name #$ -N MPI_calc_PI_Job # # pe request #$ -pe mpich 2-6 # #$ -v MPIR_HOME=/usr/local/mpich # # needs in # $NSLOTS # the number of tasks to be used # $TMPDIR/machines # a valid machine file to be passed to mpirun # echo "Got $NSLOTS slots." # $MPIR_HOME/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines $HOME/MPI/cpi Checkpointing SGE support two class of checkpointing: User level checkpointing Operating system level checkpointing Checkpointing environments must be defined by each type of application with this support When a checkpointing job is launched this must be indicated using the ckpt option of the qsub command

Checkpointing (2) Checkpointing environments are defined in configuration files: Define the operations to: initiating a checkpoint generation migrate a checkpoint job to another host restart of a checkpointed application As well as the list of queues which are eligible for a check-pointing method. Checkpointing (3) Checkpoint environment file format: ckpt_name <name> interface user defined or os provided. ckpt_command command to initiate the checkpoint. migr_command command used during a migration of a checkpointing job from one host to another. restart_command command used to restart a previously checkpointed application. clean_command command used to cleanup after a checkpointed application has finished. ckpt_dir where checkpoint file should be stored. queue_list all or comma separated list of queues signal Unix signal to be sent to a job to initiate a checkpoint generation when when generate the checkpoints: s (shutdown the node) m (periodically, at the min_cpu_interval interval defined by the queue) x (when the job gets suspended) r (job will be rescheduled (not checkpointed))

SGE Administration All administration activities on SGE are commited through the qmon command Basically: qconf a<h q s > <associated arguments> qconf d<h q e conf s > <associated arguments> qconf m<q conf > <associated arguments> qconf s<h s sel conf > <associated arguments> QMON: the SGE GUI