Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer. Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA)

Similar documents

The SUN ONE Grid Engine BATCH SYSTEM

Grid Engine. Application Integration

Notes on the SNOW/Rmpi R packages with OpenMPI and Sun Grid Engine

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

Introduction to Sun Grid Engine (SGE)

An Introduction to High Performance Computing in the Department

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Grid Engine Users Guide p1 Edition

Grid 101. Grid 101. Josh Hegie.

High Performance Computing Facility Specifications, Policies and Usage. Supercomputer Project. Bibliotheca Alexandrina

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems

ABAQUS High Performance Computing Environment at Nokia

1.0. User Manual For HPC Cluster at GIKI. Volume. Ghulam Ishaq Khan Institute of Engineering Sciences & Technology

Grid Engine Training Introduction

Debugging and Profiling Lab. Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Introduction to the SGE/OGS batch-queuing system

SMRT Analysis Software Installation (v2.3.0)

Running ANSYS Fluent Under SGE

Oracle Grid Engine. Administration Guide Release 6.2 Update 7 E

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

SLURM Workload Manager

User s Guide. Introduction

SGE Roll: Users Guide. Version Edition

MPI / ClusterTools Update and Plans

Grid Engine 6. Policies. BioTeam Inc.

Introduction to parallel computing and UPPMAX

Introduction to Sun Grid Engine 5.3

Oracle Grid Engine. User Guide Release 6.2 Update 7 E

Miami University RedHawk Cluster Working with batch jobs on the Cluster

KISTI Supercomputer TACHYON Scheduling scheme & Sun Grid Engine

Enigma, Sun Grid Engine (SGE), and the Joint High Performance Computing Exchange (JHPCE) Cluster

Sun Grid Engine, a new scheduler for EGEE

Cluster Computing With R

The CNMS Computer Cluster

Linux für bwgrid. Sabine Richling, Heinz Kredel. Universitätsrechenzentrum Heidelberg Rechenzentrum Universität Mannheim. 27.

Sun Grid Engine, a new scheduler for EGEE middleware

Grid Engine 6. Troubleshooting. BioTeam Inc.

How To Run A Tompouce Cluster On An Ipra (Inria) (Sun) 2 (Sun Geserade) (Sun-Ge) 2/5.2 (

HPC at IU Overview. Abhinav Thota Research Technologies Indiana University

Optimizing Shared Resource Contention in HPC Clusters

BEGINNER'S GUIDE TO SUN GRID ENGINE 6.2

GRID Computing: CAS Style

High Performance Computing with Sun Grid Engine on the HPSCC cluster. Fernando J. Pineda

CENTRO DE SUPERCOMPUTACIÓN GALICIA CESGA

Using Parallel Computing to Run Multiple Jobs

The RWTH Compute Cluster Environment

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Using the Yale HPC Clusters

How to Run Parallel Jobs Efficiently

Advanced Techniques with Newton. Gerald Ragghianti Advanced Newton workshop Sept. 22, 2011

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

An Oracle White Paper August Beginner's Guide to Oracle Grid Engine 6.2

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Running applications on the Cray XC30 4/12/2015

Using NeSI HPC Resources. NeSI Computational Science Team

Until now: tl;dr: - submit a job to the scheduler

NEC HPC-Linux-Cluster

Manual for using Super Computing Resources

Efficient cluster computing

Hodor and Bran - Job Scheduling and PBS Scripts

Sun Powers the Grid SUN GRID ENGINE

SUN GRID ENGINE & SGE/EE: A CLOSER LOOK

High Performance Computing

Cloud Computing. Up until now

Installing and running COMSOL on a Linux cluster

Job Scheduling with Moab Cluster Suite

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

Batch Job Analysis to Improve the Success Rate in HPC

Quick Tutorial for Portable Batch System (PBS)

Informationsaustausch für Nutzer des Aachener HPC Clusters

Introduction to HPC Workshop. Center for e-research

Running Jobs on Genepool

Running on Blue Gene/Q at Argonne Leadership Computing Facility (ALCF)

LSKA 2010 Survey Report Job Scheduler

Introduction to Grid Engine

The XSEDE Global Federated File System (GFFS) - Breaking Down Barriers to Secure Resource Sharing

CNR-INFM DEMOCRITOS and SISSA elab Trieste

PBS Professional Job Scheduler at TCS: Six Sigma- Level Delivery Process and Its Features

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

HPCC - Hrothgar Getting Started User Guide MPI Programming

SRNWP Workshop. HP Solutions and Activities in Climate & Weather Research. Michael Riedmann European Performance Center

Parallel Processing using the LOTUS cluster

Monitoring Infrastructure for Superclusters: Experiences at MareNostrum

Running COMSOL in parallel

Grid Engine Administration. Overview

A Crash course to (The) Bighouse

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Getting Started with HPC

CycleServer Grid Engine Support Install Guide. version 1.25

Martinos Center Compute Clusters

Chapter 2: Getting Started

Microsoft Compute Clusters in High Performance Technical Computing. Björn Tromsdorf, HPC Product Manager, Microsoft Corporation

How To Use A Job Management System With Sun Hpc Cluster Tools

Grid Scheduling Dictionary of Terms and Keywords

SCHEDULER POLICIES FOR JOB PRIORITIZATION IN THE SUN N1 GRID ENGINE 6 SYSTEM. Charu Chaubal, N1 Systems. Sun BluePrints OnLine October 2005

Introduction to Running Hadoop on the High Performance Clusters at the Center for Computational Research

Summary. Load and Open GaussView to start example

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

Transcription:

Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer Pablo Rey Mayo Systems Technician, Galicia Supercomputing Centre (CESGA)

Agenda Introducing CESGA Finis Terrae Architecture Grid Engine Experience in Finis Terrae Grid Engine Utilities Grid Engine and glite Middleware Sun HPC 2009 regensburg, germany hpcworkshop.com 2

Sun HPC UNESCO World Heritage 1985 End of St. James Way 100,000 pilgrims in 2007 Sun HPC 2009 regensburg, germany hpcworkshop.com 3

Past, Present, Future Users: Three main Galician Universities and Spanish Research Council, Regional weather forecast service Services: High performance computing, storage and communication resources (RedIris PoP) Promote new information and communication technologies (HPC & Grid projects) Future: Centre of Excellence in Computational Science C 2 SRC 141 research staff 75 MM (31% building, 23% HPC) Sun HPC 2009 regensburg, germany hpcworkshop.com 4

Performance 16,000 GFLOPS CESGA s Technological Evolution (Past & Present) 3,142 GF 768 GF Start to use GE HPC 320 SUPERDOME SVG 64 GF Major CPD rebuild Minor CPD rebuild BEOWULF 16 GF 14.1 GF 12 GF 9.9 GF 9.6 GF VP 2400 VPP 300 SVG Linux Cluster Myrinet Trend: Standardization Processors, network, OS 2.5 GF AP 3000 HPC 4500 1993 1998 1999 2001 2002 2003 2004, 2005, 2006 2007 Date Sun HPC 2009 regensburg, germany hpcworkshop.com 5

Sun HPC 2009 regensburg, germany hpcworkshop.com 6

Finis Terrae Supercomputer Spanish National Unique Scientific & Technological Infrastructure More than: 16,000 GFLOPS 2,580 Itanium2 CPUs 19,640 GB Memory OPEN SOURCE Sun HPC 2009 regensburg, germany hpcworkshop.com 7

Finis Terrae SMP NUMA Cluster Sun HPC 2009 regensburg, germany hpcworkshop.com 8

Demand of computing resources at CESGA Sun HPC 2009 regensburg, germany hpcworkshop.com 9

Grid Engine experience in Finis Terrae Sun HPC 2009 regensburg, germany hpcworkshop.com 10

Tight HP-MPI Integration The integration has been tested both with GE 6.1 and 6.2 GE 6.1 uses rsh. Limitations: 256 connections limit, but one connection against each node/job. No prompt in the session: not aware of any problem in batch jobs. GE 6.2 uses an internal communication protocol. Limitations: No prompt. No X11 forwarding. More info in: http://wiki.gridengine.info/wiki/index.php/tight-hp-mpi- Integration-Notes Sun HPC 2009 regensburg, germany hpcworkshop.com 11

Tight HP-MPI Integration Set up the Grid Engine Parallel Environment (PE): pe_name mpi slots 9999 user_lists NONE xuser_lists NONE start_proc_args <YOUR_SGE_ROOT>/mpi/startmpi.sh -catch_rsh $pe_hostfile stop_proc_args <YOUR_SGE_ROOT>/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE Modify the starter_method and pe_list options of the queue you want to use starter_method <YOUR_PATH>/job_starter.sh pe_list mpi mpi_rr mpi_1p mpi_2p mpi_4p mpi_8p mpi_16p Set up the hp-mpi environment: export MPI_REMSH=$TMPDIR/rsh Sun HPC 2009 regensburg, germany hpcworkshop.com 12

qrsh and memory consumption in Itanium qrsh in GE6.2 adds builtin communication (IJS) with threads. Memory consumption is about 2 times the stack limit. IF no stack limit, uses 32MB Number of threads expanded by each qrsh depend of the shell. If shell NOT recognised as sh, extra thread consume 30MB. if stack limit is 256MB each qrsh consumes >500MB of virtual memory jobs killed if exceed memory limit. Optimal memory consumption in Itanium a stack of 1MB. Set this limit inside the rsh script before qrsh is launched. Sun HPC 2009 regensburg, germany hpcworkshop.com 13

Tight Intel-MPI Integration: Extra steps Create independent rings by adding the variable to avoid processes interfering in the allexit step: export I_MPI_JOB_CONTEXT= $JOB_ID-$SGE_TASK_ID If Intel-MPI version is previous to 3.2, rsh script is not able to detect the python version. if [ x$just_wrap = x ]; then if [ $minus_n -eq 1 ]; then if [ "$cmd" = 'python -c "from sys import version; print version[:3]"' ]; then exec $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit -nostdin $rhost /opt/cesga/sistemas/sge/tools/util/version_python.sh else exec $SGE_ROOT/bin/$ARC/qrsh $CESGA_VARS -inherit -nostdin $rhost $cmd fi else... Modify mpirun to use by default RSH instead of SSH other_mpdboot_opt="$other_mpdboot_opt --rsh=$tmpdir/rsh" Sun HPC 2009 regensburg, germany hpcworkshop.com 14

Disabling direct SSH connection to the nodes To accomplish this the following steps are required: MPI Tight Integration. Qlogin: configure it to run over qrsh following also the tight integration way. qlogin_command builtin qlogin_daemon builtin rlogin_daemon builtin rsh_daemon builtin rsh_command builtin rlogin_command builtin Re-configuration of ssh to allow only certain administrative users to connect (option AllowUsers). In this way all processes are correctly accounted in the accounting file. To facilitate interactive use of the nodes a wrapper was created. Solved the X11 forwarding limitation in GE 6.2 More info in: http://wiki.gridengine.info/wiki/index.php/disabling_direct_ssh_connecti on_to_the_nodes Sun HPC 2009 regensburg, germany hpcworkshop.com 15

Checkpointing in GE Berkeley Lab Checkpoint/Restart (BLCR) Integrated with GridEngine (GE) Set a ckpt_list in the queue configuration Configure a checkpoint environment: [sdiaz@svgd ~]$ qconf -sckpt BLCR ckpt_name BLCR interface application-level ckpt_command $BLCR_PATH/bin/cr_checkpoint -f $ckpt_dir/$ckptfile --stop -T $process2 migr_command $BLCR_PATH/bin/cr_restart $ckptfile restart_command none clean_command $BLCR_PAT/bin/clean.sh $job_id $ckpt_dir ckpt_dir $SGE_O_WORKDIR signal none when xsmr Checkpointing with gaussian. Future: Checkpointing with parallel jobs Sun HPC 2009 regensburg, germany hpcworkshop.com 16

Utilities Sun HPC 2009 regensburg, germany hpcworkshop.com 17

qsub wrapper Instead of original qsub, a wrapper let us check jobs fulfil some basic requirements before queued: Check user has given an estimation of all the required resources (num_proc, s_rt, s_vmem and h_fsize) and that them fulfil the default limits. Distribute jobs to the corresponding queues (cluster partitioning) Verify if the job can run in the queues before submission. Let us to define environment variables necessary to some application (e.g. OMP_NUM_THREADS). Support special resources requests (per user limit). Similar to Job Submission Verifiers (JSVs)?. Sun HPC 2009 regensburg, germany hpcworkshop.com 18

qsub wrapper: default limits Maximum number of processors (num_proc*slots): 160 Maximum value of num_proc: 16 Maximum number of slots:160 Execution time (s_rt): Sequential jobs: Until 1000 hours Parallel jobs: From 2 to 16 processors: 200 hours From 17 to 32 processors: 100 hours From 33 to 64 processors: 50 hours From 65 to 128 processors: 30 hours From 129 to 160 processors: 20 hours More than 161 processors: 10 hours Memory: Per core: 8GB Per node: 112 GB H_fsize: 500GB Sun HPC 2009 regensburg, germany hpcworkshop.com 19

qsub wrapper: queue distribution Number of cores Small queue Medium queue Large queue 4 1 hour 1 hour < 16 1 hour 1 hour 16 1 hour queue Interactive Small Medium Large meteo # nodes 2 14 62 62 2 Sun HPC 2009 regensburg, germany hpcworkshop.com 20

Management of special user requirements Default limits are not enough. Users can request access to extra resources (memory, CPU time, number of processors, space on scratch disk, ). The special resources requests are stored in a database and then exported to a XML file The XML file is processed by the qsub wrapper. <users> <user login= prey"> <application id_sol="71"> <start>2009-06-01</start> <end>2010-05-31</end> <priority>1</priority> <s_rt>1440</s_rt> <s_vmem_total>0</s_vmem_total> <h_fsize>0</h_fsize> <n_proc_total>256</n_proc_total> <num_proc>0</num_proc> <slots_mpi>0</slots_mpi> <s_vmem_core>0</s_vmem_core> <exclusivity>0</exclusivity> </application> </user>... <users> Sun HPC 2009 regensburg, germany hpcworkshop.com 21

Job prioritization Some kind of jobs (special user requirements, challenges, agreements, ) need to be prioritized. The qsub wrapper don t prioritize jobs so we have to do it after they are submitted: Change the number of override tickets (qalter ot) and the priority (qalter p) for the specified users. In very special cases a reservation (qalter R y) is also done. Change the hard requested queues. Sun HPC 2009 regensburg, germany hpcworkshop.com 22

Job prioritization: special case Jobs of the regional weather forecast service cannot wait for free slots Move into execution its pending jobs. Process: Check if there are jobs in error state and clears this state if it is necessary. Hold all the pending jobs except its jobs. Change the priority for this user. Restrict access to the nodes while we are increasing complex_values like num_proc or memory to avoid other jobs entering in the selected nodes. Restore the complex_values of the selected nodes. Remove the hold state of the pending jobs Sun HPC 2009 regensburg, germany hpcworkshop.com 23

Should be prioritized some pending job? Daily email with information about pending jobs (resource request list, time since it was submitted, predecessor job, ). A job will be prioritized if: Time since it was submitted is greater than the requested time. Time since it was submitted is greater than a maximum waiting time limit (100 hours). The script can send this information by email, and/or save it in a file in tab or CSV format. It can also save a summary by user and queue (historical information). The script is based in the XML output of the qstat command (qstat -u * -s p -r -xml). Sun HPC 2009 regensburg, germany hpcworkshop.com 24

Are nodes overloaded? Nodes are shared between different jobs (big SMP nodes, specially superdome) Users not always use properly the number of slots required by the jobs. So we can have nodes overloaded or underloaded. For each node it is compared the load with the number of available processors and with the number of processors required by all the jobs running on it. The script is based in the XML output of the qstat command (qstat -u * -s r -r -f -ne -xml -q *@NODE) for each host (qhost). Sun HPC 2009 regensburg, germany hpcworkshop.com 25

Application integration with GE: Gaussian Gaussian is one of the most common application used incorrectly Created a wrapper to avoid it. The wrapper let us: Control how it is used: Gaussian is used only under the queue system. Users don t try to use MPI environment The queue system requirements are accordingly to Gaussian input Facilitate the use: Not necessary to set all the requirements in the Gaussian input Sun HPC 2009 regensburg, germany hpcworkshop.com 26

When will be free the node X? Sometimes we need to know when a node will be free (maintenance, tests, ). Using the list of jobs running in the node and the required s_rt limit for each job we can obtain the expected end time of all the jobs. JOBS RUNNING IN NODES: sdd001 sdd002 sdd003 node jobid user start time s_rt end time ------------------------------------------------------------------------------------- sdd001 1702180 uscqojvc 08/27/2009 09:02:01 716400 2009-09-04 16:02:01 sdd001 1704961 uviqoarl 08/31/2009 11:48:16 684000 2009-09-08 09:48:16 -------- --------- ---------- --------------------- --------- --------------------- sdd002 1694756 uviqfjhr 08/26/2009 21:08:39 720000 2009-09-04 05:08:39 -------- --------- ---------- --------------------- --------- --------------------- sdd003 1704957 uviqoarl 08/31/2009 11:46:07 684000 2009-09-08 09:46:07 ------------------------------------------------------------------------------------- Sun HPC 2009 regensburg, germany hpcworkshop.com 27

GE integration in Ganglia Sun HPC 2009 regensburg, germany hpcworkshop.com 28

Management of accounting information Sun HPC 2009 regensburg, germany hpcworkshop.com 29

Integration of GE in the glite middleware (EGEE project ) Sun HPC 2009 regensburg, germany hpcworkshop.com 30

Grid Engine in the glite middleware GE JobManager maintainers and developers. GE certification testbed. GE stress tests: https://twiki.cern.ch/twiki/bin/view/lcg/sge_stress We also performed stress tests in Finis Terrae supercomputer to check GE behaviour in a big cluster. Documentation: GE Cookbook (https://edms.cern.ch/document/858737/1.3) End-user support for GE to the EGEE community. More info in: https://twiki.cern.ch/twiki/bin/view/lcg/genericinstallgu ide310#the_sge_batch_system Sun HPC 2009 regensburg, germany hpcworkshop.com 31

Pablo Rey Mayo prey@cesga.es. www.cesga.es Grid Engine experience in Finis Terrae, large Itanium cluster supercomputer.