HPC-Nutzer Informationsaustausch. The Workload Management System LSF



Similar documents
Informationsaustausch für Nutzer des Aachener HPC Clusters

The RWTH Compute Cluster Environment

UMass High Performance Computing Center

RWTH GPU Cluster. Sandra Wienke November Rechen- und Kommunikationszentrum (RZ) Fotos: Christian Iwainsky

Job Scheduling with Moab Cluster Suite

Parallel Processing using the LOTUS cluster

SLURM Workload Manager

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

A High Performance Computing Scheduling and Resource Management Primer

Chapter 2: Getting Started

MSU Tier 3 Usage and Troubleshooting. James Koll

Streamline Computing Linux Cluster User Training. ( Nottingham University)

Using the Yale HPC Clusters

SLURM: Resource Management and Job Scheduling Software. Advanced Computing Center for Research and Education

Adaptive Resource Optimizer For Optimal High Performance Compute Resource Utilization

Microsoft HPC. V 1.0 José M. Cámara (checam@ubu.es)

Martinos Center Compute Clusters

Platform LSF Version 9 Release 1.2. Foundations SC

Matlab on a Supercomputer

Introduction to Running Computations on the High Performance Clusters at the Center for Computational Research

HTCondor at the RAL Tier-1

Biowulf2 Training Session

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Grid Computing in SAS 9.4 Third Edition

Managing GPUs by Slurm. Massimo Benini HPC Advisory Council Switzerland Conference March 31 - April 3, 2014 Lugano

SAS Grid: Grid Scheduling Policy and Resource Allocation Adam H. Diaz, IBM Platform Computing, Research Triangle Park, NC

Resource Scheduling Best Practice in Hybrid Clusters

Grid Engine Basics. Table of Contents. Grid Engine Basics Version 1. (Formerly: Sun Grid Engine)

An introduction to compute resources in Biostatistics. Chris Scheller

How to Run Parallel Jobs Efficiently

Grid Engine 6. Policies. BioTeam Inc.

Making the Most of the New Splunk Scheduler

Characterizing Task Usage Shapes in Google s Compute Clusters

Batch Systems. provide a mechanism for submitting, launching, and tracking jobs on a shared resource

Running Jobs with Platform LSF. Version 6.2 September 2005 Comments to:

Survey on Job Schedulers in Hadoop Cluster

The Moab Scheduler. Dan Mazur, McGill HPC Aug 23, 2013

GraySort on Apache Spark by Databricks

Submitting batch jobs Slurm on ecgate. Xavi Abellan User Support Section

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Application of Predictive Analytics for Better Alignment of Business and IT

Slurm License Management

Job scheduler details

OpenMP Programming on ScaleMP

General Overview. Slurm Training15. Alfred Gil & Jordi Blasco (HPCNow!)

PBS Tutorial. Fangrui Ma Universit of Nebraska-Lincoln. October 26th, 2007

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

The Importance of Software License Server Monitoring

Miami University RedHawk Cluster Working with batch jobs on the Cluster

Running Jobs with Platform LSF. Platform LSF Version 8.0 June 2011

Installing and running COMSOL on a Linux cluster

Using WestGrid. Patrick Mann, Manager, Technical Operations Jan.15, 2014

Until now: tl;dr: - submit a job to the scheduler

Work Environment. David Tur HPC Expert. HPC Users Training September, 18th 2015

Sun Grid Engine, a new scheduler for EGEE

Using NeSI HPC Resources. NeSI Computational Science Team

Debugging with TotalView

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Quick Tutorial for Portable Batch System (PBS)

Microsoft Compute Clusters in High Performance Technical Computing. Björn Tromsdorf, HPC Product Manager, Microsoft Corporation

LoadLeveler Overview. January 30-31, IBM Storage & Technology Group. IBM HPC Developer TIFR, Mumbai

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

GSX Monitor & Analyzer. for IBM Collaboration Suite

Grid Engine Training Introduction

Optimizing Shared Resource Contention in HPC Clusters

Ready Time Observations

The Hadoop Distributed File System

An objective comparison test of workload management systems

Fast Setup and Integration of ABAQUS on HPC Linux Cluster and the Study of Its Scalability

8/15/2014. Best (and more) General Information. Staying Informed. Staying Informed. Staying Informed-System Status

Exclaimer Mail Archiver User Manual

Energy-aware job scheduler for highperformance

Integration of Virtualized Workernodes in Batch Queueing Systems The ViBatch Concept

Monitoring Microsoft Exchange to Improve Performance and Availability

ABAQUS High Performance Computing Environment at Nokia

The GRID and the Linux Farm at the RCF

The ENEA-EGEE site: Access to non-standard platforms

RED HAT ENTERPRISE LINUX 7

Introduction to the SGE/OGS batch-queuing system

SP Apps Performance test Test report. 2012/10 Mai Au

G Porcupine. Robert Grimm New York University

PES. Batch virtualization and Cloud computing. Part 1: Batch virtualization. Batch virtualization and Cloud computing

Table of Contents. Chapter 1: Introduction. Chapter 2: Getting Started. Chapter 3: Standard Functionality. Chapter 4: Module Descriptions

Tutorial: Using WestGrid. Drew Leske Compute Canada/WestGrid Site Lead University of Victoria

PBS Professional Job Scheduler at TCS: Six Sigma- Level Delivery Process and Its Features

Grid Engine Users Guide p1 Edition

Transcription:

HPC-Nutzer Informationsaustausch The Workload Management System LSF

Content Cluster facts Job submission esub messages Scheduling strategies Tools and security Future plans 2 von 10

Some facts about the cluster about 2,000 hosts (1,400 from BULL installation in 2011) about 35,000 cores hosts have between 12 and 128 cores and 24 to 2,048 GB memory 10,000 to several 100,000 jobs in the queues waiting times depending on job size and filling level of the cluster Between several minutes up to several weeks about 150 active users, ten times more seen on the batch system over the time about 240 million cpu hours per year (BULL installation) planning new additional installations in 2016 and 2018 3 von 10

Submission bsub < jobscript vs. bsub [options] command please use jobscripts -> helps debugging due to easier reading the verification script describes rules for jobs project management additional resources (bcs, phi, gpu ) and informs the user in various cases about decisions made regular review of long pending jobs done by us 4 von 10

esub messages mail related messages is not a valid email address, please use a correct email address! The mailbox is out of quota, prohibitive messages Using a loginshell (-L) is not allowed! Setting requeue values (-Q) is not allowed! Stacksize must not be bigger then requested memory size. You seem to use hpcwork but did not request it. project related messages The project is not (yet) active!... The project has used up all of its quota 5 von 10

Scheduling strategies / rules 1 setting default values if not given by user jobname, output file, memory limit, wall time limit put jobs into the right queue depending on the request exception GPU-queue request additional resources depending on the job type BCS boards for bcs jobs resource according to requested project (ih exclusive) additional settings e.g. for mic jobs on phi nodes describe job through the job description (-Jd), which gets parsed 6 von 10

Scheduling strategies / rules 2 JARA jobs per default scheduled to mpi-s able to define automatically a ptile depending on the memory request need to request -m mpi-l or -a bcs for hosts with larger memory set jobs exclusive with 32 or more slots requested job cannot be disturbed by other, eventually misbehaving jobs jobs get priority depending on various factors (fairshare) increases while pending decreases with running jobs from same LSF user group initial priority defined by granted quota (rwth / jara) or provided hardware (ih fairshare) jobs of projects / users, which are over quota, get scheduled to the low queue cron job (run once a day) takes care of switching jobs between queues 7 von 10

Tools and security no detailed job information for jobs of users of other user groups (projects) but also lack of summary information in jara-, rwth-, queues -> use tools jarajobs, rwthjobs, lecturejobs, r_batch_usage gives a per month summary of the usage of the batch system add -q to see your batch quota add -p <projectid> to see usage of the project, if you are allowed to submit to it 8 von 10

Future plans upgrade to LSF 9.1.3 change of concepts from slots to tasks still need to fully evaluate consequences activation of cpu and memory affinity scheduling bad jobs do not influence good jobs that much anymore should increase performance of jobs activation of power aware scheduling jobs running at lower frequencies reduce power costs reward with less accounted core hours reporting for projects (first ih, later also jara, rwth, ) lower the threshold for exclusivity of jobs 9 von 10

Many thanks for your attention!